Hardware Acceleration of YOLOv3-tiny Object Detection

Author: V.V.S.Prithvi 2018900104
Date: 2023-04-28
Report no: IIIT/TH/2023/28
Advisor:Suresh Purini

Abstract

FPGAs are increasingly significant for deploying convolutional neural network (CNN) inference models because of performance demands and power constraints in embedded and data centre applications. The compute intensity of these models makes prototyping highly complex and time-consuming with traditional RTL approaches. The release of new generation high-level synthesis tools (HLS), such as Intel FPGA SDK for OpenCL, and Xilinx’s VITIS Unified Software Development platform, have significantly reduced the time and complexity of prototyping complex designs on FPGA. This work involves building custom FPGA accelerators for image recognition systems using OpenCL-HLS. Object detection and classification are vital steps in building image recognition systems. The first part of the work concerns building an FPGA accelerator for the Traffic Sign Classification problem, a vital step in building traffic sign recognition (TSR) systems that employ vehicle-mounted cameras that identify traffic signs while driving on the road. However, the CNNs for the classification still need the ability to be spatially invariant to the input data. Spatial Transformers are learnable modules that, upon integration with CNN, would allow the spatial manipulation of data within the network, making it invariant to affine transformations. Generic Matrix multiply (GEMM) methods that express convolution as matrix multiplication are widely used in deep-learning frameworks like Caffe, Theano, and Torch with GPU support. im2row is one of the commonly used GEMM methods. In this work, we built a GEMM-based accelerator for a CNN with a Spatial transformer module. We proposed the channel adaptive im2row method, with a lesser on-chip memory footprint than im2row. The system attains a latency of 202 ms (5˜ fps), running at 202 MHz on Intel Arria10 GX FPGA, and attains a speedup of (> 5 X) compared to the CPU. The performance is not state-of-the-art and calls for more FPGA-specific optimizations. Further, from the learnings, we designed a Systolic Array accelerator with a novel load pattern for accelerating the widely-used object detector YOLOv3-tiny optimized explicitly for embedded applications. We build the accelerator for multiple precisions ( FIXED8, FIXED16, FLOAT32 ) of YOLOv3-tiny. The architecture uses a homogenous systolic array architecture with a synchronized pipeline adder tree for convolution, allowing it to be scalable for multiple variants of YOLO with a change in the host driver. It is a deeply pipelined architecture that also exploits three-dimensional spatial parallelism. We evaluated the design on Terasic DE5a-Net-DDR4. The Fixed point (FP-8, FP-16) implementations attain a throughput of 57 GOPs/s (> 23 %) and 46.16 GOPs/s (> 340 %) running at 234 MHz and 227 MHz. We synthesized the first FLOAT32 implementation attaining 11.22 GFLOPs/s, running at 172 MHz

Full thesis: pdf

Centre for Others

IIIT Hyderabad Publications

Hardware Acceleration of YOLOv3-tiny Object Detection

Abstract