Designing Programmable Domain-Specific Overlay Accelerators on FPGAs

Author: Ziaul Choudhury
Date: 2024-06-18
Report no: IIIT/TH/2024/88
Advisor:Suresh Purini

Abstract

For about fifty years, hardware designers have been relying on different types of semiconductor scaling laws, like Moore’s Law and Dennard scaling, to achieve gains in performance. The industry became used to processor performance per watt, doubling approximately every 18 months. Over the past decade, we have seen the breakdown of these scaling predictions. With the old certainties of scaling silicon geometries gone forever, the industry is already changing. The number of cores has increased in a single die. SoCs like mobile phone processors are combining application-specific co-processors, GPUs, and DSPs in different configurations to maintain the performance scaling trend. However, in a postDennard, post-Moore world, further processor specialization will be needed to achieve performance improvements. Emerging applications such as artificial intelligence and vision are demanding heavy computational performance that cannot be met by conventional architectures. This inevitably leads to creating special purpose, domain-specific accelerators. A domain-specific accelerator (DSA) is a processor or set of processors that are optimized to perform a narrow range of computations. They are tailored to meet the needs of the algorithms required for their domain. For example, an AI accelerator might have an array of elements, including multiplyaccumulate functionality, to efficiently undertake matrix operations. Google’s Tensor Processing Unit (TPU), Neural Engine in Apple’s M1 processor, and Xilinx Vitis-AI Engine are popular ASIC-based DSAs. ASIC bases DSAs provide significant gains in performance and power efficiency. However, due to the long design cycles and high engineering costs, they may not cope with the ever-evolving computation landscape. Field-Programmable Gate Arrays (FPGAs) offer advantages over ApplicationSpecific Integrated Circuits (ASICs) in certain scenarios due to their flexibility and quicker development times. FPGAs are programmable hardware, allowing users to configure their functionality after manufacturing. In contrast, ASICs are hardwired for specific tasks. This flexibility makes FPGAs suitable for prototyping, testing, and adapting to changing requirements without the need for costly chip redesigns. Developing an ASIC is a complex and time-consuming process, often taking several months to years. FPGAs can be programmed and deployed much faster, making them advantageous when speedto-market is crucial. ASICs can be expensive to design and manufacture, especially for low-volume or rapidly changing applications. FPGAs have a lower initial cost, as they don’t require custom chip fabrication. This cost-effectiveness is notable for small production runs or research projects. Therefore, FPGAs are rapidly evolving towards becoming an alternative to custom ASICs to design DSAs because of their low power consumption and a higher degree of parallelism. Designing DSAs on an FPGA requires carefully calibrating the FPGA compute and memory resources to achieve optimal throughput from a given device. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are generic and not geared towards any domain. Also, the user has to put in much effort to describe the hardware at the register transfer level using the HDL. A recent trend is emerging wherein existing HDLs are used to create carefully handwritten templates suiting a specific domain. A compiler framework weaves together these templates to generate the DSA for accelerating the domain computations. This approach requires expensive design synthesis and FPGA re-flashing for accelerating different algorithms from the domain. In many edge and deeply embedded applications, this may not be feasible. Further, these days cloud companies are offering FPGA bases acceleration as a service. A large cluster of custom accelerators supports these services at the backend. In contrast to this fixed-function hardware approach, where the DSA gets tied with a specific function, an alternative design approach of overlay accelerators is gaining prominence. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations soft reconfiguration. Over the last couple of years, a few design approaches have developed for overlay accelerators. Overlay designs exist that resemble a processor controlled through an instruction set; the Xilinx Vitis-AI Engine is a prime example of such a design. However, such a homogeneous approach often leads to inefficiencies arising out of the fetch-decodeexecute model of processor design. Instruction-based overlays spend significant energy on instruction overhead rather than actual computation. To solve this problem of homogeneous overlay accelerators, a heterogeneous design methodology is proposed. A heterogeneous overlay accelerator contains multiple small homogeneous units. These units have a simple, non-instruction-based interface optimized specifically for specific workload characteristics. A heterogeneous design mainly aims to optimize throughput by concurrently processing multiple inputs over the different homogenous accelerator units in a pipelined fashion. While this approach reduces overhead, the latency will be longer. Also, as the variations in domain workloads grow, so does the complexity of these heterogeneous architectures. This thesis proposes the creation of uniform, non-instruction-based overlays, introducing two distinct overlays: FlowPix and FlexNN. FlowPix is specialized for image processing pipelines, while FlexNN is optimized for applications employing Convolutional Neural Networks (CNNs). We believe overlays represent a promising path toward achieving domain-specific acceleration, presenting adaptable and efficient solutions to address the ever-changing requirements of contemporary computing tasks. The concepts presented herein are expected to provide valuable insights for crafting efficient overlays with a superior performance-to-area ratio. FlowPix is a DSL-based overlay accelerator for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in la-tency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA reflashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks. The FlexNN overlay efficiently processes CNNs and can be scaled based on the available compute and memory resources of the FPGA. The overlay is configured on the fly through control words sent by the host on a per-layer basis. Unlike current overlays, our architecture exploits all forms of parallelism inside a convolution operation. A constraint system is employed at the host end to find out the per-layer configuration of the overlay that uses all forms of parallelism in the processing of the layer, resulting in the highest throughput for that layer. We studied the effectiveness of our overlay by using it to process AlexNet, VGG16, YOLO, MobileNet, and ResNet-50 CNNs targeting a Virtex7 and a bigger Ultrascale+VU9P FPGAs. The chosen CNNs have a mix of different types of convolution layers and filter sizes, presenting a good variation in model size and structure. Our accelerator reported a maximum throughput of 1200 GOps/sec on the Virtex7, an improvement of 1.2x to 5x over the recent designs. Also, the reported performance density, measured in Giga operations per second per KLUT, is 1.3x to 4x improvement over existing works. Similar speed-up and performance density is also observed for the Ultrascale+VU9P FPGA.

Full thesis: pdf

Centre for Others

IIIT Hyderabad Publications

Designing Programmable Domain-Specific Overlay Accelerators on FPGAs

Abstract