Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System

Author: D Shaarada Yamini 2020702003
Date: 2023-09-07
Report no: IIIT/TH/2023/148
Advisor:Suresh Purini

Abstract

Hardware accelerators are being designed to offload compute-intensive tasks such as deep neural networks from the CPU to improve the overall performance of an application, specifically on the performance-per-watt metric. With the evolution of speech recognition based applications, many deep learning models for Automatic Speech Recognition have been proposed. Encoder-decoder-based sequenceto-sequence models such as the Transformer model have demonstrated state-of-the-art results in end-toend automatic speech recognition systems (ASRs). However, the Transformer model being intensive on memory and computation poses a challenge for an FPGA implementation. This thesis proposes an end-to-end architecture to accelerate a Transformer for an ASR system. The host CPU orchestrates the computations from different encoder and decoder stages of the Transformer architecture on the designed hardware accelerator with no necessity for intervening FPGA reconfiguration. The intermediate stages, like data pre-processing and feature extraction, are performed on the host while the complex recognizer, i.e., the Transformer model, is offloaded onto an FPGA. The communication latency is hidden by prefetching the weights of the next encoder/decoder block while the current block is being processed. The larger computations in the model are split across both the Super Logic Regions (SLRs) of the FPGA, mitigating the inter-SLR communication. The proposed design presents an optimal latency, exploiting the available resources. The accelerator design is realized using Vitis high-level synthesis tool, using OpenCL, a language for heterogeneous computing, and evaluated on an Alveo U-50 FPGA card. The end-to-end ASR system has a latency of ∼120ms, which is suitable for real-time applications. The design demonstrates an average speed-up of 32× compared to an Intel Xeon E5-2640 CPU and 8.8× compared to NVIDIA GeForce RTX 3080 Ti Graphics card for a 32-bit floating point single precision model.

Full thesis: pdf

Centre for Others

IIIT Hyderabad Publications

Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System

Abstract