Building Efficient Tensor Accelerators for Sparse and Irregular Workloads

Thumbnail Image
Qin, Eric
Krishna, Tushar
Associated Organization(s)
Supplementary to
Popular Machine Learning (ML) and High Performance Computing (HPC) workloads contribute to a significant portion of runtime on data centers. Applications include image classification, speech recognition, recommendation systems, social network analysis, robotic problems, chemical process simulations, and so on. Recently due to large computational demands from emerging workloads, there is a surge of custom hardware accelerator development for computing tensor kernels with high performance and energy efficiency. For example, the Google Tensor Processing Unit (TPU) is a custom hardware accelerator targeting efficient matrix multiplications for Deep Neural Networks (DNNs). However, there are limitations with state-of-the-art accelerators, stemming from (1) a vast spectrum of sparsity across various workloads and (2) irregularity of tensor dimensions (e.g. tall-skinny matrices). This thesis explores novel methodologies and architectures for building efficient accelerators for sparse tensor algebra. The first major contribution of this thesis is the proposal of using specialized on-chip interconnects to provide flexible computational mappings of sparse and irregular matrices onto processing elements (PEs). This enables close to full PE utilization and significantly improves the performance over TPU, which has a rigid on-chip interconnect. With the proposed specialized interconnects, this thesis presents a new sparse DNN accelerator targeting workloads with 30% to 100% density (percentage of nonzeros) named SIGMA. Unlike popular DNNs, HPC workloads utilize tensors spanning from 10^-6% dense to fully dense. The second major contribution of this thesis explores the system impact of utilizing various compression formats across all sparsity regions. The key insights gathered is that different workloads prefer different compression formats, and the best compression format used for memory storage may not be the same as the best compression format used for computation. This thesis proposes a predictor to determine the the best compression format combination and a custom hardware compression format converter named MINT. Together, they provide significant energy-delay product (EDP) improvement over state-of-the-art accelerators. The third major contribution of this thesis analyzes popular state-of-the-art sparse accelerators using a new tool named Hard TACO. This tool utilizes the open source Tensor Algebra Compiler (TACO) and High Level Synthesis (HLS) to generate functional sparse accelerator of different dataflows, e.g. inner product vs output product SpGEMM. The impact of Hard TACO is that it allows realistic architectural exploration of homogeneous and heterogeneous accelerators.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI