Hardware-Friendly Model Compression techniques for Deep Learning Accelerators
Author(s)
Karimzadeh, Foroozan
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
The objective of the proposed research is to introduce solutions to make energy-efficient \gls{dnn} accelerators to be deployable on edge devices through developing hardware-aware \gls{dnn} compression methods. The rising popularity of intelligent mobile devices and the computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. In particular, we proposed four compression techniques for energy and memory efficient \gls{dnn} computing. In the first method, \gls{lgps}, we present a hardware-aware pruning method where the locations of non-zero weights are derived in real-time from a \gls{lfsr}. Using the proposed method, we demonstrate a total saving of energy and area up to 63.96\% and 64.23\% for VGG-16 network on down-sampled ImageNet, respectively for iso-compression-rate and iso-accuracy. Secondly, We achieved ultra-low bit-precision deep learning model by developing a quantization scheme through knowledge distillation and gradual quantization for pruned network. Thirdly, we propose a novel model compression scheme that allows inference to be carried out using bit-level sparsity, which can be efficiently implemented using in-memory computing macros. We introduce a method called BitS-Net to leverage the benefits of bit-sparsity (where the number of zeros is more than number of ones in binary representation of weight/activation values) when applied to Compute-In-Memory (\gls{cim}) with Resistive Random Access Memory (\gls{rram}) to develop energy efficient \gls{dnn} accelerators operating in the inference mode. We demonstrate that BitS-Net improves the energy efficiency by up to 5x for ResNet models on the ImageNet dataset.
In the last part, to achieve highly energy-efficient DNN, we introduce a novel twofold sparsity method (Twofold Sparsity, twofoldS-Net) to sparsify the DNN models in bit- and network-level, simultaneously. We added two separate regularizations to the loss function in order to achieve bit- and network-level sparsity at the same time. We sparsify the model in network-level, by adding a mask generated by \gls{lfsr}. For bit-level sparsity, we quantize the network to 8-bit representation in two's complement format. During inference we take advantage of \gls{cim} architecture and \gls{lfsr} indexing. We have shown that by using our proposed method we are able to sparsify the network and design a highly energy-efficient deep learning accelerator to eventually bring \gls{ai} to our daily lives.
Sponsor
Date
2022-12-19
Extent
Resource Type
Text
Resource Subtype
Dissertation