Structured Sparsity-Aware Hardware-Software Co-Design for Deep Neural Network Acceleration
Author(s)
Jeong, Geonhwa
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
In diverse areas including, but not limited to, computer vision, natural language processing, and personal recommendation, Deep Neural Networks (DNNs) have shown dramatic performance, even exceeding that of humans for some tasks. DNNs are known for their high computational demands and are widely used in various applications, motivating enhancements to hardware and software to improve performance and energy efficiency. Using various types of sparsity in DNNs has been proposed recently to reduce compute and memory requirements, but finding the proper target sparsity to meet both HW and SW requirements is still an active area of research. In this thesis, we develop HW-SW co-design methods to accelerate various DNNs leveraging structured sparsity. We first present RASA, an efficient register-aware systolic array as a matrix engine. We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently. Next, we present VEGETA, a flexible structured sparse matrix engine extending a dense matrix engine with flexible structured sparsity support. In addition, we show how VEGETA engines can be used for different sparsity granularities, such as network-wise, layer-wise, tile-wise, and row-wise. Next, we propose TASD, an approximation method to decompose an unstructured sparse tensor using a sequence of structured sparse tensors. We also show how TASD can be applied to accelerate the execution of both dense and sparse DNNs using structured sparse matrix engines. Finally, we introduce SDQ using both sparsification and quantization complementing each other through structured decomposition to accelerate Large Language Models on structured sparse HW.
Sponsor
Date
2024-04-28
Extent
Resource Type
Text
Resource Subtype
Dissertation