Accelerated deep learning for the edge-to-cloud continuum: A specialized full stack derived from algorithms

Sharma, Hardik

Title:

Accelerated deep learning for the edge-to-cloud continuum: A specialized full stack derived from algorithms

dc.contributor.advisor	Esmaeilzadeh, Hadi
dc.contributor.advisor	Kim, Hyesoon
dc.contributor.advisor	Prvulovic, Milos
dc.contributor.advisor	Krishna, Tushar
dc.contributor.advisor	Chandra, Vikas
dc.contributor.author	Sharma, Hardik
dc.contributor.department	Electrical and Computer Engineering
dc.date.accessioned	2019-05-29T14:03:31Z
dc.date.available	2019-05-29T14:03:31Z
dc.date.created	2019-05
dc.date.issued	2019-03-29
dc.date.submitted	May 2019
dc.date.updated	2019-05-29T14:03:31Z
dc.description.abstract	Advances in high-performance computer architecture design have been a major driver for the rapid evolution of Deep Neural Networks (DNN). Due to their insatiable demand for compute power, naturally, both the research community as well the industry have turned to accelerators to accommodate modern DNN computation. Furthermore, DNNs are gaining prevalence and have found applications across a wide spectrum of devices, from commod- ity smartphones to enterprise cloud platforms. However, there is no one-size-fits-all solu- tion for this continuum of devices that can meet the strict energy/power/chip-area budgets for edge devices and meet the high performance requirements for enterprise-grade servers. To this end, this thesis designs a specialized compute stack for DNN acceleration across the edge-to-cloud continuum that flexibly matches the varying constraints for different devices and simultaneously exploits algorithmic properties to maximize the benefits from acceleration. To this end, this thesis first explores a tight integration of Neural Network (NN) accelerators within the massively-parallel GPUs with a minimal area overhead. We show that a tight-coupling of NN-accelerators and GPUs can provide a significant gain in performance and energy efficiency across a diverse set of applications through neural acceleration, by approximating regions of approximation- amenable code using a neural networks. Next, this thesis develops a full-stack for accelerating DNN inference on FPGAs that aims to provide programmability, performance, and efficiency. We call our specialized compute stack DNNWEAVER, which encompasses (1) high-level algorithmic abstractions, (2) a flexible template accelerator architecture, and (3) a compiler that automatically and efficiently optimizes the template architecture to maximize DNN performance using the limited resources available on the FPGA die. The third thrust of this thesis explores scale-out acceleration of training using cloud-scale FPGAs for a wide range of machine learning algorithms, including neural networks. The challenge here is to design an accelerator architecture that can scale up to efficiently use the large pool of compute resources available on modern cloud-grade FPGAs. To tackle this challenge, this thesis explores multi-threading to maximize efficiency from FPGA acceleration by running multiple parallel threads of training. The final thrust of this thesis builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent loss of accuracy, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, the final thrust of this thesis introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. The final thrust of this thesis explores mixed-signal acceleration to push accelerator efficiency to its limits. As such, the final thrust explores executing the low-bitwidth multiply- add operations prevalent in DNNs in the analog domain to gain significant efficiency ben- efits. Using low-bitwdith analog compute units enables us to overcome the limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads.
dc.description.degree	Ph.D.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1853/61267
dc.language.iso	en_US
dc.publisher	Georgia Institute of Technology
dc.subject	Bit level composability
dc.subject	Dynamic composability
dc.subject	Deep neural networks
dc.subject	Accelerators
dc.subject	DNN
dc.subject	Convolutional neural networks
dc.subject	CNN
dc.subject	Long short-term memory
dc.subject	LSTM
dc.subject	Recurrent neural networks
dc.subject	RNN
dc.subject	Quantization
dc.subject	Bit fusion
dc.subject	DnnWeaver
dc.subject	FPGA
dc.subject	ASIC
dc.title	Accelerated deep learning for the edge-to-cloud continuum: A specialized full stack derived from algorithms
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.advisor	Prvulovic, Milos
local.contributor.advisor	Kim, Hyesoon
local.contributor.advisor	Krishna, Tushar
local.contributor.corporatename	School of Electrical and Computer Engineering
local.contributor.corporatename	College of Engineering
relation.isAdvisorOfPublication	2d678067-bb81-43c7-be94-bd87bced736e
relation.isAdvisorOfPublication	ec222ec7-e853-445c-b356-51b942d36799
relation.isAdvisorOfPublication	f80c3b14-cd42-456d-b440-addf20372fbc
relation.isOrgUnitOfPublication	5b7adef2-447c-4270-b9fc-846bd76f80f2
relation.isOrgUnitOfPublication	7c022d60-21d5-497c-b552-95e489a06569
thesis.degree.level	Doctoral