HW-SW Methods for Modeling and Optimizing Communication for Scalable Training of Deep Learning Models

Author(s)
Rashidi, Saeed
Advisor(s)
Editor(s)
Associated Organization(s)
Series
Supplementary to:
Abstract
The objective of this thesis is to present novel HW-SW methods for modeling and optimizing communication for scalable training of deep learning models. DL applications are becoming an integral part of our society due to their vast application in different domains such as vision, language processing, recommendation systems, speech processing, etc. Before being deployed, DL models need to be trained using training samples over many iterations to reach the desired accuracy. To improve the accuracy, DL models are constantly growing in size and training samples, making the tasks of training extremely challenging, taking months or even years for a given model to be trained. Distributed training aims to improve the training speed by distributing the training task across many accelerators. However, distributed training introduces new overheads, such as communication overhead that can limit scalability if left unaddressed. We begin this thesis by introducing the challenges and the need for building distributed training platforms in Chapter 1. In Chapter 2, we provide the necessary background to understand the design space. In Chapter 3, we motivate the need for building end-to-end evaluation methodologies for distributed training systems and propose ASTRA-SIM, a simulation methodology that enables the researchers to explore the HW/SW design space of distributed training systems. Chapter 3 also shows the capabilities of ASTRA-SIM through multiple case studies. To achieve maximum performance, it is important to overlap the compute with communication operations, as we explain in Chapter 4. ACE, a microarchitecture support for the efficient simultaneous execution of comm/comp operations, is presented in Chapter 4. ACE can autonomously execute the communication patterns of the distributed training workloads, effectively reducing the resource contention between compute and communication kernels. The need to design specific-purpose networks for a set of target workloads is motivated in Chapter 5. We provide a framework, called LIBRA, to design optimized networks for a given set of target training workloads in Chapter 5. Large-scale distributed training platforms use multi-level networks with hybrid BW and latency characteristics for maximum scalability and performance. However, as we show in Chapter 6, it is challenging the utilize the full network BW offered by different network levels. In this chapter, we present Themis, a novel communication scheduling method to maximize the network BW utilization on hybrid networks for distributed training workloads. An important feature of the underlying interconnect for distributed training is its efficiency for various parallelization strategies. In Chapter 7, we motivate this fact and propose FRED, an interconnect along with its communication implementation algorithms that are flexible for various parallelization strategies of distributed training workloads. We present the benefits of FRED by applying it to wafer-scale platforms and comparing it against the baseline systems. Finally, in Chapter 8, we conclude the thesis and propose future works.
Sponsor
Date
2023-01-25
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI