Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms

Author(s)
Won, William Jonghoon
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computer Science
School established in 2007
Supplementary to:
Abstract
Foundation machine learning (ML) models have emerged as one of the most prominent applications in modern computing, exemplified by mixture-of-experts–based large language models. The immense resource demands of these models have driven the development of large-scale, high-performance computing platforms tailored for artificial intelligence workloads. In such distributed platforms, both model parameters and data are partitioned and processed across numerous neural processing units, requiring frequent synchronization of activations and gradients through collective communication operations. As collective communication constitutes a primary bottleneck in distributed ML, optimizing its efficiency remains a critical research challenge. This dissertation explores software-hardware optimizations for collective communications to better understand the tightly coupled design space of networking in distributed ML platforms. First, it introduces ASTRA-sim2.0, an end-to-end simulation and modeling framework that enables comprehensive design space exploration of distributed ML platforms with arbitrary parallelization strategies and multi-dimensional networks. Second, it presents LIBRA, which enhances the bandwidth utilization of hierarchical collective communication algorithms by optimizing multi-dimensional network topologies via analytical modeling. Finally, the dissertation proposes two collective communication algorithm synthesizers, TACOS and PCCL, which automatically generate optimized collective communication algorithms for arbitrary network topologies through algorithmic approaches. Together, the dissertation underscores the significance of judicious software-hardware approaches in achieving efficient collective communication for large-scale distributed ML platforms.
Sponsor
Date
2025-12
Extent
Resource Type
Text
Resource Subtype
Dissertation (PhD)
Rights Statement
Rights URI