Network-aware device placement search for distributed training

Loading...
Thumbnail Image
Author(s)
Venkata, Vishnu Varma
Advisor(s)
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
As deep learning models grow in scale and complexity, efficient distributed training requires not only advanced parallelization strategies but also intelligent placement of model components across heterogeneous computing infrastructures. Existing device placement frameworks often assume simplified, uniform network topologies, leading to suboptimal performance in real-world data centers where communication costs vary significantly across nodes. I present my thesis on Network-aware, efficient device placement framework based on structured dynamic programming techniques (NEST). NEST jointly optimizes device placement and parallelism configuration by explicitly modeling the hierarchical and over-subscribed nature of modern data center networks. It supports a broad range of parallelization strategies–including tensor, pipeline, data, expert, and Zero Redundancy Optimizer (ZeRO) parallelism—and integrates detailed memory and communication cost modeling. Through structured dynamic programming, NEST explores the vast placement space efficiently and offers provable optimality guarantees within its search scope. Evaluations across realistic workloads and network settings show that NEST consistently outperforms manual and network-unaware baselines, delivering significant improvements in training throughput and resource utilization.
Sponsor
Date
2025-04-30
Extent
Resource Type
Text
Resource Subtype
Thesis
Rights Statement
Rights URI