Network-aware device placement search for distributed training
Loading...
Author(s)
Venkata, Vishnu Varma
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
As deep learning models grow in scale and complexity, efficient distributed training requires not only advanced parallelization strategies but also intelligent placement of model
components across heterogeneous computing infrastructures. Existing device placement frameworks often assume simplified, uniform network topologies, leading to suboptimal
performance in real-world data centers where communication costs vary significantly across nodes. I present my thesis on Network-aware, efficient device placement framework based
on structured dynamic programming techniques (NEST). NEST jointly optimizes device placement and parallelism configuration by explicitly modeling the hierarchical and over-subscribed nature of modern data center networks. It supports a broad range of parallelization strategies–including tensor, pipeline, data, expert, and Zero Redundancy Optimizer
(ZeRO) parallelism—and integrates detailed memory and communication cost modeling.
Through structured dynamic programming, NEST explores the vast placement space efficiently and offers provable optimality guarantees within its search scope. Evaluations
across realistic workloads and network settings show that NEST consistently outperforms manual and network-unaware baselines, delivering significant improvements in training
throughput and resource utilization.
Sponsor
Date
2025-04-30
Extent
Resource Type
Text
Resource Subtype
Thesis