Partitioning and Scheduling Framework with Dynamic Memory Estimation for Multi-Instance GPUs

Author(s)
Saraha, Abhijeet
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computer Science
School established in 2007
Supplementary to:
Abstract
The problem of partitioning and scheduling to effectively utilize Nvidia’s Multi Instance GPU (MIG) capabilities is quite challenging. On the one hand, tight partitions must be created to maximize concurrency and throughput; on the other, the memory needs of executing GPU processes must be adequately met to avoid out-of-memory errors. This is exacerbated by the unique (dynamic) memory behavior of the modern ML workloads like LLMs, as well as by the peculiar constraints on partition creation on the MIGs and choosing right configuration for maximizing concurrency. This research proposes a comprehensive framework for solving the above challenges. It combines memory estimation analysis and scheduling to dynamically create and manage MIG parti- tions, per the resource needs of GPU jobs. For general programs and ML workloads, we propose two scheduling schemes: one to minimize the number of repartitioning calls at runtime, and a second that reconfigures the GPU partitions as per the need of the next GPU job in queue. This approach yields up to 6.20x throughput improvement and 5.93x energy improvement for general workloads; and we see 1.59x and 1.12x improvement to throughput and energy, respectively, for ML workloads on an A100 GPU. Many workloads’ memory requirements, however, are quite challenging to analyze. State-of-the-art ML model estimation methods are ineffective for workloads that allocate dynamic memory. To overcome this limitation, we design a time series-based profiling method that gathers memory allocation statistics during the initial part of the execution and then projects future memory needs of the process. If the projected memory need is likely to exceed the allocated partition, the process is aborted and restarted on a larger partition. Early prediction of memory needs is attempted to optimize delays in completing execution due to a restart. We leverage this technique on LLM workloads and show good improvements, including up to 1.43x throughput improvement and 1.11x energy savings. Lastly, we show that the framework is agnostic for MIG enabled GPUs, and can be adapted to newer generations of GPU micro-architectures without any changes.
Sponsor
Date
2025-12
Extent
Resource Type
Text
Resource Subtype
Thesis (Masters Degree)
Rights Statement
Rights URI