Efficient AI Stack: Deployment-Aware Neural Architecture Search and Serving of Deep Neural Networks

Author(s)
Khare, Alind
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computer Science
School established in 2007
Supplementary to:
Abstract
The increasing deployment of Deep Neural Networks (DNNs) on the critical path of production applications in both datacenter and the edge require production systems to serve these DNNs under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency (R1) and accuracy (R2) requirements of the application and the overall efficiency of utilization of scarce resources (R3). To efficiently balance trade-offs in R1-R3, production systems need to navigate between models, choices of hardware, and application contexts. This thesis proposes an efficient AI stack to solve this tension in the R1-R3 trade-off space. The key idea in the efficient AI stack is to produce and consume Pareto-Optimal (w.r.t latency/accuracy) DNNs. On the production side, the thesis proposes several neural architecture search algorithms namely CompOFA, DES, and SuperFedNAS that automatically specialize DNNs to produce the highest accuracy under different hardware and latency targets in centralized and federated data environments. On the consumption side, the thesis proposes a) an inference serving system SuperServe that consumes these DNNs and schedules them under bursty workloads with resource efficiency, and b) DSched that schedules data pipelines to DNNs in a timely and cost-efficient manner. Overall, the proposed efficient stack co-optimizes R1-R2 under dynamic workloads with resource efficiency R3.
Sponsor
Date
2024-12-02
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI