Efficient AI Stack: Deployment-Aware Neural Architecture Search and Serving of Deep Neural Networks
Author(s)
Khare, Alind
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
The increasing deployment of Deep Neural Networks (DNNs) on the critical path of production applications in both datacenter and the edge require production systems to serve these DNNs under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency (R1) and accuracy (R2) requirements of the application and the overall efficiency of utilization of scarce resources (R3). To efficiently balance trade-offs in R1-R3, production systems need to navigate between models, choices of hardware, and application contexts. This thesis proposes an efficient AI stack to solve this tension in the R1-R3 trade-off space. The key idea in the efficient AI stack is to produce and consume Pareto-Optimal (w.r.t latency/accuracy) DNNs. On the production side, the thesis proposes several neural architecture search algorithms namely CompOFA, DES, and SuperFedNAS that automatically specialize DNNs to produce the highest accuracy under different hardware and latency targets in centralized and federated data environments. On the consumption side, the thesis proposes a) an inference serving system SuperServe that consumes these DNNs and schedules them under bursty workloads with resource efficiency, and b) DSched that schedules data pipelines to DNNs in a timely and cost-efficient manner. Overall, the proposed efficient stack co-optimizes R1-R2 under dynamic workloads with resource efficiency R3.
Sponsor
Date
2024-12-02
Extent
Resource Type
Text
Resource Subtype
Dissertation