Title:
Error Resilient and Adaptive Deep Learning Systems
Error Resilient and Adaptive Deep Learning Systems
Author(s)
Ma, Kwondo
Advisor(s)
Chatterjee,, Abhijit
Editor(s)
Collections
Supplementary to
Permanent Link
Abstract
As deep learning systems become integral to a wide array of applications, including autonomous systems, healthcare, and finance, their complexity and deployment in hardware bring new challenges. In particular, the susceptibility of deep learning systems to hardware-induced errors, manufacturing process variability, and resource constraints presents critical obstacles to their reliable and efficient operation. This dissertation addresses these issues by introducing methodologies that enhance error resilience, adaptability, and energy efficiency in deep learning systems. The motivation for this work stems from the increasing integration of deep learning systems into real-world applications where reliability and robustness are paramount. The inherent variability in hardware—such as resistive RAM (RRAM)—and the need for efficient testing and tuning processes highlight the need for adaptive systems that can mitigate the impact of these variabilities. Additionally, the demand for low-power, high-performance hardware accelerators in edge computing environments presents further challenges in balancing computational efficiency and energy consumption. In response to these challenges, this research proposes a signature-based predictive testing framework for detecting performance degradation caused by process variability in hardware implementations of deep neural networks (DNNs). This framework introduces a compact, efficient testing mechanism that significantly improves the ability to identify defective devices during manufacturing, while also adapting to evolving manufacturing conditions through continuous retraining. Furthermore, a learning-assisted post-manufacture tuning framework is developed to optimize the performance of DNN accelerators, ensuring higher yields and greater reliability in fault-sensitive environments. This framework allows the system to adapt its tuning strategies over time, reducing the need for exhaustive retraining while maintaining operational efficiency. The dissertation also addresses the resilience of Transformer architectures to soft errors, a growing concern in high-performance applications such as natural language processing, and vision and image processing. The proposed approach combines error detection and suppression techniques to restore model performance under various error conditions, demonstrating the robustness of Transformer networks when deployed in real-world, error-prone environments. Finally, the work presents a novel energy-efficient DNN accelerator design that replaces traditional multiplication operations with shift-add computations, substantially reducing power consumption and latency. This architecture is particularly suited for low-power applications in edge and Internet of Things (IoT) devices, offering a practical solution for the deployment of deep learning models in energy-constrained settings. Overall, this research makes significant contributions toward improving the reliability and adaptability of deep learning systems, addressing key limitations in error resilience, manufacturing yield, and energy efficiency. These methodologies pave the way for the development of robust, efficient AI technologies capable of thriving in diverse and challenging environments.
Sponsor
Date Issued
2024-12-09
Extent
Resource Type
Text
Resource Subtype
Dissertation