Designing and Automating Asynchronous, Localized, Multi-Level Fault-Tolerance at the Application Level
Author(s)
Whitlock, Matthew J.
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Moore’s law is dead or dying, but demand for compute continues to grow faster each year. The hardware scaling trends that have driven the growth of HPC are tapering off, which is forcing the industry to explore new approaches to continue scaling. Further, there is a growing public concern about the environmental impact of extreme-scale computing. Consequently, researchers in the cloud computing, machine learning, and embedded computing areas are exploring reduced-reliability computing as a means to improve both performance and efficiency. In HPC, however, the current Global Checkpoint/Recovery (GCR) approach to dealing with reduced hardware reliability is fundamentally unscalable. The costs of GCR are rising faster than the performance of leading supercomputers. It is critical for application resilience to scale with, rather than against, increasing hardware fault rates if HPC is to continue scaling while reigning in its environmental footprint. To avoid the exponential scaling of GCR, applications must localize the cost of hardware faults, which requires several changes in the traditional approach to fault tolerance. First, fault tolerance must be flexible to application-specific refinements while managing application developers’ reticence to implement complex resilience code. We describe a layer-based resilience taxonomy and approach that exposes the imperative configurability mechanisms to make fault-tolerance tools that can flexibly combine to utilize general application- and platform-tailored fault recovery. We demonstrate this by extending contemporary resilience tools to enable flexible and simple online recovery into applications with a multi-layered approach. Next, we define the key properties of localized recovery by creating a general analytical model. We demonstrate a pseudo-local approach using modern ULFM MPI features and show high scalability despite ULFM’s collective nature. Finally, we design a task-based recovery model capable of extending pseudo-locality to applications which do not fit the ideal model. These works build the path for HPC to maintain environmental accountability, meet growing compute demands, and benefit from novel upcoming hardware trends.
Sponsor
Date
2025-04-15
Extent
Resource Type
Text
Resource Subtype
Dissertation