Using Underutilized Cores for GPU DRAM Soft Error Correction: A Software-Based Approach

Author(s)
Chimmili, Chiranjeevi
Advisor(s)
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
Modern high-performance computing (HPC) systems increasingly rely on Graphics Processing Units (GPUs) for their computational power and high parallelism. GPUs drive advancements in artificial intelligence, big data analytics, and scientific simulations. However, the increased susceptibility of GPU Dynamic Random-Access Memory (DRAM) to soft errors—transient faults caused by cosmic radiation and electrical interference—poses significant challenges. Traditional error correction codes (ECC), such as Single Error Correction and Double Error Detection (SECDED), are inadequate for addressing complex error patterns, especially in older GPU models lacking hardware ECC. This research proposes a novel software-based approach to DRAM error correction that leverages underutilized GPU cores. By assigning these idle cores to error detection and correction duty through the kernel scheduler, our approach offers a scalable, adaptable solution without the need for hardware modifications. Extensive simulations demonstrate significant improvements in error correction efficiency and system reliability, providing a flexible and cost-effective method to ensure data integrity in high-performance computing environments.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis
Rights Statement
Rights URI