Title:
KIMA: Hybrid Checkpointing for Recovery from a Wide Range of Errors and Detection Latencies
KIMA: Hybrid Checkpointing for Recovery from a Wide Range of Errors and Detection Latencies
Author(s)
Doudalis, Ioannis
Prvulovic, Milos
Prvulovic, Milos
Advisor(s)
Editor(s)
Collections
Supplementary to
Permanent Link
Abstract
Full system reliability is a problem that spans multiple levels
of the software/hardware stack. The normal execution of
a program in a system can be disrupted by multiple factors,
ranging from transient errors in a processor and software bugs,
to permanent hardware failures and human mistakes. A common
method for recovering from such errors is the creation of
checkpoints during the execution of the program, allowing the
system to restore the program to a previous error-free state and
resume execution. Different causes of errors, though, have different
occurrence frequencies and detection latencies, requiring
the creation of multiple checkpoints at different frequencies
in order to maximize the availability of the system. In this paper we present KIMA, a novel checkpointing creation
and management technique that combines efficiently the
existing undo-log and redo-log checkpointing approaches, reducing
the overall bandwidth requirements to both the memory
and the hard disk. KIMA establishes DRAM-based undo-log
checkpoints every 10ms, then leverages the undo-log metadata
and checkpointed data to establish redo-log checkpoints
every 1 second in non-volatile memory (such as PCM). Our
results show that KIMA incurs average overheads of less than
1% while enabling efficient recovery from both transient and
hard errors that have a variety of detection latencies.
Sponsor
Date Issued
2010
Extent
Resource Type
Text
Resource Subtype
Technical Report