ResCheckpointer: Building Program Error Resilience-aware Checkpointing Mechanism for HPC System
-
Abstract
The reliability of high-performance computing (HPC) is essential for program execution stability. However, as the hardware fault rate constantly increases, fault-tolerance techniques such as Checkpoint/Restart (C/R) introduce significant system overhead. This paper proposes Program Error Resilience-aware Checkpointing Mechanism (ResCheckpointer) to mitigate the overhead of the C/R mechanism. The primary motivation of ResCheckpointer is that we observe that crash proneness (i.e., the probability of the program crashing after fault occurrence) varies significantly among inter- and intra-HPC programs, which prompts us to flexibly adjust checkpoint intervals for further C/R overhead optimization. Specifically, we first construct the graph neural network (GNN)-based learning paradigms to excavate the complex error propagation and effect mechanisms hidden within the HPC program's execution flow, and propose Crash-Predictor for efficiently predicting programs' crash proneness. Based on this, we build ResCheckpointer, which equips an intelligent checkpoint interval setting strategy for HPC programs i.e., denser for the crash proneness stage while sparser for the error resilience stage. Experimental results show that ResCheckpointer can achieve up to 55.37\% C/R cost reduction compared to the baseline C/R mechanism.
-
-