ResCheckpointer: Building Program Error Resilience-Aware Checkpointing Mechanism for HPC Systems

Xiao-Hui Wei; Shi-Yu Tong; Zhong-Ao Sun; Xiang Li; Heng-Shan Yue

doi:10.1007/s11390-025-4634-6

Wei XH, Tong SY, Sun ZA et al. ResCheckpointer: Building program error resilience-aware checkpointing mechanism for HPC systems. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(3): 671−685, May 2025. DOI: 10.1007/s11390-025-4634-6

Citation:

ResCheckpointer: Building Program Error Resilience-Aware Checkpointing Mechanism for HPC Systems

Abstract

Abstract

The reliability of high-performance computing (HPC) is essential for program execution stability. However, as the hardware fault rate constantly increases, fault-tolerance techniques such as Checkpoint/Restart (C/R) introduce significant system overhead. This paper proposes Program Error Resilience-Aware Checkpointing Mechanism (ResCheckpointer) to mitigate the overhead of the C/R mechanism. The primary motivation of ResCheckpointer is that we observe that crash proneness (i.e., the probability of the program crashing after fault occurrence) varies significantly among inter- and intra-HPC programs, which prompts us to flexibly adjust checkpoint intervals for further C/R overhead optimization. Specifically, we first construct the graph neural network (GNN) based learning paradigms to excavate the complex error propagation and effect mechanisms hidden within the HPC program’s execution flow, and propose Crash-Predictor for efficiently predicting programs’ crash proneness. Based on this, we build ResCheckpointer, which equips an intelligent checkpoint interval setting strategy for HPC programs, i.e., denser for the crash proneness stage while sparser for the error resilience stage. Experimental results show that ResCheckpointer can achieve up to 55.37% C/R cost reduction compared with the baseline C/R mechanism.

FullText(HTML)

References (32)

Relative Articles

Supplements (4)

Cited By

ResCheckpointer: Building Program Error Resilience-Aware Checkpointing Mechanism for HPC Systems

Abstract

Catalog

Export File

Citation

Format

Content