基于提前时间约束的大型数据中心磁盘故障预测方法

张鑫晏; 冯丹; 谭支鹏; 谢燕文; 赵少锋; 韦雅媛

doi:10.1007/s11390-025-3850-4

摘要:

研究背景 磁盘故障作为存储系统中最常见和最主要的硬件故障，会增加服务中断、数据丢失和经济损失的风险，严重影响系统可靠性。磁盘故障预测方法通过提前预测磁盘故障并及时进行业务迁移和磁盘替换，从而有效提高系统可靠性。现有方法常通过不同采样方法和建模算法提高模型预测能力。然而，当前磁盘故障预测研究仍面临提前预测时间窗口内模型准确性和稳定性难以兼顾的问题。磁盘样本标签不准确、数据采样不充分、样本分割不合理等问题影响现有磁盘故障预测模型的预测能力，同时还导致模型在提前预测时间窗口内预测能力不稳定，模型预测能力随着提前预测时间的增加而降低。

目的本文的研究目标是面向数据中心的磁盘，提出一种基于提前时间约束的故障预测模型，提高磁盘故障预测模型在提前预测时间窗口内的准确性和稳定性。

方法我们提出了一种基于提前时间约束的故障预测模型LWCM，通过基于动态标签重定和样本权重重分配的后向反馈机制以及两阶段数据采样技术，解决样本标签不准确和数据采样不充分以及提前时间窗口内模型预测不稳定的问题。首先，LWCM通过基于提前时间约束和故障征兆持续时间的动态标签重定技术重新标注样本标签。其次，提出结合初始期望采样和后续分段重采样的两阶段数据采样技术充分选取样本数据。最后，基于重标注和重采样的样本数据，结合动态加权优化的后向反馈机制提高磁盘故障预测模型在提前时间窗口内的稳定性和准确性。

结果实验结果表明基于时序的样本划分方法优于基于型号的样本划分方法，模型准确率提升18%，误报率降低72.3%。LWCM在不同的数据集上的预测表现均明显优于其他机器学习模型。与现有方法相比，LWCM的检测率平均提升26.66%，误报率平均降低64.3%。同时，LWCM在不同磁盘型号上均具有较好的预测性能，并且在不同的提前预测时间区间内均具有持续稳定的预测性能。

结论 LWCM面向数据中心磁盘提出一种基于提前时间约束的故障预测方法，通过提出三个关键性策略：动态标签重定、基于样本权重优化的后向反馈，以及两阶段数据采样方法，提高模型在提前预测时间窗口内的准确性和稳定性。LWCM通过引入两阶段数据采样优化方法解决数据采样不充分问题，提出基于提前时间约束和故障征兆持续时间的动态标签重定技术解决样本标签不准确问题，设计基于初始期望采样和后续分段重采样的两阶段数据采样技术解决数据采样不充分问题，结合基于动态加权优化的后向反馈机制解决提前预测时间区间预测能力不稳定问题。未来会考虑将动态标签重定和两阶段数据采样方法与其他建模算法相结合，进一步改善磁盘故障预测方法有效性，从而提高系统可靠性。

Abstract: Disk failures, the most common and major failures in storage systems, increase the risk of service interruption and data loss, and bring additional maintenance costs, which reduces system reliability. Disk failure prediction methods aim to forecast failures, initiating prompt data migration and disk replacement. Existing methods continuously optimize the models with different sampling methods and modeling algorithms. However, due to issues such as inaccurate sample labeling, insufficient data sampling, and improper sample segmentation, the predictive capabilities of existing models within the lookahead-window time are unstable and decline as the lookahead-window time increases. To address this, we propose LWCM (Lookahead-Window Constrained Model) to improve the predictability and stability of failure prediction models within the lookahead-window time. LWCM leverages dynamic sample relabeling methods based on lookahead-window time constraints and failure symptom durations to modify inaccurate sample labels. LWCM utilizes effective sample data by using the two-phase data sampling method including initial expectation sampling and subsequent segmented resampling. LWCM employs dynamic weighted optimization in backpropagation to enhance the predictability and stability of the disk failure prediction model. Experimental results show that LWCM has better failure prediction performance. The true positive and false positive rates surpass those of the offline-RF model by 38.7% and 92.4%, respectively. Furthermore, LWCM demonstrates its applicability across disk models while maintaining stability within the lookahead constraint window.

基于提前时间约束的大型数据中心磁盘故障预测方法

LWCM: A Lookahead-Window Constrained Model for Disk Failure Prediction in Large Data Centers