We use cookies to improve your experience with our site.

AquaSee:利用冷冻水数据预测超级计算机负载以及冷却系统故障

AquaSee: Predict Load and Cooling System Faults of Supercomputers Using Chilled Water Data

  • 摘要: 对天河一号(TH-1A)超级计算机系统的实际运行数据进行分析表明,冷冻水数据不仅能反映冷冻水系统的运行状况,而且可以反映超级计算机负载的变化。本研究提出了一种利用冷冻水压力和温度数据预测超级计算机负载和冷却系统故障的方法Aquasee。本方法中所使用的数据都是从部署在国家超级计算天津中心的TH-1A超级计算机系统收集获取的真实运行数据。本文首先使用网格搜索的方式选定合适的超参数集,然后通过利用不同成分的数据集建立预测模型来选择合适的数据集,通过测试不同的预测序列长度的预测效果来选择合适的预测序列长度。实验结果表明,采用压力和温度数据相结合的数据建立模型的方法比仅采用压力或温度数据的方法更有效,同时本文认为模型最佳预测序列长度为时间窗口外两分钟。此外,本方法还利用冷冻水数据建立了异常监测系统,以帮助工程师检测冷冻水系统异常。

     

    Abstract: An analysis of real-world operational data of Tianhe-1A (TH-1A) supercomputer system shows that chilled water data not only can reflect the status of a chiller system but also are related to supercomputer load. This study proposes AquaSee, a method that can predict the load and cooling system faults of supercomputers by using chilled water pressure and temperature data. This method is validated on the basis of real-world operational data of the TH-1A supercomputer system at the National Supercomputer Center in Tianjin. Datasets with various compositions are used to construct the prediction model, which is also established using different prediction sequence lengths. Experimental results show that the method that uses a combination of pressure and temperature data performs more effectively than that only consisting of either pressure or temperature data. The best inference sequence length is two points. Furthermore, an anomaly monitoring system is set up by using chilled water data to help engineers detect chiller system anomalies.

     

/

返回文章
返回