SAIH: 理解高性能计算系统上人工智能性能趋势的可扩展方法论

杜江溯; 李东升; 文英鹏; 江嘉治; 黄聃; 廖湘科; 卢宇彤

doi:10.1007/s11390-023-1840-y

摘要:

研究背景 新兴的人工智能（AI）技术正影响着各类研究领域，尤其是科学领域，如天文学，物理学以及生物信息，这也使得人工智能应用成为高性能计算系统重要的工作负载。如何评估高性能计算系统的AI能力至关重要。

目的当前的AI基准测试多采用静态模型以及静态数据集，无法适应快速发展的AI应用业态，难以为高性能计算系统的AI支持能力提供前瞻性的指导。纵观传统高性能计算基准测试我们发现，测试工作负载可扩展性是高性能基准测试的重要特性。因此，我们希望AI测试任务的负载是可扩展的，在任务伸缩的过程中，可以更全面了解AI任务的特性，同时在任务规模不断变大的过程中，可为未来高性能计算系统更好的支持AI任务提供前瞻性的指导。

方法 AI工作负载的大小由模型大小和模型所依赖的数据集大小共同决定，通过广泛调研现有的模型生成方法与数据生成方法，我们提出了模型与数据集均可扩展的高性能AI性能趋势评估方法论-SAIH。简要地，通过专家设计或模型大小可定的模型结构搜索方法可以实现某一类模型规模的可扩展性；通过模拟方法或对抗生成网络的方法，可以实现某一特定数据集规模的可扩展性。获得可扩展AI工作负载后，依据我们给定的评估标准进行强弱可扩展测试，可以评估现有高性能计算系统的AI性能瓶颈，为进一步设计AI友好的高性能计算系统提供参考。

结果将本可扩展评估方法应用到天文学任务的三维卷积神经网络中时，可以获得现有静态HPC-AI基准所难以观察到的现象。举例来讲，可以观察到：1) HPC系统的峰值性能与模型大小密切相关，在模型大小扩展的过程中，训练的性能从模型5.2%理论峰值性能上升到59.6%理论峰值性能；2) 数据并行训练的收敛性与模型大小密切相关，更大的模型在多节点训练中表现出更好的收敛性。3) 当前系统的中心化文件系统难以应对大数据集训练的密集I/O等。

结论从论文中的案例我们可以获得多个三维卷积神经网络训练在高性能计算系统中具有较高价值的观察。通过推广本可扩展方法到更多的科学AI任务涉及的其他神经网络类型中，对不同的高性能计算系统进行评估，可以有效发现当前高性能计算系统在AI支持能力上的瓶颈，并为下一代百亿亿级高性能计算系统的构建提供指导。

Abstract: Novel artificial intelligence (AI) technology has expedited various scientific research, e.g., cosmology, physics, and bioinformatics, inevitably becoming a significant category of workload on high-performance computing (HPC) systems. Existing AI benchmarks tend to customize well-recognized AI applications, so as to evaluate the AI performance of HPC systems under the predefined problem size, in terms of datasets and AI models. However, driven by novel AI technology, most of AI applications are evolving fast on models and datasets to achieve higher accuracy and be applicable to more scenarios. Due to the lack of scalability on the problem size, static AI benchmarks might be under competent to help understand the performance trend of evolving AI applications on HPC systems, in particular, the scientific AI applications on large-scale systems. In this paper, we propose a scalable evaluation methodology (SAIH) for analyzing the AI performance trend of HPC systems with scaling the problem sizes of customized AI applications. To enable scalability, SAIH builds a set of novel mechanisms for augmenting problem sizes. As the data and model constantly scale, we can investigate the trend and range of AI performance on HPC systems, and further diagnose system bottlenecks. To verify our methodology, we augment a cosmological AI application to evaluate a real HPC system equipped with GPUs as a case study of SAIH. With data and model augment, SAIH can progressively evaluate the AI performance trend of HPC systems, e.g., increasing from 5.2% to 59.6% of the peak theoretical hardware performance. The evaluation results are analyzed and summarized into insight findings on performance issues. For instance, we find that the AI application constantly consumes the I/O bandwidth of the shared parallel file system during its iteratively training model. If I/O contention exists, the shared parallel file system might become a bottleneck.

SAIH: 理解高性能计算系统上人工智能性能趋势的可扩展方法论

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems