计算机科学技术学报 ›› 2023,Vol. 38 ›› Issue (1): 211-218.doi: 10.1007/s11390-023-2888-4

所属专题: 综述 Computer Architecture and Systems

• • 上一篇    

面向异构高性能计算机的统一编程模型

  

  • 收稿日期:2022-10-05 修回日期:2022-10-28 接受日期:2023-01-10 出版日期:2023-02-28 发布日期:2023-02-28

Unified Programming Models for Heterogeneous High-Performance Computers

Zi-Xuan Ma (马子轩), Student Member, CCFYu-Yang Jin (金煜阳), Member, CCF, Shi-Zhi Tang (唐适之), Student Member, CCFHao-Jie Wang (王豪杰), Member, CCF, Wei-Cheng Xue (薛伟诚), Ji-Dong Zhai* (翟季冬), Senior Member, CCF, and Wei-Min Zheng (郑纬民), Fellow, CCF        

  1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2022-10-05 Revised:2022-10-28 Accepted:2023-01-10 Online:2023-02-28 Published:2023-02-28
  • Contact: Ji-Dong Zhai E-mail:zhaijidong@tsinghua.edu.cn
  • About author:Ji-Dong Zhai received his B.S. degree in computer science from the University of Electronic Science and Technology of China, Chengdu, in 2003, and his Ph.D. degree in computer science from Tsinghua University, Beijing, in 2010. He is a tenured associate professor in the Department of Computer Science and Technology of Tsinghua University, Beijing. His research interests include performance evaluation for high-performance computers, performance analysis, and modeling of parallel applications.
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China under Grant No. 62225206.

1、研究背景(context)
随着摩尔定律的放缓,基于异构加速处理器的高性能计算机成为当前发展的主流趋势。为了发挥异构系统的硬件性能,每个硬件提供商都开发了一套针对自己硬件的编程框架和私有加速库。这给应用开发人员进行程序移植带来了显著的挑战,极大地限制了高性能计算的发展和推广。一套统一的编程模型可以有效地提高各种异构高性能计算机上的程序可移植性。然而,尽管现有的统一编程模型在代码可移植性方面投入了大量精力,但仍远无法实现良好的性能可移植性。
2、目的(Objective)
性能可移植性,即一套代码不需要额外修改即可在不同异构系统上达到相当的计算效率。现有的统一编程模型通过统一的语法抽象,可以在一定程度上实现一套代码在不同异构高性能计算机上直接编译和运行,但目前的统一编程模型无法实现一套代码在不同异构系统上得到相当的计算效率。本文的主要研究目标是提出一种面向异构高性能计算机的统一编程模型设计思路,在代码可移植的基础上,进一步探索性能可移植。
3、方法(Method)
本文提出了一种统一编程模型的设计思路,此编程模型的设计中包含四项关键技术:
1) 引入领域特定语言,将领域知识应用于应用程序优化,从而可以为不同类型的应用程序进行深入的优化。
2) 通过统一的编程抽象提供统一的表示,以便为不同的应用程序实施通用优化。
3) 运用多层次编译优化技术解耦不同的优化阶段,并在每个阶段使用对应的优化策略。
4) 使用轻量级运行时调度系统自动发现应用程序的并行性,并应用资源感知调度策略来提高资源利用率。
4、结论(Conclusions)
本文是一篇前瞻性文章,旨在分析当前统一编程模型的研究现状,并提出一种面向异构高性能计算机的统一编程模型的设计思路,以进一步探索性能可移植。

关键词: 性能可移植性, 编程模型, 异构超级计算机

Abstract: Unified programming models can effectively improve program portability on various heterogeneous high-performance computers. Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability. In this paper, we present a preliminary design of a performance-portable unified programming model including four aspects: programming language, programming abstraction, compilation optimization, and scheduling system. Specifically, domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures. The unified programming abstraction unifies the common features of different architectures to support common optimizations. Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations. Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers. This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.

Key words: performance portability, programming model, heterogeneous supercomputer

<p> <table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dongarra J J, Meuer H W, Strohmaier E. Top500 supercomputer sites. <i>Supercomputer</i>, 1997, 13(1): 89–111. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vazhkudai S S, de Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL pre-exascale systems. In <i>Proc</i>. <i>the 2018 International Conference for High Performance Computing</i>, <i>Networking</i>, <i>Storage and Analysis</i>, Nov. 2018, pp.661–672. DOI: <a href="https://www.doi.org/10.1109/SC.2018.00055">10.1109/SC.2018.00055</a>. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Fu H H, Liao J F, Yang J Z <i>et al</i>. The Sunway TaihuLight supercomputer: System and applications. <i>Science China Information Sciences</i>, 2016, 59(7): 072001. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1007/s11432-016-5588-7" target="_blank">10.1007/s11432-016-5588-7</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Fu H H, Liao J F, Xue W et al. Refactoring and optimizing the community atmosphere model (CAM) on the Sunway TaihuLight supercomputer. In <i>Proc</i>. <i>the 2016</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2016, pp.969–980. DOI: <a href="https://www.doi.org/10.1109/SC.2016.82">10.1109/SC.2016.82</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Neale R B, Gettelman A, Park S et al. Description of the NCAR community atmosphere model (CAM 5.0). No. NCAR/TN-486+STR, 2010. DOI: <a href="https://doi.org/10.5065/wgtk-4g06">10.5065/wgtk-4g06</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Edwards H C, Trott C R, Sunderland D. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. <i>Journal of Parallel and Distributed Computing</i>, 2014, 74(12): 3202–3216. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/j.jpdc.2014.07.003" target="_blank">10.1016/j.jpdc.2014.07.003</a>. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Trott C R, Lebrun-Grandié D, Arndt D <i>et al</i>. Kokkos 3: Programming model extensions for the exascale era. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2022, 33(4): 805–817. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2021.3097283" target="_blank">10.1109/TPDS.2021.3097283</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Beckingsale D A, Burmark J, Hornung R et al. RAJA: Portable performance for large-scale scientific applications. In<i> Proc</i>. the <i>2019</i> <i>IEEE/ACM</i> <i>International</i> <i>workshop</i> <i>on</i> <i>Performance</i>, <i>Portability</i> <i>and</i> <i>Productivity</i> <i>in</i> <i>HPC</i> (<i>P3HPC</i>), Nov. 2019, pp.71–81. DOI: <a href="https://www.doi.org/10.1109/P3HPC49587.2019.00012">10.1109/P3HPC49587.2019.00012</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Reinders J, Ashbaugh B, Brodman J, Kinsner M, Pennycook J, Tian X M. Data Parallel C++: ing DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL. Springer Nature, 2021. DOI: <a href="https://www.doi.org/10.1007/978-1-4842-5574-2">10.1007/978-1-4842-5574-2</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pennycook S J, Sewall J D, Lee V W. Implications of a metric for performance portability. <i>Future Generation Computer Systems</i>, 2019, 92: 947–958. DOI: <a href="https://doi.org/10.1016/j.future.2017.08.007">10.1016/j.future.2017.08.007</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lin W C, McIntosh-Smith S. Comparing Julia to performance portable parallel programming models for HPC. In <i>Proc</i>. <i>the 2021</i> <i>International</i> <i>Workshop</i> <i>on</i> <i>Performance</i> <i>Modeling</i>, <i>Benchmarking</i> <i>and</i> <i>Simulation</i> <i>of</i> <i>High</i> <i>Performance</i> <i>Computer</i> <i>Systems</i>, Nov. 2021, pp.94–105. DOI: <a href="https://www.doi.org/10.1109/PMBS54543.2021.00016">10.1109/PMBS54543.2021.00016</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ma Z X, He J A, Qiu J Z et al. BaGuaLu: Targeting brain scale pretrained models with over 37 million cores. In <i>Proc</i>. <i>the</i> <i>27th</i> <i>ACM</i> <i>SIGPLAN</i> <i>Symposium</i> <i>on</i> <i>Principles</i> <i>and</i> <i>Practice</i> <i>of</i> <i>Parallel</i> <i>Programming</i>, Apr. 2022, pp.192–204. DOI: <a href="https://www.doi.org/10.1145/3503221.3508417">10.1145/3503221.3508417</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Y M, Lu K, Chen W G. Processing extreme-scale graphs on China’s supercomputers. <i>Communications of the ACM</i>, 2021, 64(11): 60–63. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3481614" target="_blank">10.1145/3481614</a>. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Y, Yang M, Baghdadi R, Kamil S, Shun J. Graphit: A high-performance graph DSL. <i>Proceedings of the ACM on</i> <i>Programming Languages</i>, 2018, 2(OOPSLA): Article No. 121. DOI: <a href="http://doi.org/10.1145/3276491">10.1145/3276491</a>. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In <i>Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation</i>, June 2013, pp.519–530. DOI: <a href="https://doi.org/10.1145/2499370.2462176">10.1145/2499370.2462176</a>. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chen T Q, Moreau T, Jiang Z H et al. TVM: An automated end-to-end optimizing compiler for deep learning. In <i>Proc. the 13th USENIX Conference on Operating Systems Design and Implementation</i>, Oct. 2018, pp.579-594. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ben-Nun T, de Fine Licht J, Ziogas A N, Schneider T, Hoefler T. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In <i>Proc</i>. <i>the</i> <i>2019</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 81. DOI: <a href="https://www.doi.org/10.1145/3295500.3356173">10.1145/3295500.3356173</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ziogas A N, Ben-Nun T, Fernández G I, Schneider T, Luisier M, Hoefler T. A data-centric approach to extreme-scale <i>ab initio</i> dissipative quantum transport simulations. In <i>Proc</i>. <i>the 2019</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 1. DOI: <a href="https://www.doi.org/10.1145/3295500.3357156">10.1145/3295500.3357156</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In <i>Proc</i>. <i>the 2004 International</i> <i>Symposium</i> <i>on</i> <i>Code</i> <i>Generation</i> <i>and</i> <i>Optimization</i>, Mar. 2004, pp.75–86. DOI: <a href="https://www.doi.org/10.1109/CGO.2004.1281665">10.1109/CGO.2004.1281665</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T, Vasilache N, Zinenko O. MLIR: A compiler infrastructure for the end of Moore’s law. arXiv: 2002.11054, 2020.<a href="https://arxiv.org/abs/2002.11054"> https://arxiv.org/abs/2002.11054, Mar. 2020.</a> </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gysi T, Müller C, Zinenko O, Herhut S, Davis E, Wicky T, Fuhrer O, Hoefler T, Grosser T. Domain-specific multi-level IR rewriting for GPU: The open earth compiler for GPU-accelerated climate simulation. <i>ACM Transactions on Architecture and Code Optimization</i>, 2021, 18(4): Article No. 51. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3469030" target="_blank">10.1145/3469030</a>. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> McCaskey A, Nguyen T. A MLIR dialect for quantum assembly languages. In <i>Proc</i>. <i>the</i> <i>2021</i> <i>IEEE</i> <i>International</i> <i>Conference</i> <i>on</i> <i>Quantum</i> <i>Computing</i> <i>and</i> <i>Engineering</i>, Oct. 2021, pp.255–264. DOI: <a href="https://www.doi.org/10.1109/QCE52317.2021">10.1109/QCE52317.2021.00043</a>. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yoo A B, Jette M A, Grondona M. SLURM: Simple Linux utility for resource management. In <i>Proc</i>. <i>the 9th International Workshop on</i> <i>Job</i> <i>Scheduling</i> <i>Strategies</i> <i>for</i> <i>Parallel</i> <i>Processing</i>, Jun. 2003, pp.44–60. DOI: <a href="https://www.doi.org/10.1007/10968987_3">10.1007/10968987_3</a>. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bode B, Halstead D M, Kendall R et al. The portable batch scheduler and the Maui scheduler on Linux clusters. In <i>Proc</i>. <i>the 4th Annual Linux Showcase & Conference</i>, Oct. 2000. DOI: <a href="https://www.doi.org/10.5555/1268379.1268406">10.5555/1268379.1268406</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vavilapalli V K, Murthy A C, Douglas C et al. Apache Hadoop YARN: Yet another resource negotiator. In <i>Proc</i>. <i>the</i> <i>4th</i> <i>Annual</i> <i>Symposium</i> <i>on</i> <i>Cloud</i> <i>Computing</i>, Oct. 2013, Article No. 5. DOI: <a href="https://www.doi.org/10.1145/2523616.2523633">10.1145/2523616.2523633</a>. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hindman B, Konwinski A, Zaharia M et al. Mesos: A platform for fine-grained resource sharing in the data center. In <i>Proc</i>. <i>the 8th USENIX Conference on Networked Systems Design and Implementation</i>, Mar. 2011, pp.295–308. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Tang X C, Wang H J, Ma X S et al. Spread-n-Share: Improving application performance and cluster throughput with resource-aware job placement. In <i>Proc</i>. <i>the</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 12. DOI: <a href="https://www.doi.org/10.1145/3295500.3356152">10.1145/3295500.3356152</a>. </div> </td> </tr> </tbody> </table> </p>
[1] Yun-Cong Zhang, Xiao-Yang Wang, Cong Wang, Yao Xu, Jian-Wei Zhang, Xiao-Dong Lin, Guang-Yu Sun, Gong-Lin Zheng, Shan-Hui Yin, Xian-Jin Ye, Li Li, Zhan Song, Dong-Dong Miao. Bigflow:一种分布式计算框架的通用优化层[J]. 计算机科学技术学报, 2020, 35(2): 453-467.
[2] Li Shen, Fan Xu, Zhi-Ying Wang. 软件线程级猜测系统中面向循环特征的优化策略[J]. , 2016, 31(1): 60-76.
[3] Xiang-Ke Liao, Can-Qun Yang, Tao Tang Hui-Zhan Yi, Feng Wang, Qiang Wu, Jingling. OpenMC:简化天河超级计算机的编程[J]. , 2014, 29(3): 532-546.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[5] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[6] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[7] 孙钟秀; 商陆军;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[8] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[9] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[10] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: