Journal of Computer Science and Technology ›› 2023, Vol. 38 ›› Issue (1): 211-218.doi: 10.1007/s11390-023-2888-4

Special Issue: Surveys; Computer Architecture and Systems

• Special Issue in Honor of Professor Kai Hwang’s 80th Birthday • Previous Articles    

Unified Programming Models for Heterogeneous High-Performance Computers

Zi-Xuan Ma (马子轩), Student Member, CCFYu-Yang Jin (金煜阳), Member, CCF, Shi-Zhi Tang (唐适之), Student Member, CCFHao-Jie Wang (王豪杰), Member, CCF, Wei-Cheng Xue (薛伟诚), Ji-Dong Zhai* (翟季冬), Senior Member, CCF, and Wei-Min Zheng (郑纬民), Fellow, CCF        

  1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2022-10-05 Revised:2022-10-28 Accepted:2023-01-10 Online:2023-02-28 Published:2023-02-28
  • Contact: Ji-Dong Zhai E-mail:zhaijidong@tsinghua.edu.cn
  • About author:Ji-Dong Zhai received his B.S. degree in computer science from the University of Electronic Science and Technology of China, Chengdu, in 2003, and his Ph.D. degree in computer science from Tsinghua University, Beijing, in 2010. He is a tenured associate professor in the Department of Computer Science and Technology of Tsinghua University, Beijing. His research interests include performance evaluation for high-performance computers, performance analysis, and modeling of parallel applications.
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China under Grant No. 62225206.

Unified programming models can effectively improve program portability on various heterogeneous high-performance computers. Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability. In this paper, we present a preliminary design of a performance-portable unified programming model including four aspects: programming language, programming abstraction, compilation optimization, and scheduling system. Specifically, domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures. The unified programming abstraction unifies the common features of different architectures to support common optimizations. Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations. Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers. This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.

Key words: performance portability; programming model; heterogeneous supercomputer;

<p> <table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dongarra J J, Meuer H W, Strohmaier E. Top500 supercomputer sites. <i>Supercomputer</i>, 1997, 13(1): 89–111. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vazhkudai S S, de Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL pre-exascale systems. In <i>Proc</i>. <i>the 2018 International Conference for High Performance Computing</i>, <i>Networking</i>, <i>Storage and Analysis</i>, Nov. 2018, pp.661–672. DOI: <a href="https://www.doi.org/10.1109/SC.2018.00055">10.1109/SC.2018.00055</a>. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Fu H H, Liao J F, Yang J Z <i>et al</i>. The Sunway TaihuLight supercomputer: System and applications. <i>Science China Information Sciences</i>, 2016, 59(7): 072001. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1007/s11432-016-5588-7" target="_blank">10.1007/s11432-016-5588-7</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Fu H H, Liao J F, Xue W et al. Refactoring and optimizing the community atmosphere model (CAM) on the Sunway TaihuLight supercomputer. In <i>Proc</i>. <i>the 2016</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2016, pp.969–980. DOI: <a href="https://www.doi.org/10.1109/SC.2016.82">10.1109/SC.2016.82</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Neale R B, Gettelman A, Park S et al. Description of the NCAR community atmosphere model (CAM 5.0). No. NCAR/TN-486+STR, 2010. DOI: <a href="https://doi.org/10.5065/wgtk-4g06">10.5065/wgtk-4g06</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Edwards H C, Trott C R, Sunderland D. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. <i>Journal of Parallel and Distributed Computing</i>, 2014, 74(12): 3202–3216. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/j.jpdc.2014.07.003" target="_blank">10.1016/j.jpdc.2014.07.003</a>. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Trott C R, Lebrun-Grandié D, Arndt D <i>et al</i>. Kokkos 3: Programming model extensions for the exascale era. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2022, 33(4): 805–817. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2021.3097283" target="_blank">10.1109/TPDS.2021.3097283</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Beckingsale D A, Burmark J, Hornung R et al. RAJA: Portable performance for large-scale scientific applications. In<i> Proc</i>. the <i>2019</i> <i>IEEE/ACM</i> <i>International</i> <i>workshop</i> <i>on</i> <i>Performance</i>, <i>Portability</i> <i>and</i> <i>Productivity</i> <i>in</i> <i>HPC</i> (<i>P3HPC</i>), Nov. 2019, pp.71–81. DOI: <a href="https://www.doi.org/10.1109/P3HPC49587.2019.00012">10.1109/P3HPC49587.2019.00012</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Reinders J, Ashbaugh B, Brodman J, Kinsner M, Pennycook J, Tian X M. Data Parallel C++: ing DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL. Springer Nature, 2021. DOI: <a href="https://www.doi.org/10.1007/978-1-4842-5574-2">10.1007/978-1-4842-5574-2</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pennycook S J, Sewall J D, Lee V W. Implications of a metric for performance portability. <i>Future Generation Computer Systems</i>, 2019, 92: 947–958. DOI: <a href="https://doi.org/10.1016/j.future.2017.08.007">10.1016/j.future.2017.08.007</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lin W C, McIntosh-Smith S. Comparing Julia to performance portable parallel programming models for HPC. In <i>Proc</i>. <i>the 2021</i> <i>International</i> <i>Workshop</i> <i>on</i> <i>Performance</i> <i>Modeling</i>, <i>Benchmarking</i> <i>and</i> <i>Simulation</i> <i>of</i> <i>High</i> <i>Performance</i> <i>Computer</i> <i>Systems</i>, Nov. 2021, pp.94–105. DOI: <a href="https://www.doi.org/10.1109/PMBS54543.2021.00016">10.1109/PMBS54543.2021.00016</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ma Z X, He J A, Qiu J Z et al. BaGuaLu: Targeting brain scale pretrained models with over 37 million cores. In <i>Proc</i>. <i>the</i> <i>27th</i> <i>ACM</i> <i>SIGPLAN</i> <i>Symposium</i> <i>on</i> <i>Principles</i> <i>and</i> <i>Practice</i> <i>of</i> <i>Parallel</i> <i>Programming</i>, Apr. 2022, pp.192–204. DOI: <a href="https://www.doi.org/10.1145/3503221.3508417">10.1145/3503221.3508417</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Y M, Lu K, Chen W G. Processing extreme-scale graphs on China’s supercomputers. <i>Communications of the ACM</i>, 2021, 64(11): 60–63. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3481614" target="_blank">10.1145/3481614</a>. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Y, Yang M, Baghdadi R, Kamil S, Shun J. Graphit: A high-performance graph DSL. <i>Proceedings of the ACM on</i> <i>Programming Languages</i>, 2018, 2(OOPSLA): Article No. 121. DOI: <a href="http://doi.org/10.1145/3276491">10.1145/3276491</a>. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In <i>Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation</i>, June 2013, pp.519–530. DOI: <a href="https://doi.org/10.1145/2499370.2462176">10.1145/2499370.2462176</a>. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chen T Q, Moreau T, Jiang Z H et al. TVM: An automated end-to-end optimizing compiler for deep learning. In <i>Proc. the 13th USENIX Conference on Operating Systems Design and Implementation</i>, Oct. 2018, pp.579-594. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ben-Nun T, de Fine Licht J, Ziogas A N, Schneider T, Hoefler T. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In <i>Proc</i>. <i>the</i> <i>2019</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 81. DOI: <a href="https://www.doi.org/10.1145/3295500.3356173">10.1145/3295500.3356173</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ziogas A N, Ben-Nun T, Fernández G I, Schneider T, Luisier M, Hoefler T. A data-centric approach to extreme-scale <i>ab initio</i> dissipative quantum transport simulations. In <i>Proc</i>. <i>the 2019</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 1. DOI: <a href="https://www.doi.org/10.1145/3295500.3357156">10.1145/3295500.3357156</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In <i>Proc</i>. <i>the 2004 International</i> <i>Symposium</i> <i>on</i> <i>Code</i> <i>Generation</i> <i>and</i> <i>Optimization</i>, Mar. 2004, pp.75–86. DOI: <a href="https://www.doi.org/10.1109/CGO.2004.1281665">10.1109/CGO.2004.1281665</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T, Vasilache N, Zinenko O. MLIR: A compiler infrastructure for the end of Moore’s law. arXiv: 2002.11054, 2020.<a href="https://arxiv.org/abs/2002.11054"> https://arxiv.org/abs/2002.11054, Mar. 2020.</a> </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gysi T, Müller C, Zinenko O, Herhut S, Davis E, Wicky T, Fuhrer O, Hoefler T, Grosser T. Domain-specific multi-level IR rewriting for GPU: The open earth compiler for GPU-accelerated climate simulation. <i>ACM Transactions on Architecture and Code Optimization</i>, 2021, 18(4): Article No. 51. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3469030" target="_blank">10.1145/3469030</a>. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> McCaskey A, Nguyen T. A MLIR dialect for quantum assembly languages. In <i>Proc</i>. <i>the</i> <i>2021</i> <i>IEEE</i> <i>International</i> <i>Conference</i> <i>on</i> <i>Quantum</i> <i>Computing</i> <i>and</i> <i>Engineering</i>, Oct. 2021, pp.255–264. DOI: <a href="https://www.doi.org/10.1109/QCE52317.2021">10.1109/QCE52317.2021.00043</a>. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yoo A B, Jette M A, Grondona M. SLURM: Simple Linux utility for resource management. In <i>Proc</i>. <i>the 9th International Workshop on</i> <i>Job</i> <i>Scheduling</i> <i>Strategies</i> <i>for</i> <i>Parallel</i> <i>Processing</i>, Jun. 2003, pp.44–60. DOI: <a href="https://www.doi.org/10.1007/10968987_3">10.1007/10968987_3</a>. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bode B, Halstead D M, Kendall R et al. The portable batch scheduler and the Maui scheduler on Linux clusters. In <i>Proc</i>. <i>the 4th Annual Linux Showcase & Conference</i>, Oct. 2000. DOI: <a href="https://www.doi.org/10.5555/1268379.1268406">10.5555/1268379.1268406</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vavilapalli V K, Murthy A C, Douglas C et al. Apache Hadoop YARN: Yet another resource negotiator. In <i>Proc</i>. <i>the</i> <i>4th</i> <i>Annual</i> <i>Symposium</i> <i>on</i> <i>Cloud</i> <i>Computing</i>, Oct. 2013, Article No. 5. DOI: <a href="https://www.doi.org/10.1145/2523616.2523633">10.1145/2523616.2523633</a>. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hindman B, Konwinski A, Zaharia M et al. Mesos: A platform for fine-grained resource sharing in the data center. In <i>Proc</i>. <i>the 8th USENIX Conference on Networked Systems Design and Implementation</i>, Mar. 2011, pp.295–308. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Tang X C, Wang H J, Ma X S et al. Spread-n-Share: Improving application performance and cluster throughput with resource-aware job placement. In <i>Proc</i>. <i>the</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 12. DOI: <a href="https://www.doi.org/10.1145/3295500.3356152">10.1145/3295500.3356152</a>. </div> </td> </tr> </tbody> </table> </p>
[1] Yun-Cong Zhang, Xiao-Yang Wang, Cong Wang, Yao Xu, Jian-Wei Zhang, Xiao-Dong Lin, Guang-Yu Sun, Gong-Lin Zheng, Shan-Hui Yin, Xian-Jin Ye, Li Li, Zhan Song, Dong-Dong Miao. Bigflow: A General Optimization Layer for Distributed Computing Frameworks [J]. Journal of Computer Science and Technology, 2020, 35(2): 453-467.
[2] Li Shen, Fan Xu, Zhi-Ying Wang. Optimization Strategies Oriented to Loop Characteristics in Software Thread Level Speculation Systems [J]. , 2016, 31(1): 60-76.
[3] Xiang-Ke Liao, Can-Qun Yang, Tao Tang Hui-Zhan Yi, Feng Wang, Qiang Wu, and Jingling Xue. OpenMC:Towards Simplifying Programming for TianHe Supercomputers [J]. , 2014, 29(3): 532-546.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[5] Wang Xuan; Lü Zhimin; Tang Yuhai; Xiang Yang;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[6] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[7] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[8] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[9] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[10] Jin Lan; Yang Yuanyuan;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved