|
计算机科学技术学报 ›› 2023,Vol. 38 ›› Issue (1): 211-218.doi: 10.1007/s11390-023-2888-4
所属专题: 综述; Computer Architecture and Systems
• • 上一篇
(马子轩), Student Member, CCF, (金煜阳), Member, CCF, (唐适之), Student Member, CCF, (王豪杰), Member, CCF, (薛伟诚), * (翟季冬), Senior Member, CCF, and (郑纬民), Fellow, CCF
1、研究背景(context)
随着摩尔定律的放缓,基于异构加速处理器的高性能计算机成为当前发展的主流趋势。为了发挥异构系统的硬件性能,每个硬件提供商都开发了一套针对自己硬件的编程框架和私有加速库。这给应用开发人员进行程序移植带来了显著的挑战,极大地限制了高性能计算的发展和推广。一套统一的编程模型可以有效地提高各种异构高性能计算机上的程序可移植性。然而,尽管现有的统一编程模型在代码可移植性方面投入了大量精力,但仍远无法实现良好的性能可移植性。
2、目的(Objective)
性能可移植性,即一套代码不需要额外修改即可在不同异构系统上达到相当的计算效率。现有的统一编程模型通过统一的语法抽象,可以在一定程度上实现一套代码在不同异构高性能计算机上直接编译和运行,但目前的统一编程模型无法实现一套代码在不同异构系统上得到相当的计算效率。本文的主要研究目标是提出一种面向异构高性能计算机的统一编程模型设计思路,在代码可移植的基础上,进一步探索性能可移植。
3、方法(Method)
本文提出了一种统一编程模型的设计思路,此编程模型的设计中包含四项关键技术:
1) 引入领域特定语言,将领域知识应用于应用程序优化,从而可以为不同类型的应用程序进行深入的优化。
2) 通过统一的编程抽象提供统一的表示,以便为不同的应用程序实施通用优化。
3) 运用多层次编译优化技术解耦不同的优化阶段,并在每个阶段使用对应的优化策略。
4) 使用轻量级运行时调度系统自动发现应用程序的并行性,并应用资源感知调度策略来提高资源利用率。
4、结论(Conclusions)
本文是一篇前瞻性文章,旨在分析当前统一编程模型的研究现状,并提出一种面向异构高性能计算机的统一编程模型的设计思路,以进一步探索性能可移植。
<p> <table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dongarra J J, Meuer H W, Strohmaier E. Top500 supercomputer sites. <i>Supercomputer</i>, 1997, 13(1): 89–111. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vazhkudai S S, de Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL pre-exascale systems. In <i>Proc</i>. <i>the 2018 International Conference for High Performance Computing</i>, <i>Networking</i>, <i>Storage and Analysis</i>, Nov. 2018, pp.661–672. DOI: <a href="https://www.doi.org/10.1109/SC.2018.00055">10.1109/SC.2018.00055</a>. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Fu H H, Liao J F, Yang J Z <i>et al</i>. The Sunway TaihuLight supercomputer: System and applications. <i>Science China Information Sciences</i>, 2016, 59(7): 072001. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1007/s11432-016-5588-7" target="_blank">10.1007/s11432-016-5588-7</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Fu H H, Liao J F, Xue W et al. Refactoring and optimizing the community atmosphere model (CAM) on the Sunway TaihuLight supercomputer. In <i>Proc</i>. <i>the 2016</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2016, pp.969–980. DOI: <a href="https://www.doi.org/10.1109/SC.2016.82">10.1109/SC.2016.82</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Neale R B, Gettelman A, Park S et al. Description of the NCAR community atmosphere model (CAM 5.0). No. NCAR/TN-486+STR, 2010. DOI: <a href="https://doi.org/10.5065/wgtk-4g06">10.5065/wgtk-4g06</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Edwards H C, Trott C R, Sunderland D. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. <i>Journal of Parallel and Distributed Computing</i>, 2014, 74(12): 3202–3216. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1016/j.jpdc.2014.07.003" target="_blank">10.1016/j.jpdc.2014.07.003</a>. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Trott C R, Lebrun-Grandié D, Arndt D <i>et al</i>. Kokkos 3: Programming model extensions for the exascale era. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2022, 33(4): 805–817. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/TPDS.2021.3097283" target="_blank">10.1109/TPDS.2021.3097283</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Beckingsale D A, Burmark J, Hornung R et al. RAJA: Portable performance for large-scale scientific applications. In<i> Proc</i>. the <i>2019</i> <i>IEEE/ACM</i> <i>International</i> <i>workshop</i> <i>on</i> <i>Performance</i>, <i>Portability</i> <i>and</i> <i>Productivity</i> <i>in</i> <i>HPC</i> (<i>P3HPC</i>), Nov. 2019, pp.71–81. DOI: <a href="https://www.doi.org/10.1109/P3HPC49587.2019.00012">10.1109/P3HPC49587.2019.00012</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Reinders J, Ashbaugh B, Brodman J, Kinsner M, Pennycook J, Tian X M. Data Parallel C++: ing DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL. Springer Nature, 2021. DOI: <a href="https://www.doi.org/10.1007/978-1-4842-5574-2">10.1007/978-1-4842-5574-2</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pennycook S J, Sewall J D, Lee V W. Implications of a metric for performance portability. <i>Future Generation Computer Systems</i>, 2019, 92: 947–958. DOI: <a href="https://doi.org/10.1016/j.future.2017.08.007">10.1016/j.future.2017.08.007</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lin W C, McIntosh-Smith S. Comparing Julia to performance portable parallel programming models for HPC. In <i>Proc</i>. <i>the 2021</i> <i>International</i> <i>Workshop</i> <i>on</i> <i>Performance</i> <i>Modeling</i>, <i>Benchmarking</i> <i>and</i> <i>Simulation</i> <i>of</i> <i>High</i> <i>Performance</i> <i>Computer</i> <i>Systems</i>, Nov. 2021, pp.94–105. DOI: <a href="https://www.doi.org/10.1109/PMBS54543.2021.00016">10.1109/PMBS54543.2021.00016</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ma Z X, He J A, Qiu J Z et al. BaGuaLu: Targeting brain scale pretrained models with over 37 million cores. In <i>Proc</i>. <i>the</i> <i>27th</i> <i>ACM</i> <i>SIGPLAN</i> <i>Symposium</i> <i>on</i> <i>Principles</i> <i>and</i> <i>Practice</i> <i>of</i> <i>Parallel</i> <i>Programming</i>, Apr. 2022, pp.192–204. DOI: <a href="https://www.doi.org/10.1145/3503221.3508417">10.1145/3503221.3508417</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Y M, Lu K, Chen W G. Processing extreme-scale graphs on China’s supercomputers. <i>Communications of the ACM</i>, 2021, 64(11): 60–63. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3481614" target="_blank">10.1145/3481614</a>. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Y, Yang M, Baghdadi R, Kamil S, Shun J. Graphit: A high-performance graph DSL. <i>Proceedings of the ACM on</i> <i>Programming Languages</i>, 2018, 2(OOPSLA): Article No. 121. DOI: <a href="http://doi.org/10.1145/3276491">10.1145/3276491</a>. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In <i>Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation</i>, June 2013, pp.519–530. DOI: <a href="https://doi.org/10.1145/2499370.2462176">10.1145/2499370.2462176</a>. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chen T Q, Moreau T, Jiang Z H et al. TVM: An automated end-to-end optimizing compiler for deep learning. In <i>Proc. the 13th USENIX Conference on Operating Systems Design and Implementation</i>, Oct. 2018, pp.579-594. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ben-Nun T, de Fine Licht J, Ziogas A N, Schneider T, Hoefler T. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In <i>Proc</i>. <i>the</i> <i>2019</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 81. DOI: <a href="https://www.doi.org/10.1145/3295500.3356173">10.1145/3295500.3356173</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ziogas A N, Ben-Nun T, Fernández G I, Schneider T, Luisier M, Hoefler T. A data-centric approach to extreme-scale <i>ab initio</i> dissipative quantum transport simulations. In <i>Proc</i>. <i>the 2019</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 1. DOI: <a href="https://www.doi.org/10.1145/3295500.3357156">10.1145/3295500.3357156</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In <i>Proc</i>. <i>the 2004 International</i> <i>Symposium</i> <i>on</i> <i>Code</i> <i>Generation</i> <i>and</i> <i>Optimization</i>, Mar. 2004, pp.75–86. DOI: <a href="https://www.doi.org/10.1109/CGO.2004.1281665">10.1109/CGO.2004.1281665</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T, Vasilache N, Zinenko O. MLIR: A compiler infrastructure for the end of Moore’s law. arXiv: 2002.11054, 2020.<a href="https://arxiv.org/abs/2002.11054"> https://arxiv.org/abs/2002.11054, Mar. 2020.</a> </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gysi T, Müller C, Zinenko O, Herhut S, Davis E, Wicky T, Fuhrer O, Hoefler T, Grosser T. Domain-specific multi-level IR rewriting for GPU: The open earth compiler for GPU-accelerated climate simulation. <i>ACM Transactions on Architecture and Code Optimization</i>, 2021, 18(4): Article No. 51. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1145/3469030" target="_blank">10.1145/3469030</a>. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> McCaskey A, Nguyen T. A MLIR dialect for quantum assembly languages. In <i>Proc</i>. <i>the</i> <i>2021</i> <i>IEEE</i> <i>International</i> <i>Conference</i> <i>on</i> <i>Quantum</i> <i>Computing</i> <i>and</i> <i>Engineering</i>, Oct. 2021, pp.255–264. DOI: <a href="https://www.doi.org/10.1109/QCE52317.2021">10.1109/QCE52317.2021.00043</a>. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yoo A B, Jette M A, Grondona M. SLURM: Simple Linux utility for resource management. In <i>Proc</i>. <i>the 9th International Workshop on</i> <i>Job</i> <i>Scheduling</i> <i>Strategies</i> <i>for</i> <i>Parallel</i> <i>Processing</i>, Jun. 2003, pp.44–60. DOI: <a href="https://www.doi.org/10.1007/10968987_3">10.1007/10968987_3</a>. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bode B, Halstead D M, Kendall R et al. The portable batch scheduler and the Maui scheduler on Linux clusters. In <i>Proc</i>. <i>the 4th Annual Linux Showcase & Conference</i>, Oct. 2000. DOI: <a href="https://www.doi.org/10.5555/1268379.1268406">10.5555/1268379.1268406</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Vavilapalli V K, Murthy A C, Douglas C et al. Apache Hadoop YARN: Yet another resource negotiator. In <i>Proc</i>. <i>the</i> <i>4th</i> <i>Annual</i> <i>Symposium</i> <i>on</i> <i>Cloud</i> <i>Computing</i>, Oct. 2013, Article No. 5. DOI: <a href="https://www.doi.org/10.1145/2523616.2523633">10.1145/2523616.2523633</a>. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hindman B, Konwinski A, Zaharia M et al. Mesos: A platform for fine-grained resource sharing in the data center. In <i>Proc</i>. <i>the 8th USENIX Conference on Networked Systems Design and Implementation</i>, Mar. 2011, pp.295–308. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Tang X C, Wang H J, Ma X S et al. Spread-n-Share: Improving application performance and cluster throughput with resource-aware job placement. In <i>Proc</i>. <i>the</i> <i>International</i> <i>Conference</i> <i>for</i> <i>High</i> <i>Performance</i> <i>Computing</i>, <i>Networking</i>, <i>Storage</i> <i>and</i> <i>Analysis</i>, Nov. 2019, Article No. 12. DOI: <a href="https://www.doi.org/10.1145/3295500.3356152">10.1145/3295500.3356152</a>. </div> </td> </tr> </tbody> </table> </p> |
[1] | Yun-Cong Zhang, Xiao-Yang Wang, Cong Wang, Yao Xu, Jian-Wei Zhang, Xiao-Dong Lin, Guang-Yu Sun, Gong-Lin Zheng, Shan-Hui Yin, Xian-Jin Ye, Li Li, Zhan Song, Dong-Dong Miao. Bigflow:一种分布式计算框架的通用优化层[J]. 计算机科学技术学报, 2020, 35(2): 453-467. |
[2] | Li Shen, Fan Xu, Zhi-Ying Wang. 软件线程级猜测系统中面向循环特征的优化策略[J]. , 2016, 31(1): 60-76. |
[3] | Xiang-Ke Liao, Can-Qun Yang, Tao Tang Hui-Zhan Yi, Feng Wang, Qiang Wu, Jingling. OpenMC:简化天河超级计算机的编程[J]. , 2014, 29(3): 532-546. |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |