计算机科学技术学报 ›› 2023,Vol. 38 ›› Issue (1): 87-102.doi: 10.1007/s11390-023-2885-7

所属专题: 综述 Computer Architecture and Systems

• • 上一篇    下一篇

功率受限的高性能计算范式

  

  • 收稿日期:2022-10-04 修回日期:2022-10-26 接受日期:2023-01-02 出版日期:2023-02-28 发布日期:2023-02-28

The Paradigm of Power Bounded High-Performance Computing

Rong Ge1, Xizhou Feng2, Pengfei Zou3, and Tyler Allen4        

  1. School of Computing, Clemson University, Clemson, SC 29634, U.S.A.
    Meta Platform, Inc., Menlo Park, CA 94025, U.S.A.
    Amazon, Inc., Seattle, WA 98170, U.S.A.
    University of North Carolina at Charlotte, NC 27599, U.S.A.
  • Received:2022-10-04 Revised:2022-10-26 Accepted:2023-01-02 Online:2023-02-28 Published:2023-02-28
  • Contact: Rong Ge E-mail:rge@clemson.edu
  • About author:Rong Ge received her B.S. and M.S. degrees in engineering mechanics from Tsinghua University, Beijing, in 1995 and 1998, respectively, and her Ph.D. degree in computer science at Virginia Tech, Washington, in 2007. She is the director of the Scalable Computing and Analytics Laboratory in the School of Computing at Clemson University, Clemson. Her research interest includes parallel and distributed systems, machine learning and big data, heterogeneous computing, and performance evaluation and optimization.
  • Supported by:
    This work is supported in part by the U.S. National Science Foundation under Grant Nos. CCF-1551511 and CNS-1551262.

现代计算机系统越来越受到从单个组件到数据中心的多层可用或允许功率的限制。为了应对这一现实,有必要了解功率界限如何影响性能,特别是对于从高端节点构建的系统,每个节点都包含多个耗电组件。由于在节点或组件上放置不适当的功率限制会导致严重的性能损失,因此在给定总功率预算的情况下,必须协调节点和组件之间的功率分配,从而实现所需的性能。在该篇论文中,作者描述了功率受限的高性能计算范式,该范式认为协调功率受限分配是计算机系统性能分析和优化的关键因素。作者将此范式应用于 CPU 和 GPU 计算的跨多层的功率协调问题。通过几个案例研究,作者展示了如何应用和平衡功率协调原则,并使其适应工作负载、硬件技术和可用总功率的相互作用,从而提高性能。

关键词: 功率受限计算, 跨组件功率协调, 分层功率分配

Abstract: Modern computer systems are increasingly bounded by the available or permissible power at multiple layers from individual components to data centers. To cope with this reality, it is necessary to understand how power bounds impact performance, especially for systems built from high-end nodes, each consisting of multiple power hungry components. Because placing an inappropriate power bound on a node or a component can lead to severe performance loss, coordinating power allocation among nodes and components is mandatory to achieve desired performance given a total power budget. In this article, we describe the paradigm of power bounded high-performance computing, which considers coordinated power bound assignment to be a key factor in computer system performance analysis and optimization. We apply this paradigm to the problem of power coordination across multiple layers for both CPU and GPU computing. Using several case studies, we demonstrate how the principles of balanced power coordination can be applied and adapted to the interplay of workloads, hardware technology, and the available total power for performance improvement.

Key words: power bounded computing, cross-component power coordination, hierarchical power allocation

<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lucas R, Ang J, Bergman K et al. Top ten exascale research challenges. DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report, U.S. Department of Energy, Office of Science, 2014. DOI: <a href="http://dx.doi.org/10.2172/1222713.">10.2172/1222713</a>. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Jeon M, Venkataraman S, Phanishayee A, Qian J J, Xiao W C, Yang F. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In <i>Proc. the 2019 USENIX Annual Technical Conference</i>, Jul. 2019, pp.947-960. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ge R, Feng X Z, Allen T, Zou P F. The case for cross-component power coordination on power bounded systems. <i>IEEE Trans. Parallel and Distributed Systems</i>, 2021, 32(10): 2464-2476. DOI: <a href="http://dx.doi.org/10.1109/TPDS.2021.3068235">10.1109/TPDS.2021.3068235</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ge R, Feng X Z, He Y Y, Zou P F. The case for cross-component power coordination on power bounded systems. In <i>Proc. the 45th International Conference on Parallel Processing (ICPP)</i>, Aug. 2016, pp.516-525. DOI: <a href="http://dx.doi.org/10.1109/ICPP.2016.66">10.1109/ICPP.2016.66</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ge R, Zou P F, Feng X Z. Application-aware power coordination on power bounded NUMA multicore systems. In <i>Proc. the 46th International Conference on Parallel Processing (ICPP)</i>, Aug. 2017, pp.591-600. DOI: <a href="http://dx.doi.org/10.1109/ICPP.2017.68">10.1109/ICPP.2017.68</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zou P F, Allen T, Davis C H, Feng X Z, Ge R. CLIP: Cluster-level intelligent power coordination for power-bounded systems. In <i>Proc. the 2017 IEEE International Conference on Cluster Computing (CLUSTER)</i>, Sept. 2017, pp.541-551. DOI: <a href="http://dx.doi.org/10.1109/CLUSTER.2017.98">10.1109/CLUSTER.2017.98</a>. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zou P F, Feng X Z, Ge R. Contention aware workload and resource co-scheduling on power-bounded systems. In <i>Proc. the 2019 IEEE International Conference on Networking, Architecture and Storage (NAS)</i>, Aug. 2019. DOI: <a href="http://dx.doi.org/10.1109/NAS.2019.8834721">10.1109/NAS.2019.8834721</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zou P F, Rodriguez D, Ge R. Maximizing throughput on power-bounded HPC systems. In <i>Proc. the 2018 IEEE International Conference on Cluster Computing (CLUSTER)</i>, Sept. 2018, pp.156-157. DOI: <a href="http://dx.doi.org/10.1109/CLUSTER.2018.00030">10.1109/CLUSTER.2018.00030</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Eyerman S, Eeckhout L. System-level performance metrics for multiprogram workloads. <i>IEEE Micro</i>, 2008, 28(3): 42–53. DOI: <a class="mainColor ref-doi" href="http://dx.doi.org/10.1109/MM.2008.44" target="_blank">10.1109/MM.2008.44</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Blagodurov S, Zhuravlev S, Fedorova A. Contention-aware scheduling on multicore systems. <i>ACM Trans. Computer Systems</i>, 2010, 28(4): Article No. 8. DOI: <a href="http://dx.doi.org/10.1145/1880018.1880019">10.1145/1880018.1880019</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Subramanian L, Seshadri V, Ghosh A, Khan S, Mutlu O. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In <i>Proc. the 48th Annual IEEE/ACM International Symposium on Microarchitecture</i>, Dec. 2015, pp.62-75. DOI: <a href="http://dx.doi.org/10.1145/2830772.2830803">10.1145/2830772.2830803</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kelley J, Stewart C, Tiwari D, Gupta S. Adaptive power profiling for many-core HPC architectures. In <i>Proc. the 2016 IEEE International Conference on Autonomic Computing (ICAC)</i>, Jul. 2016, pp.179-188. DOI: <a href="http://dx.doi.org/10.1109/ICAC.2016.45">10.1109/ICAC.2016.45</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Mishra N, Lafferty J D, Hoffmann H. ESP: A machine learning approach to predicting application interference. In <i>Proc. the 2017 IEEE International Conference on Autonomic Computing (ICAC)</i>, Jul. 2017, pp.125-134. DOI: <a href="http://dx.doi.org/10.1109/ICAC.2017.29">10.1109/ICAC.2017.29</a>. </div> </td> </tr> </tbody> </table>
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[5] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[6] 孙钟秀; 商陆军;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[7] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[8] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[9] 潘启敬;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[10] 吴恩华;. A Graphics System Distributed across a Local Area Network[J]. , 1986, 1(3): 53 -64 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: