计算机科学技术学报 ›› 2023,Vol. 38 ›› Issue (1): 64-79.doi: 10.1007/s11390-022-2911-1

所属专题: 综述 Computer Architecture and Systems

• • 上一篇    下一篇

内存制约加速比模型及其对计算的影响

  

  • 收稿日期:2022-10-17 修回日期:2022-11-12 接受日期:2022-12-01 出版日期:2023-02-28 发布日期:2023-02-28

The Memory-Bounded Speedup Model and Its Impacts in Computing

Xian-He Sun (孙贤和), Fellow, IEEE, and Xiaoyang Lu (鲁潇阳), Member, IEEE        

  1. Department of Computer Science, Illinois Institute of Technology, Chicago 60616, U.S.A.
  • Received:2022-10-17 Revised:2022-11-12 Accepted:2022-12-01 Online:2023-02-28 Published:2023-02-28
  • Contact: Xian-He Sun E-mail:sun@iit.edu
  • About author:Xian-He Sun is a University Distinguished Professor and the Ron Hochsprung Endowed Chair of the Department of Computer Science at the Illinois Institute of Technology (Illinois Tech), Chicago. Before joining Illinois Tech, he worked at DoE Ames National Laboratory, at ICASE, NASA Langley Research Center, at Louisiana State University, Baton Rouge, and was an ASEE Fellow at Navy Research Laboratories. Dr. Sun is an IEEE Fellow and is known for his memory-bounded speedup model, also called Sun-Ni's Law, for scalable computing. His research interests include high-performance computing, memory and I/O systems, and performance evaluation and optimization. He has over 300 publications, six patents in these areas, and is currently leading multiple federal-funded large software development projects in HPC I/O systems. Dr. Sun is the Editor-in-Chief of IEEE Transactions on Parallel and Distributed Systems, and a former chair of the Computer Science Department at Illinois Tech, Chicago. He received the Golden Core Award from IEEE CS Society in 2017, the Overseas Outstanding Contributions Award from CCF in 2018, the ACM Karsten Schwan Best Paper Award from ACM HPDC in 2019, the Ron Hocksprung Endowed Chairship from Illinois Tech in 2020, the First Prize Best Paper Award from ACM/IEEE CCGrid in 2021, and the CSE Distinguished Alumni Award from the Michigan State University in 2022. More information about Dr. Sun can be found at his website: www.cs.iit.edu/~sun/.
  • Supported by:
    This work is supported in part by the U.S. National Science Foundation under Grant Nos. CCF-2029014 and CCF-2008907.

随着大数据应用的激增和内存墙问题的恶化,内存系统已取代计算单元成为了计算机研究的主要关切点。三十多年前,内存制约加速比模型是第一个提出数据的存储是计算性能瓶颈的模型。内存制约加速比模型提供了通用的加速比计算方法并揭示了计算加速比将受限于存储容量的规律。内存制约加速比模型一经提出就被业界采纳,并立即被收入多本并行计算机和高级计算机结构的教科书中,成为计算机学科研究生的必修内容。其中就包括Kai Hwang教授的《Scalable Parallel Computing: Technology, Architecture, Programming》一书。在此书中,内存制约加速比模型被称为孙-倪定律 (Sun-Ni’s Law) , 与阿姆达尔 (Amdahl) 定律和古斯塔夫森 (Gustafson) 定律并列为可扩展计算的著名三大定律。经过多年的发展,内存制约加速比模型的影响已经远远超出了并行计算的范围,进入了计算的根本。内存制约加速比模型促进了以数据为中心的计算概念,为研发下一代内存系统和优化工具提供了新见解,为解决“大数据”问题提供了关键思路。在这篇文章中,我们回顾了内存制约加速比模型的进展和影响,并讨论了其在大数据时代的作用和潜力。

关键词: 内存制约加速比模型, 可扩展计算, 内存墙, 性能建模和优化, 以数据为中心

Abstract:

With the surge of big data applications and the worsening of the memory-wall problem, the memory system, instead of the computing unit, becomes the commonly recognized major concern of computing. However, this "memory-centric" common understanding has a humble beginning. More than three decades ago, the memory-bounded speedup model is the first model recognizing memory as the bound of computing and provided a general bound of speedup and a computing-memory trade-off formulation. The memory-bounded model was well received even by then. It was immediately introduced in several advanced computer architecture and parallel computing textbooks in the 1990's as a must-know for scalable computing. These include Prof. Kai Hwang's book "Scalable Parallel Computing" in which he introduced the memory-bounded speedup model as the Sun-Ni's law, parallel with the Amdahl's and the Gustafson's law. Through the years, the impacts of this model have grown far beyond parallel processing and into the fundamental of computing. In this article, we revisit the memory-bounded speedup model and discuss its progress and impacts in depth to make a unique contribution to this special issue, to stimulate new solutions for big data applications, and to promote data-centric thinking and rethinking.

Key words: memory-bounded speedup, scalable computing, memory-wall, performance modeling and optimization, data-centric design

<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wulf W A, McKee S A. Hitting the memory wall: Implications of the obvious. <i>ACM SIGARCH Computer Architecture News</i>, 1995, 23(1): 20-24. DOI: <a href="https://doi.org/10.1145/216585.216588">10.1145/216585.216588</a>. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Ni L M. Scalable problems and memory-bounded speedup. <i>Journal of Parallel and Distributed Computing</i>, 1993, 19(1): 27-37. DOI: <a href="https://doi.org/10.1006/jpdc.1993.1087">10.1006/jpdc.1993.1087</a>. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Ni L M. Another view on parallel speedup. In <i>Proc</i>.<i> the 1990 ACM/IEEE Conference on Supercomputing</i>, Nov. 1990, pp.324-333. DOI: <a href="https://doi.org/10.1109/SUPERC.1990.130037">10.1109/SUPERC.1990.130037</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Amdahl G M. Validity of the single processor approach to achieving large scale computing capabilities. In <i>Proc</i>.<i> the </i><i>Spring Joint Computer Conference</i>, Apr. 1967, pp.483-485. DOI: <a href="https://doi.org/10.1145/1465482.1465560">10.1145/1465482.1465560</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Gustafson J L. Reevaluating Amdahl’s law. <i>Communications of the ACM</i>, 1988, 31(5): 532-533. DOI: <a href="https://doi.org/10.1145/42411.42415">10.1145/42411.42415</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Bashe C J, Johnson L R, Palmer J H, Pugh E W. IBM’s Early Computers. MIT Press, 1986. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Chen Y. Reevaluating Amdahl’s law in the multicore era. <i>Journal of Parallel and Distributed Computing</i>, 2010, 70(2): 183-188. DOI: <a href="https://doi.org/10.1016/j.jpdc.2009.05.002">10.1016/j.jpdc.2009.05.002</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pan C Y, Naeemi A. System-level optimization and benchmarking of graphene PN junction logic system based on empirical CPI model. In <i>Proc. the IEEE International Conference on IC Design & Technology</i>, Jun. 2012. DOI: 10.<a href="https://doi.org/1109/ICICDT.2012.6232850">1109/ICICDT.2012.6232850</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kogge P M. Hardware Evolution Trends of Extreme Scale Computing. Technical Reprt, University of Notre Dame, South Bend, 2011. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (6th edition). Elsevier, 2017. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Liu Y H, Sun X H. LPM: A systematic methodology for concurrent data access pattern optimization from a matching perspective. <i>IEEE Trans</i>.<i> Parallel and Distributed Systems</i>, 2019, 30(11): 2478-2493. DOI: <a href="https://doi.org/10.1109/TPDS.2019.2912573">10.1109/TPDS.2019.2912573</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lo Y J, Williams S, Straalen B V, Ligocki T J, Cordery M J, Wright N J, Hall M W, Oliker L. Roofline model toolkit: A practical tool for architectural and program analysis. In <i>Proc</i>.<i> the 5th International Workshop on Performance Modeling</i>,<i> Benchmarking and Simulation of High Performance Computer Systems</i>, Nov. 2014, pp.129-148. DOI: <a href="https://doi.org/10.1007/978-3-319-17248-4_7">10.1007/978-3-319-17248-4_7</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Saini S, Chang J, Jin H Q. Performance evaluation of the Intel sandy bridge based NASA Pleiades using scientific and engineering applications. In <i>Proc</i>.<i> the 4th International Workshop on Performance Modeling</i>,<i> Benchmarking and Simulation of High Performance Computer Systems</i>, Nov. 2013, pp.25-51. DOI: <a href="https://doi.org/10.1007/978-3-319-10214-6_2">10.1007/978-3-319-10214-6_2</a>. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Gustafson J L. Toward a better parallel performance metric. <i>Parallel Computing</i>, 1991, 17(10/11): 1093-1109. DOI: <a href="https://doi.org/10.1016/S0167-8191(05)80028-6">10.1016/S0167-8191(05)80028-6</a>. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kumar V, Singh V. Scalability of parallel algorithms for the all-pairs shortest-path problem. <i>Journal of Parallel and Distributed Computing</i>, 1991, 13(2): 124-138. DOI: <a href="https://doi.org/10.1016/0743-7315(91)90083-L">10.1016/0743-7315(91)90083-L</a>. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kumar V, Grama A, Gupta A, Karypis G. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin-Cummings, 1994. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Chen Y, Wu M. Scalability of heterogeneous computing. In <i>Proc. the International Conference on Parallel Processing (ICPP’05)</i>, Jun. 2005, pp.557-564. DOI: <a href="https://doi.org/10.1109/ICPP.2005.69">10.1109/ICPP.2005.69</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Rover D T. Scalability of parallel algorithm-machine combinations. <i>IEEE Trans</i>.<i> Parallel and Distributed Systems</i>, 1994, 5(6): 599-613. DOI: <a href="https://doi.org/10.1109/71.285606">10.1109/71.285606</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Pantano M, Fahringer T. Integrated range comparison for data-parallel compilation systems. <i>IEEE Trans</i>.<i> Parallel and Distributed Systems</i>, 1999, 10(5): 448-458. DOI: <a href="https://doi.org/10.1109/71.770134">10.1109/71.770134</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H. Scalability versus ution time in scalable systems. <i>Journal of Parallel and Distributed Computing</i>, 2002, 62(2): 173-192. DOI: <a href="https://doi.org/10.1006/jpdc.2001.1773">10.1006/jpdc.2001.1773</a>. </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hill M D, Marty M R. Amdahl’s law in the multicore era. <i>Computer</i>, 2008, 41(7): 33-38. DOI: <a href="https://doi.org/10.1109/MC.2008.209">10.1109/MC.2008.209</a>. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Chen Y, Byna S. Scalable computing in the multicore era. In <i>Proc</i>.<i> the 2008 International Symposium on Parallel Architectures</i>,<i> Algorithms and Programming</i>, Sept. 2008. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dwork C, Goldberg A, Naor M. On memory-bound functions for fighting spam. In <i>Proc</i>.<i> the 23rd Annual International Cryptology Conference</i>, Aug. 2003, pp.426-444. DOI: <a href="https://doi.org/10.1007/978-3-540-45146-4_25">10.1007/978-3-540-45146-4_25</a>. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Abadi M, Burrows M, Manasse M, Wobber T. Moderately hard, memory-bound functions. <i>ACM Trans</i>.<i> Internet Technology</i>, 2005, 5(2): 299-327. DOI: <a href="https://doi.org/10.1145/1064340.1064341">10.1145/1064340.1064341</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hart P E, Nilsson N J, Raphael B. A formal basis for the heuristic determination of minimum cost paths. <i>IEEE Trans</i>.<i> Systems Science and Cybernetics</i>, 1968, 4(2): 100-107. DOI: <a href="https://doi.org/10.1109/TSSC.1968.300136">10.1109/TSSC.1968.300136</a>. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Korf R E. Depth-first iterative-deepening: An optimal admissible tree search. <i>Artificial Intelligence</i>, 1985, 27(1): 97-109. DOI: <a href="https://doi.org/10.1016/0004-3702(85)90084-0">10.1016/0004-3702(85)90084-0</a>. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Korf R E, Reid M, Edelkamp S. Time complexity of iterative-deepening-A*. <i>Artificial Intelligence</i>, 2001, 129(1/2): 199-218. DOI: <a href="https://doi.org/10.1016/S0004-3702(01)00094-7">10.1016/S0004-3702(01)00094-7</a>. </div> </td> </tr> <tr class="document-box" id="b28"> <td valign="top" class="td1"> [28] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Russell S. Efficient memory-bounded search methods. In<i> Proc</i>.<i> the 10th European Conference on Artificial intelligence</i>, Aug. 1992. </div> </td> </tr> <tr class="document-box" id="b29"> <td valign="top" class="td1"> [29] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lovinger J, Zhang X Q. Enhanced simplified memory-bounded a star (SMA*+). In<i> Proc</i>.<i> the 3rd Global Conference on Artificial Intelligence</i>, Oct. 2017, pp.202-212. DOI: <a href="https://doi.org/10.29007/v7zc">10.29007/v7zc</a>. </div> </td> </tr> <tr class="document-box" id="b30"> <td valign="top" class="td1"> [30] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Seuken S, Zilberstein S. Memory-bounded dynamic programming for DEC-POMDPs. In <i>Proc</i>.<i> the 20th International Joint Conference on Artifical Intelligence</i>, Jan. 2007, pp.2009-2015. </div> </td> </tr> <tr class="document-box" id="b31"> <td valign="top" class="td1"> [31] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Seuken S, Zilberstein S. Improved memory-bounded dynamic programming for decentralized pomdps. arXiv: 1206.5295, 2012. <a href="https://arxiv.org/abs/1206.5295,20Dec.202022">https://arxiv.org/abs/1206.5295, Dec. 2022</a>. </div> </td> </tr> <tr class="document-box" id="b32"> <td valign="top" class="td1"> [32] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Chen Z Y, Zhang W X, Deng Y C, Chen D D, Li Q. RMB-DPOP: Refining MB-DPOP by reducing redundant inferences. arXiv: 2002.10641, 2020. <a href="https://doi.org/10.48550/arXiv.2002">https://doi.org/10.48550/arXiv.2002</a>.10641, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b33"> <td valign="top" class="td1"> [33] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Brito I, Meseguer P. Improving DPOP with function filtering. In <i>Proc</i>.<i> the 9th</i> <i>International Conference on Autonomous Agents and Multiagent Systems</i>:<i> Volume 1</i>, May 2010, pp.141-148. </div> </td> </tr> <tr class="document-box" id="b34"> <td valign="top" class="td1"> [34] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Petcu A, Faltings B. ODPOP: An algorithm for open/distributed constraint optimization. In <i>Proc</i>.<i> the 21st National Conference on Artificial Intelligence</i>, Jul. 2006, pp.703-708. </div> </td> </tr> <tr class="document-box" id="b35"> <td valign="top" class="td1"> [35] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Petcu A, Faltings B. A hybrid of inference and local search for distributed combinatorial optimization. In <i>Proc. the IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT’07)</i>, Nov. 2007, pp.342-348. DOI: <a href="https://doi.org/10.1109/IAT.2007.12">10.1109/IAT.2007.12</a>. </div> </td> </tr> <tr class="document-box" id="b36"> <td valign="top" class="td1"> [36] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Petcu A, Faltings B. MB-DPOP: A new memory-bounded algorithm for distributed optimization. In <i>Proc</i>.<i> the 20th International Joint Conference on Artifical Intelligence</i>, Jan. 2007, pp.1452-1457. </div> </td> </tr> <tr class="document-box" id="b37"> <td valign="top" class="td1"> [37] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Williams S W. Auto-tuning performance on multicore computers [Ph.D. Thesis]. University of California, Berkeley, 2008. </div> </td> </tr> <tr class="document-box" id="b38"> <td valign="top" class="td1"> [38] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Williams S, Waterman A, Patterson D. Roofline: An insightful visual performance model for multicore architectures. <i>Communications of the ACM</i>, 2009, 52(4): 65-76. DOI: 10.<a href="https://doi.org/1145/1498765.1498785">1145/1498765.1498785</a>. </div> </td> </tr> <tr class="document-box" id="b39"> <td valign="top" class="td1"> [39] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lu X Y, Wang R J, Sun X H. APAC: An accurate and adaptive prefetch framework with concurrent memory access analysis. In <i>Proc. the 38th IEEE International Conference on Computer Design (ICCD)</i>, Oct. 2020, pp.222-229. DOI: <a href="https://doi.org/10.1109/ICCD50377.2020.00048">10.1109/ICCD50377.2020.00048</a>. </div> </td> </tr> <tr class="document-box" id="b40"> <td valign="top" class="td1"> [40] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lu X Y, Wang R J, Sun X H. Premier: A concurrency-aware pseudo-partitioning framework for shared last-level cache. In <i>Proc. the 39th IEEE International Conference on Computer Design (ICCD)</i>, Oct. 2021, pp.391-394. DOI: <a href="https://doi.org/10.1109/ICCD53106.2021.00068">10.1109/ICCD53106.2021.00068</a>. </div> </td> </tr> <tr class="document-box" id="b41"> <td valign="top" class="td1"> [41] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Liu J, Espina P, Sun X H. A study on modeling and optimization of memory systems. <i>Journal of Computer Science and Technology</i>, 2021, 36(1): 71-89. DOI: <a href="https://doi.org/10.1007/s11390-021-0771-8">10.1007/s11390-021-0771-8</a>. </div> </td> </tr> <tr class="document-box" id="b42"> <td valign="top" class="td1"> [42] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Glew A. MLP yes! ILP no. In <i>Proc. ASPLOS Wild and Crazy Idea Session</i>, Oct. 1998. </div> </td> </tr> <tr class="document-box" id="b43"> <td valign="top" class="td1"> [43] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Qureshi M K, Lynch D N, Mutlu O, Patt Y N. A case for MLP-aware cache replacement. In <i>Proc. the </i><i>33rd International Symposium on Computer Architecture (ISCA’06)</i>, Jun. 2006, pp.167-178. DOI: <a href="https://doi.org/10.1109/ISCA.2006.5">10.1109/ISCA.2006.5</a>. </div> </td> </tr> <tr class="document-box" id="b44"> <td valign="top" class="td1"> [44] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Wang D W. Concurrent average memory access time. <i>Computer</i>, 2014, 47(5): 74-80. DOI: <a href="https://doi.org/10.1109/MC.2013.227">10.1109/MC.2013.227</a>. </div> </td> </tr> <tr class="document-box" id="b45"> <td valign="top" class="td1"> [45] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Najafi H, Lu X, Liu J, Sun X H. A generalized model for modern hierarchical memory system. In <i>Proc. Winter Simulation Conference (WSC)</i>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b46"> <td valign="top" class="td1"> [46] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lu X, Wang R, Sun X H. CARE: A concurrency-aware enhanced lightweight cache management framework. In <i>Proc</i>.<i> the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA)</i>, Feb. 25–Mar. 1, 2023. </div> </td> </tr> <tr class="document-box" id="b47"> <td valign="top" class="td1"> [47] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yan L, Zhang M Z, Wang R J, Chen X M, Zou X Q, Lu X Y, Han Y H, Sun X H. CoPIM: A concurrency-aware PIM workload offloading architecture for graph applications. In <i>Proc. IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)</i>, Jul. 2021. DOI: <a href="https://doi.org/10.1109/ISLPED52811.2021.9502483">10.1109/ISLPED52811.2021.9502483</a>. </div> </td> </tr> <tr class="document-box" id="b48"> <td valign="top" class="td1"> [48] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang N, Jiang C T, Sun X H, Song S L. Evaluating GPGPU memory performance through the C-AMAT model. In <i>Proc</i>.<i> the Workshop on Memory Centric Programming for HPC</i>, Nov. 2017, pp.35-39. DOI: <a href="https://doi.org/10.1145/3145617.3158214">10.1145/3145617.3158214</a>. </div> </td> </tr> <tr class="document-box" id="b49"> <td valign="top" class="td1"> [49] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kannan S, Gavrilovska A, Schwan K, Milojicic D, Talwar V. Using active NVRAM for I/O staging. In <i>Proc</i>.<i> the 2nd International Workshop on Petascal Data Analytics</i>:<i> Challenges and Opportunities</i>, Nov. 2011, pp.15-22. DOI: <a href="https://doi.org/10.1145/2110205.2110209">10.1145/2110205.2110209</a>. </div> </td> </tr> <tr class="document-box" id="b50"> <td valign="top" class="td1"> [50] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Caulfield A M, Grupp L M, Swanson S. Gordon: Using flash memory to build fast, power-efficient clusters for data-intensive applications. <i>ACM SIGPLAN Notices</i>, 2009, 44(3): 217-228. DOI: <a href="https://doi.org/10.1145/1508284.1508270">10.1145/1508284.1508270</a>. </div> </td> </tr> <tr class="document-box" id="b51"> <td valign="top" class="td1"> [51] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Reed D A, Dongarra J. Exascale computing and big data. <i>Communications of the ACM</i>, 2015, 58(7): 56-68. DOI: <a href="https://doi.org/10.1145/2699414">10.1145/2699414</a>. </div> </td> </tr> <tr class="document-box" id="b52"> <td valign="top" class="td1"> [52] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In<i> Proc</i>.<i> the 9th International Conference on High Performance Computing for Computational Science</i>, Jun. 2010. DOI: <a href="https://doi.org/10.1007/978-3-642-19328-6_1">10.1007/978-3-642-19328-6_1</a>. </div> </td> </tr> <tr class="document-box" id="b53"> <td valign="top" class="td1"> [53] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kougkas A, Devarajan H, Sun X H. Hermes: A heterogeneous-aware multi-tiered distributed I/O buffering system. In <i>Proc</i>.<i> the 27th International Symposium on High-Performance Parallel and Distributed Computing</i>, Jun. 2018, pp.219-230. DOI: <a href="https://doi.org/10.1145/3208040.3208059">10.1145/3208040.3208059</a>. </div> </td> </tr> <tr class="document-box" id="b54"> <td valign="top" class="td1"> [54] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kougkas A, Devarajan H, Sun X H. I/O acceleration via multi-tiered data buffering and prefetching. <i>Journal of Computer Science and Technology</i>, 2020, 35(1): 92-120. DOI: <a href="https://doi.org/10.1007/s11390-020-9781-1">10.1007/s11390-020-9781-1</a>. </div> </td> </tr> <tr class="document-box" id="b55"> <td valign="top" class="td1"> [55] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Tissenbaum M, Sheldon J, Abelson H. From computational thinking to computational action. <i>Communications of the ACM</i>, 2019, 62(3): 34-36. DOI: <a href="https://doi.org/10.1145/3265747">10.1145/3265747</a>. </div> </td> </tr> <tr class="document-box" id="b56"> <td valign="top" class="td1"> [56] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Liu Y H, Sun X H, Wang Y, Bao Y G. HCDA: From computational thinking to a generalized thinking paradigm. <i>Communications of the ACM</i>, 2021, 64(5): 66-75. DOI: <a href="https://doi.org/10.1145/3418291">10.1145/3418291</a>. </div> </td> </tr> <tr class="document-box" id="b57"> <td valign="top" class="td1"> [57] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Owens J D, Houston M, Luebke D, Green S, Stone J E, Phillips J C. GPU computing. <i>Proceedings of the IEEE</i>, 2008, 96(5): 879-899. DOI: <a href="https://doi.org/10.1109/JPROC.2008.917757">10.1109/JPROC.2008.917757</a>. </div> </td> </tr> <tr class="document-box" id="b58"> <td valign="top" class="td1"> [58] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. <i>Communications of the ACM</i>, 2008, 51(1): 107-113. DOI: <a href="https://doi.org/10.1145/1327452.1327492">10.1145/1327452.1327492</a>. </div> </td> </tr> <tr class="document-box" id="b59"> <td valign="top" class="td1"> [59] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Momose H, Kaneko T, Asai T. Systems and circuits for AI chips and their trends. <i>Japanese Journal of Applied Physics</i>, 2020, 59(5): 050502. DOI: <a href="https://doi.org/10.35848/1347-4065/ab839f">10.35848/1347-4065/ab839f</a>. </div> </td> </tr> <tr class="document-box" id="b60"> <td valign="top" class="td1"> [60] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Singh G, Alser M, Cali D S, Diamantopoulos D, Gómez-Luna J, Corporaal H, Mutlu O. FPGA-based near-memory acceleration of modern data-intensive applications. <i>IEEE Micro</i>, 2021, 41(4): 39-48. DOI: <a href="https://doi.org/10.1109/MM.2021.3088396">10.1109/MM.2021.3088396</a>. </div> </td> </tr> <tr class="document-box" id="b61"> <td valign="top" class="td1"> [61] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Choi Y K, Santillana C, Shen Y J, Darwiche A, Cong J. FPGA acceleration of probabilistic sentential decision diagrams with high-level synthesis. <i>ACM Trans</i>.<i> Reconfigurable Technology and Systems</i>, 2022. DOI: <a href="https://doi.org/10.1145/3561514">10.1145/3561514</a>. </div> </td> </tr> <tr class="document-box" id="b62"> <td valign="top" class="td1"> [62] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ghose S, Boroumand A, Kim J S, Gómez-Luna J, Mutlu O. Processing-in-memory: A workload-driven perspective. <i>IBM Journal of Research and Development</i>, 2019, 63(6): Article No. 3. DOI: <a href="https://doi.org/10.1147/JRD.2019.2934048">10.1147/JRD.2019.2934048</a>. </div> </td> </tr> <tr class="document-box" id="b63"> <td valign="top" class="td1"> [63] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ghiasi N M, Park J, Mustafa H, Kim J, Olgun A, Gollwitzer A, Cali D S, Firtina C, Mao H Y, Alserr N A, Ausavarungnirun R, Vijaykumar N, Alser M, Mutlu O. GenStore: A high-performance in-storage processing system for genome sequence analysis. In <i>Proc</i>.<i> the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems</i>, Feb. 2022, pp.635-654. DOI: <a href="https://doi.org/10.1145/3503222.3507702">10.1145/3503222.3507702</a>. </div> </td> </tr> <tr class="document-box" id="b64"> <td valign="top" class="td1"> [64] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Mutlu O. Intelligent architectures for intelligent computing systems. In<i> Proc. the 2021 Design</i>,<i> Automation & Test in Europe Conference & Exhibition (DATE)</i>, Feb. 2021, pp.318-323. DOI: <a href="https://doi.org/10.23919/DATE51398.2021.9474073">10.23919/DATE51398.2021.9474073</a>. </div> </td> </tr> <tr class="document-box" id="b65"> <td valign="top" class="td1"> [65] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sun X H, Liu Y H. Utilizing concurrency: A new theory for memory wall. In<i> Proc</i>.<i> the 29th International Workshop on Languages and Compilers for Parallel Computing</i>, Sept. 2016, pp.18-23. DOI: <a href="https://doi.org/10.1007/978-3-319-52709-3_2">10.1007/978-3-319-52709-3_2</a>. </div> </td> </tr> <tr class="document-box" id="b66"> <td valign="top" class="td1"> [66] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kougkas A, Devarajan H, Lofstead J, Sun X H. LABIOS: A distributed label-based I/O system. In <i>Proc</i>.<i> the 28th International Symposium on High-Performance Parallel and Distributed Computing</i>, Jun. 2019, pp.13-24. DOI: <a href="https://doi.org/10.1145/3307681.3325405">10.1145/3307681.3325405</a>. </div> </td> </tr> <tr class="document-box" id="b67"> <td valign="top" class="td1"> [67] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Logan L, Garcia J C, Lofstead J, Sun X H, Kougkas A. LabStor: A modular and extensible platform for developing high-performance, customized I/O stacks in userspace. In <i>Proc</i>.<i> the ACM/IEEE International Conference for High Performance Computing</i>,<i> Networking</i>,<i> Storage and Analysis (SC’22)</i>, Nov. 2022, pp.309-323. </div> </td> </tr> <tr class="document-box" id="b68"> <td valign="top" class="td1"> [68] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hwang K, Xu Z W. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, 1998. </div> </td> </tr> <tr class="document-box" id="b69"> <td valign="top" class="td1"> [69] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hwang K. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993. </div> </td> </tr> </tbody> </table>
[1] Jing Li, Lei Liu, Yuan Wu, Xiang-Hua Liu, Yi Gao, Xiao-Bing Feng, Cheng-Yong Wu. 基于制导的GPU共享内存相关优化[J]. , 2016, 31(2): 235-252.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[5] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[6] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[7] 孙钟秀; 商陆军;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[8] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[9] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[10] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: