计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (1): 90-109.doi: 10.1007/s11390-020-0942-z

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

Unimem: 用于高性能计算的基于非易失性内存的异构主内存上的运行时系统数据管理

Kai Wu and Dong Li*   

  1. Department of Electrical Engineering and Computer Science, University of California Merced, Merced 95343, U.S.A.
  • 收稿日期:2020-08-25 修回日期:2020-11-30 出版日期:2021-01-05 发布日期:2021-01-23
  • 通讯作者: Dong Li E-mail:dli35g@ucmerced.edu
  • 作者简介:Kai Wu is a Ph.D. candidate at University of California Merced (UC Merced), Merced. Before coming to UC Merced, he earned his Master's degree in computer science and engineering from Michigan State University, East Lansing, in 2016. His research areas are computer system and high performance computing (HPC) with a focus on hardware heterogeneity. He designs high performance computer systems with memory heterogeneity. His recent work focuses on designing system support for persistent memory-based big memory platforms. He has published in the top-tier system/HPC conferences and journals, including FAST, HPCA, SC, PACT, ICPP, CLUSTER, etc.
  • 基金资助:
    This work was partially supported by the U.S. National Science Foundation under Grant Nos. CNS-1617967, CCF-1553645, and CCF1718194.

Unimem: Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Main Memory for High Performance Computing

Kai Wu and Dong Li*        

  1. Department of Electrical Engineering and Computer Science, University of California Merced, Merced 95343, U.S.A.
  • Received:2020-08-25 Revised:2020-11-30 Online:2021-01-05 Published:2021-01-23
  • Contact: Dong Li E-mail:dli35g@ucmerced.edu
  • About author:Kai Wu is a Ph.D. candidate at University of California Merced (UC Merced), Merced. Before coming to UC Merced, he earned his Master's degree in computer science and engineering from Michigan State University, East Lansing, in 2016. His research areas are computer system and high performance computing (HPC) with a focus on hardware heterogeneity. He designs high performance computer systems with memory heterogeneity. His recent work focuses on designing system support for persistent memory-based big memory platforms. He has published in the top-tier system/HPC conferences and journals, including FAST, HPCA, SC, PACT, ICPP, CLUSTER, etc.
  • Supported by:
    This work was partially supported by the U.S. National Science Foundation under Grant Nos. CNS-1617967, CCF-1553645, and CCF1718194.

研究背景:
非易失性存储器(NVM)提供了可扩展且高能效的解决方案,以取代DRAM作为主存储器。但是,由于NVM相对较高的延迟和较低的带宽,NVM通常与DRAM配对以构建异构存储系统(HMS)。因此,必须将应用程序的数据对象小心地放置到NVM和DRAM中,以获得最佳性能。
目的:
本工作目标设计一个对于用于HPC的HMS上的数据管理系统。主要有以下三个要求。第一,我们要避免对硬件的破坏性更改。由于担心硬件成本,HPC数据中心可能难以采用现有的基于硬件的解决方案来管理HMS上的数据放置。第二,我们希望最小化对应用程序和系统软件的更改。这样可以满足将HPC经典应用程序应轻而易举地移植到基于NVM的HMS。第三,管理数据放置应尽可能透明。我们希望启用自动数据放置,并减轻用户管理数据放置细节的麻烦。
方法:
在本文中,我们介绍了一个运行时系统(名为" Unimem"),该系统可以自动透明地决定和实现数据放置。首先,我们设计了基于性能计数器采用在线性能分析来捕获执行阶段的内存访问模式,以此为基础来表征每个阶段中数据对象对内存带宽和延迟的敏感性。第二,我们采用了轻量级的性能模型,基于此模型,我们可以预测在NVM和DRAM之间移动数据对象时的性能收益和成本。给定性能优势和数据移动成本,我们将确定最佳数据放置的问题公式化为背包问题。基于性能模型和公式,我们避免了不必要的数据移动 同时最大化数据移动的好处。第三,为了避免数据移动对应用程序性能的影响,我们引入了一种主动的数据移动机制。给定执行阶段和该阶段的数据移动计划,该机制使用助手线程在阶段之前触发数据移动。帮助程序线程与应用程序并行运行,使数据移动与应用程序执行重叠。这种主动的数据移动机制减小了数据移动在主程序关键路径的开销。第四, 为了进一步提高性能,我们引入了一系列技术,其中包括:(1)优化初始数据放置以减少运行时的数据移动成本;(2)探索阶段局部搜索和跨阶段全局搜索之间的权衡以获取最佳数据放置;以及(3)分解大数据对象以实现细粒度的数据移动。
结果:
我们介绍了一种轻量级的运行时解决方案,该解决方案可以自动透明地管理HMS上的数据放置,而无需进行硬件修改和对应用程序的破坏性更改。我们的运行时解决方案有效地缩小了NVM和DRAM之间的性能差距。实验结果证明,借助基于软件的数据管理,使用NVM代替大多数DRAM对于将来的HPC系统可能是可行的解决方案。
结论:
NVM的局限性提出了一个问题,即NVM对于HPC应用是否是可行的解决方案。在本文中,我们量化了基于NVM的系统与基于DRAM的系统之间的性能差距,并证明了使用精心设计的运行时,可以显着减小性能差距。我们希望我们的工作可以为将来的HPC拥抱NVM打下基础。

关键词: 数据管理, 非易失性内存, 运行时系统

Abstract: Non-volatile memory (NVM) provides a scalable and power-efficient solution to replace dynamic random access memory (DRAM) as main memory. However, because of the relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory system (HMS). As a result, data objects of the application must be carefully placed to NVM and DRAM for the best performance. In this paper, we introduce a lightweight runtime solution that automatically and transparently manages data placement on HMS without the requirement of hardware modifications and disruptive change to applications. Leveraging online profiling and performance models, the runtime solution characterizes memory access patterns associated with data objects, and minimizes unnecessary data movement. Our runtime solution effectively bridges the performance gap between NVM and DRAM. We demonstrate that using NVM to replace the majority of DRAM can be a feasible solution for future HPC systems with the assistance of a software-based data management.

Key words: data management, non-volatile memory, runtime system

[1] Dulloor S R, Roy A, Zhao Z G et al. Data tiering in heterogeneous memory systems. In Proc. the 11th European Conference on Computer Systems, April 2016, Article No. 15. DOI:10.1145/2901318.2901344.
[2] Giardino M, Doshi K, Ferri B. Soft2LM:Application guided heterogeneous memory management. In Proc. the 2016 International Conference on Networking, Architecture, and Storage, Aug. 2016. DOI:10.1109/NAS.2016.7549421.
[3] Lin F X, Liu X. memif:Towards programming heterogeneous memory asynchronously. In Proc. the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, March 2016, pp.369-383. DOI:10.1145/2980024.2872401.
[4] Shen D, Liu X, Lin F X. Characterizing emerging heterogeneous memory. In Proc. the 2016 ACM SIGPLAN International Symposium on Memory Management, June 2016, pp.13-23. DOI:10.1145/2926697.2926702.
[5] Wang B, Wu B, Li D, Shen X, Yu W, Jiao Y, Vetter J S. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques, Sept. 2013, pp.93-102. DOI:10.1109/PACT.2013.6618807.
[6] Wu K, Ren J, Li D. Runtime data management on nonvolatile memory-based heterogeneous memory for taskparallel programs. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 31. DOI:10.1109/SC.2018.00034.
[7] Wu P, Li D, Chen Z, Vetter J, Mittal S. Algorithm-directed data placement in explicitly managed no-volatile memory. In Proc. the 25th ACM Symposium on High-Performance Parallel and Distributed Computing, May 2016, pp.141-152. DOI:10.1145/2907294.2907321.
[8] Qureshi M K, Franchescini M, Srinivasan V, Lastras L, Abali B, Karidis J. Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.14-23. DOI:10.1145/1669112.1669117.
[9] Qureshi M K, Srinivasan V, Rivers J A. Scalable highperformance main memory system using phase-change memory technology. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp.24-33. DOI:10.1145/1555754.1555760.
[10] Yoon H, Meza J, Ausavarungnirun R, Harding R, Mutlu O. Row buffer locality aware caching policies for hybrid memories. In Proc. the 30th IEEE International Conference on Computer Design, Sept. 30-Oct. 3, 2012, pp.337-344. DOI:10.1109/ICCD.2012.6378661.
[11] Wu K, Huang Y, Li D. Unimem:Runtime data management on non-volatile memory-based heterogeneous main memory. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 58. DOI:10.1145/3126908.3126923.
[12] Bailey D H, Barszcz E, Dagum L, Simon H D. Nas parallel benchmark results. In Proc. the 1992 ACM/IEEE Conference on Supercomputing, Nov. 1992, pp.386-393. DOI:10.1109/SUPERC.1992.236665.
[13] Izraelevitz J, Yang J, Zhang L et al. Basic performance measurements of the Intel Optane DC persistent memory module. arXiv:1903.05714, 2019. https://arxiv.org/pdf/1903.05714v3.pdf, October 2020.
[14] Suzuki K, Swanson S. The non-volatile memory technology database (NVMDB). Technical Report, Department of Computer Science & Engineering, University of California, 2015. http://cseweb.ucsd.edu/~swanson/papers/TR2015- NVMDB.pdf, Oct. 2020.
[15] Volos H, Magalhaes G, Cherkasova L, Li J. Quartz:A lightweight performance emulator for persistent memory software. In Proc. the 16th Annual Middleware Conference, November 2015, pp.37-49. DOI:10.1145/2814576.2814806.
[16] Li D, Vetter J, Marin G, McCurdy C, Cira C, Liu Z, Yu W. Identifying opportunities for byte-addressable non-volatile memory in extreme-scale scientific applications. In Proc. the 26th International Parallel and Distributed Processing Symposium, May 2012, pp.945-956. DOI:10.1109/IPDPS.2012.89.
[17] Silvano M, Toth P. Knapsack Problems:Algorithms and Computer Implementations (1st edition). John Wiley & Sons, 1990.
[18] Agarwal N, Nellans D, Stephenson M, O'Connor M, Keckler S W. Page placement strategies for GPUs within heterogeneous memory systems. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2015, pp.607-618. DOI:10.1145/2775054.2694381.
[19] Ding C, Kennedy K. Bandwidth-based performance tuning and prediction. In Proc. the 1990 IASTED International Conference on Parallel Computing and Distributed Systems, November 1999.
[20] Berger E D, McKinley K S, Blumofe R D, Wilson P R. Hoard:A scalable memory allocator for multithreaded applications. In Proc. the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000, pp.117-128. DOI:10.1145/378993.379232.
[21] Michael M M. Scalable lock-free dynamic memory allocation. In Proc. the 2004 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2004, pp.35-46. DOI:10.1145/996893.996848.
[22] Lattner C. LLVM:An infrastructure for multi-stage optimization[Ph.D. Thesis]. Computer Science Dept., Univ. of Illinois at Urbana-Champaign, 2002.
[23] Chakaravarthy V T. New results on the computability and complexity of points-to analysis. In Proc. the 30th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, January 2003, pp.115-125. DOI:10.1145/640128.604142.
[24] Volos H, Tack A J, Swift M M. Mnemosyne:Lightweight persistent memory. In Proc. the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2011, pp.91-104. DOI:10.1145/2248487.1950379.
[25] Wen S, Cherkasova L, Lin F X, Liu X. ProfDP:A lightweight profiler to guide data placement in heterogeneous memory systems. In Proc. the 2018 International Conference on Supercomputing, June 2018, pp.263- 273. DOI:10.1145/3205289.3205320.
[26] Lachaize R, Lepers B, Quéma V. MemProf:A memory profiler for NUMA multicore systems. In Proc. the 2012 USENIX Annual Technical Conference, June 2012, pp.53- 64.
[27] Liu X, Mellor-Crummey J. A data-centric profiler for parallel programs. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis, November 2013, Article No. 28. DOI:10.1145/2503210.2503297.
[28] Liu X, Wu B. ScaAnalyzer:A tool to identify memory scalability bottlenecks in parallel programs. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2015, Article No. 47. DOI:10.1145/2807591.2807648.
[29] Liu X, Mellor-Crummey J. Pinpointing data locality problems using data-centric analysis. In Proc. the 9th International Symposium on Code Generation and Optimization, April 2011, pp.171-180. DOI:10.1109/CGO.2011.5764685.
[30] McCurdy C, Vetter J. Memphis:Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. the 2010 IEEE International Symposium on Performance Analysis of Systems Software, March 2010, pp.87-96. DOI:10.1109/ISPASS.2010.5452060.
[31] Chen Y, Peng I B, Peng Z, Liu X, Ren B. ATMem:Adaptive data placement in graph applications on heterogeneous memories. In Proc. the 18th ACM/IEEE International Symposium on Code Generation and Optimization, February 2020, pp.293-304. DOI:10.1145/3368826.3377922.
[32] Bivens A, Dube P, Franceschini M, Karidis J, Lastras L, Tsao M. Architectural design for next generation heterogeneous memory systems. In Proc. the 2010 International Memory Workshop, May 2010. DOI:10.1109/IMW.2010.5488395.
[1] Heng Bu, Ming-Kai Dong, Ji-Fei Yi, Bin-Yu Zang, Hai-Bo Chen. 基于英特尔傲腾非易失性内存的持久索引结构回顾与展望[J]. 计算机科学技术学报, 2021, 36(1): 140-157.
[2] Hai-Kun Liu, Di Chen, Hai Jin, Xiao-Fei Liao, Binsheng He, Kan Hu, Yu Zhang. 非易失性内存技术综述:现状,实践和展望[J]. 计算机科学技术学报, 2021, 36(1): 4-32.
[3] Zhi-Guang Chen, Yu-Bo Liu, Yong-Feng Wang, Yu-Tong Lu. 基于GPU的大规模并行文件系统元数据加速[J]. 计算机科学技术学报, 2021, 36(1): 44-55.
[4] Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen. Tianhe-2数据存储与管理系统设计与实现[J]. 计算机科学技术学报, 2020, 35(1): 27-46.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] 史忠植;. Knowledge-Based Decision Support System[J]. , 1987, 2(1): 22 -29 .
[10] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: