计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (1): 94-112.doi: 10.1007/s11390-019-1901-4

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

用分布式共享内存系统横向扩展非对称内存感知的应用

Yang Hong, Yang Zheng, Fan Yang, Bin-Yu Zang, Distinguished Member, CCF, Member ACM, IEEE Hai-Bing Guan, Senior Member, CCF, ACM, IEEE, and Hai-Bo Chen*, Distinguished Member, CCF, Senior Member, ACM, IEEE   

  1. Shanghai Key Laboratory for Scalable Computing Systems, Shanghai Jiao Tong University, Shanghai 200240, China
  • 收稿日期:2018-06-13 修回日期:2018-11-21 出版日期:2019-01-05 发布日期:2019-01-12
  • 通讯作者: Hai-Bo Chen E-mail:haibochen@sjtu.edu.cn
  • 作者简介:Yang Hong is currently a Ph.D. candidate at Shanghai Key Laboratory for Scalable Computing Systems, Shanghai Jiao Tong University, Shanghai. He received his B.S. degree in software engineering from Shanghai Jiao Tong University, Shanghai, in 2013. His research interests include computer architecture, virtualization, parallel computing, and networked systems.
  • 基金资助:
    This work was supported in part by the National Key Research and Development Program of China under Grant No. 2016YFB1000500, the National Natural Science Foundation of China under Grant No. 61572314, and the National Youth Top-Notch Talent Support Program of China.

Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory

Yang Hong, Yang Zheng, Fan Yang, Bin-Yu Zang, Distinguished Member, CCF, Member ACM, IEEE Hai-Bing Guan, Senior Member, CCF, ACM, IEEE, and Hai-Bo Chen*, Distinguished Member, CCF, Senior Member, ACM, IEEE   

  1. Shanghai Key Laboratory for Scalable Computing Systems, Shanghai Jiao Tong University, Shanghai 200240, China
  • Received:2018-06-13 Revised:2018-11-21 Online:2019-01-05 Published:2019-01-12
  • Contact: Hai-Bo Chen E-mail:haibochen@sjtu.edu.cn
  • About author:Yang Hong is currently a Ph.D. candidate at Shanghai Key Laboratory for Scalable Computing Systems, Shanghai Jiao Tong University, Shanghai. He received his B.S. degree in software engineering from Shanghai Jiao Tong University, Shanghai, in 2013. His research interests include computer architecture, virtualization, parallel computing, and networked systems.
  • Supported by:
    This work was supported in part by the National Key Research and Development Program of China under Grant No. 2016YFB1000500, the National Natural Science Foundation of China under Grant No. 61572314, and the National Youth Top-Notch Talent Support Program of China.

多核技术的革新推动了用共享内存多处理器纵向扩展应用的研究,并显著地提升了许多应用的性能和可扩展性。但是由于多核处理器局限在单台机器,程序员仍然需要重新设计应用以运行在更多的机器上。本文重新审视了分布式共享内存的设计和实现,并设计并实现了一个名为MAGI的分布式共享内存系统,用来横向扩展应用程序以运行在高性能集群上。MAGI提供了一个非统一内存访问架构的抽象,利用层次化的同步机制和内存管理机制来充分利用多核资源。MAGI也充分利用了大数据应用的内存访问特征,结合远程直接内存访问的硬件特性,减少和降低了缺页和对应的性能开销。我们将MAGI实现为一个用户态的库,提供兼容的线程接口,最小化应用程序的移植修改。我们将该系统部署在8态机器组成的集群上,并进行测试。解释结果表明,该系统的优化方案最多能获得9.25倍的性能提升,并且使应用程序获得良好的可扩展性。

关键词: 分布式共享内存, 可扩展性, 操作系统

Abstract: The multicore evolution has stimulated renewed interests in scaling up applications on shared-memory multiprocessors, significantly improving the scalability of many applications. But the scalability is limited within a single node; therefore programmers still have to redesign applications to scale out over multiple nodes. This paper revisits the design and implementation of distributed shared memory (DSM) as a way to scale out applications optimized for non-uniform memory access (NUMA) architecture over a well-connected cluster. This paper presents MAGI, an efficient DSM system that provides a transparent shared address space with scalable performance on a cluster with fast network interfaces. MAGI is unique in that it presents a NUMA abstraction to fully harness the multicore resources in each node through hierarchical synchronization and memory management. MAGI also exploits the memory access patterns of big-data applications and leverages a set of optimizations for remote direct memory access (RDMA) to reduce the number of page faults and the cost of the coherence protocol. MAGI has been implemented as a user-space library with pthread-compatible interfaces and can run existing multithreaded applications with minimized modifications. We deployed MAGI over an 8-node RDMAenabled cluster. Experimental evaluation shows that MAGI achieves up to 9.25x speedup compared with an unoptimized implementation, leading to a scalable performance for large-scale data-intensive applications.

Key words: distributed shared memory (DSM), scalability, multicore evolution, non-uniform memory access (NUMA), remote direct memory access (RDMA)

[1] Dice D, Marathe V J, Shavit N. Lock cohorting:A general technique for designing NUMA locks. In Proc. the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2012, pp.247-256.
[2] Calciu I, Dice D, Lev Y, Luchangco V, Marathe V J, Shavit N. Numa-aware reader-writer locks. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2013, pp.157-166.
[3] Boyd-Wickizer S, Kaashoek M F, Morris R, Zeldovich N. OpLog:A library for scaling update-heavy data structures. Technical Report, Massachusetts Institute of Technology, 2014. https://dspace.mit.edu/handle/1721.1/89653, September 2018.
[4] Majo Z, Gross T R. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In Proc. the 2013 IEEE International Symposium on Workload Characterization, September 2013, pp.11-22.
[5] Majo Z, Gross T. R. A library for portable and composable data locality optimizations for NUMA systems. In Proc. the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2015, pp.227-238.
[6] Zhang K, Chen R, Chen H. NUMA-aware graph-structured analytics. In Proc. the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2015, pp.183-193.
[7] Calciu I, Sen S, Balakrishnan M, Aguilera M K. Blackbox concurrent data structures for NUMA architectures. In Proc. the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, April 2017, pp.207-221.
[8] Li K, Hudak P. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 1989, 7(4):321-359.
[9] Bennett J K, Carter J B, Zwaenepoel W. Munin:Distributed shared memory based on type-specific memory coherence. In Proc. the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, March 1990, pp.168-176.
[10] Chapman M, Heiser G. vNUMA:A virtual shared-memory multiprocessor. In Proc. the 2009 USENIX Annual Technical Conference, June 2009, pp.349-362.
[11] Keleher P, Cox A L, Dwarkadas S, Zwaenepoel W. TreadMarks:Distributed shared memory on standard workstations and operating systems. In Proc. the USENIX Winter 1994 Technical Conference, January 1994, pp.115-132.
[12] Fleisch B, Popek G. Mirage:A coherent distributed shared memory design. In Proc. the 12th ACM Symposium on Operating Systems Principles, December 1989, pp.211-223.
[13] Iftode L, Singh J P, Li K. Scope consistency:A bridge between release consistency and entry consistency. In Proc. the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, June 1996, pp.277-287.
[14] Bershad B N, Zekauskas M J, Sawdon W A. The Midway distributed shared memory system. In Proc. Digest of Papers. Compcon Spring, February 1993, pp.528-537.
[15] Erlichson A, Nuckolls N, Chesson G, Hennessy J. SoftFLASH:Analyzing the performance of clustered distributed virtual shared memory. In Proc. the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp.210-220.
[16] Stets R, Dwarkadas S, Hardavellas N, Hunt G, Kontothanassis L, Parthasarathy S, Scott M. Cashmere-2L:Software coherent shared memory on a clustered remotewrite network. In Proc. the 16th ACM Symposium on Operating Systems Principles, October 1997, pp.170-183.
[17] Charles P, Grothof C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V. X10:An objectoriented approach to non-uniform cluster computing. In Proc. the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2005, pp.519-538.
[18] Dragojević A, Narayanan D, Hodson O, Castro M. FaRM:Fast remote memory. In Proc. the 11th USENIX Symposium on Networked Systems Design and Implementation, April 2014, pp.401-414.
[19] Wei X, Shi J, Chen Y, Chen R, Chen H. Fast in-memory transaction processing using RDMA and HTM. In Proc. the 25th Symposium on Operating Systems Principles, October 2015, pp.87-104.
[20] Nelson J, Holt B, Myers B, Briggs P, Ceze L, Kahan S, Oskin M. Latency-tolerant software distributed shared memory. In Proc. the 2015 USENIX Annual Technical Conference, July 2015, pp.291-305.
[21] Kalia A, Kaminsky M, Andersen D G. Design guidelines for high performance RDMA systems. In Proc. the 2016 USENIX Annual Technical Conference, June 2016, pp.437-450.
[22] Kumar M, Maass S, Kashyap S, Veselý J, Yan Z, Kim T, Bhattacharjee A, Krishna T. LATR:Lazy translation coherence. In Proc. the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, March 2018, pp.651-664.
[23] Dennard R H, Gaensslen F H, Rideout V L, Bassous E, LeBlanc A R. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of SolidState Circuits, 1974, 9(5):256-268.
[24] Esmaeilzadeh H, Blem E, Amant R S, Sankaralingam K, Burger D. Dark silicon and the end of multicore scaling. In Proc. the 38th International Symposium on Computer Architecture, June 2011, pp.365-376.
[25] Borkar S. Thousand core chips:A technology perspective. In Proc. the 44th Annual Design Automation Conference, June 2007, pp.746-749.
[26] Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In Proc. the 13th International Symposium on High-Performance Computer Architecture, February 2007, pp.13-24.
[27] Chen R, Chen H, Zang B. Tiled-MapReduce:Optimizing resource usages of data-parallel applications on multicore with tiling. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.523-534.
[28] Zhu X, Chen W, Zheng W, Ma X. Gemini:A computationcentric distributed graph processing system. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.301-316.
[29] Guo C, Wu H, Deng Z, Soni G, Ye J, Padhye J, Lipshteyn M. RDMA over commodity Ethernet at scale. In Proc. the 2016 ACM SIGCOMM Conference, August 2016, pp.202-215.
[30] Scales D J, Gharachorloo K, Thekkath C A. Shasta:A low overhead, software-only approach for supporting fine-grain shared memory. In Proc. the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp.174-185.
[31] Karlsson M, Stenström P. Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers. In Proc. the 2nd International Symposium on High-Performance Computer Architecture, February 1996, pp.4-13.
[32] Gaud F, Lepers B, Decouchant J, Funston J, Fedorova A, Quéma V. Large pages may be harmful on NUMA systems. In Proc. the 2014 USENIX Annual Technical Conference, June 2014, pp.231-242.
[33] Tsai S, Zhang Y. LITE kernel RDMA support for datacenter applications. In Proc. the 26th Symposium on Operating Systems Principles, October 2017, pp.306-324.
[34] Bienia C, Kumar S, Singh J P, Li K. The PARSEC benchmark suite:Characterization and architectural implications. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008, pp.72-81.
[35] Shun J, Blelloch G E. Ligra:A lightweight graph processing framework for shared memory. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2013, pp.135-146.
[36] Boldi P, Vigna S. The WebGraph framework I:Compression techniques. In Proc. the 13th International Conference on World Wide Web, May 2004, pp.595-602.
[37] Boldi P, Rosa M, Santini M, Vigna S. Layered label propagation:A multiresolution coordinate-free ordering for compressing social networks. In Proc. the 20th International Conference on World Wide Web, March 2011, pp.587-596.
[38] Raikin S, Liss L, Shachar A, Bloch N, Kagan M. Remote transactional memory. US Patent US20150269116, 2015. http://www.freepatentsonline.com/20150269116.pdf, September 2018.
[39] Daglis A, Ustiugov D, Novaković S, Bugnion E, Falsafi B, Grot B. SABRes:Atomic object reads for in-memory rackscale computing. In Proc. the 49th Annual ACM/IEEE International Symposium on Microarchitecture, October 2016, Article No. 6.
[40] Blumrich M A, Li K, Alpert R, Dubnicki C, Felten E W, Sandberg J. Virtual memory mapped network interface for the SHRIMP multicomputer. In Proc. the 21st Annual International Symposium on Computer Architecture, April 1994, pp.142-153.
[41] Kontothanassis L I, Scott M L. Using memory-mapped network interfaces to improve the performance of distributed shared memory. In Proc. the 2nd International Symposium on High-Performance Computer Architecture, February 1996, pp.166-177.
[42] Kalia A, Kaminsky M, Andersen D G. Using RDMA efficiently for key-value services. In Proc. the 2014 ACM Conference on SIGCOMM, August 2014, pp.295-306.
[43] Dragojević A, Narayanan D, Nightingale E B, Renzelmann M, Shamis A, Badam A, Castro M. No compromises:Distributed transactions with consistency, availability, and performance. In Proc. the 25th ACM Symposium on Operating Systems Principles, October 2015, pp.54-70.
[44] Kalia A, Kaminsky M, Andersen D G. FaSST:Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.185-201.
[45] Vasilevsky A, Lively D, Ofsthun S. Linuxr virtualization on Virtual IronTM VFe. In Proc. the 2005 Ottawa Linux Symposium, July 2005, pp.235-250.
[46] Kaneda K, Oyama Y, Yonezawa A. A virtual machine monitor for providing a single system image. In Proc. the 17th IPSJ Computer System Symposium, November 2005, pp.3-12.
[47] Gillett R B. Memory channel network for PCI. IEEE Micro, 1996, 16(1):12-18.
[48] Blumrich M A, Alpert R D, Chen Y et al. Design choices in the SHRIMP system:An empirical study. In Proc. the 25th Annual International Symposium on Computer Architecture, June 1998, pp.330-341.
[49] Zhou Y, Iftode L, Li K. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In Proc. the 2nd USENIX Symposium on Operating Systems Design and Implementation, October 1996, pp.75-88.
[50] Yeung D, Kubiatowicz J, Agarwal A. MGS:A multigrain shared memory system. In Proc. the 23rd Annual International Symposium on Computer Architecture, May 1996, pp.44-55.
[51] Novakovic S, Daglis A, Bugnion E, Falsafi B, Grot B. Scaleout NUMA. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2014, pp.3-18.
[1] Gen Zhang, Peng-Fei Wang, Tai Yue, Xu Zhou, Kai Lu. MEBS:挖掘操作系统内核中的内存生命周期漏洞[J]. 计算机科学技术学报, 2021, 36(6): 1248-1268.
[2] Zhi Ma, Lei Qiao, Meng-Fei Yang, Shao-Feng Li, Jin-Kun Zhang. 基于SPARCv8的实时嵌入式操作系统的异常管理验证[J]. 计算机科学技术学报, 2021, 36(6): 1367-1387.
[3] Zhi-Yuan Dong, Chu-Zhe Tang, Jia-Chen Wang, Zhao-Guo Wang, Hai-Bo Chen, Bin-Yu Zang. 使用乐观事务处理优化确定性数据库[J]. 计算机科学技术学报, 2020, 35(2): 382-394.
[4] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. 高性能计算专用文件系统[J]. 计算机科学技术学报, 2020, 35(1): 4-26.
[5] Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—一种用于高性能计算应用的临时突发缓冲文件系统[J]. 计算机科学技术学报, 2020, 35(1): 72-91.
[6] Zuo-Ning Chen, Kang Chen, Jin-Lei Jiang, Lu-Fei Zhang, Song Wu, Zheng-Wei Qi, Ch. 云操作系统演进:从技术到生态[J]. , 2017, 32(2): 224-241.
[7] Suchakrapani Datt Sharma, Student Member, IEEE, Michel Dagenais, Senior Member, IEEE. 用于生产的增强用户空间和内核跟踪滤波[J]. , 2016, 31(6): 1161-1178.
[8] Lengdong Wu, Liyan Yuan, Jiahuai You. 大规模数据管理系统综述[J]. , 2015, 30(1): 163-183.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李锦涛; 闵应骅;. Product-Oriented Test-Pattern Generation for Programmable Logic Arrays[J]. , 1990, 5(2): 164 -174 .
[2] 金凌紫;. TrapML-A Metalanguage for Transformational Programming[J]. , 1990, 5(4): 388 -399 .
[3] 吴信东;. Inductive Learning[J]. , 1993, 8(2): 22 -36 .
[4] 王晖; 刘大有; 王亚飞;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[5] 熊志国; 徐曦; 董士海;. CX11: A Chinese Language Supporting Interface for X Window Environment[J]. , 1995, 10(1): 15 -22 .
[6] 王仕军; 王树林;. Research and Design of a Fuzzy Neural Expert System[J]. , 1995, 10(2): 112 -123 .
[7] 赵彧; 张琼; 向辉; 石教英; 何志均;. A Simplified Model for Generating 3D Realistic Sound in the Multimedia and Virtual Reality Systems[J]. , 1996, 11(4): 461 -470 .
[8] 武君胜; 吴广茂;. Element-Partition-Based Methods for Visualization of 3D Unstructured Grid Data[J]. , 1998, 13(5): 417 -425 .
[9] 郑方; 吴文虎; 方棣棠;. Center-Distance Continuous Probability Models and the Distance Measure[J]. , 1998, 13(5): 426 -437 .
[10] 肖利民; 祝明发;. Exploiting the Capabilities of the Interconnection Network on Dawning-1000[J]. , 1999, 14(1): 49 -55 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: