计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (1): 123-139.doi: 10.1007/s11390-020-9826-z

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

申威众核处理器架构上粒子短程相互作用的高效计算

Jun-Shi Chen1, Member, CCF, Hong An1, Member, CCF, ACM, IEEE Wen-Ting Han1,*, Member, CCF, ACM, IEEE, Zeng Lin1, and Xin Liu2, Member, CCF   

  1. 1 School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China;
    2 National Research Center of Parallel Computer Engineering and Technology, Beijing 100080, China
  • 收稿日期:2019-07-08 修回日期:2020-07-08 出版日期:2021-01-05 发布日期:2021-01-23
  • 通讯作者: Wen-Ting Han E-mail:han@ustc.edu.cn
  • 作者简介:Jun-Shi Chen received his Ph.D. degree in computer science from the University of Science and Technology of China (USTC), Hefei, in 2020. He is a post-doctor with USTC, Hefei. His research interests include highperformance computing and computer architecture.
  • 基金资助:
    The work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB0204102.

Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture

Jun-Shi Chen1, Member, CCF, Hong An1, Member, CCF, ACM, IEEE Wen-Ting Han1,*, Member, CCF, ACM, IEEE, Zeng Lin1, and Xin Liu2, Member, CCF        

  1. 1 School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China;
    2 National Research Center of Parallel Computer Engineering and Technology, Beijing 100080, China
  • Received:2019-07-08 Revised:2020-07-08 Online:2021-01-05 Published:2021-01-23
  • Contact: Wen-Ting Han E-mail:han@ustc.edu.cn
  • About author:Jun-Shi Chen received his Ph.D. degree in computer science from the University of Science and Technology of China (USTC), Hefei, in 2020. He is a post-doctor with USTC, Hefei. His research interests include highperformance computing and computer architecture.
  • Supported by:
    The work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB0204102.

在分子动力学模拟中,粒子间的短程相互作用消耗了大部分的CPU 时间。其固有的计算稀疏性使得在新兴的众核架构上实现高性能的计算核心具有较大的挑战。在本文中,我们针对申威众核处理器架构提出了一种非常高效的粒子短程相互作用计算核心。目前在这款众核处理器上该算法的并行效率受到数据局部性差和写冲突的极大限制。为了增强数据局部性,我们提出了一种适应计算核心片上存储空间的超簇邻居列表。在处理器没有低开销锁机制的情况下,使用数据私有化力数组是避免写冲突的一种可行方法,但会带来较大的数据归约开销。为此我们提出了一种同时对硬件资源和计算任务进行划分的双切片分区方案,利用片上数据通信减少数据归约开销,同时保证负载平衡。此外,我们充分开发该处理器的SIMD并行性,并对作用力计算核心进行了指令重排优化。实验结果表明,在申威众核处理器上优化后的作用力计算核心性能相比之前提高226倍,其浮点计算性能达到理论值的20%。

关键词: 分子动力学, 申威众核, 粒子相互作用, 并行算法

Abstract: The short-range pair interaction consumes most of the CPU time in molecular dynamics (MD) simulations. The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture. In this paper, we present a highly efficient short-range force kernel on the Sunway, a novel many-core architecture with many unique features. The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts. To enhance the data locality, we propose a super-cluster-based neighbor list with an appropriate granularity that fits in the local memory of computing cores. In the absence of a low overhead locking mechanism, using data-privatization force array is a more feasible method to avoid write conflicts, but results in the large overhead of data reduction. We propose a dual-slice partitioning scheme for both hardware resources and computing tasks, which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing. Moreover, we exploit the single instruction multiple data (SIMD) parallelism and perform instruction reordering of the force kernel on this many-core processor. The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20% of peak flop rate on the Sunway many-core processor.

Key words: molecular dynamics, sunway many-core, pair interaction, parallel algorithm

[1] Hollingsworth S A, Dror R O. Molecular dynamics simulation for all. Neuron, 2018, 99(6):1129-1143. DOI:10.1016/j.neuron.2018.08.011.
[2] Kumar S, Huang C, Zheng G et al. Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system. IBM Journal of Research and Development, 2008, 52(1/2):177- 188. DOI:10.1147/rd.521.0177.
[3] Shaw D E, Grossman J P, Bank J A et al. Anton 2:Raising the bar for performance and programmability in a specialpurpose molecular dynamics supercomputer. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2014, pp.41-53. DOI:10.1109/SC.2014.9.
[4] Shaw D E, Deneroff M M, Dror R O et al. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM, 2008, 51(7):91-97. DOI:10.1145/1364782.1364802.
[5] Götz A W, Williamson M J, Xu D et al. Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. generalized born. Journal of Chemical Theory and Computation, 2012, 8(5):1542-1555. DOI:10.1021/ct200909j.
[6] Pennycook S J, Hughes C J, Smelyanskiy M, Jarvis S A. Exploring SIMD for molecular dynamics, using Intelr Xeonr and Intelr Xeon Phi coprocessors. In Proc. the 27th IEEE International Symposium on Parallel and Distributed Processing, May 2013, pp.1085-1097. DOI:10.1109/IPDPS.2013.44.
[7] Wang H Q, Peng S L, Zhu X Q et al. A method to accelerate GROMACS in offload mode on Tianhe-2 supercomputer. In Proc. the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2015, pp.781-784. DOI:10.1109/CCGrid.2015.65.
[8] Hu C J, Wang X M, Li J J et al. Kernel optimization for short-range molecular dynamics. Computer Physics Communications, 2017, 211:31-40. DOI:10.1016/j.cpc.2016.07.010.
[9] Law T R, Hancox J, Wright S A, Jarvis S A. An algorithm for computing short-range forces in molecular dynamics simulations with non-uniform particle densities. Journal of Parallel and Distributed Computing, 2019, 130:1-11. DOI:10.1016/j.jpdc.2019.03.008.
[10] Peng S L, Cui Y B, Yang S Y et al. A CPU/MIC collaborated parallel framework for GROMACS on Tianhe-2 supercomputer. IEEE/ACM Trans. Comput. Biology Bioinform., 2019, 16(2):425-433. DOI:10.1109/TCBB.2017.2713362.
[11] Anderson J A, Lorenz C D, Travesset A. General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational Physics, 2008, 227(10):5342-5359. DOI:10.1016/j.jcp.2008.01.047.
[12] Friedrichs M S, Eastman P, Vaidyanathan V et al. Accelerating molecular dynamic simulation on graphics processing units. Journal of Computational Chemistry, 2009, 30(6):864-872. DOI:10.1002/jcc.21209.
[13] Minkin A S, Knizhnik A A, Potapkin B V. GPU implementations of some many-body potentials for molecular dynamics simulations. Advances in Engineering Software, 2017, 111:43-51. DOI:10.1016/j.advengsoft.2016.05.013.
[14] Spellings M, Marson R L, Anderson J A, Glotzer S C. GPU accelerated Discrete Element Method (DEM) molecular dynamics for conservative, faceted particle simulations. Journal of Computational Physics, 2017, 334:460-467. DOI:10.1016/j.jcp.2017.01.014.
[15] Fu H H, Liao J F, Yang J Z et al. The Sunway TaihuLight supercomputer:System and applications. Science China Information Sciences, 2016, 59(7):Article No. 072001. DOI:10.1007/s11432-016-5588-7.
[16] Dong W Q, Kang L T, Quan Z et al. Implementing molecular dynamics simulation on Sunway TaihuLight system. In Proc. the 18th IEEE International Conference on High Performance Computing and Communications, December 2016, pp.443-450. DOI:10.1109/HPCC-SmartCityDSS.2016.0070.
[17] Dong W Q, Li K L, Kang L T, Quan Z, Li K Q. Implementing molecular dynamics simulation on the Sunway TaihuLight system with heterogeneous many-core processors. Concurrency and Computation:Practice and Experience, 2018, 30(16):Article No. e4468. DOI:10.1002/cpe.4468.
[18] Yu Y, An H, Chen J S et al. Pipelining computation and optimization strategies for scaling GROMACS on the Sunway many-core processor. In Proc. the 17th International Conference on Algorithms and Architectures for Parallel Processing, August 2017, pp.18-32. DOI:10.1007/978-3-319-65482- 9_2.
[19] Duan X H, Gao P, Zhang T J et al. Redesigning LAMMPS for peta-scale and hundred-billion-atom simulation on Sunway TaihuLight. In Proc. the International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 12. DOI:10.1109/SC.2018.00015.
[20] Páll S, Hess B. A flexible algorithm for calculating pair interactions on SIMD architectures. Computer Physics Communications, 2013, 184(12):2641-2650. DOI:10.1016/j.cpc.2013.06.003.
[21] Abraham M J, Murtola T, Schulz R et al. GROMACS:High performance molecular simulations through multilevel parallelism from laptops to supercomputers. SoftwareX, 2015, 1/2:19-25. DOI:10.1016/j.softx.2015.06.001.
[22] Phillips J C, Braun R, Wang W, Gumbart J et al. Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 2005, 26:1781-1802. DOI:10.1002/jcc.20289.
[23] Plimpton S. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 1995, 117:1-19. DOI:10.1006/jcph.1995.1039.
[24] Yao Z H, Wang J S, Liu G R, Cheng M. Improved neighbor list algorithm in molecular simulations using cell decomposition and data sorting method. Computer Physics Communications, 2004, 161(1/2):27-35. DOI:10.1016/j.cpc.2004.04.004.
[25] Nguyen T D. GPU-accelerated Tersoff potentials for massively parallel Molecular Dynamics simulations. Computer Physics Communications, 2017, 212:113-122. DOI:10.1016/j.cpc.2016.10.020.
[26] Jia Z, Maggioni M, Staiger B, Scarpazza D P. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv:1804.06826, 2018. https://arxiv.org/abs/1804.06826, April 2020.
[27] Kunaseth M, Richards D F, Glosli J N et al. Analysis of scalable data-privatization threading algorithms for hybrid MPI/OpenMP parallelization of molecular dynamics. The Journal of Supercomputing, 2013, 66(1):406-430. DOI:10.1007/s11227-013-0915-x.
[28] Lin J, Xu Z G, Cai L J, Nukada A, Satoshi M. Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Computing, 2018, 77:128-143. DOI:10.1016/j.parco.2018.06.001.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[4] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[8] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[9] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[10] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: