Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (1): 123-139.doi: 10.1007/s11390-020-9826-z

Special Issue: Computer Architecture and Systems

Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture

Jun-Shi Chen1, Member, CCF, Hong An1, Member, CCF, ACM, IEEE Wen-Ting Han1,*, Member, CCF, ACM, IEEE, Zeng Lin1, and Xin Liu2, Member, CCF        

  1. 1 School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China;
    2 National Research Center of Parallel Computer Engineering and Technology, Beijing 100080, China
  • Received:2019-07-08 Revised:2020-07-08 Online:2021-01-05 Published:2021-01-23
  • Contact: Wen-Ting Han
  • About author:Jun-Shi Chen received his Ph.D. degree in computer science from the University of Science and Technology of China (USTC), Hefei, in 2020. He is a post-doctor with USTC, Hefei. His research interests include highperformance computing and computer architecture.
  • Supported by:
    The work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB0204102.

The short-range pair interaction consumes most of the CPU time in molecular dynamics (MD) simulations. The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture. In this paper, we present a highly efficient short-range force kernel on the Sunway, a novel many-core architecture with many unique features. The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts. To enhance the data locality, we propose a super-cluster-based neighbor list with an appropriate granularity that fits in the local memory of computing cores. In the absence of a low overhead locking mechanism, using data-privatization force array is a more feasible method to avoid write conflicts, but results in the large overhead of data reduction. We propose a dual-slice partitioning scheme for both hardware resources and computing tasks, which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing. Moreover, we exploit the single instruction multiple data (SIMD) parallelism and perform instruction reordering of the force kernel on this many-core processor. The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20% of peak flop rate on the Sunway many-core processor.

Key words: molecular dynamics; sunway many-core; pair interaction; parallel algorithm;

