Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (1): 33-43.doi: 10.1007/s11390-020-0741-6

Special Issue: Computer Architecture and Systems

• Special Section on Memory-Centric System Research for High-Performance Computing • Previous Articles     Next Articles

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Jian-Bin Fang, Member, CCF, Xiang-Ke Liao, Fellow, CCF, Chun Huang, and De-Zun Dong*        

  1. College of Computer, National University of Defense Technology, Changsha 410073, China
  • Received:2020-06-24 Revised:2020-12-09 Online:2021-01-05 Published:2021-01-23
  • Contact: De-Zun Dong
  • About author:Jian-Bin Fang is an assistant professor in computer science at National University of Defense Technology (NUDT), Changsha. He obtained his Ph.D. degree in computer science from Delft University of Technology in 2014. His research interests include parallel programming for many-cores, parallel compilers, performance modeling, and scalable algorithms. He is a member of CCF.
  • Supported by:
    This work is partially funded by the National Key Research and Development Program of China under Grant No. 2018YFB0204301, and the National Natural Science Foundation of China under Grant Nos. 61972408 and 61602501.

This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known roofline model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.

Key words: many-core architecture; memory-centric design; performance evaluation;

[1] Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36-45. DOI:10.1109/ISPASS.2016.7482072.
[2] Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. DOI:10.1109/HOTCHIPS.2016.7936203.
[3] Zhang C. Mars:A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI:10.1109/HOTCHIPS.2015.7477454.
[4] You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp.86-105. DOI:10.1007/978-3-030-18645-66.
[5] Dongarra J. Report on the Fujitsu Fugaku system. Technical Report, University of Tennessee, 2020., Nov. 2020.
[6] Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp.261-270. DOI:10.1109/PACT.2009.22.
[7] McCalpin J. Memory bandwidth and machine balance in current high performance computers., Dec. 2020.
[8] Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp.36-43. DOI:10.1145/1111583.1111589.
[9] Williams S, Waterman A, Patterson D A. Roofline:An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4):65-76. DOI:10.1145/1498765.1498785
[10] Ilic A, Pratas F, Sousa L. Cache-aware roofline model:Upgrading the loft. IEEE Comput. Archit. Lett., 2014, 13(1):21-24. DOI:10.1109/L-CA.2013.6.
[11] Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.263-272. DOI:10.1109/IPDPS.2016.14.
[12] Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw., 2008, 34(3):Article No. 12. DOI:10.1145/1356052.1356053.
[13] Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO:Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw., 2018, 44(4):Article No. 42. DOI:10.1145/3210754.
[14] Su X, Liao X, Jiang H, Yang C, Xue J. SCP:Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4):Article No. 43. DOI:10.1145/3274654.
[15] Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics:Conference Series, 2015, 664(9):Article No. 092010. DOI:10.1088/1742- 6596/664/9/092010.
[16] Liu W, Vinter B. CSR5:An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp.339-350. DOI:10.1145/2751205.2751209.
[17] Grimes R, Kincaid D, Young D. ITPACK 2.0 user's guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979.
[18] Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput., 2014, 36(5):401-423. DOI:10.1137/130930352.
[19] Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. DOI:10.1145/1654059.1654078.
[20] Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program., 2020, 48(1):80-97. DOI:10.1007/s10766-019-00646-x.
[21] Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based manycore architecture. Int. J. Parallel Program., 2019, 47(3):418-432. DOI:10.1007/s10766-018-00625-8.
[22] Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp.649-658. DOI:10.1109/HPCC/SmartCity/DSS.2018.00116.
[23] Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp.77-96. DOI:10.1007/978-3- 540-93799-95.
[24] Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp.137-148. DOI:10.1145/2568088.2576799.
[25] Ramos S, Hoeer T. Modeling communication in cachecoherent SMP systems:A case-study with Xeon Phi. In Proc. the 22nd International Symposium on HighPerformance Parallel and Distributed Computing, June 2013, pp.97-108. DOI:10.1145/2462902.2462916.
[1] Farrukh Nadeem, Rizwan Qaiser. An Early Evaluation and Comparison of Three Private Cloud Computing Software Platforms [J]. , 2015, 30(3): 639-654.
[2] Ning-Hui Sun (孙凝辉), Member, CCF, IEEE, Jing Xing (邢晶), Zhi-Gang Huo (霍志刚), Member, CCF, ACM, Guang-Ming Tan. Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure [J]. , 2011, 26(3): 352-362.
[3] Juan A. Sánchez, Rafael Marín-Pérez, and Pedro M. Ruiz. Beacon-Less Geographic Routing in Real Wireless Sensor Networks [J]. , 2008, 23(3): 438-450 .
[4] Zhen-Hua Huang, Jian-Kui Guo, Sheng-Li Sun, and Wei Wang. Efficient Optimization of Multiple Subspace Skyline Queries [J]. , 2008, 23(1): 103-111 .
[5] Wei-Wu Hu, Ji-Ye Zhao, Shi-Qiang Zhong, Xu Yang, Elio Guidetti, and Chris Wu. Implementing a 1GHz Four-Issue Out-of-Order Execution Microprocessor in a Standard Cell ASIC Methodology [J]. , 2007, 22(1): 1-0.
[6] Lei Shi, Ying-Jie Han, Xiao-Guang Ding, Lin Wei and Zhi-Min Gu. An SPN-Based Integrated Model for Web Prefetching and Caching [J]. , 2006, 21(4): 482-489 .
[7] PANG Bin (庞 斌), SHAO HuaiRong (邵怀荣)2 and GAO Wen (高 文). An Admission Control Scheme for End-to-End Statistical QoS Provision in IP Networks [J]. , 2003, 18(3): 0-0.
[8] HU Weiwu(胡伟武),ZHANG Fuxin(张福新)and LIU Haiming(刘海明). Dynamic Data Prefetching in Home-Based Software DSMs [J]. , 2001, 16(3): 0-0.
[9] SHI Weisong; TANG Zhimin; SHI Jinsong;. Using Confidence interval to Summarize the Evaluating Results of DSM Systems [J]. , 2000, 15(1): 73-83.
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[5] Feng Yulin;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[6] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[7] Wang Xuan; Lü Zhimin; Tang Yuhai; Xiang Yang;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[8] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[9] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[10] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved