We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Jian-Bin Fang, Xiang-Ke Liao, Chun Huang, De-Zun Dong. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+[J]. Journal of Computer Science and Technology, 2021, 36(1): 33-43. DOI: 10.1007/s11390-020-0741-6
Citation: Jian-Bin Fang, Xiang-Ke Liao, Chun Huang, De-Zun Dong. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+[J]. Journal of Computer Science and Technology, 2021, 36(1): 33-43. DOI: 10.1007/s11390-020-0741-6

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Funds: This work is partially funded by the National Key Research and Development Program of China under Grant No. 2018YFB0204301, and the National Natural Science Foundation of China under Grant Nos. 61972408 and 61602501.
More Information
  • Author Bio:

    Jian-Bin Fang is an assistant professor in computer science at National University of Defense Technology (NUDT), Changsha. He obtained his Ph.D. degree in computer science from Delft University of Technology in 2014. His research interests include parallel programming for many-cores, parallel compilers, performance modeling, and scalable algorithms. He is a member of CCF.

  • Corresponding author:

    De-Zun Dong E-mail: dongg@nudt.edu.cn

  • Received Date: June 23, 2020
  • Revised Date: December 08, 2020
  • Published Date: January 04, 2021
  • This article presents a comprehensive performance evaluation of Phytium 2000+ , an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known roofline model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.
  • [1]
    Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36-45. DOI: 10.1109/ISPASS.2016.7482072.
    [2]
    Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. DOI: 10.1109/HOTCHIPS.2016.7936203.
    [3]
    Zhang C. Mars:A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI: 10.1109/HOTCHIPS.2015.7477454.
    [4]
    You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp.86-105. DOI: 10.1007/978-3-030-18645-66.
    [5]
    Dongarra J. Report on the Fujitsu Fugaku system. Technical Report, University of Tennessee, 2020. https://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf, Nov. 2020.
    [6]
    Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp.261-270. DOI: 10.1109/PACT.2009.22.
    [7]
    McCalpin J. Memory bandwidth and machine balance in current high performance computers. https://www.cs.virginia.edu/stream/analyses.html, Dec. 2020.
    [8]
    Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp.36-43. DOI: 10.1145/1111583.1111589.
    [9]
    Williams S, Waterman A, Patterson D A. Roofline:An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4):65-76. DOI: 10.1145/1498765.1498785
    [10]
    Ilic A, Pratas F, Sousa L. Cache-aware roofline model:Upgrading the loft. IEEE Comput. Archit. Lett., 2014, 13(1):21-24. DOI: 10.1109/L-CA.2013.6.
    [11]
    Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.263-272. DOI: 10.1109/IPDPS.2016.14.
    [12]
    Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw., 2008, 34(3):Article No. 12. DOI: 10.1145/1356052.1356053.
    [13]
    Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO:Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw., 2018, 44(4):Article No. 42. DOI: 10.1145/3210754.
    [14]
    Su X, Liao X, Jiang H, Yang C, Xue J. SCP:Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4):Article No. 43. DOI: 10.1145/3274654.
    [15]
    Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics:Conference Series, 2015, 664(9):Article No. 092010. DOI:10.1088/1742- 6596/664/9/092010.
    [16]
    Liu W, Vinter B. CSR5:An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp.339-350. DOI: 10.1145/2751205.2751209.
    [17]
    Grimes R, Kincaid D, Young D. ITPACK 2.0 user's guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979.
    [18]
    Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput., 2014, 36(5):401-423. DOI: 10.1137/130930352.
    [19]
    Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. DOI: 10.1145/1654059.1654078.
    [20]
    Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program., 2020, 48(1):80-97. DOI: 10.1007/s10766-019-00646-x.
    [21]
    Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based manycore architecture. Int. J. Parallel Program., 2019, 47(3):418-432. DOI: 10.1007/s10766-018-00625-8.
    [22]
    Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp.649-658. DOI: 10.1109/HPCC/SmartCity/DSS.2018.00116.
    [23]
    Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp.77-96. DOI:10.1007/978-3- 540-93799-95.
    [24]
    Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp.137-148. DOI: 10.1145/2568088.2576799.
    [25]
    Ramos S, Hoeer T. Modeling communication in cachecoherent SMP systems:A case-study with Xeon Phi. In Proc. the 22nd International Symposium on HighPerformance Parallel and Distributed Computing, June 2013, pp.97-108. DOI: 10.1145/2462902.2462916.
  • Cited by

    Periodical cited type(16)

    1. Fan Yuan, Xiaojian Yang, Shengguo Li, et al. Optimizing Multi-Grid Preconditioned Conjugate Gradient Method on Multi-Cores. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(5): 768. DOI:10.1109/TPDS.2024.3372473
    2. Lanxin Zhao, Wanrong Gao, Jianbin Fang. Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Applied Sciences, 2024, 14(6): 2364. DOI:10.3390/app14062364
    3. Weiling Yang, Jianbin Fang, Dezun Dong, et al. Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(3): 439. DOI:10.1109/TPDS.2024.3350368
    4. Wan-Rong Gao, Jian-Bin Fang, Chun Huang, et al. wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems. Journal of Computer Science and Technology, 2023, 38(6): 1323. DOI:10.1007/s11390-021-1251-x
    5. Shizhao Chen, Jianbin Fang, Chuanfu Xu, et al. Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs. Applied Sciences, 2022, 12(19): 9812. DOI:10.3390/app12199812
    6. Huayou Su, Kaifang Zhang, Songzhu Mei. On the Transformation Optimization for Stencil Computation. Electronics, 2021, 11(1): 38. DOI:10.3390/electronics11010038
    7. Zhi Ma, Lei Qiao, Meng-Fei Yang, et al. Verification of Real Time Operating System Exception Management Based on SPARCv8. Journal of Computer Science and Technology, 2021, 36(6): 1367. DOI:10.1007/s11390-021-1644-x
    8. Weiling Yang, Jianbin Fang, Dezun Dong, et al. LIBSHALOM. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, DOI:10.1145/3458817.3476217
    9. Weiling Yang, Jianbin Fang, Dezun Dong. Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures. 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), DOI:10.1109/IPDPS49936.2021.00019
    10. Xiao Fu, Weiling Yang, Dezun Dong, et al. Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs. Proceedings of the 38th ACM International Conference on Supercomputing, DOI:10.1145/3650200.3656620
    11. Xiaojian Yang, Shengguo Li, Fan Yuan, et al. Optimizing Multi-grid Computation and Parallelization on Multi-cores. Proceedings of the 37th International Conference on Supercomputing, DOI:10.1145/3577193.3593726
    12. Chaorun Liu, Huayou Su, Yong Dou, et al. Optimizing GNN on ARM Multi-Core Processors. 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), DOI:10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00206
    13. Wanrong Gao, Jianbin Fang, Chun Huang, et al. Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures. 2021 IEEE International Conference on Cluster Computing (CLUSTER), DOI:10.1109/Cluster48925.2021.00044
    14. Pengyu Wang, Weiling Yang, Jianbin Fang, et al. Optimizing Direct Convolutions on ARM Multi-Cores. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, DOI:10.1145/3581784.3607107
    15. Kangkang Chen, Huayou Su, Chaorun Liu, et al. Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, DOI:10.1007/978-981-97-0811-6_4
    16. Jintao Peng, Jianbin Fang, Jie Liu, et al. Optimizing MPI Collectives on Shared Memory Multi-Cores. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, DOI:10.1145/3581784.3607074

    Other cited types(0)

Catalog

    Article views (111) PDF downloads (0) Cited by(16)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return