计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (1): 33-43.doi: 10.1007/s11390-020-0741-6

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

以主存为中心的ARMv8众核体系结构性能评估:以飞腾2000+为例

Jian-Bin Fang, Member, CCF, Xiang-Ke Liao, Fellow, CCF, Chun Huang, and De-Zun Dong*   

  1. College of Computer, National University of Defense Technology, Changsha 410073, China
  • 收稿日期:2020-06-24 修回日期:2020-12-09 出版日期:2021-01-05 发布日期:2021-01-23
  • 通讯作者: De-Zun Dong E-mail:dongg@nudt.edu.cn
  • 作者简介:Jian-Bin Fang is an assistant professor in computer science at National University of Defense Technology (NUDT), Changsha. He obtained his Ph.D. degree in computer science from Delft University of Technology in 2014. His research interests include parallel programming for many-cores, parallel compilers, performance modeling, and scalable algorithms. He is a member of CCF.
  • 基金资助:
    This work is partially funded by the National Key Research and Development Program of China under Grant No. 2018YFB0204301, and the National Natural Science Foundation of China under Grant Nos. 61972408 and 61602501.

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Jian-Bin Fang, Member, CCF, Xiang-Ke Liao, Fellow, CCF, Chun Huang, and De-Zun Dong*        

  1. College of Computer, National University of Defense Technology, Changsha 410073, China
  • Received:2020-06-24 Revised:2020-12-09 Online:2021-01-05 Published:2021-01-23
  • Contact: De-Zun Dong E-mail:dongg@nudt.edu.cn
  • About author:Jian-Bin Fang is an assistant professor in computer science at National University of Defense Technology (NUDT), Changsha. He obtained his Ph.D. degree in computer science from Delft University of Technology in 2014. His research interests include parallel programming for many-cores, parallel compilers, performance modeling, and scalable algorithms. He is a member of CCF.
  • Supported by:
    This work is partially funded by the National Key Research and Development Program of China under Grant No. 2018YFB0204301, and the National Natural Science Foundation of China under Grant Nos. 61972408 and 61602501.

1、研究背景(context)
高性能计算领域已经明显地转向众核体系结构设计。其中,基于ARMv8的众核处理器是构建未来高性能计算机系统的典型代表。这个趋势可以从将64核的飞腾2000+处理器用于构建天河三号原型机系统和将48核的A64FX处理器用于构建富岳超算系统中看出。因此,有必要在这类体系结构上评估典型高性能计算应用核心的性能表现。
2、目的(Objective) 在ARMv8众核处理器上评估典型高性能计算应用核心的性能表现将不仅有利于挖掘该类处理器的计算潜能,还将为进一步优化体系结构设计提供参考。
3、方法(Method)以飞腾2000+处理器为例,本文着重评测该处理器的缓存和主存子系统,结合roofline模型分析影响高性能计算应用性能的体系结构特征。
4、结果(Result & Findings) 通过使用微基准测试的方法,系统地测量了飞腾2000+处理器的访存延迟与带宽性能;通过对飞腾2000+处理器实例化Roofline模型,可视化该处理器的计算与访存间的平衡性,直观地展示不同应用在该处理器上的性能瓶颈;通过优化两个典型的应用核心,发现飞腾2000+能够取得良好的性能表现。
5、结论(Conclusions) 评测结果显示基于ARMv8的众核处理器能够在多个典型高性能计算应用核心上取得良好的性能,得益于其计算与访存的平衡设计。此外,对于类似于GEMM的应用核心,共享的L2缓存设计是进一步提升性能的瓶颈,建议未来设计私有的L2缓存以优化体系结构设计。

关键词: 众核处理器体系结构, 以主存为中心的设计, 性能评估

Abstract: This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known roofline model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.

Key words: many-core architecture, memory-centric design, performance evaluation

[1] Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36-45. DOI:10.1109/ISPASS.2016.7482072.
[2] Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. DOI:10.1109/HOTCHIPS.2016.7936203.
[3] Zhang C. Mars:A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI:10.1109/HOTCHIPS.2015.7477454.
[4] You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp.86-105. DOI:10.1007/978-3-030-18645-66.
[5] Dongarra J. Report on the Fujitsu Fugaku system. Technical Report, University of Tennessee, 2020. https://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf, Nov. 2020.
[6] Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp.261-270. DOI:10.1109/PACT.2009.22.
[7] McCalpin J. Memory bandwidth and machine balance in current high performance computers. https://www.cs.virginia.edu/stream/analyses.html, Dec. 2020.
[8] Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp.36-43. DOI:10.1145/1111583.1111589.
[9] Williams S, Waterman A, Patterson D A. Roofline:An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4):65-76. DOI:10.1145/1498765.1498785
[10] Ilic A, Pratas F, Sousa L. Cache-aware roofline model:Upgrading the loft. IEEE Comput. Archit. Lett., 2014, 13(1):21-24. DOI:10.1109/L-CA.2013.6.
[11] Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.263-272. DOI:10.1109/IPDPS.2016.14.
[12] Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw., 2008, 34(3):Article No. 12. DOI:10.1145/1356052.1356053.
[13] Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO:Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw., 2018, 44(4):Article No. 42. DOI:10.1145/3210754.
[14] Su X, Liao X, Jiang H, Yang C, Xue J. SCP:Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4):Article No. 43. DOI:10.1145/3274654.
[15] Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics:Conference Series, 2015, 664(9):Article No. 092010. DOI:10.1088/1742- 6596/664/9/092010.
[16] Liu W, Vinter B. CSR5:An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp.339-350. DOI:10.1145/2751205.2751209.
[17] Grimes R, Kincaid D, Young D. ITPACK 2.0 user's guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979.
[18] Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput., 2014, 36(5):401-423. DOI:10.1137/130930352.
[19] Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. DOI:10.1145/1654059.1654078.
[20] Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program., 2020, 48(1):80-97. DOI:10.1007/s10766-019-00646-x.
[21] Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based manycore architecture. Int. J. Parallel Program., 2019, 47(3):418-432. DOI:10.1007/s10766-018-00625-8.
[22] Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp.649-658. DOI:10.1109/HPCC/SmartCity/DSS.2018.00116.
[23] Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp.77-96. DOI:10.1007/978-3- 540-93799-95.
[24] Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp.137-148. DOI:10.1145/2568088.2576799.
[25] Ramos S, Hoeer T. Modeling communication in cachecoherent SMP systems:A case-study with Xeon Phi. In Proc. the 22nd International Symposium on HighPerformance Parallel and Distributed Computing, June 2013, pp.97-108. DOI:10.1145/2462902.2462916.
[1] Farrukh Nadeem, Rizwan Qaiser. 三个私有云计算软件平台的早期评估与比较[J]. , 2015, 30(3): 639-654.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[5] 冯玉琳;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[6] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[7] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[8] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[9] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[10] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: