|
计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (1): 33-43.doi: 10.1007/s11390-020-0741-6
所属专题: Computer Architecture and Systems
Jian-Bin Fang, Member, CCF, Xiang-Ke Liao, Fellow, CCF, Chun Huang, and De-Zun Dong*
Jian-Bin Fang, Member, CCF, Xiang-Ke Liao, Fellow, CCF, Chun Huang, and De-Zun Dong*
1、研究背景(context)
高性能计算领域已经明显地转向众核体系结构设计。其中,基于ARMv8的众核处理器是构建未来高性能计算机系统的典型代表。这个趋势可以从将64核的飞腾2000+处理器用于构建天河三号原型机系统和将48核的A64FX处理器用于构建富岳超算系统中看出。因此,有必要在这类体系结构上评估典型高性能计算应用核心的性能表现。
2、目的(Objective) 在ARMv8众核处理器上评估典型高性能计算应用核心的性能表现将不仅有利于挖掘该类处理器的计算潜能,还将为进一步优化体系结构设计提供参考。
3、方法(Method)以飞腾2000+处理器为例,本文着重评测该处理器的缓存和主存子系统,结合roofline模型分析影响高性能计算应用性能的体系结构特征。
4、结果(Result & Findings) 通过使用微基准测试的方法,系统地测量了飞腾2000+处理器的访存延迟与带宽性能;通过对飞腾2000+处理器实例化Roofline模型,可视化该处理器的计算与访存间的平衡性,直观地展示不同应用在该处理器上的性能瓶颈;通过优化两个典型的应用核心,发现飞腾2000+能够取得良好的性能表现。
5、结论(Conclusions) 评测结果显示基于ARMv8的众核处理器能够在多个典型高性能计算应用核心上取得良好的性能,得益于其计算与访存的平衡设计。此外,对于类似于GEMM的应用核心,共享的L2缓存设计是进一步提升性能的瓶颈,建议未来设计私有的L2缓存以优化体系结构设计。
[1] Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36-45. DOI:10.1109/ISPASS.2016.7482072. [2] Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. DOI:10.1109/HOTCHIPS.2016.7936203. [3] Zhang C. Mars:A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI:10.1109/HOTCHIPS.2015.7477454. [4] You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp.86-105. DOI:10.1007/978-3-030-18645-66. [5] Dongarra J. Report on the Fujitsu Fugaku system. Technical Report, University of Tennessee, 2020. https://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf, Nov. 2020. [6] Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp.261-270. DOI:10.1109/PACT.2009.22. [7] McCalpin J. Memory bandwidth and machine balance in current high performance computers. https://www.cs.virginia.edu/stream/analyses.html, Dec. 2020. [8] Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp.36-43. DOI:10.1145/1111583.1111589. [9] Williams S, Waterman A, Patterson D A. Roofline:An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4):65-76. DOI:10.1145/1498765.1498785 [10] Ilic A, Pratas F, Sousa L. Cache-aware roofline model:Upgrading the loft. IEEE Comput. Archit. Lett., 2014, 13(1):21-24. DOI:10.1109/L-CA.2013.6. [11] Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.263-272. DOI:10.1109/IPDPS.2016.14. [12] Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw., 2008, 34(3):Article No. 12. DOI:10.1145/1356052.1356053. [13] Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO:Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw., 2018, 44(4):Article No. 42. DOI:10.1145/3210754. [14] Su X, Liao X, Jiang H, Yang C, Xue J. SCP:Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4):Article No. 43. DOI:10.1145/3274654. [15] Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics:Conference Series, 2015, 664(9):Article No. 092010. DOI:10.1088/1742- 6596/664/9/092010. [16] Liu W, Vinter B. CSR5:An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp.339-350. DOI:10.1145/2751205.2751209. [17] Grimes R, Kincaid D, Young D. ITPACK 2.0 user's guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979. [18] Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput., 2014, 36(5):401-423. DOI:10.1137/130930352. [19] Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. DOI:10.1145/1654059.1654078. [20] Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program., 2020, 48(1):80-97. DOI:10.1007/s10766-019-00646-x. [21] Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based manycore architecture. Int. J. Parallel Program., 2019, 47(3):418-432. DOI:10.1007/s10766-018-00625-8. [22] Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp.649-658. DOI:10.1109/HPCC/SmartCity/DSS.2018.00116. [23] Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp.77-96. DOI:10.1007/978-3- 540-93799-95. [24] Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp.137-148. DOI:10.1145/2568088.2576799. [25] Ramos S, Hoeer T. Modeling communication in cachecoherent SMP systems:A case-study with Xeon Phi. In Proc. the 22nd International Symposium on HighPerformance Parallel and Distributed Computing, June 2013, pp.97-108. DOI:10.1145/2462902.2462916. |
[1] | Farrukh Nadeem, Rizwan Qaiser. 三个私有云计算软件平台的早期评估与比较[J]. , 2015, 30(3): 639-654. |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |