SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.
Citation: | Jian-Bin Fang, Xiang-Ke Liao, Chun Huang, De-Zun Dong. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+[J]. Journal of Computer Science and Technology, 2021, 36(1): 33-43. DOI: 10.1007/s11390-020-0741-6 |
[1] |
Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36-45. DOI: 10.1109/ISPASS.2016.7482072.
|
[2] |
Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. DOI: 10.1109/HOTCHIPS.2016.7936203.
|
[3] |
Zhang C. Mars:A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI: 10.1109/HOTCHIPS.2015.7477454.
|
[4] |
You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp.86-105. DOI: 10.1007/978-3-030-18645-66.
|
[5] |
Dongarra J. Report on the Fujitsu Fugaku system. Technical Report, University of Tennessee, 2020. https://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf, Nov. 2020.
|
[6] |
Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp.261-270. DOI: 10.1109/PACT.2009.22.
|
[7] |
McCalpin J. Memory bandwidth and machine balance in current high performance computers. https://www.cs.virginia.edu/stream/analyses.html, Dec. 2020.
|
[8] |
Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp.36-43. DOI: 10.1145/1111583.1111589.
|
[9] |
Williams S, Waterman A, Patterson D A. Roofline:An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4):65-76. DOI: 10.1145/1498765.1498785
|
[10] |
Ilic A, Pratas F, Sousa L. Cache-aware roofline model:Upgrading the loft. IEEE Comput. Archit. Lett., 2014, 13(1):21-24. DOI: 10.1109/L-CA.2013.6.
|
[11] |
Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.263-272. DOI: 10.1109/IPDPS.2016.14.
|
[12] |
Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw., 2008, 34(3):Article No. 12. DOI: 10.1145/1356052.1356053.
|
[13] |
Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO:Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw., 2018, 44(4):Article No. 42. DOI: 10.1145/3210754.
|
[14] |
Su X, Liao X, Jiang H, Yang C, Xue J. SCP:Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4):Article No. 43. DOI: 10.1145/3274654.
|
[15] |
Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics:Conference Series, 2015, 664(9):Article No. 092010. DOI:10.1088/1742- 6596/664/9/092010.
|
[16] |
Liu W, Vinter B. CSR5:An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp.339-350. DOI: 10.1145/2751205.2751209.
|
[17] |
Grimes R, Kincaid D, Young D. ITPACK 2.0 user's guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979.
|
[18] |
Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput., 2014, 36(5):401-423. DOI: 10.1137/130930352.
|
[19] |
Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. DOI: 10.1145/1654059.1654078.
|
[20] |
Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program., 2020, 48(1):80-97. DOI: 10.1007/s10766-019-00646-x.
|
[21] |
Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based manycore architecture. Int. J. Parallel Program., 2019, 47(3):418-432. DOI: 10.1007/s10766-018-00625-8.
|
[22] |
Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp.649-658. DOI: 10.1109/HPCC/SmartCity/DSS.2018.00116.
|
[23] |
Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp.77-96. DOI:10.1007/978-3- 540-93799-95.
|
[24] |
Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp.137-148. DOI: 10.1145/2568088.2576799.
|
[25] |
Ramos S, Hoeer T. Modeling communication in cachecoherent SMP systems:A case-study with Xeon Phi. In Proc. the 22nd International Symposium on HighPerformance Parallel and Distributed Computing, June 2013, pp.97-108. DOI: 10.1145/2462902.2462916.
|
1. | Fan Yuan, Xiaojian Yang, Shengguo Li, et al. Optimizing Multi-Grid Preconditioned Conjugate Gradient Method on Multi-Cores. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(5): 768. DOI:10.1109/TPDS.2024.3372473 |
2. | Lanxin Zhao, Wanrong Gao, Jianbin Fang. Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Applied Sciences, 2024, 14(6): 2364. DOI:10.3390/app14062364 |
3. | Weiling Yang, Jianbin Fang, Dezun Dong, et al. Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(3): 439. DOI:10.1109/TPDS.2024.3350368 |
4. | Wan-Rong Gao, Jian-Bin Fang, Chun Huang, et al. wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems. Journal of Computer Science and Technology, 2023, 38(6): 1323. DOI:10.1007/s11390-021-1251-x |
5. | Shizhao Chen, Jianbin Fang, Chuanfu Xu, et al. Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs. Applied Sciences, 2022, 12(19): 9812. DOI:10.3390/app12199812 |
6. | Huayou Su, Kaifang Zhang, Songzhu Mei. On the Transformation Optimization for Stencil Computation. Electronics, 2021, 11(1): 38. DOI:10.3390/electronics11010038 |
7. | Zhi Ma, Lei Qiao, Meng-Fei Yang, et al. Verification of Real Time Operating System Exception Management Based on SPARCv8. Journal of Computer Science and Technology, 2021, 36(6): 1367. DOI:10.1007/s11390-021-1644-x |
8. | Weiling Yang, Jianbin Fang, Dezun Dong, et al. LIBSHALOM. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, DOI:10.1145/3458817.3476217 |
9. | Weiling Yang, Jianbin Fang, Dezun Dong. Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures. 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), DOI:10.1109/IPDPS49936.2021.00019 |
10. | Xiao Fu, Weiling Yang, Dezun Dong, et al. Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs. Proceedings of the 38th ACM International Conference on Supercomputing, DOI:10.1145/3650200.3656620 |
11. | Xiaojian Yang, Shengguo Li, Fan Yuan, et al. Optimizing Multi-grid Computation and Parallelization on Multi-cores. Proceedings of the 37th International Conference on Supercomputing, DOI:10.1145/3577193.3593726 |
12. | Chaorun Liu, Huayou Su, Yong Dou, et al. Optimizing GNN on ARM Multi-Core Processors. 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), DOI:10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00206 |
13. | Wanrong Gao, Jianbin Fang, Chun Huang, et al. Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures. 2021 IEEE International Conference on Cluster Computing (CLUSTER), DOI:10.1109/Cluster48925.2021.00044 |
14. | Pengyu Wang, Weiling Yang, Jianbin Fang, et al. Optimizing Direct Convolutions on ARM Multi-Cores. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, DOI:10.1145/3581784.3607107 |
15. | Kangkang Chen, Huayou Su, Chaorun Liu, et al. Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, DOI:10.1007/978-981-97-0811-6_4 |
16. | Jintao Peng, Jianbin Fang, Jie Liu, et al. Optimizing MPI Collectives on Shared Memory Multi-Cores. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, DOI:10.1145/3581784.3607074 |