›› 2018,Vol. 33 ›› Issue (1): 1-23.doi: 10.1007/s11390-018-1805-8

所属专题: Computer Architecture and Systems Artificial Intelligence and Pattern Recognition

• Special Section on Selected Paper from NPC 2011 •    下一篇

智能处理器的评测基准

Jin-Hua Tao1,2,3, Zi-Dong Du1,3,4, Qi Guo1,3,4, Member, CCF, Hui-Ying Lan1,3, Lei Zhang1,3, Sheng-Yuan Zhou1,3, Ling-Jie Xu5, Cong Liu6, Hai-Feng Liu7, Shan Tang8, Allen Rush9, Willian Chen9, Shao-Li Liu1,3,4, Yun-Ji Chen1,2,3, Distinguished Member, CCF, Tian-Shi Chen1,3,4   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology Chinese Academy of Sciences, Beijing 100190, China;
    2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Intelligent Processor Research Center, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    4 Cambricon Ltd., Beijing 100190, China;
    5 Alibaba Infrastructure Service, Alibaba Group, Hangzhou 311121, China;
    6 Iflytek Co., Ltd., Hefei 230088, China;
    7 Beijing Jingdong Century Trading Co., Ltd., Beijing 100176, China;
    8 RDA Microelectronics, Inc., Shanghai 201203, China;
    9 Advanced Micro Devices, Inc., Sunnyvale, CA 94085, U.S.A
  • 收稿日期:2017-09-10 修回日期:2017-12-15 出版日期:2018-01-05 发布日期:2018-01-05
  • 作者简介:Jin-Hua Tao received his B.S. degree in statistics from University of Science and Technology of China, Hefei, in 2013. He is currently a Ph.D. student at Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His research interests include computer architecture and computational intelligence.
  • 基金资助:

    This work is partially supported by the National Key Research and Development Program of China under Grant No. 2017YFB1003101, the National Natural Science Foundation of China under Grant Nos. 61472396, 61432016, 61473275, 61522211, 61532016, 61521092, 61502446, 61672491, 61602441, 61602446, 61732002, and 61702478, Beijing Science and Technology Projects under Grant No. Z151100000915072, the Science and Technology Service Network Initiative (STS) Projects of Chinese Academy of Sciences, and the National Basic Research 973 Program of China under Grant No. 2015CB358800.

BENCHIP: Benchmarking Intelligence Processors

Jin-Hua Tao1,2,3, Zi-Dong Du1,3,4, Qi Guo1,3,4, Member, CCF, Hui-Ying Lan1,3, Lei Zhang1,3, Sheng-Yuan Zhou1,3, Ling-Jie Xu5, Cong Liu6, Hai-Feng Liu7, Shan Tang8, Allen Rush9, Willian Chen9, Shao-Li Liu1,3,4, Yun-Ji Chen1,2,3, Distinguished Member, CCF, Tian-Shi Chen1,3,4   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology Chinese Academy of Sciences, Beijing 100190, China;
    2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Intelligent Processor Research Center, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    4 Cambricon Ltd., Beijing 100190, China;
    5 Alibaba Infrastructure Service, Alibaba Group, Hangzhou 311121, China;
    6 Iflytek Co., Ltd., Hefei 230088, China;
    7 Beijing Jingdong Century Trading Co., Ltd., Beijing 100176, China;
    8 RDA Microelectronics, Inc., Shanghai 201203, China;
    9 Advanced Micro Devices, Inc., Sunnyvale, CA 94085, U.S.A
  • Received:2017-09-10 Revised:2017-12-15 Online:2018-01-05 Published:2018-01-05
  • About author:Jin-Hua Tao received his B.S. degree in statistics from University of Science and Technology of China, Hefei, in 2013. He is currently a Ph.D. student at Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His research interests include computer architecture and computational intelligence.
  • Supported by:

    This work is partially supported by the National Key Research and Development Program of China under Grant No. 2017YFB1003101, the National Natural Science Foundation of China under Grant Nos. 61472396, 61432016, 61473275, 61522211, 61532016, 61521092, 61502446, 61672491, 61602441, 61602446, 61732002, and 61702478, Beijing Science and Technology Projects under Grant No. Z151100000915072, the Science and Technology Service Network Initiative (STS) Projects of Chinese Academy of Sciences, and the National Basic Research 973 Program of China under Grant No. 2015CB358800.

目的:提供针对智能处理器的评测基准负载和方法创新点:1.提出微基准和宏基准用作评测负载,相比之前提出的一些测试负载更具有代表性和多样性,涵盖更多的计算模式和应用领域;2.微基准用于针对智能处理硬件设计的性能/功耗等瓶颈分析,比以前的测试负载提供更多的关于性能/功耗/面积等相关的设计指导;3.宏基准用于针对不同硬件平台的性能对比,相对之前的测试负载更具有公平性;4.提供使用简便且高效的软件环境,输入为与caffe,tensorflow等开源框架兼容的模型文件,输出为针对硬件的性能/功耗等评测结果。方法:1.通过对多种神经网络算法进行特征分析、相关性分析以及应用领域分析,选出涵盖不同计算模式和不同应用领域的代表性算法结构,作为微基准和宏基准;2.提供简便且高效的软件环境和基准评测负载模型,提供通用api。用户可以重载api以及自定义运行次数和决定使用微基准负载还是宏基准负载。软件会根据用户设置,运行并得到相应的性能/功耗等结果,用于发现设计瓶颈或性能比较。结论:通过简便且高效的软件环境和基准评测负载,便于用户使用这些负载来发现硬件设计的性能等瓶颈和跟别的硬件平台进行公平的性能比较;同时还提供通用的api,便于用户自定义自己的库文件,方便对不同硬件平台进行测试。同时,根据面向不同领域的基准负载所反映的结果可以方便设计针对特定领域的智能硬件。

Abstract: The increasing attention on deep learning has tremendously spurred the design of intelligence processing hardware. The variety of emerging intelligence processors requires standard benchmarks for fair comparison and system optimization (in both software and hardware). However, existing benchmarks are unsuitable for benchmarking intelligence processors due to their non-diversity and nonrepresentativeness. Also, the lack of a standard benchmarking methodology further exacerbates this problem. In this paper, we propose BenchIP, a benchmark suite and benchmarking methodology for intelligence processors. The benchmark suite in BenchIP consists of two sets of benchmarks:microbenchmarks and macrobenchmarks. The microbenchmarks consist of single-layer networks. They are mainly designed for bottleneck analysis and system optimization. The macrobenchmarks contain state-of-the-art industrial networks, so as to offer a realistic comparison of different platforms. We also propose a standard benchmarking methodology built upon an industrial software stack and evaluation metrics that comprehensively reflect various characteristics of the evaluated intelligence processors. BenchIP is utilized for evaluating various hardware platforms, including CPUs, GPUs, and accelerators. BenchIP will be open-sourced soon.

[1] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6):84-90.

[2] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. http://arxiv.org/abs/1409.1556, Dec. 2017.

[3] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. arXiv:1512.03385, 2015. http://arxiv.org/abs/1512.03385, Dec. 2017.

[4] Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to sequence-video to text. In Proc. the Int. Conf. Computer Vision, Dec. 2015, pp.4534-4542.

[5] Abdel-Hamid O, Mohamed A R, Jiang H, Deng L, Penn G, Yu D. Convolutional neural networks for speech recognition. In Proc. IEEE/ACM Trans. Audio Speech and Language Processing, July 2014, pp.1533-1545.

[6] Eriguchi A, Hashimoto K, Tsuruok Y. Tree-to-sequence attentional neural machine translation. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, Aug. 2016, pp.823-833.

[7] Farabet C, Poulet C, Han J Y, LeCun Y. CNP:An FPGAbased processor for convolutional networks. In Proc. Int. Conf. Field Programmable Logic and Applications, Aug. 31-Sept. 2, 2009, pp.32-37.

[8] Zhang C, Li P, Sun G Y, Guan Y J, Xiao B J, Cong J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proc. the ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, Feb. 2015, pp.161-170.

[9] Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao:A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th Int. Conf. Architectural Support for Programming Languages and Operating Systems, March 2014, pp.269-284.

[10] Farabet C, Martini B, Corda B, Akselrod P, Culurciello E, LeCun Y. NeuFlow:A runtime reconfigurable dataflow processor for vision. In Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition Workshops, June 2011, pp.109-116.

[11] Han S, Liu X Y, Mao H Z, Pu J, Pedram A, Horowitz M A, Dally W J. EIE:Efficient inference engine on compressed deep neural network. In Proc. the 43rd Int. Symp. Computer Architecture, June 2016, pp.243-254.

[12] Bienia C, Kumar S, Singh J P, Li K. The PARSEC benchmark suite:Characterization and architectural implications. In Proc. Int. Conf. Parallel Architectures and Compilation Techniques, Oct. 2008, pp.72-81.

[13] Alwani M, Chen H, Ferdman M, Milder P. Fused-layer CNN accelerators. In Proc. the 49th Annual IEEE/ACM Int. Symp. Microarchitecture, October 2016.

[14] Judd P, Albericio J, Hetherington T, Aamodt T M, Moshovos A. Stripes:Bit-serial deep neural network computing. In Proc. the 49th Annual IEEE/ACM Int. Symp. Microarchitecture, October 2016.

[15] Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler S W. vDNN:Virtualized deep neural networks for scalable, memory-efficient neural network design. In Proc. the 49th Annual IEEE/ACM Int. Symp. Microarchitecture, October 2016.

[16] Zhang S J, Du Z D, Zhang L, Lan H Y, Liu S L, Li L, Guo Q, Chen T, Chen Y J. Cambricon-x:An accelerator for sparse neural networks. In Proc. the 49th Annual IEEE/ACM Int. Symp. Microarchitecture, Oct. 2016.

[17] Ji Y, Zhang Y H, Li S C, Chi P, Jiang C H, Qu P, Xie Y, Chen W G. NEUTRAMS:Neural network transformation and co-design under neuromorphic hardware constraints. In Proc. the 49th Annual IEEE/ACM Int. Symp. Microarchitecture, Oct. 2016.

[18] Kim D, Kung J, Chai S, Yalamanchili S, Mukhopadhyay S. Neurocube:A programmable digital neuromorphic architecture with high-density 3D memory. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016, pp.380-392.

[19] LiKamWa R, Hou Y H, Gao Y, Polansky M, Zhong L. RedEye:Analog convNet image sensor architecture for continuous mobile vision. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016, pp.255-266.

[20] Albericio J, Judd P, Hetherington T, Aamodt T, Jerger N E, Moshovos A. Cnvlutin:Ineffectual-neuron-free deep neural network computing. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016.

[21] Chi P, Li S C, Xu C, Zhang T, Zhao J S, Liu Y P, Wang Y, Xie Y. PRIME:A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016, pp.27-39.

[22] Shafiee A, Nag A, Muralimanohar N, Balasubramonian R, Strachan J P, Hu M, Williams R S, Srikumar V. ISAAC:A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016, pp.14-26.

[23] Liu S L, Du Z D, Tao J H, Han D, Luo T, Xie Y, Chen Y J, Chen T S. Cambricon:An instruction set architecture for neural networks. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016, pp.393-405.

[24] Chen Y H, Emer J, Sze V. Eyeriss:A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016, pp.367-379.

[25] Reagen B, Whatmough P, Adolf R, Rama S, Lee H, Lee S K, Hernández-Lobato J M, Wei G Y, Brooks D. Minerva:Enabling low-power, highly-accurate deep neural network accelerators. In Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture, June 2016, pp.267-278.

[26] Song L H, Qian X H, Li H, Chen Y R. PipeLayer:A pipelined reRAM-based accelerator for deep learning. In Proc. IEEE Int. Symp. High Performance Computer Architecture, Feb. 2017, pp.541-552.

[27] Lu W Y, Yan G H, Li J J, Gong S J, Han Y H, Li X W. FlexFlow:A flexible dataflow accelerator architecture for convolutional neural networks. In Proc. IEEE Int. Symp. High Performance Computer Architecture, Feb. 2017, pp.553-564.

[28] Song M C, Hu Y, Chen H X, Li T. Towards pervasive and user satisfactory CNN across GPU microarchitectures. In Proc. IEEE Int. Symp. High Performance Computer Architecture, Feb. 2017.

[29] Rastegari M, Ordonez V, Redmon J, Farhadi A. XNOR-Net:ImageNet classification using binary convolutional neural networks. arXiv:1603.05279, 2016. http://arxiv.org/abs/1603.05279, Dec. 2017.

[30] Han S, Pool J, Tran J, Dally W J. Learning both weights and connections for efficient neural networks. arXiv:1506.02626, 2015. http://arxiv.org/abs/1506.02626, Dec. 2017.

[31] Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN:Towards real-time object detection with region proposal networks. In Proc. the 28th Int. Conf. Neural Information Processing Systems, Dec. 2015, pp.91-99.

[32] Parkhi O M, Vedaldi A, Zisserman A. Deep face recognition. In Proc. the British Machine Vision Conf., September 2015, pp.41:1-41:12.

[33] Johnson J, Karpathy A, Li F F. DenseCap:Fully convolutional localization networks for dense captioning. arXiv:1511.07571, 2015. http://arxiv.org/abs/1511.07571, Dec. 2017.

[34] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In Proc. IEEE Int. Conf. Computer Vision, Dec. 2015, pp.1520-1528.

[35] Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, May 2013, pp.6645-6649.

[36] Andor D, Alberti C, Weiss D, Severyn A, Presta A, Ganchev K, Petrov S, Collins M. Globally normalized transition-based neural networks. arXiv:1603.06042, 2016. http://arxiv.org/abs/1603.06042, Dec. 2017.

[37] Chen Y J, Luo T, Liu S L, Zhang S J, He L Q, Wang J, Li L, Chen T S, Xu Z W, Sun N H, Temam O. DaDianNao:A machine-learning supercomputer. In Proc. the 47th Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014, pp.609-622.

[38] Du Z D, Fasthuber R, Chen T S, Ienne P, Li L, Feng X B, Chen Y J, Temam O. ShiDianNao:Shifting vision processing closer to the sensor. In Proc. the 42nd Annual Int. Symp. Computer Architecture, June 2015, pp.92-104.

[39] Chen T S, Chen Y J, Duranton M, Guo Q, Hashmi A, Lipasti M, Nere A, Qiu S, Sebag M, Temam O. BenchNN:On the broad potential application scope of hardware neural network accelerators. In Proc. IEEE Int. Symp. Workload Characterization, Nov. 2012, pp.36-45.

[40] Adolf R, Rama S, Reagen B, Wei G Y, Brooks D. Fathom:Reference workloads for modern deep learning methods. In Proc. IEEE Int. Symp. Workload Characterization, Sept. 2016.

[41] Murtagh F, Hernández-Pajares M. The Kohonen selforganizing map method:An assessment. Journal of Classification, 1995, 12(2):165-190.

[42] Jia Y Q, Shelhamer E, Donahue J et al. Caffe:Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014. http://arxiv.org/abs/1408.5093, Dec. 2017.

[43] Chen T Q, Li M, Li Y T et al. MXNET:A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274, 2015. http://arxiv.org/abs/1512.01274, Dec. 2017.

[44] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11):2278-2324.

[45] Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2015, pp.3128-3137.

[46] He K M, Zhang X Y, Ren S Q, Sun J. Delving deep into rectifiers:Surpassing human-level performance on ImageNet classification. In Proc. IEEE Int. Conf. Computer Vision, Dec. 2015, pp.1026-1034.

[47] Taigman Y, Yang M, Ranzato M, Wolf L. DeepFace:Closing the gap to human-level performance in face verification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2014, pp.1701-1708.

[48] Le Q V. Building high-level features using large scale unsupervised learning. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, May 2013, pp.8595-8598.

[49] Jouppi N P, Young C, Patil N et al. In-datacenter performance analysis of a tensor processing unit. arXiv:1704.04760, 2017. http://arxiv.org/abs/1704.04760, Dec. 2017.

[50] Phansalkar A, Joshi A, John L K. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. In Proc. the 34th Annual Int. Symp. Computer Architecture, June 2007, pp.412-423.

[51] McCalpin J D. Memory bandwidth and machine balance in current high performance computers. In Proc. the IEEE Computer Society Technical Committee on Computer Architecture, Dec. 1995, pp.19-25.

[52] Bull J M, O'Neill D. A microbenchmark suite for OpenMP 2.0. ACM SIGARCH Computer Architecture News, 2001, 29(5):41-48.

[53] Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks. In Proc. the 31st Int. Conf. Machine Learning, June 2014, pp.1764-1772.

[54] Marcus M P, Santorini B, Marcinkiewicz M A. Building a large annotated corpus of English:The Penn treebank. Computational Linguistics, 1993, 19(2):313-330.

[55] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C, Li F F. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3):211-252.

[56] Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2):303-338.

[57] Huang G B, Ramesh M, Berg T, Learned-Miller E. Labeled faces in the wild:A database for studying face recognition in unconstrained environments. Technical Report 07-49, Amherst:University of Massachusetts, 2007. http://viswww.cs.umass.edu/lfw/, Dec. 2017.

[58] Chen D L, Dolan W B. Collecting highly parallel data for paraphrase evaluation. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies, June 2011, pp.190-200.

[59] Mucci P J, Browne S, Deane C, Ho G. PAPI:A portable interface to hardware performance counters. In Proc. Department of Defense HPCMP Users Group Conf., June 1999, pp.7-10.

[60] Ding C, Zhong Y T. Predicting whole-program locality through reuse distance analysis. In Proc. the ACM SIGPLAN Conf. Programming Language Design and Implementation, June 2003, pp.245-257.

[61] Pawlowski J T. Hybrid memory cube:Breakthrough dram performance with a fundamentally re-architected dram subsystem. In Proc. the 23rd Hot Chips Symp., August 2011.

[62] Courbariaux M, Bengio Y. BinaryNet:Training deep neural networks with weights and activations constrained to +1 or -1. arXiv:1602.02830, 2016. http://arxiv.org/abs/1602.02830v1, Dec. 2017.

[63] Rastegari M, Ordonez V, Redmon J, Farhadi A. XNOR-Net:ImageNet classification using binary convolutional neural networks. arXiv:1603.05279, 2016. http://arxiv.org/abs/1603.05279, Dec. 2017.

[64] Denkowski M, Lavie A. Meteor universal:Language specific translation evaluation for any target language. In Proc. the 9th Workshop on Statistical Machine Translation, June 2014, pp.376-380.

[65] Keckler S W, Dally W J, Khailany B, Garland M, Glasco D. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5):7-17.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[10] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: