SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.
Citation: | Li T, Jiang HL, Mo H et al. Approximate processing element design and analysis for the implementation of CNN accelerators. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 38(2): 309−327 Mar. 2023. DOI: 10.1007/s11390-023-2548-8. |
As a primary computation unit, a processing element (PE) is key to the energy efficiency of a convolutional neural network (CNN) accelerator. Taking advantage of the inherent error tolerance of CNNs, approximate computing with high hardware efficiency has been considered for implementing the computation units of CNN accelerators. However, individual approximate designs such as multipliers and adders can only achieve limited accuracy and hardware improvements. In this paper, an approximate PE is dedicatedly devised for CNN accelerators by synergistically considering the data representation, multiplication and accumulation. An approximate data format is defined for the weights using stochastic rounding. This data format enables a simple implementation of multiplication by using small lookup tables, an adder and a shifter. Two approximate accumulators are further proposed for the product accumulation in the PE. Compared with the exact 8-bit fixed-point design, the proposed PE saves more than 29% and 20% in power-delay product for 3 × 3 and 5 × 5 sum of products, respectively. Also, compared with the PEs consisting of state-of-the-art approximate multipliers, the proposed design shows significantly smaller error bias with lower hardware overhead. Moreover, the application of the approximate PEs in CNN accelerators is analyzed by implementing a multi-task CNN for face detection and alignment. We conclude that 1) an approximate PE is more effective for face detection than for alignment, 2) an approximate PE with high statistically-measured accuracy does not necessarily result in good quality in face detection, and 3) properly increasing the number of PEs in a CNN accelerator can improve its power and energy efficiency.
[1] |
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278–2324. DOI: 10.1109/5.726791.
|
[2] |
Xie X Z, Niu J W, Liu X F, Li Q F, Wang Y, Han J, Tang S J. DG-CNN: Introducing margin information into convolutional neural networks for breast cancer diagnosis in ultrasound images. Journal of Computer Science and Technology, 2022, 37(2): 277–294. DOI: 10.1007/s11390-020-0192-0.
|
[3] |
Zhang K P, Zhang Z P, Li Z F, Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499–1503. DOI: 10.1109/LSP.2016.2603342.
|
[4] |
Caroppo A, Leone A, Siciliano P. Comparison between deep learning models and traditional machine learning approaches for facial expression recognition in ageing adults. Journal of Computer Science and Technology, 2020, 35(5): 1127–1146. DOI: 10.1007/s11390-020-9665-4.
|
[5] |
Ji S W, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231. DOI: 10.1109/TPAMI.2012.59.
|
[6] |
Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proc. the 25th International Conference on Machine Learning, Jul. 2008, pp.160–167. DOI: 10.1145/1390156.1390177.
|
[7] |
Matsugu M, Mori K, Mitari Y, Kaneda Y. Subject independent facial expression recognition with robust face detection using a convolutional neural network. Neural Networks, 2003, 16(5/6): 555–559. DOI: 10.1016/S0893-6080(03)00115-1.
|
[8] |
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015. DOI: 10.1109/CVPR.2015.7298594.
|
[9] |
Jouppi N P, Young C, Patil N et al. In-datacenter performance analysis of a tensor processing unit. In Proc. the 44th Annual International Symposium on Computer Architecture, Jun. 2017. DOI: 10.1145/3079856.3080246.
|
[10] |
Liu Z G, Whatmough P N, Zhu Y H, Mattina M. S2TA: Exploiting structured sparsity for energy-efficient mobile CNN acceleration. In Proc. the 2022 IEEE Int. Symp. High-Performance Computer Architecture (HPCA), Apr. 2022, pp.573–586. DOI: 10.1109/HPCA53966.2022.00049.
|
[11] |
Li S Y, Hanson E, Qian X H, Li H H, Chen Y R. ESCALATE: Boosting the efficiency of sparse CNN accelerator with kernel decomposition. In Proc. the 54th Annual IEEE/ACM Int. Symp. Microarchitecture, Oct. 2021, pp.992–1004. DOI: 10.1145/3466752.3480043.
|
[12] |
Guesmi A, Alouani I, Khasawneh K N, Baklouti M, Frikha T, Abid M, Abu-Ghazaleh N. Defensive approximation: Securing CNNs using approximate computing. In Proc. the 26th ACM Int. Conf. Architectural Support for Programming Languages and Operating Systems, Apr. 2021, pp.990–1003. DOI: 10.1145/3445814.3446747.
|
[13] |
Ham T J, Jung S J, Kim S, Oh Y H, Park Y, Song Y, Park J H, Lee S, Park K, Lee J W, Jeong D K. A3: Accelerating attention mechanisms in neural networks with approximation. In Proc. the 2020 IEEE Int. Symp. High Performance Computer Architecture (HPCA), Feb. 2020, pp.328–341. DOI: 10.1109/HPCA47549.2020.00035.
|
[14] |
Mo H Y, Zhu W P, Hu W J, Wang G B, Li Q, Li A, Yin S Y, Wei S J, Liu L B. 9.2 A 28nm 12.1TOPS/W dual-mode CNN processor using effective-weight-based convolution and error-compensation-based prediction. In Proc. the 2021 IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2021, pp.146–148. DOI: 10.1109/ISSCC42613.2021.9365943.
|
[15] |
Sze V, Chen Y H, Yang T J, Emer J S. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 2017, 105(12): 2295–2329. DOI: 10.1109/JPROC.2017.2761740.
|
[16] |
Tu F B, Wu W W, Wang Y, Chen H J, Xiong F, Shi M, Li N, Deng J Y, Chen T B, Liu L B, Wei S J, Xie Y, Yin S Y. Evolver: A deep learning processor with on-device quantization-voltage-frequency tuning. IEEE Journal of Solid-State Circuits, 2021, 56(2): 658–673. DOI: 10.1109/JSSC.2020.3021661.
|
[17] |
Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 2017, 18(1): 6869–6898.
|
[18] |
Gysel P M, Ristretto: Hardware-oriented approximation of convolutional neural networks [Master’s Thesis]. University of California, Berkeley, 2016.
|
[19] |
Zhou S C, Wang Y Z, Wen H, He Q Y, Zou Y H. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology, 2017, 32(4): 667–682. DOI: 10.1007/s11390-017-1750-y.
|
[20] |
Karpuzcu U R, Sinkar A, Kim N S, Torrellas J. EnergySmart: Toward energy-efficient manycores for near-threshold computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture (HPCA), Feb. 2013, pp.542–553. DOI: 10.1109/HPCA.2013.6522348.
|
[21] |
Villa O, Johnson D R, Oconnor M, Bolotin E, Nellans D, Luitjens J, Sakharnykh N, Wang P, Micikevicius P, Scudiero A, Keckler S W, Dally W J. Scaling the power wall: A path to exascale. In Proc. the Int. Conf. High Performance Computing, Networking, Storage and Analysis, Nov. 2014, pp.830–841. DOI: 10.1109/SC.2014.73.
|
[22] |
Han J, Orshansky M. Approximate computing: An emerging paradigm for energy-efficient design. In Proc. the 18th IEEE European Test Symposium, May 2013. DOI: 10.1109/ETS.2013.6569370.
|
[23] |
Yuan M K, Dai L Q, Yan D M, Zhang L Q, Xiao J, Zhang X P. Fast and error-bounded space-variant bilateral filtering. Journal of Computer Science and Technology, 2019, 34(3): 550–568. DOI: 10.1007/s11390-019-1926-8.
|
[24] |
Zhang Q, Wang T, Tian Y, Yuan F, Xu Q. ApproxANN: An approximate computing framework for artificial neural network. In Proc. the 2015 Design, Automation & Test in Europe Conference & Exhibition, Mar. 2015, pp.701–706.
|
[25] |
Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Computer Architecture News, 2014, 42(1): 269–284. DOI: 10.1145/2654822.2541967.
|
[26] |
Ansari M S, Mrazek V, Cockburn B F, Sekanina L, Vasicek Z, Han J. Improving the accuracy and hardware efficiency of neural networks using approximate multipliers. IEEE Trans. Very Large Scale Integration (VLSI) Systems, 2020, 28(2): 317–328. DOI: 10.1109/TVLSI.2019.2940943.
|
[27] |
Jiang H L, Santiago F J H, Mo H, Liu L B, Han J. Approximate arithmetic circuits: A survey, characterization, and recent applications. Proceedings of the IEEE, 2020, 108(12): 2108–2135. DOI: 10.1109/JPROC.2020.3006451.
|
[28] |
Courbariaux M, Bengio Y, David J P. Training deep neural networks with low precision multiplications. arXiv: 1412.7024, 2014. https://arxiv.org/abs/1412.7024, Apr. 2023.
|
[29] |
Chen Y H, Krishna T, Emer J S, Sze V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127–138. DOI: 10.1109/JSSC.2016.2616357.
|
[30] |
Mo H Y, Liu L B, Zhu W P, Li Q, Liu H, Hu W J, Wang Y, Wei S J. A 1.17 TOPS/W, 150fps accelerator for multi-face detection and alignment. In Proc. the 56th ACM/IEEE Design Automation Conference, Jun. 2019.
|
[31] |
Jain S, Venkataramani S, Srinivasan V, Choi J, Gopalakrishnan K, Chang L. BiScaled-DNN: Quantizing long-tailed datastructures with two scale factors for deep neural networks. In Proc. the 56th ACM/IEEE Design Automation Conference (DAC), Jun. 2019.
|
[32] |
Nagel M, Fournarakis M, Amjad R A, Bondarenko Y, van Baalen M, Blankevoort T. A white paper on neural network quantization. arXiv: 2106.08295, 2021. https://arxiv.org/abs/2106.08295, Apr. 2023.
|
[33] |
Parashar A, Rhu M, Mukkara A, Puglielli A, Venkatesan R, Khailany B, Emer J, Keckler S W, Dally W J. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proc. the 44th Annual International Symposium on Computer Architecture, Jun. 2017, pp.27–40. DOI: 10.1145/3079856.3080254.
|
[34] |
Zervakis G, Tsoumanis K, Xydis S, Soudris D, Pekmestzi K. Design-efficient approximate multiplication circuits through partial product perforation. IEEE Trans. Very Large Scale Integration (VLSI) Systems, 2016, 24(10): 3105–3117. DOI: 10.1109/TVLSI.2016.2535398.
|
[35] |
Kyaw K Y, Goh W L, Yeo K S. Low-power high-speed multiplier for error-tolerant application. In Proc. the 2010 IEEE International Conference of Electron Devices and Solid-State Circuits (EDSSC), Dec. 2010. DOI: 10.1109/EDSSC.2010.5713751.
|
[36] |
Hashemi S, Bahar R I, Reda S. DRUM: A dynamic range unbiased multiplier for approximate applications. In Proc. the 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2015, pp.418–425. DOI: 10.1109/ICCAD.2015.7372600.
|
[37] |
Chen Y H, Chang T Y. A high-accuracy adaptive conditional-probability estimator for fixed-width booth multipliers. IEEE Trans. Circuits and Systems I: Regular Papers, 2012, 59(3): 594–603. DOI: 10.1109/TCSI.2011.2167275.
|
[38] |
He Y J, Yi X L, Zhang Z J, Ma B, Li Q. A probabilistic prediction-based fixed-width booth multiplier for approximate computing. IEEE Trans. Circuits and Systems I: Regular Papers, 2020, 67(12): 4794–4803. DOI: 10.1109/TCSI.2020.3001654.
|
[39] |
Lin C H, Lin I C. High accuracy approximate multiplier with error correction. In Proc. the 31st International Conference on Computer Design (ICCD), Oct. 2013, pp.33–38. DOI: 10.1109/ICCD.2013.6657022.
|
[40] |
Kong T Q, Li S G. Design and analysis of approximate 4-2 compressors for high-accuracy multipliers. IEEE Trans. Very Large Scale Integration (VLSI) Systems, 2021, 29(10): 1771–1781. DOI: 10.1109/TVLSI.2021.3104145.
|
[41] |
Esposito D, Strollo A G M, Napoli E, De Caro D, Petra N. Approximate multipliers based on new approximate compressors. IEEE Trans. Circuits and Systems I: Regular Papers, 2018, 65(12): 4169–4182. DOI: 10.1109/TCSI.2018.2839266.
|
[42] |
Venkatachalam S, Ko S B. Design of power and area efficient approximate multipliers. IEEE Trans. Very Large Scale Integration (VLSI) Systems, 2017, 25(5): 1782–1786. DOI: 10.1109/TVLSI.2016.2643639.
|
[43] |
Mitchell J N. Computer multiplication and division using binary logarithms. IRE Trans. Electronic Computers, 1962, EC-11(4): 512–517. DOI: 10.1109/TEC.1962.5219391.
|
[44] |
Liu W Q, Xu J H, Wang D Y, Wang C H, Montuschi P, Lombardi F. Design and evaluation of approximate logarithmic multipliers for low power error-tolerant applications. IEEE Trans. Circuits and Systems I: Regular Papers, 2018, 65(9): 2856–2868. DOI: 10.1109/TCSI.2018.2792902.
|
[45] |
Ansari M S, Cockburn B F, Han J. An improved logarithmic multiplier for energy-efficient neural computing. IEEE Trans. Computers, 2021, 70(4): 614–625. DOI: 10.1109/TC.2020.2992113.
|
[46] |
Norrie T, Patil N, Yoon D H, Kurian G, Li S, Laudon J, Young C, Jouppi N P, Patterson D. Google’s training chips revealed: TPUv2 and TPUv3. In Proc. the Hot Chips 32 Symposium, Aug. 2020. DOI: 10.1109/HCS49909.2020.9220735.
|
[47] |
Liu S T, Han J. Hardware ODE solvers using stochastic circuits. In Proc. the 54th ACM/EDAC/IEEE Design Automation Conference, Jun. 2017. DOI: 10.1145/3061639.3062258.
|
[48] |
Ranasinghe A C, Gerez S H. Glitch-optimized circuit blocks for low-power high-performance booth multipliers. IEEE Trans. Very Large Scale Integration Systems, 2020, 28(9): 2028–2041. DOI: 10.1109/TVLSI.2020.3009239.
|
[49] |
Jiang H L, Liu C, Liu L B, Lombardi F, Han J. A review, classification, and comparative evaluation of approximate arithmetic circuits. ACM Journal on Emerging Technologies in Computing Systems, 2017, 13(4): Article No. 60. DOI: 10.1145/3094124.
|
[50] |
Mahdiani H R, Ahmadi A, Fakhraie S M, Lucas C. Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications. IEEE Trans. Circuits and Systems I: Regular Papers, 2010, 57(4): 850–862. DOI: 10.1109/TCSI.2009.2027626.
|
[51] |
Jiang H L, Liu L B, Jonker P P, Elliott D G, Lombardi F, Han J. A high-performance and energy-efficient FIR adaptive filter using approximate distributed arithmetic circuits. IEEE Trans. Circuits and Systems I: Regular Papers, 2019, 66(1): 313–326. DOI: 10.1109/TCSI.2018.2856513.
|
[52] |
Jain V, Learned-Miller E G. FDDB: A benchmark for face detection in unconstrained settings. Technical Report, UM-CS-2010–009, Dept. of Computer Science, University of Massachusetts, Amherst, 2010. http://vis-www.cs.umass.edu/fddb/FDDB-folds.tgz, Mar. 2023.
|
[53] |
Köstinger M, Wohlhart P, Roth P M, Bischof H. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In Proc. the 2011 IEEE International Conference on Computer Vision Workshops, Nov. 2011, pp.2144–2151. DOI: 10.1109/ICCVW.2011.6130513.
|
[1] | Ying Wu, Chen-Yi Wen, Xun-Zhao Yin, Cheng Zhuo. LMM: A Fixed-Point Linear Mapping Based Approximate Multiplier for IoT[J]. Journal of Computer Science and Technology, 2023, 38(2): 298-308. DOI: 10.1007/s11390-023-2572-8 |
[2] | Zhe Liu, Hwajeong Seo, Howon Kim. A Synthesis of Multi-Precision Multiplication and Squaring Techniques for 8-Bit Sensor Nodes: State-of-the-Art Research and Future Challenges[J]. Journal of Computer Science and Technology, 2016, 31(2): 284-299. DOI: 10.1007/s11390-016-1627-5 |
[3] | Chiou-Yng Lee, Yung-Hui Chen, Che-Wun Chiou, Jim-Min Lin. Unified Parallel Systolic Multiplier Over GF(2^m)[J]. Journal of Computer Science and Technology, 2007, 22(1): 28-38. |
[4] | Chiou-Yng Lee, Jenn-Shyong Horng, I-Chang Jou. Low-Complexity Bit-Parallel Multiplier over GF(2m) Using Dual Basis Representation[J]. Journal of Computer Science and Technology, 2006, 21(6): 887-892. |
[5] | LUO Jianhua, ZHUANG Tiange. Reduction of Artifacts in Images from MR Truncated Data Using Singularity Spectrum Analysis[J]. Journal of Computer Science and Technology, 2000, 15(4): 360-367. |
[6] | Zhou Yi, Wu ShiLin. NNF and NNPrF—Fuzzy Petri Nets Based on Neural Network for Knowledge Representation, Reasoning and Learning[J]. Journal of Computer Science and Technology, 1996, 11(2): 133-149. |
[7] | Yang Changgui, Chen Yuian, Sun Jiaghang. Advanced Geometric Modeler with Hybrid Representation[J]. Journal of Computer Science and Technology, 1996, 11(1): 1-8. |
[8] | Tang Zesheng. Octree Representation and Its Applications in CAD[J]. Journal of Computer Science and Technology, 1992, 7(1): 29-38. |
[9] | Zhou Qihai. An Improved Graphic Representation for Structured Program Design[J]. Journal of Computer Science and Technology, 1991, 6(2): 205-208. |
[10] | Qi Yulu. A Systolic Approach for an Improvement of a Finite Field Multiplier[J]. Journal of Computer Science and Technology, 1987, 2(4): 303-309. |
1. | Dinesh Kumar Jayaraman Rajanediran, Ganesh Babu C, Priyadharsini K, et al. Hybrid Radix-16 Booth encoding and Rounding-based Approximate Karatsuba multiplier for Fast Fourier Transform computation in Biomedical signal processing Application. Integration, 2024. DOI:10.1016/j.vlsi.2024.102215 |
2. | Yujie Wang, Langbing Huang, Jianshan Li, et al. Research on Face Detection Based on Lightweight MTCNN. 2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), DOI:10.1109/ITAIC58329.2023.10408812 |