卷积神经网络加速器中近似处理单元的设计与分析

李彤; 姜红兰; 莫海; 韩杰; 刘雷波; 毛志刚

doi:10.1007/s11390-023-2548-8

摘要:

研究背景 随着计算密集型应用（如智能推断）的不断推广，计算所需能耗日益升高。作为的计算单元，处理单元的能效对卷积神经网络（CNN）加速器的计算能效起到了决定性的作用。同时，由于CNN算法的容错特性，其本身并不需要完全精确的计算，有限的中间过程误差不会影响CNN算法的最终结论。因此，本文利用CNN的容错特性，设计了近似处理单元，通过牺牲部分计算精度，换取计算硬件开销的大幅降低。

目的本文的目的是利用近似计算技术，为CNN定制近似处理单元，进而在保证一定精度的前提下，降低CNN的计算开销。

方法本文综合考虑数据表示形式以及CNN计算中涉及的乘法与累加操作，设计了一个面向CNN的近似计算处理单元。受限，我们采用随机舍入的方法对CNN权重信息进行近似表示。然后，权重与上一级神经元输出的乘法便可以由简单的查找表、加法器与移位器来实现。除此之外，我们还基于Wallace树和加法树设计了两种近似累加器。

结果与准确计算的8位定点电路相比，本文提出的近似处理单元在运算3×3与5×5尺寸的点积时，可分别获得29%与20%的功率延迟积缩减。与现有最好的近似计算乘法器组成的处理单元相比，本设计在误差偏置与硬件开销方面均有优势。最终，我们将近似处理单元用于实现一个多任务CNN（MTCNN）加速器中，以执行人脸检测与对齐任务。实验结果表明，在获得一定硬件效率提升的基础上，本文所设计的近似处理单元可以得到与准确8为定点设计相近的人脸识别正确率、稍高的人脸对齐误差。同时，本文还对比了现有其他近似乘法器构成的处理单元，可以发现，在MTCNN加速器中，在本文的近似设计具有更高的准确度与更低的硬件开销。

结论通过对MTCNN加速器的硬件实现与精度仿真对比，得出以下结论：1）近似处理单元在人脸检测中效率高于人脸对齐；2）对于某些统计误差较小的近似乘法器来讲，其人脸检测准确率不一定更高；3）适当增加CNN加速器中处理单元的数量，可以提升其能效。实验中还发现，某些近似特性可以提升CNN加速器人脸识别运算的收敛速度，但近似特性与计算收敛速度的具体关系尚不明确。

Abstract: As a primary computation unit, a processing element (PE) is key to the energy efficiency of a convolutional neural network (CNN) accelerator. Taking advantage of the inherent error tolerance of CNNs, approximate computing with high hardware efficiency has been considered for implementing the computation units of CNN accelerators. However, individual approximate designs such as multipliers and adders can only achieve limited accuracy and hardware improvements. In this paper, an approximate PE is dedicatedly devised for CNN accelerators by synergistically considering the data representation, multiplication and accumulation. An approximate data format is defined for the weights using stochastic rounding. This data format enables a simple implementation of multiplication by using small lookup tables, an adder and a shifter. Two approximate accumulators are further proposed for the product accumulation in the PE. Compared with the exact 8-bit fixed-point design, the proposed PE saves more than 29% and 20% in power-delay product for 3 × 3 and 5 × 5 sum of products, respectively. Also, compared with the PEs consisting of state-of-the-art approximate multipliers, the proposed design shows significantly smaller error bias with lower hardware overhead. Moreover, the application of the approximate PEs in CNN accelerators is analyzed by implementing a multi-task CNN for face detection and alignment. We conclude that 1) an approximate PE is more effective for face detection than for alignment, 2) an approximate PE with high statistically-measured accuracy does not necessarily result in good quality in face detection, and 3) properly increasing the number of PEs in a CNN accelerator can improve its power and energy efficiency.

卷积神经网络加速器中近似处理单元的设计与分析

Approximate Processing Element Design and Analysis for the Implementation of CNN Accelerators