面向通用处理器结构优化的FFT性能模型

李玲; 陈云霁; 刘道福; 钱诚; 胡伟武

doi:10.1007/s11390-011-0186-z

摘要: 通用处理器(GPP)具有灵活性、可靠性和实用性的特点，已成为快速傅里叶变换(FFT)的重要实现平台之一。FFT是计算和访存密集性的典型应用，对通用处理器结构优化以提高FFT性能可以提升并平衡计算和访存能力，同时结构优化也有助于提升其它应用的性能。
为了有效优化通用处理器上FFT性能，本文建立了通用处理器上FFT性能模型，将FFT处理性能与不同的结构参数联系起来，以反映处理器结构对FFT性能的影响。FFT性能主要由计算时间和访存时间决定，通过分析FFT处理流程，可以得到每个步骤的时间。为了得到较优性能下的结构参数信息，本文考虑运算和访存的时间重叠，并给出运行时间的下界。通过求解运行时间的最优化问题，可以得到通用处理器上满足最优FFT处理性能的两个定理，即寄存器个数和访存带宽的下界。寄存器个数的下界与功能单元数和寄存器分配到释放的时间的乘积成正比；访存带宽的下界与功能部件数成正比，与每个蝶形运算所需的平均指令数和寄存器个数成反比。
本文在两个典型处理器平台上对该性能模型进行了验证，包括4核的Intel Core i7和8核的Godson-3B，估计的运行时间占实际运动时间的85%以上。本文以1GHz主频、256Gflops单精度浮点运算能力的Godson-3B为例，应用该性能模型优化其结构参数。优化前的Godson-3B上有128项256位宽的向量寄存器，两个256位宽的向量处理单元。优化前的Godson-3B处理器上FFT处理性能分析表明，大量的shuffle操作占28%以上的运算时间；此外根据本文定理二，访存带宽不能满足带宽下界要求，与计算能力不匹配，成为制约FFT处理性能的瓶颈。为了提高FFT运算效率、平衡计算和访存能力，本文提出两种优化策略，包括设计蝶形运算指令和直接寄存器-cache/外存的DMA通路。蝶形运算指令可有效减少蝶形运算时间和shuffle时间，降低了40%左右的运算时间；而DMA通路将访存带宽提高一倍，同时可完成倒序等功能，从而进一步减少shuffle时间。
优化后的Godson-3B上的性能结果表明，FFT性能提高40%左右，仅增加了0.8%的额外面积开销。本文以处理1K点单精度复数FFT为例，与Intel Core i7、Cell处理器、GPU (Nvidia CTX280)、FPGA (Altera Stratix IV)、DSP (TMS320C67x)和专用IP进行了性能比较分析。Intel Core i7和Cell处理器上的FFT性能是基于Spiral库得到；Godson-3B的性能在RTL仿真环境下测得。Godson-3B处理1K点FFT仅需0.368μs（仅次于GPU的0.232μs），功耗约为40w，性能功耗比为69.6（仅次于ASIP）。
本文工作表明，通过对通用处理器平台上FFT处理性能建模，可以建立性能与结构参数的关系，并可确定寄存器、访存带宽的下界。利用该模型可有效分析处理器平台在计算和访存方面的瓶颈，从而指导结构优化设计，可显著提高FFT处理器性能。

Abstract: General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT), due to its flexibility, reliability and practicality. FFT is a representative application intensive in both computation and memory access, optimizing the FFT performance of a GPP also benefits the performances of many other applications. To facilitate the analysis of FFT, this paper proposes a theoretical model of the FFT processing. The model gives out a tight lower bound of the runtime of FFT on a GPP, and guides the architecture optimization for GPP as well. Based on the model, two theorems on optimization of architecture parameters are deduced, which refer to the lower bounds of register number and memory bandwidth. Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.
The above investigations were adopted in the development of Godson-3B, which is an industrial GPP. The optimization techniques deduced from our performance model improve the FFT performance by about 40%, while incurring only 0:8% additional area cost. Consequently, Godson-3B solves the 1024-point single-precision complex FFT in 0:368 μs with about 40Watt power consumption, and has the highest performance-per-watt in complex FFT among processors as far as we know. This work could benefit optimization of other GPPs as well.

面向通用处理器结构优化的FFT性能模型

An FFT Performance Model for Optimizing General-Purpose Processor Architecture