We use cookies to improve your experience with our site.

英特尔至强融核协处理器上高效访存的三维FFT算法

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor

  • 摘要: 英特尔集成众核(MIC)架构作为一种新兴的高性能计算体系结构,它配备了512位宽的向量化指令集和大量的计算核心,同时能提供高效的浮点计算能力和大量的访存带宽.三维FFT是一个被广泛研究的算法.传统的算法将三维FFT问题转化为每个维度上的一维FFT的计算,从而要求对整个数组遍历三次,同时也会产生很多有跨步的内存访问.在本文中,我们提出了一个只需对整个内存数组遍历两次的三维FFT算法,目的是为了最大限度地减少内存和片上高速缓存之间的数据传输量.其主要思想是对一个维度进行分解,然后将子维度的计算分别合并到另外两个维度的计算中.由于对不同的维度进行分解计算会产生不同的L2 TLB缺失次数,在文章中我们对此进行了仔细的分析.我们在该众核系统上采用了一种分层的多线程并行方法,以获得高并行度的同时,增强本地cache中数据的重用.此外,我们还采用了cache分块、内存填充、循环变换和向量化的方法进行优化.在Intel Xeon Phi协处理器平台上,我们获得了最高136 Gflops的性能,达到了该平台上Intel MKL性能的2.22倍,同时也是Intel Xeon E5-2680平台上Intel MKL性能的1.80倍.

     

    Abstract: Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel® Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-dimensional fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non-unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into two sub-dimensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel® Xeon PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in o2oad mode, which beats the vendor-specific Intel® MKL library by a factor of up to 2.22X.

     

/

返回文章
返回