Memory Efficient Two-Pass 3D FFT Algorithm for Intel<sup>®</sup> Xeon Phi<sup>TM</sup> Coprocessor

Yi-Qun Liu; Yan Li; Yun-Quan Zhang; Xian-Yi Zhang

doi:10.1007/s11390-014-1484-z

Yi-Qun Liu, Yan Li, Yun-Quan Zhang, Xian-Yi Zhang. Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor[J]. Journal of Computer Science and Technology, 2014, 29(6): 989-1002. DOI: 10.1007/s11390-014-1484-z

Citation:

Memory Efficient Two-Pass 3D FFT Algorithm for Intel^® Xeon Phi^TM Coprocessor

Abstract

Abstract

Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel^® Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-dimensional fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non-unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into two sub-dimensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel^® Xeon Phi^TM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in o2oad mode, which beats the vendor-specific Intel^® MKL library by a factor of up to 2.22X.

FullText(HTML)

References (30)

Relative Articles

Supplements (0)

Cited By

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor

Abstract

Catalog

Export File

Citation

Format

Content

Memory Efficient Two-Pass 3D FFT Algorithm for Intel^® Xeon Phi^TM Coprocessor