›› 2014, Vol. 29 ›› Issue (6): 989-1002.doi: 10.1007/s11390-014-1484-z

Special Issue: Computer Architecture and Systems

• Computer Architectures and Systems • Previous Articles     Next Articles

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor

Yi-Qun Liu1,2(刘益群), Yan Li3(李焱), Yun-Quan Zhang4(张云泉), Member, CCF, ACM, IEEE, Xian-Yi Zhang1,2(张先轶)   

  1. 1 Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    4 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China
  • Received:2013-12-27 Revised:2014-07-25 Online:2014-11-05 Published:2014-11-05
  • About author:Yi-Qun Liu is pursuing her Ph.D. degree in Institute of Software, Chinese Academy of Sciences, Beijing. Her research mainly concentrates on code generation, software adaption, and the design and optimization of some math algorithms such as FFT and HPCG, on heterogeneous manycore architectures.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61221062, 61402441, 61432018, the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice under Grant No. 11000GBF01.

Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel® Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-dimensional fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non-unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into two sub-dimensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel® Xeon PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in o2oad mode, which beats the vendor-specific Intel® MKL library by a factor of up to 2.22X.

[1] Tessendorf J. Simulating ocean water. In SIGGRAPH 2001 Course Notes, http://people.clemson.edu/~jtessen/reports. html, Oct. 2014.

[2] Ohno Y, Nishibori E, Narumi T, Koishi T, Tahirov T H, Ago H, Miyano M, Himeno R, Ebisuzaki T, Sakata M, Taiji M. A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3. In Proc. SC, Nov. 2007, Article No. 56

[3] Omlor L, Giese M A. Anechoic blind source separation using wigner marginals. The Journal of Machine Learning Research, 2011, 12: 1111-1148.

[4] Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19: 297-301.

[5] Good I J. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society. Series B (Methodological), 1958, 20(2): 361-372.

[6] Thomas L H. Using a computer to solve problems in physics. Applications of Digital Computers, 1963: 44-45.

[7] Yavne R. An economical method for calculating the discrete Fourier transform. In Proc. AFIPS Fall Joint Comput. Conf., Dec. 1968, pp.115-125.

[8] Rader C M. Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 1968, 56(6): 1107-1108.

[9] Frigo M, Johnson S G. The design and implementation of FFTW3. Proceedings of the IEEE, 2005, 93(2): 216-231.

[10] Ali A, Johnsson L, Subhlok J. Scheduling FFT computation on SMP and multicore systems. In Proc. the 21st ICS, Jun. 2007, pp.293-301.

[11] Li Y, Zhang Y Q, Liu Y Q, Long G P, Jia H P. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 2013, 28(1): 90-105.

[12] Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No. 30.

[13] Ramos S, Hoefler T. Modeling communication in cachecoherent SMP systems——A case-study with Xeon Phi. In Proc. the 22nd HPDC, Jun. 2013, pp. 97-108.

[14] Van Loan C. Computational Frameworks for the Fast Fourier Transform. Philadelphia USA: SIAM, 1992.

[15] Takahashi D. A blocking algorithm for FFT on cache-based processors. In Proc. the 9th HPCN, Jun. 2001, pp.551-554.

[16] Takahashi D. Implementation and evaluation of parallel FFT using SIMD instructions on multi-core processors. In Proc. IWIA, Jan. 2007, pp.53-59.

[17] Frigo M, Leiserson C E, Prokop H, Ramachandran S. Cacheoblivious algorithms. In Proc. the 40th FOCS, Oct. 1999, pp.285-297.

[18] Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ICS, Jun. 2010, pp.305-314.

[19] Nukada A, Ogata Y, Endo T, Matsuoka S. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In Proc. SC, Nov. 2008, Article No. 5.

[20] Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc. the 16th PPoPP, Feb. 2011, pp.257-266.

[21] Pjschel M, Moura J M F, Johnson J R, Padua D, Veloso M, Singer B W, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K, Johnson R W, Rizzolo N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 2005, 93(2): 232-275.

[22] Caballero D, Duran A, Martorell X. An OpenMP* barrier using SIMD instructions for Intelr® Xeon PhiTM coprocessor. In Proc. the 9th IWOMP, Sept. 2013, pp.99-113.

[23] Krishnaiyer R, Kultursay E, Chawla P, Preis S, Zvezdin A, Saito H. Compiler-based data prefetching and streaming nontemporal store generation for the Intelr® Xeon PhiTM coprocessor. In Proc. the 27th IPDPSW, May 2013, pp.1575-1586.

[24] Franchetti F, Puschel M, Voronenko Y, Chellappa S, Moura J M. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90-102.

[25] Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. the 21st IPDPS, Mar. 2007.

[26] Chen L, Gao G R. Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture. In Proc. SpringSim, Apr. 2010, Article No. 81.

[27] Almaless G, Wajsburt F. Does shared-memory, highly multithreaded, single-application scale on many-cores? In Proc. the 4th HotPar, Jun. 2012.

[28] Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov A, Henry G, Shet A G, Chrysos G, Dubey P. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intelr® Xeon PhiTM coprocessor. In Proc. the 27th IPDPS, May 2013, pp.126-137.

[29] Liu X, Smelyanskiy M, Chow E, Dubey P. Effcient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. the 27th ICS, Jun. 2013, pp.273-282.

[30] Park J, Bikshandi G, Vaidyanathan K, Tang P T P, Dubey P, Kim D. Tera-scale 1D FFT with low-communication algorithm and Intelr® Xeon PhiTM coprocessors. In Proc. SC, Nov. 2013, Article No. 34.
No related articles found!
Full text



[1] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[2] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[3] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[4] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[5] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[8] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[9] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[10] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved