›› 2013, Vol. 28 ›› Issue (1): 90-105.doi: 10.1007/s11390-013-1314-8

Special Issue: Computer Architecture and Systems

• Architecture and VLSI Design • Previous Articles     Next Articles

MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs

Yan Li1,2 (李焱), Student Member, CCF, ACM, Yun-Quan Zhang1,* (张云泉), Member, CCF, ACM, IEEE Yi-Qun Liu1,2 (刘益群), Student Member, CCF, ACM, Guo-Ping Long1 (龙国平), and Hai-Peng Jia3 (贾海鹏)   

  1. 1. Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2. Graduate University of Chinese Academy of Sciences, Beijing 100049, China;
    3. School of Information Science and Engineering, Ocean University of China, Qingdao 266000, China
  • Received:2011-11-18 Revised:2012-09-25 Online:2013-01-05 Published:2013-01-05
  • Supported by:

    This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.

Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.

[1] Duhamel P, Vetterli M. Fast fourier transforms: A tutorialreview and a state of the art. Signal Processing, 1990, 9(14):259-299.
[2] Govindaraju N K, Lloyd B, Dotsenko Y, Smith B, ManferdelliJ. High performance discrete Fourier transforms on graphicsprocessors. In Proc. SC, Nov. 2008, Article No.2.
[3] Nukada A, Matsuoka S. Auto-tuning 3-D FFT library forCUDA GPUs. In Proc. SC, Nov. 2009, Article No.30.
[4] Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K.Auto-tuning of fast Fourier transform on graphics processors.In Proc PPoPP, Feb. 2011, pp.257-266.
[5] Gu L, Li X M, Siegel J. An empirically tuned 2D and 3D FFTlibrary on CUDA GPU. In Proc. the 24th ICS, June 2010,pp.305-314.
[6] Gaster B, Howes L, Kaeli D R, Mistry P, Schaa D. Heteroge-neous Computing with OpenCL. San Fransisco, USA: MorganKaufmann, 2011.
[7] Munshi A, Gaster B, Mattson T G, Fung J, Ginsburg D.OpenCL Programming Guide. Boston, USA: Addison-WesleyProfessional, 2011.
[8] Zhang E Z, Jiang Y L, Guo Z Y, Shen X P. StreamliningGPU applications on the fly: Thread divergence eliminationthrough runtime thread-data remapping. In Proc. the 24thICS, June 2010, pp.115-126.
[9] Yang Y, Xiang P, Kong J F, Zhou H Y. A GPGPU com-piler for memory optimization and parallelism management.In Proc. PLDI, June 2010, pp.86-97.
[10] Cooley J W, Tukey J W. An algorithm for the machine cal-culation of complex Fourier series. Mathematics of Compu-tation, 1965, 19: 297-301.
[11] Van Loan C. Computational Frameworks for the Fast FourierTransform. Philadelphia, USA: SIAM, 1992.
[12] Johnson J, Johnson R W, Rodriguez D, Tolimieri R. Amethodology for designing, modifying, and implementingFourier transform algorithms on various architectures. Cir-cuits, Systems and Signal Processing, 1990, 9(4): 449-500.
[13] Franchetti F, P黶chel M, Voronenko Y, Chellappa S, MouraJ M F. Discrete Fourier transform on multicore. IEEE SignalProcessing Magazine, 2009, 26(6): 90-102.
[14] Tolimieri R, An M, Lu C. Algorithms for Discrete FourierTransforms and Convolution. Berlin: Springer-Verlag, 1989.
[15] NVIDIA Corporation. NVIDIA CUDA compute uni-fied device architecture | Programming guide version4.2. 2008, http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation, Sept. 2012.
[16] Ji F, Ma X S. Using shared memory to accelerate MapRe-duce on graphics processing units. In Proc. IPDPS, May2011, pp.805-816.
[17] Jiao Y, Lin H, Balaji P, Feng W. Power and performance char-acterization of computational kernels on the GPU. In Proc.GreenCom-CPSCOM, Dec. 2010, pp.221-228.
[18] Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu WW. An adaptive performance modeling tool for GPU archi-tectures. In Proc. the 15th PPoPP, May 2010, pp.105-114.
[19] Schwarztrauber P N. Multiprocessor FFTs. Parallel Comput-ing, 1987, 5: 197-210.
[20] Frigo M, Johnson S G. The design and implementation ofFFTW3. In Proceedings of the IEEE, 2005, 93(2): 216-231.
[21] Frigo M. A fast Fourier transform compiler. In Proc. PLDI,May 1999, pp.169-180.
[22] Mesmay F, Franchetti F, Voronenko Y. Encyclopedia of Par-allel Computing. Berlin: Springer, 2011.
[23] de Mesmay F, Voronenko Y, P黶chel M. Offline library adap-tation using automatically generated heuristics. In Proc.IPDPS, Apr. 2010, pp.1-10.
[24] Kirk D B, Hwu W W. Programming Massively Parallel Pro-cessors: A Hands-on Approach. San Fransisco, USA: MorganKaufmann, 2010.
[25] Purnomo B, Rubin N, Houston M. ATI stream profiler: Atool to optimize an OpenCL kernel on ATI radeon GPUs. InProc. SIGGRAPH, July 2011, pp.26-30.
[26] Mirkovic D, Johnsson S L. Automatic performance tuning inthe UHFFT library. In Proc. ICCS, May 2001, pp.71-80.
[27] Ali A, Johnsson L, Mirkovic D. Empirical auto-tuning codegenerator for FFT and trigonometric transforms. In Proc.ODES, Mar. 2007.
[28] Mirkovic D, Mahasoom R, Johnsson L. An adaptive softwarelibrary for fast Fourier transforms. In Proc. the 14th ICS,May 2000, pp.215-224.
No related articles found!
Full text



[1] Zhang Bo; Zhang Ling;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] Meng Liming; Xu Xiaofei; Chang Huiyou; Chen Guangxi; Hu Mingzeng; Li Sheng;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] Lin Qi; Xia Peisu;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] Sun Chengzheng; Tzu Yungui;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[5] Zhang Bo; Zhang Tian; Zhang Jianwei; Zhang Ling;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[6] Wang Dingxing; Zheng Weimin; Du Xiaoli; Guo Yike;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[7] Zhou Quan; Wei Daozheng;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[8] Zhao Jinghai; Liu Shenquan;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .
[9] Shang Lujun; Xu Lihui;. Notes on the Design of an Integrated Object-Oriented DBMS Family[J]. , 1991, 6(4): 389 -394 .
[10] Xu Jianguo; Gou Yuchai; Lin Zongkai;. HEPAPS:A PCB Automatic Placement System[J]. , 1992, 7(1): 39 -46 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved