2013, Vol. 28 Issue (1): 90-105.

Special Issue: Computer Architecture and Systems

Architecture and VLSI Design

MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs

Yan Li1,2 (李焱), Student Member, CCF, ACM, Yun-Quan Zhang1,* (张云泉), Member, CCF, ACM, IEEE Yi-Qun Liu1,2 (刘益群), Student Member, CCF, ACM, Guo-Ping Long1 (龙国平), and Hai-Peng Jia3 (贾海鹏)   

  1. 1. Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2. Graduate University of Chinese Academy of Sciences, Beijing 100049, China;
    3. School of Information Science and Engineering, Ocean University of China, Qingdao 266000, China
  Received:2011-11-18 Revised:2012-09-25 Online:2013-01-05 Published:2013-01-05
  Supported by:

    This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.

Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.

