›› 2013,Vol. 28 ›› Issue (1): 90-105.doi: 10.1007/s11390-013-1314-8

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

MPFFT: 自动调优GPU FFT库

Yan Li1,2 (李焱), Student Member, CCF, ACM, Yun-Quan Zhang1,* (张云泉), Member, CCF, ACM, IEEE Yi-Qun Liu1,2 (刘益群), Student Member, CCF, ACM, Guo-Ping Long1 (龙国平), and Hai-Peng Jia3 (贾海鹏)   

  • 收稿日期:2011-11-18 修回日期:2012-09-25 出版日期:2013-01-05 发布日期:2013-01-05
  • 基金资助:

    This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.

MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs

Yan Li1,2 (李焱), Student Member, CCF, ACM, Yun-Quan Zhang1,* (张云泉), Member, CCF, ACM, IEEE Yi-Qun Liu1,2 (刘益群), Student Member, CCF, ACM, Guo-Ping Long1 (龙国平), and Hai-Peng Jia3 (贾海鹏)   

  1. 1. Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2. Graduate University of Chinese Academy of Sciences, Beijing 100049, China;
    3. School of Information Science and Engineering, Ocean University of China, Qingdao 266000, China
  • Received:2011-11-18 Revised:2012-09-25 Online:2013-01-05 Published:2013-01-05
  • Supported by:

    This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.

傅里叶方法在天文学、医学影像、地震学和光谱学等科学与工程领域中有着广泛的应用.快速傅里叶变换(FFT)是计算离散傅里叶变换的快速算法.GPU作为一种新兴的高性能计算体系结构,其独特的存储层次给应用程序带来了较高的性能和效率,然而GPU需要软件对其存储进行管理,其编程复杂性对开发人员带来了重大的挑战.在本文中,我们基于GPU提出了一个FFT自适应性能优化框架,并且基于该框架实现了名为MPFFT的高性能FFT库.对于计算长度为2的幂时多维FFT,MPFFT的性能在AMD GPU上远远优于clAmdFft,并且在NVIDIA GPU上与CUFFT的性能相当.此外,MPFFT也支持规模非2的幂时FFT的计算.对于3维非2的幂FFT,MPFF是FFTW在4线程时性能的1.5倍至28倍,并且在Tesla C2050上相较于CUFFT 4.0取得了20.01倍的平均加速比.

Abstract: Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.

[1] Duhamel P, Vetterli M. Fast fourier transforms: A tutorialreview and a state of the art. Signal Processing, 1990, 9(14):259-299.
[2] Govindaraju N K, Lloyd B, Dotsenko Y, Smith B, ManferdelliJ. High performance discrete Fourier transforms on graphicsprocessors. In Proc. SC, Nov. 2008, Article No.2.
[3] Nukada A, Matsuoka S. Auto-tuning 3-D FFT library forCUDA GPUs. In Proc. SC, Nov. 2009, Article No.30.
[4] Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K.Auto-tuning of fast Fourier transform on graphics processors.In Proc PPoPP, Feb. 2011, pp.257-266.
[5] Gu L, Li X M, Siegel J. An empirically tuned 2D and 3D FFTlibrary on CUDA GPU. In Proc. the 24th ICS, June 2010,pp.305-314.
[6] Gaster B, Howes L, Kaeli D R, Mistry P, Schaa D. Heteroge-neous Computing with OpenCL. San Fransisco, USA: MorganKaufmann, 2011.
[7] Munshi A, Gaster B, Mattson T G, Fung J, Ginsburg D.OpenCL Programming Guide. Boston, USA: Addison-WesleyProfessional, 2011.
[8] Zhang E Z, Jiang Y L, Guo Z Y, Shen X P. StreamliningGPU applications on the fly: Thread divergence eliminationthrough runtime thread-data remapping. In Proc. the 24thICS, June 2010, pp.115-126.
[9] Yang Y, Xiang P, Kong J F, Zhou H Y. A GPGPU com-piler for memory optimization and parallelism management.In Proc. PLDI, June 2010, pp.86-97.
[10] Cooley J W, Tukey J W. An algorithm for the machine cal-culation of complex Fourier series. Mathematics of Compu-tation, 1965, 19: 297-301.
[11] Van Loan C. Computational Frameworks for the Fast FourierTransform. Philadelphia, USA: SIAM, 1992.
[12] Johnson J, Johnson R W, Rodriguez D, Tolimieri R. Amethodology for designing, modifying, and implementingFourier transform algorithms on various architectures. Cir-cuits, Systems and Signal Processing, 1990, 9(4): 449-500.
[13] Franchetti F, P黶chel M, Voronenko Y, Chellappa S, MouraJ M F. Discrete Fourier transform on multicore. IEEE SignalProcessing Magazine, 2009, 26(6): 90-102.
[14] Tolimieri R, An M, Lu C. Algorithms for Discrete FourierTransforms and Convolution. Berlin: Springer-Verlag, 1989.
[15] NVIDIA Corporation. NVIDIA CUDA compute uni-fied device architecture | Programming guide version4.2. 2008, http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation, Sept. 2012.
[16] Ji F, Ma X S. Using shared memory to accelerate MapRe-duce on graphics processing units. In Proc. IPDPS, May2011, pp.805-816.
[17] Jiao Y, Lin H, Balaji P, Feng W. Power and performance char-acterization of computational kernels on the GPU. In Proc.GreenCom-CPSCOM, Dec. 2010, pp.221-228.
[18] Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu WW. An adaptive performance modeling tool for GPU archi-tectures. In Proc. the 15th PPoPP, May 2010, pp.105-114.
[19] Schwarztrauber P N. Multiprocessor FFTs. Parallel Comput-ing, 1987, 5: 197-210.
[20] Frigo M, Johnson S G. The design and implementation ofFFTW3. In Proceedings of the IEEE, 2005, 93(2): 216-231.
[21] Frigo M. A fast Fourier transform compiler. In Proc. PLDI,May 1999, pp.169-180.
[22] Mesmay F, Franchetti F, Voronenko Y. Encyclopedia of Par-allel Computing. Berlin: Springer, 2011.
[23] de Mesmay F, Voronenko Y, P黶chel M. Offline library adap-tation using automatically generated heuristics. In Proc.IPDPS, Apr. 2010, pp.1-10.
[24] Kirk D B, Hwu W W. Programming Massively Parallel Pro-cessors: A Hands-on Approach. San Fransisco, USA: MorganKaufmann, 2010.
[25] Purnomo B, Rubin N, Houston M. ATI stream profiler: Atool to optimize an OpenCL kernel on ATI radeon GPUs. InProc. SIGGRAPH, July 2011, pp.26-30.
[26] Mirkovic D, Johnsson S L. Automatic performance tuning inthe UHFFT library. In Proc. ICCS, May 2001, pp.71-80.
[27] Ali A, Johnsson L, Mirkovic D. Empirical auto-tuning codegenerator for FFT and trigonometric transforms. In Proc.ODES, Mar. 2007.
[28] Mirkovic D, Mahasoom R, Johnsson L. An adaptive softwarelibrary for fast Fourier transforms. In Proc. the 14th ICS,May 2000, pp.215-224.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] 孟力明; 徐晓飞; 常会友; 陈光熙; 胡铭曾; 李生;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] 林琦; 夏培肃;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] 孙成政; 慈云桂;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[5] 张钹; 张恬; 张建伟; 张铃;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[6] 王鼎兴; 郑纬民; 杜晓黎; 郭毅可;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[7] 周权; 魏道政;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[8] 赵靓海; 刘慎权;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .
[9] 商陆军; 许立辉;. Notes on the Design of an Integrated Object-Oriented DBMS Family[J]. , 1991, 6(4): 389 -394 .
[10] 许建国; 郭玉钗; 林宗楷;. HEPAPS:A PCB Automatic Placement System[J]. , 1992, 7(1): 39 -46 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: