|
›› 2013,Vol. 28 ›› Issue (1): 90-105.doi: 10.1007/s11390-013-1314-8
所属专题: Computer Architecture and Systems
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Yan Li1,2 (李焱), Student Member, CCF, ACM, Yun-Quan Zhang1,* (张云泉), Member, CCF, ACM, IEEE Yi-Qun Liu1,2 (刘益群), Student Member, CCF, ACM, Guo-Ping Long1 (龙国平), and Hai-Peng Jia3 (贾海鹏)
Yan Li1,2 (李焱), Student Member, CCF, ACM, Yun-Quan Zhang1,* (张云泉), Member, CCF, ACM, IEEE Yi-Qun Liu1,2 (刘益群), Student Member, CCF, ACM, Guo-Ping Long1 (龙国平), and Hai-Peng Jia3 (贾海鹏)
傅里叶方法在天文学、医学影像、地震学和光谱学等科学与工程领域中有着广泛的应用.快速傅里叶变换(FFT)是计算离散傅里叶变换的快速算法.GPU作为一种新兴的高性能计算体系结构,其独特的存储层次给应用程序带来了较高的性能和效率,然而GPU需要软件对其存储进行管理,其编程复杂性对开发人员带来了重大的挑战.在本文中,我们基于GPU提出了一个FFT自适应性能优化框架,并且基于该框架实现了名为MPFFT的高性能FFT库.对于计算长度为2的幂时多维FFT,MPFFT的性能在AMD GPU上远远优于clAmdFft,并且在NVIDIA GPU上与CUFFT的性能相当.此外,MPFFT也支持规模非2的幂时FFT的计算.对于3维非2的幂FFT,MPFF是FFTW在4线程时性能的1.5倍至28倍,并且在Tesla C2050上相较于CUFFT 4.0取得了20.01倍的平均加速比.
[1] Duhamel P, Vetterli M. Fast fourier transforms: A tutorialreview and a state of the art. Signal Processing, 1990, 9(14):259-299.[2] Govindaraju N K, Lloyd B, Dotsenko Y, Smith B, ManferdelliJ. High performance discrete Fourier transforms on graphicsprocessors. In Proc. SC, Nov. 2008, Article No.2.[3] Nukada A, Matsuoka S. Auto-tuning 3-D FFT library forCUDA GPUs. In Proc. SC, Nov. 2009, Article No.30.[4] Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K.Auto-tuning of fast Fourier transform on graphics processors.In Proc PPoPP, Feb. 2011, pp.257-266.[5] Gu L, Li X M, Siegel J. An empirically tuned 2D and 3D FFTlibrary on CUDA GPU. In Proc. the 24th ICS, June 2010,pp.305-314.[6] Gaster B, Howes L, Kaeli D R, Mistry P, Schaa D. Heteroge-neous Computing with OpenCL. San Fransisco, USA: MorganKaufmann, 2011.[7] Munshi A, Gaster B, Mattson T G, Fung J, Ginsburg D.OpenCL Programming Guide. Boston, USA: Addison-WesleyProfessional, 2011.[8] Zhang E Z, Jiang Y L, Guo Z Y, Shen X P. StreamliningGPU applications on the fly: Thread divergence eliminationthrough runtime thread-data remapping. In Proc. the 24thICS, June 2010, pp.115-126.[9] Yang Y, Xiang P, Kong J F, Zhou H Y. A GPGPU com-piler for memory optimization and parallelism management.In Proc. PLDI, June 2010, pp.86-97.[10] Cooley J W, Tukey J W. An algorithm for the machine cal-culation of complex Fourier series. Mathematics of Compu-tation, 1965, 19: 297-301.[11] Van Loan C. Computational Frameworks for the Fast FourierTransform. Philadelphia, USA: SIAM, 1992.[12] Johnson J, Johnson R W, Rodriguez D, Tolimieri R. Amethodology for designing, modifying, and implementingFourier transform algorithms on various architectures. Cir-cuits, Systems and Signal Processing, 1990, 9(4): 449-500.[13] Franchetti F, P黶chel M, Voronenko Y, Chellappa S, MouraJ M F. Discrete Fourier transform on multicore. IEEE SignalProcessing Magazine, 2009, 26(6): 90-102.[14] Tolimieri R, An M, Lu C. Algorithms for Discrete FourierTransforms and Convolution. Berlin: Springer-Verlag, 1989.[15] NVIDIA Corporation. NVIDIA CUDA compute uni-fied device architecture | Programming guide version4.2. 2008, http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation, Sept. 2012.[16] Ji F, Ma X S. Using shared memory to accelerate MapRe-duce on graphics processing units. In Proc. IPDPS, May2011, pp.805-816.[17] Jiao Y, Lin H, Balaji P, Feng W. Power and performance char-acterization of computational kernels on the GPU. In Proc.GreenCom-CPSCOM, Dec. 2010, pp.221-228.[18] Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu WW. An adaptive performance modeling tool for GPU archi-tectures. In Proc. the 15th PPoPP, May 2010, pp.105-114.[19] Schwarztrauber P N. Multiprocessor FFTs. Parallel Comput-ing, 1987, 5: 197-210.[20] Frigo M, Johnson S G. The design and implementation ofFFTW3. In Proceedings of the IEEE, 2005, 93(2): 216-231.[21] Frigo M. A fast Fourier transform compiler. In Proc. PLDI,May 1999, pp.169-180.[22] Mesmay F, Franchetti F, Voronenko Y. Encyclopedia of Par-allel Computing. Berlin: Springer, 2011.[23] de Mesmay F, Voronenko Y, P黶chel M. Offline library adap-tation using automatically generated heuristics. In Proc.IPDPS, Apr. 2010, pp.1-10.[24] Kirk D B, Hwu W W. Programming Massively Parallel Pro-cessors: A Hands-on Approach. San Fransisco, USA: MorganKaufmann, 2010.[25] Purnomo B, Rubin N, Houston M. ATI stream profiler: Atool to optimize an OpenCL kernel on ATI radeon GPUs. InProc. SIGGRAPH, July 2011, pp.26-30.[26] Mirkovic D, Johnsson S L. Automatic performance tuning inthe UHFFT library. In Proc. ICCS, May 2001, pp.71-80.[27] Ali A, Johnsson L, Mirkovic D. Empirical auto-tuning codegenerator for FFT and trigonometric transforms. In Proc.ODES, Mar. 2007.[28] Mirkovic D, Mahasoom R, Johnsson L. An adaptive softwarelibrary for fast Fourier transforms. In Proc. the 14th ICS,May 2000, pp.215-224. |
No related articles found! |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |