[1] Duhamel P, Vetterli M. Fast fourier transforms: A tutorialreview and a state of the art. Signal Processing, 1990, 9(14):259-299.[2] Govindaraju N K, Lloyd B, Dotsenko Y, Smith B, ManferdelliJ. High performance discrete Fourier transforms on graphicsprocessors. In Proc. SC, Nov. 2008, Article No.2.[3] Nukada A, Matsuoka S. Auto-tuning 3-D FFT library forCUDA GPUs. In Proc. SC, Nov. 2009, Article No.30.[4] Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K.Auto-tuning of fast Fourier transform on graphics processors.In Proc PPoPP, Feb. 2011, pp.257-266.[5] Gu L, Li X M, Siegel J. An empirically tuned 2D and 3D FFTlibrary on CUDA GPU. In Proc. the 24th ICS, June 2010,pp.305-314.[6] Gaster B, Howes L, Kaeli D R, Mistry P, Schaa D. Heteroge-neous Computing with OpenCL. San Fransisco, USA: MorganKaufmann, 2011.[7] Munshi A, Gaster B, Mattson T G, Fung J, Ginsburg D.OpenCL Programming Guide. Boston, USA: Addison-WesleyProfessional, 2011.[8] Zhang E Z, Jiang Y L, Guo Z Y, Shen X P. StreamliningGPU applications on the fly: Thread divergence eliminationthrough runtime thread-data remapping. In Proc. the 24thICS, June 2010, pp.115-126.[9] Yang Y, Xiang P, Kong J F, Zhou H Y. A GPGPU com-piler for memory optimization and parallelism management.In Proc. PLDI, June 2010, pp.86-97.[10] Cooley J W, Tukey J W. An algorithm for the machine cal-culation of complex Fourier series. Mathematics of Compu-tation, 1965, 19: 297-301.[11] Van Loan C. Computational Frameworks for the Fast FourierTransform. Philadelphia, USA: SIAM, 1992.[12] Johnson J, Johnson R W, Rodriguez D, Tolimieri R. Amethodology for designing, modifying, and implementingFourier transform algorithms on various architectures. Cir-cuits, Systems and Signal Processing, 1990, 9(4): 449-500.[13] Franchetti F, P黶chel M, Voronenko Y, Chellappa S, MouraJ M F. Discrete Fourier transform on multicore. IEEE SignalProcessing Magazine, 2009, 26(6): 90-102.[14] Tolimieri R, An M, Lu C. Algorithms for Discrete FourierTransforms and Convolution. Berlin: Springer-Verlag, 1989.[15] NVIDIA Corporation. NVIDIA CUDA compute uni-fied device architecture | Programming guide version4.2. 2008, http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation, Sept. 2012.[16] Ji F, Ma X S. Using shared memory to accelerate MapRe-duce on graphics processing units. In Proc. IPDPS, May2011, pp.805-816.[17] Jiao Y, Lin H, Balaji P, Feng W. Power and performance char-acterization of computational kernels on the GPU. In Proc.GreenCom-CPSCOM, Dec. 2010, pp.221-228.[18] Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu WW. An adaptive performance modeling tool for GPU archi-tectures. In Proc. the 15th PPoPP, May 2010, pp.105-114.[19] Schwarztrauber P N. Multiprocessor FFTs. Parallel Comput-ing, 1987, 5: 197-210.[20] Frigo M, Johnson S G. The design and implementation ofFFTW3. In Proceedings of the IEEE, 2005, 93(2): 216-231.[21] Frigo M. A fast Fourier transform compiler. In Proc. PLDI,May 1999, pp.169-180.[22] Mesmay F, Franchetti F, Voronenko Y. Encyclopedia of Par-allel Computing. Berlin: Springer, 2011.[23] de Mesmay F, Voronenko Y, P黶chel M. Offline library adap-tation using automatically generated heuristics. In Proc.IPDPS, Apr. 2010, pp.1-10.[24] Kirk D B, Hwu W W. Programming Massively Parallel Pro-cessors: A Hands-on Approach. San Fransisco, USA: MorganKaufmann, 2010.[25] Purnomo B, Rubin N, Houston M. ATI stream profiler: Atool to optimize an OpenCL kernel on ATI radeon GPUs. InProc. SIGGRAPH, July 2011, pp.26-30.[26] Mirkovic D, Johnsson S L. Automatic performance tuning inthe UHFFT library. In Proc. ICCS, May 2001, pp.71-80.[27] Ali A, Johnsson L, Mirkovic D. Empirical auto-tuning codegenerator for FFT and trigonometric transforms. In Proc.ODES, Mar. 2007.[28] Mirkovic D, Mahasoom R, Johnsson L. An adaptive softwarelibrary for fast Fourier transforms. In Proc. the 14th ICS,May 2000, pp.215-224. |