[1] Tessendorf J. Simulating ocean water. In SIGGRAPH 2001 Course Notes, http://people.clemson.edu/~jtessen/reports. html, Oct. 2014.[2] Ohno Y, Nishibori E, Narumi T, Koishi T, Tahirov T H, Ago H, Miyano M, Himeno R, Ebisuzaki T, Sakata M, Taiji M. A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3. In Proc. SC, Nov. 2007, Article No. 56[3] Omlor L, Giese M A. Anechoic blind source separation using wigner marginals. The Journal of Machine Learning Research, 2011, 12: 1111-1148.[4] Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19: 297-301.[5] Good I J. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society. Series B (Methodological), 1958, 20(2): 361-372.[6] Thomas L H. Using a computer to solve problems in physics. Applications of Digital Computers, 1963: 44-45.[7] Yavne R. An economical method for calculating the discrete Fourier transform. In Proc. AFIPS Fall Joint Comput. Conf., Dec. 1968, pp.115-125.[8] Rader C M. Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 1968, 56(6): 1107-1108.[9] Frigo M, Johnson S G. The design and implementation of FFTW3. Proceedings of the IEEE, 2005, 93(2): 216-231.[10] Ali A, Johnsson L, Subhlok J. Scheduling FFT computation on SMP and multicore systems. In Proc. the 21st ICS, Jun. 2007, pp.293-301.[11] Li Y, Zhang Y Q, Liu Y Q, Long G P, Jia H P. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 2013, 28(1): 90-105.[12] Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No. 30.[13] Ramos S, Hoefler T. Modeling communication in cachecoherent SMP systems——A case-study with Xeon Phi. In Proc. the 22nd HPDC, Jun. 2013, pp. 97-108.[14] Van Loan C. Computational Frameworks for the Fast Fourier Transform. Philadelphia USA: SIAM, 1992.[15] Takahashi D. A blocking algorithm for FFT on cache-based processors. In Proc. the 9th HPCN, Jun. 2001, pp.551-554.[16] Takahashi D. Implementation and evaluation of parallel FFT using SIMD instructions on multi-core processors. In Proc. IWIA, Jan. 2007, pp.53-59.[17] Frigo M, Leiserson C E, Prokop H, Ramachandran S. Cacheoblivious algorithms. In Proc. the 40th FOCS, Oct. 1999, pp.285-297.[18] Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ICS, Jun. 2010, pp.305-314.[19] Nukada A, Ogata Y, Endo T, Matsuoka S. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In Proc. SC, Nov. 2008, Article No. 5.[20] Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc. the 16th PPoPP, Feb. 2011, pp.257-266.[21] Pjschel M, Moura J M F, Johnson J R, Padua D, Veloso M, Singer B W, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K, Johnson R W, Rizzolo N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 2005, 93(2): 232-275.[22] Caballero D, Duran A, Martorell X. An OpenMP* barrier using SIMD instructions for Intelr® Xeon PhiTM coprocessor. In Proc. the 9th IWOMP, Sept. 2013, pp.99-113.[23] Krishnaiyer R, Kultursay E, Chawla P, Preis S, Zvezdin A, Saito H. Compiler-based data prefetching and streaming nontemporal store generation for the Intelr® Xeon PhiTM coprocessor. In Proc. the 27th IPDPSW, May 2013, pp.1575-1586.[24] Franchetti F, Puschel M, Voronenko Y, Chellappa S, Moura J M. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90-102.[25] Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. the 21st IPDPS, Mar. 2007.[26] Chen L, Gao G R. Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture. In Proc. SpringSim, Apr. 2010, Article No. 81.[27] Almaless G, Wajsburt F. Does shared-memory, highly multithreaded, single-application scale on many-cores? In Proc. the 4th HotPar, Jun. 2012.[28] Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov A, Henry G, Shet A G, Chrysos G, Dubey P. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intelr® Xeon PhiTM coprocessor. In Proc. the 27th IPDPS, May 2013, pp.126-137.[29] Liu X, Smelyanskiy M, Chow E, Dubey P. Effcient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. the 27th ICS, Jun. 2013, pp.273-282.[30] Park J, Bikshandi G, Vaidyanathan K, Tang P T P, Dubey P, Kim D. Tera-scale 1D FFT with low-communication algorithm and Intelr® Xeon PhiTM coprocessors. In Proc. SC, Nov. 2013, Article No. 34. |