›› 2014,Vol. 29 ›› Issue (6): 989-1002.doi: 10.1007/s11390-014-1484-z

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

英特尔至强融核协处理器上高效访存的三维FFT算法

Yi-Qun Liu1,2(刘益群), Yan Li3(李焱), Yun-Quan Zhang4(张云泉), Member, CCF, ACM, IEEE, Xian-Yi Zhang1,2(张先轶)   

  1. 1 Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    4 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China
  • 收稿日期:2013-12-27 修回日期:2014-07-25 出版日期:2014-11-05 发布日期:2014-11-05
  • 作者简介:Yi-Qun Liu is pursuing her Ph.D. degree in Institute of Software, Chinese Academy of Sciences, Beijing. Her research mainly concentrates on code generation, software adaption, and the design and optimization of some math algorithms such as FFT and HPCG, on heterogeneous manycore architectures.
  • 基金资助:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61221062, 61402441, 61432018, the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice under Grant No. 11000GBF01.

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor

Yi-Qun Liu1,2(刘益群), Yan Li3(李焱), Yun-Quan Zhang4(张云泉), Member, CCF, ACM, IEEE, Xian-Yi Zhang1,2(张先轶)   

  1. 1 Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    4 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China
  • Received:2013-12-27 Revised:2014-07-25 Online:2014-11-05 Published:2014-11-05
  • About author:Yi-Qun Liu is pursuing her Ph.D. degree in Institute of Software, Chinese Academy of Sciences, Beijing. Her research mainly concentrates on code generation, software adaption, and the design and optimization of some math algorithms such as FFT and HPCG, on heterogeneous manycore architectures.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61221062, 61402441, 61432018, the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice under Grant No. 11000GBF01.

英特尔集成众核(MIC)架构作为一种新兴的高性能计算体系结构,它配备了512位宽的向量化指令集和大量的计算核心,同时能提供高效的浮点计算能力和大量的访存带宽.三维FFT是一个被广泛研究的算法.传统的算法将三维FFT问题转化为每个维度上的一维FFT的计算,从而要求对整个数组遍历三次,同时也会产生很多有跨步的内存访问.在本文中,我们提出了一个只需对整个内存数组遍历两次的三维FFT算法,目的是为了最大限度地减少内存和片上高速缓存之间的数据传输量.其主要思想是对一个维度进行分解,然后将子维度的计算分别合并到另外两个维度的计算中.由于对不同的维度进行分解计算会产生不同的L2 TLB缺失次数,在文章中我们对此进行了仔细的分析.我们在该众核系统上采用了一种分层的多线程并行方法,以获得高并行度的同时,增强本地cache中数据的重用.此外,我们还采用了cache分块、内存填充、循环变换和向量化的方法进行优化.在Intel Xeon Phi协处理器平台上,我们获得了最高136 Gflops的性能,达到了该平台上Intel MKL性能的2.22倍,同时也是Intel Xeon E5-2680平台上Intel MKL性能的1.80倍.

Abstract: Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel® Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-dimensional fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non-unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into two sub-dimensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel® Xeon PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in o2oad mode, which beats the vendor-specific Intel® MKL library by a factor of up to 2.22X.

[1] Tessendorf J. Simulating ocean water. In SIGGRAPH 2001 Course Notes, http://people.clemson.edu/~jtessen/reports. html, Oct. 2014.

[2] Ohno Y, Nishibori E, Narumi T, Koishi T, Tahirov T H, Ago H, Miyano M, Himeno R, Ebisuzaki T, Sakata M, Taiji M. A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3. In Proc. SC, Nov. 2007, Article No. 56

[3] Omlor L, Giese M A. Anechoic blind source separation using wigner marginals. The Journal of Machine Learning Research, 2011, 12: 1111-1148.

[4] Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19: 297-301.

[5] Good I J. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society. Series B (Methodological), 1958, 20(2): 361-372.

[6] Thomas L H. Using a computer to solve problems in physics. Applications of Digital Computers, 1963: 44-45.

[7] Yavne R. An economical method for calculating the discrete Fourier transform. In Proc. AFIPS Fall Joint Comput. Conf., Dec. 1968, pp.115-125.

[8] Rader C M. Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 1968, 56(6): 1107-1108.

[9] Frigo M, Johnson S G. The design and implementation of FFTW3. Proceedings of the IEEE, 2005, 93(2): 216-231.

[10] Ali A, Johnsson L, Subhlok J. Scheduling FFT computation on SMP and multicore systems. In Proc. the 21st ICS, Jun. 2007, pp.293-301.

[11] Li Y, Zhang Y Q, Liu Y Q, Long G P, Jia H P. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 2013, 28(1): 90-105.

[12] Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No. 30.

[13] Ramos S, Hoefler T. Modeling communication in cachecoherent SMP systems——A case-study with Xeon Phi. In Proc. the 22nd HPDC, Jun. 2013, pp. 97-108.

[14] Van Loan C. Computational Frameworks for the Fast Fourier Transform. Philadelphia USA: SIAM, 1992.

[15] Takahashi D. A blocking algorithm for FFT on cache-based processors. In Proc. the 9th HPCN, Jun. 2001, pp.551-554.

[16] Takahashi D. Implementation and evaluation of parallel FFT using SIMD instructions on multi-core processors. In Proc. IWIA, Jan. 2007, pp.53-59.

[17] Frigo M, Leiserson C E, Prokop H, Ramachandran S. Cacheoblivious algorithms. In Proc. the 40th FOCS, Oct. 1999, pp.285-297.

[18] Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ICS, Jun. 2010, pp.305-314.

[19] Nukada A, Ogata Y, Endo T, Matsuoka S. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In Proc. SC, Nov. 2008, Article No. 5.

[20] Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc. the 16th PPoPP, Feb. 2011, pp.257-266.

[21] Pjschel M, Moura J M F, Johnson J R, Padua D, Veloso M, Singer B W, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K, Johnson R W, Rizzolo N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 2005, 93(2): 232-275.

[22] Caballero D, Duran A, Martorell X. An OpenMP* barrier using SIMD instructions for Intelr® Xeon PhiTM coprocessor. In Proc. the 9th IWOMP, Sept. 2013, pp.99-113.

[23] Krishnaiyer R, Kultursay E, Chawla P, Preis S, Zvezdin A, Saito H. Compiler-based data prefetching and streaming nontemporal store generation for the Intelr® Xeon PhiTM coprocessor. In Proc. the 27th IPDPSW, May 2013, pp.1575-1586.

[24] Franchetti F, Puschel M, Voronenko Y, Chellappa S, Moura J M. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90-102.

[25] Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. the 21st IPDPS, Mar. 2007.

[26] Chen L, Gao G R. Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture. In Proc. SpringSim, Apr. 2010, Article No. 81.

[27] Almaless G, Wajsburt F. Does shared-memory, highly multithreaded, single-application scale on many-cores? In Proc. the 4th HotPar, Jun. 2012.

[28] Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov A, Henry G, Shet A G, Chrysos G, Dubey P. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intelr® Xeon PhiTM coprocessor. In Proc. the 27th IPDPS, May 2013, pp.126-137.

[29] Liu X, Smelyanskiy M, Chow E, Dubey P. Effcient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. the 27th ICS, Jun. 2013, pp.273-282.

[30] Park J, Bikshandi G, Vaidyanathan K, Tang P T P, Dubey P, Kim D. Tera-scale 1D FFT with low-communication algorithm and Intelr® Xeon PhiTM coprocessors. In Proc. SC, Nov. 2013, Article No. 34.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[2] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[3] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[4] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[8] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[9] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[10] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: