基于GPU上迭代型stencil计算的混合型circular queue方法
A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs
-
摘要: 1.本文的创新点
本文的创新点如下:
1) 本文针对在GPU上执行的stencil类程序提出了一个重用模式(reuse pattern)。该重用模式使得circular queue方法可以用GPU片上的寄存器来实现,而不仅仅是可以用有间接访问能力的存储(例如内存或scratchpad memory)来实现。该模式可以适用于在片上连续执行多个时间步的stencil应用。
2) 减少对shared memory的需求和减少访问shared memory指令数对提高程序在GPU上执行的性能非常重要。对于在GPU上执行的stencil类应用程序,为了达到这两个目的,我们分别提出了两种线程间通讯的方式: SCR和MCR。
3) 为了自动找到存储在片上的数据在寄存器和shared memory间最佳的分配方式,以及最佳的线程间通讯方式和参数,我们提出了一个搜索算法。我们同时提出了一个能够有效减小搜索空间的修剪算法。
4) 通过研究4种不同维度和不同访存模式的stencil应用,我们发现:a) 我们的重用模式能够广泛的应用于多种stencil应用;b) 我们提出的混合式circular queue方法能够有效的提高性能;c) 片上数据在寄存器和shared memory间的最佳分配与stencil的类型和GPU硬件参数都有关; d) 对于我们提出的重用模式,目前编译器中的寄存器分配算法没有达到最佳的分配。我们提出了相关的改进建议。
2.实现方法
原有的circular queue是实现在GPU片上的shared memory中。我们将circular queue的数组数据结构用标量变量的集合的形式来看待,通过分析标量变量和其所存储数据之间的关系,提出了标量变量的重用模式,且不需要通过间接访问(指针)的方式来访问标量变量。由于把circular queue用不需间接访问的标量变量来表示,就可以将circular queue用寄存器和shared memory两种存储同时实现。
GPU的片上包括寄存器和shared memory两种存储。这两种存储有各自的特点,且必须平衡使用才能最大化同时执行的线程数,从而达到最佳性能。对circular queue而言,最佳的实现方式往往是将circular queue数据结构的一部分存储在寄存器中,一部分存储在shared memory中。为了找到最佳的数据分配方式,我们采取了搜索不同版本的方法。为了减小搜索空间,我们通过搜集在编译时已知的occupancy信息,结合版本的参数,排除掉性能差的版本,使得只需要实际运行少量的版本就可以找到性能最优的版本。
3.结论及未来待解决的问题
本文提出了一个针对stencil类程序的重用模式,使得circular queue方法在GPU上可以用寄存器来实现。此外,我们还提出了一个能够自动生成混合型circular queue方法代码并自动搜索片上数据在寄存器和shared memory间最佳分配参数的框架。通过对4种不同类型的stencil程序的实验,我们发现我们的方法相对于全部使用shared memory的方法最高能获得2.93倍的加速。
待解决的问题如下:1) 提出一个性能分析模型以进一步的减少搜索的时间;2) 将我们对片上寄存器和shared memory的存储分配问题的讨论扩展到更多的应用中,并提出一个更具普适性的框架来帮助程序员优化GPU代码。Abstract: In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.-
Keywords:
- stencil computation /
- circular queue /
- GPU /
- occupancy /
- register
-
-
[1] Wonnacott D. Achieving scalable locality with time skewing.Int. J. Parallel Program, 2002, 30(3): 181-221.
[2] Mccalpin J, Wonnacott D. Time skewing: A value-based ap-proach to optimizing for memory locality. Technical ReportDCS-TR-379, Department of Computer Science, Rugers Uni-versity. 1999.
[3] Strzodka R, Shaheen M, Pajak D et al. Cache obliviousparallelograms in iterative stencil computations. In Proc.the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan,Jun. 1-4, 2010, pp.49-59.
[4] Song Y, Li Z. New tiling techniques to improve cache temporallocality. In Proc. ACM SIGPLAN Conference on Program-ming Language Design and Implementation, Atlanta, USA,May 1-4, 1999, pp.215-228.
[5] Jin G, Mellor-Crummey J, Fowler R. Increasing tempo-ral locality with skewing and recursive blocking. In Proc.ACM/IEEE Conference on Supercomputing, Denver, USA,Nov. 10-16, 2001, pp.43-43.
[6] Datta K, Murphy M, Volkov V et al. Stencil computation op-timization and auto-tuning on state-of-the-art multicore ar-chitectures. In Proc. ACM/IEEE Conference on Supercom-puting, Austin, USA, Nov.15-21, 2008, Article 4.
[7] Williams S, Shalf J, Oliker L et al. Scientific computing Ker-nels on the cell processor. Int. J. Parallel Program, 2007,35(3): 263-298.
[8] Meng J, Skadron K. Performance modeling and automaticghost zone optimization for iterative stencil loops on GPUs.In Proc. the 23rd International Conference on Supercomput-ing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.
[9] NVIDIA. NVIDIA CUDA programming guide 3.0, http://de-veloper.download.nvidia.com/compute/cuda/3 0/toolkit/do-cs/NVIDIA CUDA ProgrammingGuide-pdf, 2010.
[10] NVIDIA Corp. CUDA Occupancy Calculator, 2010.
[11] Hong S, Kim H. An analytical model for a GPU architecturewith memory-level and thread-level parallelism awareness. InProc. the 36th Annual Int. Symp. Computer Architecture,Austin, USA, Jun. 20-24, 2009, pp.152-163.
[12] Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu WW. An adaptive performance modeling tool for GPU archi-tectures. In Proc. the 15th ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming, Bangalore,India, Jan. 9-14, 2010, pp.105-114.
[13] van der Laan W J. Decuda. http://wiki.github.com/laanwj/decuda/, Sept., 2010.
[14] Moazeni M, Bui A, Sarrafzadeh M. A memory optimizationtechnique for software-managed scratchpad memory in GPUs.In Proc. the 7th IEEE Symposium on Application SpecificProcessors, San Francisco, USA, Jul. 27-28, 2009, pp.43-49.
[15] Carrillo S, Siegel J, Li X. A control-structure splitting opti-mization for GPGPU. In Proc. the 6th ACM Conf. Comput-ing Frontiers, Ischia, Italy, May 18-20, 2009, pp.147-150.
[16] Park S J, Ross J, Shires D, Richie D, Henz B, Nguyen L. Hy-brid core acceleration of UWB SIRE radar signal processing.IEEE Trans. Parallel Distrib. Syst, 2011, 22(1): 46-57.
[17] Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFTlibrary on CUDA GPU. In Proc. the 24th ACM Int. Conf.Supercomputing, Tsukuba, Japan, Jun. 1-4, 2010, pp.305-314.
[18] Goorts P, Rogmans S, Bekaert P. Optimal data distribu-tion for versatile finite impulse response filtering on next-generation graphics hardware using CUDA. In Proc. the 15thInternational Conference on Parallel and Distributed Sys-tems, Shenzhen, China, Dec. 9-11, 2009, pp.300-307.
[19] Bailey P, Myre J, Walsh S D C, Lilja D J, Saar M O. Accele-rating lattice Boltzmann fluid flow simulations using graphicsprocessors. In Proc. International Conference on ParallelProcessing, Vienna, Austria, Sep. 22-25, 2009, pp.550-557.
[20] Venkatasubramanian S, Vuduc R W. Tuned and wildly asyn-chronous stencil kernels for hybrid CPU/GPU systems. InProc. the 23rd International Conference on Supercomputing,Yorktown Heights, USA, Jun. 8-12, 2009, pp.244-255.
[21] Christen M, Schenk O, Neufeld E et al. Parallel data-localityaware stencil computations on modern micro-architectures.In Proc. IEEE Int. Symp. Parallel & Distributed Process-ing, Rome, Italy, May 23-29, 2009, pp.1-10.
[22] Micikevicius P. 3D finite difference computation on GPUs us-ing CUDA. In Proc. the 2nd Workshop on General PurposeProcessing on Graphics Processing Units, Washington, USA,Mar. 8, 2009, pp.79-84.
[23] Di P, Wan Q, Zhang X et al. Toward harnessing DOACROSSparallelism for multi-GPGPUs. In Proc. the 39th Int. Conf.Parallel Processing, San Diego, USA, Sep. 13-16, 2010, pp.40-50.
[24] Christen M, Schenk O, Burkhart H. Patus: A code genera-tion and autotuning framework for parallel iterative stencilcomputations on modern microarchitectures. In Proc. IEEEInternational Parallel & Distributed Processing Symposium,Anchorage, USA, May 16-20, 2011, pp.676-687.
[25] Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-Dblocking optimization for stencil computations on modernCPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for HighPerformance Computing, Networking, Storage and Analysis,New Orleans, USA, Nov. 13-19, 2010, pp.1-13.
[26] Qasem A, Kennedy K. Profitable loop fusion and tiling us-ing model-driven empirical search. In Proc. the 20th AnnualInternational Conference on Supercomputing, Cairns, Aus-tralia, Jun. 28-Jul. 1, 2006, pp.249-258.
[27] Knijnenburg P M W, Kisuki T, Gallivan K et al. The effect ofcache models on iterative compilation for combined tiling andunrolling: Research articles. Concurrency and Computation:Practice & Experience, 2004, 16(2-3): 247-270.
[28] Yotov K, Li X, Ren G et al. A comparison of empirical andmodel-driven optimization. In Proc. the ACM SIGPLAN2003 Conference on Programming Language Design and Im-plementation, San Diego, USA, Jun. 8-11, 2003, pp.63-76.
[29] Chen C, Chame J, Hall M. Combining models and guidedempirical search to optimize for multiple levels of the mem-ory hierarchy. In Proc. Int. Symp. Code Generation andOptimization, San Jose, USA, Mar. 20-23, 2005, pp.111-122.
[30] Epshteyn A, Garzaran M, DeJong G et al. Analytical modelsand empirical search: A hybrid approach to code optimiza-tion. In Proc. the 18th International Workshop on Lan-guages and Compilers for Parallel Computing, Hawthorne,USA, Oct. 20-22, 2005, pp.259-273.
[31] Agakov F, Bonilla E, Cavazos J et al. Using machine learningto focus iterative optimization. In Proc. Int. Symp. CodeGeneration and Optimization, New York, USA, Mar. 26-29,2006, pp.295-305.
[32] Almagor L, Cooper K D, Grosul A, Harvey T J, Reeves S W,Subramanian D, Torczon L, Waterman T. Finding effectivecompilation sequences. In Proc. ACM SIGPLAN/SIGBEDConference on Languages, Compilers, and Tools for Embed-ded Systems, Washington, USA, Jun. 11-13, 2004, pp.231-239.
[33] Vaswani K, Thazhuthaveetil M J, Srikant Y N et al. Mi-croarchitecture sensitive empirical models for compiler opti-mizations. In Proc. Int, Symp, Code Generation and Opti-mization, San Jose, USA, Mar. 11-14, 2007, pp.131-143.
[34] Ryoo S, Rodrigues C I, Stone S S et al. Program optimiza-tion space pruning for a multithreaded gpu. In Proc. the6th Annual IEEE/ACM Int. Symp. Code Generation andOptimization, Boston, USA, Apr. 6-9, 2008, pp.195-204.
[35] Choi J W, Singh A, Vuduc R W. Model-driven autotuning ofsparse matrix-vector multiply on GPUs. In Proc. the 15thACM SIGPLAN Symp. Principles and Practice of ParallelProgramming, Bangalore, India, Jan. 9-14, 2010, pp.115-126.
[36] Lam M. Software pipelining: An effective scheduling tech-nique for VLIW machines. In Proc. ACM SIGPLAN Confe-rence on Programming Language Design and Implementa-tion, Atlanta, USA, Jun. 20-24, 1988, pp.318-328.
[37] Callahan D, Carr S, Kennedy K. Improving register allocationfor subscripted variables. In Proc. ACM SIGPLAN Confer-ence on Programming Language Design and Implementation,White Plains, USA, Jun. 20-22, 1990, pp.53-65.
计量
- 文章访问数: 33
- HTML全文浏览量: 9
- PDF下载量: 2986