›› 2012, Vol. 27 ›› Issue (1): 57-74.doi: 10.1007/s11390-012-1206-3

Special Issue: Computer Architecture and Systems

• Architecture and VLSI Design • Previous Articles     Next Articles

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Yang Yang1,2 (杨杨), Hui-Min Cui1,2 (崔慧敏), Xiao-Bing Feng1 (冯晓兵), Member, CCF and Jing-Ling Xue3 (薛京灵), Senior Member, IEEE   

  1. 1. State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2. Graduate University of Chinese Academy of Sciences, Beijing 100190, China;
    3. Programming Languages and Compilers Group, School of Computer Science and Engineering University of New South Wales, Sydney, NSW 2052, Australia
  • Received:2011-05-09 Revised:2011-10-08 Online:2012-01-05 Published:2012-01-05
  • Supported by:

    Supported in part by the National Basic Research 973 Program of China under Grant Nos. 2011CB302504 and 2011ZX01028-001-002, the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01A129, the National Natural Science Foundation of China (NSFC) under Grant No. 60970024, and the Innovation Research Group of NSFC under Grant No. 60921002.

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

[1] Wonnacott D. Achieving scalable locality with time skewing.Int. J. Parallel Program, 2002, 30(3): 181-221.

[2] Mccalpin J, Wonnacott D. Time skewing: A value-based ap-proach to optimizing for memory locality. Technical ReportDCS-TR-379, Department of Computer Science, Rugers Uni-versity. 1999.

[3] Strzodka R, Shaheen M, Pajak D et al. Cache obliviousparallelograms in iterative stencil computations. In Proc.the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan,Jun. 1-4, 2010, pp.49-59.

[4] Song Y, Li Z. New tiling techniques to improve cache temporallocality. In Proc. ACM SIGPLAN Conference on Program-ming Language Design and Implementation, Atlanta, USA,May 1-4, 1999, pp.215-228.

[5] Jin G, Mellor-Crummey J, Fowler R. Increasing tempo-ral locality with skewing and recursive blocking. In Proc.ACM/IEEE Conference on Supercomputing, Denver, USA,Nov. 10-16, 2001, pp.43-43.

[6] Datta K, Murphy M, Volkov V et al. Stencil computation op-timization and auto-tuning on state-of-the-art multicore ar-chitectures. In Proc. ACM/IEEE Conference on Supercom-puting, Austin, USA, Nov.15-21, 2008, Article 4.

[7] Williams S, Shalf J, Oliker L et al. Scientific computing Ker-nels on the cell processor. Int. J. Parallel Program, 2007,35(3): 263-298.

[8] Meng J, Skadron K. Performance modeling and automaticghost zone optimization for iterative stencil loops on GPUs.In Proc. the 23rd International Conference on Supercomput-ing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.

[9] NVIDIA. NVIDIA CUDA programming guide 3.0, http://de-veloper.download.nvidia.com/compute/cuda/3 0/toolkit/do-cs/NVIDIA CUDA ProgrammingGuide-pdf, 2010.

[10] NVIDIA Corp. CUDA Occupancy Calculator, 2010.

[11] Hong S, Kim H. An analytical model for a GPU architecturewith memory-level and thread-level parallelism awareness. InProc. the 36th Annual Int. Symp. Computer Architecture,Austin, USA, Jun. 20-24, 2009, pp.152-163.

[12] Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu WW. An adaptive performance modeling tool for GPU archi-tectures. In Proc. the 15th ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming, Bangalore,India, Jan. 9-14, 2010, pp.105-114.

[13] van der Laan W J. Decuda. http://wiki.github.com/laanwj/decuda/, Sept., 2010.

[14] Moazeni M, Bui A, Sarrafzadeh M. A memory optimizationtechnique for software-managed scratchpad memory in GPUs.In Proc. the 7th IEEE Symposium on Application SpecificProcessors, San Francisco, USA, Jul. 27-28, 2009, pp.43-49.

[15] Carrillo S, Siegel J, Li X. A control-structure splitting opti-mization for GPGPU. In Proc. the 6th ACM Conf. Comput-ing Frontiers, Ischia, Italy, May 18-20, 2009, pp.147-150.

[16] Park S J, Ross J, Shires D, Richie D, Henz B, Nguyen L. Hy-brid core acceleration of UWB SIRE radar signal processing.IEEE Trans. Parallel Distrib. Syst, 2011, 22(1): 46-57.

[17] Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFTlibrary on CUDA GPU. In Proc. the 24th ACM Int. Conf.Supercomputing, Tsukuba, Japan, Jun. 1-4, 2010, pp.305-314.

[18] Goorts P, Rogmans S, Bekaert P. Optimal data distribu-tion for versatile finite impulse response filtering on next-generation graphics hardware using CUDA. In Proc. the 15thInternational Conference on Parallel and Distributed Sys-tems, Shenzhen, China, Dec. 9-11, 2009, pp.300-307.

[19] Bailey P, Myre J, Walsh S D C, Lilja D J, Saar M O. Accele-rating lattice Boltzmann fluid flow simulations using graphicsprocessors. In Proc. International Conference on ParallelProcessing, Vienna, Austria, Sep. 22-25, 2009, pp.550-557.

[20] Venkatasubramanian S, Vuduc R W. Tuned and wildly asyn-chronous stencil kernels for hybrid CPU/GPU systems. InProc. the 23rd International Conference on Supercomputing,Yorktown Heights, USA, Jun. 8-12, 2009, pp.244-255.

[21] Christen M, Schenk O, Neufeld E et al. Parallel data-localityaware stencil computations on modern micro-architectures.In Proc. IEEE Int. Symp. Parallel & Distributed Process-ing, Rome, Italy, May 23-29, 2009, pp.1-10.

[22] Micikevicius P. 3D finite difference computation on GPUs us-ing CUDA. In Proc. the 2nd Workshop on General PurposeProcessing on Graphics Processing Units, Washington, USA,Mar. 8, 2009, pp.79-84.

[23] Di P, Wan Q, Zhang X et al. Toward harnessing DOACROSSparallelism for multi-GPGPUs. In Proc. the 39th Int. Conf.Parallel Processing, San Diego, USA, Sep. 13-16, 2010, pp.40-50.

[24] Christen M, Schenk O, Burkhart H. Patus: A code genera-tion and autotuning framework for parallel iterative stencilcomputations on modern microarchitectures. In Proc. IEEEInternational Parallel & Distributed Processing Symposium,Anchorage, USA, May 16-20, 2011, pp.676-687.

[25] Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-Dblocking optimization for stencil computations on modernCPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for HighPerformance Computing, Networking, Storage and Analysis,New Orleans, USA, Nov. 13-19, 2010, pp.1-13.

[26] Qasem A, Kennedy K. Profitable loop fusion and tiling us-ing model-driven empirical search. In Proc. the 20th AnnualInternational Conference on Supercomputing, Cairns, Aus-tralia, Jun. 28-Jul. 1, 2006, pp.249-258.

[27] Knijnenburg P M W, Kisuki T, Gallivan K et al. The effect ofcache models on iterative compilation for combined tiling andunrolling: Research articles. Concurrency and Computation:Practice & Experience, 2004, 16(2-3): 247-270.

[28] Yotov K, Li X, Ren G et al. A comparison of empirical andmodel-driven optimization. In Proc. the ACM SIGPLAN2003 Conference on Programming Language Design and Im-plementation, San Diego, USA, Jun. 8-11, 2003, pp.63-76.

[29] Chen C, Chame J, Hall M. Combining models and guidedempirical search to optimize for multiple levels of the mem-ory hierarchy. In Proc. Int. Symp. Code Generation andOptimization, San Jose, USA, Mar. 20-23, 2005, pp.111-122.

[30] Epshteyn A, Garzaran M, DeJong G et al. Analytical modelsand empirical search: A hybrid approach to code optimiza-tion. In Proc. the 18th International Workshop on Lan-guages and Compilers for Parallel Computing, Hawthorne,USA, Oct. 20-22, 2005, pp.259-273.

[31] Agakov F, Bonilla E, Cavazos J et al. Using machine learningto focus iterative optimization. In Proc. Int. Symp. CodeGeneration and Optimization, New York, USA, Mar. 26-29,2006, pp.295-305.

[32] Almagor L, Cooper K D, Grosul A, Harvey T J, Reeves S W,Subramanian D, Torczon L, Waterman T. Finding effectivecompilation sequences. In Proc. ACM SIGPLAN/SIGBEDConference on Languages, Compilers, and Tools for Embed-ded Systems, Washington, USA, Jun. 11-13, 2004, pp.231-239.

[33] Vaswani K, Thazhuthaveetil M J, Srikant Y N et al. Mi-croarchitecture sensitive empirical models for compiler opti-mizations. In Proc. Int, Symp, Code Generation and Opti-mization, San Jose, USA, Mar. 11-14, 2007, pp.131-143.

[34] Ryoo S, Rodrigues C I, Stone S S et al. Program optimiza-tion space pruning for a multithreaded gpu. In Proc. the6th Annual IEEE/ACM Int. Symp. Code Generation andOptimization, Boston, USA, Apr. 6-9, 2008, pp.195-204.

[35] Choi J W, Singh A, Vuduc R W. Model-driven autotuning ofsparse matrix-vector multiply on GPUs. In Proc. the 15thACM SIGPLAN Symp. Principles and Practice of ParallelProgramming, Bangalore, India, Jan. 9-14, 2010, pp.115-126.

[36] Lam M. Software pipelining: An effective scheduling tech-nique for VLIW machines. In Proc. ACM SIGPLAN Confe-rence on Programming Language Design and Implementa-tion, Atlanta, USA, Jun. 20-24, 1988, pp.318-328.

[37] Callahan D, Carr S, Kennedy K. Improving register allocationfor subscripted variables. In Proc. ACM SIGPLAN Confer-ence on Programming Language Design and Implementation,White Plains, USA, Jun. 20-22, 1990, pp.53-65.
No related articles found!
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved