We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Wen-Jing Ma, Kan Gao, Guo-Ping Long. Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs[J]. Journal of Computer Science and Technology, 2016, 31(6): 1262-1274. DOI: 10.1007/s11390-016-1696-5
Citation: Wen-Jing Ma, Kan Gao, Guo-Ping Long. Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs[J]. Journal of Computer Science and Technology, 2016, 31(6): 1262-1274. DOI: 10.1007/s11390-016-1696-5

Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs

Funds: This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, and the National Natural Science Foundation of China under Grant No. 61303059.
More Information
  • Corresponding author:

    Guo-Ping Long E-mail: guoping@iscas.ac.cn

  • Received Date: October 14, 2015
  • Revised Date: July 06, 2016
  • Published Date: November 04, 2016
  • Computation reuse is known as an effective optimization technique.However,due to the complexity of modern GPU architectures,there is yet not enough understanding regarding the intriguing implications of the interplay of computation reuse and hardware specifics on application performance.In this paper,we propose an automatic code generator for a class of stencil codes with inherent computation reuse on GPUs.For such applications,the proper reuse of intermediate results,combined with careful register and on-chip local memory usage,has profound implications on performance.Current state of the art does not address this problem in depth,partially due to the lack of a good program representation that can expose all potential computation reuse.In this paper,we leverage the computation overlap graph (COG),a simple representation of data dependence and data reuse with "element view",to expose potential reuse opportunities.Using COG,we propose a portable code generation and tuning framework for GPUs.Compared with current state-of-the-art code generators,our experimental results show up to 56.7% performance improvement on modern GPUs such as NVIDIA C2050.
  • [1]
    Zhang Y, Mueller F. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proc. the 10th Int. Symp. Code Generation and Optimization, Mar. 2012, pp.155-164.
    [2]
    Holewinski J, Pouchet L, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM Int. Conf. Supercomputing, Jun. 2012, pp.311-320.
    [3]
    Lutz T, Fensch C, Cole M. PARTANS:An autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Archit. Code Optim., 2013, 9(4):59:1-59:24.
    [4]
    Krotkiewski M, Dabrowski M. Efficient 3D stencil computations using CUDA. Parallel Computing, 2013, 39(10):533- 548.
    [5]
    Micikevicius P. 3D finite difference computation on GPUs using CUDA. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units, Mar. 2009, pp.79-84.
    [6]
    Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2010, pp.1-13.
    [7]
    Fan Z. Vectorization Theory. China Science Press, 1988. (in Chinese)
    [8]
    Allen J, Kennedy K. Optimizing Compilers for Modern Architectures:A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.
    [9]
    Cohen A, Sigler M, Girbal S, Temam O, Parello D, Vasilache N. Facilitating the search for compositions of program transformations. In Proc. the 19th Int. Conf. Supercomputing, Jun. 2005, pp.151-160.
    [10]
    Pouchet L. Interative optimization in the polyhedral model[Ph.D. Thesis]. University of Paris-Sud 11, Orsay, France, Jan 2010.
    [11]
    Deitz S, Chamberlain B, Snyder L. Eliminating redundancies in sum-of-product array computations. In Proc. the 15th International Conference on Supercomputing, Jun. 2001, pp.65-77.
    [12]
    Basu P, Hall M, Williams S, Van Straalen B et al. Compilerdirected transformation for higher-order stencils. In Proc. the 29th Int. Parallel & Distributed Processing Symp., May 2015, pp.313-323.
    [13]
    Größlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th Int. Conf. Compiler Construction, Mar. 2009, pp.236- 250.
    [14]
    Issenin I, Brockmeyer E, Miranda M, Dutt N. DRDU:A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. Des. Autom. Electron. Syst., 2007, 12(2):Article No. 15.
    [15]
    Ma W, Agrawal G. An integer programming framework for optimizing shared memory use on GPUs. In Proc. the 17th IEEE Int. Conf. High Performance Computing, Dec. 2010.
    [16]
    Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J, Sadayappan P. A framework for enhancing data reuse via associative reordering. In Proc. the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2014, pp.65-76.
    [17]
    Tseng H, Tullsen D. Eliminating redundant computation and exposing parallelism through data-triggered threads. IEEE Micro, 2012, 32(3):38-47.
    [18]
    Tseng H, Tullsen D. Software data-triggered threads. In Proc. the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Oct. 2012, pp.703-716.
    [19]
    Long G, Franklin D, Biswas S, Ortiz P, Oberg J, Fan D, Chong F. Minimal multi-threading:Finding and removing redundant instructions in multi-threaded processors. In Proc. the 43rd IEEE/ACM Int. Symp. Microarchitecture, Dec. 2010, pp.337-348.
    [20]
    Ding Y, Li Z. A compiler scheme for reusing intermediate computation results. In Proc. Int. Symp. Code Generation and Optimization, Mar. 2004, pp.277-288.
    [21]
    Hammer M, Acar U, Chen Y. CEAL:A C-based language for self-adjusting computation. In Proc. the ACM SIGPLAN Conf. Programming Language Design and Implementation, Jun. 2009, pp.25-37.
    [22]
    Gautam, Rajopadhye S.Simplifying reductions. In Proc. the 33rd ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, Jan. 2006, pp.30-41.
    [23]
    Fan Z. Investigation on vectorization problem. In Proc. China-US Symp. Computer Software Engineering, April 1982.
    [24]
    Su H, Wu N, Wen M, Zhang C, Cai X. On the GPU performance of 3D stencil computations implemented in OpenCL. In Lecture Notes in Computer Science 7905, Kuâkel J, Ludwig T, Meuer H W (eds.), Springer Berlin Heidelberg, 2013, pp.125-135.
    [25]
    Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Mar. 2008.
    [26]
    Luo Y, Tan G, Mo Z, Sun N. FAST:A fast stencil autotuning framework based on an optimal-solution space model. In Proc. the 29th ACM Int. Conf. Supercomputing, Jun. 2015, pp.187-196.
    [27]
    Meng J, Skadron K. A performance study for iterative stencil loops on GPUs with ghost zone optimizations. International Journal of Parallel Programming, 2011, 39(1):115-142.
    [28]
    Yang Y, Cui H, Feng X, Xue J. A hybrid circular queue method for iterative stencil computations on GPUs. Journal of Computer Science and Technology, 2012, 27(1):57-74.
    [29]
    Cecilia J, Garc?a J, Ujaldón M. CUDA 2D stencil computations for the Jacobi method. In Proc. the 10th International Conference on Applied Parallel and Scientific Computing-Volume Part I, June 2012, pp.173-183.
    [30]
    Kurzak J, Bader D, Dongarra J. Scientific Computing with Multicore and Accelerators (1st edition). CRC Press, 2010.
    [31]
    Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S.Halide:A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2013, pp.519-530.

Catalog

    Article views (41) PDF downloads (828) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return