SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.
Citation: | Wen-Jing Ma, Kan Gao, Guo-Ping Long. Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs[J]. Journal of Computer Science and Technology, 2016, 31(6): 1262-1274. DOI: 10.1007/s11390-016-1696-5 |
[1] |
Zhang Y, Mueller F. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proc. the 10th Int. Symp. Code Generation and Optimization, Mar. 2012, pp.155-164.
|
[2] |
Holewinski J, Pouchet L, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM Int. Conf. Supercomputing, Jun. 2012, pp.311-320.
|
[3] |
Lutz T, Fensch C, Cole M. PARTANS:An autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Archit. Code Optim., 2013, 9(4):59:1-59:24.
|
[4] |
Krotkiewski M, Dabrowski M. Efficient 3D stencil computations using CUDA. Parallel Computing, 2013, 39(10):533- 548.
|
[5] |
Micikevicius P. 3D finite difference computation on GPUs using CUDA. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units, Mar. 2009, pp.79-84.
|
[6] |
Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2010, pp.1-13.
|
[7] |
Fan Z. Vectorization Theory. China Science Press, 1988. (in Chinese)
|
[8] |
Allen J, Kennedy K. Optimizing Compilers for Modern Architectures:A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.
|
[9] |
Cohen A, Sigler M, Girbal S, Temam O, Parello D, Vasilache N. Facilitating the search for compositions of program transformations. In Proc. the 19th Int. Conf. Supercomputing, Jun. 2005, pp.151-160.
|
[10] |
Pouchet L. Interative optimization in the polyhedral model[Ph.D. Thesis]. University of Paris-Sud 11, Orsay, France, Jan 2010.
|
[11] |
Deitz S, Chamberlain B, Snyder L. Eliminating redundancies in sum-of-product array computations. In Proc. the 15th International Conference on Supercomputing, Jun. 2001, pp.65-77.
|
[12] |
Basu P, Hall M, Williams S, Van Straalen B et al. Compilerdirected transformation for higher-order stencils. In Proc. the 29th Int. Parallel & Distributed Processing Symp., May 2015, pp.313-323.
|
[13] |
Größlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th Int. Conf. Compiler Construction, Mar. 2009, pp.236- 250.
|
[14] |
Issenin I, Brockmeyer E, Miranda M, Dutt N. DRDU:A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. Des. Autom. Electron. Syst., 2007, 12(2):Article No. 15.
|
[15] |
Ma W, Agrawal G. An integer programming framework for optimizing shared memory use on GPUs. In Proc. the 17th IEEE Int. Conf. High Performance Computing, Dec. 2010.
|
[16] |
Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J, Sadayappan P. A framework for enhancing data reuse via associative reordering. In Proc. the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2014, pp.65-76.
|
[17] |
Tseng H, Tullsen D. Eliminating redundant computation and exposing parallelism through data-triggered threads. IEEE Micro, 2012, 32(3):38-47.
|
[18] |
Tseng H, Tullsen D. Software data-triggered threads. In Proc. the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Oct. 2012, pp.703-716.
|
[19] |
Long G, Franklin D, Biswas S, Ortiz P, Oberg J, Fan D, Chong F. Minimal multi-threading:Finding and removing redundant instructions in multi-threaded processors. In Proc. the 43rd IEEE/ACM Int. Symp. Microarchitecture, Dec. 2010, pp.337-348.
|
[20] |
Ding Y, Li Z. A compiler scheme for reusing intermediate computation results. In Proc. Int. Symp. Code Generation and Optimization, Mar. 2004, pp.277-288.
|
[21] |
Hammer M, Acar U, Chen Y. CEAL:A C-based language for self-adjusting computation. In Proc. the ACM SIGPLAN Conf. Programming Language Design and Implementation, Jun. 2009, pp.25-37.
|
[22] |
Gautam, Rajopadhye S.Simplifying reductions. In Proc. the 33rd ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, Jan. 2006, pp.30-41.
|
[23] |
Fan Z. Investigation on vectorization problem. In Proc. China-US Symp. Computer Software Engineering, April 1982.
|
[24] |
Su H, Wu N, Wen M, Zhang C, Cai X. On the GPU performance of 3D stencil computations implemented in OpenCL. In Lecture Notes in Computer Science 7905, Kuâkel J, Ludwig T, Meuer H W (eds.), Springer Berlin Heidelberg, 2013, pp.125-135.
|
[25] |
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Mar. 2008.
|
[26] |
Luo Y, Tan G, Mo Z, Sun N. FAST:A fast stencil autotuning framework based on an optimal-solution space model. In Proc. the 29th ACM Int. Conf. Supercomputing, Jun. 2015, pp.187-196.
|
[27] |
Meng J, Skadron K. A performance study for iterative stencil loops on GPUs with ghost zone optimizations. International Journal of Parallel Programming, 2011, 39(1):115-142.
|
[28] |
Yang Y, Cui H, Feng X, Xue J. A hybrid circular queue method for iterative stencil computations on GPUs. Journal of Computer Science and Technology, 2012, 27(1):57-74.
|
[29] |
Cecilia J, Garc?a J, Ujaldón M. CUDA 2D stencil computations for the Jacobi method. In Proc. the 10th International Conference on Applied Parallel and Scientific Computing-Volume Part I, June 2012, pp.173-183.
|
[30] |
Kurzak J, Bader D, Dongarra J. Scientific Computing with Multicore and Accelerators (1st edition). CRC Press, 2010.
|
[31] |
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S.Halide:A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2013, pp.519-530.
|