›› 2016,Vol. 31 ›› Issue (2): 235-252.doi: 10.1007/s11390-016-1624-8

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于制导的GPU共享内存相关优化

Jing Li(李晶)1,2, Member, CCF, Lei Liu(刘雷)1, Member, CCF, Yuan Wu(吴远)3, Xiang-Hua Liu(刘向华)3, Yi Gao(高翊)3, Xiao-Bing Feng(冯晓兵)1, Member, CCF, ACM, IEEE, and Cheng-Yong Wu(吴承勇)1, Senior Member, CCF, Member, ACM   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Beijing Samsung Telecom Research and Development Center, Beijing 100028, China
  • 收稿日期:2015-01-04 修回日期:2015-08-25 出版日期:2016-03-05 发布日期:2016-03-05
  • 作者简介:Jing Li received her B.S. degree in software engineering from Wuhan University, Wuhan, in 2012. Currently she is a Ph.D. candidate of Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing. Her research interests include programming language and optimization on GPUs.
  • 基金资助:

    This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, the National Natural Science Foundation of China (NSFC) under Grant No. 61432018, and the Innovation Research Group of NSFC under Grant No. 61221062.

Pragma Directed Shared Memory Centric Optimizations on GPUs

Jing Li(李晶)1,2, Member, CCF, Lei Liu(刘雷)1, Member, CCF, Yuan Wu(吴远)3, Xiang-Hua Liu(刘向华)3, Yi Gao(高翊)3, Xiao-Bing Feng(冯晓兵)1, Member, CCF, ACM, IEEE, and Cheng-Yong Wu(吴承勇)1, Senior Member, CCF, Member, ACM   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Beijing Samsung Telecom Research and Development Center, Beijing 100028, China
  • Received:2015-01-04 Revised:2015-08-25 Online:2016-03-05 Published:2016-03-05
  • About author:Jing Li received her B.S. degree in software engineering from Wuhan University, Wuhan, in 2012. Currently she is a Ph.D. candidate of Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing. Her research interests include programming language and optimization on GPUs.
  • Supported by:

    This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, the National Natural Science Foundation of China (NSFC) under Grant No. 61432018, and the Innovation Research Group of NSFC under Grant No. 61221062.

由于具备良好的并行计算能力,GPU逐渐被当作计算系统中的协处理器。在GPU的体系结构中,共享内存对系统性能有着至关重要的影响,它能够显著的提升带宽利用率并加速访存。然而,即使对只含有规则的、仿射的访存模式的GPU应用程序进行共享内存优化,也十分困难。这是由于共享内存优化常常需要程序员的经验和分析,以及复杂的参数选择。如果共享内存优化做的不好,甚至会导致GPU资源被浪费。即使现在最先进的高层编程模型(例如,OpenACC和OpenHMPP),也很难利用共享内存,因为他们缺少对共享内存优化的描述机制,并且不能自动选择合适的参数,更不能保证GPU的资源利用率。
为了高效的优化仿射程序,我们提出了一种GPU上以数据为中心的共享内存优化方式。我们在OpenACC上扩展了一个编译器制导,用来描述程序员期望的数据管理方式,并将它传递给编译器。此外,我们设计了一个基于多面体模型的编译器框架,用于为共享的数组自动选择最优的参数。在此基础之上,我们还提出了两个优化技巧用于提升访存的并行度和指令并行度。实验结果表明,我们的方法不需要程序员提供很多编译器制导就能够有效的提高五个典型GPU程序的性能,并且我们在四个广泛使用的GPU显卡上达到了平均3.7倍的加速比。

Abstract: GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource. Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.

[1] Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. http://www.cs.colostate.edu/~cs675/MatrixTranspose. pdf, Jan. 2009.

[2] Fujimoto N. Faster matrix-vector multiplication on GeForce 8800GTX. In Proc. IEEE International Symposium on Parallel and Distributed Processing, Apr. 2008.

[3] Van Werkhoven B, Maassen J, Bal H E, Seinstra F J. Optimizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst., 2014, 30: 14-26.

[4] Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.

[5] Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2010, pp.86-97.

[6] Kandemir M, Kadayif I, Sezer U. Exploiting scratch-pad memory using Presburger formulas. In Proc. the 14th International Symposium on Systems Synthesis, Sept. 2001, pp.7-12.

[7] Ueng S Z, Lathara M, Baghsorkhi S, Hwu W. CUDA-Lite: Reducing GPU programming complexity. In Proc. the Languages and Compilers for Parallel Computing, July 3-Aug. 2, 2008, pp.1-15.

[8] Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.

[9] Jablin J A, Jablin T B, Mutlu O, Herlihy M. Warp-aware trace scheduling for GPUs. In Proc. the 23rd International Conference on Parallel Architectures and Compilation, Aug. 2014, pp.163-174.

[10] Schäfer A, Fey D. High performance stencil code algorithms for GPGPUs. Procedia Computer Science, 2011, 4: 2027- 2036.

[11] Volkov V. Better performance at lower occupancy. www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf, Dec. 2014.

[12] Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In Proc. the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2008, pp.101-113.

[13] Bastoul C. Code generation in the polyhedral model is easier than you think. In Proc. the 13th International Conference on Parallel Architectures and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.7-16.

[14] Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, Jun. 2008, pp.225-234.

[15] Baskaran M, Ramanujam J, Sadayappan P. Automatic Cto- CUDA code generation for affine programs. In Proc. the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, Mar. 2010, pp.244-263.

[16] Pouchet L N. Polyhedral compilation foundations. http:// web.cs.ucla.edu/~pouchet/lectures/doc/888.11.2.pdf, Dec. 2014.

[17] Murthy G S, Ravishankar M, Baskaran M M, Sadayappan P. Optimal loop unrolling for GPGPU programs. In Proc. the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Apr. 2010.

[18] Liu L, Li Y, Cui Z, Bao Y, Chen M, Wu C. Going vertical in memory management: Handling multiplicity by multipolicy. In Proc. the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), Jun. 2014, pp.169- 180

[19] Gao S. Improving GPU shared memory access efficiency[Ph.D. Thesis]. University of Tennessee, 2014.

[20] Gou C, Gaydadjiev G. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming, 2013, 41(3): 400-429.

[21] Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.

[22] Lee S I, Johnson T, Eigenmann R. Cetus — An extensible compiler infrastructure for source-to-source transformation. In Lecture Notes in Computer Science 2958, Rauchwerger L (ed.), Springer Berlin Heidelberg, 2004, pp.539-553.

[23] Lee S, Min S, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp.101-110.

[24] Wienke S, Springer P, Terboven C, an Mey D. OpenACC — First experiences with real-world applications. In Lecture Notes in Computer Science 7484, Kaklamanis C, Papatheodorou T, Spirakis P G (eds.), Springer Berlin Heidelberg, 2012, pp.859-870.

[25] Catanzaro B, Garland M, Keutzer K. Copperhead: Compiling an embedded data parallel language. Technical Report, UCB/EECS-2010-124, EECS Department, University of California, Berkeley, Sept. 2010.

[26] Reyes R, López I, Fumero J, de Sande F. A preliminary evaluation of OpenACC implementations. The Journal of Supercomputing, 2013, 65(3): 1063-1075.

[27] Fang J, Varbanescu A, Sips H. A comprehensive performance comparison of CUDA and OpenCL. In Proc. the International Conference on Parallel Processing, Sept. 2011, pp.216-225.

[28] Karimi K, Dickson N G, Hamze F. A performance comparison of CUDA and OpenCL. arXiv: 1005.2581,2010. http://arvix.org/abs/1005.2581, Jan. 2016.

[29] Li C, Yang Y, Dai H, Yan S, Mueller F, Zhou H. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. In Proc. the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Mar. 2014, pp.231-242.

[30] Chen G, Wu B, Li D, Shen X. PORPLE: An extensible optimizer for portable data placement on GPU. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.88-100.

[31] van den Braak G, Mesman B, Corporaal H. Compile-time GPU memory access optimizations. In Proc. the 2010 International Conference on Embedded Computer Systems (SAMOS), Jul. 2010, pp.200-207.

[32] Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.1-10.

[33] Baghdadi S, Größlinger A, Cohen A. Putting automatic polyhedral compilation for GPGPU to work. In Proc. the 15th Workshop Compilers for Parallel Computers, Jul. 2010.

[34] Größlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th International Conference on Compiler Construction, Mar. 2009, pp.236-250.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[2] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[3] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[4] 刘彤; 唐稚松;. Semantic Specification and Verification of Data Flow Diagrams[J]. , 1991, 6(1): 21 -31 .
[5] Adelino Santos;. Cooperative Hypermedia Editing with CoMEdiA[J]. , 1993, 8(3): 67 -79 .
[6] 王献昌; 陈火旺; 赵沁平;. On the Relationship Between TMS and Logic Programs[J]. , 1994, 9(3): 245 -251 .
[7] 金芝;. The Structure and Semantics of an Object-Oriented Logic Programming Language: SCKE[J]. , 1995, 10(1): 74 -84 .
[8] 李宏宙; 李冠英;. Nonuniform Lowness and Strong Nonuniform Lowness[J]. , 1995, 10(3): 253 -258 .
[9] 高虹;. Transformation List for SGML Application[J]. , 1995, 10(5): 455 -462 .
[10] 闵有力; 闵应骅;. A Fault-Tolerant and Heuristic Routing Algorithm for Faulty Hypercubes[J]. , 1995, 10(6): 536 -544 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: