基于制导的GPU共享内存相关优化

李晶; 刘雷; 吴远; 刘向华; 高翊; 冯晓兵; 吴承勇

doi:10.1007/s11390-016-1624-8

基于制导的GPU共享内存相关优化

Pragma Directed Shared Memory Centric Optimizations on GPUs

摘要

摘要: 由于具备良好的并行计算能力,GPU逐渐被当作计算系统中的协处理器。在GPU的体系结构中,共享内存对系统性能有着至关重要的影响,它能够显著的提升带宽利用率并加速访存。然而,即使对只含有规则的、仿射的访存模式的GPU应用程序进行共享内存优化,也十分困难。这是由于共享内存优化常常需要程序员的经验和分析,以及复杂的参数选择。如果共享内存优化做的不好,甚至会导致GPU资源被浪费。即使现在最先进的高层编程模型(例如,OpenACC和OpenHMPP),也很难利用共享内存,因为他们缺少对共享内存优化的描述机制,并且不能自动选择合适的参数,更不能保证GPU的资源利用率。
为了高效的优化仿射程序,我们提出了一种GPU上以数据为中心的共享内存优化方式。我们在OpenACC上扩展了一个编译器制导,用来描述程序员期望的数据管理方式,并将它传递给编译器。此外,我们设计了一个基于多面体模型的编译器框架,用于为共享的数组自动选择最优的参数。在此基础之上,我们还提出了两个优化技巧用于提升访存的并行度和指令并行度。实验结果表明,我们的方法不需要程序员提供很多编译器制导就能够有效的提高五个典型GPU程序的性能,并且我们在四个广泛使用的GPU显卡上达到了平均3.7倍的加速比。

Abstract: GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource. Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.

HTML全文

参考文献()

施引文献

资源附件()