基于GPU上迭代型stencil计算的混合型circular queue方法

杨杨; 崔慧敏; 冯晓兵; 薛京灵

doi:10.1007/s11390-012-1206-3

摘要: 1．本文的创新点
本文的创新点如下：
1) 本文针对在GPU上执行的stencil类程序提出了一个重用模式(reuse pattern)。该重用模式使得circular queue方法可以用GPU片上的寄存器来实现，而不仅仅是可以用有间接访问能力的存储(例如内存或scratchpad memory)来实现。该模式可以适用于在片上连续执行多个时间步的stencil应用。
2) 减少对shared memory的需求和减少访问shared memory指令数对提高程序在GPU上执行的性能非常重要。对于在GPU上执行的stencil类应用程序，为了达到这两个目的，我们分别提出了两种线程间通讯的方式: SCR和MCR。
3) 为了自动找到存储在片上的数据在寄存器和shared memory间最佳的分配方式，以及最佳的线程间通讯方式和参数，我们提出了一个搜索算法。我们同时提出了一个能够有效减小搜索空间的修剪算法。
4) 通过研究4种不同维度和不同访存模式的stencil应用，我们发现：a) 我们的重用模式能够广泛的应用于多种stencil应用；b) 我们提出的混合式circular queue方法能够有效的提高性能；c) 片上数据在寄存器和shared memory间的最佳分配与stencil的类型和GPU硬件参数都有关; d) 对于我们提出的重用模式，目前编译器中的寄存器分配算法没有达到最佳的分配。我们提出了相关的改进建议。

2．实现方法
原有的circular queue是实现在GPU片上的shared memory中。我们将circular queue的数组数据结构用标量变量的集合的形式来看待，通过分析标量变量和其所存储数据之间的关系，提出了标量变量的重用模式，且不需要通过间接访问(指针)的方式来访问标量变量。由于把circular queue用不需间接访问的标量变量来表示，就可以将circular queue用寄存器和shared memory两种存储同时实现。
GPU的片上包括寄存器和shared memory两种存储。这两种存储有各自的特点，且必须平衡使用才能最大化同时执行的线程数，从而达到最佳性能。对circular queue而言，最佳的实现方式往往是将circular queue数据结构的一部分存储在寄存器中，一部分存储在shared memory中。为了找到最佳的数据分配方式，我们采取了搜索不同版本的方法。为了减小搜索空间，我们通过搜集在编译时已知的occupancy信息，结合版本的参数，排除掉性能差的版本，使得只需要实际运行少量的版本就可以找到性能最优的版本。

3．结论及未来待解决的问题
本文提出了一个针对stencil类程序的重用模式，使得circular queue方法在GPU上可以用寄存器来实现。此外，我们还提出了一个能够自动生成混合型circular queue方法代码并自动搜索片上数据在寄存器和shared memory间最佳分配参数的框架。通过对4种不同类型的stencil程序的实验，我们发现我们的方法相对于全部使用shared memory的方法最高能获得2.93倍的加速。
待解决的问题如下：1) 提出一个性能分析模型以进一步的减少搜索的时间；2) 将我们对片上寄存器和shared memory的存储分配问题的讨论扩展到更多的应用中，并提出一个更具普适性的框架来帮助程序员优化GPU代码。

Abstract: In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

基于GPU上迭代型stencil计算的混合型circular queue方法

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs