We use cookies to improve your experience with our site.
杨毅, 李超, 周辉阳. CUDA-NP:在GPGPGUs平台上实现嵌套的线程级别并行化[J]. 计算机科学技术学报, 2015, 30(1): 3-19. DOI: 10.1007/s11390-015-1500-y
引用本文: 杨毅, 李超, 周辉阳. CUDA-NP:在GPGPGUs平台上实现嵌套的线程级别并行化[J]. 计算机科学技术学报, 2015, 30(1): 3-19. DOI: 10.1007/s11390-015-1500-y
Yi Yang, Chao Li, Huiyang Zhou. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications[J]. Journal of Computer Science and Technology, 2015, 30(1): 3-19. DOI: 10.1007/s11390-015-1500-y
Citation: Yi Yang, Chao Li, Huiyang Zhou. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications[J]. Journal of Computer Science and Technology, 2015, 30(1): 3-19. DOI: 10.1007/s11390-015-1500-y

CUDA-NP:在GPGPGUs平台上实现嵌套的线程级别并行化

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

  • 摘要: 并行程序包含一系列的有不同的线程级并行的代码段.因此在并行程序线程,比如CUDA程序,仍包含顺序代码和并行循环.为了充分利用这样的并行循环,最新的Nvidia开普勒架构引入了动态并行性,它允许GPU线程启动另一个GPU内核,从而减少从CPU启动内核的开销.然而,在NVIDIA动态并行中,父线程只能通过全局存储器和发射的子线程通信,而产生巨大的开销.在本文中,我们首先研究了一组包含并行循环GPGPU程序,并强调这些程序并不具有很高的循环计数或高度TLP的.因此,利用使用NVIDIA动态并行的好处是很有限,以抵消它的开销.然后,我们提出我们的解决方案利用嵌套并行CUDA中,称为CUDA-NP.在CUDA-NP中,我们最初启动大量线程在GPU程序运行时,利用控制流程来激活不同数量的线程在不同的代码段.我们使用编译器的方法来实现我们提出的CUDA-NP架构.对于GPU内核,应用程序开发人员只需要添加的类似于OpenMP的Pragmas.然后,我们的CUDA-NP编译器自动生成优化的GPU内核.它支持的减少和扫描原语,探索不同的方式来分配并行线程,并有效地管理片上资源.我们的实验表明,对于一组已经得到了优化和包含嵌套并行GPGPU程序,我们提出的CUDA-NP框架进一步提高达6.69倍和平均2.01倍的性能.

     

    Abstract: Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add Open MP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

     

/

返回文章
返回