›› 2015,Vol. 30 ›› Issue (1): 3-19.doi: 10.1007/s11390-015-1500-y

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

CUDA-NP:在GPGPGUs平台上实现嵌套的线程级别并行化

Yi Yang1(杨毅), Chao Li2(李超), Huiyang Zhou2(周辉阳), Senior Member, ACM, IEEE   

  1. 1 Department of Computing Systems Architecture, NEC Laboratories America, Princeton, NJ 08540, U.S.A.;
    2 Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27606, U.S.A.
  • 收稿日期:2014-11-12 修回日期:2014-12-14 出版日期:2015-01-05 发布日期:2015-01-05
  • 作者简介:Yi Yang received his B.S. degree from University of Science and Technology of China (USTC), Hefei, M.S. degree from Chinese Academy of Sciences (CAS), Beijing, and Ph.D. degree from North Carolina State University in 2012, all in computer science. Currently Dr. Yang is a researcher in the Department of Computing Systems Architecture at NEC Laboratories America working on compiler development and software optimization for manycore architectures.
  • 基金资助:

    This work was supported by the National Science Foundation of USA under Grant No. CCF-1216569 and a CAREER award of National Science Foundation of USA under Grant No. CCF-0968667.

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Yi Yang1(杨毅), Chao Li2(李超), Huiyang Zhou2(周辉阳), Senior Member, ACM, IEEE   

  1. 1 Department of Computing Systems Architecture, NEC Laboratories America, Princeton, NJ 08540, U.S.A.;
    2 Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27606, U.S.A.
  • Received:2014-11-12 Revised:2014-12-14 Online:2015-01-05 Published:2015-01-05
  • About author:Yi Yang received his B.S. degree from University of Science and Technology of China (USTC), Hefei, M.S. degree from Chinese Academy of Sciences (CAS), Beijing, and Ph.D. degree from North Carolina State University in 2012, all in computer science. Currently Dr. Yang is a researcher in the Department of Computing Systems Architecture at NEC Laboratories America working on compiler development and software optimization for manycore architectures.
  • Supported by:

    This work was supported by the National Science Foundation of USA under Grant No. CCF-1216569 and a CAREER award of National Science Foundation of USA under Grant No. CCF-0968667.

并行程序包含一系列的有不同的线程级并行的代码段.因此在并行程序线程,比如CUDA程序,仍包含顺序代码和并行循环.为了充分利用这样的并行循环,最新的Nvidia开普勒架构引入了动态并行性,它允许GPU线程启动另一个GPU内核,从而减少从CPU启动内核的开销.然而,在NVIDIA动态并行中,父线程只能通过全局存储器和发射的子线程通信,而产生巨大的开销.在本文中,我们首先研究了一组包含并行循环GPGPU程序,并强调这些程序并不具有很高的循环计数或高度TLP的.因此,利用使用NVIDIA动态并行的好处是很有限,以抵消它的开销.然后,我们提出我们的解决方案利用嵌套并行CUDA中,称为CUDA-NP.在CUDA-NP中,我们最初启动大量线程在GPU程序运行时,利用控制流程来激活不同数量的线程在不同的代码段.我们使用编译器的方法来实现我们提出的CUDA-NP架构.对于GPU内核,应用程序开发人员只需要添加的类似于OpenMP的Pragmas.然后,我们的CUDA-NP编译器自动生成优化的GPU内核.它支持的减少和扫描原语,探索不同的方式来分配并行线程,并有效地管理片上资源.我们的实验表明,对于一组已经得到了优化和包含嵌套并行GPGPU程序,我们提出的CUDA-NP框架进一步提高达6.69倍和平均2.01倍的性能.

Abstract: Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add Open MP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

[1] Chen L, Agrawal G. Optimizing MapReduce for GPUs with effective shared memory usage. In Proc. the 21st International Symposium on High-Performance Parallel and Distributed Computing, June 2012, pp.199-210.

[2] He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008, pp.260-269.

[3] Stuart J A, Owens J D. Multi-GPU MapReduce on GPU clusters. In Proc. IEEE Int. Parallel & Distributed Processing Symposium, May 2011, pp.1068-1079.

[4] Wang J, Yalamanchili S. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Proc. the 2014 IEEE International Symposium on Workload Characterization, Oct. 2014.

[5] Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. the 2009 IEEE International Symposium on Workload Characterization, Oct. 2009, pp.44-54.

[6] Bakhoda A, Yuan G, Fung W W L et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. Int. Symp. Performance Analysis of Systems and Software, April 2009, pp.163-174.

[7] Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proc. the 2009 Euro-Par Parallel Processing Workshops, Aug. 2009, pp.46-55.

[8] Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.86-97.

[9] Boyer M, Tarjan D, Acton S T, Skadron K. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.

[10] Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.

[11] Lee S I, Johnson T, Eigenmann R. Cetus An extensible compiler infrastructure for source-to-source transformation. In Proc. the 16th Int. Workshop on Languages and Compilers for Parallel Computing, October 2003, pp.539-553.

[12] Kayiran O, Jog A, Kandemir M T, Das C R. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques, Sept. 2013, pp.157-166.

[13] Baskaran M M, Bondhugula U, Krishnamoorthy S et al. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, June 2008, pp.225-234.

[14] Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp. 101-110,

[15] Liu Y, Zhang E Z, Shen X. A cross-input adaptive framework for GPU programs optimization. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.

[16] Ueng S, Lathara M, Baghsorkhi S S, HwuWW. CUDA-lite: Reducing GPU programming complexity. In Proc. the 21st Languages and Compilers for Parallel Computing, July 31Aug. 2, 2008, pp.1-15.

[17] DiMarco J, Taufer M. Performance impact of dynamic parallelism on different clustering algorithms. In Proc. SPIE 8752, Modeling and Simulation for Defense Systems and Applications VIII, May 2013, p. 87520E.

[18] Hong S, Kim H. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp. 152-163.

[19] Phothilimthana P M, Ansel J, Ragan-Kelley J, Amarasinghe S. Portable performance on heterogeneous architectures. In Proc. the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2013, pp.431-444.

[20] Narasiman V, Shebanow M, Lee C et al. Improving GPU performance via large warps and two-level warp scheduling. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp. 308-317.

[21] Rogers T G, O'Connor M, Aamodt T. Cache-conscious wavefront scheduling. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp. 72-83.

[22] Steffen M, Zambreno J. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2010, pp.237-248.

[23] Govindaraju N, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 2.

[24] Hong S, Kim S K, Oguntebi T, Olukotun K. Accelerating CUDA graph algorithms at maximum warp. In Proc. the 16th ACM Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp.267-276.

[25] Jang B, Schaa D, Mistry P, Kaeli D. Exploiting memory access patterns to improve memory performance in dataparallel architectures. IEEE Transactions on Parallel and Distributed Systems, 2011, 22(1): 105-118.

[26] Kim J, Kim H, Lee J, Lee J. Achieving a single compute device image in OpenCL for multiple GPUs. In Proc. the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp. 277-288.

[27] Ren B, Agrawal G, Larus J R, Mytkowicz T, Poutanen T, Schulte W. SIMD parallelization of applications that traverse irregular data structures. In Proc. the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, Feb. 2013.

[28] Ryoo S, Rodrigues C I, Stone S S et al. Program optimization space pruning for a multithreaded GPU. In Proc. the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, April 2008, pp.195-204.

[29] Ryoo S, Rodrigues C I, Baghsorkhi S S et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.

[30] Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. 2009. https://developer.nvidia.com/cuda-toolkit-(\d)5-archive, Dec. 2014.

[31] Sim J, Dasgupta A, Kim H et al. A performance analysis framework for identifying performance benefits in GPGPU applications. In Proc. the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2012, pp.11-22.

[32] Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 31.

[33] Wu B, Zhao Z, Zhang E et al. Complexity analysis and algorithm design for reorganizing data to minimize noncoalesced GPU memory accesses. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel programming, Feb. 2013, pp. 57-68.

[34] Zhang Y, Cohen J, Owens J D. Fast tridiagonal solvers on the GPU. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Jan. 2010, pp. 127-136.

[35] Zhang E Z, Jiang Y, Guo Z et al. On-the-fly elimination of dynamic irregularities for GPU computing. In Proc. the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2011, pp. 369-380.

[36] Liao C, Hernandez O, Chapman B et al. OpenUH: An optimizing, portable OpenMP compiler. In Proc. the 12th Workshop on Compilers for Parallel Computers, Jan. 2006, pp. 2317-2332.

[37] Ayguadé E, Copty N, Duran A et al. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(3): 404-418.

[38] Bücker H M, Rasch A, Wolf A. A class of OpenMP applications involving nested parallelism. In Proc. the 2004 ACM Symposium on Applied Computing, March 2004, pp.220-224.

[39] Dagum L, Menon R. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science & Engineering, 1998, 5(1): 46-55.

[40] Dimakopoulos V V, Hadjidoukas P E, Philos G C. A microbenchmark study of OpenMP overheads under nested parallelism. In Proc. the 4th Int. Workshop on OpenMP in a New Era of Parallelism, May 2008, pp. 1-12.

[41] Duran A, Gonz`alez M, Corbal′an J. Automatic thread distribution for nested parallelism in OpenMP. In Proc. the 19th Annual International Conference on Supercomputing, June 2005, pp.121-130.

[42] Hadjidoukas P E, Dimakopoulos V V. Nested parallelism in the OMPi OpenMP/C compiler. In Proc. the 13th Int. Euro-Par Conf. Euro-Par Parallel Processing, Aug. 2007, pp. 662-671.

[43] Tian X, Hoeflinger J P, Haab G et al. A compiler for exploiting nested parallelism in OpenMP programs. Parallel Computing, 2005, 31(10/11/12): 960-983.

[44] Tanaka Y, Taura K, Sato M et al. Performance evaluation of OpenMP applications with nested parallelism. In Proc. the 5th Int. Workshop on Languages, Compilers, and RunTime Systems for Scalable Computers, May 2000, pp. 100-112.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 郭恒昌;. On the Characterization and Fault Identification of Sequentially t-Diagnosable System Under PMC Model[J]. , 1991, 6(1): 83 -90 .
[2] 孙吉贵; 程晓春; 刘叙华;. The Global Properties of Valid Formulas in Modal Logic K[J]. , 1996, 11(6): 615 -621 .
[3] 董峰; 蔡文立; 陈天洲; 石教英;. Three-Dimensional Volume Datafield Reconstruction from Physical Model[J]. , 1997, 12(3): 217 -230 .
[4] 马华东; 刘慎权;. Multimedia Data Modeling Based on TemporalLogic and XYZ System[J]. , 1999, 14(2): 188 -193 .
[5] 史忠植; 田启家; 李云峰;. RAO Logic for Multiagent Framework[J]. , 1999, 14(4): 393 -400 .
[6] SUN Wei;. Multi-Volume CAD Modeling for Heterogeneous Object Design and Fabrication[J]. , 2000, 15(1): 27 -36 .
[7] . 知识图的数学模型与动态行为[J]. , 2005, 20(3): 289 -295 .
[8] . 可恢复收缩关系的一些表示定理[J]. , 2005, 20(4): 536 -541 .
[9] . 基于免疫记忆的克隆策略算法[J]. , 2005, 20(5): 728 -734 .
[10] . 演化计算的最新进展[J]. , 2006, 21(1): 1 -0 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: