|
›› 2015,Vol. 30 ›› Issue (1): 3-19.doi: 10.1007/s11390-015-1500-y
所属专题: Computer Architecture and Systems
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Yi Yang1(杨毅), Chao Li2(李超), Huiyang Zhou2(周辉阳), Senior Member, ACM, IEEE
Yi Yang1(杨毅), Chao Li2(李超), Huiyang Zhou2(周辉阳), Senior Member, ACM, IEEE
并行程序包含一系列的有不同的线程级并行的代码段.因此在并行程序线程,比如CUDA程序,仍包含顺序代码和并行循环.为了充分利用这样的并行循环,最新的Nvidia开普勒架构引入了动态并行性,它允许GPU线程启动另一个GPU内核,从而减少从CPU启动内核的开销.然而,在NVIDIA动态并行中,父线程只能通过全局存储器和发射的子线程通信,而产生巨大的开销.在本文中,我们首先研究了一组包含并行循环GPGPU程序,并强调这些程序并不具有很高的循环计数或高度TLP的.因此,利用使用NVIDIA动态并行的好处是很有限,以抵消它的开销.然后,我们提出我们的解决方案利用嵌套并行CUDA中,称为CUDA-NP.在CUDA-NP中,我们最初启动大量线程在GPU程序运行时,利用控制流程来激活不同数量的线程在不同的代码段.我们使用编译器的方法来实现我们提出的CUDA-NP架构.对于GPU内核,应用程序开发人员只需要添加的类似于OpenMP的Pragmas.然后,我们的CUDA-NP编译器自动生成优化的GPU内核.它支持的减少和扫描原语,探索不同的方式来分配并行线程,并有效地管理片上资源.我们的实验表明,对于一组已经得到了优化和包含嵌套并行GPGPU程序,我们提出的CUDA-NP框架进一步提高达6.69倍和平均2.01倍的性能.
[1] Chen L, Agrawal G. Optimizing MapReduce for GPUs with effective shared memory usage. In Proc. the 21st International Symposium on High-Performance Parallel and Distributed Computing, June 2012, pp.199-210.[2] He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008, pp.260-269.[3] Stuart J A, Owens J D. Multi-GPU MapReduce on GPU clusters. In Proc. IEEE Int. Parallel & Distributed Processing Symposium, May 2011, pp.1068-1079.[4] Wang J, Yalamanchili S. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Proc. the 2014 IEEE International Symposium on Workload Characterization, Oct. 2014.[5] Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. the 2009 IEEE International Symposium on Workload Characterization, Oct. 2009, pp.44-54.[6] Bakhoda A, Yuan G, Fung W W L et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. Int. Symp. Performance Analysis of Systems and Software, April 2009, pp.163-174.[7] Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proc. the 2009 Euro-Par Parallel Processing Workshops, Aug. 2009, pp.46-55.[8] Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.86-97.[9] Boyer M, Tarjan D, Acton S T, Skadron K. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.[10] Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.[11] Lee S I, Johnson T, Eigenmann R. Cetus An extensible compiler infrastructure for source-to-source transformation. In Proc. the 16th Int. Workshop on Languages and Compilers for Parallel Computing, October 2003, pp.539-553.[12] Kayiran O, Jog A, Kandemir M T, Das C R. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques, Sept. 2013, pp.157-166.[13] Baskaran M M, Bondhugula U, Krishnamoorthy S et al. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, June 2008, pp.225-234.[14] Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp. 101-110,[15] Liu Y, Zhang E Z, Shen X. A cross-input adaptive framework for GPU programs optimization. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.[16] Ueng S, Lathara M, Baghsorkhi S S, HwuWW. CUDA-lite: Reducing GPU programming complexity. In Proc. the 21st Languages and Compilers for Parallel Computing, July 31Aug. 2, 2008, pp.1-15.[17] DiMarco J, Taufer M. Performance impact of dynamic parallelism on different clustering algorithms. In Proc. SPIE 8752, Modeling and Simulation for Defense Systems and Applications VIII, May 2013, p. 87520E.[18] Hong S, Kim H. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp. 152-163.[19] Phothilimthana P M, Ansel J, Ragan-Kelley J, Amarasinghe S. Portable performance on heterogeneous architectures. In Proc. the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2013, pp.431-444.[20] Narasiman V, Shebanow M, Lee C et al. Improving GPU performance via large warps and two-level warp scheduling. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp. 308-317.[21] Rogers T G, O'Connor M, Aamodt T. Cache-conscious wavefront scheduling. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp. 72-83.[22] Steffen M, Zambreno J. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2010, pp.237-248.[23] Govindaraju N, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 2.[24] Hong S, Kim S K, Oguntebi T, Olukotun K. Accelerating CUDA graph algorithms at maximum warp. In Proc. the 16th ACM Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp.267-276.[25] Jang B, Schaa D, Mistry P, Kaeli D. Exploiting memory access patterns to improve memory performance in dataparallel architectures. IEEE Transactions on Parallel and Distributed Systems, 2011, 22(1): 105-118.[26] Kim J, Kim H, Lee J, Lee J. Achieving a single compute device image in OpenCL for multiple GPUs. In Proc. the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp. 277-288.[27] Ren B, Agrawal G, Larus J R, Mytkowicz T, Poutanen T, Schulte W. SIMD parallelization of applications that traverse irregular data structures. In Proc. the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, Feb. 2013.[28] Ryoo S, Rodrigues C I, Stone S S et al. Program optimization space pruning for a multithreaded GPU. In Proc. the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, April 2008, pp.195-204.[29] Ryoo S, Rodrigues C I, Baghsorkhi S S et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.[30] Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. 2009. https://developer.nvidia.com/cuda-toolkit-(\d)5-archive, Dec. 2014.[31] Sim J, Dasgupta A, Kim H et al. A performance analysis framework for identifying performance benefits in GPGPU applications. In Proc. the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2012, pp.11-22.[32] Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 31.[33] Wu B, Zhao Z, Zhang E et al. Complexity analysis and algorithm design for reorganizing data to minimize noncoalesced GPU memory accesses. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel programming, Feb. 2013, pp. 57-68.[34] Zhang Y, Cohen J, Owens J D. Fast tridiagonal solvers on the GPU. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Jan. 2010, pp. 127-136.[35] Zhang E Z, Jiang Y, Guo Z et al. On-the-fly elimination of dynamic irregularities for GPU computing. In Proc. the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2011, pp. 369-380.[36] Liao C, Hernandez O, Chapman B et al. OpenUH: An optimizing, portable OpenMP compiler. In Proc. the 12th Workshop on Compilers for Parallel Computers, Jan. 2006, pp. 2317-2332.[37] Ayguadé E, Copty N, Duran A et al. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(3): 404-418.[38] Bücker H M, Rasch A, Wolf A. A class of OpenMP applications involving nested parallelism. In Proc. the 2004 ACM Symposium on Applied Computing, March 2004, pp.220-224.[39] Dagum L, Menon R. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science & Engineering, 1998, 5(1): 46-55.[40] Dimakopoulos V V, Hadjidoukas P E, Philos G C. A microbenchmark study of OpenMP overheads under nested parallelism. In Proc. the 4th Int. Workshop on OpenMP in a New Era of Parallelism, May 2008, pp. 1-12.[41] Duran A, Gonz`alez M, Corbal′an J. Automatic thread distribution for nested parallelism in OpenMP. In Proc. the 19th Annual International Conference on Supercomputing, June 2005, pp.121-130.[42] Hadjidoukas P E, Dimakopoulos V V. Nested parallelism in the OMPi OpenMP/C compiler. In Proc. the 13th Int. Euro-Par Conf. Euro-Par Parallel Processing, Aug. 2007, pp. 662-671.[43] Tian X, Hoeflinger J P, Haab G et al. A compiler for exploiting nested parallelism in OpenMP programs. Parallel Computing, 2005, 31(10/11/12): 960-983.[44] Tanaka Y, Taura K, Sato M et al. Performance evaluation of OpenMP applications with nested parallelism. In Proc. the 5th Int. Workshop on Languages, Compilers, and RunTime Systems for Scalable Computers, May 2000, pp. 100-112. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |