›› 2012,Vol. 27 ›› Issue (1): 42-56.doi: 10.1007/s11390-012-1205-4

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

使用MapCG进行CPU,GPU统一编程

Chun-Tao Hong1 (洪春涛), Member, CCF, De-Hao Chen1 (陈德颢), Member, CCF, Yu-Bei Chen2 (陈羽北), Wen-Guang Chen1 (陈文光), Member, CCF, ACM, IEEE, Wei-Min Zheng1 (郑纬民), Member, CCF, ACM, IEEE and Hai-Bo Lin3 (林海波), Member, ACM, IEEE   

  • 收稿日期:2011-02-23 修回日期:2011-05-27 出版日期:2012-01-05 发布日期:2012-01-05
  • 作者简介:Chun-Tao Hong is a Ph.D. candidate in computer science and technology in Tsinghua University. His research interests include parallel and distributed computing, compiler technology and cloud computing. He is a member of CCF.

Providing Source Code Level Portability Between CPU and GPU with MapCG

Chun-Tao Hong1 (洪春涛), Member, CCF, De-Hao Chen1 (陈德颢), Member, CCF, Yu-Bei Chen2 (陈羽北), Wen-Guang Chen1 (陈文光), Member, CCF, ACM, IEEE, Wei-Min Zheng1 (郑纬民), Member, CCF, ACM, IEEE and Hai-Bo Lin3 (林海波), Member, ACM, IEEE   

  1. 1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    2. Department of Electronic Engineering, Tsinghua University, Beijing 100084, China;
    3. IBM China Research Lab, Beijing 100094, China
  • Received:2011-02-23 Revised:2011-05-27 Online:2012-01-05 Published:2012-01-05
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant No. 60973143, the National High Technology Research and Development 863 Program of China under Grant No. 2008AA01A201, and the National Basic Research 973 Program of China under Grant No. 2007CB310900.

1.本文的创新点
GPU已经发展成为一个重要的处理器,越来越多的领域开始使用GPU来加速程序。目前GPU上使用最为广泛的编程方式是CUDA。但是使用CUDA编写的程序无法运行于CPU上。这导致很多程序不得不维护两个版本:一个是CPU版本,一个是GPU版本,大大提高了软件开发和维护成本。MapCG提供了CPU,GPU统一的编程接口。程序员使用MapCG编程,只需要编写一份代码,即可让代码运行于CPU和GPU之上。同时,MapCG提供的基于MapReduce的编程接口让编程变得更加简单。
2.实现方法
CPU和GPU的主要区别有几方面:1)线程模型。GPU使用的编程粒度远小于CPU,同时GPU上线程之间的关系也较CPU复杂得多。2)内存模型。GPU的内存层次比CPU多,而且cache较CPU小。3)编程语言。GPU使用CUDA编程,而CPU则使用C/C++等语言。
MapCG使用MapReduce模型进行编程,可以掩盖大部分的问题。在MapReduce程序中,程序员不需要控制线程的调度,任务分配,负载均衡等问题,因此线程模型差异所带来的困扰可以被大大减少。同时,MapReduce程序中,任务调度是由运行时系统决定的,因而运行时系统也可以较好的控制其内存访问行为。因而程序员不需要过于担心GPU上的内存层次问题。在编程语言上,CUDA与C/C++的区别在于它提供了一些专用于GPU的语言扩展。在MapCG中,程序员使用一个C/C++语言的子集编写程序,然后使用MapCG提供的简单的源代码转换工作将代码转换成CPU代码,或者GPU代码。
MapCG的运行时系统向程序员提供了统一的接口,但是它在CPU和GPU上的实现又有较大区别。在CPU上,MapCG为每个线程使用一个哈希表。哈希表的主要作用是对中间值进行分组。在GPU上,因为线程粒度的关系,无法为每个线程提供一个独立哈希表,因此MapCG为每个GPU使用一个哈希表。为了保证同时插入的正确性,MapCG使用GPU中的原子指令实现了一个可同时插入的哈希表。
此外,为了优化内存分配性能,MapCG还使用了一个定制的内存分配器。这个内存分配器只供MapCG存储对使用。它是一个定制的分配器,支持调速的并行malloc,但只能一次性进行free。这样的设计符合MapCG的内存使用模式,但是不适合作为通用内存分配器。
3.结论及未来待解决的问题
实验表明,使用MapCG,程序员可以只编写一份代码,让代码同时在CPU和GPU上运行,同时还能取得接近于手工编写的代码的性能。同时,测试还表明,MapCG的性能优于同类框架,它比Mars和Phoenix都要快。单独的实验也证明了,MapCG中使用的哈希表设计,以及定制的内存分配器对性能的提高都是很有效的。
MapCG使用MapReduce编程模型,而MapReduce编程模型只适合某一些程序,对于有多重循环的程序,MapReduce往往比较低效,这是MapReduce固有的问题。
目前,MapCG还不能使用GPU上的shared memory来加速程序员编写的代码,仅在内存分配器中使用了shared memory。Xiaosong Ma等人提出了使用shared memory来改进MapReduce程序在GPU上的内存访问行为,从而获取较好的性能。我们正与他们合作将这一特性加入MapCG中。
因为CPU和GPU之间的序列化/反序列化操作,MapCG无法提供高效的CPU-GPU协同处理,我们希望在以后的工作中改进这一点。
4.实用价值或应用前景
MapCG为程序员提供了一个CPU/GPU统一的编程环境,这样程序员可以只编写一份代码,就能让它运行在CPU和GPU上。同时,MapCG提供的MapReduce编程模型也让编程变得更加简单。我们相信,MapCG对于降低CPU/GPU系统中的软件开发与维护成本,普及GPU使用有较大的现实意义。

Abstract: Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years. At present, the common approach to programming GPU units is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve good performance, it creates serious portability issues as programmers are required to write a specific version of the code for each potential target architecture. This results in high development and maintenance costs. We believe it is desirable to have a programming model which provides source code portability between CPUs and GPUs, as well as different GPUs. This would allow programmers to write one version of the code, which can be compiled and executed on either CPUs or GPUs efficiently without modification. In this paper, we propose MapCG, a MapReduce framework to provide source code level portability between CPUs and GPUs. In contrast to other approaches such as OpenCL, our framework, based on MapReduce, provides a high level programming model and makes programming much easier. We describe the design of MapCG, including the MapReduce-style high-level programming framework and the runtime system on the CPU and GPU. A prototype of the MapCG runtime, supporting multi-core CPUs and NVIDIA GPUs, was implemented. Our experimental results show that this implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs, achieving an average speedup of 1.6?2.5x over previous implementations of MapReduce on eight commonly used applications.

[1] NVIDIA. NVIDIA CUDA compute unified device architectureprogramming guide. http://developer.dounload.nvidia.com/compute/cuda/1-1/NVIDIA CUDA programming Guide 1.1.pdf, 2007.

[2] Eichenberger A E, O'Brien J K, O'Brien K M et al. Usingadvanced compiler technology to exploit the performance ofthe Cell Broadband EngineTM architecture. IBM SystemsJournal, 2006, 45(1): 59-84.

[3] Zhu W R, Sreedhar V C, Hu Z, Gao G R. Synchronizationstate buffer: Supporting efficient fine-grain synchronizationon many-core architectures. In Proc. the 34th ISCA, June2007, pp.35-45.

[4] Buck I, Foley T, Horn D et al. Brook for GPUs: Streamcomputing on graphics hardware. ACM Trans. Graph., 2004,23(3): 777-786.

[5] Khronos Group. OpenCL specification. http://www.khronos.org/registry/cl/.

[6] Stratton J, Stone S S, Hwu W M. MCUDA: An efficient im-plementation of CUDA kernels for multi-core CPUs. In Proc.the 21th LCPC, July 31-Aug. 2, 2008, pp.16-30.

[7] He B S, Fang W B, Luo Q, Govindaraju N K, Wang T. Mars:A mapreduce framework on graphics processors. In Proc. the17th PACT, Oct. 2008, pp.260-269.

[8] Ranger C, Raghuraman R, Penmetsa A, Bradski G, KozyrakisC. Evaluating mapreduce for multi-core and multiprocessorsystems. In Proc. the 13th HPCA, Feb. 2007, pp.13-24.

[9] Berger E D, McKinley K S, Blumofe R D, Wilson P R. Hoard:A scalable memory allocator for multithreaded applications.SIGPLAN Not., 2000, 35(11): 117-128.

[10] Dean J, Ghemawat S. MapReduce: Simplified data process-ing on large clusters. In Proc. the 6th OSDI, Dec. 2004,pp.137-150.

[11] Ekanayake J, Pallickara S, Fox G. MapReduce for data inten-sive scientific analyses. In Proc. the 4th IEEE InternationalConference on eScience, Dec. 2008, pp.277-284.

[12] Chu C T, Kim S K, Lin Y A et al. Map-reduce for machinelearning on multicore. Advances in Neural Information Pro-cessing System, 2007, 19: 281-288.

[13] Matthews S, Williams T. Mrsrf: An efficient mapreduce al-gorithm for analyzing large collections of evolutionary trees.BMC Bioinformatics, 2010, 11(Suppl. 1): S15.

[14] Panda B, Herbach J, Basu S, Bayardo R. PLANET: Mas-sively parallel learning of tree ensembles with mapreduce. InProc. VLDB, Aug. 2009, pp.1426-1437.

[15] Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth: Scalablemapreduce on a large-scale shared-memory system. In Proc.IISWC, Oct. 2009, pp.198-207.

[16] Fomitchev M, Ruppert E. Lock-free linked lists and skip lists.In Proc. the 23rd PODC, Jul. 2004, pp.50-59.

[17] Dice D, Garthwaite A. Mostly lock-free malloc. In Proc. the3rd ISMM, Jun. 2002, pp.163-174.

[18] Huang X H, Rodrigues C I, Jones S et al. XMalloc: A scalablelock-free dynamic memory allocator for many-core machines.In Proc. the 10th CIT, June 29-July 1, 2010, pp.1134-1139.

[19] Fang W B, He B S, Luo Q et al. Mars: Accelerating MapRe-duce with graphics processors. IEEE Transactions on Paral-lel and Distributed Systems, 2010, 22(4): 608-620.

[20] Ji F, Ma X S. Using shared memory to accelerate MapReduceon graphics processing units. In Proc. the 25th IPDPS, May2011, pp.805-816.

[21] Apache hadoop. http://hadoop.apache.org/.

[22] Chen R, Chen H B, Zang B Y. Tiled-MapReduce: Optimizingresource usage of data-parallel applications on multicore withtiling. In Proc. the 19th PACT, Sep. 2010, pp.523-534.

[23] Shan Y, Wang B, Yan J et al. Fpmr: Mapreduce frameworkon fpga. In Proc. the 18th FPGA, Feb. 2010, pp.93-102.

[24] Rafique M M, Rose B, Butt A R, Nikolopoulos D S. CellMR:A framework for supporting mapreduce on asymmetric cell-based clusters. In Proc. the 23rd IPDPS, May 2009.

[25] Govindaraju N, Gray J, Kumar R et al. GPUTeraSort: Highperformance graphics co-processor sorting for large databasemanagement. In Proc. SIGMOD/PODS, Jun. 2006, pp.325-336.

[26] AMD CTM. http://www.and.com/us/press-release/Pages/Press Release 114147.aspx, 2011.

[27] Yan Y H, Grossman M, Sarkar V. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA.In Proc. the 15th Euro-Par, Aug. 2009, pp.887-899.

[28] Wang P H, Collins J P, Chinya G N et al. EXOCHI: Archi-tecture and programming environment for a heterogeneousmulti-core multithreaded system. In Proc. PLDI, Jun. 2007,pp.156-166.

[29] Linderman M, Collins J P, Wang H, Meng T H. Merge: Aprogramming model for heterogeneous multi-core systems. InProc. the 13th ASPLOS, Mar. 2008, pp.287-296.

[30] Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A com-piler framework for automatic translation and optimization.In Proc. the 14th PPoPP, Feb. 2009, pp.101-110.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 章萃; 赵沁平; 徐家福;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[8] 卢学妙;. On the Complexity of Induction of Structural Descriptions[J]. , 1987, 2(1): 12 -21 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: