›› 2012, Vol. 27 ›› Issue (1): 42-56.doi: 10.1007/s11390-012-1205-4

Special Issue: Computer Architecture and Systems

• Architecture and VLSI Design • Previous Articles     Next Articles

Providing Source Code Level Portability Between CPU and GPU with MapCG

Chun-Tao Hong1 (洪春涛), Member, CCF, De-Hao Chen1 (陈德颢), Member, CCF, Yu-Bei Chen2 (陈羽北), Wen-Guang Chen1 (陈文光), Member, CCF, ACM, IEEE, Wei-Min Zheng1 (郑纬民), Member, CCF, ACM, IEEE and Hai-Bo Lin3 (林海波), Member, ACM, IEEE   

  1. 1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    2. Department of Electronic Engineering, Tsinghua University, Beijing 100084, China;
    3. IBM China Research Lab, Beijing 100094, China
  • Received:2011-02-23 Revised:2011-05-27 Online:2012-01-05 Published:2012-01-05
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant No. 60973143, the National High Technology Research and Development 863 Program of China under Grant No. 2008AA01A201, and the National Basic Research 973 Program of China under Grant No. 2007CB310900.

Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years. At present, the common approach to programming GPU units is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve good performance, it creates serious portability issues as programmers are required to write a specific version of the code for each potential target architecture. This results in high development and maintenance costs. We believe it is desirable to have a programming model which provides source code portability between CPUs and GPUs, as well as different GPUs. This would allow programmers to write one version of the code, which can be compiled and executed on either CPUs or GPUs efficiently without modification. In this paper, we propose MapCG, a MapReduce framework to provide source code level portability between CPUs and GPUs. In contrast to other approaches such as OpenCL, our framework, based on MapReduce, provides a high level programming model and makes programming much easier. We describe the design of MapCG, including the MapReduce-style high-level programming framework and the runtime system on the CPU and GPU. A prototype of the MapCG runtime, supporting multi-core CPUs and NVIDIA GPUs, was implemented. Our experimental results show that this implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs, achieving an average speedup of 1.6?2.5x over previous implementations of MapReduce on eight commonly used applications.

[1] NVIDIA. NVIDIA CUDA compute unified device architectureprogramming guide. http://developer.dounload.nvidia.com/compute/cuda/1-1/NVIDIA CUDA programming Guide 1.1.pdf, 2007.

[2] Eichenberger A E, O'Brien J K, O'Brien K M et al. Usingadvanced compiler technology to exploit the performance ofthe Cell Broadband EngineTM architecture. IBM SystemsJournal, 2006, 45(1): 59-84.

[3] Zhu W R, Sreedhar V C, Hu Z, Gao G R. Synchronizationstate buffer: Supporting efficient fine-grain synchronizationon many-core architectures. In Proc. the 34th ISCA, June2007, pp.35-45.

[4] Buck I, Foley T, Horn D et al. Brook for GPUs: Streamcomputing on graphics hardware. ACM Trans. Graph., 2004,23(3): 777-786.

[5] Khronos Group. OpenCL specification. http://www.khronos.org/registry/cl/.

[6] Stratton J, Stone S S, Hwu W M. MCUDA: An efficient im-plementation of CUDA kernels for multi-core CPUs. In Proc.the 21th LCPC, July 31-Aug. 2, 2008, pp.16-30.

[7] He B S, Fang W B, Luo Q, Govindaraju N K, Wang T. Mars:A mapreduce framework on graphics processors. In Proc. the17th PACT, Oct. 2008, pp.260-269.

[8] Ranger C, Raghuraman R, Penmetsa A, Bradski G, KozyrakisC. Evaluating mapreduce for multi-core and multiprocessorsystems. In Proc. the 13th HPCA, Feb. 2007, pp.13-24.

[9] Berger E D, McKinley K S, Blumofe R D, Wilson P R. Hoard:A scalable memory allocator for multithreaded applications.SIGPLAN Not., 2000, 35(11): 117-128.

[10] Dean J, Ghemawat S. MapReduce: Simplified data process-ing on large clusters. In Proc. the 6th OSDI, Dec. 2004,pp.137-150.

[11] Ekanayake J, Pallickara S, Fox G. MapReduce for data inten-sive scientific analyses. In Proc. the 4th IEEE InternationalConference on eScience, Dec. 2008, pp.277-284.

[12] Chu C T, Kim S K, Lin Y A et al. Map-reduce for machinelearning on multicore. Advances in Neural Information Pro-cessing System, 2007, 19: 281-288.

[13] Matthews S, Williams T. Mrsrf: An efficient mapreduce al-gorithm for analyzing large collections of evolutionary trees.BMC Bioinformatics, 2010, 11(Suppl. 1): S15.

[14] Panda B, Herbach J, Basu S, Bayardo R. PLANET: Mas-sively parallel learning of tree ensembles with mapreduce. InProc. VLDB, Aug. 2009, pp.1426-1437.

[15] Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth: Scalablemapreduce on a large-scale shared-memory system. In Proc.IISWC, Oct. 2009, pp.198-207.

[16] Fomitchev M, Ruppert E. Lock-free linked lists and skip lists.In Proc. the 23rd PODC, Jul. 2004, pp.50-59.

[17] Dice D, Garthwaite A. Mostly lock-free malloc. In Proc. the3rd ISMM, Jun. 2002, pp.163-174.

[18] Huang X H, Rodrigues C I, Jones S et al. XMalloc: A scalablelock-free dynamic memory allocator for many-core machines.In Proc. the 10th CIT, June 29-July 1, 2010, pp.1134-1139.

[19] Fang W B, He B S, Luo Q et al. Mars: Accelerating MapRe-duce with graphics processors. IEEE Transactions on Paral-lel and Distributed Systems, 2010, 22(4): 608-620.

[20] Ji F, Ma X S. Using shared memory to accelerate MapReduceon graphics processing units. In Proc. the 25th IPDPS, May2011, pp.805-816.

[21] Apache hadoop. http://hadoop.apache.org/.

[22] Chen R, Chen H B, Zang B Y. Tiled-MapReduce: Optimizingresource usage of data-parallel applications on multicore withtiling. In Proc. the 19th PACT, Sep. 2010, pp.523-534.

[23] Shan Y, Wang B, Yan J et al. Fpmr: Mapreduce frameworkon fpga. In Proc. the 18th FPGA, Feb. 2010, pp.93-102.

[24] Rafique M M, Rose B, Butt A R, Nikolopoulos D S. CellMR:A framework for supporting mapreduce on asymmetric cell-based clusters. In Proc. the 23rd IPDPS, May 2009.

[25] Govindaraju N, Gray J, Kumar R et al. GPUTeraSort: Highperformance graphics co-processor sorting for large databasemanagement. In Proc. SIGMOD/PODS, Jun. 2006, pp.325-336.

[26] AMD CTM. http://www.and.com/us/press-release/Pages/Press Release 114147.aspx, 2011.

[27] Yan Y H, Grossman M, Sarkar V. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA.In Proc. the 15th Euro-Par, Aug. 2009, pp.887-899.

[28] Wang P H, Collins J P, Chinya G N et al. EXOCHI: Archi-tecture and programming environment for a heterogeneousmulti-core multithreaded system. In Proc. PLDI, Jun. 2007,pp.156-166.

[29] Linderman M, Collins J P, Wang H, Meng T H. Merge: Aprogramming model for heterogeneous multi-core systems. InProc. the 13th ASPLOS, Mar. 2008, pp.287-296.

[30] Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A com-piler framework for automatic translation and optimization.In Proc. the 14th PPoPP, Feb. 2009, pp.101-110.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Zhang Cui; Zhao Qinping; Xu Jiafu;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[5] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[8] Lu Xuemiao;. On the Complexity of Induction of Structural Descriptions[J]. , 1987, 2(1): 12 -21 .
[9] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved