›› 2015, Vol. 30 ›› Issue (1): 30-41.doi: 10.1007/s11390-015-1502-9

Special Issue: Computer Architecture and Systems

• Special Section on Computer Architecture and Systems for Big Data • Previous Articles     Next Articles

Using Memory in the Right Way to Accelerate Big Data Processing

Dong Yan1(阎栋), Xu-Sen Yin2(尹绪森), Cheng Lian2(连城), Xiang Zhong2(钟翔), Xin Zhou2(周鑫), Member, CCF, ACM, Gan-Sha Wu2(吴甘沙), Senior Member, CCF, Member, ACM   

  1. 1 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    2 Intel Labs China, Beijing 100091, China
  • Received:2014-07-14 Revised:2014-12-16 Online:2015-01-05 Published:2015-01-05
  • About author:Dong Yan is a Ph.D. candidate in Tsinghua University. He received his B.S. degree in computer science and technology from Wuhan University in 2009. His research interests include operating system kernel, debug, and big data.

Big data processing is becoming a standout part of data center computation. However, latest research has indicated that big data workloads cannot make full use of modern memory systems. We find that the dramatic inefficiency of the big data processing is from the enormous amount of cache misses and stalls of the depended memory accesses. In this paper, we introduce two optimizations to tackle these problems. The first one is the slice-and-merge strategy, which reduces the cache miss rate of the sort procedure. The second optimization is direct-memory-access, which reforms the data structure used in key/value storage. These optimizations are evaluated with both micro-benchmarks and the real-world benchmark HiBench. The results of our micro-benchmarks clearly demonstrate the effectiveness of our optimizations in terms of hardware event counts; and the additional results of HiBench show the 1.21X average speedup on the application-level. Both results illustrate that careful hardware/software co-design will improve the memory efficiency of big data processing. Our work has already been integrated into Intel distribution for Apache Hadoop.

[1] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1):107-113.

[2] Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J M. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010. http://arxiv.org/abs/1006.4990, Dec. 2014.

[3] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Conference on Networked Systems Design and Implementation, April 2012, pp.15-28.

[4] Shafer J, Rixner S, Cox A L. The Hadoop distributed filesystem: Balancing portability and performance. In Proc. IEEE International Symposium on Performance Analysis of Systems and Software, March 2010, pp.122-133.

[5] Wang Y, Xu C, Li X, Yu W. JVM-bypass for efficient Hadoop shuffling. In Proc. the 27th IEEE International Symposium on Parallel and Distributed Processing, May 2013, pp.569-578.

[6] Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Toward dark silicon in servers. IEEE Micro, 2011, 31(4): 6-15.

[7] Horowitz M, Alon E, Patil D, Naffziger S, Kumar R, Bernstein K. Scaling, power, and the future of CMOS. In Proc. IEEE Int. Electron Devices Meeting, December 2005, pp.7-15.

[8] Ferdman M, Adileh A, Kocberber O, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu A D, Ailamaki A, Falsafi B. Quantifying the mismatch between emerging scale-out applications and modern processors. ACM Transactions on Computer Systems, 2012, 30(4): Article No. 15.

[9] Huang S, Huang J, Dai J, Xie T, Huang B. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proc. the 26th IEEE International Conference on Data Engineering Workshops, March 2010, pp.41-51.

[10] Yang D, Zhong X, Yan D, Dai F, Yin X, Lian C, Zhu Z, Jiang W, Wu G. NativeTask: A Hadoop compatible framework for high performance. In Proc. IEEE International Conference on Big Data, October 2013, pp.94-101.

[11] Chen R, Chen H, Zang B. Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.523-534.

[12] Brodal G S, Fagerberg R, Vinther K. Engineering a cacheoblivious sorting algorithm. Journal of Experimental Algorithmics, 2008, 12(2): Article No. 22.

[13] Levinthal D. Cycle accounting analysis on Intel® CoreTM 2 processors. Intel Corp., 2012. https://software.intel.com/sites/ products/collateral/hpc/vtune/cycle accounting analysis. pdf, Dec. 2014.

[14] Levinthal D. Performance analysis guide for Intel® CoreTM i7 processor and Intel® XeonTM 5500 processors. Intel Performance Analysis Guide, 2009. https://software. intel.com/sites/products/collateral/hpc/vtune/performance analysis guide.pdf, Dec. 2014.

[15] Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In Proc. the 13th IEEE International Symposium on High Performance Computer Architecture, February 2007, pp.13-24.

[16] Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system. In Proc. IEEE International Symposium on Workload Characterization, October 2009, pp.198-207.

[17] He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008, pp.260-269.

[18] Hong C, Chen D, Chen W, Zheng W, Lin H. MapCG: Writing parallel program portable between CPU and GPU. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.217-226.
No related articles found!
Full text



[1] Zhang Bo; Zhang Ling;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] Meng Liming; Xu Xiaofei; Chang Huiyou; Chen Guangxi; Hu Mingzeng; Li Sheng;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] Lin Qi; Xia Peisu;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] Sun Chengzheng; Tzu Yungui;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[5] Zhang Bo; Zhang Tian; Zhang Jianwei; Zhang Ling;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[6] Wang Dingxing; Zheng Weimin; Du Xiaoli; Guo Yike;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[7] Zhou Quan; Wei Daozheng;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[8] Zhao Jinghai; Liu Shenquan;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .
[9] Shang Lujun; Xu Lihui;. Notes on the Design of an Integrated Object-Oriented DBMS Family[J]. , 1991, 6(4): 389 -394 .
[10] Xu Jianguo; Gou Yuchai; Lin Zongkai;. HEPAPS:A PCB Automatic Placement System[J]. , 1992, 7(1): 39 -46 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved