›› 2015,Vol. 30 ›› Issue (1): 30-41.doi: 10.1007/s11390-015-1502-9

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于内存访问优化的大数据处理

Dong Yan1(阎栋), Xu-Sen Yin2(尹绪森), Cheng Lian2(连城), Xiang Zhong2(钟翔), Xin Zhou2(周鑫), Member, CCF, ACM, Gan-Sha Wu2(吴甘沙), Senior Member, CCF, Member, ACM   

  1. 1 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    2 Intel Labs China, Beijing 100091, China
  • 收稿日期:2014-07-14 修回日期:2014-12-16 出版日期:2015-01-05 发布日期:2015-01-05
  • 作者简介:Dong Yan is a Ph.D. candidate in Tsinghua University. He received his B.S. degree in computer science and technology from Wuhan University in 2009. His research interests include operating system kernel, debug, and big data.

Using Memory in the Right Way to Accelerate Big Data Processing

Dong Yan1(阎栋), Xu-Sen Yin2(尹绪森), Cheng Lian2(连城), Xiang Zhong2(钟翔), Xin Zhou2(周鑫), Member, CCF, ACM, Gan-Sha Wu2(吴甘沙), Senior Member, CCF, Member, ACM   

  1. 1 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    2 Intel Labs China, Beijing 100091, China
  • Received:2014-07-14 Revised:2014-12-16 Online:2015-01-05 Published:2015-01-05
  • About author:Dong Yan is a Ph.D. candidate in Tsinghua University. He received his B.S. degree in computer science and technology from Wuhan University in 2009. His research interests include operating system kernel, debug, and big data.

大数据处理正在数据中心的运行负载中占据越来越大的比重.但是最近的研究却指出,大数据的处理过程并没有有效的利用内存系统.我们发现,高昂的缓存缺失率和内存访问依赖是造成大数据处理的低效的两大原因.为解决此问题,本文引入了两个针对性的优化技术,切分合并策略和直接内存访问.切分合并策略能够大大减少在排序过程中的末级缓存的缺失率;而直接内存访问则重新设计了键值对的存储格式以消除内存访问依赖.在实验部分,我们从专有测试集和实际应用两个方面检验本文所引入的两个优化技术的效果.在专有测试集的试验中,CPU提供的硬件事件清楚的显示了优化技术所带来的性能上的改进.而在实际应用方面,我们选取HiBench所包含的八个典型的大数据应用作为代表.实验结果显示, HiBench性能的平均提升达1.21倍.这充分说明了,在软件架构设计中考虑硬件特性能够大大改善大数据处理过程中的访存效率.我们的工作已经被集成到了英特尔的Hadoop发行版中.

Abstract: Big data processing is becoming a standout part of data center computation. However, latest research has indicated that big data workloads cannot make full use of modern memory systems. We find that the dramatic inefficiency of the big data processing is from the enormous amount of cache misses and stalls of the depended memory accesses. In this paper, we introduce two optimizations to tackle these problems. The first one is the slice-and-merge strategy, which reduces the cache miss rate of the sort procedure. The second optimization is direct-memory-access, which reforms the data structure used in key/value storage. These optimizations are evaluated with both micro-benchmarks and the real-world benchmark HiBench. The results of our micro-benchmarks clearly demonstrate the effectiveness of our optimizations in terms of hardware event counts; and the additional results of HiBench show the 1.21X average speedup on the application-level. Both results illustrate that careful hardware/software co-design will improve the memory efficiency of big data processing. Our work has already been integrated into Intel distribution for Apache Hadoop.

[1] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1):107-113.

[2] Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J M. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010. http://arxiv.org/abs/1006.4990, Dec. 2014.

[3] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Conference on Networked Systems Design and Implementation, April 2012, pp.15-28.

[4] Shafer J, Rixner S, Cox A L. The Hadoop distributed filesystem: Balancing portability and performance. In Proc. IEEE International Symposium on Performance Analysis of Systems and Software, March 2010, pp.122-133.

[5] Wang Y, Xu C, Li X, Yu W. JVM-bypass for efficient Hadoop shuffling. In Proc. the 27th IEEE International Symposium on Parallel and Distributed Processing, May 2013, pp.569-578.

[6] Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Toward dark silicon in servers. IEEE Micro, 2011, 31(4): 6-15.

[7] Horowitz M, Alon E, Patil D, Naffziger S, Kumar R, Bernstein K. Scaling, power, and the future of CMOS. In Proc. IEEE Int. Electron Devices Meeting, December 2005, pp.7-15.

[8] Ferdman M, Adileh A, Kocberber O, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu A D, Ailamaki A, Falsafi B. Quantifying the mismatch between emerging scale-out applications and modern processors. ACM Transactions on Computer Systems, 2012, 30(4): Article No. 15.

[9] Huang S, Huang J, Dai J, Xie T, Huang B. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proc. the 26th IEEE International Conference on Data Engineering Workshops, March 2010, pp.41-51.

[10] Yang D, Zhong X, Yan D, Dai F, Yin X, Lian C, Zhu Z, Jiang W, Wu G. NativeTask: A Hadoop compatible framework for high performance. In Proc. IEEE International Conference on Big Data, October 2013, pp.94-101.

[11] Chen R, Chen H, Zang B. Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.523-534.

[12] Brodal G S, Fagerberg R, Vinther K. Engineering a cacheoblivious sorting algorithm. Journal of Experimental Algorithmics, 2008, 12(2): Article No. 22.

[13] Levinthal D. Cycle accounting analysis on Intel® CoreTM 2 processors. Intel Corp., 2012. https://software.intel.com/sites/ products/collateral/hpc/vtune/cycle accounting analysis. pdf, Dec. 2014.

[14] Levinthal D. Performance analysis guide for Intel® CoreTM i7 processor and Intel® XeonTM 5500 processors. Intel Performance Analysis Guide, 2009. https://software. intel.com/sites/products/collateral/hpc/vtune/performance analysis guide.pdf, Dec. 2014.

[15] Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In Proc. the 13th IEEE International Symposium on High Performance Computer Architecture, February 2007, pp.13-24.

[16] Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system. In Proc. IEEE International Symposium on Workload Characterization, October 2009, pp.198-207.

[17] He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008, pp.260-269.

[18] Hong C, Chen D, Chen W, Zheng W, Lin H. MapCG: Writing parallel program portable between CPU and GPU. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.217-226.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] 孟力明; 徐晓飞; 常会友; 陈光熙; 胡铭曾; 李生;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] 林琦; 夏培肃;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] 孙成政; 慈云桂;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[5] 张钹; 张恬; 张建伟; 张铃;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[6] 王鼎兴; 郑纬民; 杜晓黎; 郭毅可;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[7] 周权; 魏道政;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[8] 赵靓海; 刘慎权;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .
[9] 商陆军; 许立辉;. Notes on the Design of an Integrated Object-Oriented DBMS Family[J]. , 1991, 6(4): 389 -394 .
[10] 许建国; 郭玉钗; 林宗楷;. HEPAPS:A PCB Automatic Placement System[J]. , 1992, 7(1): 39 -46 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: