|  Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1):107-113. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J M. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010. http://arxiv.org/abs/1006.4990, Dec. 2014. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Conference on Networked Systems Design and Implementation, April 2012, pp.15-28. Shafer J, Rixner S, Cox A L. The Hadoop distributed filesystem: Balancing portability and performance. In Proc. IEEE International Symposium on Performance Analysis of Systems and Software, March 2010, pp.122-133. Wang Y, Xu C, Li X, Yu W. JVM-bypass for efficient Hadoop shuffling. In Proc. the 27th IEEE International Symposium on Parallel and Distributed Processing, May 2013, pp.569-578. Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Toward dark silicon in servers. IEEE Micro, 2011, 31(4): 6-15. Horowitz M, Alon E, Patil D, Naffziger S, Kumar R, Bernstein K. Scaling, power, and the future of CMOS. In Proc. IEEE Int. Electron Devices Meeting, December 2005, pp.7-15. Ferdman M, Adileh A, Kocberber O, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu A D, Ailamaki A, Falsafi B. Quantifying the mismatch between emerging scale-out applications and modern processors. ACM Transactions on Computer Systems, 2012, 30(4): Article No. 15. Huang S, Huang J, Dai J, Xie T, Huang B. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proc. the 26th IEEE International Conference on Data Engineering Workshops, March 2010, pp.41-51. Yang D, Zhong X, Yan D, Dai F, Yin X, Lian C, Zhu Z, Jiang W, Wu G. NativeTask: A Hadoop compatible framework for high performance. In Proc. IEEE International Conference on Big Data, October 2013, pp.94-101. Chen R, Chen H, Zang B. Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.523-534. Brodal G S, Fagerberg R, Vinther K. Engineering a cacheoblivious sorting algorithm. Journal of Experimental Algorithmics, 2008, 12(2): Article No. 22. Levinthal D. Cycle accounting analysis on Intel® CoreTM 2 processors. Intel Corp., 2012. https://software.intel.com/sites/ products/collateral/hpc/vtune/cycle accounting analysis. pdf, Dec. 2014. Levinthal D. Performance analysis guide for Intel® CoreTM i7 processor and Intel® XeonTM 5500 processors. Intel Performance Analysis Guide, 2009. https://software. intel.com/sites/products/collateral/hpc/vtune/performance analysis guide.pdf, Dec. 2014. Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In Proc. the 13th IEEE International Symposium on High Performance Computer Architecture, February 2007, pp.13-24. Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system. In Proc. IEEE International Symposium on Workload Characterization, October 2009, pp.198-207. He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008, pp.260-269. Hong C, Chen D, Chen W, Zheng W, Lin H. MapCG: Writing parallel program portable between CPU and GPU. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.217-226.