›› 2013, Vol. 28 ›› Issue (6): 989-1011.doi: 10.1007/s11390-013-1393-6

Special Issue: Data Management and Data Mining

• Special Section on Cloud Data Management • Previous Articles     Next Articles

Partition-Based Online Aggregation with Shared Sampling in the Cloud

Yu-Xiang Wang (王宇翔), Jun-Zhou Luo (罗军舟), Senior Member, CCF, Member, ACM, IEEE Ai-Bo Song* (宋爱波), Member, CCF, ACM, and Fang Dong (东方), Member, CCF, ACM   

  1. School of Computer Science and Engineering, Southeast University, Nanjing 211189, China
  • Received:2012-12-01 Revised:2013-05-07 Online:2013-11-05 Published:2013-11-05
  • About author:Yu-Xiang Wang received the B.S. degree in software engineering from Tianjin University, China, in 2008. He is currently a Ph.D. student in the School of Computer Science and Engineering, Southeast University, Nanjing. His current research interests include cloud computing, big data processing and query optimization.
  • Supported by:

    This work is supported by the National Basic Research 973 Program of China under Grant No. 2010CB328104, the National Natural Science Foundation of China under Grant Nos. 61070161, 61202449, 61320106007, the National High Technology Research and Development 863 Program of China under Grant No. 2013AA013503, the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No. 20110092130002, the Jiangsu Provincial Key Laboratory of Network and Information Security under Grant No. BM2003201, the Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant No. 93K-9, and the Shanghai Key Laboratory of Scalable Computing and Systems of China under Grant No. 2010DS680095.

Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there are several limitations that restrict the performance of online aggregation generated from the gap between the current mechanism of MapReduce paradigm and the requirements of online aggregation, such as: 1) the low sampling efficiency due to the lack of consideration of skewed data distribution for online aggregation in MapReduce, and 2) the large redundant I/O cost of online aggregation caused by the independent job execution mechanism of MapReduce. In this paper, we present OLACloud, a MapReduce-based cloud system to well support online aggregation for different data distributions and large-scale concurrent query processing. We propose a content-aware repartition method with a fair-allocation block placement strategy to increase the sampling efficiency and guarantee the storage and computation load balancing simultaneously. We also develop a shared sampling method to share the sampling opportunities among multiple queries to reduce redundant I/O cost. We also implement OLACloud in Hadoop, and conduct an extensive experimental study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OLACloud.

[1] Herodotou H, Lim H, Luo G et al. Starfish: A self-tuning system for big data analytics. In Proc. the 15th CIDR, Apr. 2011, pp.261-272.

[2] Wu S, Ooi B C, Tan K L. Continuous sampling for online aggregation over multiple queries. In Proc. the 2010 International Conference on Management of Data (SIGMOD), June 2010, pp.651-662.

[3] Chaudhuri S, Das G, Datar M et al. Overcoming limitations of sampling for aggregation queries. In Proc. the 17th Int. Conf. Data Engineering, Apr. 2001, pp.534-544.

[4] Laptev N, Zeng K, Zaniolo C. Early accurate results for advanced analytics on MapReduce. PVLDB, 2012, 5(10): 10281039.

[5] Hellerstein J M, Haas P J, Wang H J. Online aggregation. ACM SIGMOD Record., 1997, 26(2): 171-182.

[6] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.

[7] Borkar V, Carey M, Grover R et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.1151-1162.

[8] Pansare N, Borkar V R, Jermaine C et al. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11): 1135-1145.

[9] Böose J H, Andrzejak A, Höogqvist M. Beyond online aggregation: Parallel and incremental data mining with online mapreduce. In Proc. MDAC, Apr. 2010, Article No.3.

[10] Condie T, Conway N, Alvaro P et al. Online aggregation and continuous query support in MapReduce. In Proc. the 2010 International Conference on Management of Data, June 2010, pp.1115-1118.

[11] Shi Y, Meng X,Wang F et al. You can stop early with COLA: Online processing of aggregate queries in the cloud. In Proc. the 21st ACM International Conference on Information and Knowledgy Management, Oct. 29-Nov. 2, 2012, pp.1223-1232.

[12] Grover R, Carey M J. Extending MapReduce for efficient predicate-based sampling. In Proc. the 28th International Conference on Data Engineering, Apr. 2012, pp.486-497.

[13] Wang Y, Luo J, Song A, Jin J H, Dong F. Improving online aggregation performance for skewed data distribution. In Proc. Database Systems for Advanced Applications, Apr. 2012, pp.18-32.

[14] Chaudhuri S, Das G, Srivastava U. Effective use of blocklevel sampling in statistics estimation. In Proc. the 2004 International Conference on Management of Data, June 2004, pp.287-298.

[15] Jacobs A. The pathologies of big data. Communications of the ACM, 2009, 52(8): 36-44.

[16] Soroush E, Balazinska M, Wang D. Arraystore: A storage manager for complex parallel array processing. In Proc. the 2011 International Conference on Management of Data, June 2011, pp.253-264.

[17] Eltabakh M Y, Tian Y, Ozcan F et al. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB, 2011, 4(9): 575-585.

[18] Nykiel T, Potamias M, Mishra C et al. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1/2): 494-505.

[19] Wu S, Jiang S, Ooi B C et al. Distributed online aggregations. PVLDB, 2009, 2(1): 443-454.

[20] Zaharia M, Borthakur D, Sen Sarma J et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proc. the 5th European Conference on Computer System, Apr. 2010, pp.265-278.

[21] Chaudhuri S, Narasyrra V. Program for tpc-d data generation with skew. Technical Report, ftp://ftp.research.microsoft.com/pub/user./viveknar/tpcdskew, Dec. 2012.

[22] Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th International Conference on Scientific and Statistical Database Management, Aug. 1997, pp.51-62.

[23] Haas P J, Hellerstein J M. Ripple joins for online aggregation. ACM SIGMOD Record, 1999, 28(2): 287-298.

[24] Luo G, Ellmann C J, Haas P J et al. A scalable hash ripple join algorithm. In Proc. the 2002 International Conference on Management of Data, June 2002, pp.252-262.
No related articles found!
Full text



[1] Yao Shu; Zhang Bo;. Situated Learning of a Behavior-Based Mobile Robot Path Planner[J]. , 1995, 10(4): 375 -379 .
[2] Hu Shimin;. A Subdivision Scheme for Rational Triangular Bézier Surfaces[J]. , 1996, 11(1): 9 -16 .
[3] Yi Bo; Tao Xianping; G.Cioni; A.Colagrossi;. Intuitive Minimal Abduction in Sequent Calculi[J]. , 1998, 13(3): 209 -219 .
[4] Dun-Ren Che. Accomplishing Deterministic XML Query Optimization[J]. , 2005, 20(3): 357 -366 .
[5] Bao-Xia Fan, Liang Yang, Jiang-Mei Wang, Ru Wang, Bin Xiao, Ying Xu, Dong Liu, and Ji-Ye Zhao. Physical Implementation of the 1GHz Godson-3 Quad-Core Microprocessor[J]. , 2010, 25(2): 192 -199 .
[6] Hua Huang (黄华), Senior Member, CCF, Member, IEEE and Xiang-Wang Ma (马湘旺). Frontal and Semi-Frontal Facial Caricature Synthesis Using Non-Negative Matrix Factorization[J]. , 2010, 25(6): 1282 -1292 .
[7] Wei Jiang (姜伟), Tian Wu (吴甜), Song-Lin Hu (虎嵩林), Senior Member, CCF, and Zhi-Yong Liu (刘志勇), Senior Member, CCF. QoS-Aware Automatic Service Composition: A Graph View[J]. , 2011, 26(5): 837 -853 .
[8] Wen-Yong Zhao, Shao-Lin Chen, Yuan Zheng, and Si-Long Peng. Lighting Estimation of a Convex Lambertian Object Using Redundant Spherical Harmonic Frames[J]. , 2013, 28(3): 454 -467 .
[9] Cheng-Lin Fan, Jun Luo, Wen-Cheng Wang, Fa-Rong Zhong, Binhai Zhu. On some proximity problems of colored sets[J]. , 2014, 29(5): 879 -886 .
[10] Chong Cao, Hai-Zhou Ai. Facial similarity learning with humans in the loop[J]. , 2015, 30(3): 499 -510 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved