›› 2013, Vol. 28 ›› Issue (6): 973-988.doi: 10.1007/s11390-013-1392-7

Special Issue: Data Management and Data Mining

• Special Section on Cloud Data Management • Previous Articles     Next Articles

HEDC++:An Extended Histogram Estimator for Data in the Cloud

Ying-Jie Shi1 (史英杰), Xiao-Feng Meng1 (孟小峰), Senior Member, CCF, Member, ACM, IEEE Fusheng Wang2, 3, and Yan-Tao Gan1 (干艳桃)   

  1. 1 School of Information, Renmin University of China, Beijing 100872, China;
    2 Department of Biomedical Informatics, Emory University, Atlanta 30322, U.S.A.;
    3 Department of Mathematics and Computer Science, Emory University, Atlanta 30322, U.S.A.
  • Received:2012-12-02 Revised:2013-05-03 Online:2013-11-05 Published:2013-11-05
  • Supported by:

    This research was partially supported by the National Natural Science Foundation of China under Grant Nos. 61070055, 91024032, 91124001, the Fundamental Research Funds for the Central Universities of China, the Research Funds of Renmin University of China under Grant No. 11XNL010, and the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010701, 2013AA013204.

With increasing popularity of cloud-based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional databases to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plans, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct an exact histogram on massive data, building an approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC++, an extended histogram estimator for data in the cloud, which provides efficient approximation approaches for both equi-width and equi-depth histograms. We design the histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC++ can provide promising histogram estimate for massive data in the cloud.

[1] Thusoo A, Sarma J, Jain N et al. Hive: A warehousing solution over a Map-Reduce framework. In Proc. the 35th Conference of Very Large Databases (VLDB2009), August 2009, pp.1626-1629.

[2] Olston C, Reed B, Srivastava U et al. Pig latin: A notso-foreign language for data processing. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2008), June 2008, pp.1099-1110.

[3] Abadi D J. Data management in the cloud: Limitations and opportunities. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2009, 32(1): 3-12.

[4] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In Proc. the 6th Symposium on Opearting Systems Design and Implementation (OSDI2004), December 2004, pp.137-150.

[5] Blanas S, Patel J, Ercegovac V et al. A comparison of join algorithms for log processing in MapReduce. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2010), June 2010, pp.975-986.

[6] Okcan A, Riedewald M. Processing theta-joins using MapReduce. In Proc. the ACM International Conference on Management of Data (SIGMOD2011), June 2011, pp.949-960.

[7] Shi Y J, Meng X F, Wang F S et al. HEDC: A histogram estimator for data in the cloud. In Proc. the 4th Int. Workshop on Cloud Data Management (CloudDB2012), Oct. 29-Nov. 2, 2012, pp.51-58.

[8] Poosala V, Ioannidis Y E, Haas P J, Shekita E J. Improved histograms for selectivity estimation of range predicates. In Proc. the ACM International Conference on Management of Data (SIGMOD1996), June 1996, pp.294-305.

[9] Ioannidis Y E. The history of histograms (abridged). In Proc. the 29th Conference of Very Large Databases (VLDB2003), September 2003, pp.19-30.

[10] Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. In Proc. the ACM International Conference on Management of Data (SIGMOD1984), June 1984, pp.256-276.

[11] Gibbons P B, Matias Y, Poosala V. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 2002, 27(3): 261-298.

[12] Chaudhuri S, Motwani R, Narasayya V. Random sampling for histogram construction: How much is enough? In Proc. ACM International Conference on Management of Data (SIGMOD1998), June 1998, pp.436-447.

[13] Chaudhuri S, Motwani R, Narasayya V. Using random sampling for histogram construction. Technical Report, Microsoft, http://citeseerx.ist.psu.edu/showciting?cid=467221, 1997.

[14] Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In Proc. ACM International Conference on Management of Data (SIGMOD2004), June 2004, pp.287-298.

[15] Jestes J, Yi K, Li F F. Building wavelet histograms on large data in MapReduce. In Proc. the 37th International Conference of Very Large Databases (VLDB2011), August 29September 3, 2011, pp.109-120.

[16] Mousavi H, Zaniolo C. Fast and accurate computation of equi-depth histograms over data streams. In Proc. the 14th International Conference on Extending Database Technology (EDBT2011), March 2011, pp.69-80.

[17] Cochran W G. Sampling Techniques. John Wiley and Sons, 1977.

[18] Francisco C A, Fuller W A. Quantile estimation with a complex survey design. The Annals of Statistics, 1991, 19(1): 454-469.

[19] Woodruff R S. Confidence intervals for medians and other position measures. Journal of the American Statistical Association, 1952, 47(260): 635-646.
No related articles found!
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhang Bo; Zhang Ling;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[10] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved