1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
2 University of Chinese Academy of Sciences, Beijing 100049, China;
3 Tencent Inc., Beijing 100080, China
Abstract Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional indexing techniques have been presented to support queries on non-key columns for DOTs. However, there was no comprehensive analysis or comparison of these techniques, which brings troubles to users in selecting or proposing a proper indexing technique for a certain workload. This paper proposes a taxonomy based on six indexing issues to classify indexing techniques on DOTs and provides a comprehensive review of the state-of-the-art techniques. Based on the taxonomy, we propose a performance model named QSModel to estimate the query time and storage cost of these techniques and run experiments on a practical workload from Tencent to evaluate this model. The results show that the maximum error rates of the query time and storage cost are 24.2% and 9.8%, respectively. Furthermore, we propose IndexComparator, an open source project that implements representative indexing techniques. Therefore, users can select the best-fit indexing technique based on both theoretical analysis and practical experiments.
This work is partially supported by the Strategic Priority Program of Chinese Academy of Sciences under Grant No. XDB02040009, the Key Program of the National Natural Science Foundation of China under Grant No. 61532016, the Key Program of Cloud Computing and Big Data of the Ministry of the Science and Technology of China under Grant No. 2016YFB1000200, and Tencent Inc.
About author: Chen Feng is a Ph.D. candidate of Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He received his B.S. degree in software engineering from Nankai University, Tianjin, in 2011. His current research interests include big data computing and distributed system. He is a student member of CCF.
Cite this article:
Chen Feng, Chun-Dian Li, Rui Li.Indexing Techniques of Distributed Ordered Tables: A Survey and Analysis[J] Journal of Computer Science and Technology, 2018,V33(1): 169-189
 Chang F, Dean J, Ghemawat S, Hsieh W C, Wallach D A, Burrows M, Chandra T, Fikes A, Gruber R E. BigTable:A distributed storage system for structured data. ACM Trans. Computer Systems (TOCS), 2008, 26(2):Article No. 4. Lakshman A, Malik P. Cassandra:A decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44(2):35-40. Cooper B F, Ramakrishnan R, Srivastava U, Silberstein A, Bohannon P, Jacobsen H A, Puz N, Weaver D, Yerneni R. PNUTS:Yahoo!'s hosted data serving platform. Proceedings of the VLDB Endowment, 2008, 1(2):1277-1288. Harter T, Borthakur D, Dong S Y, Aiyer A, Tang L Y, Arpaci-Dusseau A C, Arpaci-Dusseau R H. Analysis of HDFS under HBase:A Facebook messages case study. In Proc. the 12th USENIX Conf. File and Storage Technologies, February 2014, pp.199-212. George L. HBase:The Definitive Guide. O'Reilly Media, Inc., 2011. Feng C, Yang X, Liang F, Sun X H, Xu Z W. LCIndex:A local and clustering index on distributed ordered tables for flexible multi-dimensional range queries. In Proc. the 44th Int. Conf. Parallel Processing, September 2015, pp.719-728. Nishimura S, Das S, Agrawal D, Abbadi A E. MD-HBase:A scalable multi-dimensional data infrastructure for location aware services. In Proc. the 12th IEEE Int. Conf. Mobile Data Management, June 2011, pp.7-16. Ma Y Z, Rao J, Hu W S, Meng X F, Han X, Zhang Y, Chai Y P, Liu C Q. An efficient index for massive IOT data in cloud environment. In Proc. the 21st ACM Int. Conf. Information and Knowledge Management, October 2012, pp.2129-2133. Zou Y Q, Liu J, Wang S C, Zha L, Xu Z W. CCIndex:A complemental clustering index on distributed ordered tables for multi-dimensional range queries. In Proc. the 7th IFIP Int. Conf. Network and Parallel Computing, September 2010, pp.247-261. Zhang C, Chen X Y, Shi Z L, Ge B. Algorithms for spatiotemporal queries in HBase. Journal of Chinese Computer Systems, 2016, 37(11):2409-2415. (in Chinese) Hsu Y T, Pan Y C, Wei L Y, Peng W C, Lee W C. Key formulation schemes for spatial index in cloud data managements. In Proc. the 13th IEEE Int. Conf. Mobile Data Management, July 2012, pp.21-26. Zhou X, Zhang X, Wang Y H, Li R, Wang S. Efficient distributed multi-dimensional index for big data management. In Proc. the 14th Int. Conf. Web-Age Information Management, June 2013, pp.130-141. O'Neil P, Cheng E, Gawlick D, O'Neil E. The log-structured merge-tree (LSM-tree). Acta Informatica, 1996, 33(4):351-385. Guttman A. R-trees:A dynamic index structure for spatial searching. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 1984, pp.47-57. Sellis T K, Roussopoulos N, Faloutsos C. The R+-tree:A dynamic index for multi-dimensional objects. In Proc. the 13th Int. Conf. Very Large Data Bases, September 1987, pp.507-518. Silberschatz A, Korth H F, Sudarshan S. Database System Concepts (3rd edition). McGraw-Hill, 1997. Wang L, Chen B, Liu Y H. Distributed storage and index of vector spatial data based on HBase. In Proc. the 21st Int. Conf. Geoinformatics, June 2013. Ge W, Luo S M, Zhou W H, Zhao D, Tang Y, Zhou J, Qu W W, Yuan C F, Huang Y H. HiBase:A hierarchical indexing mechanism and system for efficient HBase query. Chinese Journal of Computers, 2016, 39(1):140-153. (in Chinese) Wang S C, Mares M A, Guo Y K. CGDM:Collaborative genomic data model for molecular profiling data using NoSQL. Bioinformatics, 2016, 32(23):3654-3660. Serrano D, Han D, Stroulia E. From relations to multidimensional maps:Towards an SQL-to-HBase transformation methodology. In Proc. the 8th IEEE Int. Conf. Cloud Computing, July 2015, pp.81-89. Chang C R, Hsieh M J, Wu J J, Wu P Y, Liu P F. HSQL:A highly scalable cloud database for multi-user query processing. In Proc. the 5th IEEE Int. Conf. Cloud Computing, June 2012, pp.943-944. Agrawal P, Silberstein A, Cooper B F, Srivastava U, Ramakrishnan R. Asynchronous view maintenance for VLSD databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2009, pp.179-192. Zhou X, Li H, Zhang X, Wang S, Ma Y Y, Liu K Y, Zhu M, Huang M L. ABR-Tree:An efficient distributed multidimensional indexing approach for massive data. In Proc. Int. Conf. Algorithms and Architectures for Parallel Processing, November 2015, pp.781-790. Zhang Y, Ma Y Z, Meng X F. Efficient processing of spatial keyword queries on HBase. Journal of Chinese Computer Systems, 2012, 33(10):2141-2146. (in Chinese) Zhang N Y, Zheng G Z, Chen H J, Chen J Y, Chen X. HBaseSpatial:A scalable spatial data storage based on HBase. In Proc. the 13th IEEE Int. Conf. Trust Security and Privacy in Computing and Communications, September 2014, pp.644-651. Han D, Stroulia E. HGrid:A data model for large geospatial data sets in HBase. In Proc. the 6th IEEE Int. Conf. Cloud Computing, June 28-July 3, 2013, pp.910-917. Zhou W, Han J Z, Zhang Z, Dai J, Xu Z Y. HDKV:Supporting efficient high-dimensional similarity search in keyvalue stores. Concurrency and Computation:Practice and Experience, 2013, 25(12):1675-1698. Ma Y Z, Zhang Y, Meng X F. ST-HBase:A scalable data management system for massive geo-tagged objects. In Proc. the 14th Int. Conf. Web-Age Information Management, June 2013, pp.155-166. Tang X S, Han B D, Chen H. A hybrid index for multidimensional query in HBase. In Proc. the 4th Int. Conf. Cloud Computing and Intelligence Systems, August 2016, pp.332-336. Ma Y Z, Meng X F, Wang S Y, Hu W S, Han X, Zhang Y. An efficient index method for multi-dimensional query in cloud environment. In Proc. the 2nd Int. Conf. Cloud Computing and Big Data in Asia, June 2015, pp.307-318. Chen X Y, Zhang C, Ge B, Xiao W D. Efficient historical query in HBase for spatio-temporal decision support. International Journal of Computers Communications & Control, 2016, 11(5):613-630. Lee K, Ganti R K, Srivatsa M, Liu L. Efficient spatial query processing for big data. In Proc. the 22nd ACM SIGSPATIAL Int. Conf. Advances in Geographic Information Systems, November 2014, pp.469-472. Wang H P, Ci X, Meng X F. Fast multi-fields query processing in bigtable based cloud systems. In Proc. the 14th Int. Conf. Web-Age Information Management, June 2013, pp.142-154. Huang S, Wang B T, Zhu J Y, Wang G R, Yu G. R-HBase:A multi-dimensional indexing framework for cloud computing environment. In Proc. IEEE Int. Conf. Data Mining Workshop, December 2014, pp.569-574. Li Q C, Lu Y, Gong X L, Zhang J. Optimizational method of HBase multi-dimensional data query based on Hilbert space-filling curve. In Proc. the 9th Int. Conf. P2P Parallel Grid Cloud and Int. Conf. Computing, November 2014, pp.469-474. Li S, Hu S H, Ganti R, Srivatsa M, Abdelzaher T. Pyro:A spatial-temporal big-data storage system. In Proc. the USENIX Annual Technical Conf., July 2015, pp.97-109. Du N B, Zhan J F, Zhao M, Xiao D R, Xie Y C. Spatiotemporal data index model of moving objects on fixed networks using HBase. In Proc. IEEE Int. Conf. Computational Intelligence & Communication Technology, February 2015, pp.247-251.