›› 2018,Vol. 33 ›› Issue (1): 169-189.doi: 10.1007/s11390-018-1813-8

所属专题: 不能删除 Computer Architecture and Systems Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

分布式顺序表的索引技术:索引和分析

Chen Feng1,2, Student Member, CCF, Chun-Dian Li1,2, Student Member, CCF, Rui Li3   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Tencent Inc., Beijing 100080, China
  • 收稿日期:2017-03-27 修回日期:2017-05-12 出版日期:2018-01-05 发布日期:2018-01-05
  • 作者简介:Chen Feng is a Ph.D. candidate of Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He received his B.S. degree in software engineering from Nankai University, Tianjin, in 2011. His current research interests include big data computing and distributed system. He is a student member of CCF.
  • 基金资助:

    This work is partially supported by the Strategic Priority Program of Chinese Academy of Sciences under Grant No. XDB02040009, the Key Program of the National Natural Science Foundation of China under Grant No. 61532016, the Key Program of Cloud Computing and Big Data of the Ministry of the Science and Technology of China under Grant No. 2016YFB1000200, and Tencent Inc.

Indexing Techniques of Distributed Ordered Tables: A Survey and Analysis

Chen Feng1,2, Student Member, CCF, Chun-Dian Li1,2, Student Member, CCF, Rui Li3   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Tencent Inc., Beijing 100080, China
  • Received:2017-03-27 Revised:2017-05-12 Online:2018-01-05 Published:2018-01-05
  • About author:Chen Feng is a Ph.D. candidate of Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He received his B.S. degree in software engineering from Nankai University, Tianjin, in 2011. His current research interests include big data computing and distributed system. He is a student member of CCF.
  • Supported by:

    This work is partially supported by the Strategic Priority Program of Chinese Academy of Sciences under Grant No. XDB02040009, the Key Program of the National Natural Science Foundation of China under Grant No. 61532016, the Key Program of Cloud Computing and Big Data of the Ministry of the Science and Technology of China under Grant No. 2016YFB1000200, and Tencent Inc.

业界提出了许多NoSQL数据库以存储和查询大量数据。其中如BigTable、PNUTS和HBase等一些数据库可以被归纳为分布式顺序表(Distributed Ordered Table,DOT)。分布式顺序表之上,存在许多额外的索引技术以支持非主键查询。然而,目前并没有针对这些索引技术的分析和比较,这为用户针对特定工作负载选择或者设计的索引技术制造了难点。本文提出了一个基于6项分布式顺序表的索引要素的分类方法,并对现有技术进行了对比。基于该分类方法,本文提出了一个性能模型QSModel,以预测特定索引技术的查询时间和存储开销,同时本文还采用来自腾讯的真实数据集对该模型进行了验证。实验结果显示,查询时间和存储开销的最大误差分别是24.2和9.8%。本文还提出了IndexComparator,一个继承了代表性索引技术的开源项目。基于此,用户可以从理论分析和实际实验两个方面选择最合适的索引技术。

Abstract: Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional indexing techniques have been presented to support queries on non-key columns for DOTs. However, there was no comprehensive analysis or comparison of these techniques, which brings troubles to users in selecting or proposing a proper indexing technique for a certain workload. This paper proposes a taxonomy based on six indexing issues to classify indexing techniques on DOTs and provides a comprehensive review of the state-of-the-art techniques. Based on the taxonomy, we propose a performance model named QSModel to estimate the query time and storage cost of these techniques and run experiments on a practical workload from Tencent to evaluate this model. The results show that the maximum error rates of the query time and storage cost are 24.2% and 9.8%, respectively. Furthermore, we propose IndexComparator, an open source project that implements representative indexing techniques. Therefore, users can select the best-fit indexing technique based on both theoretical analysis and practical experiments.

[1] Chang F, Dean J, Ghemawat S, Hsieh W C, Wallach D A, Burrows M, Chandra T, Fikes A, Gruber R E. BigTable:A distributed storage system for structured data. ACM Trans. Computer Systems (TOCS), 2008, 26(2):Article No. 4.

[2] Lakshman A, Malik P. Cassandra:A decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44(2):35-40.

[3] Cooper B F, Ramakrishnan R, Srivastava U, Silberstein A, Bohannon P, Jacobsen H A, Puz N, Weaver D, Yerneni R. PNUTS:Yahoo!'s hosted data serving platform. Proceedings of the VLDB Endowment, 2008, 1(2):1277-1288.

[4] Harter T, Borthakur D, Dong S Y, Aiyer A, Tang L Y, Arpaci-Dusseau A C, Arpaci-Dusseau R H. Analysis of HDFS under HBase:A Facebook messages case study. In Proc. the 12th USENIX Conf. File and Storage Technologies, February 2014, pp.199-212.

[5] George L. HBase:The Definitive Guide. O'Reilly Media, Inc., 2011.

[6] Feng C, Yang X, Liang F, Sun X H, Xu Z W. LCIndex:A local and clustering index on distributed ordered tables for flexible multi-dimensional range queries. In Proc. the 44th Int. Conf. Parallel Processing, September 2015, pp.719-728.

[7] Nishimura S, Das S, Agrawal D, Abbadi A E. MD-HBase:A scalable multi-dimensional data infrastructure for location aware services. In Proc. the 12th IEEE Int. Conf. Mobile Data Management, June 2011, pp.7-16.

[8] Ma Y Z, Rao J, Hu W S, Meng X F, Han X, Zhang Y, Chai Y P, Liu C Q. An efficient index for massive IOT data in cloud environment. In Proc. the 21st ACM Int. Conf. Information and Knowledge Management, October 2012, pp.2129-2133.

[9] Zou Y Q, Liu J, Wang S C, Zha L, Xu Z W. CCIndex:A complemental clustering index on distributed ordered tables for multi-dimensional range queries. In Proc. the 7th IFIP Int. Conf. Network and Parallel Computing, September 2010, pp.247-261.

[10] Zhang C, Chen X Y, Shi Z L, Ge B. Algorithms for spatiotemporal queries in HBase. Journal of Chinese Computer Systems, 2016, 37(11):2409-2415. (in Chinese)

[11] Hsu Y T, Pan Y C, Wei L Y, Peng W C, Lee W C. Key formulation schemes for spatial index in cloud data managements. In Proc. the 13th IEEE Int. Conf. Mobile Data Management, July 2012, pp.21-26.

[12] Zhou X, Zhang X, Wang Y H, Li R, Wang S. Efficient distributed multi-dimensional index for big data management. In Proc. the 14th Int. Conf. Web-Age Information Management, June 2013, pp.130-141.

[13] O'Neil P, Cheng E, Gawlick D, O'Neil E. The log-structured merge-tree (LSM-tree). Acta Informatica, 1996, 33(4):351-385.

[14] Guttman A. R-trees:A dynamic index structure for spatial searching. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 1984, pp.47-57.

[15] Sellis T K, Roussopoulos N, Faloutsos C. The R+-tree:A dynamic index for multi-dimensional objects. In Proc. the 13th Int. Conf. Very Large Data Bases, September 1987, pp.507-518.

[16] Silberschatz A, Korth H F, Sudarshan S. Database System Concepts (3rd edition). McGraw-Hill, 1997.

[17] Wang L, Chen B, Liu Y H. Distributed storage and index of vector spatial data based on HBase. In Proc. the 21st Int. Conf. Geoinformatics, June 2013.

[18] Ge W, Luo S M, Zhou W H, Zhao D, Tang Y, Zhou J, Qu W W, Yuan C F, Huang Y H. HiBase:A hierarchical indexing mechanism and system for efficient HBase query. Chinese Journal of Computers, 2016, 39(1):140-153. (in Chinese)

[19] Wang S C, Mares M A, Guo Y K. CGDM:Collaborative genomic data model for molecular profiling data using NoSQL. Bioinformatics, 2016, 32(23):3654-3660.

[20] Serrano D, Han D, Stroulia E. From relations to multidimensional maps:Towards an SQL-to-HBase transformation methodology. In Proc. the 8th IEEE Int. Conf. Cloud Computing, July 2015, pp.81-89.

[21] Chang C R, Hsieh M J, Wu J J, Wu P Y, Liu P F. HSQL:A highly scalable cloud database for multi-user query processing. In Proc. the 5th IEEE Int. Conf. Cloud Computing, June 2012, pp.943-944.

[22] Agrawal P, Silberstein A, Cooper B F, Srivastava U, Ramakrishnan R. Asynchronous view maintenance for VLSD databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2009, pp.179-192.

[23] Zhou X, Li H, Zhang X, Wang S, Ma Y Y, Liu K Y, Zhu M, Huang M L. ABR-Tree:An efficient distributed multidimensional indexing approach for massive data. In Proc. Int. Conf. Algorithms and Architectures for Parallel Processing, November 2015, pp.781-790.

[24] Zhang Y, Ma Y Z, Meng X F. Efficient processing of spatial keyword queries on HBase. Journal of Chinese Computer Systems, 2012, 33(10):2141-2146. (in Chinese)

[25] Zhang N Y, Zheng G Z, Chen H J, Chen J Y, Chen X. HBaseSpatial:A scalable spatial data storage based on HBase. In Proc. the 13th IEEE Int. Conf. Trust Security and Privacy in Computing and Communications, September 2014, pp.644-651.

[26] Han D, Stroulia E. HGrid:A data model for large geospatial data sets in HBase. In Proc. the 6th IEEE Int. Conf. Cloud Computing, June 28-July 3, 2013, pp.910-917.

[27] Zhou W, Han J Z, Zhang Z, Dai J, Xu Z Y. HDKV:Supporting efficient high-dimensional similarity search in keyvalue stores. Concurrency and Computation:Practice and Experience, 2013, 25(12):1675-1698.

[28] Ma Y Z, Zhang Y, Meng X F. ST-HBase:A scalable data management system for massive geo-tagged objects. In Proc. the 14th Int. Conf. Web-Age Information Management, June 2013, pp.155-166.

[29] Tang X S, Han B D, Chen H. A hybrid index for multidimensional query in HBase. In Proc. the 4th Int. Conf. Cloud Computing and Intelligence Systems, August 2016, pp.332-336.

[30] Ma Y Z, Meng X F, Wang S Y, Hu W S, Han X, Zhang Y. An efficient index method for multi-dimensional query in cloud environment. In Proc. the 2nd Int. Conf. Cloud Computing and Big Data in Asia, June 2015, pp.307-318.

[31] Chen X Y, Zhang C, Ge B, Xiao W D. Efficient historical query in HBase for spatio-temporal decision support. International Journal of Computers Communications & Control, 2016, 11(5):613-630.

[32] Lee K, Ganti R K, Srivatsa M, Liu L. Efficient spatial query processing for big data. In Proc. the 22nd ACM SIGSPATIAL Int. Conf. Advances in Geographic Information Systems, November 2014, pp.469-472.

[33] Wang H P, Ci X, Meng X F. Fast multi-fields query processing in bigtable based cloud systems. In Proc. the 14th Int. Conf. Web-Age Information Management, June 2013, pp.142-154.

[34] Huang S, Wang B T, Zhu J Y, Wang G R, Yu G. R-HBase:A multi-dimensional indexing framework for cloud computing environment. In Proc. IEEE Int. Conf. Data Mining Workshop, December 2014, pp.569-574.

[35] Li Q C, Lu Y, Gong X L, Zhang J. Optimizational method of HBase multi-dimensional data query based on Hilbert space-filling curve. In Proc. the 9th Int. Conf. P2P Parallel Grid Cloud and Int. Conf. Computing, November 2014, pp.469-474.

[36] Li S, Hu S H, Ganti R, Srivatsa M, Abdelzaher T. Pyro:A spatial-temporal big-data storage system. In Proc. the USENIX Annual Technical Conf., July 2015, pp.97-109.

[37] Du N B, Zhan J F, Zhao M, Xiao D R, Xie Y C. Spatiotemporal data index model of moving objects on fixed networks using HBase. In Proc. IEEE Int. Conf. Computational Intelligence & Communication Technology, February 2015, pp.247-251.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Jose K- Raphel; Siu Cheung Hui; Angela Goh;. Class Based Contextual Logic for DOOD[J]. , 1996, 11(2): 161 -170 .
[2] 庄旗铭; 蒋定安; 杨庆;. A Comparative Analysis of Different Arbitration Protocols for Multiple-Bus Multiprocessors[J]. , 1996, 11(3): 313 -325 .
[3] 马宗民; Yan Li;. Using Multivalued Logic in Relational Database Containing Null Value[J]. , 1996, 11(4): 421 -426 .
[4] . 不相容多流体的动画模拟[J]. , 2007, 22(1): 156 -160 .
[5] . 一种新的公钥加密方案[J]. , 2007, 22(1): 95 -02 .
[6] Xin Liu (刘欣) and Tsuyoshi Murata, Member, ACM, IEEE. K部K一致(超)网络中的社区发现[J]. , 2011, 26(5): 778 -791 .
[7] Wei Hu (胡伟), Zhao Dong (董朝), and Guo-Dong Yuan (袁国栋). 基于边缘保持滤波的编辑传播[J]. , 2012, 27(4): 830 -840 .
[8] Bin Xiao, Yi-Fu Zhang, Yan-Ping Gao, Liang Yang, Dong-Mei Wu, and Bao-Xia Fan. 一种健壮和高能效的65nm SOC物理实现方法[J]. , 2013, 28(4): 682 -688 .
[9] Yu-Xin Ma, Jia-Yi Xu, Di-Chao Peng, Ting Zhang, Cheng-Zhe Jin, Hua-Min Qu, Wei C. 基于可视分析的多上下文移动社交网络社区发现[J]. , 2013, 28(5): 797 -809 .
[10] Zhi-Ping Jiang, Wei Xi, Xiangyang Li, Shaojie Tang, Ji-Zhong Zhao, Jin-Song Han,. 通信既众包:带有基于CSI测速功能的Wi-Fi室内定位系统[J]. , 2014, 29(4): 589 -604 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: