›› 2017,Vol. 32 ›› Issue (1): 41-54.doi: 10.1007/s11390-017-1704-4

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

dCompaction:基于延迟的日志结构合并树的合并方法

Feng-Feng Pan1,2(潘锋烽), Student Member, CCF, ACM, IEEE, Yin-Liang Yue3(岳银亮), Member, CCF, ACM, IEEE, and Jin Xiong1,2(熊劲), Senior Member, CCF, Member, ACM, IEEE   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
  • 收稿日期:2016-08-01 修回日期:2016-11-09 出版日期:2017-01-05 发布日期:2017-01-05
  • 作者简介:Feng-Feng Pan received his B.S. degree in computer science and technology from Central South University, Changsha, in 2010. He is now a Ph.D. candidate in the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His research interests include big data storage and management, storage systems, and big data analysis.
  • 基金资助:

    This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000202 and the National Natural Science Foundation of China under Grant Nos. 61303056 and 61379042.

dCompaction: Speeding up Compaction of the LSM-Tree via Delayed Compaction

Feng-Feng Pan1,2(潘锋烽), Student Member, CCF, ACM, IEEE, Yin-Liang Yue3(岳银亮), Member, CCF, ACM, IEEE, and Jin Xiong1,2(熊劲), Senior Member, CCF, Member, ACM, IEEE   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
  • Received:2016-08-01 Revised:2016-11-09 Online:2017-01-05 Published:2017-01-05
  • About author:Feng-Feng Pan received his B.S. degree in computer science and technology from Central South University, Changsha, in 2010. He is now a Ph.D. candidate in the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His research interests include big data storage and management, storage systems, and big data analysis.
  • Supported by:

    This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000202 and the National Natural Science Foundation of China under Grant Nos. 61303056 and 61379042.

键值存储系统在当今互联网应用中发挥着巨大的作用。写优化的数据结构,例如日志结构合并树以及它的变种,被广泛应用于各类键值存储系统中,如Bigtable,RocksDB等。传统的日志结构合并树通过两层有序数据集或者多层有序数据集对索引变更进行延迟及批量处理,并通过类似于归并排序的方式高效地将磁盘上的多个有序数据集进行合并,而在多层合并过程中会由于大量键值对重复读写引发写放大问题,导致性能下降。本文针对合并过程中的写放大问题,提出了一种新的合并方法—dCompaction,其核心思想是通过延迟调度部分合并操作方式,来减少合并过程中的键值对重复读写的问题,从而提升了系统的吞吐率。本文以RocksDB为原型在其上实现了dCompaction策略,并利用YCSB进行大量的测试,其结果表明与RocksDB相比,dCompaction在保持读性能不变的情况下,写性能提升40%左右。

Abstract: Key-value (KV) stores have become a backbone of large-scale applications in today's data centers. Writeoptimized data structures like the Log-Structured Merge-tree (LSM-tree) and their variants are widely used in KV storage systems like BigTable and RocksDB. Conventional LSM-tree organizes KV items into multiple, successively larger components, and uses compaction to push KV items from one smaller component to another adjacent larger component until the KV items reach the largest component. Unfortunately, current compaction scheme incurs significant write amplification due to repeated KV item reads and writes, and then results in poor throughput. We propose a new compaction scheme, delayed compaction (dCompaction) that decreases write amplification. dCompaction postpones some compactions and gathers them into the following compaction. In this way, it avoids KV item reads and writes during compaction, and consequently improves the throughput of LSM-tree based KV stores. We implement dCompaction on RocksDB, and conduct extensive experiments. Validation using YCSB framework shows that compared with RocksDB, dCompaction has about 40% write performance improvements and also comparable read performance.

[1] Sears R, Ramakrishnan R. bLSM:A general purpose log structured merge tree. In Proc. the ACM SIGMOD Inter national Conference on Management of Data, May 2012, pp.217-228.

[2] Huang Q, Birman K, van Renesse R, Lloyd W, Kumar S, Li H C. An analysis of Facebook photo caching. In Proc. the 24th ACM Symposium on Operating Systems Princi ples (SOSP), Nov. 2013, pp.167-181.

[3] Atikoglu B, Xu Y, Frachtenberg E et al. Workload analysis of a large-scale key-value store. In Proc. ACM SIGMET RICS, Jun. 2012, pp.53-64.

[4] O'Neil P, Cheng E, Gawlick D et al. The log-structured merge-tree (LSM-tree). Acta Informatica, 1996, 33(4):351-385.

[5] Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Bur rows M, Chandra T, Fikes A, Gruber R. Bigtable:A dis tributed storage system for structured data. In Proc. the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Nov. 2006, pp.205-218.

[6] Lakshman A, Malik P. Cassandra:A decentralized struc tured storage system. ACM SIGOPS Operating Systems Review, 2010, 44(2):35-40.

[7] George L. HBase:The Definitive Guide. O'Reilly Media, 2011.

[8] Escriva R, Wong B, Sirer E. HyperDex:A distributed, searchable key-value store. In Proc. ACM SIGCOMM Conf. Applications, Technologies, Architectures, and Protocols for Computer Communication, Aug. 2012, pp.25-36.

[9] Cooper B, Ramakrishnan R, Srivastava U, Silberstein A, Bohannon P, Jacobsen H, Puz N, Weaver D, Yerneni R. PNUTS:Yahoo! hosted data serving platform. Proc. the VLDB Endowment, 2008, 1(2):1277-1288.

[10] Shetty P, Spillane R, Malpani R et al. Building workloadindependent storage with VT-trees. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), Feb. 2013, pp.17-30.

[11] Jermaine C, Omiecinski E, Yee W G. The partitioned exponential file for database storage management. The VLDB Journal, 2007, 16(4):417-437.

[12] Zhong Z, Yue Y, He B et al. Pipelined compaction for the LSM-tree. In Proc. the 28th International Parallel and Distributed Processing Symposium (IPDPS), May 2014, pp.777-786.

[13] Wu X, Xu Y, Shao Z et al. LSM-trie:An LSM-tree-based ultra-large key-value store for small data. In Proc. the USENIX Annual Technical Conference (ATC), Jul. 2015, pp.71-82.

[14] Amur H, Andersen D, Kaminsky M et al. Design of a writeoptimized data store. Technical Report GIT-CERCS-13-08, Georgia Tech CERCS, 2013.

[15] Cooper B, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In Proc. the 1st ACM Symposium on Cloud Computing (SoCC), Jun. 2010, pp.143-154.

[16] Spillane R, Shetty P, Zadok E, Dixit S, Archak S. An efficient multi-tier tablet server storage architecture. In Proc. the 2nd ACM Symposium on Cloud Computing in Conjunction with SOSP (SoCC), Oct. 2011, pp.1-14.

[17] Bloom H. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 1970, 13(7):422-426.

[18] Chazelle B, Guibas L. Fractional cascading:A data structuring technique with geometric applications. In Proc. the 12th International Colloquium on Automata, Languages, and Programming (ICALP), Jul. 1985, pp.90-100.

[19] Bender M, Farach-Colton M, Fineman J, Fogel Y, Kuszmaul B, Nelson J. Cache-oblivious streaming B-trees. In Proc. the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Jun. 2007, pp.81-92.

[20] Li Y, He B, Yang R J et al. Tree indexing on solid state drives. Proc. the VLDB Endowment, 2010, 3(1/2):1195-1206.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[2] 龚振和;. On Conceptual Model Specification and Verification[J]. , 1987, 2(1): 35 -50 .
[3] 金凌紫; 朱鸿;. Systems Programming in the Functional Language FP[J]. , 1988, 3(1): 40 -55 .
[4] 薛行; 孙钟秀; 周建强; 徐希豪;. A Message-Based Distributed Kernel for a Full Heterogeneous Environment[J]. , 1990, 5(1): 47 -56 .
[5] 韩建超; 史忠植;. Formalizing Default Reasoning[J]. , 1990, 5(4): 374 -378 .
[6] 沈绪榜; 马光悌; 陈岚;. An Inference Microprocessor Design[J]. , 1991, 6(3): 209 -213 .
[7] 史维更;. Reconnectable Network with Limited Resources[J]. , 1991, 6(3): 243 -249 .
[8] 廖先湜; 金兰;. A Mechanism Supporting the Client/Server Relationship in the Operating System of Distributed System “THUDS”[J]. , 1991, 6(3): 256 -262 .
[9] 孙玉方;. The UNIX Localization and Chinese Information Processing System[J]. , 1991, 6(4): 370 -375 .
[10] I.V.Vel bitsky; A.L.Kovalev; I.V.Kasatkina; 王镭;. R-Technology of Programming: Basic Notions and Implementation[J]. , 1992, 7(4): 345 -355 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: