计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (5): 1037-1050.doi: 10.1007/s11390-021-1247-6

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

面向持久性内存的写优化B+Tree索引结构

Rui-Xiang Ma1,2,3,4, Fei Wu1,2,3,4,*, Senior Member, CCF, Member, IEEE, Bu-Rong Dong1,2,3,4 Meng Zhang1,2,3,4, Wei-Jun Li5, Senior Member, CCF, IEEE, and Chang-Sheng Xie1,2,3,4, Member, IEEE   

  1. 1 Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430070, China;
    2 Key Laboratory of Information Storage System, Ministry of Education of China, Wuhan 430070, China;
    3 Engineering Research Center of Data Storage Systems and Technology, Huazhong University of Science and Technology Wuhan 430070, China;
    4 School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430070, China;
    5 Shenzhen DAPU Microelectronics Co., Ltd, Shenzhen 518116, China
  • 收稿日期:2020-12-31 修回日期:2021-08-28 出版日期:2021-09-30 发布日期:2021-09-30
  • 作者简介:Rui-Xiang Ma is currently pursuing his Ph.D. degree with the Wuhan National Laboratory for Optoelectronics (WNLO), Huazhong University of Science and Technology (HUST), Wuhan. His current research interests include intelligent storage, machine learning, and non-volatile memory.
  • 基金资助:
    This work was supported in part by the National Natural Science Foundation of China under Grant Nos. U1709220, U2001203, 61821003, 61872413, and 61902137, in part by the National Key Research and Development Program of China under Grant No. 2018YFB1003305, and in part by the Key-Area Research and Development Program of Guangdong Province of China under Grant No. 2019B010107001.

Write-Optimized B+ Tree Index Technology for Persistent Memory

Rui-Xiang Ma1,2,3,4, Fei Wu1,2,3,4,*, Senior Member, CCF, Member, IEEE, Bu-Rong Dong1,2,3,4 Meng Zhang1,2,3,4, Wei-Jun Li5, Senior Member, CCF, IEEE, and Chang-Sheng Xie1,2,3,4, Member, IEEE        

  1. 1 Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430070, China;
    2 Key Laboratory of Information Storage System, Ministry of Education of China, Wuhan 430070, China;
    3 Engineering Research Center of Data Storage Systems and Technology, Huazhong University of Science and Technology Wuhan 430070, China;
    4 School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430070, China;
    5 Shenzhen DAPU Microelectronics Co., Ltd, Shenzhen 518116, China
  • Received:2020-12-31 Revised:2021-08-28 Online:2021-09-30 Published:2021-09-30
  • About author:Rui-Xiang Ma is currently pursuing his Ph.D. degree with the Wuhan National Laboratory for Optoelectronics (WNLO), Huazhong University of Science and Technology (HUST), Wuhan. His current research interests include intelligent storage, machine learning, and non-volatile memory.
  • Supported by:
    This work was supported in part by the National Natural Science Foundation of China under Grant Nos. U1709220, U2001203, 61821003, 61872413, and 61902137, in part by the National Key Research and Development Program of China under Grant No. 2018YFB1003305, and in part by the Key-Area Research and Development Program of Guangdong Province of China under Grant No. 2019B010107001.

1、研究背景(context
近年来,以PCM为代表的持久性内存技术(Persistent Memory,PM)正在快速的发展,特别是Intel Optane DC Persistent Memory的商业使用,成为学术界和工业界近年来研究的热点。持久性内存具有可字节寻址的特点,这使得持久性内存可以直接挂载到内存总线上,并通过load/store指令对其直接访问。另外持久性内存具有类似DRAM的读写延迟以及高密度、低功耗等优点,有望取代DRAM成为新一代的主存设备。此外,持久性内存的非易失性,可以直接将数据持久化到上面。因此,持久化内存技术的出现在理论上可以将内存和主存合并,简化存储架构。
2、目的(Objective):
尽管持久性内存有诸多优点,但它也具有读写不对称,写次数有限以及掉电不易失引起的数据不一致问题。因此,持久性内存的出现虽然给计算机架构的发展带来了机遇,但是也给传统的软件技术带来了挑战。特别是以B+Tree为代表的索引技术,主要是针对DRAM和磁盘的特点设计的,直接将其迁移到持久性内存架构上,会导致其快速的磨破。比如,B+Tree中重排序会引起大量移位写操作,对于掉电易失、无擦写次数限制和读写对称的DRAM来说,它不会显著的影响索引性能。但是在持久性内存中,重排序和日志预写操作会引起大量的写操作。由于CPU的乱序写操作以及原子写粒度不匹配的问题,为了维护数据一致性,持久性内存中的写操作需要频繁的调用mfence和clflush指令,造成索引性能大幅下降。因此,针对持久性内存的特性设计一种面向持久性内存的B+Tree索引结构具有非常重要的应用价值。
3、方法(Method
针对持久性内存的特性,本文提出了一种面向持久性内存的写优化B+Tree索引技术WO-Tree。WO-Tree对叶子节点采用“追加写”机制,可以减少插入过程中维护节点内部有序性引起的大量写操作。WO-Tree对内部节点采用延迟刷新机制,以脏缓存行大小为刷新粒度,可以减少内部节点更新过程中的写次数。WO-tree在叶子节点分裂时,当所有的写操作完成后,再进行数据刷新操作,可以减少频繁的数据刷新。WO-Tree采用了部分写日志机制,只对叶子节点上的关键数据写日志,内部节点通过读操作识别数据不一致,并利用叶子节点数据进行索引重建,可以减少日志写开销。WO-Tree在内部节点采用无锁搜索,可以减少线程访问时的锁开销。
4、结果(Result&Findings):
在YCSB生成的负载上进行插入、删除等测试,实验结果表明,以插入操作为例,相比传统的B+Tree、wB Tree和Fast-Fair,当节点大小为1KB时,WO-Tree插入操作引起的cacheline的刷新次数分别减少了84.7%、22.2%和30.8%。在多线程并发的情况,当线程数为2时,相比传统的B+Tree、wB Tree和Fast-Fair插入操作完成的时间分别减少了84.3%、27.3%和44.7%,并且随着线程数的增加,WO-Tree表现出更显著的并发能力。
5、结论(Conclusions):
WO-Tree通过“追加写“的方式来解决重排序的问题,并且内部节点采取了延迟刷新的策略减少数据迁移引起的刷新操作。针对叶子节点分裂操作引起的大量写,同样采取了延迟刷新的机制。为了减少日志写开销,WO-Tree采取了部分写日志的方法,仅对叶子节点的操作进行写日志,内部节点通过读操作识别不一致状态,并通过叶子节点数据重建索引结构。WO-Tree在内部节点中,采用了无锁搜索的机制减少锁开销。实WO-Tree在插入、删除等操作中,都能有效的减少cacheline的刷新次数,并提升了执行效率。在并发能力测试时,相比wB-Tree等索引算法,WO-Tree也获得了显著的性能提升。WO-Tree能够减少对持久性内存的写操作,并充分的发挥新硬件的特性,具有十分显著的索引性能。

关键词: 持久性内存, B+ Tree, 写放大, 一致性, 日志

Abstract: Due to its low latency, byte-addressable, non-volatile, and high density, persistent memory (PM) is expected to be used to design a high-performance storage system. However, PM also has disadvantages such as limited endurance, thereby proposing challenges to traditional index technologies such as B+ tree. B+ tree is originally designed for dynamic random access memory (DRAM)-based or disk-based systems and has a large write amplification problem. The high write amplification is detrimental to a PM-based system. This paper proposes WO-tree, a write-optimized B+ tree for PM. WO-tree adopts an unordered write mechanism for the leaf nodes, and the unordered write mechanism can reduce a large number of write operations caused by maintaining the entry order in the leaf nodes. When the leaf node is split, WO-tree performs the cache line flushing operation after all write operations are completed, which can reduce frequent data flushing operations. WO-tree adopts a partial logging mechanism and it only writes the log for the leaf node. The inner node recognizes the data inconsistency by the read operation and the data can be recovered using the leaf node information, thereby significantly reducing the logging overhead. Furthermore, WO-tree adopts a lock-free search for inner nodes, which reduces the locking overhead for concurrency operation. We evaluate WO-tree using the Yahoo! Cloud Serving Benchmark (YCSB) workloads. Compared with traditional B+ tree, wB-tree, and Fast-Fair, the number of cache line flushes caused by WO-tree insertion operations is reduced by 84.7%, 22.2%, and 30.8%, respectively, and the execution time is reduced by 84.3%, 27.3%, and 44.7%, respectively.

Key words: persist memory, B+ tree, write amplication, consistency, YCSB (Yahoo! Cloud Serving Benchmark)

[1] Mueller W, Aichmayr G, Bergner W et al. Challenges for the DRAM cell scaling to 40nm. In Proc. IEEE International Electron Devices Meeting, December 2005, pp.336-339. DOI:10.1109/IEDM.2005.1609344.
[2] Mandelman A J, Dennard H R, Bronner B G et al. Challenges and future directions for the scaling of dynamic random-access memory (DRAM). IBM Journal of Research and Development, 2002, 46(2.3):187-212. DOI:10.1147/rd.462.0187.
[3] Freitas R, Wilcke W. Storage-class memory:The next storage system technology. IBM Journal of Research and Development, 2008, 52(4.5):439-447. DOI:10.1147/rd.524.0439.
[4] Arulraj J, Pavlo A, Dulloor S. Let's talk about storage & recovery methods for non-volatile memory database systems. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 31-June 4, 2015, pp.707-722. DOI:10.1145/2723372.2749441.
[5] Harter T, Borthakur D, Dong S et al. Analysis of HDFS under HBase:A facebook messages case study. In Proc. the 12th USENIX Conference on File and Storage Technologies, February 2014, pp.199-212.
[6] Lepers B, Balmau O, Gupta K et al. KVell:The design and implementation of a fast persistent key-value store. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.447-461. DOI:10.1145/3341301.3359628.
[7] Wang Y, Tan J, Mao R et al. Temperature-aware persistent data management for LSM-Tree on 3-D NAND flash memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(12):4611-4622. DOI:10.1109/TCAD.2020.2982623.
[8] Lu L, Pillai S T, Arpaci-Dusseau C A et al. WiscKey:Separating keys from values in SSD conscious storage. ACM Transactions on Storage, 2017, 13(1):1-28. DOI:10.1145/3033273.
[9] Li Y, Chan H, Lee P et al. HashKV:Enabling efficient updates in KV storage via hashing. In Proc. the 2018 USENIX Annual Technical Conference, June 2018, pp.1007-1019. DOI:10.5555/3277355.3277451.
[10] Raju P, Kadekodi R, Chidambaram V et al. PebblesDB:Building key-value stores using fragmented log-structured merge trees. In Proc. the 26th Symposium on Operating Systems Principle, October 2017, pp.497-514. DOI:10.1145/3132747.3132765.
[11] Chen S, Jin Q. Persistent B+-trees in non-volatile main memory. Proc. the VLDB Endowment, 2015, 8(7):786-797. DOI:10.14778/2752939.2752947.
[12] Lee B, Ipek E, Mutlu O et al. Phase change memory architecture and the quest for scalability. Communications of the ACM, 2010, 53(7):99-106. DOI:10.1145/1785414.1785441.
[13] Zhou P, Zhao B, Yan J et al. A durable and energy efficient main memory using phase change memory technology. In Proc. the 36th Annual International Symposium on Computer Architecture, June 2009, pp.14-23. DOI:10.1145/1555754.1555759.
[14] Yu S. Resistive Random Access Memory (RRAM). Morgan & Claypool, 2016. 10.2200/S00681ED1V01Y201510EET006.
[15] Apalkov D, Khvalkovskiy A, Watts S et al. Spintransfer torque magnetic random access memory (STTMRAM). ACM Journal on Emerging Technologies in Computing Systems, 2013, 9(2):Article No. 13. DOI:10.1145/2463585.2463589.
[16] Venkataraman S, Tolia N, Ranganathan P et al. Consistent and durable data structures for non-volatile byteaddressable memory. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011, pp.61-75.
[17] Yang J, Wei Q, Cheng C et al. NV-tree:Reducing consistency cost for NVM-based single level systems. In Proc. the 13th USENIX Conference on File and Storage Technologies, February 2015, pp.167-181. DOI:10.5555/2750482.2750495.
[18] Oukid I, Lasperas J, Nica A et al. FPTree:A hybrid SCMDRAM persistent and concurrent B-tree for storage class memory. In Proc. the 2016 International Conference on Management of Data, June 26-July 1, 2016, pp.371-386. DOI:10.1145/2882903.2915251.
[19] Comer D. Ubiquitous B-tree. ACM Comput. Surv., 1979, 11(2):121-137. DOI:10.1145/356770.356776.
[20] Bayer R. Binary B-trees for virtual memory. In Proc. the 1971 ACM SIGFIDET Workshop on Data Description, Access and Control, November 1971, pp.219-235. DOI:10.1145/1734714.1734731.
[21] Ni J, Hu W, Li G et al. Bp-tree:A predictive B+-tree for reducing writes on phase change memory. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(10):2368-2381. DOI:10.1109/TKDE.2014.5.
[22] Hwang D, Kim W H, Won Y et al. Endurable transient inconsistency in byte-addressable persistent B+-tree. In Proc. the 16th USENIX Conference on File and Storage Technologies, February 2018, pp.187-200. DOI:10.5555/3189759.3189777.
[23] Silberschatz A, Korth H, Sudarshan S. Database Systems Concepts (5th edition). McGraw-Hill, 2005.
[24] Dulloo R S, Kumar S, Keshavamurthy A et al. System software for persistent memory. In Proc. the 9th European Conference on Computer Systems, April 2014, Article No. 15. DOI:10.1145/2592798.2592814.
[25] Volos1 H, Magalhaes G, Cherkasova L et al. Quartz:A lightweight performance emulator for persistent memory software. In Proc. the 16th Annual Middleware Conference, November 2015, pp.37-49. DOI:10.1145/2814576.2814806.
[26] Cooper B F, Silberstein A, Tam E et al. Benchmarking cloud serving systems with YCSB. In Proc. the 1st ACM Symposium on Cloud Computing, June 2010, pp.143-154. DOI:10.1145/1807128.1807152.
[1] 欧阳鸿荣, 魏恒峰, 李海翔, 潘安群, 黄宇. MongoDB因果一致性协议测试[J]. 计算机科学技术学报, 2022, 37(1): 128-146.
[2] Lie-Huang Zhu, Bao-Kun Zheng, Meng Shen, Feng Gao, Hong-Yu Li, Ke-Xin Shi. 比特币系统的数据安全和隐私综述[J]. 计算机科学技术学报, 2020, 35(4): 843-862.
[3] Xu-Ran Zhao, Xun Wang, Qi-Chao Chen. 基于深度卷积神经网络和条件随机场的时空一致性深度恢复[J]. , 2017, 32(3): 443-456.
[4] Feng-Feng Pan, Yin-Liang Yue, Jin Xiong. dCompaction:基于延迟的日志结构合并树的合并方法[J]. , 2017, 32(1): 41-54.
[5] Xusheng Xiao, Jian-Guang Lou, Shan Lu, David C. Shepherd, Xin Peng, Qian-Xiang W. 大规模软件系统的研究机遇与挑战[J]. , 2016, 31(5): 851-860.
[6] De-Qing Zou, Hao Qin, Hai Jin. UiLog:基于日志的故障诊断系统[J]. , 2016, 31(5): 1038-1052.
[7] Yue-Feng Du, De-Rong Shen, Tie-Zheng Nie, Yue Kou, Ge Yu. 一种分布式环境下基于内容相关数据的一致性修复方法[J]. , 2016, 31(4): 741-758.
[8] Yuan-Chao Xu, Hu Wan, Ke-Ni Qiu, Tao Li, Wei-Gong Zhang. 降低单级存储移动终端系统的同步开销[J]. , 2016, 31(4): 836-848.
[9] Lengdong Wu, Liyan Yuan, Jiahuai You. 大规模数据管理系统综述[J]. , 2015, 30(1): 163-183.
[10] 邵冰清, 张军伟, 郑彩平, 张浩, 刘振军, 许鲁. 一种异步的机群文件系统元数据一致性保证协议[J]. , 2014, 29(2): 303-315.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 衷仁保; 邢林; 任朝阳;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: