|
计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (5): 1037-1050.doi: 10.1007/s11390-021-1247-6
所属专题: Computer Architecture and Systems
Rui-Xiang Ma1,2,3,4, Fei Wu1,2,3,4,*, Senior Member, CCF, Member, IEEE, Bu-Rong Dong1,2,3,4 Meng Zhang1,2,3,4, Wei-Jun Li5, Senior Member, CCF, IEEE, and Chang-Sheng Xie1,2,3,4, Member, IEEE
Rui-Xiang Ma1,2,3,4, Fei Wu1,2,3,4,*, Senior Member, CCF, Member, IEEE, Bu-Rong Dong1,2,3,4 Meng Zhang1,2,3,4, Wei-Jun Li5, Senior Member, CCF, IEEE, and Chang-Sheng Xie1,2,3,4, Member, IEEE
1、研究背景(context)
近年来,以PCM为代表的持久性内存技术(Persistent Memory,PM)正在快速的发展,特别是Intel Optane DC Persistent Memory的商业使用,成为学术界和工业界近年来研究的热点。持久性内存具有可字节寻址的特点,这使得持久性内存可以直接挂载到内存总线上,并通过load/store指令对其直接访问。另外持久性内存具有类似DRAM的读写延迟以及高密度、低功耗等优点,有望取代DRAM成为新一代的主存设备。此外,持久性内存的非易失性,可以直接将数据持久化到上面。因此,持久化内存技术的出现在理论上可以将内存和主存合并,简化存储架构。
2、目的(Objective):
尽管持久性内存有诸多优点,但它也具有读写不对称,写次数有限以及掉电不易失引起的数据不一致问题。因此,持久性内存的出现虽然给计算机架构的发展带来了机遇,但是也给传统的软件技术带来了挑战。特别是以B+Tree为代表的索引技术,主要是针对DRAM和磁盘的特点设计的,直接将其迁移到持久性内存架构上,会导致其快速的磨破。比如,B+Tree中重排序会引起大量移位写操作,对于掉电易失、无擦写次数限制和读写对称的DRAM来说,它不会显著的影响索引性能。但是在持久性内存中,重排序和日志预写操作会引起大量的写操作。由于CPU的乱序写操作以及原子写粒度不匹配的问题,为了维护数据一致性,持久性内存中的写操作需要频繁的调用mfence和clflush指令,造成索引性能大幅下降。因此,针对持久性内存的特性设计一种面向持久性内存的B+Tree索引结构具有非常重要的应用价值。
3、方法(Method)
针对持久性内存的特性,本文提出了一种面向持久性内存的写优化B+Tree索引技术WO-Tree。WO-Tree对叶子节点采用“追加写”机制,可以减少插入过程中维护节点内部有序性引起的大量写操作。WO-Tree对内部节点采用延迟刷新机制,以脏缓存行大小为刷新粒度,可以减少内部节点更新过程中的写次数。WO-tree在叶子节点分裂时,当所有的写操作完成后,再进行数据刷新操作,可以减少频繁的数据刷新。WO-Tree采用了部分写日志机制,只对叶子节点上的关键数据写日志,内部节点通过读操作识别数据不一致,并利用叶子节点数据进行索引重建,可以减少日志写开销。WO-Tree在内部节点采用无锁搜索,可以减少线程访问时的锁开销。
4、结果(Result&Findings):
在YCSB生成的负载上进行插入、删除等测试,实验结果表明,以插入操作为例,相比传统的B+Tree、wB Tree和Fast-Fair,当节点大小为1KB时,WO-Tree插入操作引起的cacheline的刷新次数分别减少了84.7%、22.2%和30.8%。在多线程并发的情况,当线程数为2时,相比传统的B+Tree、wB Tree和Fast-Fair插入操作完成的时间分别减少了84.3%、27.3%和44.7%,并且随着线程数的增加,WO-Tree表现出更显著的并发能力。
5、结论(Conclusions):
WO-Tree通过“追加写“的方式来解决重排序的问题,并且内部节点采取了延迟刷新的策略减少数据迁移引起的刷新操作。针对叶子节点分裂操作引起的大量写,同样采取了延迟刷新的机制。为了减少日志写开销,WO-Tree采取了部分写日志的方法,仅对叶子节点的操作进行写日志,内部节点通过读操作识别不一致状态,并通过叶子节点数据重建索引结构。WO-Tree在内部节点中,采用了无锁搜索的机制减少锁开销。实WO-Tree在插入、删除等操作中,都能有效的减少cacheline的刷新次数,并提升了执行效率。在并发能力测试时,相比wB-Tree等索引算法,WO-Tree也获得了显著的性能提升。WO-Tree能够减少对持久性内存的写操作,并充分的发挥新硬件的特性,具有十分显著的索引性能。
[1] Mueller W, Aichmayr G, Bergner W et al. Challenges for the DRAM cell scaling to 40nm. In Proc. IEEE International Electron Devices Meeting, December 2005, pp.336-339. DOI:10.1109/IEDM.2005.1609344. [2] Mandelman A J, Dennard H R, Bronner B G et al. Challenges and future directions for the scaling of dynamic random-access memory (DRAM). IBM Journal of Research and Development, 2002, 46(2.3):187-212. DOI:10.1147/rd.462.0187. [3] Freitas R, Wilcke W. Storage-class memory:The next storage system technology. IBM Journal of Research and Development, 2008, 52(4.5):439-447. DOI:10.1147/rd.524.0439. [4] Arulraj J, Pavlo A, Dulloor S. Let's talk about storage & recovery methods for non-volatile memory database systems. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 31-June 4, 2015, pp.707-722. DOI:10.1145/2723372.2749441. [5] Harter T, Borthakur D, Dong S et al. Analysis of HDFS under HBase:A facebook messages case study. In Proc. the 12th USENIX Conference on File and Storage Technologies, February 2014, pp.199-212. [6] Lepers B, Balmau O, Gupta K et al. KVell:The design and implementation of a fast persistent key-value store. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.447-461. DOI:10.1145/3341301.3359628. [7] Wang Y, Tan J, Mao R et al. Temperature-aware persistent data management for LSM-Tree on 3-D NAND flash memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(12):4611-4622. DOI:10.1109/TCAD.2020.2982623. [8] Lu L, Pillai S T, Arpaci-Dusseau C A et al. WiscKey:Separating keys from values in SSD conscious storage. ACM Transactions on Storage, 2017, 13(1):1-28. DOI:10.1145/3033273. [9] Li Y, Chan H, Lee P et al. HashKV:Enabling efficient updates in KV storage via hashing. In Proc. the 2018 USENIX Annual Technical Conference, June 2018, pp.1007-1019. DOI:10.5555/3277355.3277451. [10] Raju P, Kadekodi R, Chidambaram V et al. PebblesDB:Building key-value stores using fragmented log-structured merge trees. In Proc. the 26th Symposium on Operating Systems Principle, October 2017, pp.497-514. DOI:10.1145/3132747.3132765. [11] Chen S, Jin Q. Persistent B+-trees in non-volatile main memory. Proc. the VLDB Endowment, 2015, 8(7):786-797. DOI:10.14778/2752939.2752947. [12] Lee B, Ipek E, Mutlu O et al. Phase change memory architecture and the quest for scalability. Communications of the ACM, 2010, 53(7):99-106. DOI:10.1145/1785414.1785441. [13] Zhou P, Zhao B, Yan J et al. A durable and energy efficient main memory using phase change memory technology. In Proc. the 36th Annual International Symposium on Computer Architecture, June 2009, pp.14-23. DOI:10.1145/1555754.1555759. [14] Yu S. Resistive Random Access Memory (RRAM). Morgan & Claypool, 2016. 10.2200/S00681ED1V01Y201510EET006. [15] Apalkov D, Khvalkovskiy A, Watts S et al. Spintransfer torque magnetic random access memory (STTMRAM). ACM Journal on Emerging Technologies in Computing Systems, 2013, 9(2):Article No. 13. DOI:10.1145/2463585.2463589. [16] Venkataraman S, Tolia N, Ranganathan P et al. Consistent and durable data structures for non-volatile byteaddressable memory. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011, pp.61-75. [17] Yang J, Wei Q, Cheng C et al. NV-tree:Reducing consistency cost for NVM-based single level systems. In Proc. the 13th USENIX Conference on File and Storage Technologies, February 2015, pp.167-181. DOI:10.5555/2750482.2750495. [18] Oukid I, Lasperas J, Nica A et al. FPTree:A hybrid SCMDRAM persistent and concurrent B-tree for storage class memory. In Proc. the 2016 International Conference on Management of Data, June 26-July 1, 2016, pp.371-386. DOI:10.1145/2882903.2915251. [19] Comer D. Ubiquitous B-tree. ACM Comput. Surv., 1979, 11(2):121-137. DOI:10.1145/356770.356776. [20] Bayer R. Binary B-trees for virtual memory. In Proc. the 1971 ACM SIGFIDET Workshop on Data Description, Access and Control, November 1971, pp.219-235. DOI:10.1145/1734714.1734731. [21] Ni J, Hu W, Li G et al. Bp-tree:A predictive B+-tree for reducing writes on phase change memory. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(10):2368-2381. DOI:10.1109/TKDE.2014.5. [22] Hwang D, Kim W H, Won Y et al. Endurable transient inconsistency in byte-addressable persistent B+-tree. In Proc. the 16th USENIX Conference on File and Storage Technologies, February 2018, pp.187-200. DOI:10.5555/3189759.3189777. [23] Silberschatz A, Korth H, Sudarshan S. Database Systems Concepts (5th edition). McGraw-Hill, 2005. [24] Dulloo R S, Kumar S, Keshavamurthy A et al. System software for persistent memory. In Proc. the 9th European Conference on Computer Systems, April 2014, Article No. 15. DOI:10.1145/2592798.2592814. [25] Volos1 H, Magalhaes G, Cherkasova L et al. Quartz:A lightweight performance emulator for persistent memory software. In Proc. the 16th Annual Middleware Conference, November 2015, pp.37-49. DOI:10.1145/2814576.2814806. [26] Cooper B F, Silberstein A, Tam E et al. Benchmarking cloud serving systems with YCSB. In Proc. the 1st ACM Symposium on Cloud Computing, June 2010, pp.143-154. DOI:10.1145/1807128.1807152. |
[1] | 欧阳鸿荣, 魏恒峰, 李海翔, 潘安群, 黄宇. MongoDB因果一致性协议测试[J]. 计算机科学技术学报, 2022, 37(1): 128-146. |
[2] | Lie-Huang Zhu, Bao-Kun Zheng, Meng Shen, Feng Gao, Hong-Yu Li, Ke-Xin Shi. 比特币系统的数据安全和隐私综述[J]. 计算机科学技术学报, 2020, 35(4): 843-862. |
[3] | Xu-Ran Zhao, Xun Wang, Qi-Chao Chen. 基于深度卷积神经网络和条件随机场的时空一致性深度恢复[J]. , 2017, 32(3): 443-456. |
[4] | Feng-Feng Pan, Yin-Liang Yue, Jin Xiong. dCompaction:基于延迟的日志结构合并树的合并方法[J]. , 2017, 32(1): 41-54. |
[5] | Xusheng Xiao, Jian-Guang Lou, Shan Lu, David C. Shepherd, Xin Peng, Qian-Xiang W. 大规模软件系统的研究机遇与挑战[J]. , 2016, 31(5): 851-860. |
[6] | De-Qing Zou, Hao Qin, Hai Jin. UiLog:基于日志的故障诊断系统[J]. , 2016, 31(5): 1038-1052. |
[7] | Yue-Feng Du, De-Rong Shen, Tie-Zheng Nie, Yue Kou, Ge Yu. 一种分布式环境下基于内容相关数据的一致性修复方法[J]. , 2016, 31(4): 741-758. |
[8] | Yuan-Chao Xu, Hu Wan, Ke-Ni Qiu, Tao Li, Wei-Gong Zhang. 降低单级存储移动终端系统的同步开销[J]. , 2016, 31(4): 836-848. |
[9] | Lengdong Wu, Liyan Yuan, Jiahuai You. 大规模数据管理系统综述[J]. , 2015, 30(1): 163-183. |
[10] | 邵冰清, 张军伟, 郑彩平, 张浩, 刘振军, 许鲁. 一种异步的机群文件系统元数据一致性保证协议[J]. , 2014, 29(2): 303-315. |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |