通过增强去重系统的物理局部性来提升空间效率

李鹏飞; 华宇; 曹钦

doi:10.1007/s11390-023-2646-7

摘要:

研究背景 数据去重技术通过减少重复数据块的存储节省了大量的空间，因而被广泛应用于备份存储系统中。但是，随着存储的数据版本越来越多，新版本的恢复速度受碎片问题的影响显著下降了。主要的原因是去重系统只存储了新版本的新数据，而已有的数据被链接到已存在的位置上，导致新版本的数据被分散到不同的位置。在恢复数据的时候，系统从不同的地方读入数据从而降低了恢复性能。另外，由于不同版本的数据交织在一起，在删除过期版本的时候，系统引入了很高的开销进行过期数据检测和垃圾回收，无法满足高性能的需求。

目的通过追踪不同版本数据块的存储路径，发现一部分没有在当前版本中出现的数据也不会出现在后续版本中。基于这个发现，提出将不同版本的数据块存储到特定的位置以增强新版本数据块的物理局部性，从而提供高数据恢复和过期数据删除性能。

方法本文提出了高性能的去重系统，称为HiDeStore。主要思想是在去重阶段通过指纹缓存识别冷、热块，将“热块”存储到活动容器中，并将变冷的块存储到归档容器中。在移动块的过程中，HiDeStore修改数据图谱中相应块的位置，使得系统可以高效准确地找到所需要的数据。在删除过期数据的时候，由于老版本的数据块已经被存储到了相近的归档容器中，HiDeStore可以直接删除相应的容器，从而避免了过期数据块检测和垃圾回收的开销。

结果实验发现相邻版本的数据最相似，且没有出现在当前版本的数据块也不会出现在后续版本中。与当前的方案相比，HiDeStore的设计增强了新版本数据的物理局部性，并提升了2.6x的恢复速度和1.5倍的空间利用率。

结论当前的去重系统将同一备份版本的数据存到了不同的位置，降低了原数据的恢复性能。为了解决这样的碎片问题，本文追踪了不同版本数据的存储路径，并发现没有出现在当前版本的数据块也不会出现在后续版本中。基于这样的方案，本文设计了基于双哈希的指纹缓存来识别冷、热块，并将不同的块分别存储到不同的容器中，从而增强了新版本数据的物理局部性，提升了去重系统的恢复性能和删除过期数据的性能。

Abstract: An abundance of data have been generated from various embedded devices, applications, and systems, and require cost-efficient storage services. Data deduplication removes duplicate chunks and becomes an important technique for storage systems to improve space efficiency. However, stored unique chunks are heavily fragmented, decreasing restore performance and incurs high overheads for garbage collection. Existing schemes fail to achieve an efficient trade-off among deduplication, restore and garbage collection performance, due to failing to explore and exploit the physical locality of different chunks. In this paper, we trace the storage patterns of the fragmented chunks in backup systems, and propose a high-performance deduplication system, called HiDeStore. The main insight is to enhance the physical-locality for the new backup versions during the deduplication phase, which identifies and stores hot chunks in the active containers. The chunks not appearing in new backups become cold and are gathered together in the archival containers. Moreover, we remove the expired data with an isolated container deletion scheme, avoiding the high overheads for expired data detection. Compared with state-of-the-art schemes, HiDeStore improves the deduplication and restore performance by up to 1.4x and 1.6x, respectively, without decreasing the deduplication ratios and incurring high garbage collection overheads.

通过增强去重系统的物理局部性来提升空间效率

An Enhanced Physical-Locality Deduplication System for Space Efficiency