重复数据删除系统元数据缓存效率提升研究
Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization
-
摘要: 本文提出一个针对磁盘镜像备份存储的重复数据删除系统。本文将该系统称为内存元数据应用重复数据删除系统,记作IR-MUD系统。IR-MUD系统首先提出内存元数据哈希粒度自适应和基于miniLZO的元数据压缩技术以降低元数据缓存所需要的内存空间开销。进一步地,相对于传统的元数据读缓存技术,IR-MUD系统提出元数据写缓存技术以进一步减少元数据访问相关的磁盘读写操作并提高重复数据删除的去重吞吐率。在重复删除过程中,元数据写缓存按照LRU缓存替换策略进行管理。当一个元数据哈希索引序列manifest在元数据写缓存中被新到达的重复哈希击中,一个用于将该manifest从磁盘读入内存的磁盘读写操作则被节省下来。在重复数据删除结束后,元数据写缓存中的元数据最终会被清空并存储在磁盘中。基于一个1.5TB的实际磁盘镜像数据集,我们对IR-MUD系统的实验结果显示:(1)IR-MUD系统以极小的时间开销为代价,获得约95%的去重元数据字节减少;(2)在没有使用元数据写缓存技术,仅使用元数据读缓存的情况下,基于相同的内存空间开销,与传统的Sparse Indexing系统相比,IR-MUD系统通过内存元数据抑制获得了400%的内存元数据命中率以及相应50%的去重吞吐率提升;(3)在使用元数据写缓存和元数据读缓存技术,并且内存空间足够大的情况下,与传统Sparse Indexing系统相比,IR-MUD系统获得了500%的内存元数据命中率提升,与不使用元数据写缓存技术的IR-MUD系统相比,获得了进一步70%的去重吞吐率提升。IR-MUD系统中的元数据抑制和元数据写缓存技术容易推广到大部分重复数据删除系统以提高元数据的缓存效率。Abstract: We describe a data deduplication system for backup storage of PC disk images, named in-RAM metadata utilizing deduplication (IR-MUD). In-RAM hash granularity adaptation and miniLZO based data compression are firstly proposed to reduce the in-RAM metadata size and thereby reduce the space overheads required by the in-RAM metadata caches. Secondly, an in-RAM metadata write cache, as opposed to the traditional metadata read cache, is proposed for further reducing metadata-related disk I/O operations and improving deduplication throughput. During deduplication, the metadata write cache is managed following the LRU caching policy. For each manifest that is hit in the metadata write cache, an expensive manifest reloading operation from the disk is avoided. After deduplication, all the manifests in the metadata write cache are cleared and stored on the disk. Our experimental results using 1.5 TB real-world disk image dataset show that 1) IR-MUD achieved about 95% size reduction for the deduplication metadata, with a small time overhead introduced, 2) when the metadata write cache was not utilized, with the same RAM space size for the metadata read cache, IR-MUD achieved a 400% higher RAM hit ratio and a 50% higher deduplication throughput, as compared with the classic Sparse Indexing deduplication system where no metadata utilization approaches are utilized, and 3) when the metadata write cache was utilized and enough RAM space was available, IR-MUD achieved a 500% higher RAM hit ratio compared with Sparse Indexing and a 70% higher deduplication throughput compared with IR-MUD with only a single metadata read cache. The in-RAM metadata harnessing and metadata write caching approaches of IR-MUD can be applied in most parallel deduplication systems for improving metadata caching efficiency.