磁盘镜像的自适应重复数据删除框架

doi:10.1007/s11390-016-1665-z

磁盘镜像的自适应重复数据删除框架

A Data Deduplication Framework of Disk Images with Adaptive Block Skipping

摘要

摘要: 本文提出一个针对磁盘镜像备份存储的自适应重复数据删除框架以及适用于该框架的基于数据局部性信息的数据分块均匀跳跃算法。该框架的目标在于降低针对如磁盘镜像集等实际数据集的重复数据删除相关的时间和空间开销,从而实现在保证去重效率条件下,去重吞吐率的提高。该框架包括自适应数据跳跃,双向重复拓展和滞后哈希索引三个子模块。其中,自适应数据跳跃模块,通过使用启发式预测算法"跳过"对那些被判断为非重复数据分块的去重处理以节省去重时间和空间开销;双向重复拓展模块,基于数据局部性信息,检测与被"跳过"的数据分块重复的后续到达的数据;滞后哈希索引模块,对被"跳过"的而又重新在双向重复拓展中遇到的数据分块,重新生成完整元数据,供后续到达数据使用。双向重复检测和滞后哈希索引模块是为了纠正启发式预测算法的预测错误而提出。我们将自适应重复数据删除框架和数据分块均匀跳跃算法实现在Data Domain和Sparse Indexing原型系统中。实验测试使用1TB实际磁盘镜像数据集。实验测试结果显示,自适应重复数据删除框架可以有效降低重复数据删除相关的时间空间开销。其中,自适应重复数据删除框架帮助Data Domain(将元数据全部存储在磁盘中)提高30%-80%的去重吞吐率;帮助Sparse Indexing(将部分元数据以哈希索引表的形式存储在内存中)节省25%-40%的内存空间开销和提高15%-20%的去重吞吐率。相应的去重率损失均控制在5%以内。

Abstract: We describe an efficient and easily applicable data deduplication framework with heuristic prediction based adaptive block skipping for the real-world dataset such as disk images to save deduplication related overheads and improve deduplication throughput with good deduplication efficiency maintained. Under the framework, deduplication operations are skipped for data chunks determined as likely non-duplicates via heuristic prediction, in conjunction with a hit and matching extension process for duplication identification within skipped blocks and a hysteresis mechanism based hash indexing process to update the hash indices for the re-encountered skipped chunks. For performance evaluation, the proposed framework was integrated and implemented in the existing data domain and sparse indexing deduplication algorithms. The experimental results based on a real-world dataset of 1.0 TB disk images showed that the deduplication related overheads were significantly reduced with adaptive block skipping, leading to a 30% 80% improvement in deduplication throughput when deduplication metadata were stored on the disk for data domain, and 25% 40% RAM space saving with a 15% 20% improvement in deduplication throughput when an in-RAM sparse index was used in sparse indexing. In both cases, the corresponding deduplication ratios reduced were below 5%.

HTML全文

参考文献()

施引文献

资源附件()