›› 2016, Vol. 31 ›› Issue (4): 805-819.doi: 10.1007/s11390-016-1664-0

Special Issue: Computer Architecture and Systems; Data Management and Data Mining

• Computer Architecture and Systems • Previous Articles     Next Articles

Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

Bing Zhou, and Jiang-Tao Wen, Fellow, IEEE   

  1. 1 State Key Laboratory on Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China;
    2 Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;
    3 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2015-07-08 Revised:2016-01-10 Online:2016-07-05 Published:2016-07-05
  • About author:Bing Zhou received his B.S. degree in computer science and technology from Nanjing University, Nanjing, in 2009, and his M.S. and Ph.D. degrees in computer science and technology from Tsinghua University, Beijing, in 2012 and 2015 respectively. He is currently working at 2012 Laboratory, Huawei Technologies Co., Ltd., Beijing. His research interests include data communication/deduplication and high-performance distributed systems.
  • Supported by:

    This work is supported by the National Science Fund for Distinguished Young Scholars of China under Grant No. 61125102 and the Key Program of National Natural Science Foundation of China under Grant No. 61133008.

We describe a data deduplication system for backup storage of PC disk images, named in-RAM metadata utilizing deduplication (IR-MUD). In-RAM hash granularity adaptation and miniLZO based data compression are firstly proposed to reduce the in-RAM metadata size and thereby reduce the space overheads required by the in-RAM metadata caches. Secondly, an in-RAM metadata write cache, as opposed to the traditional metadata read cache, is proposed for further reducing metadata-related disk I/O operations and improving deduplication throughput. During deduplication, the metadata write cache is managed following the LRU caching policy. For each manifest that is hit in the metadata write cache, an expensive manifest reloading operation from the disk is avoided. After deduplication, all the manifests in the metadata write cache are cleared and stored on the disk. Our experimental results using 1.5 TB real-world disk image dataset show that 1) IR-MUD achieved about 95% size reduction for the deduplication metadata, with a small time overhead introduced, 2) when the metadata write cache was not utilized, with the same RAM space size for the metadata read cache, IR-MUD achieved a 400% higher RAM hit ratio and a 50% higher deduplication throughput, as compared with the classic Sparse Indexing deduplication system where no metadata utilization approaches are utilized, and 3) when the metadata write cache was utilized and enough RAM space was available, IR-MUD achieved a 500% higher RAM hit ratio compared with Sparse Indexing and a 70% higher deduplication throughput compared with IR-MUD with only a single metadata read cache. The in-RAM metadata harnessing and metadata write caching approaches of IR-MUD can be applied in most parallel deduplication systems for improving metadata caching efficiency.

[1] Black J. Compare-by-hash:A reasoned analysis. In Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp.85-90.

[2] Meister D, Kaiser J, Brinkmann A, Cortes T, Kuhn M, Kunkel J. A study on data deduplication in HPC storage systems. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 7.

[3] Bloom B H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, July 1970, 13(7):422-426.

[4] Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse Indexing:Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp.111-123.

[5] Tanenbaum A S. Modern Operating Systems (2nd edition). Prentice Hall PTR, 2001.

[6] Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Processing (ICPP), October 2013, pp.389-398.

[7] Rabin M O. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

[8] Muthitacharoen A, Chen B, Maziéres D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles, October 2001, pp.174-187.

[9] Romański B, Heldt T, Kilian W et al. Anchor-driven subchunk deduplication. In Proc. the 4th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1-6:13.

[10] Tolia N, Kozuch M, Satyanarayanan M, Karp B, Bressoud T, Perrig A. Opportunistic use of content addressable storage for distributed file systems. In Proc. the USENIX Annual Technical Conference (ATC), June 2003, pp.127-140.

[11] Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme Binning:Scalable, parallel deduplication for chunk-based file backup. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.

[12] Shilane P, Huang M, Wallace G, Hsu W. WAN-optimized replication of backup datasets using stream-informed delta compression. Transactions on Storage, 2012, 8(4):13:1-13:26.

[13] Tuduce I C, Gross T. Adaptive main memory compression. In Proc. the USENIX Annual Technical Conference (ATEC), April 2005, pp.237-250.

[14] Wilson P R, Kaplan S F, Smaragdakis Y. The case for compressed caching in virtual memory systems. In Proc. the USENIX Annual Technical Conference (ATEC), June 1999, pp.101-116.

[15] Gupta D, Lee S, Vrable M, Savage S, Snoeren A C, Varghese G, Voelker G M, Vahdat A. Difference engine:Harnessing memory redundancy in virtual machines. Commun. ACM, 2010, 53(10):85-93.

[16] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3):337-343.

[17] Jin K, Miller E L. The effectiveness of deduplication on virtual machine disk images. In Proc. the 2nd Annual International Systems and Storage Conference:The Israeli Experimental Systems Conference (SYSTOR), May 2009, pp.7:1-7:12.

[18] Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST), February 2008, Article No. 18.

[19] Wildani A, Miller E, Rodeh O. HANDS:A heuristically arranged non-backup inline deduplication system. In Proc. the 29th IEEE International Conference on Data Engineering, April 2013, pp.446-457.

[20] Botelho F C, Shilane P, Gang N, Hsu W. Memory efficient sanitization of a deduplicated storage system. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp.81-84.

[21] Alameldeen A R, Wood D A. Adaptive cache compression for high-performance processors. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp.212-223.

[22] Debnath B, Sengupta S, Li J. ChunkStash:Speeding up inline storage deduplication using flash memory. In Proc. the USENIX Annual Technical Conference (ATC), June 2010, Article No. 16.

[23] Meister D, Brinkmann A. dedupv1:Improving deduplication throughput using solid state drives (SSD). In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), May 2010.

[24] Chen F, Luo T, Zhang X. CAFTL:A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. the 9th USENIX Conference on File and Stroage Technologies (FAST), February 2011, Article No. 6.

[25] Agrawal N, Prabhakaran V, Wobber T, Davis J D, Manasse M, Panigrahy R. Design tradeoffs for SSD performance. In Proc. the USENIX Annual Technical Conference (ATC), June 2008, pp.57-70.

[26] Kaiser J, Brinkmann A, Süβ T, Meister D. Deriving and comparing deduplication techniques using a model-based classification. In Proc. the 10th European Conference on Computer Systems (EuroSys), April 2015, pp.11:1-11:13.

[27] Srinivasan K, Bisson T, Goodson G, Voruganti K. iDedup:Latency-aware, inline data deduplication for primary storage. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), February 2012, pp.299-312.

[28] Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4):424-448.

[29] Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams. In Proc. the 8th USENIX Conference on File and Storage Technologies (FAST), February 2010, Article No. 18.

[30] Lu G, Jin Y, Du D. Frequency based chunking for data de-duplication. In Proc. IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), August 2010, pp.287-296.
No related articles found!
Full text



[1] Ma Songde; Wei Guoqing; Huang Jinfeng;. Segment Based Camera Calibration[J]. , 1993, 8(1): 11 -16 .
[2] Zhang Yongyue; Peng Zhenyun; You Suya; Xu Guangyou;. A Multi-View Face Recognition System[J]. , 1997, 12(5): 400 -407 .
[3] LI Xiaoshan;. Decidability of Mean Value Calculus[J]. , 1999, 14(2): 173 -180 .
[4] HUANG Xiong; LI wei;. On k-Positive Satisfiability Problem[J]. , 1999, 14(4): 309 -313 .
[5] QI Yuesheng; WANG Baozhong; KANG Lishan;. Genetic Programming with Simple Loops[J]. , 1999, 14(4): 429 -433 .
[6] WEI Xiaohui; JU Jiubin;. SFT: A Consistent Checkpointing Algorithm with Short Freezing Time[J]. , 2000, 15(2): 169 -175 .
[7] LIU Bin; LU Zengxiang; GAN Quan; FENG Ao; WANG Pu;. Infomarker—A New Internet Information Service System[J]. , 2000, 15(3): 300 -304 .
[8] Jian-Wen Chen, Guo-Ping Li, and Yun He. A Novel MBAFF Scheme of AVS[J]. , 2006, 21(3): 323 -331 .
[9] Hao-Jun Ai, Shui-Xian Chen, and Rui-Min Hu. Introduction to AVS Audio[J]. , 2006, 21(3): 360 -365 .
[10] Rafiullah Chamlawi, Asifullah Khan, and Adnan Idris. Wavelet Based Image Authentication and Recovery[J]. , 2007, 22(6): 795 -804 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved