Data deduplication (dedupe for short) is a special data compression technique. It has been widely adopted to save backup time as well as storage space, particularly in backup storage systems. Therefore, most dedupe research has primarily focused on improving dedupe write performance. However, backup storage dedupe read performance is also a crucial problem for storage recovery. This paper designs a new dedupe storage read cache for backup applications that improves read performance by exploiting a special characteristic:the read sequence is the same as the write sequence. Consequently, for better cache utilization, by looking ahead for future references within a moving window, it evicts victims from the cache having the smallest future access. Moreover, to further improve read cache performance, it maintains a small log buffer to judiciously cache future access data chunks. Extensive experiments with real-world backup workloads demonstrate that the proposed read cache scheme improves read performance by up to 64.3%
This work is partially supported by the National Science Foundation Awards of USA under Grant Nos. 121756, 1305237, 142191 and 1439622.
About author: Dongchul Park is currently a research scientist in Memory Solutions Laboratory (MSL) at Samsung Semiconductor Inc. in San Jose, California. He received his Ph.D. degree in computer science and engineering at the University of Minnesota-Twin Cities, Minneapolis, in 2012, and was a member of Center for Research in Intelligent Storage (CRIS) group under the advice of Professor David H. C. Du. His research interests focus on storage system design and applications including non-volatile memories, in-storage computing, big data processing, Hadoop MapReduce, data deduplication, key-value store, cloud computing, and shingled magnetic recording (SMR) technology.
Dongchul Park, Ziqi Fan, Young Jin Nam, David H. C. Du.一种提升数据去重备份储存读取性能的预读缓存器[J] Journal of Computer Science and Technology , 2017,V32(1): 26-40
Dongchul Park, Ziqi Fan, Young Jin Nam, David H. C. Du.A Lookahead Read Cache: Improving Read Performance for Deduplication Backup Storage[J] Journal of Computer Science and Technology, 2017,V32(1): 26-40
 Srinivasan K, Bisson T, Goodson G, Voruganti K. iDedup:Latency-aware, inline data deduplication for primary storage. In Proc. the 10th USENIX Conference on File and Storage Technologies, February 2012, pp.299-312. Nam Y, Park D, Du D H C. Virtual USB drive:A key component for smart home storage architecture. In Proc. IEEE International Conference on Consumer Electronics, January 2012, pp.23-24. Fu M, Feng D, Hua Y, He X B, Chen Z N, Xia W, Zhang Y C, Tan Y J. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies, February 2015, pp.331-344. Fu Y J, Xiao N, Liao X K, Liu F. Application-aware clientside data reduction and encryption of personal data in cloud backup services. J. Comput. Sci. Technol., 2013, 28(6):1012-1024. Dong W, Douglis F, Li K, Pattersonet H, Reddy S, Shilane P. Tradeoffs in scalable data routing for deduplication clusters. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011, pp.15-29. Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M. HYDRAstor:A scalable secondary storage. In Proc. the 7th Conference on File and Storage Technologies, February 2009, pp.197-210. Debnath B, Sengupta S, Li J. ChunkStash:Speeding up inline storage deduplication using flash memory. In Proc. USENIX Conference on USENIX Annual Technical Conference, June 2010. Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies, February 2008, Article No. 18. Meyer D T, Bolosky W J. A study of practical deduplication. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011. Seo M K, Lim S H. Deduplication flash file system with PRAM for non-linear editing. IEEE Transactions on Consumer Electronics, 2010, 56(3):1502-1510. Min J, Yoon D, Won Y. Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers, 2011, 60(6):824-840. Fineberg S, Rabinovici-Cohen S. Long term information retention. In Proc. Storage Developer Conference, September 2011. http://www.snia.org/events/storagedeveloper2011/agenda2011/abstracts#longtermret, Dec. 2016. Kay D, Maybee M. Aspects of deduplication. Storage Networking Industry Association (SNIA) Tutorial, January 2010. http://www.snia.org/sites/default/education/tutorials/2010/spring/file/MarkMaybee-DominicKayAspectsofDeduplication.pdf, Dec. 2016. Kim Y. Consolidate more:High-performance primary deduplication in the age of abundant capacity. Hitachi Data Systems, March 2013. http://www.slideshare.net/hdscorp/consolidate-more-high-performance-primary-deduplication-in-the-age-of-abundant-capacity, Dec. 2016. Nam Y, Park D, Du D H C. Assuring demanded read performance of data deduplication storage with backup datasets. In Proc. the 20th IEEE International Symposium on Modeling, Analysis and Simulations of Computer and Telecommunication Systems, August 2012, pp.201-208. Lillibridge M, Eshghi K, Bhagwat D. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. the 11th USENIX Conference on File and Storage Technologies, February 2013, pp.183-198. Welnicki M, Szczepkowski J, Dubnicki C. System and method for deduplication of distributed data. US Patent 9, 256, 368, February 2016. http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL-&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=-1&f=G&l=50&s1=9256368.PN.&OS=PN/9256368&RS=-PN/9256368, Dec. 2016. Efstathopoulos P, Guo F L. Rethinking deduplication scalability. In Proc. the 2nd USENIX Conference on Hot Topics in Storage and File Systems, June 2010. Nam Y, Lu G L, Park N, Xiao W J, Du D H C. Chunk fragmentation level:An effective indicator for read performance degradation in deduplication storage. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications, September 2011, pp.581-586. Rabin M O. Fingerprinting by random polynomials. Technical Report TR-1581, 1981. http://www.citeulike.org/user/dmeister/article/2706665, Nov. 2016. Liu C Y, Gu Y, Sun L C, Yan B, Wang D S. RADMAD:High reliability provision for large-scale de-duplication archival storage systems. In Proc. the 23rd International Conference on Supercomputing, June 2009, pp.370-379. Jin K R, Miller E L. The effectiveness of deduplication on virtual machine disk images. In Proc. SYSTOR 2009:The Israeli Experimental Systems Conference, May 2009, Article No. 7. Lim H, Fan B, Andersen D G, Kaminsky M. SILT:A memory-efficient, high-performance key-value store. In Proc. the 23rd ACM Symposium on Operating Systems Principles, October 2011, pp.1-13. Park N, Lilja D J. Characterizing datasets for data deduplication in backup applications. In Proc. IEEE International Symposium on Workload Characterization, December 2010. Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme Binning:Scalable, parallel deduplication for chunkbased file backup. In Proc. the 17th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, September 2009. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing:Large scale, inline deduplication using sampling and locality. In Proc. the 7th Conference on File and Storage Technologies, February 2009, pp.111-123. Lu G L, Jin Y, Du D H C. Frequency based chunking for data de-duplication. In Proc. IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, August 2010, pp.287-296. Tsuchiya Y, Watanabe T. DBLK:Deduplication for primary block storage. In Proc. the 27th IEEE Symposium on Massive Storage Systems and Technologies, May 2011. Mao B, Jiang H, Wu S Z, Fu Y J, Tian L. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Transactions on Storage, 2014, 10(2):Article No. 6. Wallace G, Douglis F, Qian H W, Shilane P, Smaldone S, Chamness M, Hsu W. Characteristics of backup workloads in production systems. In Proc. the 10th USENIX Conference on File and Storage Technologies, February 2012. Park D, Du D H C. Hot data identification for flash-based storage systems using multiple bloom filters. In Proc. the 27th IEEE Symposium on Mass Storage Systems and Technologies, May 2011. Megiddo N, Modha D. ARC:A self-tuning, low overhead replacement cache. In Proc. the 2nd USENIX Conference on File and Storage Technologies, March 2003, pp.115-130. Bucy J S, Schindler J, Schlosser S W, Ganger G R. The DiskSim simulation environment version 4.0 reference manual. Technical Report CMU-PDL-08-101, 2008. http://www.pdl.cmu.edu/PDL-FTP/DriveChar/CMU-PDL-08-101.pdf, Nov. 2016. Fu M, Feng D, Hua Y, He X B, Chen Z N, Xia W, Huang F T, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. USENIX Conference on USENIX Annual Technical Conference, June 2014, pp.181-192. Li W J, Jean-Baptise G, Riveros J, Narasimhan G, Zhang T, Zhao M. CacheDedup:In-line deduplication for flash caching. In Proc. the 14th USENIX Conference on File and Storage Technologies, February 2016, pp.301-314. Koller R, Rangaswami R. I/O deduplication:Utilizing content similarity to improve I/O performance. In Proc. the 8th USENIX Conference on File and Storage Technologies, February 2010. Li Y K, Xu M, Ng C H, Lee P P C. Efficient hybrid inline and out-of-line deduplication for backup storage. ACM Transactions on Storage, 2015, 11(1):Article No. 2. Tan Y J, Yan Z C, Feng D, He X B, Zou Q, Yang L. De-Frag:An efficient scheme to improve deduplication performance via reducing data placement de-linearization. Cluster Computing, 2015, 18(1):79-92.