SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.
Citation: | Li PF, Hua Y, Cao Q. An enhanced physical-locality deduplication system for space efficiency. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(6): 1361−1379 Nov. 2024. DOI: 10.1007/s11390-023-2646-7. |
An abundance of data have been generated from various embedded devices, applications, and systems, and require cost-efficient storage services. Data deduplication removes duplicate chunks and becomes an important technique for storage systems to improve space efficiency. However, stored unique chunks are heavily fragmented, decreasing restore performance and incurs high overheads for garbage collection. Existing schemes fail to achieve an efficient trade-off among deduplication, restore and garbage collection performance, due to failing to explore and exploit the physical locality of different chunks. In this paper, we trace the storage patterns of the fragmented chunks in backup systems, and propose a high-performance deduplication system, called HiDeStore. The main insight is to enhance the physical-locality for the new backup versions during the deduplication phase, which identifies and stores hot chunks in the active containers. The chunks not appearing in new backups become cold and are gathered together in the archival containers. Moreover, we remove the expired data with an isolated container deletion scheme, avoiding the high overheads for expired data detection. Compared with state-of-the-art schemes, HiDeStore improves the deduplication and restore performance by up to 1.4x and 1.6x, respectively, without decreasing the deduplication ratios and incurring high garbage collection overheads.
[1] |
Khorasani S O, Rellermeyer J S, Epema D. Self-adaptive executors for big data processing. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.176–188. DOI: 10.1145/3361525.3361545.
|
[2] |
Birke R, Rocha I, Perez J, Schiavoni V, Felber P, Chen L Y. Differential approximation and sprinting for multi-priority big data engines. In Proc. the 20th International Middleware Conference, Dec. 2019, pp.202–214. DOI: 10.1145/3361525.3361547.
|
[3] |
Akbari A, Martinez J, Jafari R. Facilitating human activity data annotation via context-aware change detection on smartwatches. ACM Trans. Embedded Computing Systems, 2021, 20(2): 15. DOI: 10.1145/3431503.
|
[4] |
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th USENIX Conference on File and Storage Technologies, Feb. 2015, pp.331–344.
|
[5] |
Li Y K, Xu M, Ng C H, Lee P P C. Efficient hybrid inline and out-of-line deduplication for backup storage. ACM Trans. Storage, 2015, 11(1): Article No. 2. DOI: 10.1145/2641572.
|
[6] |
Park D, Fan Z, Nam Y J, Du D H C. A lookahead read cache: Improving read performance for deduplication backup storage. Journal of Computer Science and Technology, 2017, 32(1): 26–40. DOI: 10.1007/s11390-017-1680-8.
|
[7] |
Duggal A, Jenkins F, Shilane P, Chinthekindi R, Shah R, Kamat M. Data domain cloud tier: Backup here, backup there, deduplicated everywhere! In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.647–660.
|
[8] |
Meyer D T, Bolosky W J. A study of practical deduplication. ACM Trans. Storage, 2012, 7(4): Article No. 14. DOI: 10.1145/2078861.2078864.
|
[9] |
Muthitacharoen A, Chen B, Mazières D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles, Oct. 2001, pp.174–187. DOI: 10.1145/502034.502052.
|
[10] |
Wallace G, Douglis F, Qian H, Shilane P, Smaldone S, Chamness M, Hsu W. Characteristics of backup workloads in production systems. In Proc. the 10th USENIX Conference on File and Storage Technologies, Feb. 2012, p.4.
|
[11] |
Yang Q, Jin R, Zhao M. SmartDedup: Optimizing deduplication for resource-constrained devices. In Proc. the 2019 USENIX Annual Technical Conference, Jul. 2019, pp.633–646.
|
[12] |
Quinlan S, Dorward S. Venti: A new approach to archival storage. In Proc. the FAST 2002 Conference on File and Storage Technologies, Jan. 2002, pp.89–101.
|
[13] |
Zhu B, Li K, Patterson R H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies, Feb. 2008, pp.269–282.
|
[14] |
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies, Feb. 2009, pp.111–123.
|
[15] |
Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. the 2014 USENIX Annual Technical Conference, Jun. 2014, pp.181–192.
|
[16] |
Kaczmarczyk M, Barczynski M, Kilian W, Dubnicki C. Reducing impact of data fragmentation caused by in-line deduplication. In Proc. the 5th Annual International Systems and Storage Conference, Jun. 2012, Article No. 15. DOI: 10.1145/2367589.2367600.
|
[17] |
Lillibridge M, Eshghi K, Bhagwat D. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. the 11th USENIX conference on File and Storage Technologies, Feb. 2013, pp.183–198.
|
[18] |
Cao Z, Wen H, Wu F, Du D H C. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proc. the 16th USENIX Conference on File and Storage Technologies, Feb. 2018, pp.309–324.
|
[19] |
Mao B, Jiang H, Wu S, Fu Y, Tian L. SAR: SSD assisted restore optimization for deduplication-based storage systems in the cloud. In Proc. the 7th IEEE International Conference on Networking, Architecture, Jun. 2012, pp.328–337. DOI: 10.1109/NAS.2012.48.
|
[20] |
Nam Y J, Park D, Du D H C. Assuring demanded read performance of data deduplication storage with backup datasets. In Proc. the 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Aug. 2012, pp.201–208. DOI: 10.1109/MASCOTS.2012.32.
|
[21] |
Cao Z, Liu S, Wu F, Wang G, Li B, Du D H C. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proc. the 17th USENIX Conference on File and Storage Technologies, Feb. 2019, pp.129–142.
|
[22] |
Nam Y, Lu G, Park N, Xiao W, Du D H C. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications, Sept. 2011, pp.581–586. DOI: 10.1109/HPCC.2011.82.
|
[23] |
Ng C H, Lee P P C. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proc. the 4th Asia-Pacific Workshop on Systems, Jul. 2013, Article No. 15. DOI: 10.1145/2500727.2500731.
|
[24] |
Li P, Hua Y, Cao Q, Zhang M. Improving the restore performance via physical-locality middleware for backup systems. In Proc. the 21st International Middleware Conference, Dec. 2020, pp.341–355. DOI: 10.1145/3423211.3425691.
|
[25] |
Debnath B K, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the 2010 USENIX Annual Technical Conference, Jun. 2010, Article No. 16.
|
[26] |
Meister D, Kaiser J, Brinkmann A. Block locality caching for data deduplication. In Proc. the 6th International Systems and Storage Conference, Jul. 2013, Article No. 15. DOI: 10.1145/2485732.2485748.
|
[27] |
Eshghi K, Tang H K. A framework for analyzing and improving content-based chunking algorithms. Technical Report, HP Laboratory, 2005. http://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf. Oct. 2024.
|
[28] |
Xia W, Zhou Y, Jiang H, Feng D, Hua Y, Hu Y, Liu Q, Zhang Y. Fastcdc: A fast and efficient content-defined chunking approach for data deduplication. In Proc. the 2016 USENIX Annual Technical Conference, Jun. 2016, pp.101–114.
|
[29] |
Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the 17th IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, Sept. 2009. DOI: 10.1109/MASCOT.2009.5366623.
|
[30] |
Xia W, Jiang H, Feng D, Hua Y. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.
|
[31] |
Xu G, Tang B, Lu H, Yu Q, Sung C W. LIPA: A learning-based indexing and prefetching approach for data deduplication. In Proc. the 35th Symposium on Mass Storage Systems and Technologies, May 2019, pp.299–310. DOI: 10.1109/MSST.2019.00010.
|
[32] |
Wei J, Jiang H, Zhou K, Feng D. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010. DOI: 10.1109/MSST.2010.5496987.
|
[33] |
Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, Jun. 2011.
|
[34] |
Zhang Y, Jiang H, Feng D, Xia W, Fu M, Huang F, Zhou Y. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. the 2015 IEEE Conference on Computer Communications, Apr. 26–May 1, 2015, pp.1337–1345. DOI: 10.1109/INFOCOM.2015.7218510.
|
[35] |
Xia W, Jiang H, Feng D, Hua Y. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Computers, 2015, 64(4): 1162–1176. DOI: 10.1109/TC.2014.2308181.
|
[1] | Zheng Chen, Xiao-Nan Fang, Song-Hai Zhang. Local Homography Estimation on User-Specified Textureless Regions[J]. Journal of Computer Science and Technology, 2022, 37(3): 615-625. DOI: 10.1007/s11390-022-2185-7 |
[2] | Tong Lin, Yao Liu, Bo Wang, Li-Wei Wang, Hong-Bin Zha. Nonlinear Dimensionality Reduction by Local Orthogonality Preserving Alignment[J]. Journal of Computer Science and Technology, 2016, 31(3): 512-524. DOI: 10.1007/s11390-016-1644-4 |
[3] | Ji-Liang Zhang, Qiang Wu, Yi-Peng Ding, Yong-Qiang Lv, Qiang Zhou, Zhi-Hua Xia, Xing-Ming Sun, Xing-Wei Wang. Techniques for Design and Implementation of an FPGA-Specific Physical Unclonable Function[J]. Journal of Computer Science and Technology, 2016, 31(1): 124-136. DOI: 10.1007/s11390-016-1616-8 |
[4] | Yan-Chao Zhao, Jie Wu, Wen-Zhong Li, Sang-Lu Lu. Throughput Optimization in Cognitive Radio Networks Ensembling Physical Layer Measurement[J]. Journal of Computer Science and Technology, 2015, 30(6): 1290-1305. DOI: 10.1007/s11390-015-1599-x |
[5] | Ru Wang, Bao-Xia Fan, Liang Yang, Yan-Ping Gao, Dong Liu, Bin Xiao, Jiang-Mei Wang, Yi-Fu Zhang, Hong Wang, Wei-Wu Hu. Physical Implementation of the Eight-Core Godson-3B Microprocessor[J]. Journal of Computer Science and Technology, 2011, 26(3): 520-527. DOI: 10.1007/s11390-011-1151-6 |
[6] | Yunhao Liu, Zheng Yang, Xiaoping Wang, Lirong Jian. Location, Localization, and Localizability[J]. Journal of Computer Science and Technology, 2010, 25(2): 274-297. |
[7] | Ji-Ye Zhao, Dong Liu, Dan-Dan Huan, Meng-Hao Su, Bin Xiao, Ying Xu, Feng Shi, Chen Chen, Song Wang. Physical Design Methodology for Godson-2G Microprocessor[J]. Journal of Computer Science and Technology, 2010, 25(2): 225-231. |
[8] | Bao-Xia Fan, Liang Yang, Jiang-Mei Wang, Ru Wang, Bin Xiao, Ying Xu, Dong Liu, Ji-Ye Zhao. Physical Implementation of the 1GHz Godson-3 Quad-Core Microprocessor[J]. Journal of Computer Science and Technology, 2010, 25(2): 192-199. |
[9] | Wei Mi, Xiao-Bing Feng, Yao-Cang Jia, Li Chen, Jing-Ling Xue. PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization[J]. Journal of Computer Science and Technology, 2009, 24(6): 1086-1097. |
[10] | Dong Feng, Cai Wenli, Chen Tianzhou, Shi Jiaoying. Three-Dimensional Volume Datafield Reconstruction from Physical Model[J]. Journal of Computer Science and Technology, 1997, 12(3): 217-230. |