›› 2014, Vol. 29 ›› Issue (6): 947-961.doi: 10.1007/s11390-014-1481-2

Special Issue: Computer Architecture and Systems

• Computer Architectures and Systems • Previous Articles     Next Articles

Retention Benefit Based Intelligent Cache Replacement

Ling-Da Li1,2,3(李凌达), Student Member, CCF, ACM, Jun-Lin Lu1,2,3(陆俊林), Member, CCF, Xu Cheng1,2,3(程旭), Member, CCF   

  1. 1 Microprocessor Research and Development Center, Peking University, Beijing 100871, China;
    2 Engineering Research Center of Microprocessor and System, Ministry of Education, Beijing 100871, China;
    3 School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
  • Received:2013-12-19 Revised:2014-05-06 Online:2014-11-05 Published:2014-11-05
  • About author:Ling-Da Li received his B.E. degree in computer science from Harbin Institute of Technology in 2008. He is now a Ph.D. candidate in computer architecture of Peking University. His research interests include cache system, processor architecture, and multi-core system. He is a student member of CCF and ACM.
  • Supported by:

    The work was supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant No. 2009ZX01029-001-002-2.

The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cores varies across different misses. Compared with parallel misses and store misses, isolated fetch and load misses are more costly. The variation of cache miss penalty suggests that the cache replacement policy should take it into account. To that end, first, we propose the notion of retention benefit. Retention benefits can evaluate not only the increment of processor stall cycles on cache misses, but also the reduction of processor stall cycles due to cache hits. Then, we propose Retention Benefit Based Replacement (RBR) which aims to maximize the aggregate retention benefits of blocks reserved in the cache. RBR keeps track of the total retention benefit for each block in the cache, and it preferentially evicts the block with the minimum total retention benefit on replacement. The evaluation shows that RBR can improve cache performance significantly in both single-core and multi-core environment while requiring a low storage overhead. It also outperforms other state-of-the-art techniques.

[1] Lai A C, Fide C, Falsafi B. Dead-block prediction & deadblock correlating prefetchers. In Proc. the 28th ISCA, Jun. 2001, pp.144-154.

[2] Qureshi M K, Patt Y N. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. the 39th MICRO, Dec. 2006, pp.423-432.

[3] Qureshi M K, Jaleel A, Patt Y N, Steely Jr S C, Emer J. Adaptive insertion policies for high performance caching. In Proc. the 34th ISCA, Jun. 2007, pp.381-391.

[4] Kharbutli M, Solihin D. Counter-based cache replacement and bypassing algorithms. IEEE Transactions on Computers, 2008, 57(4): 433-447.

[5] Xie Y, Loh G H. PIPP: Promotion/insertion pseudopartitioning of multi-core shared caches. In Proc. the 36th ISCA, Jun. 2009, pp.174-183.

[6] Jaleel A, Theobald K B, Steely Jr S C, Emer J. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. the 37th ISCA, Jun. 2010, pp.60-71.

[7] Khan S M, Tian Y, Jimenez D A. Sampling dead block prediction for last-level caches. In Proc. the 43rd MICRO, Dec. 2010, pp.175-186.

[8] Wu C J, Jaleel A, Hasenplaugh W, Martonosi M, Steely Jr S C, Emer J. SHiP: Signature-based hit predictor for high performance caching. In Proc. the 44th MICRO, Dec. 2011, pp.430-441.

[9] Xie Y. Modeling, architecture, and applications for emerging memory technologies. IEEE Design & Test of Computers, 2011, 28(1): 44-51.

[10] Loh G H, Hill M D. Effciently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. the 44th MICRO, Dec. 2011, pp.454-464.

[11] Kroft D. Lockup-free instruction fetch/prefetch cache organization. In Proc. the 8th ISCA, May 1981, pp.81-87.

[12] Vanderwiel S P, Lilja D J. Data prefetch mechanisms. ACM Comput. Surv., 2000, 32(2): 174-199.

[13] Jeong J, Dubois M. Optimal replacements in caches with two miss costs. In Proc. the 11th SPAA, Jun. 1999, pp.155-164.

[14] Jeong J, Stenström P, Dubois M. Simple penalty-sensitive replacement policies for caches. In Proc. the 3rd CF, May 2006, pp.341-352.

[15] Ju R D C, Lebeck A R, Wilkerson C. Locality vs. criticality. In Proc. the 28th ISCA, Jun. 2001, pp.132-143.

[16] Qureshi M K, Lynch D N, Mutlu O, Patt Y N. A case for MLP-aware cache replacement. In Proc. the 33rd ISCA, Jun. 2006, pp.167-178.

[17] Sheikh R, Kharbutli M. Improving cache performance by combining cost-sensitivity and locality principles in cache replacement algorithms. In Proc. the 28th ICCD, Oct. 2010, pp.7683.

[18] Kharbutli M, Sheikh R. LACS: A locality-aware cost-sensitive cache replacement algorithm. IEEE Transactions on Computers, 2013, 63(8): 1975-1987.

[19] Chaudhuri M. Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches. In Proc. the 42nd MICRO, Dec. 2009, pp.401-412.

[20] Keramidas G, Petoumenos P, Kaxiras S. Cache replacement based on reuse-distance prediction. In Proc. the 25th ICCD, Oct. 2007, pp.245-250.

[21] Wu C J, Jaleel A, Martonosi M, Steely Jr S C, Emer J. PACMan: Prefetch-aware cache management for high performance caching. In Proc. the 44th MICRO, Dec. 2011, pp.442-453.

[22] Duong N, Zhao D, Kim T, Cammarota R, Valero M, Veidenbaum A V. Improving cache management policies using dynamic reuse distances. In Proc. the 45th MICRO, Dec. 2012, pp.389-400.

[23] Hu Z, Kaxiras S, Martonosi M. Timekeeping in the memory system: Predicting and optimizing memory behavior. In Proc. the 29th ISCA, May 2002, pp.209-220.

[24] Liu H, Ferdman M, Huh J, Burger D. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proc. the 41st MICRO, Nov. 2008, pp.222-233.

[25] Jalminger J, Stenstrom P. A novel approach to cache block reuse predictions. In Proc. the 2003 ICPP, Oct. 2003, pp.294-302.

[26] Johnson T L, Connors D A, Merten M C, Hwu W M W. Run-time cache bypassing. IEEE Transactions on Computers, 1999, 48(12): 1338-1354.

[27] Rivers J A, Davidson E S. Reducing conflicts in directmapped caches with a temporality-based design. In Proc. the 1996 ICPP, Aug. 1996, Vol. 1, pp.154-163.

[28] Rivers J A, Tam E S, Tyson G S, Davidson E S, Farrens M. Utilizing reuse information in data cache management. In Proc. the 12th ICS, Jul. 1998, pp.449-456.

[29] John L K, Subramanian A. Design and performance evaluation of a cache assist to implement selective caching. In Proc. the 1997 ICCD, Oct. 1997, pp.510-518.

[30] Walsh S J, Board J A. Pollution control caching. In Proc. the 1995 ICCD, Oct. 1995, pp.300-306.

[31] Chi C H, Dietz H. Improving cache performance by selective cache bypass. In Proc. the 22nd HICSS, Jan. 1989, Vol. 1, pp.277-285.

[32] González, A, Aliagas C, Valero M. A data cache with multiple caching strategies tuned to different types of locality. In Proc. the 9th ICS, Jul. 1995, pp.338-347.

[33] Tyson G, Farrens M, Matthews J, Pleszkun A R. A modified approach to data cache management. In Proc. the 28th MICRO, Dec. 1995, pp.93-103.

[34] Xiang L, Chen T, Shi Q, Hu W. Less reused filter: Improving L2 cache performance via filtering less reused lines. In Proc. the 23rd ICS, Jun. 2009, pp.68-79.

[35] Gao H, Wilkerson C. A dueling segmented LRU replacement algorithm with adaptive bypassing. In Proc. the 1st JWAC, Jun. 2010.

[36] Gaur J, Chaudhuri M, Subramoney S. Bypass and insertion algorithms for exclusive last-level caches. In Proc. the 38th ISCA, Jun. 2011, pp.81-92.

[37] Li L, Tong D, Xie Z, Lu J, Cheng X. Optimal bypass monitor for high performance last-level caches. In Proc. the 21st PACT, Sept. 2012, pp.315-324.

[38] Jaleel A, Hasenplaugh W, Qureshi M, Sebot J, Steely Jr S, Emer J. Adaptive insertion policies for managing shared caches. In Proc. the 17th PACT, Oct. 2008, pp.208-219.

[39] Manikantan R, Rajan K, Govindarajan R. NUcache: An effcient multicore cache organization based on next-use distance. In Proc. the 17th HPCA, Feb. 2011, pp.243-253.

[40] Sanchez D, Kozyrakis C. Vantage: Scalable and effcient finegrain cache partitioning. In Proc. the 38th ISCA, Jun. 2011, pp.57-68.

[41] Manikantan R, Rajan K, Govindarajan R. Probabilistic shared cache management (PriSM). In Proc. the 39th ISCA, Jun. 2012, pp.428-439.

[42] Hsu L R, Reinhardt S K, Iyer R, Makineni S. Communist, utilitarian, and capitalist cache policies on CMPs: Caches as a shared resource. In Proc. the 15th PACT, Sept. 2006, pp.13-22.

[43] Iyer R. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In Proc. the 18th ICS, Jun. 2004, pp.257266.

[44] Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. the 13th PACT, Sept. 2004, pp.111-122.

[45] Jeong J, Dubois M. Cache replacement algorithms with nonuniform miss costs. IEEE Transactions on Computers, 2006, 55(4): 353-365.

[46] Jeong J, Dubois M. Cost-sensitive cache replacement algorithms. In Proc. the 9th HPCA, Feb. 2003, pp.327-337.

[47] Moreto M, Cazorla F, Ramirez A, Valero M. MLP-aware dynamic cache partitioning. In Proc. the 3rd HiPEAC, Jan. 2008, pp.337-352.

[48] Kaseridis D, Iqbal M, John L. Cache friendliness-aware management of shared last-level caches for high performance multicore systems. IEEE Transactions on Computers, 2014, 63(4): 874-887.

[49] Lee H H S, Tyson G S, Farrens M K. Eager writeback——A technique for improving bandwidth utilization. In Proc. the 33rd MICRO, Dec. 2000, pp.11-21.

[50] Lee C J, Narasiman V, Ebrahimi E, Mutlu O, Patt Y N. DRAM-aware last-level cache writeback: Reducing writecaused interference in memory systems. Technical Report, TR-HPS-2010-002, High Performance Systems Group, Department of Electrical and Computer Engineering, The University of Texas at Austin & Department of Electrical and Computer Engineering, Carnegie Mellon University, April 2010.

[51] Stuecheli J, Kaseridis D, Daly D, Hunter H C, John L K. The virtual write queue: Coordinating DRAM and last-level cache policies. In Proc. the 37th ISCA, Jun. 2010, pp.72-82.

[52] Wang Z, Khan S M, Jiménez D A. Improving writeback efficiency with decoupled last-write prediction. In Proc. the 39th ISCA, Jun. 2012, pp.309-320.

[53] Lee C J, Ebrahimi E, Narasiman V, Mutlu O , Patt Y N. DRAM-aware last-level cache replacement. Technical Report, TR-HPS-2010-007, High Performance Systems Group, Department of Electrical and Computer Engineering, The University of Texas at Austin & Department of Electrical and Computer Engineering, Carnegie Mellon University, Dec. 2010.

[54] HP. Inside the Intelr® Itaniumr® 2 processor. HP Technical White Paper, July 2002. http://www.dig64.org/about/Itanium2 white paper public.pdf, Oct. 2014.

[55] Oracle. UltraSPARC T2 supplement to the UltraSPARC architecture 2007. Draft D1.4.3, Sept. 2007. http://www.oracle.com/technetwork/systems/opensparc/t2-14-ust2-uasuppl -draft-hp-ext-1537761.html, Oct. 2014.

[56] Jaleel A, Borch E, Bhandaru M, Steely Jr S C, Emer J. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (TLA) cache management policies. In Proc. the 43rd MICRO, Dec. 2010, pp.151-162.

[57] Martin M M K, Hill M D, Sorin D J. Why on-chip cache coherence is here to stay. Commun. ACM, 2012, 55(7): 78-89.

[58] Albericio J, Ibáñez P, Viñals V, Llabería J M. Exploiting reuse locality on inclusive shared last-level caches. ACM Trans. Archit. Code Optim., 2013, 9(4): Article No. 38.

[59] Binkert N, Beckmann B, Black G, Reinhardt S K, Saidi A, Basu A, Hestness J, Hower D R, Krishna T, Sardashti S, Sen Ling-Da Li et al.: Retention Benefit Based Intelligent Cache Replacement R, Sewell K, Shoaib M, Vaish N, Hill M D, Wood D A. The gem5 simulator. SIGARCH Comput. Archit. News, 2011, 39(2): 1-7.

[60] Henning J L. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 2006, 34(4): 1-17.

[61] Perelman E, Hamerly G, Biesbrouck M V, Sherwood T, Calder B. Using SimPoint for accurate and effcient simu- lation. SIGMETRICS Perform. Eval. Rev., 2003, 31(1): 318-319.
No related articles found!
Full text



[1] Feng Yulin;. Hierarchical Protocol Analysis by Temporal Logic[J]. , 1988, 3(1): 56 -69 .
[2] Chen Guoliang;. A Partitioning Selection Algorithm on Multiprocessors[J]. , 1988, 3(4): 241 -250 .
[3] Zhou Guodong; Ye Ganlin;. Forward-Backward Search Method[J]. , 1988, 3(4): 289 -305 .
[4] Han Jianchao; Shi Zhongzhi;. Formalizing Default Reasoning[J]. , 1990, 5(4): 374 -378 .
[5] Cai Shijie; Zhang Fuyan;. A Fast Algorithm for Polygon Operations[J]. , 1991, 6(1): 91 -96 .
[6] Su Bogong; Wang Jian; Xia Jinshi;. TST——An Algorithm for Global Microcode Compaction with Timing Constraints[J]. , 1991, 6(1): 97 -107 .
[7] Sui Yuefei;. The Polynomially Exponential Time Restrained Analytical Hierarchy[J]. , 1991, 6(3): 282 -284 .
[8] Li Tao;. Competition Based Neural Networks for Assignment Problems[J]. , 1991, 6(4): 305 -315 .
[9] Hayong Zhou;. Analogical Learning and Automated Rule Constructions[J]. , 1991, 6(4): 316 -328 .
[10] Zhang Bo; Zhang Ling;. An Algorithm for Finding D-Time Table[J]. , 1992, 7(1): 62 -67 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved