›› 2014, Vol. 29 ›› Issue (4): 679-691.doi: 10.1007/s11390-014-1459-0

Special Issue: Computer Architecture and Systems

• Computer Architectures and Systems • Previous Articles     Next Articles

DEAM:Decoupled, Expressive, Area-Efficient Metadata Cache

Peng Liu1,2 (刘鹏), Member, CCF, IEEE, Lei Fang1 (方磊), and Michael C. Huang3 (黄巍), Member, ACM, IEEE   

  1. 1. Department of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China;
    2. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China;
    3. Department of Electrical and Computer Engineering, University of Rochester, NY 14627-0231, U.S.A.
  • Online:2014-07-05 Published:2014-07-05
  • About author:Peng Liu received his B.S. and M.S. degrees both in optical engineering, and Ph.D. degree in communication and electrical engineering from Zhejiang University, Hangzhou, in 1992, 1996, and 1999, respectively. In 1999, he joined the faculty of the Information Science and Electronic Engineering Department, Zhejiang University, where he was promoted to associate professor in 2002. From 2009 to 2010, he was a visiting scholar at University of Rochester working on high-performance computer architectures.
  • Supported by:

    The work is supported by the Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao of the National Natural Science Foundation of China under Grant No. 61028004, and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A08.

Chip multiprocessor presents brand new opportunities for holistic on-chip data and coherence management solutions. An intelligent protocol should be adaptive to the fine-grain accessing behavior. And in terms of storage of metadata, the size of conventional directory grows as the square of the number of processors, making it very expensive in large-scale systems. In this paper, we propose a metadata cache framework to achieve three goals (1) reducing the latency of data access and coherence activities, (2) saving the storage of metadata, and (3) providing support for other optimization techniques. The metadata is implemented with compact structures and tracks the dynamically changing access pattern. The pattern information is used to guide the delegation and replication of decoupled data and metadata to allow fast access. We also use our metadata cache as a building block to enhance the stream prefetching. Using detailed execution-driven simulation, we demonstrate that our protocol achieves an average speedup of 1.12X compared to a shared cache protocol with 1/5 of the storage of metadata.

[1] Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro, 2010, 30(2): 16-29.

[2] Kalla R, Sinharoy B, Starke W, Floyd M. Power7: IBM's next-generation server processor. IEEE Micro, 2010, 30(2): 7-15.

[3] Shah M, Barren J, Brooks J et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2007, pp.22-25.

[4] Mcnairy C, Bhatia R. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, 2005, 25(2): 10-20.

[5] Keltcher C N, Mcgrath K J, Ahmed A, Conway P. The AMD Opteron processor for multiprocessor servers. IEEE Micro, 2003, 23(2): 66-76.

[6] Barroso L A, Gharachorloo K, Mcnamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.282-293.

[7] Tendler J M, Dodson J S, Fields J S, Le H, Sinharoy B. POWER4 system microarchitecture. IBM Journal of Research and Development, 2002, 46(1): 5-25.

[8] Agarwal V, Hrishikesh M S, Keckler S W, Burger D. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.248-259.

[9] Ho R, Mai K W, Horowitz M A. The future of wires. Proceedings of the IEEE, 2001, 89(4): 490-504.

[10] Zhang M, Asanovi′c K. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. the 32nd Int. Symp. Comp. Arch., Jun. 2005, pp.336-345.

[11] Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler S W. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel and Distributed Systems, 2007, 18(8): 10281040.

[12] Chishti Z, Powell M D, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. the 32nd Int. Symp. Comp. Arch., June 2005, pp.357368.

[13] Hossain H, Dwarkadas S, Huang M C. DDCache: Decoupled and delegable cache data and metadata. In Proc. the 18th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2009, pp.227-236.

[14] Marty M R, Hill M D. Virtual hierarchies to support server consolidation. In Proc. the 34th Int. Symp. Comp. Arch., Jun. 2007, pp.46-56.

[15] Censier L M, Feautrier P. A new solution to coherence problems in multicache systems. IEEE Trans. Computers, 1978, 27(12): 1112-1118.

[16] Zebchuk J, Moshovos A. RegionTracker: A case for dualgrain tracking in the memory system. Technical Report, University of Toronto, 2006.

[17] Zebchuk J, Safi E, Moshovos A. A framework for coarsegrain optimizations in the on-chip memory hierarchy. In Proc. the 40th Int. Symp. Microarch., Dec. 2007, pp.314327.

[18] Bowen N S, Pradham D K. Processorand memory-based checkpoint and rollback recovery. Computer, 1993, 26(2): 22-31.

[19] Xue J, Garg A, Ciftcioglu B et al. An intra-chip free-space optical interconnect: Extended technical report. Technical Report, Dept. Electrical & Computer Engineering, University of Rochester, 2010.

[20] Compaq Computer Corporation. Alpha 21264/EV67 microprocessor hardware reference manual. Sept. 2000. http://download.majix.org/dec/21264ev67 hrm.pdf, May 2014.

[21] Burger D, Austin T M. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News, 1997, 25(3): 13-25.

[22] Sheng L, Ahn J, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.469-480.

[23] Woo S, Ohara M, Torrie E, Singh J, Gupta A. The SPLASH2 programs: Characterization and methodological considerations. In Proc. the 22nd Int. Symp. Comp. Arch., Jun. 1995, pp.24-36.

[24] Bienia C, Kumar S, Singh J, Li K. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.72-81.

[25] Zhang M, Asanovic K. Victim migration: Dynamically adapting between private and shared CMP caches. Technical Report, MIT-CSAIL, Oct. 2005.

[26] Jouppi N P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. the 17th Int. Symp. Comp. Arch., May 1990, pp.364-373.

[27] Ganusov I, Burtscher M. On the importance of optimizing the configuration of stream prefetchers. In Proc. the 2005 Workshop on Memory System Performance, Jun. 2005, pp.54-61.

[28] Ros A, Acacio M E, Garcia J M. Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors. In Proc. the 14th Int. Conf. High Performance Computing, Dec. 2007, pp.147-160.

[29] Ros A, Acacio M E, Garcia J M. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In Proc. Int. Symp. Parallel and Distributed Processing, Apr. 2008.

[30] Hossain H, Dwarkadas S, Huang M C. Improving support for locality and fine-grain sharing in chip multiprocessors. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.155-165.

[31] Ros A, Kaxiras S. Complexity-effective multicore coherence. In Proc. the 21st Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2012, pp.241-252.

[32] Pugsley S H, Spjut J B, Nellans D W, Balasubramonian R. SWEL: Hardware cache coherence protocols to map shared data onto shared caches. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.465-476.

[33] Khan O, Hoffmann H, Lis M, Hijaz F, Agarwal A, Devadas S. ARCc: A case for an architecturally redundant cachecoherence architecture for large multicores. In Proc. the 29th IEEE Int. Conf. Computer Design, Oct. 2011, pp.411-418.

[34] Beckmann B M, Marty M R, Wood D A. ASR: Adaptive selective replication for CMP caches. In Proc. the 39th Int. Symp. Microarch., Dec. 2006, pp.443-454.

[35] Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proc. the 36th Int. Symp. Comp. Arch., Jun. 2009, pp.184-195.

[36] Agarwal A, Simoni R, Hennessy J et al. An evaluation of directory schemes for cache coherence. In Proc. the 15th Int. Symp. Comp. Arch., May 30-June 2, 1988, pp.280-289.

[37] Chaiken D, Kubiatowicz J, Agarwal A. LimitLESS directories: A scalable cache coherence scheme. In Proc. the 4th Int. Conf. Arch. Support for Prog. Lang. and Operating Systems, Apr. 1991, pp.224-234.

[38] Maa Y, Pradhan D, Thiebaut D. Two economical directory schemes for large-scale cache coherent multiprocessors. ACM SIGARCH Computer Architecture News, 1991, 19(5): 1018.

[39] Gupta A, Weber W, Mowry T. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. Int. Conf. Parallel Processing, Aug. 1990, pp.312-321.

[40] Sanchez D, Kozyrakis C. SCD: A scalable coherence directory with flexible sharer set encoding. In Proc. the 18th IEEE Int. Symp. High-Perf. Comp. Arch., Feb. 2012, pp.112.

[41] Zhao H, Shriraman A, Dwarkadas S. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.135-146.

[42] Zhao H, Shriraman A, Dwarkadas S, Srinivasan V. SPATL: Honey, I shrunk the coherence directory. In Proc. Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2011, pp.33-44.

[43] Choi J, Park K. Segment directory enhancing the limited directory cache coherence schemes. In Proc. the 10th Int. Parallel and Distributed Processing Symp., Apr. 1999, pp.258267.

[44] Cuesta B, Ros A, G′omez M, Robles A, Duato J. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. Int. Symp. Comp. Arch., Jun. 2011, pp.93-104.

[45] Alisafaee M. Spatiotemporal coherence tracking. In Proc. the 45th Int. Symp. Microarch., Dec. 2012, pp.341-350.

[46] Fang L, Liu P, Hu Q, Huang M C, Jiang G F. Building expressive, area-efficient coherence directories. In Proc. the 22nd Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2013, pp.299-308.

[47] Ebrahimi E, Mutlu O, Chang J L, Patt Y N. Coordinated control of multiple prefetchers in multi-core systems. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.316326.

[48] Dahlgren F, Dubois M, Stenstrom P. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int. Conf. Parallel Processing, Aug. 1993, pp.56-63.

[49] Nesbit K J, Dhodapkar A S, Smith J E. AC/DC: An adaptive data cache prefetcher. In Proc. the 13th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.135-145.

[50] Gendler A, Mendelson A, Birk Y. A PAB-based multiprefetcher mechanism. International Journal of Parallel Programming, 2006, 34(2): 171-188.
No related articles found!
Full text



[1] Zhang Bo; Zhang Ling;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] Meng Liming; Xu Xiaofei; Chang Huiyou; Chen Guangxi; Hu Mingzeng; Li Sheng;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] Lin Qi; Xia Peisu;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] Sun Chengzheng; Tzu Yungui;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[5] Zhang Bo; Zhang Tian; Zhang Jianwei; Zhang Ling;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[6] Zhu Mingyuan;. Two Congruent Semantics for Prolog with CUT[J]. , 1990, 5(1): 82 -91 .
[7] Wang Dingxing; Zheng Weimin; Du Xiaoli; Guo Yike;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[8] Zhou Quan; Wei Daozheng;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[9] Guo Hengchang;. On the Characterization and Fault Identification of Sequentially t-Diagnosable System Under PMC Model[J]. , 1991, 6(1): 83 -90 .
[10] Zhao Jinghai; Liu Shenquan;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved