›› 2014,Vol. 29 ›› Issue (4): 679-691.doi: 10.1007/s11390-014-1459-0

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

面积有效和表现力强的解耦元数据高速缓存

Peng Liu1,2 (刘鹏), Member, CCF, IEEE, Lei Fang1 (方磊), and Michael C. Huang3 (黄巍)Member, ACM, IEEE   

  1. 1. Department of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China;
    2. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China;
    3. Department of Electrical and Computer Engineering, University of Rochester, NY 14627-0231, U.S.A.
  • 出版日期:2014-07-05 发布日期:2014-07-05
  • 作者简介:Peng Liu received his B.S. and M.S. degrees both in optical engineering, and Ph.D. degree in communication and electrical engineering from Zhejiang University, Hangzhou, in 1992, 1996, and 1999, respectively. In 1999, he joined the faculty of the Information Science and Electronic Engineering Department, Zhejiang University, where he was promoted to associate professor in 2002. From 2009 to 2010, he was a visiting scholar at University of Rochester working on high-performance computer architectures.
  • 基金资助:

    The work is supported by the Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao of the National Natural Science Foundation of China under Grant No. 61028004, and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A08.

DEAM:Decoupled, Expressive, Area-Efficient Metadata Cache

Peng Liu1,2 (刘鹏), Member, CCF, IEEE, Lei Fang1 (方磊), and Michael C. Huang3 (黄巍), Member, ACM, IEEE   

  1. 1. Department of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China;
    2. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China;
    3. Department of Electrical and Computer Engineering, University of Rochester, NY 14627-0231, U.S.A.
  • Online:2014-07-05 Published:2014-07-05
  • About author:Peng Liu received his B.S. and M.S. degrees both in optical engineering, and Ph.D. degree in communication and electrical engineering from Zhejiang University, Hangzhou, in 1992, 1996, and 1999, respectively. In 1999, he joined the faculty of the Information Science and Electronic Engineering Department, Zhejiang University, where he was promoted to associate professor in 2002. From 2009 to 2010, he was a visiting scholar at University of Rochester working on high-performance computer architectures.
  • Supported by:

    The work is supported by the Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao of the National Natural Science Foundation of China under Grant No. 61028004, and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A08.

片上多处理器的发展为数据和元数据管理提供了广阔的探索空间。一种高效的管理机制应该从细粒度的数据访问模式出发来进行设计。传统的元数据存储和处理器核数成平方倍增长关系,对于大规模片上多核系统显得开销较大。本文提出了一种元数据高速缓存框架来解决以下三个问题:(1)降低数据访问和一致性维护的时延,(2)减少元数据高速缓存的存储开销,(3)为其他访存技术提供优化支持。元数据高速缓存利用高效的硬件结构来动态地记录数据的访问模式。在此基础上管理解耦的数据和元数据的迁移和复制来降低访问时延。我们还利用元数据高速缓存框架作为系统构件对流式预取进行了优化。仿真结果表明:与传统的共享高速缓存机制相比,我们的解决方案在1/5的元数据高速缓存开销下实现了1.12倍的加速比。

Abstract: Chip multiprocessor presents brand new opportunities for holistic on-chip data and coherence management solutions. An intelligent protocol should be adaptive to the fine-grain accessing behavior. And in terms of storage of metadata, the size of conventional directory grows as the square of the number of processors, making it very expensive in large-scale systems. In this paper, we propose a metadata cache framework to achieve three goals (1) reducing the latency of data access and coherence activities, (2) saving the storage of metadata, and (3) providing support for other optimization techniques. The metadata is implemented with compact structures and tracks the dynamically changing access pattern. The pattern information is used to guide the delegation and replication of decoupled data and metadata to allow fast access. We also use our metadata cache as a building block to enhance the stream prefetching. Using detailed execution-driven simulation, we demonstrate that our protocol achieves an average speedup of 1.12X compared to a shared cache protocol with 1/5 of the storage of metadata.

[1] Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro, 2010, 30(2): 16-29.

[2] Kalla R, Sinharoy B, Starke W, Floyd M. Power7: IBM's next-generation server processor. IEEE Micro, 2010, 30(2): 7-15.

[3] Shah M, Barren J, Brooks J et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2007, pp.22-25.

[4] Mcnairy C, Bhatia R. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, 2005, 25(2): 10-20.

[5] Keltcher C N, Mcgrath K J, Ahmed A, Conway P. The AMD Opteron processor for multiprocessor servers. IEEE Micro, 2003, 23(2): 66-76.

[6] Barroso L A, Gharachorloo K, Mcnamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.282-293.

[7] Tendler J M, Dodson J S, Fields J S, Le H, Sinharoy B. POWER4 system microarchitecture. IBM Journal of Research and Development, 2002, 46(1): 5-25.

[8] Agarwal V, Hrishikesh M S, Keckler S W, Burger D. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.248-259.

[9] Ho R, Mai K W, Horowitz M A. The future of wires. Proceedings of the IEEE, 2001, 89(4): 490-504.

[10] Zhang M, Asanovi′c K. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. the 32nd Int. Symp. Comp. Arch., Jun. 2005, pp.336-345.

[11] Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler S W. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel and Distributed Systems, 2007, 18(8): 10281040.

[12] Chishti Z, Powell M D, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. the 32nd Int. Symp. Comp. Arch., June 2005, pp.357368.

[13] Hossain H, Dwarkadas S, Huang M C. DDCache: Decoupled and delegable cache data and metadata. In Proc. the 18th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2009, pp.227-236.

[14] Marty M R, Hill M D. Virtual hierarchies to support server consolidation. In Proc. the 34th Int. Symp. Comp. Arch., Jun. 2007, pp.46-56.

[15] Censier L M, Feautrier P. A new solution to coherence problems in multicache systems. IEEE Trans. Computers, 1978, 27(12): 1112-1118.

[16] Zebchuk J, Moshovos A. RegionTracker: A case for dualgrain tracking in the memory system. Technical Report, University of Toronto, 2006.

[17] Zebchuk J, Safi E, Moshovos A. A framework for coarsegrain optimizations in the on-chip memory hierarchy. In Proc. the 40th Int. Symp. Microarch., Dec. 2007, pp.314327.

[18] Bowen N S, Pradham D K. Processorand memory-based checkpoint and rollback recovery. Computer, 1993, 26(2): 22-31.

[19] Xue J, Garg A, Ciftcioglu B et al. An intra-chip free-space optical interconnect: Extended technical report. Technical Report, Dept. Electrical & Computer Engineering, University of Rochester, 2010.

[20] Compaq Computer Corporation. Alpha 21264/EV67 microprocessor hardware reference manual. Sept. 2000. http://download.majix.org/dec/21264ev67 hrm.pdf, May 2014.

[21] Burger D, Austin T M. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News, 1997, 25(3): 13-25.

[22] Sheng L, Ahn J, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.469-480.

[23] Woo S, Ohara M, Torrie E, Singh J, Gupta A. The SPLASH2 programs: Characterization and methodological considerations. In Proc. the 22nd Int. Symp. Comp. Arch., Jun. 1995, pp.24-36.

[24] Bienia C, Kumar S, Singh J, Li K. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.72-81.

[25] Zhang M, Asanovic K. Victim migration: Dynamically adapting between private and shared CMP caches. Technical Report, MIT-CSAIL, Oct. 2005.

[26] Jouppi N P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. the 17th Int. Symp. Comp. Arch., May 1990, pp.364-373.

[27] Ganusov I, Burtscher M. On the importance of optimizing the configuration of stream prefetchers. In Proc. the 2005 Workshop on Memory System Performance, Jun. 2005, pp.54-61.

[28] Ros A, Acacio M E, Garcia J M. Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors. In Proc. the 14th Int. Conf. High Performance Computing, Dec. 2007, pp.147-160.

[29] Ros A, Acacio M E, Garcia J M. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In Proc. Int. Symp. Parallel and Distributed Processing, Apr. 2008.

[30] Hossain H, Dwarkadas S, Huang M C. Improving support for locality and fine-grain sharing in chip multiprocessors. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.155-165.

[31] Ros A, Kaxiras S. Complexity-effective multicore coherence. In Proc. the 21st Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2012, pp.241-252.

[32] Pugsley S H, Spjut J B, Nellans D W, Balasubramonian R. SWEL: Hardware cache coherence protocols to map shared data onto shared caches. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.465-476.

[33] Khan O, Hoffmann H, Lis M, Hijaz F, Agarwal A, Devadas S. ARCc: A case for an architecturally redundant cachecoherence architecture for large multicores. In Proc. the 29th IEEE Int. Conf. Computer Design, Oct. 2011, pp.411-418.

[34] Beckmann B M, Marty M R, Wood D A. ASR: Adaptive selective replication for CMP caches. In Proc. the 39th Int. Symp. Microarch., Dec. 2006, pp.443-454.

[35] Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proc. the 36th Int. Symp. Comp. Arch., Jun. 2009, pp.184-195.

[36] Agarwal A, Simoni R, Hennessy J et al. An evaluation of directory schemes for cache coherence. In Proc. the 15th Int. Symp. Comp. Arch., May 30-June 2, 1988, pp.280-289.

[37] Chaiken D, Kubiatowicz J, Agarwal A. LimitLESS directories: A scalable cache coherence scheme. In Proc. the 4th Int. Conf. Arch. Support for Prog. Lang. and Operating Systems, Apr. 1991, pp.224-234.

[38] Maa Y, Pradhan D, Thiebaut D. Two economical directory schemes for large-scale cache coherent multiprocessors. ACM SIGARCH Computer Architecture News, 1991, 19(5): 1018.

[39] Gupta A, Weber W, Mowry T. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. Int. Conf. Parallel Processing, Aug. 1990, pp.312-321.

[40] Sanchez D, Kozyrakis C. SCD: A scalable coherence directory with flexible sharer set encoding. In Proc. the 18th IEEE Int. Symp. High-Perf. Comp. Arch., Feb. 2012, pp.112.

[41] Zhao H, Shriraman A, Dwarkadas S. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.135-146.

[42] Zhao H, Shriraman A, Dwarkadas S, Srinivasan V. SPATL: Honey, I shrunk the coherence directory. In Proc. Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2011, pp.33-44.

[43] Choi J, Park K. Segment directory enhancing the limited directory cache coherence schemes. In Proc. the 10th Int. Parallel and Distributed Processing Symp., Apr. 1999, pp.258267.

[44] Cuesta B, Ros A, G′omez M, Robles A, Duato J. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. Int. Symp. Comp. Arch., Jun. 2011, pp.93-104.

[45] Alisafaee M. Spatiotemporal coherence tracking. In Proc. the 45th Int. Symp. Microarch., Dec. 2012, pp.341-350.

[46] Fang L, Liu P, Hu Q, Huang M C, Jiang G F. Building expressive, area-efficient coherence directories. In Proc. the 22nd Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2013, pp.299-308.

[47] Ebrahimi E, Mutlu O, Chang J L, Patt Y N. Coordinated control of multiple prefetchers in multi-core systems. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.316326.

[48] Dahlgren F, Dubois M, Stenstrom P. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int. Conf. Parallel Processing, Aug. 1993, pp.56-63.

[49] Nesbit K J, Dhodapkar A S, Smith J E. AC/DC: An adaptive data cache prefetcher. In Proc. the 13th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.135-145.

[50] Gendler A, Mendelson A, Birk Y. A PAB-based multiprefetcher mechanism. International Journal of Parallel Programming, 2006, 34(2): 171-188.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] 孟力明; 徐晓飞; 常会友; 陈光熙; 胡铭曾; 李生;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] 林琦; 夏培肃;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] 孙成政; 慈云桂;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[5] 张钹; 张恬; 张建伟; 张铃;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[6] 朱明远;. Two Congruent Semantics for Prolog with CUT[J]. , 1990, 5(1): 82 -91 .
[7] 王鼎兴; 郑纬民; 杜晓黎; 郭毅可;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[8] 周权; 魏道政;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[9] 郭恒昌;. On the Characterization and Fault Identification of Sequentially t-Diagnosable System Under PMC Model[J]. , 1991, 6(1): 83 -90 .
[10] 赵靓海; 刘慎权;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: