›› 2010, Vol. 25 ›› Issue (2): 246-256.

Special Issue: Computer Architecture and Systems

• Special Section on CPU Researches in China • Previous Articles     Next Articles

Hierarchical Cache Directory for CMP

Song-Liu Guo1 (郭松柳), Hai-Xia Wang2 (王海霞), Senior Member, CCF, Yi-Bo Xue2 (薛一波), Senior Member, CCF, Chong-Min Li1 (李崇民), Student Member, CCF| and Dong-Sheng Wang12 (汪东升), Senior Member, CCF   

  1. 1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
    2Tsinghua National Laboratory of Information Science and Technology, Beijing 100084, China
  • Received:2009-06-10 Revised:2009-10-27 Online:2010-03-05 Published:2010-03-05
  • About author:
    Song-Liu Guo received his B.S. degree from Tsinghua University. His research interests include high performance computing, multiprocessor architecture.
    Hai-Xia Wang received her B.S. degree from Nankai University, and Ph.D. degree from Chinese Academy of Sciences. She is an associate professor of Tsinghua University. Her research interests include high performance computing, multiprocessor architecture and formal check.
    Yi-Bo Xue received his B.S. and M.S. degrees from Harbin Institute of Technology, and Ph.D. degree from Chinese Academy of Sciences. He is a professor of Tsinghua University. His research interests include high performance computing, parallel processing and network security.
    Chong-Min Li received his B.S. and M.S. degrees from Daqing Petroleum Institute and Tsinghua University respectively. He is a student member of China Computer Federation. His research interests include high performance computing, multiprocessor architecture.
    Dong-Sheng Wang received his B.S. and Ph.D. degrees from Harbin Institute of Technology. He is a professor of Tsinghua University. He is a senior member of China Computer Federation. His research interests include computer architecture, high performance computing, storage & file systems, and network security.
  • Supported by:

    This work is supported by the National Natural Science Foundation of China under Grant Nos. 60673145, 60773146 and 60833004.

As more processing cores are integrated into one chip and feature size continues to shrink, the average access latency for remote nodes using directory-based coherence protocol becomes higher, which greatly impacts system performance. Previous techniques such as data replication and data migration optimize the performance of the requesting core, but offer little improvement for neighbor nodes. Other techniques such as in-transit optimization try to reduce latency at the cost of increased storage. This paper introduces hierarchical cache directory into CMP (chip multiprocessor), which divides CMP tiles into multiple regions hierarchically, and combines it with data replication. A new directory organization is proposed to record the share status within a region and assist the regional home to complete operation efficiently. Simulation results show that for a 16-core CMP, compared to traditional directory, hierarchical cache directory reduces average access latency by 9% and on-chip network traffic by 34% on average with less storage. Theoretical analyses show that for a 2n times 2n tiled CMP, the average access latency in hierarchical cache directory asymptotically approaches a function that is independent of n, hence the architecture is highly scalable.


[1] Kim C, Burger D, Keckler S W. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ACM SIGPLAN Not., 2002, 37(10): 211-222.

[2] Chishti Z, Powell M D, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. the 32nd Annual International Symposium on Computer Architecture, Madison, USA, June 4-8, 2005, pp.357368.

[3] Zhang M, Asanovic K. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. the 32nd Annual International Symposium on Computer Architecture (ISCA2005), June 4-8, 2005, pp.336-345.

[4] Chang J, Sohi G S. Cooperative caching for chip multiprocessors. In Proc. the 33rd Annual International Symposium on Computer Architecture (ISCA2006), Boston, USA, June 17-21, 2006, pp.264-276.

[5] Eisley N, Peh L S, Shang L. In-network cache coherence. In Proc. the 39th International Symposium on Microarchitecture (MICRO2006), Orlando, USA, Dec. 9-13, 2006, pp.321332.

[6] Enright-Jerger N, Peh L S, Lipasti M. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proc. 41st International Symposium on Microarchitecture (MICRO2008), Lake Como, Italy, Nov. 812, 2008, pp.35-46.

[7] Wallach D A. PHD: A hierarchical cache coherent protocol

[Master’s Thesis]. MIT, September 1992.

[8] Gustavson D. The scalable coherent interface and related standards projects. IEEE Micro, Jan./Feb. 1992, 12(1): 1022.

[9] Nilsson H, Stenstr¨om P. The scalable tree protocol—A cache coherence approach for large-scale multiprocessors. In Proc. SPDP 1992, Arlington, USA, Dec. 1-4, 1992, pp.498-506.

[10] Acacio M E, Gonzalez J, Garcia J M et al. A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Transactions on Parallel and Distributed, Jan. 2005, 16(1): 67-79.

[11] Acacio M E, Gonzalez J, Garcia J M et al. A new scalable directory architecture for large-scale multiprocessors. In Proc. HPCA-7, Nuevo Leone, Mexico, Jan. 20-24, 2001, pp.97-106.

[12] Acacio ME, Gonzalez J, Garcia JM, Duato J. An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration. IEEE Transactions on Parallel and Distributed Systems, August 2004, 15(8): 755-768.

[13] Wilson A W. Hierarchical cache/bus architecture for shared memory multiprocessors. In Proc. the 14th Annual International Symposium on Computer Architecture, Pittsburgh, USA, June 2-5, 1987, pp.244-252.

[14] Zhang Y, Lu Z, Jantsch A, Li L, Gao M. Towards hierarchical cluster based cache coherence for large-scale network-onchip. In Proc. the 4th IEEE International Conference on Design & Technology of Integrated Systems in Nanoscale Era (DTIS ’09), Cairo, Egypt, April 6-7, 2009, pp.119-122.

[15] Huh J et al. A NUCA substrate for flexible CMP cache sharing. In Proc. the 19th Annual International Conference on Supercomputing, Massachusetts, USA, June 20-22, 2005, pp.31-40.

[16] Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. R-NUCA: Data placement in distributed shared caches. In Proc. the 36th Annual International Symposium on Computer Architecture, Texas, USA, June 20-24, 2009.

[17] Herrero E, Gonz′aez J, Canal R. Distributed cooperative caching. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT2008), Toronto, Canada, Oct. 25-29, 2008, pp.134143.

[18] Eisley N, Peh L S, Shang L. Leveraging on-chip networks for data cache migration in chip multiprocessors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT2008), Toronto, Canada, Oct. 25-29, 2008, pp.197-207.

[19] Beckmann B, Marty M, Wood D. ASR: Adaptive selective replication for CMP caches. In Proc. the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, USA, Dec. 9-13, 2006, pp.321-332.

[20] https://www.simics.net/.

[21] Martin M M K, Sorin D J, Beckmann B M, Marty M R, Xu M, Alameldeen A R, Moore K E, Hill M D, Wood D A. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News (CAN), September 2005, 33(4): 92-99.

[22] Woo S C, Ohara M, Torrie E, Singh J P, Gupta A. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 22-24, 1995, pp.24-37.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhang Bo; Zhang Ling;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[10] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved