[1] Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro, 2010, 30(2): 16-29.[2] Kalla R, Sinharoy B, Starke W, Floyd M. Power7: IBM's next-generation server processor. IEEE Micro, 2010, 30(2): 7-15.[3] Shah M, Barren J, Brooks J et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2007, pp.22-25.[4] Mcnairy C, Bhatia R. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, 2005, 25(2): 10-20.[5] Keltcher C N, Mcgrath K J, Ahmed A, Conway P. The AMD Opteron processor for multiprocessor servers. IEEE Micro, 2003, 23(2): 66-76.[6] Barroso L A, Gharachorloo K, Mcnamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.282-293.[7] Tendler J M, Dodson J S, Fields J S, Le H, Sinharoy B. POWER4 system microarchitecture. IBM Journal of Research and Development, 2002, 46(1): 5-25.[8] Agarwal V, Hrishikesh M S, Keckler S W, Burger D. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.248-259.[9] Ho R, Mai K W, Horowitz M A. The future of wires. Proceedings of the IEEE, 2001, 89(4): 490-504.[10] Zhang M, Asanovi′c K. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. the 32nd Int. Symp. Comp. Arch., Jun. 2005, pp.336-345.[11] Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler S W. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel and Distributed Systems, 2007, 18(8): 10281040.[12] Chishti Z, Powell M D, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. the 32nd Int. Symp. Comp. Arch., June 2005, pp.357368.[13] Hossain H, Dwarkadas S, Huang M C. DDCache: Decoupled and delegable cache data and metadata. In Proc. the 18th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2009, pp.227-236.[14] Marty M R, Hill M D. Virtual hierarchies to support server consolidation. In Proc. the 34th Int. Symp. Comp. Arch., Jun. 2007, pp.46-56.[15] Censier L M, Feautrier P. A new solution to coherence problems in multicache systems. IEEE Trans. Computers, 1978, 27(12): 1112-1118.[16] Zebchuk J, Moshovos A. RegionTracker: A case for dualgrain tracking in the memory system. Technical Report, University of Toronto, 2006.[17] Zebchuk J, Safi E, Moshovos A. A framework for coarsegrain optimizations in the on-chip memory hierarchy. In Proc. the 40th Int. Symp. Microarch., Dec. 2007, pp.314327.[18] Bowen N S, Pradham D K. Processorand memory-based checkpoint and rollback recovery. Computer, 1993, 26(2): 22-31.[19] Xue J, Garg A, Ciftcioglu B et al. An intra-chip free-space optical interconnect: Extended technical report. Technical Report, Dept. Electrical & Computer Engineering, University of Rochester, 2010.[20] Compaq Computer Corporation. Alpha 21264/EV67 microprocessor hardware reference manual. Sept. 2000. http://download.majix.org/dec/21264ev67 hrm.pdf, May 2014.[21] Burger D, Austin T M. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News, 1997, 25(3): 13-25.[22] Sheng L, Ahn J, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.469-480.[23] Woo S, Ohara M, Torrie E, Singh J, Gupta A. The SPLASH2 programs: Characterization and methodological considerations. In Proc. the 22nd Int. Symp. Comp. Arch., Jun. 1995, pp.24-36.[24] Bienia C, Kumar S, Singh J, Li K. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.72-81.[25] Zhang M, Asanovic K. Victim migration: Dynamically adapting between private and shared CMP caches. Technical Report, MIT-CSAIL, Oct. 2005.[26] Jouppi N P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. the 17th Int. Symp. Comp. Arch., May 1990, pp.364-373.[27] Ganusov I, Burtscher M. On the importance of optimizing the configuration of stream prefetchers. In Proc. the 2005 Workshop on Memory System Performance, Jun. 2005, pp.54-61.[28] Ros A, Acacio M E, Garcia J M. Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors. In Proc. the 14th Int. Conf. High Performance Computing, Dec. 2007, pp.147-160.[29] Ros A, Acacio M E, Garcia J M. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In Proc. Int. Symp. Parallel and Distributed Processing, Apr. 2008.[30] Hossain H, Dwarkadas S, Huang M C. Improving support for locality and fine-grain sharing in chip multiprocessors. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.155-165.[31] Ros A, Kaxiras S. Complexity-effective multicore coherence. In Proc. the 21st Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2012, pp.241-252.[32] Pugsley S H, Spjut J B, Nellans D W, Balasubramonian R. SWEL: Hardware cache coherence protocols to map shared data onto shared caches. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.465-476.[33] Khan O, Hoffmann H, Lis M, Hijaz F, Agarwal A, Devadas S. ARCc: A case for an architecturally redundant cachecoherence architecture for large multicores. In Proc. the 29th IEEE Int. Conf. Computer Design, Oct. 2011, pp.411-418.[34] Beckmann B M, Marty M R, Wood D A. ASR: Adaptive selective replication for CMP caches. In Proc. the 39th Int. Symp. Microarch., Dec. 2006, pp.443-454.[35] Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proc. the 36th Int. Symp. Comp. Arch., Jun. 2009, pp.184-195.[36] Agarwal A, Simoni R, Hennessy J et al. An evaluation of directory schemes for cache coherence. In Proc. the 15th Int. Symp. Comp. Arch., May 30-June 2, 1988, pp.280-289.[37] Chaiken D, Kubiatowicz J, Agarwal A. LimitLESS directories: A scalable cache coherence scheme. In Proc. the 4th Int. Conf. Arch. Support for Prog. Lang. and Operating Systems, Apr. 1991, pp.224-234.[38] Maa Y, Pradhan D, Thiebaut D. Two economical directory schemes for large-scale cache coherent multiprocessors. ACM SIGARCH Computer Architecture News, 1991, 19(5): 1018.[39] Gupta A, Weber W, Mowry T. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. Int. Conf. Parallel Processing, Aug. 1990, pp.312-321.[40] Sanchez D, Kozyrakis C. SCD: A scalable coherence directory with flexible sharer set encoding. In Proc. the 18th IEEE Int. Symp. High-Perf. Comp. Arch., Feb. 2012, pp.112.[41] Zhao H, Shriraman A, Dwarkadas S. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.135-146.[42] Zhao H, Shriraman A, Dwarkadas S, Srinivasan V. SPATL: Honey, I shrunk the coherence directory. In Proc. Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2011, pp.33-44.[43] Choi J, Park K. Segment directory enhancing the limited directory cache coherence schemes. In Proc. the 10th Int. Parallel and Distributed Processing Symp., Apr. 1999, pp.258267.[44] Cuesta B, Ros A, G′omez M, Robles A, Duato J. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. Int. Symp. Comp. Arch., Jun. 2011, pp.93-104.[45] Alisafaee M. Spatiotemporal coherence tracking. In Proc. the 45th Int. Symp. Microarch., Dec. 2012, pp.341-350.[46] Fang L, Liu P, Hu Q, Huang M C, Jiang G F. Building expressive, area-efficient coherence directories. In Proc. the 22nd Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2013, pp.299-308.[47] Ebrahimi E, Mutlu O, Chang J L, Patt Y N. Coordinated control of multiple prefetchers in multi-core systems. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.316326.[48] Dahlgren F, Dubois M, Stenstrom P. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int. Conf. Parallel Processing, Aug. 1993, pp.56-63.[49] Nesbit K J, Dhodapkar A S, Smith J E. AC/DC: An adaptive data cache prefetcher. In Proc. the 13th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.135-145.[50] Gendler A, Mendelson A, Birk Y. A PAB-based multiprefetcher mechanism. International Journal of Parallel Programming, 2006, 34(2): 171-188. |