›› 2016,Vol. 31 ›› Issue (1): 20-35.doi: 10.1007/s11390-016-1609-7

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

优化带宽的非易失性混合缓存设计

Jishen Zhao1, Member, ACM, IEEE, Cong Xu2, Member, ACM, IEEE, Tao Zhang3, Member, ACM, IEEE and Yuan Xie4, Fellow, IEEE, Member, ACM   

  1. 1 Department of Computer Engineering, University of California at Santa Cruz, Santa Cruz, CA 95064, U.S.A.;
    2 Hewlet-Packard Labs, Palo Alto, CA 94304, U.S.A.;
    3 NVIDIA Corporation, Santa Clara, CA 95050, U.S.A.;
    4 Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara CA 93106, U.S.A
  • 收稿日期:2015-09-08 修回日期:2015-12-10 出版日期:2016-01-05 发布日期:2016-01-05
  • 作者简介:Jishen Zhao received her Ph.D. degree in computer engineering from Pennsylvania State University, University Park, in 2014. She is now an assistant professor at University of California at Santa Cruz. Her research interests are computer architecture and electronic design automation, with an emphasis on emerging technologies and high-performance computing. She is a member of ACM and IEEE.

BACH:A Bandwidth-Aware Hybrid Cache Hierarchy Design with Nonvolatile Memories

Jishen Zhao1, Member, ACM, IEEE, Cong Xu2, Member, ACM, IEEE, Tao Zhang3, Member, ACM, IEEE and Yuan Xie4, Fellow, IEEE, Member, ACM   

  1. 1 Department of Computer Engineering, University of California at Santa Cruz, Santa Cruz, CA 95064, U.S.A.;
    2 Hewlet-Packard Labs, Palo Alto, CA 94304, U.S.A.;
    3 NVIDIA Corporation, Santa Clara, CA 95050, U.S.A.;
    4 Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara CA 93106, U.S.A
  • Received:2015-09-08 Revised:2015-12-10 Online:2016-01-05 Published:2016-01-05
  • About author:Jishen Zhao received her Ph.D. degree in computer engineering from Pennsylvania State University, University Park, in 2014. She is now an assistant professor at University of California at Santa Cruz. Her research interests are computer architecture and electronic design automation, with an emphasis on emerging technologies and high-performance computing. She is a member of ACM and IEEE.

有限的内存带宽逐渐成为多核性能主要的瓶颈之一。然而,改变内存硬件设计来直接提高内存带宽的方法会增加内存成本和功耗。为了解决这一问题,本文采取有效的缓存设计来减少软件对内存的带宽需求。我们设计了一种给予非易失性存储技术的缓存系统,并为其设计了可重构功能以增加其灵活性。其中,混合缓存采用了几种缓存器件技术混合的方法,即采用不同的技术设计不同层的缓存,以达到优化整个缓存系统的性能和能耗。我们设计的重构功能可以预测软件对缓存带宽的需求,并在软件运行时实时改变各层缓存的可用空间和所能提供的带宽。通过这一方法,我们的设计能更好的配合软件对带宽的需求。仿真测试结果显示,相对于传统基于SRAM的缓存设计,采用我们的设计能提高单线程软件性能58%,多线程软件性能14%。

Abstract: Limited main memory bandwidth is becoming a fundamental performance bottleneck in chipmultiprocessor (CMP) design. Yet directly increasing the peak memory bandwidth can incur high cost and power consumption. In this paper, we address this problem by proposing a memory, a bandwidth-aware reconfigurable cache hierarchy, BACH, with hybrid memory technologies. Components of our BACH design include a hybrid cache hierarchy, a reconfiguration mechanism, and a statistical prediction engine. Our hybrid cache hierarchy chooses different memory technologies with various bandwidth characteristics, such as spin-transfer torque memory (STT-MRAM), resistive memory (ReRAM), and embedded DRAM (eDRAM), to configure each level so that the peak bandwidth of the overall cache hierarchy is optimized. Our reconfiguration mechanism can dynamically adjust the cache capacity of each level based on the predicted bandwidth demands of running workloads. The bandwidth prediction is performed by our prediction engine. We evaluate the system performance gain obtained by BACH design with a set of multithreaded and multiprogrammed workloads with and without the limitation of system power budget. Compared with traditional SRAM-based cache design, BACH improves the system throughput by 58% and 14% with multithreaded and multiprogrammed workloads respectively.

[1] McKee S A. Reflections on the memory wall. In Proc. the 1st Conference on Computing Frontiers, April 2004, p.162.

[2] Burger D, Goodman J R, Kägi A. Memory bandwidth limitations of future microprocessors. In Proc. the 23rd International Symposium on Computer Architecture, May 1996, pp.78-89.

[3] Rogers B M, Krishna A, Bell G B et al. Scaling the bandwidth wall:Challenges in and avenues for CMP scaling. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp.371-382.

[4] Huh J, Burger D, Keckler S W. Exploring the design space of future CMPs. In Proc. the International Conference on Parallel Architectures and Compilation Techniques, Sept. 2001, pp.199-210.

[5] Lindholm E, Nickolls J, Oberman S, Montrym J. NVIDIA Tesla:A unified graphics and computing architecture. IEEE Micro, 2008, 28(2):39-55.

[6] Sun G, Wu X, Xie Y. Exploration of 3D stacked L2 cache design for high performance and efficient thermal control. In Proc. the International Symposium on Low Power Electronics and Design, Aug. 2009, pp.295-298.

[7] Sun G, Dong X, Xie Y, Li J, Chen Y. A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In Proc. the 15th International Conference on High Performance Computer Architecture, Feb. 2009, pp.239-249.

[8] Yu C, Petrov P. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In Proc. the 47th Design Automation Conference, June 2010, pp.132-137.

[9] Sun G, Hughes C, Kim C, Zhao J, Xu C, Xie Y, Chen Y K. Moguls:A model to explore memory hierarchy for throughput computing. In Proc. the 38th ISCA, June 2011, pp.377-388.

[10] Hosomi M, Yamagishi H, Yamamoto T et al. A novel nonvolatile memory with spin torque transfer magnetization switching:Spin-RAM. In Proc. IEEE International Electron Devices Meeting, IEDM Technical Digest, Dec. 2005, pp.459-462.

[11] Zhao W, Belhaire E, Mistral Q, Chappert C, Javerliac V, Dieny B, Nicolle E. Macro-model of spin-transfer torque based magnetic tunnel junction device for hybrid magneticCMOS design. In Proc. the 2006 IEEE International Conference:Behavioral Modeling and Simulation Workshop, Sept. 2006, pp.40-43.

[12] Degraeve R, Fantini A, Clima S et al. Dynamic 'hour glass' model for SET and RESET in HfO2 RRAM. In Proc. the Symposium on VLSI Technology, June 2012, pp.75-76.

[13] Goux L, Fantini A, Kar G et al. Ultralow sub-500nA operating current high-performance TiN\Al2O3\HfO2\Hf\TiN bipolar RRAM achieved through understanding-based stack-engineering. In Proc. the Symposium on VLSI Technology, June 2012, pp.159-160.

[14] Cagli C, Buckley J, Jousseaume V et al. Characterization and modelling of electrode impact in HfO2-based RRAM. In Proc. the Memory Workshop, June 2012.

[15] Raoux S, Burr G W, Breitwisch M J et al. Phase-change random access memory:A scalable technology. IBM Journal of Research and Development, 2008, 52(4/5):465-479.

[16] Sousa V. Phase change materials engineering for RESET current reduction. In Proc. the Memory Workshop, June 2012.

[17] Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y. Hybrid cache architecture with disparate memory technologies. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp.34-45.

[18] Kim K H, Hyun Jo S, Gaba S, Lu W. Nanoscale resistive memory with intrinsic diode characteristics and long endurance. Applied Physics Letters, 2010, 96(5):053106.1- 053106.3.

[19] Lee H Y, Chen Y S, Chen P S et al. Evidence and solution of over-RESET problem for HfOx based resistive memory with sub-ns switching speed and high endurance. In Proc. the International Electron Devices Meeting, Dec. 2010, pp.19.7.1- 19.7.4.

[20] Kim Y B, Lee S, Lee D et al. Bi-layered RRAM with unlimited endurance and extremely uniform switching. In Proc. the Symposium on VLSI Technology, June 2011, pp.52-53.

[21] Ahn S, Song Y, Jeong C et al. Highly manufacturable high density phase change memory of 64Mb and beyond. In Proc. the International Electron Devices Meeting, Dec. 2004, pp.907-910.

[22] Kitagawa E, Fujita S, Nomura K et al. Impact of ultra low power and fast write operation of advanced perpendicular MTJ on power reduction for high-performance mobile CPU. In Proc. the International Electron Devices Meeting, Dec. 2012, pp.29.4.1-29.4.4.

[23] Yoda H, Fujita S, Shimomura N et al. Progress of STTMRAM technology and the effect on normally-off computing systems. In Proc. the International Electron Devices Meeting, Dec. 2012, pp.11.3.1-11.3.4.

[24] Abe K, Noguchi H, Kitagawa E, Shimomura N, Ito J, Fujita S. Novel hybrid DRAM/MRAM design for reducing power of high performance mobile CPU. In Proc. the International Electron Devices Meeting, Dec. 2012, pp.10.5.1-10.5.4.

[25] Schechter S, Loh G H, Straus K, Burger D. Use ECP, not ECC, for hard failures in resistive memories. In Proc. the 37th International Symposium on Computer Architecture, June 2010, pp.141-152.

[26] Ipek E, Condit J, Nightingale E B, Burger D, Moscibroda T. Dynamically replicated memory:Building reliable systems from nanoscale resistive memories. In Proc. the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, Mar. 2010, pp.3-14.

[27] Seong N H, Woo D H, Srinivasan V, Rivers J A, Lee H H S. SAFER:Stuck-at-fault error recovery for memories. In Proc. the 43rd International Symposium on Microarchitecture, Dec. 2010, pp.115-124.

[28] Qureshi M K, Karidis J, Franceschini M, Srinivasan V, Lastras L, Abali B. Enhancing lifetime and security of PCMbased main memory with start-gap wear leveling. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.14-23.

[29] Seong N H, Woo D H, Lee H H S. Security refresh:Prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. In Proc. the International Symposium on Computer Architecture, June 2010, pp.383-394.

[30] Yoon D H, Muralimanohar N, Chang J, Ranganathan P, Jouppi N, Erez M. FREE-p:Protecting non-volatile memory against both hard and soft errors. In Proc. the 17th International Symposium on High Performance Computer Architecture, Feb. 2011, pp.466-477.

[31] Dorsey P. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Papers:Virtex-7 FPGAs, WP 380, 2010.

[32] Zhao J, Dong X, Xie Y. Cost-aware three-dimensional (3D) many-core multiprocessor design. In Proc. the 47th Design Automation Conference, June 2010, pp.126-131.

[33] Xie Y, Loh G H, Black B, Bernstein K. Design space exploration for 3D architectures. J. Emerg. Technol. Comput. Syst., 2006, 2(2):65-103.

[34] Loh G H. 3D-stacked memory architectures for multi-core processors. In Proc. the 35th International Symposium on Computer Architecture, June 2008, pp.453-464.

[35] Dong X, Xie Y, Muralimanohar N, Jouppi N P. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.

[36] Kgil T, D'Souza S, Saidi A et al. PicoServer:Using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In Proc. the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006, pp.117-128.

[37] Liu C C, Ganusov I, Burtscher M, Tiwari S. Bridging the processor-memory performance gap with 3D IC technology. IEEE Design and Test of Computers, 2005, 22(6):556-564.

[38] Loi G L, Agrawal B, Srivastava N, Lin S C, Sherwood T, Banerjee K. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proc. the 43rd Design Automation Conference, July 2006, pp.991-996.

[39] Gu S, Marchal P, Facchini M, Wang F, Suh M, Lisk D, Nowak M. Stackable memory of 3D chip integration for mobile applications. In Proc. Int. Electron Devices Meeting, Dec. 2008.

[40] Woo D H, Seong N H, Lewis D L, Lee H H. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proc. the 16th International Conference for High Performance Computer Architecture, Jan. 2010.

[41] Kim J S, Oh C S, Lee H et al. A 1.2V 12.8GB/s 2Gb mobile wide-I/O DRAM with 4×128 I/Os using TSV-based stacking. In Proc. Int. Solid-State Circuits Conf. Digest of Technical Papers, Feb. 2011, pp.496-498.

[42] Loi I, Benini L. An efficient distributed memory interface for many-core platform with 3D stacked DRAM. In Proc. Design, Automation and Test in Europe Conference & Exhibition, Mar. 2010, pp.99-104.

[43] Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers:Hit ratio, latency, or bandwidth? Have it all with footprint cache. In Proc. the 40th International Symposium on Computer Architecture, June 2013, pp.404-415.

[44] Lin C J, Kang S H, Wang Y J et al. 45nm low power CMOS logic compatible embedded STT MRAM utilizing a reverseconnection 1T/1MTJ cell. In Proc. the International Electron Devices Meeting, Dec. 2009, pp.11.6.1-11.6.4.

[45] Ranganathan P, Adve S, Jouppi N P. Reconfigurable caches and their application to media processing. In Proc. the 27th International Symposium on Computer Architecture, June 2000, pp.214-224.

[46] Srikantaiah S, Kultursay E, Zhang T, Kandemir M, Irwin M, Xie Y. MorphCache:A reconfigurable adaptive multilevel cache hierarchy. In Proc. the 17th International Symposium on High Performance Computer Architecture, Feb. 2011, pp.231-242.

[47] Dong X, Xu C, Xie Y, Jouppi N. NVSim:A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31(7):994-1007.

[48] Kim C, Burger D, Keckler S W. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002, pp.211-222.

[49] Flautner K, Kim N S, Martin S, Blaauw D, Mudge T. Drowsy caches:Simple techniques for reducing leakage power. In Proc. the 29th International Symposium on Computer Architecture, May 2002, pp.148-157.

[50] Zhou P, Pandey V, Sundaresan J, Raghuraman A, Zhou Y, Kumar S. Dynamic tracking of page miss ratio curve for memory management. In Proc. the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2004, pp.177-188.

[51] Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. the 13th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2004, pp.111-122.

[52] Duesterwald E, Ca?caval C, Dwarkadas S. Characterizing and predicting program behavior and its variability. In Proc. the 12th International Conference on Parallel Architectures and Compilation Techniques, Sept.27-Oct.1, 2003, pp.220- 231.

[53] Sarikaya R, Buyuktosunoglu A. Predicting program behavior based on objective function minimization. In Proc. the 10th International Symposium on Workload Characterization, Sept. 2007, pp.25-34.

[54] Sarikaya R, Isci C, Buyuktosunoglu A. Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management. In Proc. the International Symposium on Workload Characterization, Dec. 2010.

[55] Chen S F, Goodman J. An empirical study of smoothing techniques for language modeling. In Proc. the 34th Annual Meeting on Association for Computational Linguistics, June 1996, pp.310-318.

[56] Magnusson P S, Christensson M, Eskilson J et al. Simics:A full system simulation platform. IEEE Transactions on Computer, 2002, 35(2):50-58.

[57] Shah M, Barren J, Brooks J et al. UltraSPARC T2:A highly-treaded, powere-efficient, SPARC SOC. In Proc. the IEEE Solid-State Circuits Conference, Nov. 2007, pp.22-25.

[58] Bienia C. Benchmarking modern multiprocessors[Ph.D. Thesis]. Princeton University, January 2011.

[59] Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT:An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.469-480.

[60] Meza J, Chang J, Yoon H, Mutlu O, Ranganathan P. Enabling efficient and scalable hybrid memories using finegranularity DRAM cache management. IEEE Comput. Archit. Lett., 2012, 11(2):61-64.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . 用户级设备驱动程序:得来的性能[J]. , 2005, 20(5): 654 -664 .
[2] . AVS 音频编码标准介绍[J]. , 2006, 21(3): 360 -365 .
[3] . 暂缺[J]. , 2006, 21(6): 927 -931 .
[4] . 暂缺[J]. , 2007, 22(1): 71 -74 .
[5] . 基于模型等价归约的析取逻辑程序语义比较[J]. , 2007, 22(4): 562 -568 .
[6] . 关于使用三种基本操作来生成组合[J]. , 2007, 22(6): 909 -913 .
[7] Jason Cong (丛京生). [J]. , 2011, 26(4): 632 -635 .
[8] Marcelo G. Armentano, Daniela Godoy, and Analia Amandi. 微博社区中基于拓扑的用户推荐[J]. , 2012, 27(3): 624 -634 .
[9] Shi-Min Hu, Leif Kobbelt. Preface[J]. , 2015, 30(3): 437 -438 .
[10] Pei-Quan Jin, Xike Xie, Christian S. Jensen, Yong Jin, Li-Hua Yue. HAG——一种面向磁盘阵列系统的能耗同比数据存储机制[J]. , 2015, 30(4): 679 -695 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: