梦幻存储器小微核的价值与 RAMpage 存储层次

梦幻存储器小微核的价值与 RAMpage 存储层次

The Value of a Small Microkernel for Dreamy Memory and the RAMpageMemory Hierarchy

Philip Machanick

摘要

摘要: 本文探讨了专用的 Cache 同速静态 RAM （紧耦合存储器，简称 TCM ）中 RAMpage 存储层次使用一种带有少量存储痕迹的微核的可能性。除非被访问， Dreamy Memory （梦幻存储器）是保持在低功耗模式的 DRAM 。仿真表明，小微核非常适合于 RAMpage ，原因是在增加 TCM 时它可以比标准层次获得明显更好的速度提高和节能效果。 RAMpage 在其最好的 128KB L2 情况下，利用 TCM 可获得 11% 的速度提高，节能 14% 。等价的传统层次的提高均低于 1% 。对于较小的 L2 来说，虽然 1MB L2 在低能耗情况下速度快很多，但较大 SRAM 的能耗并不能证明速度的提高。在传统结构中使用一个 128KB L2 的 Cache 会导致 2.58s 的最佳情况的整体运行时间。这是与 3.34s 的最佳 Dreamy 模式的运行时间（ RAMpage 在不命中情况时无上下文切换）相比较得出的，造成了 29% 的速度损失。在最快的 128KB L2 情况下的能耗是 2.18J 与 1.50J ，降低了 31% 。 RAMpage 存储层次结构通过把主存提高一个层次来把最低层的 Cache 替换成为一个 SRAM 主存，而 DRAM 则变成一个页设备。以往的工作说明，根据硬软件折衷， RAMpage 是一个潜在可行的方案；同时也说明，随着 CPU-DRAM 的速度差的增大， RAMpage 也随之适度变化。尤其是到 DRAM 的上下文切换有误差时。以往针对 RAMpage 所做的低能耗设计已经显示出希望。本文通过对 Dreamy Memory 思想进行更加完整的研究，进一步探讨了 RAMpage 在低能耗设计方面的价值。虽然唤醒主存会招致明显的开销，但 RAMpage 会把这种开销隐藏，以弥补 CPU-DRAM 的速度差，这在以前经过了证明。在台式机和服务器设计中，在处理器的功耗达到几十瓦甚至 100W 的情况下，降低存储器的功耗并非主要问题。然而，在低能耗设计的情况下， DRAM 的能耗变得很明显。在本文当中，我们研究的方法是利用双数率同步 DRAM （ DDR-DRAM ）中的自刷新模式，它允许 DRAM 的内容在 1% 正常功耗的情况下能够保持，以实现 Dreamy Memory 。仿真是基于适合移动设备的参数进行的，目的是尽可能把 DRAM 的能耗降低到自刷新模式能耗，而把把 DRAM 的性能提高到全功率模式的性能。仿真实验表明，相同 RAMpage 配置的无 Dreamy 模式的需要 2. 83s ，能耗是 2.39J ；对于能够很容易地切换到低能耗的情况来说，这是一个可以接受的折衷（损失低于 10% ）。

Abstract: This paper explores potential for the RAMpage memory hierarchy to use amicrokernel with a small memory footprint, in a specialized cache-speedstatic RAM (tightly-coupled memory, TCM). Dreamy memory is DRAMkept in low-power mode, unless referenced. Simulations show that asmall microkernel suits RAMpage well, in that it achieves significantlybetter speed and energy gains than a standard hierarchy from adding TCM.RAMpage, in its best 128KB L2 case, gained 11% speed using TCM, andreduced energy 14%. Equivalent conventional hierarchy gains were under1%. While 1MB L2 was significantly faster against lower-energy casesfor the smaller L2, the larger SRAM's energy does not justify the speedgain. Using a 128KB L2 cache in a conventional architecture resulted ina best-case overall run time of 2.58s, compared with the best dreamymode run time (RAMpage without context switches on misses) of 3.34s, aspeed penalty of 29%. Energy in the fastest 128KB L2 case was 2.18Jvs. 1.50J, a reduction of 31%. The same RAMpage configurationwithout dreamy mode took 2.83s as simulated, and used 2.39J, anacceptable trade-off (penalty under 10%) for being able to switcheasily to a lower-energy mode.

HTML全文

参考文献()

施引文献

资源附件()