P3DC：采用混合映射机制降低DRAM缓存命中延迟

池也; 郭人通; 廖小飞; 刘海坤; 岳建辉

doi:10.1007/s11390-023-2561-y

摘要:

研究背景 3D堆叠内存相比于传统的DRAM具备更多的通道，更高的带宽，更低的能耗并且占用更少的空间面积。但是由于其成本昂贵，完全替代主存使用并非良策。因此，学术界和工业界通常采用小容量的3D堆叠内存和大容量的DRAM构成3DDRAM-DRAM混合内存使用。在组织方式上，3D堆叠内存通常被当作DRAM的缓存使用，被置于片上缓存和DRAM主存的中间位置，作为高带宽、大容量的第4级缓存（L4 Cache）。在3DDRAM缓存的组织上通常采用直接映射方式或者组相连方式，两种方式各有利弊。直接映射缓存能够并发获取标签和数据，命中延迟低，但缓存替换灵活度不足，故缓存命中率低。相反，组相连缓存具有较高的缓存命中率，但面临着标签-数据的串行访问，故缓存命中延迟高。

目的我们的研究目的在于寻找一种新的混合映射模式，既能达到媲美直接映射缓存的低命中延迟，同时能够达到比肩组相连缓存的高缓存命中率。

方法在研究中，我们将DRAM缓存按照以往研究中组相连缓存进行组织，但进一步将数据块的类型分为首访块和随访块。通过研究发现绝大部分首访块承担了整个缓存组的标签访问开销，后继块可以享受到标签缓存的便利得到快速查询。因此，我们设计了一种混合映射方式P3DC，即针对首访块，我们采用类似于直接映射缓存的静态映射方式，降低命中延迟；针对随访块，我们采用类似于组相连缓存的动态映射方式，保证缓存替换灵活度，增加缓存命中率。在缓存替换算法上，我们设计了全局时钟算法保证首访块具有驻留在DRAM缓存中的优先级，降低首访块被置换的风险。同时，设计了防止数据块类型频繁转变的高频过滤机制，减少DRAM缓存震荡。

结果从系统性能上看，P3DC与最先进的直接映射缓存和组相连缓存相比，平均性能分别提升12%（最高达66%）和6%（最高达19%）。与组相连缓存相比，平均命中延迟降低20.5%，约为直接映射缓存的1.17倍。与此同时，P3DC相比于直接映射缓存,缓存命中率提升32%到63%，仅比组相连缓存低3%。综上所述，P3DC结合了低缓存命中延迟和高缓存命中率的优势，针对各种类型的应用程序，均能发挥良好性能。

结论从结果看来，混合映射模式P3DC能够同时满足低命缓存中延迟和高缓存命中率。这意味着，在对于应用程序访存特性未知的前提下，使用混合映射方式能够有效发挥系统性能。但值得注意的是，混合映射方式在处理随访块时，需要并发访问标签行和数据行，这与直接映射方式相比需要耗费更多的带宽。因此，探寻带宽优化策略是值得进一步研究的。

Abstract: Die-stacked dynamic random access memory (DRAM) caches are increasingly advocated to bridge the performance gap between the on-chip cache and the main memory. To fully realize their potential, it is essential to improve DRAM cache hit rate and lower its cache hit latency. In order to take advantage of the high hit-rate of set-association and the low hit latency of direct-mapping at the same time, we propose a partial direct-mapped die-stacked DRAM cache called P3DC. This design is motivated by a key observation, i.e., applying a unified mapping policy to different types of blocks cannot achieve a high cache hit rate and low hit latency simultaneously. To address this problem, P3DC classifies data blocks into leading blocks and following blocks, and places them at static positions and dynamic positions, respectively, in a unified set-associative structure. We also propose a replacement policy to balance the miss penalty and the temporal locality of different blocks. In addition, P3DC provides a policy to mitigate cache thrashing due to block type variations. Experimental results demonstrate that P3DC can reduce the cache hit latency by 20.5% while achieving a similar cache hit rate compared with typical set-associative caches. P3DC improves the instructions per cycle (IPC) by up to 66% (12% on average) compared with the state-of-the-art direct-mapped cache—BEAR, and by up to 19% (6% on average) compared with the tag-data decoupled set-associative cache—DEC-A8.

P3DC：采用混合映射机制降低DRAM缓存命中延迟

P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings