wrBench：比较ARMv8多核系统上的缓存架构和一致性协议

高琬蓉; 方建滨; 黄春; 徐传福; 王峥

doi:10.1007/s11390-021-1251-x

wrBench：比较ARMv8多核系统上的缓存架构和一致性协议

\ttwrBench : Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

摘要

摘要:
研究背景 近年来，基于ARMv8的多核CPU逐渐成为构建高性能计算(HPC)系统的有力替代方案。缓存性能对于HPC多核系统非常重要，因为在HPC系统上运行的工作负载经常导致频繁的内核间通信，影响程序的执行时间。为了开发潜在的硬件性能，软件优化的一个重要任务是将内存访问模式匹配到硬件底层的缓存体系结构和一致性协议。但缓存通常作为一个“黑盒”工作，包含许多软件开发人员无法获得的实现细节。微基准测试是揭示硬件底层的缓存体系结构的有效方法。目前，微基准测试已经被广泛用于描述和评估传统x86多核系统上的内存层次架构，但仍缺乏关于剖析ARMv8高性能多核系统的内存层次结构设计方面的工作。
目的本研究旨在弥补ARMv8多核系统内存层次架构研究方面的空白，为描述ARMv8多核系统的内存层次结构开发专门的基准测试套件。通过基准测试揭示其在延迟和带宽等性能方面的微架构细节，比较三款经典ARMv8处理器的不同缓存架构设计，给出在ARMv8多核系统上优化软件内存访问的指导意见。
方法我们根据x86和ARMv8之间的体系结构和ISA差异扩展了BenchIT基准测试套件，在获取体系结构参数、设置缓存行初始状态、使用汇编指令计时，转移数据以及作废缓存行等方面使其适应ARMv8系统。我们的扩展和完善工作构成了一个新的开源基准测试套件，即wrBench。通过wrBench可以测量在ARMv8系统上各个内核之间的核间通信性能。
结果我们选择了三个具有代表性的ARMv8系统，即Phytium 2000、ThunderX2和KP920作为我们的实验平台来验证wrBench的潜力。实验结果表明，我们的wrBench可以提供关于ARMv8多核内存层次结构的量化性能描述。我们的实验数据在论文中详细列出。
结论实验表明我们拓展的基准测试是有效的。根据基准测试的测量结果，我们发现三种ARMv8处理器在缓存组织和一致性协议方面各具优缺点。我们测量的通信性能虽然不能用于直接判断出最优的架构设计，但可以用来帮助判断并行算法的通信瓶颈。除此之外，我们还基于通信性能提出了通过优化内存访问来提高并行程序性能的指导意见。未来，一旦ARMv9机器可用，我们将扩展我们的wrBench以适应最新的ARM机器。此外，我们将结合不同应用程序的内存访问模式和不同架构的通信性能来确定最优的对应关系。我们想通过这种方式找到适用性最广最优的ARM架构设计。

Abstract: Cache performance is a critical design constraint for modern many-core systems. Since the cache often works in a "black-box" manner, it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware. To better support code optimization, we need to understand and characterize the cache behavior. While cache performance characterization is heavily studied on traditional \ttx86 architectures, there is little work for understanding the cache implementations on emerging ARMv8-based many-cores. This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 (KP920). To this end, we develop \ttwrBench , a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication. Our evaluation provides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores. The quantitative performance data is shown in tables. We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors, Phytium 2000+, ThunderX2, and KP920. Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.

HTML全文

参考文献()

施引文献

资源附件()