NapFS: 一个为NUMA架构设计的高性能持久化内存文件系统

贾文庆; 蒋德钧; 熊劲

doi:10.1007/s11390-024-4075-7

NapFS: 一个为NUMA架构设计的高性能持久化内存文件系统

NapFS: A High-Performance Persistent Memory File System for Non-Uniform Memory Access Architectures

摘要

摘要:
研究背景 持久化内存可以提供可字节寻址、接近DRAM的延迟和持久化特性。基于上述优点，很多工作在持久化内存上建立文件系统，可以让文件数据直接在内存上持久化。随着应用的数据量逐渐增加，持久化内存文件系统需要提供更大的容量来满足上述需求。因为单个CPU socket上的内存插槽和单条持久化内存条的容量有限，一种主要用来扩展文件系统容量的方法是使用多个socket上的持久化内存来建立文件系统，即在NUMA架构下建立持久化内存文件系统。
目的在NUMA架构下，跨socket访问持久化内存上的文件数据会带来性能下降，研究目标是在建立持久化内存文件系统时解决NUMA架构对持久化内存访问的影响，在扩大文件系统容量的同时保证高性能的需求。
方法首先进行实验以了解NUMA架构对构建PM文件系统的影响。然后，文章提出四个设计原则，用于构建适用于NUMA架构的高性能PM文件系统，并设计了NapFS，一个NUMA架构感知的持久化内存文件系统。该系统具有每个Socket的本地PM文件系统和每个Socket专用的IO线程池。这不仅允许应用程序将数据访问代理给IO线程以避免远程PM访问，而且还能充分利用现有的单Socket PM文件系统来降低实现复杂性。此外，NapFS利用快速DRAM，通过添加全局缓存来加速性能，并采用选择性缓存来消除数据同步操作中两次复制的冗余开销。最后，作者对NapFS进行扩展，采用细粒度锁和调度策略，以提高关键请求的可扩展性和IO性能。
结果作者将NapFS与其他PM文件系统进行了评估。评估结果显示，NapFS分别在Filebench和RocksDB的吞吐量上实现了2.2倍和1.0倍的提升。
结论 NapFS通过使用IO代理技术来避免远端PM访问，通过使用DRAM缓存来加速IO访问，同时复用成熟的文件系统来减少实现复杂度，实验表明，相比于其他的多Socket的PM文件系统，NapFS在微观测试和宏观测试上都具有更高的性能。

Abstract: Persistent memory (PM) allows file systems to directly persist data on the memory bus. To increase the capacity of PM file systems, building a file system across sockets with each attached PM is attractive. However, accessing data across sockets incurs impacts of the non-uniform memory access (NUMA) architecture, which will lead to significant performance degradation. In this paper, we first use experiments to understand the NUMA impacts on building PM file systems. And then, we propose four design principles for building a high-performance PM file system NapFS for the NUMA architecture. We architect NapFS with per-socket local PM file systems and per-socket dedicated IO thread pools. This not only allows applications to delegate data accesses to IO threads for avoiding remote PM accesses, but also fully reuses existing single-socket PM file systems to reduce implementation complexity. Additionally, NapFS utilizes fast DRAM to accelerate performance by adding a global cache and adopts a selective cache mechanism to eliminate the redundant double-copy overhead for synchronization operations. Lastly, we show that NapFS can adopt extended optimizations to improve scalability and the performance of critical requests. We evaluate NapFS against other multi-socket PM file systems. The evaluation results show that NapFS achieves 2.2x and 1.0x throughput improvement for Filebench and RocksDB, respectively.

HTML全文

参考文献()

施引文献

资源附件()