Fatman:基于低成本自愿者资源而建立的可靠归档存储系统

覃安; 胡殿明; 刘俊; 杨文君; 谭待

doi:10.1007/s11390-015-1521-6

Fatman:基于低成本自愿者资源而建立的可靠归档存储系统

Fatman: Building Reliable Archival Storage Based on Low-Cost Volunteer Resources

摘要

摘要: 本篇文章介绍Fatman,一种通过自愿者提供的资源来搭建的企业级归档存储系统。提供资源的自愿者通常是部署在数以万计节点上的,且拥有空闲的存储空间的网页服务器。Fatman在不影响现有服务水平前提下通过最大化地利用自愿资源以及在不影响可靠性前提下尽可能减少冗余资源的技术,为现有部署集群实现资源利用率的提高,为应用减少存储所需要的硬件成本。
Fatman系统目前已经被广泛部署在上万个节点上,跨越数个数据中心,提供多达100PB的存储容量,为几十个内部大数据应用提供服务。Fatman通过实现资源的强隔离以及基于配额的资源限制和调度机制,实现了在对应用服务水平无损的情况下最大化地利用自愿者提供的资源;同时,还开创性地对磁盘故障进行建模并用于指导数据恢复和调度策略,从而大幅度地将整个系统的数据平均修复时间(MTTR)降低了76.3%,在真实产品负载下文件损坏率降低了35%,实现了高可靠性。

Abstract: We present Fatman, an enterprise-scale archival storage based on volunteer contribution resources from underutilized web servers, usually deployed on thousands of nodes with spare storage capacity. Fatman is specifically designed for enhancing the utilization of existing storage resources and cutting down the hardware purchase cost. Two major concerned issues of the system design are maximizing the resource utilization of volunteer nodes without violating service level objectives (SLOs) and minimizing the cost without reducing the availability of archival system. Fatman has been widely deployed on tens of thousands of server nodes across several datacenters, providing more than 100 PB storage capacity and serving dozens of internal mass-data applications. The system realizes an efficient storage quota consolidation by strong isolation and budget limitation, to maximally support resource contribution without any degradation on host-level SLOs. It novelly improves data reliability by applying disk failure prediction to minish failure recovery cost, named fault-aware data management, dramatically reduces the mean time to repair (MTTR) by 76:3% and decreases file crash ratio by 35% on real-life product workload.

HTML全文

参考文献()

施引文献

资源附件()