We use cookies to improve your experience with our site.

Bigflow:一种分布式计算框架的通用优化层

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

  • 摘要: 随着数据量的大幅增加,分布式计算已成为数据中心中处理海量数据集的通用方法。虽然已经有了很多相关的计算模型和数据集抽象,但绝大多数都采用了单层的数据分区模型,这导致开发人员经常需要手动实现一些多次分区的操作,这样既增加了开发难度,又可能导致程序难以维护。为了解决这一问题,我们提出了NDD编程模型,它可以很大程度上减轻程序员在数据分区方面的的开发困难。
    NDD模型允许用户对某一数据集进行多次分区,最后形成一个多层的分区结构,如先按颜色分大类,再按形状分小类。除了对数据集进一步细分,NDD也支持对底层小类进行重合并,回归上一层分区结构。
    基于NDD模型,我们实现了一个开源的编程框架Bigflow。它可以看作是现有分布式系统的一个通用优化层,能够提供许多自动优化策略。我们将Bigflow对接在Spark和Hadoop上,发现对多层分区的基准测试,性能分别提升了30%和50%,同时与Hadoop相比,代码量也有很大的降低。目前,Bigflow已被应用于百度的数据中心,日均处理数据为3PB量级。根据用户的反馈,Bigflow可以极大减轻他们的开发负担,且为应用提供了非常好的性能。
    NDD在Spark、Hadoop等批处理分布式计算引擎上工作练好,我们目前在探索如何在Flink等流式数据引擎上使用相同的思路,在不降低性能的前提下为用户提供更加方便的多层分区模型。

     

    Abstract: As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed to improve the abstraction of distributed datasets and hide the details of parallelism. However, most of them follow the single-layer partitioning method, which limits developers to express a multi-level partitioning operation succinctly. To overcome the problem, we present the NDD (Nested Distributed Dataset) data model. It is a more compact and expressive extension of Spark RDD (Resilient Distributed Dataset), in order to remove the burden on developers to manually write the logic for multi-level partitioning cases. Base on the NDD model, we develop an open-source framework called Bigflow, which serves as an optimization layer over computation engines from most widely used processing frameworks. With the help of Bigflow, some advanced optimization techniques, which may only be applied by experienced programmers manually, are enabled automatically in a distributed data processing job. Currently, Bigflow is processing about 3 PB data volumes daily in the data-centers of Baidu. According to customer experience, it can significantly save code length and improve performance over the intuitive programming style.

     

/

返回文章
返回