We use cookies to improve your experience with our site.

基于MPI的大数据迭代计算加速方法

Accelerating Iterative Big Data Computing Through MPI

  • 摘要: 目前,许多大数据迭代计算算法基于 MapReduce 编程范式开发了运行在 Apache Hadoop 上的应用。Apache Spark 作为一个新兴的大数据迭代计算框架被提出,它采用 in-memory 技术提高了性能。我们之前的工作显示基于键值对通信的方式,MPI可以通过扩展支持大数据计算。这篇文章,我们基于 DataMPI ,提出了一种事件驱动的设计方法来支持大数据迭代计算。这种设计方法可以使得DataMPI在迭代计算场景下可以高效地并发执行通信和计算任务。实验结果显示,相对于 Apache Hadoop,我们的方式使得 DataMPI-Iteration 获得9-21倍的加速比;相对于 Apache Spark,DataMPI-Iteration 获得2-3倍的加速比。

     

    Abstract: Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPIIteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X~21X speedup over Apache Hadoop, and 2X~3X speedup over Apache Spark for PageRank and K-means.

     

/

返回文章
返回