基于REDUCE阶段任务调度的分布式计算加速方法

董加卿; 何泽昊; 龚媛媛; 于沛文; 田臣; 窦万春; 陈贵海; 夏耐; 管浩然

doi:10.1007/s11390-022-2118-5

基于REDUCE阶段任务调度的分布式计算加速方法

SMART: Speedup Job Completion Time by Scheduling Reduce Tasks

摘要

摘要: 海量随着信息技术的飞速发展呈指数级增长，分布式计算系统在此背景下被广泛应用于数据处理与分析。
工作完成时间（Job Completion Time，JCT）是衡量分布式计算系统处理数据任务效能的重要指标。如何降低分布式计算系统数据处理任务的工作完成时间成为了学界、工业界共同关注的重要问题。
数据倾斜在海量数据场景下是十分普遍存在的现象，而此现象极易造成此类分布式计算系统的性能降低。基于对数据倾斜现象的观察与分析，作者在本工作中提出了SMART，一种基于REDUCE阶段任务调度的工作完成时间优化方法。
作者首先从理论上分析了，在REDUCE阶段执行最长数据量优先的调度方式，能够比原调度方式最多有30%的性能提升；基于理论分析结果，作者在SMART中利用了数据倾斜现象广泛存在这一事实，通过对已完成的部分工作的数据量来预测剩余未完成任务的数据量大小关系，从而实现在REDUCE阶段按照最长数据量优先的任务调度方式。
SMART以当前广泛使用的分布式计算框架Hadoop实现环境，仅需修改非常小量的系统代码即可实现系统的部署。作者所提方法的有效性与鲁棒性在大量的模拟仿真与真实环境实验中均得到了有效的验证。
实验结果表明，SMART能够将典型分布式计算数据处理的工作完成时间显著降低。例如，对于280GB数据量大小的典型任务Terasort、WordCount、InvertedIndex，SMART分别将工作完成时间降低了6.47%、9.26%和13.66%，效果显著，具有很好的工程实践价值。

Abstract: Distributed computing systems have been widely used as the amount of data grows exponentially in the era of information explosion. Job completion time (JCT) is a major metric for assessing their effectiveness. How to reduce the JCT for these systems through reasonable scheduling has become a hot issue in both industry and academia. Data skew is a common phenomenon that can compromise the performance of such distributed computing systems. This paper proposes SMART, which can effectively reduce the JCT through handling the data skew during the reducing phase. SMART predicts the size of reduce tasks based on part of the completed map tasks and then enforces largest-first scheduling in the reducing phase according to the predicted reduce task size. SMART makes minimal modifications to the original Hadoop with only 20 additional lines of code and is readily deployable. The robustness and the effectiveness of SMART have been evaluated with a real-world cluster against a large number of datasets. Experiments show that SMART reduces JCT by up to 6.47%, 9.26%, and 13.66% for Terasort, WordCount and InvertedIndex respectively with the Purdue MapReduce benchmarks suite (PUMA) dataset.

HTML全文

参考文献()

施引文献

资源附件()