支持恢复延迟约束的流数据处理任务分配

doi:10.1007/s11390-018-1876-6

支持恢复延迟约束的流数据处理任务分配

A Task Allocation Method for Stream Processing with Recovery Latency Constraint

摘要

摘要: 流数据处理应用程序能够实时或近实时地连续处理大规模在线流数据。这些应用程序有严格的延迟限制。但是连续处理模式导致其对任何错误都很敏感，错误恢复将会减慢整个流处理流水下的处理速度，甚至造成不满足延迟约束条件。上游备份是一种广泛应用的流数据处理系统容错方法。它引入了复杂的上下游任务备份依赖关系，增加了控制恢复延迟的难度。另外，当多个有依赖关系的上下游任务被分配到同一个处理单元时，多个任务将在物理处理器故障时同时发生故障。这导致了额外的恢复延迟，从而增加了错误影响。本文研究流处理应用中任务分配和恢复延迟之间的关系。我们提出了一个关联错误模型来描述一个特定的任务分配方案下流拓扑结构中恢复延迟和处理器级故障之间的关系。我们提出了恢复延迟感知的任务分配你问题，该问题在满足恢复延迟约束条件下为流数据处理作业进行任务分配。我们进而讨论了该问题同经典任务分配问题的区别，提出了一个计算复杂度为O（nlog^2n）的启发式算法来解决该问题。我们提供了大量的实验数据和分析结果，验证了本文方法的正确性和可行性。实验证明本文提出的方法相比较现有工作平均提高资源利用率15-20%。

Abstract: Stream processing applications continuously process large amounts of online streaming data in real time or near real time. They have strict latency constraints. However, the continuous processing makes them vulnerable to any failures,and the recoveries may slow down the entire processing pipeline and break latency constraints. The upstream backup scheme is one of the most widely applied fault-tolerant schemes for stream processing systems. It introduces complex backup dependencies to tasks, which increases the difficulty of controlling recovery latencies. Moreover, when dependent tasks are located on the same processor, they fail at the same time in processor-level failures, bringing extra recovery latencies that increase the impacts of failures. This paper studies the relationship between the task allocation and the recovery latency of a stream processing application. We present a correlated failure effect model to describe the recovery latency of a stream topology in processor-level failures under a task allocation plan. We introduce a recovery-latency aware task allocation problem (RTAP) that seeks task allocation plans for stream topologies that will achieve guaranteed recovery latencies. We discuss the difference between RTAP and classic task allocation problems and present a heuristic algorithm with a computational complexity of O(n log² n) to solve the problem. Extensive experiments were conducted to verify the correctness and effectiveness of our approach. It improves the resource usage by 15%-20% on average.

HTML全文

参考文献()

施引文献

资源附件()