HybridTune:基于时空数据关联的大数据系统性能诊断

doi:10.1007/s11390-019-1968-y

HybridTune:基于时空数据关联的大数据系统性能诊断

HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data Systems

摘要

摘要: 随着大数据的发展，提升大数据系统的性能越来越重要。为了提升大数据系统的性能，第一步往往是分析和诊断大数据系统的性能瓶颈。如果使用单纯的数据驱动的诊断方法，可能非常耗时；如果使用基于规则的分析方法，通常需要先验知识。
对于像Spark这样的大数据应用，我们发现同一阶段的任务在不同的数据分区上执行相同或相似的代码。那么，基于大数据系统的阶段相似性和分布性特征，我们使用每个阶段对应的系统层和微架构层指标分析大数据应用的行为。并且，针对不同的性能问题，我们提出了一种混合方法，它结合先验规则和机器学习算法来检测性能异常，例如，落后任务、任务分配不平衡、数据倾斜、异常节点和异常指标。同时，我们设计并实现了一个名为HybridTune的轻量级可扩展工具，使用BigDataBench基准测试集测量HybridTune工具的开销和异常检测效果。实验结果表明，HybridTune的开销仅为5%，异常检测的准确率达到93%。最后，我们报告了几个诊断用例，结果展示出HybridTune能有效检测多种性能异常。

Abstract: With tremendous growing interests in Big Data, the performance improvement of Big Data systems becomes more and more important. Among many steps, the first one is to analyze and diagnose performance bottlenecks of the Big Data systems. Currently, there are two major solutions. One is the pure data-driven diagnosis approach, which may be very time-consuming; the other is the rule-based analysis method, which usually requires prior knowledge. For Big Data applications like Spark workloads, we observe that the tasks in the same stages normally execute the same or similar codes on each data partition. On basis of the stage similarity and distributed characteristics of Big Data systems, we analyze the behaviors of the Big Data applications in terms of both system and micro-architectural metrics of each stage. Furthermore, for different performance problems, we propose a hybrid approach that combines prior rules and machine learning algorithms to detect performance anomalies, such as straggler tasks, task assignment imbalance, data skew, abnormal nodes and outlier metrics. Following this methodology, we design and implement a lightweight, extensible tool, named HybridTune, and measure the overhead and anomaly detection effectiveness of HybridTune using the BigDataBench benchmarks. Our experiments show that the overhead of HybridTune is only 5%, and the accuracy of outlier detection algorithm reaches up to 93%. Finally, we report several use cases diagnosing Spark and Hadoop workloads using BigDataBench, which demonstrates the potential use of HybridTune.

HTML全文

参考文献()

施引文献

资源附件()