计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (6): 1167-1184.doi: 10.1007/s11390-019-1968-y

所属专题: Data Management and Data Mining

• Data Management and Data Mining •    下一篇

HybridTune:基于时空数据关联的大数据系统性能诊断

Rui Ren1,2, Member, CCF, IEEE, Jiechao Cheng3, Xi-Wen He1, Lei Wang1, Member, CCF, Jian-Feng Zhan1,*, Member, CCF, ACM, IEEE, Wan-Ling Gao1, Member, CCF, ACM, IEEE, Chun-Jie Luo1,2, Member, CCF   

  1. 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 School of Computing, National University of Singapore, Singapore 117417, Singapore
  • 收稿日期:2018-09-06 修回日期:2019-09-04 出版日期:2019-11-16 发布日期:2019-11-16
  • 通讯作者: Jian-Feng Zhan E-mail:zhanjianfeng@ict.ac.cn
  • 作者简介:Rui Ren received her B.S. degree in computer science from the Sichuan University, Chengdu, in 2009, her M.S. degree in computer architecture from Chinese Academy of Sciences, Beijing, in 2012, and her Ph.D. degree in computer software and theory from Chinese Academy of Sciences, Beijing, in 2019. She is currently an engineer in the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. Her research interests include big data, performance analysis and optimization.
  • 基金资助:
    This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000601.

HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data Systems

Rui Ren1,2, Member, CCF, IEEE, Jiechao Cheng3, Xi-Wen He1, Lei Wang1, Member, CCF, Jian-Feng Zhan1,*, Member, CCF, ACM, IEEE, Wan-Ling Gao1, Member, CCF, ACM, IEEE, Chun-Jie Luo1,2, Member, CCF   

  1. 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 School of Computing, National University of Singapore, Singapore 117417, Singapore
  • Received:2018-09-06 Revised:2019-09-04 Online:2019-11-16 Published:2019-11-16
  • Contact: Jian-Feng Zhan E-mail:zhanjianfeng@ict.ac.cn
  • About author:Rui Ren received her B.S. degree in computer science from the Sichuan University, Chengdu, in 2009, her M.S. degree in computer architecture from Chinese Academy of Sciences, Beijing, in 2012, and her Ph.D. degree in computer software and theory from Chinese Academy of Sciences, Beijing, in 2019. She is currently an engineer in the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. Her research interests include big data, performance analysis and optimization.
  • Supported by:
    This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000601.

随着大数据的发展,提升大数据系统的性能越来越重要。为了提升大数据系统的性能,第一步往往是分析和诊断大数据系统的性能瓶颈。如果使用单纯的数据驱动的诊断方法,可能非常耗时;如果使用基于规则的分析方法,通常需要先验知识。
对于像Spark这样的大数据应用,我们发现同一阶段的任务在不同的数据分区上执行相同或相似的代码。那么,基于大数据系统的阶段相似性和分布性特征,我们使用每个阶段对应的系统层和微架构层指标分析大数据应用的行为。并且,针对不同的性能问题,我们提出了一种混合方法,它结合先验规则和机器学习算法来检测性能异常,例如,落后任务、任务分配不平衡、数据倾斜、异常节点和异常指标。同时,我们设计并实现了一个名为HybridTune的轻量级可扩展工具,使用BigDataBench基准测试集测量HybridTune工具的开销和异常检测效果。实验结果表明,HybridTune的开销仅为5%,异常检测的准确率达到93%。最后,我们报告了几个诊断用例,结果展示出HybridTune能有效检测多种性能异常。

关键词: 大数据系统, 时空关联, 基于规则的诊断, 机器学习

Abstract: With tremendous growing interests in Big Data, the performance improvement of Big Data systems becomes more and more important. Among many steps, the first one is to analyze and diagnose performance bottlenecks of the Big Data systems. Currently, there are two major solutions. One is the pure data-driven diagnosis approach, which may be very time-consuming; the other is the rule-based analysis method, which usually requires prior knowledge. For Big Data applications like Spark workloads, we observe that the tasks in the same stages normally execute the same or similar codes on each data partition. On basis of the stage similarity and distributed characteristics of Big Data systems, we analyze the behaviors of the Big Data applications in terms of both system and micro-architectural metrics of each stage. Furthermore, for different performance problems, we propose a hybrid approach that combines prior rules and machine learning algorithms to detect performance anomalies, such as straggler tasks, task assignment imbalance, data skew, abnormal nodes and outlier metrics. Following this methodology, we design and implement a lightweight, extensible tool, named HybridTune, and measure the overhead and anomaly detection effectiveness of HybridTune using the BigDataBench benchmarks. Our experiments show that the overhead of HybridTune is only 5%, and the accuracy of outlier detection algorithm reaches up to 93%. Finally, we report several use cases diagnosing Spark and Hadoop workloads using BigDataBench, which demonstrates the potential use of HybridTune.

Key words: Big Data system, spatio-temporal correlation, rule-based diagnosis, machine learning

[1] Dai J, Huang J, Huang S, Huang B, Liu Y. HiTune:Dataflow-based performance analysis for big data cloud. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, June 2011, Article No. 27.
[2] Guo Q, Li Y, Liu T, Wang K, Chen G, Bao X, Tang W. Correlation-based performance analysis for full-system MapReduce optimization. In Proc. the 2013 IEEE International Conference on Big Data, October 2013, pp.753-761.
[3] Garduño E, Kavulya S P, Tan J, Gandhi R, Narasimhan P. Theia:Visual signatures for problem diagnosis in large Hadoop clusters. In Proc. the 26th Large Installation System Administration Conference, December 2012, pp.33-42.
[4] Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P. Mochi:Visual log-analysis based tools for debugging Hadoop. In Proc. USENIX Workshop on Hot Topics in Cloud Computing, June 2009, Article No. 1.
[5] Cretu-Ciocarlie G, Budiu M, Goldszmidt M. Hunting for problems with Artemis. In Proc. the 1st USENIX Workshop on Analysis of System Logs, Dec. 2008, Article No. 2.
[6] Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin F, Babu S. Starfish:A self-tuning system for big data analytics. In Proc. the 5th Biennial Conference on Innovative Data Systems Research, January 2011, pp.261-272.
[7] Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Qiu B. BigDataBench:A Big Data benchmark suite from internet services. In Proc. the 20th IEEE International Symposium on High Performance Computer Architecture, February 2014, pp.488-499.
[8] Ananthanarayanan G, Kandula S, Greenberg A, Stoica I, Lu Y, Saha B, Harris E. Reining in the outliers in MapReduce clusters using Mantri. In Proc. the 9th USENIX Conference on Operating Systems Design and Implementation, October 2010, pp.265-278.
[9] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.15-28.
[10] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad:Distributed data-parallel programs from sequential building blocks. In Proc. the 2007 EuroSys Conference, March 2007, pp.59-72.
[11] Ren R, Jia Z, Wang L, Zhan J, Yi T. BDTUne:Hierarchical correlation-based performance analysis and rule-based diagnosis for big data systems. In Proc. the IEEE International Conference on Big Data, Dec. 2016, pp.555-562.
[12] Cochran W, Cooley J, Favin D, Helms H, Kaenel R, Langa W, Maling G, Nelson D, Rader C, Welch P. What is the fast Fourier transform? IEEE Transactions on Audio and Electroacoustics, 1967, 55(10):1664-1674.
[13] Knorr E M, Ng R T. Algorithms for mining distancebased outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases, August 1998, pp.392-403.
[14] Ming Z, Luo C, Gao W, Han R, Yang Q, Wang L, Zhan J. BDGS:A scalable Big Data generator suite in Big Data benchmarking. In Proc. the 2013 Workshop Series on Big Data Benchmarking, July 2014, pp.138-154.
[15] Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D B, Amde M, Owen S, Xin D, Xin R, Franklin M J, Zadeh R, Zaharia M, Talwalkar A. MLlib:Machine learning in Apache Spark. J. Mach. Learn. Res., 2016, 17:Article No. 34.
[16] Wang C, Talwar V, Schwan K, Ranganathan P. Online detection of utility cloud anomalies using metric distributions. In Proc. the IEEE/IFIP Network Operations and Management Symposium, April 2010, pp.96-103.
[17] Ousterhout K, Rasti R, Ratnasamy S, Shenker S, Chun B. Making sense of performance in data analytics frameworks. In Proc. the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, pp.293-307.
[18] Jayathilaka H, Krintz C, Wolski R. Detecting performance anomalies in cloud platform applications. IEEE Transactions on Cloud Computing. doi:10.1109/TCC.2018.2808289.
[19] Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data, May 2000, pp.427-438.
[20] Breunig M M, Kriegel H P, Ng R T, Sander J. LOF:Identifying density-based local outliers. In Proc. ACM SIGMOD International Conference on Management of Data, May 2000, pp.93-104.
[21] Yu D, Sheikholeslami G, Zhang A. FindOut:Finding outliers in very large datasets. Knowledge and Information Systems, 2002, 4(4):387-412.
[22] Yu L, Lan Z. A scalable, non-parametric method for detecting performance anomaly in large scale computing. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(7):1902-1914.
[23] Tan J, Pan X, Marinelli E, Kavulya S, Gandhi R, Narasimhan P. Kahuna:Problem diagnosis for MapReducebased cloud computing environments. In Proc. the IEEE/IFIP Network Operations and Management Symposium, April 2010, pp.112-119.
[24] Pan X, Tan J, Kavulya S, Gandhi R, Narasimhan P. Ganesha:BlackBox diagnosis of MapReduce systems. SIGMETRICS Performance Evaluation Review, 2009, 37(3):8-13.
[25] Gupta C, Sinha R, Zhang Y. Eagle:User profile-based anomaly detection for securing Hadoop clusters. In Proc. the 2015 IEEE International Conference on Big Data, October 2015, pp.1336-1343.
[26] Kasick M P, Tan J, Gandhi R, Narasimhan P. Black-box problem diagnosis in parallel file systems. In Proc. the 8th USENIX Conference on File and Storage Technologies, February 2010, pp.43-56.
[27] Fu X, Ren R, McKeez S A, Zhan J, Sun N. Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In Proc. IEEE International Conference on Cluster Computing, September 2014, pp.103-112.
[28] Khan L, Awad M, Thuraisingham B. A new intrusion detection system using support vector machines and hierarchical clustering. The VLDB Journal, 2007, 16(4):507-521.
[29] Lee S, Shin K G. Probabilistic diagnosis of multiprocessor systems. ACM Computing Surveys, 1994, 26(1):121-139.
[30] Das K, Schneider J. Detecting anomalous records in categorical datasets. In Proc. the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2007, pp.220-229.
[31] Mi H, Wang H, Zhou Y, Lyu M R, Cai H. Toward finegrained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(6):1245-1255.
[32] Jia T, Chen P, Yang L, Li Y, Meng F, Xu J. An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services. In Proc. the 2017 IEEE International Conference on Web Services, June 2017, pp.25-32.
[33] Ren R, Tian S, Wang L. Online anomaly detection framework for Spark systems via stage-task behavior modeling. In Proc. the 15th ACM International Conference on Computing Frontiers, May 2018, pp.256-259.
[1] Geun Yong Kim, Joon-Young Paik, Yeongcheol Kim, and Eun-Sun Cho. 基于字节频率特征码的勒索病毒检测方法[J]. 计算机科学技术学报, 2022, 37(2): 423-442.
[2] 赵建喆, 王兴伟, 毛克明, 黄辰希, 苏昱恺, 李宇宸. 机器学习中基于相关差分隐私保护的多方数据发布方法[J]. 计算机科学技术学报, 2022, 37(1): 231-251.
[3] Yi Zhong, Jian-Hua Feng, Xiao-Xin Cui, Xiao-Le Cui. 机器学习辅助的抗逻辑块加密密钥猜测攻击范式[J]. 计算机科学技术学报, 2021, 36(5): 1102-1117.
[4] Sara Elmidaoui, Laila Cheikhi, Ali Idri, Alain Abran. 用于软件可维护性预测的机器学习技术:精度分析[J]. 计算机科学技术学报, 2020, 35(5): 1147-1174.
[5] Andrea Caroppo, Alessandro Leone, Pietro Siciliano. 用于老年人面部表情识别的深度学习模型和传统机器学习方法的对比研究[J]. 计算机科学技术学报, 2020, 35(5): 1127-1146.
[6] Shu-Zheng Zhang, Zhen-Yu Zhao, Chao-Chao Feng, Lei Wang. 基于的特征选择的用于加速芯片物理设计Floorplan的机器学习框架[J]. 计算机科学技术学报, 2020, 35(2): 468-474.
[7] João Fabrício Filho, Luis Gustavo Araujo Rodriguez, Anderson Faustino da Silva. 另一种智能代码生成系统:一种灵活低成本解决方案[J]. 计算机科学技术学报, 2018, 33(5): 940-965.
[8] Lan Yao, Feng Zeng, Dong-Hui Li, Zhi-Gang Chen. 基于Lp正则化的稀疏支持向量机特征选择算法[J]. , 2017, 32(1): 68-77.
[9] 包新启, 吴云芳. 面向问题检索的层级自训练张量神经网络模型[J]. , 2016, 31(6): 1151-1160.
[10] Najam Nazar, Yan Hu, He Jiang. 软件工件摘要方法综述[J]. , 2016, 31(5): 883-909.
[11] Xi-Jin Zhang, Yi-Fan Lu, Song-Hai Zhang. 用于食品识别和分析的深度卷积神经网络多任务学习[J]. , 2016, 31(3): 489-500.
[12] Lixue Xia, Peng Gu, Boxun Li, Tianqi Tang, Xiling Yin, Wenqin Huangfu, Shimeng Yu, Yu Cao, Yu Wang, Huazhong Yang. 忆阻器阵列矩阵向量乘的设计空间优化[J]. , 2016, 31(1): 3-19.
[13] Jun-Fa Liu, Wen-Jing He, Tao Chen, and Yi-Qiang Chen. 由流形约束实现人脸知识迁移的三维卡通重建方法[J]. , 2013, 28(3): 479-489.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: