›› 2016,Vol. 31 ›› Issue (5): 1038-1052.doi: 10.1007/s11390-016-1678-7

所属专题: Artificial Intelligence and Pattern Recognition

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

UiLog:基于日志的故障诊断系统

De-Qing Zou, Member, CCF, Hao Qin, and Hai Jin, Fellow, CCF, Senior Member, IEEE, Member, ACM   

  1. Services Computing Technology and System Laboratory, Huazhong University of Science and Technology Wuhan 430074, China;
    Big Data Technology and System Laboratory, Huazhong University of Science and Technology, Wuhan 430074, China;
    Cluster and Grid Computing Laboratory, Huazhong University of Science and Technology, Wuhan 430074, China;
    School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
  • 收稿日期:2015-05-18 修回日期:2016-02-04 出版日期:2016-09-05 发布日期:2016-09-05
  • 作者简介:De-Qing Zou received his Ph.D. degree in computer architecture from the Huazhong University of Science and Technology (HUST), Wuhan, in 2004. He is a professor of computer science at HUST. His main research interests include system security, trusted computing, virtualization and cloud security. He has been the leader of one National High Technology Research and Development 863 Program of China and three NSFC (National Natural Science Foundation of China) projects, and the core member of several important national projects, such as projects of National Basic Research 973 Program of China. He has applied almost 20 patents, published two books (one is entitled “Xen Virtualization Technologies” and the other is entitled “Trusted Computing Technologies and Principles”) and more than 50 high-quality papers, including papers published by IEEE Transactions on Dependable and Secure Computing, IEEE Symposium on Reliable Distributed Systems and so on. He has always served as a reviewer for several prestigious journals, such as IEEE TPDS, IEEE TOC, IEEE TDSC, IEEE TCC and so on. He is on the editorial boards of four international journals, and has served as a PC chair/PC member of more than 40 international conferences.
  • 基金资助:

    This work was supported by the National Basic Research 973 Program of China under Grant No. 2014CB340600, the National Natural Science Foundation of China under Grant No. 61272072, and the Program for New Century Excellent Talents in University of China under Grant No. NCET-13-0241.

UiLog: Improving Log-Based Fault Diagnosis by Log Analysis

De-Qing Zou, Member, CCF, Hao Qin, and Hai Jin, Fellow, CCF, Senior Member, IEEE, Member, ACM   

  1. Services Computing Technology and System Laboratory, Huazhong University of Science and Technology Wuhan 430074, China;
    Big Data Technology and System Laboratory, Huazhong University of Science and Technology, Wuhan 430074, China;
    Cluster and Grid Computing Laboratory, Huazhong University of Science and Technology, Wuhan 430074, China;
    School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
  • Received:2015-05-18 Revised:2016-02-04 Online:2016-09-05 Published:2016-09-05
  • About author:De-Qing Zou received his Ph.D. degree in computer architecture from the Huazhong University of Science and Technology (HUST), Wuhan, in 2004. He is a professor of computer science at HUST. His main research interests include system security, trusted computing, virtualization and cloud security. He has been the leader of one National High Technology Research and Development 863 Program of China and three NSFC (National Natural Science Foundation of China) projects, and the core member of several important national projects, such as projects of National Basic Research 973 Program of China. He has applied almost 20 patents, published two books (one is entitled “Xen Virtualization Technologies” and the other is entitled “Trusted Computing Technologies and Principles”) and more than 50 high-quality papers, including papers published by IEEE Transactions on Dependable and Secure Computing, IEEE Symposium on Reliable Distributed Systems and so on. He has always served as a reviewer for several prestigious journals, such as IEEE TPDS, IEEE TOC, IEEE TDSC, IEEE TCC and so on. He is on the editorial boards of four international journals, and has served as a PC chair/PC member of more than 40 international conferences.
  • Supported by:

    This work was supported by the National Basic Research 973 Program of China under Grant No. 2014CB340600, the National Natural Science Foundation of China under Grant No. 61272072, and the Program for New Century Excellent Talents in University of China under Grant No. NCET-13-0241.

在现代计算机系统当中,日志一直都是作为获取系统运行情况、诊断系统故障的首要来源。但随着云计算和集群环境的发展,系统和软件的架构变得越来越复杂。各个不同层次间的软件和硬件频繁的交互,导致了系统的高耦合性,也加大了对系统当中出现的故障进行诊断的难度。本文提出了日志综合管理分析系统(UiLog),对整个云环境中各个组件产生的故障日志进行统一的管理,并可以实时分析当前系统的运行状况。当发生故障时,日志综合管理分析系统会判断故障的类型、把故障日志按照故障发生的因果顺序进行排序,辅助管理员进行故障诊断。UiLog首先对系统中的日志进行统一的收集管理,确保了即使系统宕机也可以获得故障信息。UiLog采用了新的故障分类方法,降低了分类过程对于人工知识库的依赖。同时通过故障关键词矩阵,可以有效地对故障日志进行实时分类。此外,UiLog改进了传统的基于时间的故障关联性分析,利用日志故障分类的结果确定不同时间窗口的大小,提高了故障关联性分析的准确率,并可以帮助管理员找到引发故障的根本原因。实验结果显示,日志综合管理分析系统能够全面管理系统中的日志,对日志按照故障类型进行分类,并针对具体故障挖掘出故障产生的根本原因。

Abstract: In modern computer systems, system event logs have always been the primary source for checking system statuses. As computer systems become more and more complex, the interaction among software and hardware increases frequently. The components will generate enormous log information, including running reports and fault information. The amount of data is a great challenge for analysis relying on the manual method. In this paper, we implement a management and analysis system of log information, which can assist system administrators to understand the real-time status of the entire system, classify logs into different fault types, and determine the root cause of the faults. In addition, we improve the existing fault correlation analysis method based on the results of system log classification. We apply the system in a cloud computing environment for evaluation. The results show that our system can classify fault logs automatically and effectively. With the proposed system, administrators can easily detect the root cause of faults.

[1] Zawoad S, Dutta A K, Hasan R. SecLaaS: Secure logging-asa-service for cloud forensics. In Proc. the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, May 2013, pp.219-230.

[2] Rao X, Wang H, Shi D et al. Identifying faults in large-scale distributed systems by filtering noisy error logs. In Proc. the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, June 2011, pp.140-145.

[3] Yuan D, Mai H, Xiong W et al. SherLog: Error diagnosis by connecting clues from run-time logs. ACM SIGARCH Computer Architecture News, 2010, 38(1): 143-154.

[4] Fu Q, Lou J G, Wang Y et al. Execution anomaly detection in distributed systems through unstructured log analysis. In Proc. the 9th IEEE International Conference on Data Mining, December 2009, pp.149-158.

[5] Xu W, Huang L, Fox A et al. Detecting large-scale system problems by mining console logs. In Proc. the 22nd ACM Symposium on Operating Systems Principles, October 2009, pp.117-132.

[6] Hansen J P, Siewiorek D P. Models for time coalescence in event logs. In Proc. the 22nd IEEE International Symposium on Fault-Tolerant Computing, July 1992, pp.221-227.

[7] Mi H B, Wang H M, Zhou Y F et al. Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs. Science China (Information Sciences), 2012, 55(12): 2757-2773.

[8] Prewett J E, James E. Listening to your cluster with LoGS. In Proc. the 5th LCI International Conference on Linux Clusters: The HPC Revolution, May 2004.

[9] Jain S, Singh I, Chandra A et al. Extracting the textual and temporal structure of supercomputing logs. In Proc. the 16th IEEE International Conference on High Performance Computing, December 2009, pp.254-263.

[10] Takada T, Koike H. Tudumi: Information visualization system for monitoring and auditing computer logs. In Proc. the International Conference on Information Visualization, July 2002, pp.570-576.

[11] Vaarandi R. A data clustering algorithm for mining patterns from event logs. In Proc. the 3rd IEEE Workshop on IP Operations and Management, October 2003, pp.119-126.

[12] Bellec J H, Kechadi T M. Cufres: Clustering using fuzzy representative events selection for the fault recognition problem in telecommunication networks. In Proc. the 1st ACM Ph.D. Workshop on Information and Knowledge Management, November 2007, pp.55-62.

[13] Ganapathi A, Patterson D. Crash data collection: A windows case study. In Proc. the IEEE/IFIP International Conference on Dependable Systems and Networks, June 28-July 1, 2005, pp.280-285.

[14] Pecchia A, Cotroneo D, Kalbarczyk Z et al. Improving logbased field failure data analysis of multi-node computing systems. In Proc. the IEEE/IFIP International Conference on Dependable Systems and Networks, June 2011, pp.97-108.

[15] Stearley J, Oliner A J. Bad words: Finding faults in Spirit's syslogs. In Proc. the IEEE International Symposium on Cluster Computing and the Grid, May 2008, pp.765-770.

[16] Stearley J. Towards informatic analysis of Syslogs. In Proc. the IEEE International Conference on Cluster Computing, September 2004, pp.309-318.

[17] Xu W, Huang L, Fox A et al. Mining console logs for largescale system problem detection. In Proc. the IEEE Conference on Tackling Computer Systems Problems with Machine Learning Techniques, December 2008.

[18] Salfner F, Tschirpke S. Error log processing for accurate failure prediction. In Proc. the 1st USENIX Workshop on Analysis of System Logs, December 2008, p.4.

[19] Wen X, Zhang X, Zhu Y. Design of fault detection observer based on Hyper Basis Function. Tsinghua Science and Technology, 2015, 20(2): 200-204.

[20] Park J, Yoo G, Lee E. Proactive self-healing system based on multi-agent technologies. In Proc. the ACIS International Conference on Software Engineering Research, Management and Applications, August 2005, pp.256-263.

[21] Li T, Liang F, Ma S et al. An integrated framework on mining logs files for computing system management. In Proc. the 11th ACM International Conference on Knowledge Discovery in Data Mining, August 2005, pp.776-781.

[22] Vaarandi R. A breadth-first algorithm for mining frequent patterns from event logs. In Proc. Intelligence in Communication Systems, November 2004, pp.293-308.

[23] Oliner A, Stearley J. What supercomputers say: A study of five system logs. In Proc. the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, June 2007, pp.575-584.

[24] Simache C, Kaâniche M, Saïdane A. Event log based dependability analysis of Windows NT and 2K systems. In Proc. the Pacific Rim International Symposium on Dependable Computing, December 2002, pp.311-315.

[25] Ester M, Kriegel H P, Sander J et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. the 2nd ACM International Conference on Knowledge Discovery and Data Mining, August 1996, pp.226-231.

[26] Levenshtein V I. Binary codes capable of correcting deletions, insertions and reversals. Journal of Soviet Physics Doklady, 1966, 10: 707-711.

[27] Tsao M M, Siewiorek D P. Trend analysis on system error files. In Proc. the IEEE International Symposium on Fault-Tolerant Computing, June 1983, pp.116-119.

[28] Fu S, Xu C Z. Exploring event correlation for failure prediction in coalitions of clusters. In Proc. the ACM/IEEE Conference on High Performance Networking and Computing, November 2007.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: