›› 2011, Vol. 26 ›› Issue (2): 328-342.doi: 10.1007/s11390-011-1135-6

Special Issue: Software Systems

• Software Engineering • Previous Articles    

Software Defect Detection with ROCUS

Yuan Jiang (姜远), Member, CCF, Ming Li (黎铭), Member, CCF, ACM, IEEE, and Zhi-Hua Zhou (周志华), Senior Member, CCF, IEEE, Member, ACM   

  1. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
  • Received:2009-05-15 Revised:2010-10-26 Online:2011-03-05 Published:2011-03-05
  • About author:Yuan Jiang received the Ph.D. degree in computer science from Nanjing University, China, in 2004. Now she is an associate professor at the Department of Computer Science and Technology, Nanjing University. Her research interests include machine learning, information retrieval and data mining. In these areas, she has published more than 30 technical papers in refereed journals or conferences. She served as the publication chair of PAKDD'07, also served as program committee members for many conferences such as CCTA'07, BIBE'07, ICNC'08, ICNC'09, CCDM'09, NCIIP'09, CCML'10. She is now a committee member of Machine Learning Society of Chinese Association of Artificial Intelligence (CAAI), and is a committee member of Artificial Intelligence Society of Jiangsu Computer Association. She is a member of CCF. She was selected in the Program for New Century Excellent talents in University, Ministry of Education, in 2009.
    Ming Li received the B.Sc. and Ph.D. degrees in computer science from Nanjing University, China, in 2003 and 2008 respectively. He is currently an assistant professor with LAMDA Group, the Department of Computer Sciences and Technology, Nanjing University. His major research interests include machine learning, data mining and information retrieval, especially on learning with labeled and unlabeled data. He has been granted various awards including the CCF Outstanding Doctoral Dissertation Award (2009), Microsoft Fellowship Award (2005). He served on the program committee of a number of important international conferences including KDD'10, ACML'10, ACML'09, ACM CKIM'09, IEEE ICME'10, AI'10, and served as reviewers for a number of refereed journals including IEEE Trans. KDE, IEEE Trans. NN, IEEE Trans. SMCC, ACM Trans. IST, Pattern Recognition, Knowledge and Information Systems, Journal of Computer Science and Technology. He is a committee member of the Machine Learning Society of the CAAI, member of ACM, IEEE, IEEE Computer Society, CCF and CAAI.
    Zhi-Hua Zhou received the B.Sc., M.Sc. and Ph.D. degrees in computer science from Nanjing University, China, in 1996, 1998 and 2000, respectively, all with the highest honors. He joined the Department of Computer Science and Technology at Nanjing University as an assistant professor in 2001, and is currently Cheung Kong Professor and Director of the LAMDA Group. His research interests are in artificial intelligence, machine learning, data mining, pattern recognition, information retrieval, evolutionary computation and neural computation. In these areas he has published over 70 papers in leading international journals or conference proceedings. Dr. Zhou has won various awards/honors including the National Science and Technology Award for Young Scholars of China (2006), the Award of National Science Fund for Distinguished Young Scholars of China (2003), the National Excellent Doctoral Dissertation Award of China (2003), the Microsoft Young Professorship Award (2006). He is an associate editor of IEEE Transactions on Knowledge and Data Engineering, associate editor-in-chief of Chinese Science Bulletin, and on the editorial boards of Artificial Intelligence in Medicine, Intelligent Data Analysis, Science in China. He is the founder of ACML, Steering Committee member of PAKDD and PRI-CAI, Program Committee Chair/Co-Chair of PAKDD'07, PRICAI'08 and ACML'09, vice Chair or area Chair of conferences including IEEE ICDM'06, IEEE ICDM'08, SIAM DM'09, ACM CIKM'09, and general chair/co-chair or program committee chair/co-chair of a dozen of native conferences. He is the chair of the Machine Learning Society of the CAAI, vice chair of the Artificial Intelligence and Pattern Recognition Society of the CCF, and chair of the IEEE Computer Society Nanjing Chapter. He is a fellow of IET, a member of AAAI and ACM, and a senior member of IEEE, IEEE Computer Society and IEEE Computational Intelligence Society, CCF.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 60975043, 60903103, and 60721002.

Software defect detection aims to automatically identify defective software modules for efficient software test in order to improve the quality of a software system. Although many machine learning methods have been successfully applied to the task, most of them fail to consider two practical yet important issues in software defect detection. First, it is rather difficult to collect a large amount of labeled training data for learning a well-performing model; second, in a software system there are usually much fewer defective modules than defect-free modules, so learning would have to be conducted over an imbalanced data set. In this paper, we address these two practical issues simultaneously by proposing a novel semi-supervised learning approach named Rocus. This method exploits the abundant unlabeled examples to improve the detection accuracy, as well as employs under-sampling to tackle the class-imbalance problem in the learning process. Experimental results of real-world software defect detection tasks show that Rocus is effective for software defect detection. Its performance is better than a semi-supervised learning method that ignores the class-imbalance nature of the task and a class-imbalance learning method that does not make effective use of unlabeled data.

[1] Dai Y S, Xie M, Long Q, Ng S H. Uncertainty analysis in software reliability modeling by Bayesian approach with maximum-entropy principle. IEEE Transactions on Software Engineering, 2007, 33(11): 781-795.

[2] Guo L, Ma Y, Cukic B, Singh H. Robust prediction of faultproneness by random forests. In Proc. the 15th International Symposium on Software Reliability Engineering, Nov. 2-5, 2004, pp.417-428.

[3] Khoshgoftaar T M, Allen E B, Jones W D, Hudepohl J P. Classification-tree models of software-quality over multiple releases. IEEE Transactions on Reliability, 2000, 49(1): 4-11.

[4] Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 2008, 34(4): 485-496.

[5] Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 2007, 33(1): 2-13.

[6] Zhang H, Zhang X. Comments on Data mining static code attributes to learn defect predictors". IEEE Transactions on Software Engineering, 2007, 33(9): 635-637.

[7] Zhou Y, Leung H. Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Transactions on Software Engineering, 2006, 32(10): 771-789.

[8] Seliya N, Khoshgoftaar T M. Software quality estimation with limited fault data: A semi-supervised learning perspective. Software Quality Journal, 2007, 15: 327-344.

[9] Pelayo L, Dick S. Applying novel resampling strategies to software defect prediction. In Proc. the 2007 Annual Meeting of the North American Fuzzy Information Processing Society, San Diego, USA, Jun. 24-27, 2007, pp.69-72.

[10] Zhou Z H, Li M. Semi-supervised learning by disagreement. Knowledge and Information Systems, 2010, 24(3): 415-439.

[11] Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Working Notes of the ICML'03 Workshop on Learning from Imbalanced Data Sets, Washington DC, USA, Jul. 21, 2003.

[12] Zheng A X, Jordan M I, Liblit B, Naik M, Aiken A. Statistical debugging: Simultaneous identification of multiple bugs. In Proc. the 23rd International Conference on Machine Learning, Pittsburgh, USA, Jun. 25-29, 2006, pp.1105-1112.

[13] Andrzejewski D, Mulhern A, Liblit B, Zhu X. Statistical debugging using latent topic models. In Proc. the 18th European Conference on Machine Learning, Warsaw, Poland, Sept. 17-21, 2007, pp.6-17.

[14] Chilimbi T M, Liblit B, Mehra K K et al. HOLMES: Efiective statistical debugging via efficient path profiling. In Proc. the 31st International Conference on Software Engineering, Vancouver, Canada, May 16-24, 2009, pp.34-44.

[15] Basili V R, Briand L C, Melo W L. A validation of objectoriented design metrics as quality indicators. IEEE Transactions on Software Engineering, 1996, 22(10): 751-761.

[16] Khoshgoftaar T M, Allen E B. Neural Networks for Software Quality Prediction. Computational Intelligence in Software Engineering, Pedrycz W, Peters J F (eds.), World Scientific, Singapore, 1998, pp.33-63.

[17] Halstead M H. Elements of Software Science, Elsevier, 1977.

[18] McCabe T J. A complexity measure. IEEE Transactions on Software Engineering, 1976, 2(4): 308-320.

[19] Gyimfiothy T, Ferenc R, Siket I. Empirical validation of objectoriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering, 2005, 31(10): 897-910.

[20] Ganesan K, Khoshgoftaar T M, Allen E B. Verifying requirements through mathematical modelling and animation. International Journal of Software Engineering and Knowledge Engineering, 2000, 10(2): 139-152.

[21] Khoshgoftaar T M, Seliya N. Fault prediction modeling for software quality estimation: Comparing commonly used techniques. Empirical Software Engineering, 2003, 8(3): 255-283.

[22] Fenton N E, Neil M. A critique of software defect prediction models. IEEE Transactions Software Engineering, 1999, 25(5): 675-689.

[23] Pérez-Miñana E, Gras J J. Improving fault prediction using Bayesian networks for the development of embedded software applications. Software Testing, Verification Reliability, 2006, 16(3): 157-174.

[24] Chapelle O, Schölkopf B, Zien A. Semi-Supervised Learning. Cambridge: MIT Press, MA, 2006.

[25] Zhu X. Semi-supervised learning literature survey. Technical Report 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI, 2006, http://www.cs.wisc.edu/fijerryzhu/pub/sslsurvey.pdf.

[26] Miller D J, Uyar H S. A Mixture of Experts Classiér with Learning Based on Both Labelled and Unlabelled Data. Advances in Neural Information Processing Systems 9, Mozer M, Jordan M I, Petsche T (eds.), MIT Press, Cambridge, MA, 1997, pp.571-577.

[27] Nigam K, McCallum A K, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000, 39(2/3): 103-134.

[28] Shahshahani B, Landgrebe D. The efiect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(5): 1087-1095.

[29] Chapelle O, Zien A. Semi-supervised learning by low density separation. In Proc. the 10th International Workshop on Artificial Intelligence and Statistics, Barbados, Jan. 6-8, 2005, pp.57-64.

[30] Grandvalet Y, Bengio Y. Semi-Supervised Learning by Entropy Minimization. Advances in Neural Information Processing Systems, Saul L K, Weiss Y, Bottou L (eds.), MIT Press, Cambridge, MA, 2005, pp.529-536.

[31] Joachims T. Transductive inference for text classification using support vector machines. In Proc. the 16th International Conference on Machine Learning, Bled, Slovenia, Jun. 27-30, 1999, pp.200-209.

[32] Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 2006, 7(11): 2399-2434.

[33] Zhou D, Bousquet O, Lal T N, Weston J, Schölkopf B. Learning with Local and Global Consistency. Advances in Neural Information Processing Systems 16, Thrun S, Saul L, Schölkopf B (eds.), MIT Press, Cambridge, MA, 2004.

[34] Zhu X, Ghahramani Z, Lafierty J. Semi-supervised learning using Gaussian fields and harmonic functions. In Proc. the 20th International Conference on Machine Learning, Washington, DC, USA, Aug. 21-24, 2003, pp.912-919.

[35] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In Proc. the 11th Annual Conference on Computational Learning Theory, Madison, USA, Jul. 24-26, 1998, pp.92-100.

[36] Goldman S, Zhou Y. Enhancing supervised learning with unlabeled data. In Proc. the 17th International Conference on Machine Learning, San Francisco, USA, Jun. 29-Jul. 2, 2000, pp.327-334.

[37] Li M, Zhou Z H. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and Cybernetics—Part A: Systems and Humans, 2007, 37(6): 1088-1098.

[38] Zhou Z H, Li M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541.

[39] Zhou Z H, Li M. Semi-supervised regression with co-training style algorithms. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1479-1493.

[40] Steedman M, Osborne M, Sarkar A et al. Bootstrapping statistical parsers from small data sets. In Proc. the 11th Conference on the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, Apr. 12-17, 2003, pp.331-338.

[41] Li M, Zhou Z H. Semi-supervised document retrieval. Information Processing & Management, 2009, 45(3): 341-355.

[42] Zhou Z H, Chen K J, Dai H B. Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on Information Systems, 2006, 24(2): 219-244.

[43] Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321-357.

[44] Kubat M, Matwin S. Addressing the curse of imbalanced training sets: One-sided selection. In Proc. the 14th Int. Conf. Machine Learning, Nashville, USA, 1997, pp.179-186.

[45] Domingos P. MetaCost: A general method for making classifiers cost-sensitive. In Proc. the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA, Aug. 15-18, 1999, pp.155-164.

[46] Elkan C. The foundations of cost-sensitive learning. In Proc. the 17th International Joint Conference on Artificial Intelligence, Seattle, USA, Aug. 4-10, 2001, pp.973-978.

[47] Batista G, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations, 2004, 6(1): 20-29.

[48] Liu X Y, Wu J X, Zhou Z H. Exploratory under-sampling for class-imbalance learning. IEEE Transactions on Systems, Man and Cybernetics—Part B: Cybernetics, 2009, 39(2): 539-550.

[49] Angluin D, Laird P. Learning from noisy examples. Machine Learning, 1988, 2(4): 343-370.

[50] Ho T K. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8"): 832-844.

[51] Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123-140.

[52] Chapman M, Callis P, Jackson W. Metrics data program. NASA IV and V Facility, 2004, http://mdp.ivv.nasa.gov/.

[53] Schapire R E. A brief introduction to Boosting. In Proc. the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, Jul. 31-Aug. 6, 1999, pp.1401- 1406.

[54] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005.

[55] Bradley A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997, 30(6): 1145-1159.

[56] Zhou Z H, Wu J, Tang W. Ensembling neural networks: Many could be better than all. Artificial Intelligence, 2002, 137(1/2): 239-263.

[57] Khoshgoftaar T M, Seliya N. Tree-based software quality estimation models for fault prediction. In Proc. the 8th IEEE International Symp. Software Metrics, Ottawa, Canada, Jun. 4-7, 2002, pp.203-214.

[58] Dietterich T G, Lathrop R H, Lozano-Pérez T. Solving the Multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 1997, 89(1/2): 31-71.
No related articles found!
Full text



[1] Li Renwei;. Soundness and Completeness of Kung s Reasoning Procedure[J]. , 1988, 3(1): 7 -15 .
[2] Yao Xin; Li Guojie;. General Simulated Annealing[J]. , 1991, 6(4): 329 -338 .
[3] Andrew I. Adamatzky;. Identification of Nonstationary Cellular Automata[J]. , 1992, 7(4): 379 -382 .
[4] Zhou Jianqiang; Xie Li; Dai Fei; Sun Zhongxiu;. Adaptive Memory Coherence Algorithms in DSVM[J]. , 1994, 9(4): 365 -372 .
[5] Zhao Yu; Zhang Qiong; Xiang Hui; Shi Jiaosing; He Zhijun;. A Simplified Model for Generating 3D Realistic Sound in the Multimedia and Virtual Reality Systems[J]. , 1996, 11(4): 461 -470 .
[6] Xu Meihe; Tang Zesheng;. A Boundary Element Method for Simulation of Deformable Objects[J]. , 1996, 11(5): 497 -506 .
[7] Chen Yangjun;. Magic Sets Revisited[J]. , 1997, 12(4): 346 -365 .
[8] Hao Ruibing; Wu Jianping;. A Formal Approach to Protocol Interoperability Testing[J]. , 1998, 13(1): 79 -90 .
[9] TING Jing'an;. A Neural Paradigm for Time-Varying Motion Segmentation[J]. , 1999, 14(6): 539 -550 .
[10] BI Jun; WU Jianping;. An Approach to Concurrent TTCN Test Generation[J]. , 1999, 14(6): 614 -618 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved