›› 2017, Vol. 32 ›› Issue (6): 1090-1107.doi: 10.1007/s11390-017-1785-0

Special Issue: Software Systems

• Special Section on Software Systems 2017 • Previous Articles     Next Articles

A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction

Chao Ni1, Student Member, IEEE, Wang-Shu Liu1, Xiang Chen1,2, Senior Member, CCF, Qing Gu2, Senior Member, CCF, Dao-Xu Chen1, Fellow, CCF, Member, ACM, IEEE, Qi-Guo Huang1, Member, CCF   

  1. 1 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China;
    2 School of Computer Science and Technology, Nantong University, Nantong 226019, China
  • Received:2017-04-21 Revised:2017-09-27 Online:2017-11-05 Published:2017-11-05
  • Contact: Qing Gu E-mail:guq@nju.edu.cn
  • About author:Chao Ni received his B.S.degree in computer science from Nantong University,Nantong,in 2014.Then he received his M.S.degree in computer science from Nanjing University,Nanjing,in 2017.Now he is a Ph.D.candidate of State Key Laboratory for Novel Software Technology and the Department of Computer Science and Technology,Nanjing University,Nanjing.His research interests are mainly in software defect prediction and machine learning.
  • Supported by:

    This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61373012, 91218302, 61321491 and 61202006, the Collaborative Innovation Center of Novel Software Technology and Industrialization, the Open Project of State Key Laboratory for Novel Software Technology at Nanjing University under Grant No. KFKT2016B18, and the National Basic Research 973 Program of China under Grant No. 2009CB320705.

Cross-project defect prediction (CPDP) uses the labeled data from external source software projects to compensate the shortage of useful data in the target project, in order to build a meaningful classification model. However, the distribution gap between software features extracted from the source and the target projects may be too large to make the mixed data useful for training. In this paper, we propose a cluster-based novel method FeSCH (Feature Selection Using Clusters of Hybrid-Data) to alleviate the distribution differences by feature selection. FeSCH includes two phases. The feature clustering phase clusters features using a density-based clustering method, and the feature selection phase selects features from each cluster using a ranking strategy. For CPDP, we design three different heuristic ranking strategies in the second phase. To investigate the prediction performance of FeSCH, we design experiments based on real-world software projects, and study the effects of design options in FeSCH (such as ranking strategy, feature selection ratio, and classifiers). The experimental results prove the effectiveness of FeSCH. Firstly, compared with the state-of-the-art baseline methods, FeSCH achieves better performance and its performance is less affected by the classifiers used. Secondly, FeSCH enhances the performance by effectively selecting features across feature categories, and provides guidelines for selecting useful features for defect prediction.

[1] Nam J, Pan S J, Kim S. Transfer defect learning. In Proc. the 35th Int. Conf. Software Engineering, May 2013, pp.382-391.

[2] Zhang F, Keivanloo I, Zou Y. Data transformation in crossproject defect prediction. Empir. Softw. Eng., 2017, 22(6):3186-3218.

[3] Herbold S. Training data selection for cross-project defect prediction. In Proc. the 9th Int. Conf. Predictive Models in Software Engineering, October 2013, Article No. 6.

[4] Turhan B, Menzies T, Bener A B, Di Stefano J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng., 2009, 14(5):540-578.

[5] Peters F, Menzies T, Marcus A. Better cross company defect prediction. In Proc. the 10th Working Conf. Mining Software Repositories, May 2013, pp.409-418.

[6] Amasaki S, Kawata K, Yokogawa T. Improving crossproject defect prediction methods with data simplification. In Proc. the 41st Euromicro Conf. Software Engineering and Advanced Applications, August 2015, pp.96-103.

[7] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344(6191):1492-1496.

[8] Briand L C, Melo W L, Wust J. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans. Softw. Eng., 2002, 28(7):706-720.

[9] Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction:A large scale experiment on data vs. domain vs. process. In Proc. the 7th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. the Foundations of Software Engineering, August 2009, pp.91-100.

[10] He Z M, Shu F D, Yang Y, Li M S, Wang Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng., 2012, 19(2):167-199.

[11] Ma Y, Luo G C, Zeng X, Chen A G. Transfer learning for cross-company software defect prediction. Inf. Softw. Technol., 2012, 54(3):248-256.

[12] Ryu D, Jang J I, Baik J. A hybrid instance selection using nearest-neighbor for cross-project defect prediction. J. Comput. Sci. Technol., 2015, 30(5):969-980.

[13] Herbold S, Trautsch A, Grabowski J. Global vs. local models for cross-project defect prediction. Empir. Softw. Eng., 2017, 22(4):1866-1902.

[14] Wang S, Liu T Y, Tan L. Automatically learning semantic features for defect prediction. In Proc. the 38th Int. Conf. Software Engineering, May 2016, pp.297-308.

[15] Chen L, Fang B, Shang Z W, Tang Y Y. Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol., 2015, 62:6777.

[16] Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S. Multi-objective cross-project defect prediction. In Proc. the 6th Int. Conf. Software Testing Verification and Validation, March 2013, pp.252-261.

[17] Panichella A, Oliveto R, De Lucia A. Cross-project defect prediction models:L'union fait la force. In Proc. Conf. Software Maintenance Reengineering and Reverse Engineering, February 2014, pp.164-173.

[18] Zhang Y, Lo D, Xia X, Sun J L. An empirical study of classifier combination for cross-project defect prediction. In Proc. the 39th Annual Computer Software and Applications Conf., July 2015, 2:264-269.

[19] Zhang F, Mockus A, Keivanloo I, Zou Y. Towards building a universal defect prediction model. In Proc. the 11th Working Conf. Mining Software Repositories, May 2014, pp.182-191.

[20] Xia X, Lo D, Pan S J, Nagappan N, Wang X Y. HYDRA:Massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng., 2016, 42(10):977-998.

[21] Herbold S. CrossPare:A tool for benchmarking crossproject defect predictions. In Proc. the 30th ACM/IEEE Int. Conf. Automated Software Engineering Workshop November 2015, pp.90-96.

[22] Nam J, Kim S. Heterogeneous defect prediction. In Proc. the 10th Joint Meeting on Foundations of Software Engineering, September 2015, pp.508-519.

[23] Jing X Y, Wu F, Dong X W, Qi F M, Xu B W. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proc. the 10th Joint Meeting on Foundations of Software Engineering, August 30-September 4, 2015, pp.496-507.

[24] Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan A E. Studying just-in-time defect prediction using cross-project models. Empir. Softw. Eng., 2016, 21(5):2072-2106.

[25] Hosseini S, Turhan B, Mäntylä M. Search based training data selection for cross project defect prediction. In Proc. the 12th Int. Conf. Predictive MODELS and Data Analytics in Software Engineering, September 2016, Article No. 3.

[26] Nam J, Kim S. CLAMI:Defect prediction on unlabeled datasets. In Proc. the 30th ACM/IEEE Int. Conf. Automated Software Engineering, November 2015, pp.452-463.

[27] Zhang F, Zheng Q, Zou Y, Hassan A E. Cross-project defect prediction using a connectivity-based unsupervised classifier. In Proc. the 38th Int. Conf. Software Engineering, May 2016, pp.309-320.

[28] Gao K H, Khoshgoftaar T M, Wang H J, Seliya N. Choosing software metrics for defect prediction:An investigation on feature selection techniques. Softw.:Pract. Exper., 2011, 41(5):579-606.

[29] Shivaji S, Whitehead E J, Akella R, Kim S. Reducing features to improve code change-based bug prediction. IEEE Trans. Softw. Eng., 2013, 39(4):552-569.

[30] Xu Z, Liu J, Yang Z J, An G G, Jia X Y. The impact of feature selection on defect prediction performance:An empirical comparison. In Proc. the 27th IEEE Int. Symp. Software Reliability Engineering, October 2016, pp.309-320.

[31] Xu Z, Liu J, Xia Z, Yuan P P. An empirical study on the equivalence and stability of feature selection for noisy software defect data. In Proc. the 29th Int. Conf. Software Engineering and Knowledge Engineering, July 2017, pp.191-196.

[32] Ghotra B, McIntosh S, Hassan A E. A large-scale study of the impact of feature selection techniques on defect classification models. In Proc. the 14th Int. Conf. Mining Software Repositories, May 2017, pp.146-157.

[33] Liu S L, Chen X, Liu W S, Chen J Q, Gu Q, Chen D X. FECAR:A feature selection framework for software defect prediction. In Proc. the 38th Annual Computer Software and Applications Conf., July 2014, pp.426-435.

[34] Liu W S, Liu S L, Gu Q, Chen J Q, Chen X, Chen D X. Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Trans. Reliab., 2016, 65(1):38-53.

[35] Liu W S, Chen X, Gu Q, Liu S L, Chen D X. A clusteranalysis-based feature-selection method for software defect prediction. Sci. Sin. Inf., 2016, 46(9):1298-1320.

[36] Reshef D N, Reshef Y A, Finucane H K, Grossman S R, McVean G, Turnbaugh P J, Lander E S, Mitzenmacher M, Sabeti P C. Detecting novel associations in large data sets. Science, 2011, 334(6062):1518-1524.

[37] Fan R E, Chang K W, Hsieh C J, Wang X R, Lin C J. LIBLINEAR:A library for large linear classification. J. Mach. Learn. Res., 2008, 9:1871-1874.

[38] Wu R X, Zhang H Y, Kim S, Cheung S C. ReLink:Recovering links between bugs and changes. In Proc. the 19th ACM SIGSOFT Symp. and the 13th European Conf. Foundations of Software Engineering, September 2011, pp.15-25.

[39] D'Ambros M, Lanza M, Robbes R. An extensive comparison of bug prediction approaches. In Proc. the 7th IEEE Working Conf. Mining Software Repositories, May 2010, pp.31-41.
No related articles found!
Full text



[1] Shen Li; Stephen Y.H.Su;. Generalized Parallel Signature Analyzers with External Exclusive-OR Gates[J]. , 1986, 1(4): 49 -61 .
[2] Bo Yan, You-Xing Qu, Feng-Lou Mao, Victor N. Olman, and Ying Xu. PRIME: A Mass Spectrum Data Mining Tool for De Novo Sequencing and PTMs Identification[J]. , 2005, 20(4): 483 -490 .
[3] Swapan Bhattacharya and Ananya Kanjilal. Code Based Analysis for Object-Oriented Systems[J]. , 2006, 21(6): 965 -972 .
[4] Jie Liang and Xue-Jia Lai. Improved Collision Attack on Hash Function MD5[J]. , 2007, 22(1): 79 -87 .
[5] Piotr Tomaszewski, Lars Lundberg, and Haa kan Grahn. Improving Fault Detection in Modified Code --- A Study from the Telecommunication Industry[J]. , 2007, 22(3): 397 -409 .
[6] Chuan Shi (石川), Member, CCF, IEEE, Zhen-Yu Yan (闫震宇), Member, IEEE, Xin Pan (潘欣), Ya-Nan Cai (蔡亚男), and Bin Wu (吴斌), Member, CCF. A Posteriori Approach for Community Detection[J]. , 2011, 26(5): 792 -805 .
[7] Yuan Ping, Ying-Jie Tian, Ya-Jian Zhou, Yi-Xian Yang. Convex Decomposition Based Cluster Labeling Method for Support Vector Clustering[J]. , 2012, (2): 428 -442 .
[8] Xin Liu (刘鑫), Wei Gao (高伟), and Zhan-Yi Hu (胡占义). Hybrid Parallel Bundle Adjustment for 3D Scene Reconstruction with Massive Points[J]. , 2012, 27(6): 1269 -1280 .
[9] Xin-Yu Wang, Xin Xia, David Lo. TagCombine: Recommending Tags to Contents in Software Information Sites[J]. , 2015, 30(5): 1017 -1035 .
[10] Hai-Da Zhang, Zhi-Hao Xing, Lu Chen, Yun-Jun Gao, Senior Member, CCF, Member, ACM, IEEE. Efficient Metric All-k-Nearest-Neighbor Search on Datasets Without Any Index[J]. , 2016, 31(6): 1194 -1211 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved