›› 2015,Vol. 30 ›› Issue (5): 969-980.doi: 10.1007/s11390-015-1575-5

所属专题: Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

针对跨项目缺陷预测的一种最近邻混合实例选择方法

Duksan Ryu, Jong-In Jang, Jongmoon Baik, Member, ACM, IEEE   

  1. School of Computing, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon 305-701, Korea
  • 收稿日期:2015-03-20 修回日期:2015-07-07 出版日期:2015-09-05 发布日期:2015-09-05
  • 作者简介:Duksan Ryu earned his Bachelor's degree in computer science from Hanyang University, Seoul, in 1999, and Master's dual degree in software engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, and Carnegie Mellon University, Pittsburgh, in 2012. He is a Ph.D. student in the School of Computing at KAIST. His research areas are software defect prediction and software reliability engineering.
  • 基金资助:

    This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT and Future Planning (MSIP)) under Grant No. NRF-2013R1A1A2006985 and Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) under Grant No. R0101-15-0144, Development of Autonomous Intelligent Collaboration Framework for Knowledge Bases and Smart Devices.

A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction

Duksan Ryu, Jong-In Jang, Jongmoon Baik, Member, ACM, IEEE   

  1. School of Computing, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon 305-701, Korea
  • Received:2015-03-20 Revised:2015-07-07 Online:2015-09-05 Published:2015-09-05
  • About author:Duksan Ryu earned his Bachelor's degree in computer science from Hanyang University, Seoul, in 1999, and Master's dual degree in software engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, and Carnegie Mellon University, Pittsburgh, in 2012. He is a Ph.D. student in the School of Computing at KAIST. His research areas are software defect prediction and software reliability engineering.
  • Supported by:

    This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT and Future Planning (MSIP)) under Grant No. NRF-2013R1A1A2006985 and Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) under Grant No. R0101-15-0144, Development of Autonomous Intelligent Collaboration Framework for Knowledge Bases and Smart Devices.

为识别缺陷倾向模块, 软件缺陷预测(SDP)研究在软件工程领域非常活跃。SDP使得有限的测试资源可以有效地分配给易错模块。虽然SDP需要组织内部有足够的本地数据, 但是存在一些本地数据不可使用的情况, 如试点项目。事实上, 缺乏本地数据的组织可使用跨项目缺陷预测(CPDP)以提取外部数据构建分类器。使用CPDP面临的主要挑战是训练数据和测试数据的数据分布不同。为了解决此问题, 研究者会选择与目的数据类似的源数据实例构建分类器。总所周知, 软件数据存在分类不平衡问题, 即有缺陷的类别与无缺陷类别的比例非常低。一般它会频低分类器的性能。为此, 我们提出了一种使用最近邻法的混合实例选择方法(HISNN), 它选择性地执行一种学习本地知识(通过k最近邻算法)和全局知识(通过naive Bayes)的混合分类法。具有强本地知识的实例通过最近邻法被赋予同样的类别标签。以前的研究具有低PD(检测可能性)或者高PF(误警报可能性), 从而难以实际使用;而实验结果表明本文提出的HISNN具有较高的综合性能和高PD低PF。

Abstract: Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via naïve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF.

[1] Gao K, Khoshgoftaar T. Software defect prediction for highdimensional and class-imbalanced data. In Proc. the 23rd SEKE, July 2011, pp.89-94.

[2] Zheng J. Cost-sensitive boosting neural networks for software defect prediction. Expert Syst. Appl., 2010, 37(6):4537-4543.

[3] Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Trans. Reliab., 2013, 62(2):434-443.

[4] Turhan B, Tosun M?s?rl? A, Bener A. Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf. Softw. Technol., 2013, 55(6):1101-1118.

[5] Turhan B, Menzies T, Bener A B, Di Stefano J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng., 2009, 14(5):540-578.

[6] Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull., 1945, 1(6):80-83.

[7] Vargha A, Delaney H D. A critique and improvement of the "CL" common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat., 2000, 25(2):101-132.

[8] Hall T, Beecham S, Bowes D, Gray D, Counsell S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng., 2012, 38(6):1276-1304.

[9] Arisholm E, Briand L C, Johannessen E B. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw., 2010, 83(1):2-17.

[10] D'Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches:A benchmark and an extensive comparison. Empir. Softw. Eng., 2012, 17(4/5):531-577.

[11] Dejaeger K, Verbraker T, Basesens B. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans. Softw. Eng., 2013, 39(2):237-257.

[12] Elish K O, Elish M O. Predicting defect-prone software modules using support vector machines. J. Syst. Softw., 2008, 81(5):649-660.

[13] Singh Y, Kaur A, Malhotra R. Empirical validation of object-oriented metrics for predicting fault proneness models. Softw. Qual. J., 2009, 18(1):3-35.

[14] Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction:A large scale experiment on data vs. domain vs. process. In Proc. the 7th ESEC/FSE, August 2009, pp.91-100.

[15] He Z, Shu F, Yang Y, Li M, Wang Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng., 2011, 19(2):167-199.

[16] Ma Y, Luo G, Zeng X, Chen A. Transfer learning for crosscompany software defect prediction. Inf. Softw. Technol., 2012, 54(3):248-256.

[17] Nam J, Pan S J, Kim S. Transfer defect learning. In Proc. the 35th Int. Conf. Softw. Eng., May 2013, pp.382-391.

[18] Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Addison Wesley, 2006.

[19] Grbac T, Mausa G, Baši? B. Stability of software defect prediction in relation to levels of data imbalance. In Proc. the 2nd SQAMIA, Sept. 2013, pp.1:1-1:10.

[20] Raman B, Ioerger T R. Enhancing learning using feature and example selection. Technical Report, Department of Computer Science, Texas A&M Univ., 2003.

[21] Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is "nearest neighbor" meaningful? In Lecture Notes in Computer Science 1540, Beeri C, Buneman P (eds.), Springer-Verlag, 1999, pp.217-235.

[22] Mahalanobis P C. On the generalised distance in statistics. Proc. Natl. Inst. Sci., 1936, 2(1):49-55.

[23] Turhan B, Tosun A, Bener A. Empirical evaluation of mixed-project defect prediction models. In Proc. the 37th EUROMICRO Conf. Softw. Eng. Adv. Appl., Aug. 30-Sept. 2, 2011, pp.396-403.

[24] Hall M, Frank E, Holmes G et al. The WEKA data mining software:An update. ACM SIGKDD Explor. Newsl., 2009, 11(1):10-18.

[25] Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features:Current results, limitations, new approaches. Autom. Softw. Eng., 2010, 17(4):375-407.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: