计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (5): 1039-1062.doi: 10.1007/s11390-019-1959-z

所属专题: Data Management and Data Mining Software Systems

• • 上一篇    下一篇

基于平衡分布适应迁移学习的跨项目缺陷预测

Zhou Xu1,2,3, Shuai Pang2, Tao Zhang1,4*, Senior Member, CCF, Xia-Pu Luo3*, Member, ACM, IEEE, Jin Liu2,4,5, Member, CCF, IEEE, Yu-Tian Tang3, Xiao Yu2,6, Lei Xue3, Member, IEEE   

  1. 1 College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China;
    2 School of Computer Science, Wuhan University, Wuhan 430072, China;
    3 Department of Computing, The Hong Kong Polytechnic University, Hong Kong 999077, China;
    4 Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences Beijing 100190, China;
    5 Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China;
    6 Department of Computer Science, City University of Hong Kong, Hong Kong 999077, China
  • 收稿日期:2018-10-22 修回日期:2019-07-11 出版日期:2019-08-31 发布日期:2019-08-31
  • 通讯作者: Tao Zhang, Xia-Pu Luo E-mail:cstzhang@hrbeu.edu.cn;csxluo@comp.polyu.edu.hk
  • 作者简介:Zhou Xu received his B.S. degree in computer science and technology from Huazhong Agricultural University, Wuhan, in 2014. Now he is a joint Ph.D. candidate in the School of Computer Science at Wuhan University, Wuhan, and in the Department of Computing at The Hong Kong Polytechnic University, Hong Kong. He is also a temporary research assistant at the College of Computer Science and Technology, Harbin Engineering University, Harbin. His research interests include software defect prediction, feature engineering, and data mining.
  • 基金资助:
    This work was partially supported by the National Key Research and Development Program of China under Grant No. 2018YFC1604000, the National Natural Science Foundation of China under Grant Nos. 61602258, 61572374, and U163620068, the China Postdoctoral Science Foundation under Grant No. 2017M621247, the Natural Science Foundation of Heilongjiang Province of China under Grant No. LH2019F008, Heilongjiang Postdoctoral Science Foundation under Grant No. LBH-Z17047, the Open Fund of Key Laboratory of Network Assessment Technology from Chinese Academy of Sciences, Guangxi Key Laboratory of Trusted Software under Grant No. kx201607, the Academic Team Building Plan for Young Scholars from Wuhan University under Grant No. WHU2016012, and Hong Kong GRC (Research Grants Council) Project under Grant Nos. PolyU 152223/17E and PolyU 152239/18E.

Cross Project Defect Prediction via Balanced Distribution Adaptation Based Transfer Learning

Zhou Xu1,2,3, Shuai Pang2, Tao Zhang1,4*, Senior Member, CCF, Xia-Pu Luo3*, Member, ACM, IEEE, Jin Liu2,4,5, Member, CCF, IEEE, Yu-Tian Tang3, Xiao Yu2,6, Lei Xue3, Member, IEEE   

  1. 1 College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China;
    2 School of Computer Science, Wuhan University, Wuhan 430072, China;
    3 Department of Computing, The Hong Kong Polytechnic University, Hong Kong 999077, China;
    4 Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences Beijing 100190, China;
    5 Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China;
    6 Department of Computer Science, City University of Hong Kong, Hong Kong 999077, China
  • Received:2018-10-22 Revised:2019-07-11 Online:2019-08-31 Published:2019-08-31
  • Contact: Tao Zhang, Xia-Pu Luo E-mail:cstzhang@hrbeu.edu.cn;csxluo@comp.polyu.edu.hk
  • About author:Zhou Xu received his B.S. degree in computer science and technology from Huazhong Agricultural University, Wuhan, in 2014. Now he is a joint Ph.D. candidate in the School of Computer Science at Wuhan University, Wuhan, and in the Department of Computing at The Hong Kong Polytechnic University, Hong Kong. He is also a temporary research assistant at the College of Computer Science and Technology, Harbin Engineering University, Harbin. His research interests include software defect prediction, feature engineering, and data mining.
  • Supported by:
    This work was partially supported by the National Key Research and Development Program of China under Grant No. 2018YFC1604000, the National Natural Science Foundation of China under Grant Nos. 61602258, 61572374, and U163620068, the China Postdoctoral Science Foundation under Grant No. 2017M621247, the Natural Science Foundation of Heilongjiang Province of China under Grant No. LH2019F008, Heilongjiang Postdoctoral Science Foundation under Grant No. LBH-Z17047, the Open Fund of Key Laboratory of Network Assessment Technology from Chinese Academy of Sciences, Guangxi Key Laboratory of Trusted Software under Grant No. kx201607, the Academic Team Building Plan for Young Scholars from Wuhan University under Grant No. WHU2016012, and Hong Kong GRC (Research Grants Council) Project under Grant Nos. PolyU 152223/17E and PolyU 152239/18E.

在产品发布之前,缺陷预测通过检测潜在有缺陷的软件模块来帮助测试资源的合理分配。当一个软件项目没有历史有标签的缺陷数据的时候,在这种场景下,跨项目缺陷预测是一种替代技术。跨项目缺陷预测利用其他项目有标签的缺陷数据构建分类模型来预测当前项目的模块标签。基于迁移学习的跨项目缺陷预测是当前的主流技术。一般来说,这些方法的目的是最小化两个项目数据间的分布差异。然而,先前的方法主要关注于边缘分布差异而忽视了条件分布差异,这会导致得到的性能不理想。在本文工作中,我们使用一个新颖的基于平衡分布适应的迁移学习方法来缩小这一差距。该方法同时考虑这两种分布差异并自适应地赋予他们不同的权重。为了评估这个方法对跨项目缺陷预测的有效性,我们在4个数据集的18个软件项目上进行实验并采用了6个指标(即F-measure,g-means,Balance,AUC,EARecall,and EAF-measure)。和12种基准方法相比,在4个数据集上,我们的平衡分布适应方法在这6个指标上得到23.8%,12.5%,11.5%,4.7%,34.2%,and 33.7%的平均提升。

关键词: 跨项目缺陷预测, 迁移学习, 平衡分布, 代价感知性能

Abstract: Defect prediction assists the rational allocation of testing resources by detecting the potentially defective software modules before releasing products. When a project has no historical labeled defect data, cross project defect prediction (CPDP) is an alternative technique for this scenario. CPDP utilizes labeled defect data of an external project to construct a classification model to predict the module labels of the current project. Transfer learning based CPDP methods are the current mainstream. In general, such methods aim to minimize the distribution differences between the data of the two projects. However, previous methods mainly focus on the marginal distribution difference but ignore the conditional distribution difference, which will lead to unsatisfactory performance. In this work, we use a novel balanced distribution adaptation (BDA) based transfer learning method to narrow this gap. BDA simultaneously considers the two kinds of distribution differences and adaptively assigns different weights to them. To evaluate the effectiveness of BDA for CPDP performance, we conduct experiments on 18 projects from four datasets using six indicators (i.e., F-measure, g-means, Balance, AUC, EARecall, and EAF-measure). Compared with 12 baseline methods, BDA achieves average improvements of 23.8%, 12.5%, 11.5%, 4.7%, 34.2%, and 33.7% in terms of the six indicators respectively over four datasets.

Key words: cross-project defect prediction, transfer learning, balancing distribution, effort-aware indicator

[1] Mei H. Understanding "software-defined" from an OS perspective:Technical challenges and research issues. Sci. China-Inf. Sci., 2017, 60(12):Article No. 126101.
[2] Lyu M R. Handbook of Software Reliability Engineering. McGraw-Hill, 1996.
[3] Xu Z, Xuan J, Liu J, Cui X. MICHAC:Defect prediction via feature selection based on maximal information coefficient with hierarchical agglomerative clustering. In Proc. the 23rd Int. Conf. Software Analysis, Evolution, and Reengineering, March 2016, pp.370-381.
[4] Ni C, Liu W S, Chen X, Gu Q, Chen D, Huang G D. A cluster based feature selection method for cross-project software defect prediction. J. Comput. Sci. Technol., 2017, 32(6):1090-1107.
[5] Ma Y, Luo G, Zeng X, Chen A. Transfer learning for crosscompany software defect prediction. Inf. Softw. Technol., 2012, 54(3):248-256.
[6] Nam J, Pan S J, Kim S. Transfer defect learning. In Proc. the 35th Int. Conf. Software Engineering, May 2013, pp.382-391.
[7] Wang J, Chen Y, Hao S, Feng W, Shen Z. Balanced distribution adaptation for transfer learning. In Proc. the 17th Int. Conf. Data Mining, November 2017, pp.1129-1134.
[8] Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng., 2007, 33(1):2-13.
[9] Fawcett T. An introduction to ROC analysis. Pattern Recognit. Lett., 2006, 27(8):861-874.
[10] Huang Q, Xia X, Lo D. Supervised vs unsupervised models:A holistic look at effort-aware just-in-time defect prediction. In Proc. the 2017 Int. Conf. Software Maintenance and Evolution, September 2017, pp.159-170.
[11] Xu Z, Li S, Tang Y et al. Cross version defect prediction with representative data via sparse subset selection. In Proc. the 26th Int. Conf. Program Comprehension, May 2018, pp.132-143.
[12] Briand L C, Melo W L, Wüst J. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans. Softw. Eng., 2002, 28(7):706-720.
[13] Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. Cross-project defect prediction:A large scale experiment on data vs. domain vs. process. In Proc. the 7th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. Foundations of Software Engineering, August 2009, pp.91-100.
[14] Turhan B, Menzies T, Bener A B, di Stefano J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng., 2009, 14(5):540-578.
[15] Peters F, Menzies T, Marcus A. Better cross company defect prediction. In Proc. the 10th Working Conf. Mining Software Repositories, May 2013, pp.409-418.
[16] Kawata K, Amasaki S, Yokogawa T. Improving relevancy filter methods for cross-project defect prediction. In Proc. the 3rd Int. Conf. Applied Computing and Information Technology, July 2015, pp.1-12.
[17] Yu X, Zhang J, Zhou P, Liu J. A data filtering method based on agglomerative clustering. In Proc. the 29th Int. Conf. Software Engineering and Knowledge Engineering, July 2017, pp.392-397.
[18] He P, Li B, Zhang D, Ma Y. Simplification of training data for cross-project defect prediction. arXiv:1405.0773, 2014. https://arxiv.org/abs/1405.0773, June 2019.
[19] He P, Ma Y, Li B. TDSelector:A training data selection method for cross-project defect prediction. arXiv:1612.09065, 2016. https://arxiv.org/abs/1612.09065, Jun. 2019.
[20] He P, He Y, Yu L, Li B. An improved method for cross-project defect prediction by simplifying training data. Math. Probl. Eng., 2018, 2018:Article No. 2650415.
[21] Chen L, Fang B, Shang Z, Tang Y. Negative samples reduction in cross-company software defects prediction. Inf. Softw. Technol., 2015, 62:67-77.
[22] Ryu D, Jang J I, Baik J. A transfer cost-sensitive boosting approach for cross-project defect prediction. Softw. Qual. J., 2017, 25(1):235-272.
[23] Liu C, Yang D, Xia X, Yan M, Zhang X. A two-phase transfer learning model for cross-project defect prediction. Inf. Softw. Technol., 2019, 107:125-136.
[24] Forbes C, Evans M, Hastings N, Peacock B. Statistical Distributions (4th edition). John Wiley and Sons, 2010.
[25] Long M, Wang J, Ding G, Sun J, Yu P S. Transfer feature learning with joint distribution adaptation. In Proc. the 2013 IEEE Int. Conf. Computer Vision, December 2013, pp.2200-2207.
[26] Pan S J, Tsang I W, Kwok J T, Yang Q. Domain adaptation via transfer component analysis. IEEE Trans. Neural Networks, 2011, 22(2):199-210.
[27] D'Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches:A benchmark and an extensive comparison. Empir. Softw. Eng., 2012, 17(4/5):531-577.
[28] Shepperd M, Song Q, Sun Z, Mair C. Data quality:Some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng., 2013, 39(9):1208-1215.
[29] Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction:A proposed framework and novel findings. IEEE Trans. Softw. Eng., 2008, 34(4):485-496.
[30] Ghotra B, McIntosh S, Hassan A E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. the 37th Int. Conf. Software Engineering, May 2015, pp.789-800.
[31] Xu Z, Liu J, Luo X, Yang Z, Zhang Y, Yuan P, Tang Y, Zhang T. Software defect prediction based on kernel PCA and weighted extreme learning machine. Inf. Softw. Technol., 2019, 106:182-200.
[32] Xu Z, Liu J, Yang Z, An G, Jia X. The impact of feature selection on defect prediction performance:An empirical comparison. In Proc. the 27th Int. Symp. Software Reliability Engineering, October 2016, pp.309-320.
[33] Xu Z, Yuan P, Zhang T, Tang Y, Li S, Xia Z. HDA:Crossproject defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access, 2018, 6:57597-57613.
[34] Jing X Y, Wu F, Dong X, Qi F, Xu B. Heterogeneous crosscompany defect prediction by unified metric representation and CCA-based transfer learning. In Proc. the 10th Joint Meeting on Foundations of Software Engineering, August 31-September 4, 2015, pp.496-507.
[35] Wu R, Zhang H, Kim S, Cheung S C. ReLink:Recovering links between bugs and changes. In Proc. the 19th ACM SIGSOFT Symp. and the 13th European Conf. Foundations of Software Engineering, September 2011, pp.15-25.
[36] Han J, Pei J, Kamber M. Data mining:Concepts and Techniques (3rd edition). Morgan Kaufmann, 2011.
[37] Xia X, David L O, Pan S J, Nagappan N, Wang X. HYDRA:Massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng., 2016, 42(10):977-998.
[38] Yang Y, Zhou Y, Lu H, Chen L, Chen Z, Xu B, Zhang Z. Are slice-based cohesion metrics actually useful in effortaware post-release fault-proneness prediction? An empirical study. IEEE Trans. Softw. Eng., 2015, 41(4):331-357.
[39] Nam J, Kim S. CLAMI:Defect prediction on unlabeled datasets (T). In Proc. the 30th Int. Conf. Automated Software Engineering, November 2015, pp.452-463.
[40] Yang Y, Harman M, Krinke J et al. An empirical study on dependence clusters for effort-aware fault-proneness prediction. In Proc. the 31st IEEE/ACM Int. Conf. Automated Software Engineering, September 2016, pp.296-307.
[41] Nam J, Fu W, Kim S et al. Heterogeneous defect prediction. IEEE Trans. Softw. Eng., 2018, 44(9):874-896.
[42] Li Z, Jing X Y, Zhu X, Zhang H. Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In Proc. the 2017 Int. Conf. Software Maintenance and Evolution, Sept. 2017, pp.91-102.
[43] Li Z, Jing X Y, Zhu X, Zhang H, Xu B, Ying S. On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Trans. Softw. Eng., 2019, 45(4):391-411.
[44] Li Z, Jing X Y, Wu F, Zhu X, Xu B, Ying S. Costsensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng., 2018, 25(2):201-245.
[45] Fan R E, Chang K W, Hsieh C J, Wang X R, Lin C J. LIBLINEAR:A library for large linear classification. J. Mach. Learn. Res., 2008, 9:1871-1874.
[46] Sasaki Y. The truth of the F -measure. Teach Tutor Mater, 2007, 1(5):1-5.
[47] Jiang Y, Cukic B, Ma Y. Techniques for evaluating fault prediction models. Empir. Softw. Eng., 2008, 13(5):561-595.
[48] Liparas D, Angelis L, Feldt R. Applying the MahalanobisTaguchi strategy for software defect diagnosis. Autom. Softw. Eng., 2012, 19(2):141-165.
[49] Jing X Y, Wu F, Dong X, Xu B. An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans. Softw. Eng., 2017, 43(4):321-339.
[50] Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Trans. Reliab., 2013, 62(2):434-443.
[51] Ryu D, Jang J I, Baik J. A hybrid instance selection using nearest-neighbor for cross-project defect prediction. J. Comput. Sci. Technol., 2015, 30(5):969-980.
[52] Li M, Zhang H, Wu R et al. Sample-based software defect prediction with active and semi-supervised learning. Autom. Softw. Eng., 2012, 19(2):201-230.
[53] Ling C X, Huang J, Zhang H. AUC:A statistically consistent and more discriminating measure than accuracy. In Proc. the 18th Int. Joint Conf. Artificial Intelligence, August 2003, pp.519-524.
[54] Huang Q, Xia X, Lo D. Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction. Empir. Softw. Eng., doi:10.1007/s10664-018-9661-2.
[55] Demšar J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 2006, 7:1-30.
[56] Mende T, Koschke R. Effort-aware defect prediction models. In Proc. the 14th European. Conf. Software Maintenance and Reengineering, March 2010, pp.107-116.
[57] Herbold S, Trautsch A, Grabowski J. A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng., 2018, 44(9):811-833.
[58] Zhou Y, Yang Y, Lu H et al. How far we have progressed in the journey? An examination of cross-project defect prediction. ACM Trans. Software Eng. Method., 2018, 27(1):Article No. 1.
[59] Tantithamthavorn C, McIntosh S, Hassan A E et al. The impact of automated parameter optimization on defect prediction models. IEEE Trans. Softw. Eng., 2019, 45(7):683-672.
[60] Shepperd M, Bowes D, Hall T. Researcher bias:The use of machine learning in software defect prediction. IEEE Trans. Softw. Eng., 2014, 40(6):603-616.
[61] Tantithamthavorn C, McIntosh S, Hassan A E, Matsumoto K. An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng., 2017, 43(1):1-18.
[62] Herbold S. Comments on ScottKnottESD in response to "an empirical comparison of model validation techniques for defect prediction models". IEEE Trans. Softw. Eng., 2017, 43(11):1091-1094.
[1] 王新峰、周翔、饶家华、张柱金、杨跃东. 基于迁移学习的DNA甲基化缺失数据补齐[J]. 计算机科学技术学报, 2022, 37(2): 320-329.
[2] Songjie Niu, Shimin Chen. TransGPerf:利用迁移学习建模分布式图计算性能[J]. 计算机科学技术学报, 2021, 36(4): 778-791.
[3] Wei Du, Yu Sun, Hui-Min Bao, Liang Chen, Ying Li, Yan-Chun Liang. 基于迁移学习与深度学习的人类血液分泌蛋白预测框架[J]. 计算机科学技术学报, 2021, 36(2): 234-247.
[4] Ying Li, Jia-Jie Xu, Peng-Peng Zhao, Jun-Hua Fang, Wei Chen, Lei Zhao. ATLRec:用于跨领域推荐的注意力对抗迁移学习网络[J]. 计算机科学技术学报, 2020, 35(4): 794-808.
[5] Xiang Chen, Dun Zhang, Zhan-Qi Cui, Qing Gu, Xiao-Lin Ju. DP-Share:基于差分隐私保护的软件缺陷预测模型共享方法[J]. 计算机科学技术学报, 2019, 34(5): 1020-1038.
[6] De-Fu Lian, Qi Liu. 图书推荐与成绩预测相互增强的联合建模[J]. , 2018, 33(4): 654-667.
[7] Chao Ni, Wang-Shu Liu, Xiang Chen, Qing Gu, Dao-Xu Chen, Qi-Guo Huang. 基于聚类的跨项目软件缺陷预测特征选择方法[J]. , 2017, 32(6): 1090-1107.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[2] 范植华;. Vectorization for Loops with Three-Forked Jumps[J]. , 1988, 3(3): 186 -202 .
[3] 郭庆平; Y.Paker;. Communication Analysis and Granularity Assessment for a Transputer-Based System[J]. , 1990, 5(4): 347 -362 .
[4] 陈昉; 施伯乐;. A Conservative Multiversion Locking-Graph Scheduler Algorithm[J]. , 1991, 6(2): 161 -166 .
[5] 李卫东; 魏道政;. Test Derivation Through Critical Path Transitions[J]. , 1992, 7(1): 12 -18 .
[6] 周勇; 唐泽圣;. Constructing Isosurfaces from 3D Data Sets Taking Account of Depth Sorting of Polyhedra[J]. , 1994, 9(2): 117 -127 .
[7] 王晖; 刘大有; 王亚飞;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[8] 廖乐健; 史忠植;. Minimal Model Semantics for Sorted Constraint Representation[J]. , 1995, 10(5): 439 -446 .
[9] 赵彧; 张琼; 向辉; 石教英; 何志均;. A Simplified Model for Generating 3D Realistic Sound in the Multimedia and Virtual Reality Systems[J]. , 1996, 11(4): 461 -470 .
[10] 汪芸; 顾冠群; 兑继英;. Research on Protocol Migration[J]. , 1996, 11(6): 601 -606 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: