基于平衡分布适应迁移学习的跨项目缺陷预测

doi:10.1007/s11390-019-1959-z

基于平衡分布适应迁移学习的跨项目缺陷预测

Cross Project Defect Prediction via Balanced Distribution Adaptation Based Transfer Learning

摘要

摘要: 在产品发布之前，缺陷预测通过检测潜在有缺陷的软件模块来帮助测试资源的合理分配。当一个软件项目没有历史有标签的缺陷数据的时候，在这种场景下，跨项目缺陷预测是一种替代技术。跨项目缺陷预测利用其他项目有标签的缺陷数据构建分类模型来预测当前项目的模块标签。基于迁移学习的跨项目缺陷预测是当前的主流技术。一般来说，这些方法的目的是最小化两个项目数据间的分布差异。然而，先前的方法主要关注于边缘分布差异而忽视了条件分布差异，这会导致得到的性能不理想。在本文工作中，我们使用一个新颖的基于平衡分布适应的迁移学习方法来缩小这一差距。该方法同时考虑这两种分布差异并自适应地赋予他们不同的权重。为了评估这个方法对跨项目缺陷预测的有效性，我们在4个数据集的18个软件项目上进行实验并采用了6个指标（即F-measure，g-means，Balance，AUC，EARecall，and EAF-measure）。和12种基准方法相比，在4个数据集上，我们的平衡分布适应方法在这6个指标上得到23.8%，12.5%，11.5%，4.7%，34.2%，and 33.7%的平均提升。

Abstract: Defect prediction assists the rational allocation of testing resources by detecting the potentially defective software modules before releasing products. When a project has no historical labeled defect data, cross project defect prediction (CPDP) is an alternative technique for this scenario. CPDP utilizes labeled defect data of an external project to construct a classification model to predict the module labels of the current project. Transfer learning based CPDP methods are the current mainstream. In general, such methods aim to minimize the distribution differences between the data of the two projects. However, previous methods mainly focus on the marginal distribution difference but ignore the conditional distribution difference, which will lead to unsatisfactory performance. In this work, we use a novel balanced distribution adaptation (BDA) based transfer learning method to narrow this gap. BDA simultaneously considers the two kinds of distribution differences and adaptively assigns different weights to them. To evaluate the effectiveness of BDA for CPDP performance, we conduct experiments on 18 projects from four datasets using six indicators (i.e., F-measure, g-means, Balance, AUC, EARecall, and EAF-measure). Compared with 12 baseline methods, BDA achieves average improvements of 23.8%, 12.5%, 11.5%, 4.7%, 34.2%, and 33.7% in terms of the six indicators respectively over four datasets.

HTML全文

参考文献()

施引文献

资源附件()