计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (5): 1020-1038.doi: 10.1007/s11390-019-1958-0

所属专题: Data Management and Data Mining Software Systems

• Special Section on Software Systems 2019 • 上一篇    下一篇

DP-Share:基于差分隐私保护的软件缺陷预测模型共享方法

Xiang Chen1,2,3, Senior Member, CCF, Dun Zhang1, Zhan-Qi Cui2,4, Member, CCF, Qing Gu2, Senior Member, CCF, Xiao-Lin Ju1,2, Member, CCF   

  1. 1 School of Information Science and Technology, Nantong University, Nantong 226019, China;
    2 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China;
    3 School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore;
    4 Computer School, Beijing Information Science and Technology University, Beijing 100101, China
  • 收稿日期:2018-11-29 修回日期:2019-04-03 出版日期:2019-08-31 发布日期:2019-08-31
  • 作者简介:Xiang Chen received his B.S. degree in information management and information system from Xi'an Jiaotong University, Xi'an, in 2002. Then he received his M.S. and Ph.D. degrees in computer software and theory from Nanjing University, Nanjing, in 2008 and 2011 respectively. He is with the School of Information Science and Technology at Nantong University, Nantong, as an associate professor. His research interests are mainly in software maintenance and software testing, such as software defect prediction, security vulnerability prediction, combinatorial testing, regression testing, and software fault localization. He has published over 40 papers in referred journals or conferences, including Information and Software Technology, Journal of Systems and Software, IEEE Transactions on Reliability, Software Quality Journal, Journal of Computer Science and Technology, COMPSAC, APSEC and SAC, etc.
  • 基金资助:
    This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 61702041 and 61872263, the Open Project of State Key Laboratory for Novel Software Technology at Nanjing University under Grant No. KFKT2019B14, the Science and Technology Project of Beijing Municipal Education Commission under Grant No. KM201811232016, the Nantong Application Research Plan under Grant No. JC2018134, and Jiangsu Government Scholarship for Overseas Studies.

DP-Share: Privacy-Preserving Software Defect Prediction Model Sharing Through Differential Privacy

Xiang Chen1,2,3, Senior Member, CCF, Dun Zhang1, Zhan-Qi Cui2,4, Member, CCF, Qing Gu2, Senior Member, CCF, Xiao-Lin Ju1,2, Member, CCF   

  1. 1 School of Information Science and Technology, Nantong University, Nantong 226019, China;
    2 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China;
    3 School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore;
    4 Computer School, Beijing Information Science and Technology University, Beijing 100101, China
  • Received:2018-11-29 Revised:2019-04-03 Online:2019-08-31 Published:2019-08-31
  • About author:Xiang Chen received his B.S. degree in information management and information system from Xi'an Jiaotong University, Xi'an, in 2002. Then he received his M.S. and Ph.D. degrees in computer software and theory from Nanjing University, Nanjing, in 2008 and 2011 respectively. He is with the School of Information Science and Technology at Nantong University, Nantong, as an associate professor. His research interests are mainly in software maintenance and software testing, such as software defect prediction, security vulnerability prediction, combinatorial testing, regression testing, and software fault localization. He has published over 40 papers in referred journals or conferences, including Information and Software Technology, Journal of Systems and Software, IEEE Transactions on Reliability, Software Quality Journal, Journal of Computer Science and Technology, COMPSAC, APSEC and SAC, etc.
  • Supported by:
    This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 61702041 and 61872263, the Open Project of State Key Laboratory for Novel Software Technology at Nanjing University under Grant No. KFKT2019B14, the Science and Technology Project of Beijing Municipal Education Commission under Grant No. KM201811232016, the Nantong Application Research Plan under Grant No. JC2018134, and Jiangsu Government Scholarship for Overseas Studies.

在当前软件缺陷预测研究中,已有的大部分实证研究仅使用了PROMISE仓库提供的数据集,因此会对实证研究结论的外部有效性产生影响。与缺陷预测数据集共享不同,缺陷预测模型的共享可以缓解上述问题并有助于鼓励来自学术界的研究人员和工业界的实践人员共享更多的模型。然后,直接共享模型可能会造成隐私泄露,例如模型逆向攻击。据我们所知,论文首次将差分隐私保护用于缺陷预测模型的共享并提出一种新颖的模型共享方法DP-Share。因为当隐私预算被精心设置时,差分隐私保护机制的设定可以有效阻挡这类攻击。具体来说:DP-Share首先针对数据集进行预处理,例如针对少数类(即缺陷模块)使用过采样方法,以及对连续型特征进行离散化处理来优化随后的隐私预算分配。随后DP-Share使用一种新颖的采样策略来构建一系列训练集。最后基于这些训练集构建出一系列决策树并最终返回随机森林(即模型)。在最后阶段,DP-Share使用Laplace机制和exponential机制来满足差分隐私保护的需求。在实证研究中,我们使用了来自实际项目的9个实验对象,使用AUC(area under ROC curve)来评估模型性能,使用holdout方法作为模型验证技术。当完成隐私和可用性分析后,我们发现在使用相同隐私预算时,DP-Share方法在大部分情况下比基准方法DF-Enhance可以取的更好的性能。除此之外,我们也为更好的使用DP-Share方法提供了使用指南。该研究工作试图填补差分隐私保护在软件缺陷预测研究方面的空白,并有助于鼓励研究人员和实践人员共享更多的缺陷预测模型,从而可以更好的推动软件缺陷预测领域的研究进展。

关键词: 软件缺陷预测, 模型共享, 差分隐私, 跨项目缺陷预测, 实证研究

Abstract: In current software defect prediction (SDP) research, most previous empirical studies only use datasets provided by PROMISE repository and this may cause a threat to the external validity of previous empirical results. Instead of SDP dataset sharing, SDP model sharing is a potential solution to alleviate this problem and can encourage researchers in the research community and practitioners in the industrial community to share more models. However, directly sharing models may result in privacy disclosure, such as model inversion attack. To the best of our knowledge, we are the first to apply differential privacy (DP) to privacy-preserving SDP model sharing and then propose a novel method DP-Share, since DP mechanisms can prevent this attack when the privacy budget is carefully selected. In particular, DP-Share first performs data preprocessing for the dataset, such as over-sampling for minority instances (i.e., defective modules) and conducting discretization for continuous features to optimize privacy budget allocation. Then, it uses a novel sampling strategy to create a set of training sets. Finally it constructs decision trees based on these training sets and these decision trees can form a random forest (i.e., model). The last phase of DP-Share uses Laplace and exponential mechanisms to satisfy the requirements of DP. In our empirical studies, we choose nine experimental subjects from real software projects. Then, we use AUC (area under ROC curve) as the performance measure and holdout as our model validation technique. After privacy and utility analysis, we find that DP-Share can achieve better performance than a baseline method DF-Enhance in most cases when using the same privacy budget. Moreover, we also provide guidelines to effectively use our proposed method. Our work attempts to fill the research gap in terms of differential privacy for SDP, which can encourage researchers and practitioners to share more SDP models and then effectively advance the state of the art of SDP.

Key words: software defect prediction, model sharing, differential privacy, cross project defect prediction, empirical study

[1] Hall T, Beecham S, Bowes D, Gray D, Counsell S. A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 2012, 38(6):1276-1304.
[2] Kamei Y, Shihab E. Defect prediction:Accomplishments and future challenges. In Proc. the 23rd International Conference on Software Analysis, Evolution, and Reengineering, March 2016, pp.33-45.
[3] Fredrikson M, Jha S, Ristenpart T. Model inversion attacks that exploit confidence information and basic countermeasures. In Proc. the 22nd ACM SIGSAC Conference on Computer and Communications Security, October 2015, pp.1322-1333.
[4] Hosseini S, Turhan B, Gunarathna D. A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering, 2019, 45(2):111-147.
[5] Dwork C. Differential privacy. In Proc. the 33rd International Colloquium on Automata, Languages and Programming, July 2006, pp.1-12.
[6] Zhu T, Li G, Zhou W, Yu P S. Differentially private data publishing and analysis:A survey. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(8):1619-1638.
[7] Friedman A, Schuster A. Data mining with differential privacy. In Proc. the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2010, pp.493-502.
[8] Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE:Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[9] Fayyad U. Multi-interval discretization of continuousvalued attributes for classification learning. In Proc. the 13th International Joint Conference on Artificial Intelligence, August 1993, pp.1022-1027.
[10] Patil A, Singh S. Differential private random forest. In Proc. the 2014 International Conference on Advances in Computing, Communications and Informatics, September 2014, pp.2623-2630.
[11] Zhang D, Chen X, Cui Z, Ju X. Software defect prediction model sharing under differential privacy. In Proc. the 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation, October 2018, pp.1547-1554.
[12] Tantithamthavorn C, Hassan A E. An experience report on defect modelling in practice:Pitfalls and challenges. In Proc. the 40th International Conference on Software Engineering:Software Engineering in Practice, May 2018, pp.286-295.
[13] Chen X, Zhao Y, Wang Q, Yuan Z. MULTI:Multi-objective effort-aware just-in-time software defect prediction. Information and Software Technology, 2018, 93:1-13.
[14] Radjenovic D, Hericko M, Torkar R, Zivkovic A. Software fault prediction metrics:A systematic literature review. Information and Software Technology, 2013, 55(8):1397-1418.
[15] Peters F, Menzies T. Privacy and utility for defect prediction:Experiments with MORPH. In Proc. the 34th International Conference on Software Engineering, June 2012, pp.189-199.
[16] Weyuker E J, Ostrand T J, Bell R M. Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empirical Software Engineering, 2008, 13(5):539-559.
[17] Peters F, Menzies T, Gong L, Zhang H. Balancing privacy and utility in cross-company defect prediction. IEEE Transactions on Software Engineering, 2013, 39(8):1054-1068.
[18] Peters F, Menzies T, Layman L. LACE2:Better privacypreserving data sharing for cross project defect prediction. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.801-811.
[19] Fan Y, Lv C, Zhang X, Zhou G, Zhou Y. The utility challenge of privacy-preserving data-sharing in cross-company defect prediction:An empirical study of the CLIFF & MORPH algorithm. In Proc. International Conference on Software Maintenance and Evolution, September 2017, pp.80-90.
[20] Blum A, Dwork C, McSherry F, Nissim K. Practical privacy:The SuLQ framework. In Proc. the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 2005, pp.128-138.
[21] Dwork C. Differential privacy:A survey of results. In Proc. the 5th International Conference on Theory and Applications of Models of Computation, April 2008, pp.1-19.
[22] Dwork C. A firm foundation for private data analysis. Communications of the ACM, 2011, 54(1):86-95.
[23] McSherry F, Talwar K. Mechanism design via differential privacy. In Proc. the 48th Annual IEEE Symposium on Foundations of Computer Science, October 2007, pp.94-103.
[24] McSherry F D. Privacy integrated queries:An extensible platform for privacy-preserving data analysis. In Proc. the 2009 ACM SIGMOD International Conference on Management of Data, June 2009, pp.19-30.
[25] Tan M, Tan L, Dara S, Mayeux C. Online defect prediction for imbalanced data. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.99-108.
[26] Bennin K E, Keung J, Phannachitta P, Monden A, Mensah S. MAHAKIL:Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Transactions on Software Engineering, 2018, 44(6):534-550.
[27] Liu M, Miao L, Zhang D. Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 2014, 63(2):676-686.
[28] Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 2013, 62(2):434-443.
[29] Öztürk M M. Which type of metrics are useful to deal with class imbalance in software defect prediction? Information and Software Technology, 2017, 92:17-29.
[30] He H, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9):1263-1284.
[31] García S, Luengo J, Sáez J A, López V, Herrera F. A survey of discretization techniques:Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(4):734-750.
[32] Hansen M H, Yu B. Model selection and the principle of minimum description length. Journal of the American Statistical Association, 2001, 96(454):746-774.
[33] Steinberg D. Cart:Classification and regression trees. In The Top Ten Algorithms in Data Mining, Wu X D, Kumer V (eds.), Chapman and Hall/CRC, 2009, pp.193-216.
[34] Wang S, Liu T, Tan L. Automatically learning semantic features for defect prediction. In Proc. the 38th International Conference on Software Engineering, May 2016, pp.297-308.
[35] Tantithamthavorn C, McIntosh S, Hassan A E, Matsumoto K. Automated parameter optimization of classification techniques for defect prediction models. In Proc. the 38th International Conference on Software Engineering, May 2016, pp.321-332.
[36] Zhang F, Zheng Q, Zou Y, Hassan A E. Cross-project defect prediction using a connectivity-based unsupervised classifier. In Proc. the 38th International Conference on Software Engineering, May 2016, pp.309-320.
[37] He P, Li B, Liu X, Chen J, Ma Y. An empirical study on software defect prediction with a simplified metric set. Information and Software Technology, 2015, 59:170-190.
[38] Sayyad Shirabad J, Menzies T J. The PROMISE repository of softare engineering databases. Technical Report, School of Information Technology and Engineering, University of Ottawa. http://promise.site.upttawa.ca/SERepsiting, Aug. 2018.
[39] Jureczko M, Madeyski L. Towards identifying software project clusters with regard to defect prediction. In Proc. the 6th International Conference on Predictive Models in Software Engineering, September 2010, Article No. 9.
[40] Chidamber S R, Kemerer C F. A metrics suite for object oriented design. IEEE Transactions on Software Engineering, 1994, 20(6):476-493.
[41] Zhang Y, Lo D, Xia X, Sun J. An empirical study of classifier combination for cross-project defect prediction. In Proc. the 39th IEEE Annual Computer Software and Applications Conference, Volume 2, July 2015, pp.264-269.
[42] Liu W, Liu S, Gu Q, Chen J, Chen X, Chen D. Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Transactions on Reliability, 2016, 65(1):38-53.
[43] Liu S, Chen X, Liu W, Chen J, Gu Q, Chen D. FECAR:A feature selection framework for software defect prediction. In Proc. the 38th IEEE Annual Computer Software and Applications Conference, July 2014, pp.426-435.
[44] Tantithamthavorn C, McIntosh S, Hassan A E, Matsumoto K. An empirical comparison of model validation techniques for defect prediction models. IEEE Transactions on Software Engineering, 2017, 43(1):1-18.
[45] Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. The reusable holdout:Preserving validity in adaptive data analysis. Science, 2015, 349(6248):636-638.
[46] Shivaji S, Whitehead E J, Akella R, Kim S. Reducing features to improve code change-based bug prediction. IEEE Transactions on Software Engineering, 2013, 39(4):552-569.
[47] Herbold S, Trautsch A, Grabowski J. A comparative study to benchmark cross-project defect prediction approaches. IEEE Transactions on Software Engineering, 2018, 44(9):811-833.
[48] Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.
[49] Wu F, Jing X Y, Sun Y, Sun J, Huang L, Cui F, Sun Y. Cross-project and within-project semisupervised software defect prediction:A unified approach. IEEE Transactions on Reliability, 2018, 67(2):581-597.
[50] Jing X Y, Wu F, Dong X, Xu B. An improved SDA based defect prediction framework for both within project and cross-project class-imbalance problems. IEEE Transactions on Software Engineering, 2017, 43(4):321-339.
[51] Ni C, Liu W S, Chen X, Gu Q, Chen D X, Huang Q G. A cluster based feature selection method for cross-project software defect prediction. Journal of Computer Science and Technology, 2017, 32(6):1090-1107.
[52] Krishna R, Menzies T, Fu W. Too much automation? The bellwether effect and its implications for transfer learning. In Proc. the 31st IEEE/ACM International Conference on Automated Software Engineering, August 2016, pp.122-131.
[53] Ryu D, Jang J I, Baik J. A hybrid instance selection using nearest-neighbor for cross-project defect prediction. Journal of Computer Science and Technology, 2015, 30(5):969-980.
[54] Hosseini S, Turhan B, Mantyla M. A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Information and Software Technology, 2018, 95:296-312.
[55] Moser R, Pedrycz W, Succi G. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Proc. the 30th International Conference on Software Engineering, May 2008, pp.181-190.
[56] Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. Defect prediction from static code features:Current results, limitations, new approaches. Automated Software Engineering, 2010, 17(4):375-407.
[57] Storn R, Price K. Differential evolution-A simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 1997, 11(4):341-359.
[58] Agrawal A, Menzies T. Is "better data" better than "better data miners"?:On the benefits of tuning SMOTE for defect prediction. In Proc. the 40th International Conference on Software Engineering, May 2018, pp.1050-1061.
[59] Chen X, Zhang D, Zhao Y, Cui Z, Ni C. Software defect number prediction:Unsupervised vs supervised methods. Information and Software Technology, 2019, 106:161-181.
[1] . 基于截断和规范化的Laplace机制的差分隐私[J]. 计算机科学技术学报, 2022, 37(2): 369-388.
[2] 赵建喆, 王兴伟, 毛克明, 黄辰希, 苏昱恺, 李宇宸. 机器学习中基于相关差分隐私保护的多方数据发布方法[J]. 计算机科学技术学报, 2022, 37(1): 231-251.
[3] 孔雀屏, 王子彦, 黄袁, 陈湘萍, 周晓聪, 郑子彬, 黄罡. 定义和检测智能合约中低效率的Gas模式[J]. 计算机科学技术学报, 2022, 37(1): 67-82.
[4] Yong-Hao Wu, Zheng Li, Yong Liu, Xiang Chen. 使用OPTICS聚类进行基于错误划分的多错误定位[J]. 计算机科学技术学报, 2020, 35(5): 979-998.
[5] Zhou Xu, Shuai Pang, Tao Zhang, Xia-Pu Luo, Jin Liu, Yu-Tian Tang, Xiao Yu, Lei Xue. 基于平衡分布适应迁移学习的跨项目缺陷预测[J]. 计算机科学技术学报, 2019, 34(5): 1039-1062.
[6] Mohammed Alqmase, Mohammad Alshayeb, Lahouari Ghouti. 软件度量元阈值提取框架[J]. 计算机科学技术学报, 2019, 34(5): 1063-1078.
[7] Chao Ni, Wang-Shu Liu, Xiang Chen, Qing Gu, Dao-Xu Chen, Qi-Guo Huang. 基于聚类的跨项目软件缺陷预测特征选择方法[J]. , 2017, 32(6): 1090-1107.
[8] Ning Wang, Yu Gu, Jia Xu, Fang-Fang Li, Ge Yu. 面向满足图约束关系的序列数据的差分隐私直方图发布[J]. , 2017, 32(5): 1008-1024.
[9] Xin-Li Yang, David Lo, Xin Xia, Zhi-Yuan Wan, Jian-Ling Sun. 开发者问什么安全问题?在Stack Overflow上的大规模实证研究[J]. , 2016, 31(5): 910-924.
[10] Saiqa Aleem, Luiz Fernando Capretz, Faheem Ahmed. 基于开发员的视角提升游戏开发过程的关键成功因素[J]. , 2016, 31(5): 925-950.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: