›› 2015,Vol. 30 ›› Issue (5): 981-997.doi: 10.1007/s11390-015-1576-4

所属专题: Artificial Intelligence and Pattern Recognition Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于多因素的Stack Overflow网站重复问题检测

Yun Zhang1(张芸), David Lo2, Member, ACM, IEEE, Xin Xia1*(夏鑫), Member, CCF, ACM, IEEE, Jian-Ling Sun1(孙建伶), Member, CCF, ACM   

  1. 1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
    2 School of Information Systems, Singapore Management University, Singapore, Singapore
  • 收稿日期:2015-03-20 修回日期:2015-07-17 出版日期:2015-09-05 发布日期:2015-09-05
  • 通讯作者: Xin Xia E-mail:xxia@zju.edu.cn
  • 作者简介:Yun Zhang is a Ph.D. candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou. Her research interests include mining software repository and empirical study.
  • 基金资助:

    This work was partially supported by the China Knowledge Centre for Engineering Sciences and Technology under Grant No. CKCEST-2014-1-5, the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant Nos. 2015BAH17F01 and 2013BAH01B01, and the Fundamental Research Funds for the Central Universities of China.

Multi-Factor Duplicate Question Detection in Stack Overflow

Yun Zhang1(张芸), David Lo2, Member, ACM, IEEE, Xin Xia1*(夏鑫), Member, CCF, ACM, IEEE, Jian-Ling Sun1(孙建伶), Member, CCF, ACM   

  1. 1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
    2 School of Information Systems, Singapore Management University, Singapore, Singapore
  • Received:2015-03-20 Revised:2015-07-17 Online:2015-09-05 Published:2015-09-05
  • Contact: Xin Xia E-mail:xxia@zju.edu.cn
  • About author:Yun Zhang is a Ph.D. candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou. Her research interests include mining software repository and empirical study.
  • Supported by:

    This work was partially supported by the China Knowledge Centre for Engineering Sciences and Technology under Grant No. CKCEST-2014-1-5, the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant Nos. 2015BAH17F01 and 2013BAH01B01, and the Fundamental Research Funds for the Central Universities of China.

Stack Overflow网站是最近比较流行的一个在线问答网站, 软件开发者可以在上面交流分享经验和专业知识。在Stack Overflow网站海量的问题库中, 有一些问题是重复的, 表达的是同样的观点。大量重复问题的存在导致Stack Overflow网站的维护变困难, 浪费资源, 导致开发者浪费时间等待已经存在的回答。为了减少重复问题, Stack Overflow网站允许用户将问题手动标记成重复的。但是Stack Overflow网站中每天都会有成千上万个问题提交, 手动标记重复问题是比较难的工作。因此我们需要一个自动化的方法来检测重复问题。为了解决上边提出的问题, 本文中, 我们提出一种名为DupPredictor的自动化方法, DupPredictor通过分析多种因素可以检测到与当前问题重复的问题。DupPredictor提取了问题的标题, 正文和标签。这些信息都是用户在创建新问题是必须输入的信息。然后DupPredictor通过构建主题模型输出每个问题的潜在主题。对于每一对问题, DupPredictor会它们的标题相似度, 正文相似度, 主题相似度和标签相似度。然后这四个相似度会整合到一起, 得到两个问题的最终相似度。为了检验DupPredictor的性能, 我们收集了Stack Overflow网站中200万个问题来进行实验。结果表明DupPredictor的recall-rate@20值是 63.8%。将我们的方法和Stack Overflow默认的搜索引擎做对比, DupPredictor的recall-rate@10值比它提高了40.62%。我们还对比了DupPredictor和只用标题相似度, 正文相似度, 主题相似度, 标签相似度的方法和Runeson提出的用来检测重复缺陷报告的方法, DupPredictor的recall-rate@10值比它们分别提高了27.2%, 97.4%, 746.0%, 231.1%, 和16.4%。

Abstract: Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers unnecessary to wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DupPredictor that takes as input a new question and detects potential duplicates of this question by considering multiple factors. DupPredictor extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DupPredictor then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DupPredictor, we perform an experiment on a Stack Overflow dataset which contains a total of more than 2 million questions. The result shows that DupPredictor can achieve a recall-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DupPredictor improves its recall-rate@10 score by 40.63%. We also compare our approach with approaches that only use title, description, topic, and tag similarity and Runeson et al.'s approach that has been used to detect duplicate bug reports, and DupPredictor improves their recall-rate@10 scores by 27.2%, 97.4%, 746.0%, 231.1%, and 16.4% respectively.

[1] Xia X, Lo D, Wang X, Zhou B. Tag recommendation in software information sites. In Proc. the 10th Working Conference on Mining Software Repositories (MSR), May 2013, pp.287-296.

[2] Begel A, DeLine R, Zimmermann T. Social media for software engineering. In Proc. the FSE/SDP Workshop on Future of Software Engineering Research, November 2010, pp.33-38.

[3] Storey M A, Treude C, Deursen A, Cheng L T. The impact of social media on software engineering practices and tools. In Proc. the FSE/SDP Workshop on Future of Software Engineering Research, November 2010, pp.359-364.

[4] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal Machine Learning Research, 2003, 3:993-1022.

[5] Bacchelli A. Mining challenge 2013:Stack Overflow. In Proc. the 10th MSR, May 2013.

[6] Runeson P, Alexandersson M, Nyholm O. Detection of duplicate defect reports using natural language processing. In Proc. the 29th International Conference on Software Engineering (ICSE), May 2007, pp.499-510.

[7] Porter M. An algorithm for suffix stripping. Program, 1980, 14(3):130-137.

[8] Kochhar P S, Thung F, Lo D. Automatic fine-grained issue report reclassification. In Proc. the 19th International Conference on Engineering of Complex Computer Systems (ICECCS), August 2014, pp.126-135.

[9] Thung F, Lo D, Jiang L. Automatic defect categorization. In Proc. the 19th Working Conference on Reverse Engineering (WCRE), October 2012, pp.205-214.

[10] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval:The Concepts and Technology Behind Search (2nd edition). Addision Wesley, 2011.

[11] Heinrich G. Parameter estimation for text analysis. Technical Report, University of Leipzig, 2005. http://www.arbulon.net/publications/text-est.pdf, Aug. 2015.

[12] Steyvers M, Griffiths T. Probabilistic topic models. In Handbook of Latent Semantic Analysis, Landauer T, Mcnamara D, Dennis S et al. (eds.), Routledge, 2007.

[13] Wurst M. The word vector tool user guide operator reference developer tutorial. http://www-ai.cs.uni-dortmund. de/SOFTWARE/WVTOOL/doc/wvtool-1.0.pdf, July 2015.

[14] Correa D, Sureka A. Chaff from the wheat:Characterization and modeling of deleted questions on Stack Overflow. In Proc. the 23rd International Conference on World Wide Web, April 2014, pp.631-642.

[15] Han J, Kamber M. Data Mining:Concepts and Techniques (2nd edition). San Francisco, CA, USA:Morgan Kaufmann, 2006.

[16] Sun C, Lo D, Khoo S C, Jiang J. Towards more accurate retrieval of duplicate bug reports. In Proc. the 26th IEEE/ACM International Conference on Automated Software Engineering, November 2011, pp.253-262.

[17] Sun C, Lo D, Wang X, Jiang J, Khoo S C. A discriminative model approach for accurate duplicate bug report retrieval. In Proc. the 32nd ICSE, Volume 1, May 2010, pp.45-54.

[18] Wang X, Zhang L, Xie T, Anvik J, Sun J. An approach to detecting duplicate bug reports using natural language and execution information. In Proc. the 30th ICSE, May 2008, pp.461-470.

[19] Alipour A, Hindle A, Stroulia E. A contextual approach towards more accurate duplicate bug report detection. In Proc. the 10th MSR, May 2013, pp.183-192.

[20] Klein N, Corley C S, Kraft N A. New features for duplicate bug detection. In Proc. the 11th MSR, May 31-June 1, 2014, pp.324-327.

[21] Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval, Volume 1. Cambridge University Press Cambridge, 2008.

[22] Lazar A, Ritchey S, Sharif B. Improving the accuracy of duplicate bug report detection using textual similarity measures. In Proc. the 11th MSR, May 31-June 1, 2014, pp.308- 311.

[23] Anvik J, Hiew L, Murphy G C. Coping with an open bug repository. In Proc. the 2005 OOPSLA Workshop on Eclipse Technology eXchange, October 2005, pp.35-39.

[24] Lo D, Cheng H, Lucia. Mining closed discriminative dyadic sequential patterns. In Proc. the 14th International Conference on Extending Database Technology (EDBT), March 2011, pp.21-32.

[25] Zanetti M S, Scholtes I, Tessone C J, Schweitzer F. Categorizing bugs with social networks:A case study on four open source software communities. In Proc. the 35th ICSE, May 2013, pp.1032-1041.

[26] Xuan J, Jiang H, Hu Y, Ren Z, Zou W, Luo Z, Wu X. Towards effective bug triage with software data reduction techniques. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1):264-280.

[27] Bougie G, Starke J, Storey M A, German D M. Towards understanding Twitter use in software engineering:Preliminary findings, ongoing challenges and future questions. In Proc. the 2nd International Workshop on Web 2.0 for Software Engineering, May 2011, pp.31-36.

[28] Tian Y, Achananuparp P, Lubis I N, Lo D, Lim E P. What does software engineering community microblog about? In Proc. the 9th MSR, June 2012, pp.247-250.

[29] Prasetyo P K, Lo D, Achananuparp P, Tian Y, Lim E P. Automatic classification of software related microblogs. In Proc. the 28th ICSM, September 2012, pp.596-599.

[30] Surian D, Lo D, Lim E P. Mining collaboration patterns from a large developer network. In Proc. the 17th Working Conference on Reverse Engineering (WCRE), October 2010, pp.269-273.

[31] Surian D, Liu N, Lo D, Tong H, Lim E P, Faloutsos C. Recommending people in developers' collaboration network. In Proc. the 18th WCRE, October 2011, pp.379-388.

[32] Hong Q, Kim S, Cheung S, Bird C. Understanding a developer social network and its evolution. In Proc. the 27th IEEE International Conference on Software Maintenance (ICSM), September 2011, pp.323-332.

[33] Wang S, Lo D, Vasilescu B, Serebrenik A. EnTagRec:An enhanced tag recommendation system for software information sites. In Proc. the 30th ICSME, September 29-October 31, 2014, pp.291-300.

[34] Barua A, Thomas S W, Hassan A E. What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering, 2014, 19(3):619-654.

[35] Gottipati S, Lo D, Jiang J. Finding relevant answers in software forums. In Proc. the 26th IEEE/ACM International Conference on Automated Software Engineering, November 2011, pp.323-332.

[36] Henβ S, Monperrus M, Mezini M. Semi-automatically extracting FAQs to improve accessibility of software development knowledge. In Proc. the 34th ICSE, June 2012, pp.793-803.

[37] Correa D, Sureka A. Fit or unfit:Analysis and prediction of 'closed questions' on stack overflow. In Proc. the 1st ACM Conference on Online Social Networks, October 2013, pp.201-212.

[38] Zhou B, Xia X, Lo D, Tian C, Wang X. Towards more accurate content categorization of API discussions. In Proc. the 22nd International Conference on Program Comprehension, June 2014, pp.95-105.

[39] Hou D, Mo L. Content categorization of API discussions. In Proc. the 29th ICSM, September 2013, pp.60-69.

[40] Hou D, Li L. Obstacles in using frameworks and APIs:An exploratory study of programmers' newsgroup discussions. In Proc. the 19th IEEE International Conference on Program Comprehension (ICPC), June 2011, pp.91-100.

[41] Rupakheti C R, Hou D. Evaluating forum discussions to inform the design of an API critic. In Proc. the 20th ICPC, July 2012, pp.53-62.

[42] Zhang Y, Hou D. Extracting problematic API features from forum discussions. In Proc. the 21st ICPC, May 2013, pp.142-151.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[2] 练林; 张一立; 唐常杰;. A Non-Recursive Algorithm Computing Set Expressions[J]. , 1988, 3(4): 310 -316 .
[3] 周巢尘; 柳欣欣;. Denote CSP with Temporal Formulas[J]. , 1990, 5(1): 17 -23 .
[4] 韩建超; 史忠植;. Formalizing Default Reasoning[J]. , 1990, 5(4): 374 -378 .
[5] 徐美瑞; 刘小林;. A VLSI Algorithm for Calculating the Tree to Tree Distance[J]. , 1993, 8(1): 68 -76 .
[6] 张钹; 张铃;. On Memory Capacity of the Probabilistic Logic Neuron Network[J]. , 1993, 8(3): 62 -66 .
[7] 王晖; 刘大有; 王亚飞;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[8] 曹存根;. Expansion Nets and Expansion Processes of Elementary Net Systems[J]. , 1995, 10(4): 325 -333 .
[9] 陶雪红; 孙伟; 马绍汉;. A Practical Propositional Knowledge Base Revision Algorithm[J]. , 1997, 12(2): 154 -159 .
[10] 罗军舟; 顾冠群;. CIMS Network Protocol and Its Net Models[J]. , 1997, 12(5): 476 -481 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: