›› 2015, Vol. 30 ›› Issue (5): 981-997.doi: 10.1007/s11390-015-1576-4

Special Issue: Artificial Intelligence and Pattern Recognition; Data Management and Data Mining

• Special Section on Software Systems • Previous Articles     Next Articles

Multi-Factor Duplicate Question Detection in Stack Overflow

Yun Zhang1(张芸), David Lo2, Member, ACM, IEEE, Xin Xia1*(夏鑫), Member, CCF, ACM, IEEE, Jian-Ling Sun1(孙建伶), Member, CCF, ACM   

  1. 1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
    2 School of Information Systems, Singapore Management University, Singapore, Singapore
  • Received:2015-03-20 Revised:2015-07-17 Online:2015-09-05 Published:2015-09-05
  • Contact: Xin Xia E-mail:xxia@zju.edu.cn
  • About author:Yun Zhang is a Ph.D. candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou. Her research interests include mining software repository and empirical study.
  • Supported by:

    This work was partially supported by the China Knowledge Centre for Engineering Sciences and Technology under Grant No. CKCEST-2014-1-5, the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant Nos. 2015BAH17F01 and 2013BAH01B01, and the Fundamental Research Funds for the Central Universities of China.

Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers unnecessary to wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DupPredictor that takes as input a new question and detects potential duplicates of this question by considering multiple factors. DupPredictor extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DupPredictor then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DupPredictor, we perform an experiment on a Stack Overflow dataset which contains a total of more than 2 million questions. The result shows that DupPredictor can achieve a recall-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DupPredictor improves its recall-rate@10 score by 40.63%. We also compare our approach with approaches that only use title, description, topic, and tag similarity and Runeson et al.'s approach that has been used to detect duplicate bug reports, and DupPredictor improves their recall-rate@10 scores by 27.2%, 97.4%, 746.0%, 231.1%, and 16.4% respectively.

[1] Xia X, Lo D, Wang X, Zhou B. Tag recommendation in software information sites. In Proc. the 10th Working Conference on Mining Software Repositories (MSR), May 2013, pp.287-296.

[2] Begel A, DeLine R, Zimmermann T. Social media for software engineering. In Proc. the FSE/SDP Workshop on Future of Software Engineering Research, November 2010, pp.33-38.

[3] Storey M A, Treude C, Deursen A, Cheng L T. The impact of social media on software engineering practices and tools. In Proc. the FSE/SDP Workshop on Future of Software Engineering Research, November 2010, pp.359-364.

[4] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal Machine Learning Research, 2003, 3:993-1022.

[5] Bacchelli A. Mining challenge 2013:Stack Overflow. In Proc. the 10th MSR, May 2013.

[6] Runeson P, Alexandersson M, Nyholm O. Detection of duplicate defect reports using natural language processing. In Proc. the 29th International Conference on Software Engineering (ICSE), May 2007, pp.499-510.

[7] Porter M. An algorithm for suffix stripping. Program, 1980, 14(3):130-137.

[8] Kochhar P S, Thung F, Lo D. Automatic fine-grained issue report reclassification. In Proc. the 19th International Conference on Engineering of Complex Computer Systems (ICECCS), August 2014, pp.126-135.

[9] Thung F, Lo D, Jiang L. Automatic defect categorization. In Proc. the 19th Working Conference on Reverse Engineering (WCRE), October 2012, pp.205-214.

[10] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval:The Concepts and Technology Behind Search (2nd edition). Addision Wesley, 2011.

[11] Heinrich G. Parameter estimation for text analysis. Technical Report, University of Leipzig, 2005. http://www.arbulon.net/publications/text-est.pdf, Aug. 2015.

[12] Steyvers M, Griffiths T. Probabilistic topic models. In Handbook of Latent Semantic Analysis, Landauer T, Mcnamara D, Dennis S et al. (eds.), Routledge, 2007.

[13] Wurst M. The word vector tool user guide operator reference developer tutorial. http://www-ai.cs.uni-dortmund. de/SOFTWARE/WVTOOL/doc/wvtool-1.0.pdf, July 2015.

[14] Correa D, Sureka A. Chaff from the wheat:Characterization and modeling of deleted questions on Stack Overflow. In Proc. the 23rd International Conference on World Wide Web, April 2014, pp.631-642.

[15] Han J, Kamber M. Data Mining:Concepts and Techniques (2nd edition). San Francisco, CA, USA:Morgan Kaufmann, 2006.

[16] Sun C, Lo D, Khoo S C, Jiang J. Towards more accurate retrieval of duplicate bug reports. In Proc. the 26th IEEE/ACM International Conference on Automated Software Engineering, November 2011, pp.253-262.

[17] Sun C, Lo D, Wang X, Jiang J, Khoo S C. A discriminative model approach for accurate duplicate bug report retrieval. In Proc. the 32nd ICSE, Volume 1, May 2010, pp.45-54.

[18] Wang X, Zhang L, Xie T, Anvik J, Sun J. An approach to detecting duplicate bug reports using natural language and execution information. In Proc. the 30th ICSE, May 2008, pp.461-470.

[19] Alipour A, Hindle A, Stroulia E. A contextual approach towards more accurate duplicate bug report detection. In Proc. the 10th MSR, May 2013, pp.183-192.

[20] Klein N, Corley C S, Kraft N A. New features for duplicate bug detection. In Proc. the 11th MSR, May 31-June 1, 2014, pp.324-327.

[21] Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval, Volume 1. Cambridge University Press Cambridge, 2008.

[22] Lazar A, Ritchey S, Sharif B. Improving the accuracy of duplicate bug report detection using textual similarity measures. In Proc. the 11th MSR, May 31-June 1, 2014, pp.308- 311.

[23] Anvik J, Hiew L, Murphy G C. Coping with an open bug repository. In Proc. the 2005 OOPSLA Workshop on Eclipse Technology eXchange, October 2005, pp.35-39.

[24] Lo D, Cheng H, Lucia. Mining closed discriminative dyadic sequential patterns. In Proc. the 14th International Conference on Extending Database Technology (EDBT), March 2011, pp.21-32.

[25] Zanetti M S, Scholtes I, Tessone C J, Schweitzer F. Categorizing bugs with social networks:A case study on four open source software communities. In Proc. the 35th ICSE, May 2013, pp.1032-1041.

[26] Xuan J, Jiang H, Hu Y, Ren Z, Zou W, Luo Z, Wu X. Towards effective bug triage with software data reduction techniques. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1):264-280.

[27] Bougie G, Starke J, Storey M A, German D M. Towards understanding Twitter use in software engineering:Preliminary findings, ongoing challenges and future questions. In Proc. the 2nd International Workshop on Web 2.0 for Software Engineering, May 2011, pp.31-36.

[28] Tian Y, Achananuparp P, Lubis I N, Lo D, Lim E P. What does software engineering community microblog about? In Proc. the 9th MSR, June 2012, pp.247-250.

[29] Prasetyo P K, Lo D, Achananuparp P, Tian Y, Lim E P. Automatic classification of software related microblogs. In Proc. the 28th ICSM, September 2012, pp.596-599.

[30] Surian D, Lo D, Lim E P. Mining collaboration patterns from a large developer network. In Proc. the 17th Working Conference on Reverse Engineering (WCRE), October 2010, pp.269-273.

[31] Surian D, Liu N, Lo D, Tong H, Lim E P, Faloutsos C. Recommending people in developers' collaboration network. In Proc. the 18th WCRE, October 2011, pp.379-388.

[32] Hong Q, Kim S, Cheung S, Bird C. Understanding a developer social network and its evolution. In Proc. the 27th IEEE International Conference on Software Maintenance (ICSM), September 2011, pp.323-332.

[33] Wang S, Lo D, Vasilescu B, Serebrenik A. EnTagRec:An enhanced tag recommendation system for software information sites. In Proc. the 30th ICSME, September 29-October 31, 2014, pp.291-300.

[34] Barua A, Thomas S W, Hassan A E. What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering, 2014, 19(3):619-654.

[35] Gottipati S, Lo D, Jiang J. Finding relevant answers in software forums. In Proc. the 26th IEEE/ACM International Conference on Automated Software Engineering, November 2011, pp.323-332.

[36] Henβ S, Monperrus M, Mezini M. Semi-automatically extracting FAQs to improve accessibility of software development knowledge. In Proc. the 34th ICSE, June 2012, pp.793-803.

[37] Correa D, Sureka A. Fit or unfit:Analysis and prediction of 'closed questions' on stack overflow. In Proc. the 1st ACM Conference on Online Social Networks, October 2013, pp.201-212.

[38] Zhou B, Xia X, Lo D, Tian C, Wang X. Towards more accurate content categorization of API discussions. In Proc. the 22nd International Conference on Program Comprehension, June 2014, pp.95-105.

[39] Hou D, Mo L. Content categorization of API discussions. In Proc. the 29th ICSM, September 2013, pp.60-69.

[40] Hou D, Li L. Obstacles in using frameworks and APIs:An exploratory study of programmers' newsgroup discussions. In Proc. the 19th IEEE International Conference on Program Comprehension (ICPC), June 2011, pp.91-100.

[41] Rupakheti C R, Hou D. Evaluating forum discussions to inform the design of an API critic. In Proc. the 20th ICPC, July 2012, pp.53-62.

[42] Zhang Y, Hou D. Extracting problematic API features from forum discussions. In Proc. the 21st ICPC, May 2013, pp.142-151.
No related articles found!
Full text



[1] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[2] Lian Lin; Zhang Yili; Tang Changjie;. A Non-Recursive Algorithm Computing Set Expressions[J]. , 1988, 3(4): 310 -316 .
[3] Zhou Chaochen; Liu Xinxin;. Denote CSP with Temporal Formulas[J]. , 1990, 5(1): 17 -23 .
[4] Han Jianchao; Shi Zhongzhi;. Formalizing Default Reasoning[J]. , 1990, 5(4): 374 -378 .
[5] Xu Meirui; Liu Xiaolin;. A VLSI Algorithm for Calculating the Tree to Tree Distance[J]. , 1993, 8(1): 68 -76 .
[6] Zhang Bo; Zhang Ling;. On Memory Capacity of the Probabilistic Logic Neuron Network[J]. , 1993, 8(3): 62 -66 .
[7] Wang Hui; Liu Dayou; Wang Yafei;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[8] Cao Cungen;. Expansion Nets and Expansion Processes of Elementary Net Systems[J]. , 1995, 10(4): 325 -333 .
[9] Tao Xuehong; Sun Wei; Ma Shaohan;. A Practical Propositional Knowledge Base Revision Algorithm[J]. , 1997, 12(2): 154 -159 .
[10] Luo Junzhou; Gu Guanqun;. CIMS Network Protocol and Its Net Models[J]. , 1997, 12(5): 476 -481 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved