基于多因素的Stack Overflow网站重复问题检测
Multi-Factor Duplicate Question Detection in Stack Overflow
-
摘要: Stack Overflow网站是最近比较流行的一个在线问答网站, 软件开发者可以在上面交流分享经验和专业知识。在Stack Overflow网站海量的问题库中, 有一些问题是重复的, 表达的是同样的观点。大量重复问题的存在导致Stack Overflow网站的维护变困难, 浪费资源, 导致开发者浪费时间等待已经存在的回答。为了减少重复问题, Stack Overflow网站允许用户将问题手动标记成重复的。但是Stack Overflow网站中每天都会有成千上万个问题提交, 手动标记重复问题是比较难的工作。因此我们需要一个自动化的方法来检测重复问题。为了解决上边提出的问题, 本文中, 我们提出一种名为DupPredictor的自动化方法, DupPredictor通过分析多种因素可以检测到与当前问题重复的问题。DupPredictor提取了问题的标题, 正文和标签。这些信息都是用户在创建新问题是必须输入的信息。然后DupPredictor通过构建主题模型输出每个问题的潜在主题。对于每一对问题, DupPredictor会它们的标题相似度, 正文相似度, 主题相似度和标签相似度。然后这四个相似度会整合到一起, 得到两个问题的最终相似度。为了检验DupPredictor的性能, 我们收集了Stack Overflow网站中200万个问题来进行实验。结果表明DupPredictor的recall-rate@20值是 63.8%。将我们的方法和Stack Overflow默认的搜索引擎做对比, DupPredictor的recall-rate@10值比它提高了40.62%。我们还对比了DupPredictor和只用标题相似度, 正文相似度, 主题相似度, 标签相似度的方法和Runeson提出的用来检测重复缺陷报告的方法, DupPredictor的recall-rate@10值比它们分别提高了27.2%, 97.4%, 746.0%, 231.1%, 和16.4%。Abstract: Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers unnecessary to wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DupPredictor that takes as input a new question and detects potential duplicates of this question by considering multiple factors. DupPredictor extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DupPredictor then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DupPredictor, we perform an experiment on a Stack Overflow dataset which contains a total of more than 2 million questions. The result shows that DupPredictor can achieve a recall-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DupPredictor improves its recall-rate@10 score by 40.63%. We also compare our approach with approaches that only use title, description, topic, and tag similarity and Runeson et al.'s approach that has been used to detect duplicate bug reports, and DupPredictor improves their recall-rate@10 scores by 27.2%, 97.4%, 746.0%, 231.1%, and 16.4% respectively.