基于文本和变更相似度的重复性合并请求检测技术

doi:10.1007/s11390-020-9935-1

基于文本和变更相似度的重复性合并请求检测技术

Detecting Duplicate Contributions in Pull-based Model Combining Textual and Change Similarities

摘要

摘要: 在开源软件分布式协同开发过程中,开发者之间的沟通和协调一直是备受关注的研究问题。作为目前最先进的协同开发机制,基于pull-request(合并请求)的开发模式为开源开发者提供了高度的开放性和透明性,提高了其工作的可见性。然而,由于此开发模式的并行性和无中心协调的性质,仍存在多个开发者提交重复性合并请求的现象。重复的合并请求如果没有被及时检测到,可能会导致贡献者和审查者浪费时间和精力做冗余的审查和更新工作。在本文中,我们提出了一种综合利用文本和变更相似度以自动检测重复合并请求的方法。对于给定的合并请求,我们首先计算它与历史合并请求之间的文本相似度以及变更相似度,然后利用贪心搜索策略得到混合相似度,并依据混合相似度返回一组相似度最高的合并请求列表。实验结果显示,当我们使用混合相似度时,召回率可以达到83.4%,而仅使用文本相似度时召回率为54.8%,仅使用变更相似度时召回率为78.2%。

Abstract: Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues. The pull-based development model, as the state-of-art collaborative development mechanism, provides high openness and transparency to improve the visibility of contributors' work. However, duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this model. If not detected in time, duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work. In this paper, we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission time. For a new-arriving contribution, we first compute textual similarity and change similarity between it and other existing contributions. And then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change similarity. The evaluation shows that 83.4% of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8% using only textual similarity and 78.2% using only change similarity.

HTML全文

参考文献()

施引文献

资源附件()