基于属性图建模的复合提交检测与分解

徐圣斌; 陈思宇; 姚远; 徐锋

doi:10.1007/s11390-024-2943-9

摘要:

研究背景 软件开发过程中，开发人员普遍使用版本控制系统以提交的形式跟踪软件变更过程。原子提交指只包含一个变更意图（如修复一个缺陷或添加一个特性）的提交，其相对复合提交（包含多个变更意图）更具优势（如便于审查、复用），因此被推荐为最佳实践。然而已有研究表明复合提交在商业软件和开源软件项目中都广泛存在，因此需要自动检测和分解复合变更的方法。已有工作提出了一些基于代码变更图的复合提交分解方法，但还存在对代码变更图的结构信息和代码包含的自然语义信息利用不足的缺点。且这类方法只关注于提交的分解算法，不考虑复合提交的检测，很有可能将原子提交继续拆解。

目的本研究旨在提出一种更加准确高效的复合提交检测与分解方法以辅助开发者更好完成原子提交实践。

方法本文提出了一个包含提交图构建、复合提交检测、复合提交分解三个阶段的复合提交检测与分解方法。具体来说，在提交图构建阶段，为每个提交构建带属性的提交图，图中节点对应提交相关的代码语句，边对应代码依赖关系，相比已有工作，本文首次提出基于子词共现的边关系捕获代码间的自然语言相似度，还使用预训练模型为节点生成嵌入表示作为节点属性；在复合提交检测阶段，先使用构造的提交图数据及其标签训练神经网络，然后使用训练好的神经网络识别复合提交并用于分解阶段；在复合提交分解阶段，使用深度卷积操作和近邻传播聚类算法进行复合提交分解，相比已有方法，本文工作可以捕获提交图中的高阶邻居信息。

结果在输入同时包括原子提交和复合提交时，本文提出的先检测后分解的方法比已有分解方法在分解准确率上平均提升5.1%；本文还分别针对复合提交检测模块和复合提交分解模块进行实验，复合提交检测模块在F1-score和Accuracy上相对提升了89.7%和7.6%且具有更好的泛化性，单独的复合提交分解模块在复合提交数据上的表现也比已有分解方法更加准确和高效。

结论本文实验结果证明了先检测后分解的方法在复合提交分解任务上的有效性，还验证了现有复合提交分解方法确实存在对提交的结构信息和代码自然语义信息利用不足的情况，本文提出的改进方法弥补了已有方法的不足。

Abstract: During software development, developers tend to tangle multiple concerns into a single commit, resulting in many composite commits. This paper studies the problem of detecting and untangling composite commits, so as to improve the maintainability and understandability of software. Our approach is built upon the observation that both the textual content of code statements and the dependencies between code statements are helpful in comprehending the code commit. Based on this observation, we first construct an attributed graph for each commit, where code statements and various code dependencies are modeled as nodes and edges, respectively, and the textual bodies of code statements are maintained as node attributes. Based on the attributed graph, we propose graph-based learning algorithms that first detect whether the given commit is a composite commit, and then untangle the composite commit into atomic ones. We evaluate our approach on nine C# projects, and the results demonstrate the effectiveness and efficiency of our approach.

基于属性图建模的复合提交检测与分解

Detecting and Untangling Composite Commits via Attributed Graph Modeling