CHANN：克隆一致性预测的分层神经网络

张凡龙; 陈宇琛; 邱少青; 姜文超

doi:10.1007/s11390-023-2831-8

摘要:

研究背景 由于克隆代码段之间的相似性，修改一个克隆代码可能会导致克隆一致性变化问题。最近的研究表明克隆一致性变化不仅会产生额外的维护成本，如果开发人员忘记对其进行一致性维护还会引入克隆缺陷。为了解决这个问题，研究人员提出了一些使用手工特征的克隆一致性预测方法。尽管这些方法可以在一定程度上预测克隆一致性，但这种方法的能力通常很弱，尤其是在软件开发初期缺少可用数据时效果尤为令人不满。

目的近年来，深度学习技术在软件工程任务中的应用取得了较大的成功，这促使我们探索利用深度学习技术从克隆代码及其演化中提取更好的特征，从而更加有效地建模和预测克隆代码一致性。

方法鉴于此，在本文中我们提出了一种克隆分层注意力神经网络（CHANN），从代码、上下文和代码演化等不同视角来表示和提取克隆代码及其演化的特征，从而提高克隆一致性预测的有效性。具体来说，在克隆片段的层面上，我们采用了基于AST的神经网络(ASTNN)来捕捉每个克隆代码片段的语法和语义特征；在克隆组层面上，我们采用克隆组神经网络来捕捉克隆组的上下文特征，揭示克隆组内克隆片段之间的关系。在克隆演化层面上，采用克隆进化神经网络来捕捉克隆代码发生变化之前的历史演化特征。同时，在克隆组和克隆演化上分别引入注意力机制，这有助于从克隆组和演化的表征中获取相对重要的信息。

结果为评估CHANN的有效性，我们在从八个开源项目中收集的数据集上进行了实验。实验结果表明，CHANN在预测克隆一致性方面非常有效，准确率、召回率和f-measure都达到了82%左右。

结论 CHANN是一个具有独特神经架构的设计，可用于从克隆代码及其演化中中捕捉多层次特征信息。实验结果支持我们的假设：层次神经网络可以帮助开发者在跨项目的情况下更有效地预测克隆一致性。当在软件开发的早期阶段没有足够的数据时，层次神经网络可以帮助开发者避免没有数据可用的弊端，保障克隆的一致性预测的有效性，帮助开发人员避免软件缺陷、提升软件质量。

Abstract: Modifying a code segment may give rise to a consistency issue when the code segment belongs to a clone group comprising closely similar code segments. Recent studies have demonstrated that such consistent changes can incur extra maintenance costs when clones are checked for consistency and introduce defects if developers forget to change clones consistently when needed. To address this problem, researchers have proposed an approach to predict clone consistency in advance with handcrafted attributes, notably using machine learning methods. Although these attributes can help predict clone consistency to some extent, the capability of such an approach is generally weak and unsatisfactory in practice. Such limitations in capability are especially severe at a project’s infancy stage when there is not sufficient within-project data to model clone consistency behavior, and cross-project data have not been helpful in supporting prediction. In this paper, we propose the Clone Hierarchical Attention Neural Network (CHANN) to represent code clones and their evolution by adopting a hierarchical perspective of code, context, and code evolution, and thus enhancing the effectiveness of clone consistency prediction. To assess the effectiveness of CHANN, we conduct experiments on the dataset collected from eight open-source projects. The experimental results show that CHANN is highly effective in predicting clone consistency, and the precision, recall, and F-measure attained in prediction are around 82%. These findings support our hypothesis that the hierarchical neural network can help developers predict clone consistency effectively in the case of cross-project incubation when insufficient data are available at the early stage of software development.

CHANN：克隆一致性预测的分层神经网络

CHANN: A Hierarchical Neural Network for Clone Consistency Prediction