基于深度神经网络的多特征融合Stack Overflow问题回复时间预测方法

郭世凯; 王思文; 李辉; 范玉龙; 刘亚清; 张斌

doi:10.1007/s11390-023-1438-4

基于深度神经网络的多特征融合Stack Overflow问题回复时间预测方法

Multi-Feature Fusion Based Structural Deep Neural Network for Predicting Answer Time on Stack Overflow

摘要

摘要:
研究背景 程序开发人员在软件开发的过程中往往会遇到许多技术问题，提出具体问题并从在线专家那里得到有针对性回答的方式是当前最常用的方式之一。但是程序开发人员所提问题回复时间的长短取决于很多因素，包括问题的表述方式、问题表述的细致度、问题类别的数量、在线并对问题感兴趣的用户数，等等。目前的相关研究工作集中在预测问题是否会在给定的时间间隔内被回复，而预测出其具体的回复时间在目前尚未见报道。如果能够准确高效地预测问题的回复时间，且能够让用户对其回复时间有一个明确的认知，开发人员就能够更合理安排自己的工作，既能提高工作效率，也能提升用户对平台的体验感受。
目的我们的研究目标是预测问答网站上问题的回复时间。由于目前在线问答网站有一个明显的缺点，对于用户发布的问题它没有一个明确的预计回复时间。如果问答网站可以为用户发布的问题提供一个预计的回复时间，可以帮助用户更合理的安排他们的时间，提高工作效率，也可以提升用户对平台的体验感受，问答网站也会变得越来越受用户欢迎。
方法我们提出了PRT模型，这是一种将深度神经网络与多特征融合相结合的方法，提取出问题的多种特征并进行分析，将相关的特征进行提取和融合，再结合全连接神经网络模型来预测问答网站上问题的回复时间，然后通过平均相对误差来衡量PRT模型的性能。通过这种方法，我们采用Stack Overflow平台上的问题数据进行实验分析，以证明PRT模型的有效性。
结果对于问答网站上用户提出的问题，PRT模型预测的问题的回复时间与该问题的实际回复时间相比平均误差在5.5个小时左右，而传统的回归模型的预测误差在15个小时左右。PRT模型使得平均相对误差缩短了将近10个小时，因此我们提出的PRT模型在预测Stack Overflow上问题的回复时间方面相比于传统的回归算法具有更好的性能。
结论我们将预测回复时间这类问题当作回归问题来处理，找到影响问题的回复时间的特征集合，通过特征融合和深度神经网络方法相结合，用于预测问题的具体回复时间。对于一个新发布的问题帖子，通过该模型可以直接预测出其具体的回复时间，用户可以根据模型的预测结果决定选择另一个解决方案还是继续等待一个可接受的答案，可以帮助用户更好的安排时间。通过一系列的实验分析，表明本文提出的用于预测问题回复时间的模型框架具有较好的性能。此外，我们还讨论了其它潜在的改进方向，例如使用卷积神经网络或者递归神经网络代替全连接神经网络模型，通过模型改进和参数优化过程使得模型具有更好的性能。

Abstract: Stack Overflow provides a platform for developers to seek suitable solutions by asking questions and receiving answers on various topics. However, many questions are usually not answered quickly enough. Since the questioners are eager to know the specific time interval at which a question can be answered, it becomes an important task for Stack Overflow to feedback the answer time to the question. To address this issue, we propose a model for predicting the answer time of questions, named Predicting Answer Time (i.e., PAT model), which consists of two parts: a feature acquisition and fusion model, and a deep neural network model. The framework uses a variety of features mined from questions in Stack Overflow, including the question description, question title, question tags, the creation time of the question, and other temporal features. These features are fused and fed into the deep neural network to predict the answer time of the question. As a case study, post data from Stack Overflow are used to assess the model. We use traditional regression algorithms as the baselines, such as Linear Regression, K-Nearest Neighbors Regression, Support Vector Regression, Multilayer Perceptron Regression, and Random Forest Regression. Experimental results show that the PAT model can predict the answer time of questions more accurately than traditional regression algorithms, and shorten the error of the predicted answer time by nearly 10 hours.

HTML全文

参考文献()

施引文献

资源附件()