基于深度神经网络的多特征融合Stack Overflow问题回复时间预测方法
Multi-Feature Fusion Based Structural Deep Neural Network for Predicting Answer Time on Stack Overflow
-
摘要:研究背景
程序开发人员在软件开发的过程中往往会遇到许多技术问题,提出具体问题并从在线专家那里得到有针对性回答的方式是当前最常用的方式之一。但是程序开发人员所提问题回复时间的长短取决于很多因素,包括问题的表述方式、问题表述的细致度、问题类别的数量、在线并对问题感兴趣的用户数,等等。目前的相关研究工作集中在预测问题是否会在给定的时间间隔内被回复,而预测出其具体的回复时间在目前尚未见报道。如果能够准确高效地预测问题的回复时间,且能够让用户对其回复时间有一个明确的认知,开发人员就能够更合理安排自己的工作,既能提高工作效率,也能提升用户对平台的体验感受。
目的我们的研究目标是预测问答网站上问题的回复时间。由于目前在线问答网站有一个明显的缺点,对于用户发布的问题它没有一个明确的预计回复时间。如果问答网站可以为用户发布的问题提供一个预计的回复时间,可以帮助用户更合理的安排他们的时间,提高工作效率,也可以提升用户对平台的体验感受,问答网站也会变得越来越受用户欢迎。
方法我们提出了PRT模型,这是一种将深度神经网络与多特征融合相结合的方法,提取出问题的多种特征并进行分析,将相关的特征进行提取和融合,再结合全连接神经网络模型来预测问答网站上问题的回复时间,然后通过平均相对误差来衡量PRT模型的性能。通过这种方法,我们采用Stack Overflow平台上的问题数据进行实验分析,以证明PRT模型的有效性。
结果对于问答网站上用户提出的问题,PRT模型预测的问题的回复时间与该问题的实际回复时间相比平均误差在5.5个小时左右,而传统的回归模型的预测误差在15个小时左右。PRT模型使得平均相对误差缩短了将近10个小时,因此我们提出的PRT模型在预测Stack Overflow上问题的回复时间方面相比于传统的回归算法具有更好的性能。
结论我们将预测回复时间这类问题当作回归问题来处理,找到影响问题的回复时间的特征集合,通过特征融合和深度神经网络方法相结合,用于预测问题的具体回复时间。对于一个新发布的问题帖子,通过该模型可以直接预测出其具体的回复时间,用户可以根据模型的预测结果决定选择另一个解决方案还是继续等待一个可接受的答案,可以帮助用户更好的安排时间。通过一系列的实验分析,表明本文提出的用于预测问题回复时间的模型框架具有较好的性能。此外,我们还讨论了其它潜在的改进方向,例如使用卷积神经网络或者递归神经网络代替全连接神经网络模型,通过模型改进和参数优化过程使得模型具有更好的性能。
Abstract:Stack Overflow provides a platform for developers to seek suitable solutions by asking questions and receiving answers on various topics. However, many questions are usually not answered quickly enough. Since the questioners are eager to know the specific time interval at which a question can be answered, it becomes an important task for Stack Overflow to feedback the answer time to the question. To address this issue, we propose a model for predicting the answer time of questions, named Predicting Answer Time (i.e., PAT model), which consists of two parts: a feature acquisition and fusion model, and a deep neural network model. The framework uses a variety of features mined from questions in Stack Overflow, including the question description, question title, question tags, the creation time of the question, and other temporal features. These features are fused and fed into the deep neural network to predict the answer time of the question. As a case study, post data from Stack Overflow are used to assess the model. We use traditional regression algorithms as the baselines, such as Linear Regression, K-Nearest Neighbors Regression, Support Vector Regression, Multilayer Perceptron Regression, and Random Forest Regression. Experimental results show that the PAT model can predict the answer time of questions more accurately than traditional regression algorithms, and shorten the error of the predicted answer time by nearly 10 hours.
-
Keywords:
- answer time /
- structural deep neural network /
- Stack Overflow /
- feature acquisition /
- feature fusion
-
1. Introduction
During the process of software development, developers often spend a large amount of time on searching for assistance in various ways, such as handbook querying, forum discussions, and online questions. Nowadays, the way of asking specific questions and getting targeted answers from online experts is generally considered to be the most effective way to find appropriate answers to technical questions[1]. Therefore, many online forums or platforms are emerged to provide this service.
Stack Overflow is one of the most famous and reliable online Community Question and Answer (CQA) exchanging knowledge and solving problems for developers[2, 3]. Some community users can post programming questions and technical questions, while others can easily find the corresponding posts according to their interests and demands. Furthermore, Stack Overflow is one of the largest CQAs for computer programming[4, 5]. All the records about questions are open source and available. These records are organized as datasets, which contain ``Posts", “Users”, “Votes”, “Comments”, “PostHistory”, “PostLinks”, etc.
1 . Among them, the “Posts” datasets contain the most valuable information, including all the questions, answers, and their interactions. There are more than 15 million posts written by 8.5 million users with a total size of 15 GB in the Stack Overflow, which contains more than 500 programming languages[6]. The rich feature set of Stack Overflow has attracted the attention of many professional software developers, in which users can edit questions, answer questions, vote on the quality of answers, and comment on individual questions and answers. Besides, a growing number of users are sharing their programming algorithms, library technologies, and problems with programming through Stack Overflow[7]. The open datasets also can be used in a variety of ways to perform statistical analysis on the posted questions, evaluate the quality of questions and answers, and help the developer community to obtain better technical support[8].When a developer posts a question, he (or she) is often eager to receive an answer as soon as possible. But many questions are usually not answered quickly enough because of various reasons. Therefore, it will release the questioners' anxiety through providing the specific time interval at which the question will be answered, which is named as ``answer time" in this paper. However, the answer time of a question actually depends on many factors, including how the developer describes the question, whether the question is described in detail, how many tags are used to categorize questions, whether the question is recommended to the related developers[9, 10], how many developers are online and interested in the question, etc.[11]. One obvious drawback of Stack Overflow is that it does not have a clear expected answer time for the questions. As a result, the developers who posted questions do not know the specific answer time, and thus they may have to wait for a long time to get answers[12]. It is reported that 92% of the questions were answered on Stack Overflow, but the average answer time is about 24 days[13]. In other words, if someone posts a question, he (or she) may have to wait for about 24 days to receive an answer in average, because he (or she) does not know the specific time when the question will be answered, which causes the question not to be solved in a timely manner. Therefore, predicting the answer time of a question on Stack Overflow has become a challenging task.
In recent years, some machine learning techniques have been used for addressing this challenge. Previous work formulated the problem in different ways, and reported different accuracy measures in predicting the answer time. For example, Bhat et al.[12] formulated it as a classification problem of predicting 1) whether a given question will be answered in less than 16 minutes or not, and 2) whether a given question will be answered in less than or equal to one hour, or greater than or equal to one day. They studied multiple factors of questions on Stack Overflow and reported that popularity (i.e., the usage frequency of the tag) and the number of subscribers (i.e., how many users can answer the question containing the tag) played a key role in predicting the answer time of questions, which also proves the importance of the tags in predicting answer time. On this basis, Wu et al.[1] labeled the time into four different answer time groups, which are within one hour, one to four hours, four to 12 hours, and 12 hours or more. Then the datasets are used for training classification models (including Support Vector Machine, Random Forest Classifier, Logistic Regression, Decision Tree, Neural Network, Gaussian Naive Bayes, and K-Nearest Neighbors) and evaluating the classification accuracy of each algorithm. However, the researchers[12] all formulated the problem as a classification problem. They focused on whether the question will be answered within a specific time frame, rather than predicting the specific time interval when the question receives an acceptable answer.
In this work, we conduct a comprehensive study of the features of the question. We define a new problem formulation, which re-formulates the answer time prediction as a regression problem. Then we propose a new regression model named Predicting Answer Time (PAT) model. Specifically, we extract multiple text features and time features from the question, including the question description (Body), question title (Title), question tags (Tags), the creation time of the question (Time-rate), and question week feature (Week). Consequently, we use the Doc2vec model to convert text features into vectors. Then the normalization method is used to calculate the value of the time feature. We fuse them to get the new feature vector. Finally, we feed the new feature vector into the fully-connected neural network to predict the answer time of the question. We evaluate the performance of the PAT model by the relative error of the answer time. Finally, we assess the validity of the PAT model through experimental studies based on datasets of Stack Overflow.
The main contributions of this work are as follows.
1) Considering the practical implementation of Stack Overflow, we reconstruct the problem as a regression problem to accurately formulate the research question.
2) We propose a multi-feature fusion model based on a deep neural network, (i.e., the PAT model), to predict the answer time of questions on Stack Overflow.
3) We analyze and design features that may affect the answer time of questions. As a result, we identify a new feature set for predicting the answer time of questions. We experimentally prove that the PAT model outperforms Linear Regression, K-Nearest Neighbors Regression, Support Vector Regression, Multilayer Perceptron (MLP) Regression, and Random Forest Regression, in terms of the relative error of the answer time on Stack Overflow.
The remainder of this paper is organized as follows. Related work and motivation are discussed in Section 2. The design of the PAT model is described in Section 3. The experimental design and results are presented in Sections 4 and 5, respectively. The threats to validity are discussed in Section 6. The conclusions are given in Section 7.
2. Related Work and Motivation
2.1 Related Work
Prediction on the answer time for CQAs has attracted more and more attentions of scientific researchers from software engineering to artificial intelligence. Bhat et al.[12] studied multiple factors of questions on Stack Overflow and reported that popularity (the usage frequency of the tag) and the number of subscribers (how many users can answer the question containing the tag) play the key role in predicting the answer time of questions. Treude et al.[14] studied the questions on Stack Overflow, and reported that 72.30% of the questions have two to four tags. The tag can then reveal which topic the question belongs to, and developers can encode the questions with tags to allow navigation to their questions. On this basis, Goderie et al.[15] reported that the answer time of the questions could be predicted based on the features of question tags. They derived ideas from the model of Bhat et al. and presented three tag-related features associated with the answer time, namely the active user ratio of each tag (ASR), the responsive subscribers ratio for each tag (RSR), and the popularity level for each tag (PR). Then they classified the questions based on the tag's metrics and used the supervised learning algorithm K-nearest neighbors to calculate the expected answer time of questions.
As we know, the answer time may depend on whether the question is easy to be answered. Therefore, it is worthy to investigate which kinds of questions are easy or difficult to be answered. Teevan et al.[16] discussed the number of replied questions, the quality of the answers, and the speed of response on the Facebook. They studied the punctuation of the question, the number of clauses, and the scope of the questions. It is reported that a question with a single clause is more likely to receive a faster response, namely, the description of the question has an impact on the predicted answer time[16]. Arguello et al.[17] investigated the factors affecting the communication between individuals and online communities in various aspects, such as the ability and scale of group identification, the status of new users and their contributions, the rhetorical strategies for publishing content, the coherence of topics, and the semantic complexity. It is revealed that questions with unclear semantics, questions with complex topics, and questions with novice posters are not easily to be replied. Conversely, questions with simple language content or their posters with a greater degree of contribution are more likely to be replied.
On this basis, studies on answer time prediction for questions have been emerged. Dror et al.[18] presented a prediction method via multiple features to predict whether a question will be answered and how many answers the question will receive. The purpose of this prediction is to help the user re-express his/her question (if it is unlikely to be answered) and reduce the frustration of waiting for an answer. However, it does not consider when the question would be answered. Arunapuram et al.[19] studied the answer time based on more than two million question-and-answer threads, and discussed the distribution and relevance prediction of answer time for the questions on Stack Overflow. They produced the characteristics associated with the answer time through analyzing the length of the question title, keywords, punctuation, time of day, etc., and then employed a weighted average algorithm to predict the distribution range of the answer time. However, they only considered the impact of a single feature on the answer time.
Subsequently, Bhat et al.[12] formulated the answer time prediction problem as two separate classification tasks: 1) whether a given question will be answered in less than 16 minutes or not, and 2) whether a given question will be answered in less than or equal to one hour, or greater than or equal to one day. They reported that the tag features have an influence on predicting the answer time of the question. Wu et al.[1] conducted a comprehensive study on this basis, and labeled the time into four different answer time groups, which are within one hour, one to four hours, four to 12 hours, and 12 hours or more. They used a variety of classification algorithms for training and evaluated the performance of the algorithms through classification accuracy. Although many factors affecting the answer time of questions have been investigated in the previous studies, the features of the questions they considered are still not comprehensive. Thus we propose a new feature set to predict the answer time of questions, and take the prediction of the answer time as a regression task. The relative error of the answer time predicted by the model can be used for more intuitively understanding the answer time of questions.
2.2 Motivation
It is valuable to understand the answer time of a question on CQAs, because users are often eager to know the answer to the question. Most CQAs are not able to guarantee that users can receive satisfying answers to their questions on time, resulting in disappointment and frustration of users. Bhat et al.[20] reported that the answer time of about 37.7% questions on Stack Overflow is over one hour. Even worse, the answer time of 11.81% questions is longer than one day. It indicates that the answer time of questions is with a larger range of fluctuation. The above issues make it difficult for questioners to decide whether to switch focus to other parts of software development or to keep waiting for answers. This dilemma has brought great inconvenience for questioners to manage their time. Actually, the mechanism of providing users with an accurate time of answering their questions can not only help them manage their time reasonably, but also prompt them to rephrase their questions for obtaining answers faster.
Therefore, it is important to figure out the factors affecting the answer time of questions on CQAs, and then we can shorten the answer time of questions by adjusting the factors. These factors include changing the label of the question, shortening the content of the question, and posting a question at a specific time of day[21]. If CQAs provide the expected answer time of a question, it can help users better schedule their work hours and increase their productivity, and CQAs will also become more popular[22, 23]. At present, the studies on predicting the answer time of questions for CQAs, such as Stack Overflow, are still rare. Previous studies take the answer time prediction as a classification problem, in which the answer time is divided into several time intervals. The performance of the model is usually determined by the accuracy of the classification. These studies only predict whether the question will be answered within a specified time interval. However, users more expect to know the specific time when the question will be answered. Thus, the previous studies do not fundamentally solve the problem of predicting the answer time of questions for users.
In this work, the problem is converted to a regression task, in which the relative error of the answer time is used to measure the performance of the proposed model. Hence, users can also understand the answer time of questions more intuitively. That is the motivation of carrying out this study.
3. Proposed Framework
3.1 Problem Statement
Whether the answer to a question can be accepted by the users depends on the quality of the question and the answer. The accepted answers are chosen and studied in this work, because we can only obtain the necessary time stamps from them. Thus the answer time is defined as the time span between the point when a question is posted and the point when the question has an acceptable answer. Specifically, qi denotes the i-th question, ai denotes the acceptable answer for question qi, and the answer time is defined as Ti=t(ai)−t(qi), where t(ai) is the creation time of the acceptable answer and t(qi) is the creation time of the question. Therefore, we create a set of features F={F1,F2,...,Fn} to predict the variable yi.
3.2 Overview
In this subsection, we present the multi-feature fusion network based on the deep neural network to predict the answer time of questions, named Predicting Answer Time (PAT) model. It consists of two parts: 1) a feature acquisition and fusion model, and 2) a deep neural network model. In the feature acquisition and fusion model, it includes the extraction of multi-features and the fusion of multi-features. The entire framework is shown in Fig.1. We extract a variety of features from questions. These features are divided into two types, namely text features and time features. We extract the body, title, and tags of questions as text features, and the creation time and week features of the questions as time features. In the following, we use the Doc2vec model[24] to convert the text features of questions into vectors. Then the normalization method is used to convert each time feature of the question into a specific value. We expand the dimension to make it be a vector. Then we use the feature fusion to process these two types of vectors to obtain a new feature vector. In the deep neural network model, we feed the obtained new feature vector into the three-layer fully-connected neural network model to predict the answer time of questions.
3.3 Feature Acquisition and Fusion Model
3.3.1 Multi-Feature Extraction
We conduct a comprehensive study for the questions on Stack Overflow and present a new feature set to predict the answer time of questions. Specifically, we extract text features and time features of questions as shown in Fig.1(a), where the text features include the body, title and tags of questions, and the time features include the creation time and week feature of questions. The mentioned features are listed below.
1) Body Feature (Body). It refers to the description of the question. The body of a question expands the summary provided by its title. The text should be well-written, engaging, and informative, and contains properly formatted sentences[25].
2) Title Feature (Title). The title is equivalent to a summary of the question. Since many Stack Overflow members may create content of a question that mismatches the title, we also need to consider this feature.
3) Tags Feature (Tags). Tags reflect related topics of the questions, and some tags may appear in the same questions[26]. Tags are the words or phrases that can highlight the main topics of the questions. They can also be used to help users rapidly identify interesting or self-related questions[26]. The posters have to specify the tag when creating the question on Stack Overflow. Specifically, each question must be labeled with one to five tags. With the help of tags, all the questions can be categorized clearly.
The purpose of using subject tags on Stack Overflow is to target questions to specific users. For example, a developer could label a tag “Java” when he or she posts a question with the topic of Java, so that developers who are interested in Java or usually answer Java-related questions can view it more quickly. Therefore, it is possible to make the questions to be answered faster through adjusting the factors which directly impact the answer time of the questions. For instance, Arguello et al.[17] suggested that the answer time can be shortened through cross posting messages. Besides, Arunapuram et al.[19] reported that using more specific tags (for example, using visual-studio-2010/2008 and ruby-on rails-3 instead of visual-studio and ruby-on-rails, respectively) can greatly reduce the answer time.
4) Creation Time of the Question (Time-rate). This is the time stamp for a question to be posted. We use the creation time of the question to determine and predict how long it will take for the question to get an acceptable answer. The creation time of questions may be in the morning, noon, or evening. Avrahami et al.[27] reported that developers answer questions more actively in the morning and at noon, compared with their performance in the afternoon. Therefore, the creation time of the question is a feature that needs to be considered for predicting the answer time of the questions.
We extract the number of hours, minutes, and seconds of the question creation time through the built-in time function of Python. The time-rate feature timefeature can be expressed by
timefeature=3600×hours+60×minutes+seconds, where hours, minutes and seconds denote the number of hours, minutes, and seconds of the question creation time, respectively.
5) Week Feature (Week). This represents in which day of the week is the question posted. It is known that the number of created questions could be different for each day of a week. For instance, the number of new questions may be relatively small on Monday, and the answer time could be relatively long, because many Stack Overflow users are busy in working. Conversely, the number of new questions may be also small on Sunday, but the answer time may be shorter, because many Stack Overflow users could rest at home. The week feature of a question could be extracted from the creation time of the question, through the built-in time function of Python. The values are enumerated by weekday∈{1,2,3,4,5,6,0}, in which the elements denote Monday to Sunday, respectively.
In summary, there are two types of question features. The first type is the textual features, including the Body feature, the Title feature and the Tags feature. The second type is the time features, including the Time-rate feature and the Week feature. We convert the text features of the question into vectors through the Doc2vec model, and use the normalization method to convert the time features into vectors. Then we use the feature fusion method to fuse them into a new feature vector.
3.3.2 Multi-Feature Fusion
First, for the text features of questions, we use the Doc2vec model to convert the processed text sequence of questions into a high-dimensional vector, that is, the Body feature, the Title feature, and the Tags feature. Each paragraph is represented by a unique vector, which is named a paragraph vector. Each word is also represented by a unique vector, named word vector. We concatenate the paragraph vectors and word vectors, and then average the integrated vectors to get a new vector, which is used to predict the next word in the paragraph. This paragraph vector can also be considered as a word. It acts as a memory unit of the context or topic of this paragraph. Thus this method is generally named as Distributed Memory Model of Paragraph Vectors (PV-DW)[24]. The PV-DW method slides and samples fixed-length words from one paragraph at a time, and takes one of them as the predicted word and the other words as the input word. Here we set the embedding vector dimensions of the Body feature, Title feature and Tags feature to 50, 20 and 5, respectively.
The process of summarizing the Doc2vec model consists of two main steps. First, in the training stage, the word vector, the parameters of the softmax function, and the paragraph vector are obtained from the training data. Each paragraph has a unique paragraph vector {\boldsymbol{d}} , and each word has a unique word vector {\boldsymbol{w}} . More formally, given a sequence of training words {w_1},{w_2}, ...,{w_T} and a sequence of training paragraphs {d_1},{d_2}, ...,{d_T} , the objective of the Doc2vec model is to maximize the average logarithmic probability of the sentence vector and the word vector by softmax, that is, if
\frac{1}{T}\displaystyle\sum\limits_{t = k}^{T - k} {{\rm log}p({d_t}|{d_{t - k}},...,{d_{t + k}},{w_t}|{w_{t - k}},...,{w_{t + k}})}, has a maximum value, we have
p({d_t}|{d_{t - k}},...,{d_{t + k}},{w_t}|{w_{t - k}},...,{w_{t + k}}) = \dfrac{{{e^{{y_{{w_t}},\;y_{{d_t}}}}}}}{{\displaystyle\sum\nolimits_i {{e^{{y_i}}}} }}, where {y_i} is the output value of word i before normalization. The output of the Doc2vec model is
y = b + U{\boldsymbol{h}}({d_{t - k}},...,{d_{t + k}},{w_{t - k}},...,{w_{t + k}};{\boldsymbol{d}},{\boldsymbol{w}}), where U and b are softmax parameters, and {\boldsymbol{h}} is constructed by a concatenation or average of word vectors extracted from {\boldsymbol{w}} and paragraph vectors extracted from {\boldsymbol{d}} . The paragraph vector {\boldsymbol{d}} is also trained while training the word vector {\boldsymbol{w}} . After the training is finished, the vectorized representation of the paragraph is also included.
Second, in the inference stage, the new paragraph vector is obtained by the gradient drop method, and the values of {\boldsymbol{w}} , U , and {\boldsymbol{h}} remain constant.
Finally, we obtain the feature vectors of Body, Title, and Tags through the Doc2vec model.
For the time features of the question, we use the normalization method to get the feature values. We obtain the eigenvalues of the Time-rate and Week features through the following formula; thus we have
tim{e_{{\rm{time \text{-} rate}}}} = tim{e_{{\rm{feature}}}}/(3\;600 \times 24), and
tim{e_{\rm{week}}} = weekday/7, where tim{e_{\rm{time \text{-} rate}}} and tim{e_{\rm{week}}} denote the Time-rate eigenvalue and Week eigenvalue after normalization, respectively. We convert them into vectors by expanding the dimension.
We use the feature fusion algorithm to fuse the textual feature vector and the time feature vector of the question as follows. The frequently-used feature fusion methods include concatenation, element-wise addition and element-wise multiplication[28]. Since concatenation can combine feature vectors of different dimensions, we use concatenation for feature fusion in this work. Let {\boldsymbol{t}}{{\boldsymbol{f}}_1} be the Body feature vector, {\boldsymbol{t}}{{\boldsymbol{f}}_2} be the Title feature vector, {\boldsymbol{t}}{{\boldsymbol{f}}_3} be the Tags feature vector, {\boldsymbol{t}}{{\boldsymbol{f}}_4} be the Time-rate feature vector, and {\boldsymbol{t}}{{\boldsymbol{f}}_5} be the Week feature vector. Then we could express high-level feature vector X after combination by (1),
\boldsymbol X = {\boldsymbol{t}}{{\boldsymbol{f}}_1} \circ {\boldsymbol{t}}{{\boldsymbol{f}}_2} \circ {\boldsymbol{t}}{{\boldsymbol{f}}_3} \circ {\boldsymbol{t}}{{\boldsymbol{f}}_4} \circ {\boldsymbol{t}}{{\boldsymbol{f}}_5}= {\rm{(}}{\boldsymbol{t}}{{\boldsymbol{f}}_1},{\boldsymbol{t}}{{\boldsymbol{f}}_2}{\rm{,}} {\boldsymbol{t}}{{\boldsymbol{f}}_3},{\boldsymbol{t}}{{\boldsymbol{f}}_4}{\rm{,}} {\boldsymbol{t}}{{\boldsymbol{f}}_5}{\rm{)}}, (1) where \circ denotes the concatenation operator. Then the new feature vector can be fed into the neural network model to predict the answer time of questions.
3.4 Deep Neural Network Model
Neural networks simulate many interconnected processing units that resemble abstract versions of neurons[29-32]. The processing units are usually distributed in different layers. Typically, a neural network includes three parts. The first part is an input layer that contains the units representing the input fields; the second part includes one or more hidden layers; the third part is an output layer which contains a unit or units representing the target fields. The units in a neural network are connected with varying connection strengths (or weights). The input data is sent to the first layer, and then the corresponding values are propagated from each neuron to every neuron in the next layer. Finally, the result will be delivered from the output layer.
In this work, we use a three-layer fully connected neural network model to predict the answer time of a question. The input of the fully-connected neural network model is the new feature vector X obtained in Subsection 3.3. The structure of the neural network includes an input layer, two hidden layers and an output layer, where the nodes in each layer receive input from the previous layer, and the output of the nodes in the previous layer is the input of the next layer. The activation function of the first three layers is ReLU. The inputs to each node are combined using a weighted linear combination. Finally, the answer time of the question is obtained through the sigmoid function.
As shown in Fig.1(b), the input data {\boldsymbol{X}} = \{ {x_1}, {x_2},...,{x_i}, ...,{x_f}\} is given, where the number of neurons is f . The neurons {b_{11}},{b_{12}},...,{b_{1h}}...,{b_{1q}} are in the first hidden layer, where the number of neurons is q . Then {v_{1h}},...,{v_{fh}} is the input weight of the corresponding nodes of the first hidden layer, with a bias value of {b'} . The neurons {b_{21}},{b_{22}},...,{b_{2h}},...,{b_{2s}} are in the second hidden layer, where the number of neurons is s . The values {u_1},...,{u_q} are the input weights of the corresponding nodes of the second hidden layer, with a bias value {b''} . Additionally, {y_j} is the true value of the answer time of the question, and {w_{h1}},...,{w_{hs}} are the input weights of the output nodes, with a bias value {b'''} .
The computation process of the neural network is described as follows. {\alpha _h} is the input value of the h -th neuron in the first hidden layer, and then we have
{\alpha _h} = \sum\limits_{i = 1}^f {{v_{ih}}{x_i}} + {b'}. The output value {\alpha _{oh}} of the h -th neuron in the first hidden layer can be calculated by {\alpha _{oh}} = \varphi ({\alpha _h}) , where \varphi (x) is the sigmoid activation function,
\varphi (x) = \frac{1}{{1 + {{\rm e}^{ - x}}}}. Its derivative function is
\varphi (x)' = \varphi (x)(1 - \varphi (x)). {\alpha _h}' is the input value of the h -th neuron in the second hidden layer, and it is expressed as
{{\alpha _h}' = \sum\limits_{h = 1}^q {{u_h}} \alpha _{oh}} + {b''}. {{\alpha _{oh}}'} is the output value of the h -th neuron in the second hidden layer, and thus we have
{{\alpha _{oh}}}' = \varphi ({\alpha _h}'). {\beta _j} is the input value of the output neuron, and thus
{\beta _j} = \sum\limits_{j = 1}^s {{w_{hj}}{\alpha _{oh}}'} + {b'''}. Therefore, the predicted value \mathop {{y_j}}\limits^ \wedge of the neural network is
\mathop {{y_j}}\limits^ \wedge = \varphi ({\beta _j}). The loss of the neural network E is defined as
E = \frac{1}{2}{(\mathop {{y_j}}\limits^ \wedge - {y_j})^2}. Based on the loss function, the updated formula of weights is deduced. After the training of the neural network model, the weights of the neural network update are shown in (2), (3) and (4).
\mathop {{\bar{w}_{hj}}} = {w_{hj}} - \eta \frac{{\partial E}}{{\partial {w_{hj}}}} = {w_{hj}} - \eta \left(\frac{{\partial E}}{{\partial {{\mathop y\limits^ \wedge }_j}}}\times\frac{{\partial {{\mathop y\limits^ \wedge }_j}}}{{\partial {\beta _j}}}\times\frac{{\partial {\beta _j}}}{{\partial {w_{hj}}}}\right), (2) \begin{aligned}[b] \mathop {{\bar{u}_h}} =\;& {u_h} - \eta \frac{{\partial E}}{{\partial {u_h}}}\\=\;& {u_h} - \eta \left(\frac{{\partial E}}{{\partial {{\mathop y\limits^ \wedge }_j}}} \times \frac{{\partial {{\mathop y\limits^ \wedge }_j}}}{{\partial {\beta _j}}} \times \frac{{\partial {\beta _j}}}{{\partial {\alpha _{oh}}'}} \times \frac{{\partial {\alpha _{oh}}'}}{{\partial {\alpha _h}'}} \times \frac{{\partial {\alpha _h}'}}{{\partial {u_h}}}\right), \end{aligned} (3) \begin{aligned}[b] \mathop {{\bar{v}_{ih}}} =\;& {v_{ih}} - \eta \frac{{\partial E}}{{\partial {v_{ih}}}} \\=\;& {v_{ih}} - \eta \left(\frac{{\partial E}}{{\partial {{\mathop y\limits^ \wedge }_j}}} \times \frac{{\partial {{\mathop y\limits^ \wedge }_j}}}{{\partial {\beta _j}}} \times \frac{{\partial {\beta _j}}}{{\partial {\alpha _{oh}}'}} \times \right.\\ &\left.\frac{{\partial {\alpha _{oh}}'}}{{\partial {\alpha _h}'}} \times \frac{{\partial {\alpha _h}'}}{{\partial {\alpha _{oh}}}} \times \frac{{\partial {\alpha _{oh}}}}{{\partial {\alpha _h}}} \times \frac{{\partial {\alpha _h}}}{{\partial {v_{ih}}}}\right), \end{aligned} (4) where \mathop {{\bar{w}_{hj}}} is the connection weight between the hidden layer and the output layer after the update, \mathop {{\bar{u}_h}} is the connection weight between the hidden layers after the update, \mathop {{\bar{v}_{ih}}} is the connection weight of the input layer and the hidden layer after the update, \eta is the learning rate, and the updating method of the offset value is the same as that of the connection weight.
4. Experimental Design
4.1 Experimental Dataset
In order to extract the data from Stack Overflow, we start with the file named posts.xml from the Stack Overflow data dump, which contains all the user posts (i.e., questions and answers) on Stack Overflow
2 . The detailed information of the posts is shown in Table 1. Firstly, we select the first 100000 questions from Stack Overflow in 2013. In order to ensure the timeliness of the data, we append 376685 and 372075 questions posted on January 2020 and February 2020 from the Stack Exchange website2 . Secondly, all of the questions without AcceptedAnswerId will be removed in the further pre-processing, and the remaining questions have acceptable answers. Thirdly, we eliminate the question data of the HTML and other rich-text tags in the question description, because these tags contain some useless information that can increase the prediction error. Fourthly, we remove the question data with the answer time of more than 400000 seconds (i.e., more than about four days) to avoid excessive time variance that could affect the experimental results. Finally, we get three datasets for experiments, which are the questions in 2013, January 2020, and February 2020, respectively. The statistics of these three datasets are listed in Table 2.Table 1. Attribute Information and Values of a PostName Description ID ID of the post PostTypeId Type of post: if PostTypeId = 1, it means this is a question; if PostTypeId = 2, it means this is an answer AcceptedAnswerId The ID of the relevant acceptable answer post for the question post (it exists only when PostTypeId = 1) ParentId The ID of the related question post for the answer post (it exists only when PostTypeId = 2) CreationDate The creation time of the post Score Average score by the viewers for the post ViewCount Total number of views for the post Body Description of the post (body) OwnerUserId ID of the post owner OwnerDisplayName Username of the post owner LastEditorUserId ID of the person who last edited the post LastEditorDisplayN-ame Username of the person who last edited the post LastEditDate Date when the post is last edited LastActivityDate Date when the status of the post is last changed Title Title of the post (it exists only when PostTypeId = 1) Tags Tags of the post (it exists only when PostTypeId = 1) AnswerCount Number of answers for the question post (it exists only when PostTypeId = 1) CommentCount Number of comments for the post FavoriteCount Number of people who like the post (it exists only when PostTypeId = 1) ClosedDate Date when the post is closed Table 2. Statistics for the Three Datasets on Stack OverflowDataset Number of Questions Number of Answers Number of Question-Answer Pairs After Pre-Processing 2013 100000 675611 32592 January 2020 376685 1048575 63530 February 2020 372075 846646 61799 After data pre-processing, each dataset is divided into a training set and a test set at a ratio of 9 : 1 . As a result, 29332 questions are randomly sampled from the 2013 dataset as the training set. Similarly, we can get 57177 and 55619 questions from the January 2020 and February 2020 datasets for training, respectively. The datasets and the related codes for experiments can be found in Github
3 .4.2 Experimental Setup
In the process of encoding, the Doc2vec model is employed, in which the embedding vector dimension of the Body feature is set to 50, the embedding vector dimension of the Title feature is set to 20, the embedding vector dimension of the Tags feature is set to 5, and the embedding vector dimensions of the other time features are set to 1. We use a fully-connected network with three hidden layers, in which the number of neurons in the first layer is 100, the number of neurons in the second layer is 200, the number of neurons in the third layer is 100, and the activation function of the hidden layer is ReLU. We also use the Dropout method for randomly excluding some neurons during each training to avoid overfitting of the neural network, and it further improves the effect of the prediction phase. The Dropout parameter is set to 0.8. The activation function of the output layer is sigmoid.
In the process of training, the optimization process uses mean square error as the loss function, and the optimizer uses AdamOptimizer to adjust the model parameters during the training process dynamically[33]. Additionally, the learning rate is set to 0.01. This method could make the model achieve better convergence by dynamically adjusting the learning rate. Linear Regression[34], K-Nearest Neighbors Regression[35], Support Vector Regression[36], MLP Regression[37], and Random Forest Regression[38] are employed as the baseline algorithms. We build the system using a Python library scikit-learn for training with default parameter settings
4 .4.3 Evaluation Metrics
As mentioned in Subsection 3.1, the answer time of a question is defined as the time span between the creation time of an acceptable answer and the creation time of the question. Thus we normalize the answer time interval and convert it to a value between 0 and 1, and the actual answer time after normalization is
{Y_i} = ({T_i} - tim{e_{\rm{min}}})/(tim{e_{\rm{max}}} - tim{e_{\rm{min}}}). Since we select the questions where their answer time is within four days, the maximum time tim{e_{\rm{max}}} is 400000 seconds, and the shortest time tim{e_{\rm{min}}} is set to 0 second by default.
We use Mean Square Error (MSE) as an indicator to evaluate the performance of the PAT model. Assuming the predicted value is y = \left\{ {{y_1},{y_2}, ..., {y_n}} \right\} and the true value is Y = \{ {Y_1},{Y_2}, ..., {Y_n}\} , MSE is defined by (5).
\begin{align} MSE = \frac{1}{n}\sum\limits_{i = 1}^n {{{\left( {{y_i} - {Y_i}} \right)}^2}}. \end{align} (5) Aiming at characterizing the error of the predicted answer time more clearly, the relative error of the answer time within four days, which is between the time predicted by the PAT model and the actual answer time, is used for measuring the performance of the PAT model. Specifically, the relative error of the answer time is defined as MS{E'} , where MS{E'} = \sqrt {MSE} \times 400\;000/3\;600. The unit of MS{E'} is hour.
5. Experimental Results
In this section, the experimental results are discussed in relation to the specific research questions (RQs).
5.1 RQ1: Can PAT Model Better Predict the Answer Time of Questions?
In this research question, we plan to explore whether the PAT model has improved its prediction effect on predicting the answer time of questions of Stack Overflow, compared with previous regression algorithms. Similar to the comparison experiments conducted by Burlutskiy et al.[11], we compare the PAT model with some traditional regression algorithms, such as Linear Regression[34], K-Nearest Neighbors Regression[35], Support Vector Regression[36], MLP Regression[37], and Random Forest Regression[38]. Previous studies show that these classic regression algorithms play an important role in data analysis, function fitting, and time series prediction. We record the experimental results of each regression algorithm. In contrast to the experimental results, we observe whether the PAT model is superior to the traditional algorithms in predicting the answer time of questions. For the baseline models, we use the same features as the PAT model to make predictions, and extract the Body, Title, Tags, Time-rate and Week features of the question. Table 3 shows the values of relative error MS{E^{'}} for the answer time of the PAT model and baseline regression models for three datasets. The unit of error in the table is hour.
Table 3. Values of Relative Error MS{E^{'}} for Answer Time (h) for Three DatasetsModel Dataset 2013 January 2020 February 2020 Linear Regression 15.533570 19.709493 18.801359 K-Nearest Neighbors Regression 16.545609 20.354959 19.860232 Random Forest Regression 16.673747 23.814190 19.860628 Support Vector Regression 16.539869 19.559889 19.323035 MLP Regression 15.923066 33.428693 19.027602 PAT model 5.597671 5.500320 5.499918 We can see from Table 3 that the values of relative error MS{E'} of the PAT model are much smaller than those of traditional regression models, and thus the PAT model performs better in predicting the answer time of questions for the three datasets. The optimal performance is marked out in bold in Table 3. Besides, it can be seen from Table 3 that the gap of the prediction error for different datasets is very small. Therefore, it reveals that the prediction ability of the PAT model is stable for different datasets.
Among the baseline models, the best prediction models are Linear Regression and Support Vector Regression. For the dataset in 2013, the prediction error reaches 15.533570 hours and 16.539896 hours, and it is about three times of the error of the PAT model. In other words, given a question, the error of the answer time predicted by the PAT model is about 5.5 hours compared with the actual answer time of the question, while the best result of traditional regression models is around 16 hours. Therefore, the PAT model shortens the error by nearly 10 hours.
5.2 RQ2: How Does a Single Feature Extracted from a Question Affect the Prediction of Answer Time?
In this subsection, five experiments are carried out for exploring the impact of the features on predicting the answer time of questions. We aim to figure out the most important feature of the questions. In each experiment, one feature is removed, namely, only the remaining four features are used as the input. The experimental results obtained are compared with the experimental result of the PAT model which considers all the features. Through the above experimental results, we observe the impact of each feature on the performance of the PAT model and identify the most important feature for predicting the answer time of questions. Table 4 shows the values of relative error MS{E^{'}} between the predicted values and the actual values of the answer time of questions after removing a feature from the question. The bold indicates the minimum error predicted by the model. The first column represents the features we use, and the second column represents the features that are not considered.
Table 4. Values of Relative Error MSE' for Answer Time (h) for PAT Model after Removing a FeatureFeatures of Feature Dataset Used Removed from 2013 January February Questions the Questions 2020 2020 Body, Title, None 5.597671 5.500320 5.500320 Tags, Week, Time-rate Title, Tags, Body 6.401961 6.290593 6.356981 Week, Time-rate Body, Tags, Title 5.611655 5.508178 5.506810 Week, Time-rate Body, Title, Tags 5.598784 5.511303 5.505734 Week, Time-rate Body, Title, Time-rate 5.631999 5.537991 5.523851 Tags, Week Body, Title, Week 5.614476 5.525467 5.522045 Tags, Time-rate For the three datasets, it can be seen from Table 4 that the results containing all the features (i.e., Body, Title, Tags, Time-rate, Week) are the optimal (the error of the answer time is about six hours), while the results after removing the Body feature are the worst. Therefore, for the problem of predicting the answer time of questions, we need to consider as many features as possible, and each feature has a certain impact on the answer time of questions. Besides, it also reveals that the Body feature is the most important feature, because the Body feature represents the description of the question, which is the most informative one among all features. The clarity or ambiguity of the question description will directly affect the answer time of the question. Also, it can be seen from Table 4 that the Time-rate feature and the Week feature are relatively more important than the other features. In other words, the creation time of the question and the day of the week for the posted questions are important for the answer time.
To analyze the impact of the Week feature on the performance of the PAT model in a more fine-grained way, we record the number of the questions for each day of a week, and the average answer time of the questions for the three datasets. Fig.2 shows the number of questions posted and the average answer time of questions in each day of the week for the three datasets, where Fig.2(a) is the number of questions posted in each day of the week, and Fig.2(b) is the average answer time of questions in each day of the week. It can be seen from Fig.2(a) that the number of questions decreases significantly on weekends, and it even reaches one-half to one-third of the peak. Thus the result shows that only a few people posted questions on the weekends. For questions in 2013 and January 2020, it can be seen from Fig.2(b) that the average answer time is the shortest during the weekend. For the questions in February 2020, there are more questions posted on weekends than in January 2020, but the average answer time of questions in February is less than that in January, indicating that the data fluctuates greatly. For the three datasets, although there are few questions on weekends, the average answer time of questions per day is not much different. It can be seen from Fig.2 that the Week feature can affect the answer time of questions, which is an effective feature to predict the answer time of questions.
Then we analyze the number of questions and the average answer time of questions in each hour for the three datasets, in order to study the impact of each time period of the day on the answer time of questions in more detail. It can be seen from Fig.3(a) that the number of questions is normally distributed and peaks in the 14–16 time period. However, it can be seen from Fig.3(b) that the average answer time of questions does not change significantly during this time period. Therefore, it is necessary to explore whether the hour of the day has an effect on the answer time of a question in the following research.
5.3 RQ3: How Does the Hour of the Day Affect the Answer Time of a Question?
To explore the impact of the hour in a day for the answer time of questions, we analyze the number of questions and the average answer time in each hour and each day in the above data analysis (as shown in Fig.3). We extract a new time feature from question data (named the Weekall feature) for representing which hour of the day the question was posted. We get the Weekall feature by
tim{e_{{\rm{week}}}} = h_o/24, where h_o is the hour extracted from the creation time of the question. We expand the Weekall feature value into a vector by expanding the dimension, and fuse it with other feature vectors through the feature fusion model to form a new feature vector. Then, the feature set used for predicting the answer time of the question includes Body, Title, Tags, Time-rate, Week, and Weekall features.
We design the following two cases of comparisons for the three datasets. Table 5 shows the values of relative error MS{E'} for the answer time in these two cases: without the Weekall feature, and with the Weekall feature. The optimal results are marked in bold.
Table 5. Values of Relative Error MSE' for Answer Time (h) after Adding Weekall Feature for Three DatasetsFeatures of Dataset Used Questions 2013 January 2020 February 2020 Body, Title, Tags, Week, 5.597671 5.500321 5.499918 Time-rate Body, Title, Tags, Week, 5.593785 5.478284 5.497325 Time-rate, Weekall It can be seen from Table 5 that the values of relative error MS{E'} for the answer time with all the features (Body, Title, Tags, Time-rate, Week, Weekall) are the smallest, which is the most obvious for the questions of January 2020. Therefore, the Weekall feature has a positive effect on predicting the answer time of questions. Furthermore, we could use a new feature set (including the Body feature, Title feature, Tags feature, Time-rate feature, Week feature, and Weekall feature) to predict the answer time of questions.
In the following, we study the Tags feature of the question and analyze the number of questions with each specified tag. Figs.4-6 show the number of questions containing each of the top-100 tags for the three datasets. Due to the limitations on space in this paper, 100 tags cannot be fully displayed. But we can still see the trend in the number of questions containing each of the top-100 tags. The greater the number of questions containing a tag, the more active the tag. Then we aim to figure out whether the activity of tags impacts on the answer time of questions, and whether the questions with active tags have shorter answer time. It can be seen from Figs.4-6 that the number of questions containing the top-10 tags is the largest, and the top-10 tags are active. Additionally, the number of questions containing the top 60–100 tags is small, and these tags are inactive. Therefore, we choose the questions with the top-50 tags for further study.
5.4 RQ4: How Does the Tag Activity Affect the Prediction of Answer Time?
To study the effect of tag activity on predicting the answer time of questions, we first select questions with the top- k ( k = 10, 20, 50) tags as a test set for the three datasets. In the previous experiments, we used all the processed question data to train and test according to a ratio of 9:1 . In this experiment, our training set contains all of the processed question data, and the test set is questions with the top- k ( k = 10, 20, 50) tags. As concluded in Subsection 5.3, the Weekall feature can improve the performance of the PAT model, and thus it is added in the next experiment. We extract the Body, Title, Tags, Time-rate, Week, and Weekall features from the questions as the input of the deep neural network model to analyze the performance of the PAT model under different test sets. Table 6 shows the values of relative error MS{E^{'}} of the PAT model in predicting the answer time of questions for three datasets. The first column is the test data, where the optimal result is marked in bold.
Table 6. Values of Relative Error MSE' for Answer Time (h) Under the Top-k (k = 10, 20, 50) Test SetsTest Set Dataset 2013 January 2020 February 2020 Questions with
top-10 tags5.559093 5.522607 5.515567 Questions with
top-20 tags5.567227 5.518562 5.517117 Questions with
top-50 tags5.576814 5.526546 5.525517 It can be seen from Table 6 that the values of relative error MS{E^{'}} of predicted answer time are the smallest, which is 5.515567 hours, when using questions with the top-10 tags as the test set for the February 2020 dataset. When the questions with top-20 tags are used as the test set, the value of relative error MS{E'} of the predicted answer time is 5.518562 hours for the January 2020 dataset. Therefore, the activity of tags impacts the performance of the PAT model, which makes the model produce better prediction results on test sets with top-10 and top-20 tags than on the test set with top-50 tags. The results also reveal that the PAT model is more effective on test sets with active tags. It also suggests that labeling a popular tag can make it easier to catch one's attention and get the answers, if a user plan to ask a question for advice on Stack Overflow.
5.5 RQ5: How Does an Active DatasetAffect the Prediction of Answer Time?
In order to explore the impact of active datasets on the performance of the PAT model, we use all questions with the top- k ( k = 10, 20, 50) tags to predict the answer time of the questions for the three datasets. We take questions with top-10 tags, questions with top-20 tags, and questions with top-50 tags as the datasets, and then divide them separately into the training set and the test set according to a ratio of 9:1 . Next, we extract the Body, Title, Tags, Time-rate, Week, and Weekall features of the questions as the input to the deep neural network model, and train the model to predict the answer time of questions. Finally, we record the results of the three experiments separately. Table 7 shows the values of the relative error MSE' for the answer time on questions with top-10 tags, top-20 tags, and top-50 tags for the three datasets, and the optimal results are marked in bold.
Table 7. Values of Relative Error MS{E^{'}} for Answer Time (h) for PAT Model Under Different DatasetsTest Set Dataset 2013 January 2020 February 2020 Questions with
top-10 tags5.526785 5.501368 5.479926 Questions with
top-20 tags5.553070 5.494471 5.475987 Questions with
top-50 tags5.556707 5.487882 5.498280 It can be seen from Table 7 that the prediction performance by using the questions with top- k ( k = 10, 20, 50) tags as the dataset is better than that by using all the questions for the 2013 dataset, January 2020 dataset and February 2020 dataset. It reveals that the performance of the PAT model can be improved by using questions with only active tags for experiments. However, there are also differences for datasets in different periods. It can be seen from Table 7 that there is no direct relationship between the activity of tags and the answer time of the question. In other words, it is not true that the more the questions that contain active tags, the shorter the answer time of the question. Actually, the answer time of questions fluctuates considerably. Therefore, we should not only consider Tags features, but also consider multiple features comprehensively to get the feature set of the questions.
5.6 RQ6: How Does the Activity of a Single Tag Affect the Prediction of Answer Time?
To explore the impact of a single specific active tag on the performance of the PAT model, we use the questions containing the top-10 tags as the training set, and questions with a single tag in the top-10 tags as the test set for the three datasets. We investigate the impact of a single tag on predicting the answer time of questions.
As before, we extract the Body, Title, Tags, Time-rate, Week, and Weekall features from the questions as the input to the deep neural network model, and predict the answer time of questions through model training. Table 8 shows the values of relative error MS{E'} for the answer time predicted by the PAT model using questions with top-10 tags as the training set and questions with individual tags as the test set for the three datasets. The optimal results are marked in bold.
Table 8. Values of Relative Error MS{E^{'}} for Answer Time (h) of Questions with Top-10 TagsDataset Tags of Questions
in Test SetValues of Relative
Error MSE'2013 javascript 5.535771 java 5.503521 php 5.494104 c# 5.512172 android 5.544136 jquery 5.501302 html 5.523103 python 5.528663 ios 5.570634 c++ 5.490342 January 2020 python 5.496512 javascript 5.516533 java 5.501913 c# 5.507224 html 5.488374 reactjs 5.499581 android 5.531280 r 5.449537 php 5.500238 python-3.x 5.538648 February 2020 python 5.536867 javascript 5.503452 java 5.493337 c# 5.496660 html 5.506823 r 5.427288 reactjs 5.506511 php 5.495022 sql 5.475149 android 5.514656 It can be seen from Table 8 that the questions with the “c++” tag perform the best in predicting the answer time of questions, with an error of 5.490342 hours, followed by the questions with the “php” tag, with an error of 5.494104 hours for the data of 2013. For the data of January 2020 and February 2020, the values of relative error MS{E'} for the answer time are the smallest for the questions with the “r” tag. Compared with the results of experiment for RQ1, the prediction performance has been improved. Therefore, the category of the tag can also impact on predicting the answer time of questions. Additionally, it can be seen that the questions with “c++” and ``r'' tags have smaller prediction error than those with other tags.
6. Threats to Validity
Internal Validity. A threat to the internal validity is the user status of Stack Overflow. The degree of user contribution, that is, the honor status, will possibly impact on the answer time of questions. Some users may have a great contribution to Stack Overflow, with more badges and honor, which will increase the probability of their questions being quickly answered. Therefore, the answer time of their questions is relatively short. But for some novice users, the answer time of their questions may be longer. Due to the uncertainty of the user group of Stack Overflow, there is also uncertainty in the answer time of the questions.
Construct Validity. In addition, the difference between the question data in different periods is large on Stack Overflow, which leads to the difference of experimental results. Subsequent studies could start from the aspect of data imbalance. When using the Doc2vec model for text vectorization, the default embedding vector dimension is 100 dimensions. As known, using default parameter settings may lead to insufficient dimensions or redundancy of dimensions. For example, when it is set to 100 dimensions, it can be seen that using 100 dimensions is redundant. We verify the performance of the model using embedding vectors of different dimensions through experiments. The results show that the dimension change of the embedding vector has little influence on the performance of the model. Therefore, we set up an appropriate embedding vector dimension through experimental analysis in order to save space, time, and cost.
7. Conclusions
In this paper, we took the task of predicting the answer time as a regression task, and found the feature set that affects the answer time of questions. We combined feature fusion and the deep neural network method to propose a PAT model to predict the specific answer time of questions. For a question post, the specific answer time can be directly predicted through the PAT model. The user can decide to choose another solution or continue to wait for an acceptable answer based on the prediction result of the model, which can help the user arrange time better. We conducted extensive experiments using real datasets from Stack Overflow, and experimental results showed that the PAT model can well predict the answer time of questions and outperforms the state-of-the-art models.
The PAT model can help users manage their time effectively by using the given exact time estimate for answering their questions. It can also encourage users to rephrase their inquiries in order to get answers more quickly. As a result, users could get prompt and satisfactory answers to their questions, while CQA can attract more users because of the improved user experience.
In a follow-up study, we plan to improve the PAT model through replacing the neural network model with another efficient model, such as the BERT model. We can also combine our feature acquisition and fusion model with the traditional regression model to get a better performance through model improvement and parameter optimization processes.
-
Table 1 Attribute Information and Values of a Post
Name Description ID ID of the post PostTypeId Type of post: if PostTypeId = 1, it means this is a question; if PostTypeId = 2, it means this is an answer AcceptedAnswerId The ID of the relevant acceptable answer post for the question post (it exists only when PostTypeId = 1) ParentId The ID of the related question post for the answer post (it exists only when PostTypeId = 2) CreationDate The creation time of the post Score Average score by the viewers for the post ViewCount Total number of views for the post Body Description of the post (body) OwnerUserId ID of the post owner OwnerDisplayName Username of the post owner LastEditorUserId ID of the person who last edited the post LastEditorDisplayN-ame Username of the person who last edited the post LastEditDate Date when the post is last edited LastActivityDate Date when the status of the post is last changed Title Title of the post (it exists only when PostTypeId = 1) Tags Tags of the post (it exists only when PostTypeId = 1) AnswerCount Number of answers for the question post (it exists only when PostTypeId = 1) CommentCount Number of comments for the post FavoriteCount Number of people who like the post (it exists only when PostTypeId = 1) ClosedDate Date when the post is closed Table 2 Statistics for the Three Datasets on Stack Overflow
Dataset Number of Questions Number of Answers Number of Question-Answer Pairs After Pre-Processing 2013 100000 675611 32592 January 2020 376685 1048575 63530 February 2020 372075 846646 61799 Table 3 Values of Relative Error MS{E^{'}} for Answer Time (h) for Three Datasets
Model Dataset 2013 January 2020 February 2020 Linear Regression 15.533570 19.709493 18.801359 K-Nearest Neighbors Regression 16.545609 20.354959 19.860232 Random Forest Regression 16.673747 23.814190 19.860628 Support Vector Regression 16.539869 19.559889 19.323035 MLP Regression 15.923066 33.428693 19.027602 PAT model 5.597671 5.500320 5.499918 Table 4 Values of Relative Error MSE' for Answer Time (h) for PAT Model after Removing a Feature
Features of Feature Dataset Used Removed from 2013 January February Questions the Questions 2020 2020 Body, Title, None 5.597671 5.500320 5.500320 Tags, Week, Time-rate Title, Tags, Body 6.401961 6.290593 6.356981 Week, Time-rate Body, Tags, Title 5.611655 5.508178 5.506810 Week, Time-rate Body, Title, Tags 5.598784 5.511303 5.505734 Week, Time-rate Body, Title, Time-rate 5.631999 5.537991 5.523851 Tags, Week Body, Title, Week 5.614476 5.525467 5.522045 Tags, Time-rate Table 5 Values of Relative Error MSE' for Answer Time (h) after Adding Weekall Feature for Three Datasets
Features of Dataset Used Questions 2013 January 2020 February 2020 Body, Title, Tags, Week, 5.597671 5.500321 5.499918 Time-rate Body, Title, Tags, Week, 5.593785 5.478284 5.497325 Time-rate, Weekall Table 6 Values of Relative Error MSE' for Answer Time (h) Under the Top-k (k = 10, 20, 50) Test Sets
Test Set Dataset 2013 January 2020 February 2020 Questions with
top-10 tags5.559093 5.522607 5.515567 Questions with
top-20 tags5.567227 5.518562 5.517117 Questions with
top-50 tags5.576814 5.526546 5.525517 Table 7 Values of Relative Error MS{E^{'}} for Answer Time (h) for PAT Model Under Different Datasets
Test Set Dataset 2013 January 2020 February 2020 Questions with
top-10 tags5.526785 5.501368 5.479926 Questions with
top-20 tags5.553070 5.494471 5.475987 Questions with
top-50 tags5.556707 5.487882 5.498280 Table 8 Values of Relative Error MS{E^{'}} for Answer Time (h) of Questions with Top-10 Tags
Dataset Tags of Questions
in Test SetValues of Relative
Error MSE'2013 javascript 5.535771 java 5.503521 php 5.494104 c# 5.512172 android 5.544136 jquery 5.501302 html 5.523103 python 5.528663 ios 5.570634 c++ 5.490342 January 2020 python 5.496512 javascript 5.516533 java 5.501913 c# 5.507224 html 5.488374 reactjs 5.499581 android 5.531280 r 5.449537 php 5.500238 python-3.x 5.538648 February 2020 python 5.536867 javascript 5.503452 java 5.493337 c# 5.496660 html 5.506823 r 5.427288 reactjs 5.506511 php 5.495022 sql 5.475149 android 5.514656 -
[1] Wu D, Johnson S, Foster C, Li E, Elmiligi H, Rahman M. Improving response time prediction for Stack Overflow questions. In Proc. the 10th IEEE Annual Information Technology, Electronics and Mobile Communication Conference, Oct. 2019, pp.786–791. DOI: 10.1109/IEMCON.2019.8936252.
[2] Lopez T, Tun T T, Bandara A, Levine M, Nuseibeh B, Sharp H. An investigation of security conversations in Stack Overflow: Perceptions of security and community involvement. In Proc. the 1st International Workshop on Security Awareness from Design to Deployment, May 2018, pp.26–32. DOI: 10.1145/3194707.3194713.
[3] Wang W, Malik H, Godfrey M W. Recommending posts concerning API issues in developer Q&A sites. In Proc. the 12th IEEE/ACM Working Conference on Mining Software Repositories, May 2015, pp.224–234. DOI: 10.1109/MSR.2015.28.
[4] Yanovsky S, Hoernle N, Lev O, Gal K. One size does not fit all: A study of badge behavior in Stack Overflow. Journal of the Association for Information Science and Technology, 2021, 72(3): 331–345. DOI: 10.1002/asi.24409.
[5] Mondal S, Rahman M M, Roy C K. Can issues reported at Stack Overflow questions be reproduced? An exploratory study. In Proc. the 16th IEEE/ACM International Conference on Mining Software Repositories, May 2019, pp.479–489. DOI: 10.1109/MSR.2019.00074.
[6] Tabassum J, Maddela M, Xu W, Ritter A. Code and named entity recognition in StackOverflow. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.4913–4926. DOI: 10.18653/v1/2020.acl-main.443.
[7] Zhang J X, Jiang H, Ren Z L, Chen X. Recommending APIs for API related questions in Stack Overflow. IEEE Access, 2017, 6: 6205–6219. DOI: 10.1109/ACCESS.2017.2777845.
[8] Pan W F, Ming H, Chang C K, Yang Z J, Kim D K. ElementRank: Ranking Java software classes and packages using a multilayer complex network-based approach. IEEE Trans. Software Engineering, 2021, 47(10): 2272–2295. DOI: 10.1109/TSE.2019.2946357.
[9] Ai J, Su Z, Li Y, Wu C X. Link prediction based on a spatial distribution model with fuzzy link importance. Physica A: Statistical Mechanics and Its Applications, 2019, 527: 121155. DOI: 10.1016/j.physa.2019.121155.
[10] Su Z, Zheng X L, Ai J, Shang L H, Shen Y M. Link prediction in recommender systems with confidence measures. Chaos, 2019, 29(8): 083133. DOI: 10.1063/1.5099565.
[11] Burlutskiy N, Fish A, Ali N, Petridis M. Prediction of users' response time in Q&A communities. In Proc. the 14th IEEE International Conference on Machine Learning and Applications, Dec. 2015, pp.618–623. DOI: 10.1109/ICMLA.2015.190.
[12] Bhat V, Gokhale A, Jadhav R, Pudipeddi J, Akoglu L. Min(e)d your tags: Analysis of question response time in StackOverflow. In Proc. the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Aug. 2014, pp.328–335. DOI: 10.1109/ASONAM.2014.6921605.
[13] Rahman M M, Roy C K. An insight into the unresolved questions at Stack Overflow. In Proc. the 12th Working Conference on Mining Software Repositories, May 2015, pp.426–429. DOI: 10.1109/MSR.2015.55.
[14] Treude C, Barzilay O, Storey M A. How do programmers ask and answer questions on the web? (NIER track). In Proc. the 33rd International Conference on Software Engineering, May 2011, pp.804–807. DOI: 10.1145/1985793.1985907.
[15] Goderie J, Georgsson B M, Graafeiland B V, Bacchelli A. ETA: Estimated time of answer predicting response time in Stack Overflow. In Proc. the 12th Working Conference on Mining Software Repositories, May 2015, pp.414–417. DOI: 10.1109/MSR.2015.52.
[16] Teevan J, Morris M R, Panovich K. Factors affecting response quantity, quality, and speed for questions asked via social network status messages. In Proc. the 5th International Conference on Weblogs and Social Media, Jul. 2011. DOI: 131.107.65.14
[17] Arguello J, Butler B S, Joyce E, Kraut R, Ling K S, Rosé C, Wang X Q. Talk to me: Foundations for successful individual-group interactions in online communities. In Proc. the 2006 International Conference on Human Factors in Computing Systems, Apr. 2006, pp.959–968. DOI: 10.1145/1124772.1124916.
[18] Dror G, Maarek Y, Szpektor I. Will my question be answered? Predicting ``question answerability'' in community question-answering sites. In Proc. the 2013 European Conference on Machine Learning and Knowledge Discovery in Databases, Sept. 2013, pp.499–514. DOI: 10.1007/978-3-642-40994-3_32.
[19] Arunapuram P, Bartel J W, Dewan P. Distribution, correlation and prediction of response times in Stack Overflow. In Proc. the 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, Oct. 2014, pp.378-387. DOI: 10.4108/icst.collaboratecom.2014.257265.
[20] Bhat V, Gokhale A, Jadhav R, Pudipeddi J, Akoglu L. Effects of tag usage on question response time. Social Network Analysis and Mining, 2015, 5(1): Article No. 24. DOI: 10.1007/s13278-015-0263-3.
[21] Mi Q, Gao Y J, Keung J, Xiao Y, Mensah S. Identifying textual features of high-quality questions: An empirical study on Stack Overflow. In Proc. the 24th Asia-Pacific Software Engineering Conference, Dec. 2017, pp.636–641. DOI: 10.1109/APSEC.2017.77.
[22] Remígio J, Aragão F, Souza C, Costa E, Fechine J. Question's advisor—A Wizard interface to teach novice programmers how to post “better” questions in Stack Overflow. In Proc. the 19th International Conference on Enterprise Information Systems, Apr. 2017, pp.471–478. DOI: 10.5220/0006389504710478.
[23] Kowalik G, Nielek R. Senior programmers: Characteristics of elderly users from Stack Overflow. In Proc. the 8th International Conference on Social Informatics, Nov. 2016, pp.87–96. DOI: 10.1007/978-3-319-47874-6_7.
[24] Le Q V, Mikolov T. Distributed representations of sentences and documents. In Proc. the 31st International Conference on Machine Learning, Jun. 2014, pp.1188–1196.
[25] Gupta R, Reddy P K. Towards question improvement on knowledge sharing platforms: A Stack Overflow case study. In Proc. the 2017 IEEE International Conference on Big Knowledge, Aug. 2017, pp.41–48. DOI: 10.1109/ICBK.2017.25.
[26] Lezina G E, Kuznetsov A M. Predict closed questions on StackOverflow. In Proc. the 9th Spring Researchers Colloquium on Databases and Information Systems, May 2013, pp.10–14.
[27] Avrahami D, Fussell S R, Hudson S E. IM waiting: Timing and responsiveness in semi-synchronous communication. In Proc. the 2008 ACM Conference on Computer Supported Cooperative Work, Nov. 2008, pp.285–294. DOI: 10.1145/1460563.1460610.
[28] Li K, Zou C Q, Bu S H, Liang Y, Zhang J, Gong M L. Multi-modal feature fusion for geographic image annotation. Pattern Recognition, 2018, 73: 1–14. DOI: 10.1016/j.patcog.2017.06.036.
[29] Borovykh A, Oosterlee C W, Bohté S M. Generalization in fully-connected neural networks for time series forecasting. Journal of Computational Science, 2019, 36: 101020. DOI: 10.1016/j.jocs.2019.07.007.
[30] Cheng Y P. Backpropagation for fully connected cascade networks. Neural Processing Letters, 2017, 46(1): 293–311. DOI: 10.1007/s11063-017-9588-4.
[31] Deng W, Liu H L, Xu J J, Zhao H M, Song Y J. An improved quantum-inspired differential evolution algorithm for deep belief network. IEEE Trans. Instrumentation and Measurement, 2020, 69(10): 7319–7327. DOI: 10.1109/TIM.2020.2983233.
[32] Hao L Y, Li J, Guo G. A multi-target corner pooling-based neural network for vehicle detection. Neural Computing and Applications, 2020, 32(18): 14497–14506. DOI: 10.1007/s00521-019-04486-1.
[33] Hao L Y, Zhang H, Guo G, Li H. Quantized sliding mode control of unmanned marine vehicles: Various thruster faults tolerated with a unified model. IEEE Trans. Systems, Man, and Cybernetics: Systems, 2021, 51(3): 2012–2026. DOI: 10.1109/TSMC.2019.2912812.
[34] Priya S S, Gupta L. Predicting the future in time series using auto regressive linear regression modeling. In Proc. the 12th International Conference on Wireless and Optical Communications Networks, Sept. 2015. DOI: 10.1109/WOCN.2015.8064521.
[35] Nguyen B, Morell C, De Baets B. Large-scale distance metric learning for k-nearest neighbors regression. Neurocomputing, 2016, 214: 805–814. DOI: 10.1016/j.neucom.2016.07.005.
[36] Li Z J, Li Y X, Yu F, Ge D H. Adaptively weighted support vector regression for financial time series prediction. In Proc. the 2014 International Joint Conference on Neural Networks, Jul. 2014, pp.3062–3065. DOI: 10.1109/IJCNN.2014.6889426.
[37] Park J G, Jo S. Approximate Bayesian MLP regularization for regression in the presence of noise. Neural Networks, 2016, 83: 75–85. DOI: 10.1016/j.neunet.2016.07.010.
[38] Hu Q, Wu W B, Friedl M A. Mapping sub-pixel corn distribution using MODIS time-series data and a random forest regression model. In Proc. the 6th International Conference on Agro-Geoinformatics, Aug. 2017, pp.108–112. DOI: 10.1109/Agro-Geoinformatics.2017.8047051.
-
期刊类型引用(2)
1. Archita Gupta, Vartika Singh, Sushant Mahajan, et al. Web Platform as Path-Guide for Professional Students: A One-Stop Solution. Journal of The Institution of Engineers (India): Series B, 2024. 必应学术
2. S Deepajothi, Kalyankumar Dasari, N Krishnaveni, et al. Predicting Software Energy Consumption Using Time Series-Based Recurrent Neural Network with Natural Language Processing on Stack Overflow Data. 2024 Asian Conference on Communication and Networks (ASIANComNet), 必应学术
其他类型引用(0)
-
其他相关附件
-
本文中文pdf
2023-3-9-1438-Chinese Information 点击下载(27KB) -
本文英文pdf
2023-3-9-1438-Highlights 点击下载(150KB) -
本文附件外链
https://rdcu.be/djH6u
-