基于自监督的句子匹配无监督领域自适应
-
摘要:研究背景
随着深度学习的发展,神经网络模型在句子匹配任务上,取得了好的效果。深度学习模型是数据驱动的方法,虽然它们能利用一个领域上已有的标注数据训练,并得到好的效果,但是在面临新的领域时,由于源领域和目标领域的差异,它们的性能将会大幅下降。并且,新领域往往缺乏大量现成的标注数据,所以如何利用源领域的标注数据和目标领域的无标注数据实现领域自适应是不得不面对的问题。过去的研究中,基于对抗的领域自适应方法是一个经典的解决无监督领域自适应的方法,但是这种对抗训练方法在实践中通常难以收敛,并且没有针对性的考虑句子匹配任务的特性。所以面向句子匹配任务的无监督领域自适应是一个有价值的挑战。
目的我们的目的是找到一个方法,能针对句子匹配任务的特性,在句子匹配任务上实现领域迁移,并且相比于过去的方法来说能够比较容易优化训练。
方法我们提出了基于自监督的领域自适应。提出了四个不同辅助任务,其中包含了针对句子匹配任务特性的任务,来帮助两个领域在无监督的情况下对齐,缓解深度学习模型在新领域中性能下降的问题。
结果我们在六个数据集上进行了实验,我们的方法比之前的方法平均提高了6.3%,证明了我们方法的有效性。此外,我们实验性地探索了如何考虑使用自监督任务来提高效果。我们发现领域相关的自监督任务是最有用的,导致领域分离的自监督任务是不好的,并且自监督任务多一点会更好。
结论基于深度学习的句子匹配模型在面临新领域时,不可避免地出现性能下降。在无监督领域自适应的问题上,我们提出了基于自监督的方法,更好优化训练并且也在实验上取得了效果。此外我们发现在使用自监督任务时,相关的自监督任务是最有用的,导致领域分离的自监督任务是不好的,并且自监督任务多一点会更好。
Abstract:Although neural approaches have yielded state-of-the-art results in the sentence matching task, their performance inevitably drops dramatically when applied to unseen domains. To tackle this cross-domain challenge, we address unsupervised domain adaptation on sentence matching, in which the goal is to have good performance on a target domain with only unlabeled target domain data as well as labeled source domain data. Specifically, we propose to perform self-supervised tasks to achieve it. Different from previous unsupervised domain adaptation methods, self-supervision can not only flexibly suit the characteristics of sentence matching with a special design, but also be much easier to optimize. When training, each self-supervised task is performed on both domains simultaneously in an easy-to-hard curriculum, which gradually brings the two domains closer together along the direction relevant to the task. As a result, the classifier trained on the source domain is able to generalize to the unlabeled target domain. In total, we present three types of self-supervised tasks and the results demonstrate their superiority. In addition, we further study the performance of different usages of self-supervised tasks, which would inspire how to effectively utilize self-supervision for cross-domain scenarios.
-
Keywords:
- unsupervised domain adaptation /
- sentence matching /
- self-supervision
-
1. Introduction
Sentence matching is a fundamental task in natural language processing. It aims to judge the semantic relationship between two input sentences and is widely used in many applications such as natural language inference[1, 2], paraphrase identification[3-5], and question answering[6-8]. Recently, with the development of deep learning[9-18], neural approaches have achieved significant progress in this task.
Unfortunately, these neural approaches always require large-scale data annotation, which often makes their application to new domains prohibitively expensive. In real-world applications, sentence matching usually needs to face an unseen domain and only unlabeled data is available in this new domain. If the neural network model is directly trained on a labeled source domain and makes predictions on the unseen target domain, the performance will inevitably drop dramatically due to the distribution shift between the two domains. Even with the powerful pre-trained model BERT[19], the domain shift problem is still serious, e.g., the performance drops dramatically on the dataset of the education domain[4] using the model trained on the dataset of daily life domain[5]. Therefore, it is an important challenge to achieve good performance of sentence matching on a target domain with only unlabeled data.
To this end, we address unsupervised domain adaptation[20, 21] on sentence matching. Formally, in this task, except for annotated data in the source domain, there is only unlabeled data in the target domain. Although unsupervised domain adaptation on sentence matching deserves effort, there are very few studies. Previous approaches[20-23] of unsupervised domain adaptation usually focus on the conventional classification task without considering the specific sentence matching task. In the sentence matching task, the input is a sentence pair instead of a single sentence, and making predictions needs to capture the internal contrast between the two input sentences. Therefore, previous methods do not meet the unsupervised domain adaptation of sentence matching.
Moreover, most existing approaches of unsupervised domain adaptation suffer from the difficulty of optimization. Typically, the philosophy to address unsupervised domain adaptation is to induce alignment between the source and target domains through some transformation. These approaches generally implement it by minimizing a measurement of the distributional discrepancy, e.g., adversarial learning[22, 23] employs a learned discriminator of the source and the target as an approximation to the total variation distance. However, such measurements lead to the training objective as a minimax optimization problem, which is known to be very difficult to solve[24]. Without careful balance, this kind of objective with opposite optimization directions often causes wild fluctuations in the discrepancy loss and leads to sudden divergence.
To tackle above mentioned issues, this paper proposes to utilize self-supervision[25, 26] to address unsupervised domain adaptation on sentence matching.
First, we present three types of self-supervised tasks to achieve alignment between the source and target domains. Specifically, we train the model on the same auxiliary task in both domains simultaneously in an easy-to-hard curriculum[27], and each task can facilitate the alignment between the two domains along a direction of variation relevant to the adopted self-supervised task. As a result, the classifier trained on the source domain is able to gradually generalize to the unlabeled target domain, which is shown in Fig.1, where S and T denote the source domain data and the target domain data, respectively. Due to the lack of labels in the target domain, self-supervised tasks, which create labels directly from the data itself without manual annotation, are naturally a good choice for the auxiliary tasks. Different from previous methods[20-23], self-supervision is able to flexibly provide special auxiliary tasks to suit the specific sentence matching task. Moreover, the objective of self-supervised tasks for domain alignment is straightforward to implement without a minimax optimization, which is different from previous methods such as adversarial learning. Therefore, it is much easier to optimize.
Figure 1. Visualization of how two domains are aligned by self-supervision. (a) Source classifier only. Without any self-supervised task, the source domain is far away from the target domain, and a source classifier can hardly generalize to the target. (b) Adding a self-supervised task. Performing one self-supervised task on both domains in a shared feature space can align the source and target along one direction. (c) Adding more self-supervised tasks. Performing more self-supervised tasks can further achieve domain alignment along multiple directions. Finally, the two domains are much closer and the source classifier is expected to be generalized to the target domain.Second, we further analyze what should be noticed to effectively use self-supervision in cross-domain scenarios. Specifically, self-supervised tasks are divided into three types: general, domain-related, and task-related. Experimental results reveal that the domain-related task is the most effective especially for smaller data sizes. Besides, designing a self-supervised task with leading domain separation is not useful. Moreover, combining more self-supervised tasks is shown to be helpful.
In brief, our contributions are as follows.
● We present a self-supervised method for unsupervised domain adaptation on sentence matching, which can suit sentence matching and reduce the difficulty of optimization.
● We conduct experiments on six datasets across different domains and the results demonstrate that our method significantly outperforms previous state-of-the-art methods.
● We study different usages of self-supervision and discuss how to effectively utilize them for cross-domain scenarios.
2. Related Work
2.1 Sentence Matching
Sentence matching plays an important role in natural language processing. With the progress of neural networks, sentence matching approaches are divided into two types. One type is the encoding-based models[9-12, 17, 18]. They achieve semantic matching by encoding a sentence into a fixed-length vector without any incorporating information from the other sentence. Then, a classifier is applied to decide the relationship according to these independent sentence representations. The other type is joint models[13-16], which make up for the lack of information from the other sentences in the encoding-based models. To capture the semantic relationship, this type of methods uses cross-features between two sentences as an attention mechanism to express the word-level alignments.
2.2 Unsupervised Domain Adaptation
Despite the advantages of neural approaches, the performance of neural sentence matching models inevitably drops dramatically when applied to new domains due to the domain discrepancy. It is a common challenge that models are often trained on a labeled source domain and then applied to another target domain with only unlabeled data, which is known as unsupervised domain adaptation[20, 21]. Although some previous work studied unsupervised domain adaptation on several other natural language processing tasks, such as text classification[28, 29], reading comprehension[30, 31], and word segmentation[32], very little work has studied unsupervised domain adaptation on the sentence matching task. Recently, Rücklé et al.[33] first studied zero-shot transfer capabilities of text matching models and leveraged training signals of datasets to perform sentence matching across domains. However, it still does not present a universal solution to unsupervised domain adaptation on sentence matching.
2.3 Self-Supervision
In this paper, we utilize self-supervision[25, 26] to address unsupervised domain adaptation on sentence matching, which is able to consider the characteristics of sentence matching in the unsupervised domain adaptation setting and is easy to optimize. Self-supervised learning is able to train a network on auxiliary tasks where labels can be automatically obtained without manual annotation. In the natural language processing domain, self-supervision is often used to learn word embedding[34, 35] and language models[19, 36-38]. Motivated by the success of self-supervised learning, we use a self-supervised learning method to align the labeled source domain and unlabeled target domain.
2.4 Curriculum Learning
Curriculum learning[27] aims to train the model as the human learning process that from the easy way to the hard way and benefit the learning process. It has been applied to some natural language processing tasks[39-43]. Inspired by its idea, we also align the source domain and the target domain in an easy-to-hard order. In our method, we organize the curriculum according to the domain transfer difficulty of the unlabeled target data.
3. Problem Definition
In the unsupervised domain adaptation setting, our goal is to train a sentence matching model, which can reduce the distribution shift across domains and achieve good classification performance on the target domain. Formally, given data for training consists of labeled instances from the source domain DS={(xsi,ysi)}NSi=1 and unlabeled instances from the target domain DT={xti}NTi=1, where S denotes the source domain and T denotes the target domain. NS and NT are the data sizes of the source domain and the target domain, respectively. It is a common setup in many real scenarios where annotations for the target domain are often expensive and time-consuming.
For the specific sentence matching task, an input instance x is made up of two sentences P={p1,p2,…,pI} and Q={q1,q2,…,qJ}, where pi/qj (i={1,2,…,I},j={1,2,…,J}) is the i/j-th word of the sentence P/Q, and I/J is the corresponding word length of P/Q. Label y indicates the relationship between the two sentences, which covers Nc categories. There is an assumption that tasks across domains share the same label categories.
4. Methodology
The purpose of our approach is to provide a way to transfer a sentence matching model from the labeled source domain to the unlabeled target domain. To achieve this, we propose a self-supervised method in the unsupervised domain adaptation setting. Specifically, we present four effective self-supervised tasks to achieve alignment between two domains. Enhanced by an easy-to-hard curriculum, the classifier trained on the source domain can gradually generalize to the unlabeled target domain with these self-supervised tasks. It is illustrated in Fig.2.
4.1 Self-Supervised Tasks
The design of auxiliary self-supervised tasks is a valuable area of research in itself[19, 37, 38, 44, 45]. Similarly, designing effective auxiliary tasks for our needs is worthy of thinking. However, not all the auxiliary tasks are suitable for unsupervised domain adaptation on sentence matching. First, designed self-supervised tasks need to consider the characteristics of the goal. In our work, we focus on unsupervised domain adaptation on sentence matching. Therefore, the self-supervised tasks should consider factors of both domain transfer and sentence matching.
Second, like most of previous unsupervised domain adaptation methods[20-23], the model needs to eliminate the distribution shift between domains and learn domain-invariant representations for the classifier[22], which is trained on the source domain and generalized to the target domain. Therefore, in order to induce alignment between the source domain and the target domain for domain adaptation, the labels created by designed self-supervision tasks should not cause domain separation. If the auxiliary self-supervised tasks separate the two domains, the performance of unsupervised domain adaptation may be even worse. For example, we conduct experiments with an auxiliary task, which only predicts the domain of each sample and does not have an object to eliminate domain discrepancy as previous work[22], and find that the performance tends to be worse than that of the source only zero-shot baseline.
Considering the above requirements, we present four self-supervised tasks, which are illustrated in Fig.3. According to the difference of leveraged self-supervised signals, these tasks are divided into three types: general, domain-related, and task-related.
Language Model (General). Learning language models is the most common self-supervised task in natural language processing. Thus, following BERT[19], we first employ the masked language model as an auxiliary self-supervised task. With the shared networks in the same feature spaces, training the general task on source domain S and target domain T together can align two different domains along the direction relevant to this task.
In this self-supervised task, 15% of random tokens are sampled and replaced with “[MASK]” elements. The goal of this task is to predict masked tokens with the context for reconstruction. We use L1(S,T) to indicate this object. Formally, this object is calculated as:
L1(S,T)=−∑x∈S,TNm∑ilnp(wi|xm), where xm denotes the masked sentence pair sample based on the original text x in both the source domain and the target domain, wi denotes the i-th masked token in the sample xm, p(wi|xm) denotes the probability of reconstruct masked token wi under the context xm, and Nm denotes the number of masked tokens in xm.
Rotation Detection (General). Inspired by the pre-trained language model BART[45], we also employ a rotation detection task as an auxiliary self-supervised task. Rotation detection is a variation of learning language models and can help the networks learn linguistics in a different way. Thus, rotation detection can also achieve domain alignment between S and T along another direction. In rotation detection, a token in one of two input sentences is chosen at random, and the sentence containing the token is rotated so that it begins with the token. The goal of this self-supervised task is to identify the real start of a sentence. In practice, although only one sentence is rotated, the position of rotation may be predicted on both the two sentences, which makes the model learn to judge whether a sentence is linguistic. We use L2(S,T) to indicate this object. Formally,
L2(S,T)=−∑x∈S,T|xr|∑ilnp(ri|xr), where xr denotes the rotated sentence pair in both two domains from the original sample x, ri∈{0,1} denotes whether the i-th token in xr is the start position of the original sample, |xr| is the number of tokens in the input sentence pair, p(ri|xr) denotes the probability of whether the i-th token in xr is the start position of the original sample.
Insensitive Word Prediction (Domain-Related). In the domain adaption setting, there are two datasets from different domains. Thus, we can also utilize the domain factor to construct self-supervision. To avoid separating the source and target domains during the training, we perform this self-supervised task by employing a classifier to judge whether each word in the current sample is a domain-insensitive word, instead of directly predicting the specific domain of the sample. We use L3(S,T) to indicate this object. Formally, it is defined as follows:
L3(S,T)=−∑x∈S,T|x|∑ilnp(gi|x), where x denotes the sentence pair sample in both domains, gi∈{0,1} denotes the general word label (gi=0/1 denotes that the word is a domain-related/general word) of the i-th token in x, and p(gi|x) denotes the probability of whether the i-th token in x is a domain-related/general word.
To decide whether the word is a domain-insensitive word, we count the frequency of each token in both the source and target domains, and regard the tokens whose frequencies are relatively stable as domain-insensitive words. Formally,
g={0,if freS(w)freT(w)>Γ or freS(w)freT(w)<1/Γ,1,otherwise, where freS(w) and freT(w) denote the normalized frequencies of the token w in the source domain and the target domain, respectively, and Γ is the threshold to decide whether the token is an insensitive one, i.e., the label g∈{0,1}. If frequencies of words are relatively high in one of these domains, we regard them as domain-related words (g=0) and vice versa (g=1 as general words).
Through such a domain-related self-supervised task, the networks can learn to recognize the domain-related words in two domains. As a result, the shared networks in two domains obtain a new direction to align the spaces of two domains, and thus can help perform domain transfer.
Longest Common Subsequence Prediction (Task-Related). Intuitively, a task-related self-supervised task can be suitable for the sentence matching task. Although we have presented three self-supervised tasks for domain alignment, including the one that considers the domain factor, we still lack a suitable task to focus on the characteristics of the sentence matching task itself. Thus, we think about introducing a task-related self-supervised task.
In sentence matching, two input sentences must share the same meaning, if they have the completely identical sequence. It is the different subsequences that decide the relationship between two sentences. Thus, we present the longest common subsequence prediction as a task-related self-supervised task. First, according to the input sentence pair, we obtain the longest common subsequence L={w1,…,wn}. Then we mask words of the longest common subsequence L with “[MASK]” elements in one sentence, and predict whether each token in the other sentence is in L. We use L4(S,T) to indicate this object. Formally,
L4(S,T)=−∑x∈S,T|xl|∑ilnp(li|xl), where xl denotes the sentence pair in both domains, in which the words of the longest common subsequence in one of the input sentence pair are masked, |xl| is the length of the input sentence pair, li∈{0,1} denotes the label of whether the i-th token in xl belongs to the longest common subsequence L= {w1,…,wn}, p(li|xl) denotes the probability of whether the i-th token in xl is in the longest common subsequence L={w1,…,wn}. For the case that the longest common subsequence is empty, i.e., L=[⋅], we predict a special classification for the sample. For example, we can make the special prediction on “[CLS]” with multi-layer perceptron for BERT[19].
Through distinguishing the same subsequences from different subsequences by the used task-related self-supervised task, the networks can learn to recognize the difference between two sentences, which is the very important information for sentence matching. As a result, the shared feature spaces in both domains can be aligned in a task-related way, which suits the sentence matching task itself.
4.2 Unsupervised Domain Adaptation with Curriculum Learning Framework
While many samples require complex knowledge transfer for domain adaptation, some are inherently adaptable to different domains, such as ones that consist of common words. Therefore, when performing self-supervised tasks to align domains, the difficulty of domain alignment may vary significantly due to the difference of samples in the unlabeled target domain. Inspired by curriculum learning[27], we start the self-supervised alignment domain from easy ones and gradually move on to hard ones, which benefits the learning process.
Specifically, we decompose our curriculum learning framework for unsupervised domain adaptation into two stages: 1) difficulty evaluation and 2) curriculum arrangement. Let DS and DT be the labeled source dataset and the unlabeled target dataset for training, respectively, and M be our sentence matching model. In the first stage, the goal is to assign each sample di in DT with a score si, which reflects its difficulty of the domain transfer. In the second stage, according to the score si, DT is organized into a sequence of ordered learning stages in an easy-to-hard fashion, resulting in the final curriculum, where unlabeled target data is sorted in a special order to perform self-supervised tasks.
Difficulty Evaluation. The best metric should reflect the domain transfer difficulty. The direct way to measure the domain transfer difficulty of an unlabeled target sample is to refer to the model itself. Therefore, we judge the score si of domain transfer difficulty by the confidence of predictions of the current model M, i.e., si=−∑cP(yi=c|di;M)lnP(yi=c|di;M), where yi denotes the label of the i-th sample and c denotes the class of the label. According to this measurement, we are able to distinguish easy samples from hard ones.
Curriculum Arrangement. After obtaining score si for each sample di in unlabeled target dataset DT, we arrange di into an easy-to-hard learning curriculum according to low-to-high difficulty scores si. Due to the variation of model parameters during training, we measure the score and arrange a curriculum at each epoch. During training, sorted unlabeled target samples in DT and random labeled source samples in DS are jointly used to perform self-supervised tasks for domain alignment.
4.3 Optimization
Training. The total object for unsupervised domain adaptation on sentence matching is to jointly train a supervised classifier on the source domain and perform self-supervised tasks on both the source and target domains. It is formulated as follows:
L=L0(S)+K∑k=1Lk(S,T), where Lk(S,T) denotes the k-th self-supervised task on both the source and target domains, K=4 in our work, and L0(S) denotes the objective for the sentence matching task with labeled source domain data.
Selecting Hyper-Parameter and Early Stopping. In the unsupervised domain adaptation setting, there is no available label and no validation set for the target domain. Thus, how to select the hyper-parameter and early stop to optimize the model remains a problem. Even in previous work, it is often unclear about how they select the hyper-parameter or decide the end of the training. The hyper-parameter selection and its early stop are important factors that impact performance, especially for complex methods using adversarial learning. For example, previous methods may be only based on the labeled source domain validation set and cannot reflect the performance on the target domain.
In this paper, we present a simple but effective heuristic way to solve this problem. We measure both the performance on the validation set of the labeled source domain and the distributional discrepancy between the two domains. If the two measurements are both at the highest levels, the classifier generalizing to the target domain is naturally the most suitable. Thus, it is a good and direct reference for selecting the hyper-parameter and early stopping. Different from previous work[20-23], we do not explicitly optimize for any measurement of distributional discrepancy (such as domain classifier in adversarial learning), and thus the distributional discrepancy in our method is reliable as a metric for selecting the hyper-parameter and early stopping.
Specifically, we simply use the distance between the mean of the source samples and the mean of the target samples in the learned representation space as the distributional discrepancy. Formally, it is expressed as:
d=||1NS′∑x∈S′Φ(x)−1NT′∑x∈T′Φ(x)||, where S′ and T′ are the unlabeled source and target dataset for validation, respectively, and Φ(x) denotes the representation of the sample. Taking BERT as an example, we can use the representation of “[CLS]” as the representation of a sentence pair sample in sentence matching. Finally, we combine the measurement of the distributional discrepancy and the performance on the validation dataset of the labeled source domain for selecting the hyper-parameter and early stopping.
5. Experiments
5.1 Dataset
We evaluate our unsupervised domain adaptation on three commonly used sentence matching tasks, including six popular benchmark datasets.
Natural Language Inference. SNLI[1] is a caption genre built on image captions. MultiNLI[2] is built on the Open American National Corpus and contemporary fiction, whose genres such as reports are different from SNLI. This task decides whether a hypothesis sentence can reasonably be inferred from a given premise sentence. Label types of sentence pairs are divided into 0 (dissimilar), 1 (entailment), and 2 (neutral).
Paraphrase Identification. SRA[4] collects sentence pairs about the education domain, which belongs to relatively specific domains. This task aims to identify whether two sentences have identical meanings. QQS[5] collects sentence pairs about the daily life domain from an online forum. Label types of sentence pairs are divided into 0 (dissimilar) and 1 (similar).
Question Answering. TrecQA[6] is based on official resources such as journals. WikiQA[46] is based on English Wikipedia. It is required to select matched answers from candidates for the given question (label 0/1).
To perform unsupervised domain adaptation on sentence matching, we conduct experiments by: SNLI → MultiNLI, MultiNLI → SNLI, SRA → QQS, QQS → SRA, TrecQA → WikiQA, WikiQA → TrecQA. The dataset on the left/right of the arrow denotes the source/target dataset.
5.2 Implementation Detail
We implement the self-supervised method based on the pre-trained model BERT[19], i.e., both the classification object and the self-supervision object are implemented by BERT, using the base-uncased pre-trained model with 12 layers and a 768-dimensional hidden state. And the intermediate dimension for self-supervised tasks is also 768. The maximum input length is 80. Γ is set to 5 for the threshold of deciding whether it is a domain-insensitive word. The Adam optimizer[47] is employed for optimization. The learning rate is 0.00002 during training. We set the maximum epoch number as 20.
5.3 Comparison
We employ classification accuracy for evaluation and conduct experiments on the following state-of-the-art methods for comparison.
1) Zero-shot. To compare the cross-domain generalization capability of our self-supervised method (SS), we first compare it with the pre-trained model BERT[19], which is finetuned on one source dataset and directly applied to another target dataset without any change. This model represents the lower bound of cross-domain performance.
2) IR. We employ BM25 as an information retrieval baseline. According to the source domain, a threshold is decided to judge the 0/1 label.
3) Upper. It denotes the upper bound of cross-domain performance, which directly trains on the target domain with the labeled data.
4) DANN. We employ the domain-adversarial neural network[22] with BERT for comparison.
5) DSN. It learns to extract shared and private components of each domain[48].
6) TRL. It iteratively trains a pivot-based language model with solving increasingly more complex tasks in subsequent stages[49].
7) JMMD. It learns a transfer network by aligning the distributions of domains based on a joint maximum mean discrepancy criterion[50].
8) CMD. It minimizes the difference between feature representations by utilizing equivalent representation of probability distributions by moment sequences[51].
9) MT-Tri. It uses a tri-training framework and multi-task learning for domain adaptation[52].
10) MMT. It utilizes refined pseudo labels in a collaborative training manner[53].
5.4 Effectiveness of Self-Supervision
The cross-domain results across six datasets are shown in Table 1. As observed, models drop dramatically when applied to another domain, even if it is a powerful pre-trained model. The results show that the unsupervised domain adaptation on sentence matching is a challenging problem. We can see that some methods such as DANN sometimes do not work. We can know that directly training the objective as a minimax optimization is a difficult problem. In contrast, our self-supervised method avoids this problem and achieves better performance.
Table 1. Results of Unsupervised Domain Adaptation on Three Sentence Matching TasksMethod Natural Language Inference Paraphrase Identification Question Answering All (Avg) S → M M → S R → Q Q → R T → W W → T Upper 84.6 91.2 80.8 75.6 95.3 87.8 85.9 Zero-shot[19] 64.1/20.5↓ 75.1/16.1↓ 64.4/16.4↓ 52.1/23.5↓ 89.6/5.7↓ 80.3/7.5↓ 70.9/15.0↓ IR - - 63.6/17.2↓ 63.7/11.9↓ 89.1/6.2↓ 81.3/6.5↓ - DANN[22] 67.3/17.3↓ 76.5/14.7↓ 63.7/17.1↓ 52.8/22.8↓ 93.0/2.3↓ 81.3/6.5↓ 72.4/13.5↓ DSN[48] 69.5/15.1↓ 77.1/14.1↓ 63.5/17.3↓ 55.4/20.2↓ 92.1/3.2↓ 81.2/6.6↓ 73.1/12.8↓ TRL[49] 70.0/14.6↓ 77.8/13.4↓ 66.1/14.7↓ 56.3/19.3↓ 90.3/5.0↓ 82.5/5.3↓ 73.8/12.1↓ JMMD[50] 68.5/16.1↓ 76.7/14.5↓ 63.7/17.1↓ 50.8/24.8↓ 90.2/5.1↓ 81.6/6.2↓ 71.9/14.0↓ CMD[51] 68.7/15.9↓ 78.4/12.8↓ 62.4/18.4↓ 53.1/22.5↓ 91.7/3.6↓ 81.9/5.9↓ 72.7/13.2↓ MT-Tri[52] 68.8/15.8↓ 78.3/12.9↓ 67.7/13.1↓ 50.7/24.9↓ 90.9/4.4↓ 82.0/5.8↓ 73.1/12.8↓ MMT[53] 66.2/18.4↓ 76.9/14.3↓ 66.2/14.6↓ 55.9/19.7↓ 93.2/2.1↓ 80.8/7.0↓ 73.2/12.7↓ SS (ours) 73.3/11.3↓ 81.5/9.7↓ 76.9/3.9↓ 61.0/14.6↓ 95.2/0.1↓ 84.1/3.7↓ 78.7/7.2↓ w/o curriculum 72.1/12.5↓ 80.4/10.8↓ 74.5/6.3↓ 60.4/15.2↓ 95.2/0.1↓ 83.6/4.2↓ 77.7/8.2↓ Note: Upper is learned on the target domain with labeled data and represents the upper bound for the cross-domain performance. The numerical value before “↓” denotes the dropped accuracy according to Upper. “w/o curriculum” indicates the performance of the combination of all self-supervised tasks without the curriculum learning framework for unsupervised domain adaptation. IR can only deal with tasks with 0/1 labels by a threshold. Thus, IR cannot deal with natural language inference due to its various label types (0/1/2). S, M, R, Q, T and W denote SNLI, MultiNLI, SRA, QQS, TrecQA, and WikiQA, respectively. We use “avg” to indicate the average performance. And the bold numbers indicate the best performance. In addition, the special design of self-supervision for suiting sentence matching is also an important factor to enhance the cross-domain model. Features of previous methods[20-23] for cross-domain classification do not consider the characteristics of the sentence matching. Thus, we can see that our method significantly outperforms them in all the cross-domain tasks. In general, the results demonstrate that self-supervision is effective to alleviate the domain shift problem of sentence matching.
In addition, as observed, when we remove the curriculum learning framework, the performance declines. It indicates an easy-to-hard order for domain alignment is effective and can benefit the learning process of unsupervised domain adaptation.
Moreover, IR is a strong baseline that outperforms some neural methods in the cross-domain setting. It is further verified that neural models are weak when applied to new domains. However, unsupervised IR has a clear defect. The information retrieval baseline IR can only deal with semantic similarity tasks, whose label types consist of 0 (match) and 1 (not match). This defect limits its application in other sentence tasks that have more various labels such as natural language inference, in which label types consist of 0 (dissimilar), 1 (entailment) and 2 (neutral).
Besides, previous methods[20-23] have better performance than Zero-shot in most cases, which demonstrates that learning domain-invariant representations and achieving domain alignment are useful to alleviate the cross-domain problem.
5.5 Influence of Pre-Training
To explore the influence of pre-training for cross-domain sentence matching, we also conduct experiments with randomly initialized parameters. The results are shown in Fig.4.
By observing the Upper and Zero-shot models, we can see that both the pre-trained and un-pretrained models suffer from the serious domain shift problem. In addition, in all methods, the pre-trained model outperforms the un-pretrained model. The improvement benefits from pre-trained parameters, which indicates that pre-training is useful to improve the performance in cross-domain scenarios. However, we can see that the degrees of performance drop of pre-trained and un-pretrained models are at the same level, which indicates that pre-training cannot directly tackle the domain shift problem.
6. Analysis for Usages of Self-Supervision
Due to the open style of self-supervised signals, there may be various heuristic self-supervised tasks that can be candidates. Obviously, not all self-supervised tasks are useful for domain alignment. In this section, we further study different usages of self-supervision for cross-domain scenarios. There is no statistical guarantee, but it may be practically helpful for thinking about how to effectively design and utilize self-supervised tasks. In total, we find three empirical views in the experiments, which are shown in Subsections 6.1-6.3.
6.1 Domain Signal for Self-Supervision Tasks
To investigate the effectiveness of different self-supervised tasks, we conduct ablation studies. The results are shown in Table 2. Overall, compared with the Zero-shot baseline, we can observe that all four self-supervised tasks can improve the performance. The results demonstrate that auxiliary self-supervised tasks are useful to align two domains with unlabeled data and achieve unsupervised domain adaptation.
Table 2. Results of Ablation StudiesMethod Natural Language Inference Paraphrase Identification Question Answering All S → M M → S R → Q Q → R T → W W → T (Avg) SS (ours) 73.3 81.5 76.9 61.0 95.2 84.1 78.7 Zero-shot 64.1 75.1 64.4 52.1 89.6 80.3 70.9 LM (general) 69.5 78.1 65.7 53.2 92.5 81.8 73.5 w/o LM 72.4 80.9 76.1 60.2 95.0 83.5 78.0 Rotation (general) 69.4 77.4 70.6 52.7 90.9 81.9 73.8 w/o Rotation 71.6 80.0 76.3 60.6 95.0 83.9 77.9 Insensitive (domain-related) 68.2 77.7 73.8 60.0 92.5 81.3 75.6 w/o insensitive 71.0 79.3 76.0 60.4 94.8 83.2 77.4 LCS (task-related) 69.8 76.7 71.2 54.9 90.3 81.5 74.1 w/o LCS 72.7 80.7 76.3 60.7 94.9 83.7 78.1 All (w/o curriculum) 72.1 80.4 74.5 60.4 95.2 83.6 77.7 Separate 65.7 75.2 63.4 50.8 88.9 80.6 70.8 Note: “LM”, “Rotation”, “Insensitive” and “LCS” refer to our presented self-supervised tasks of L1, L2, L3 and L4, respectively. “w/o” denotes deleting a self-supervised task. “w/o curriculum” indicates the performance of the combination of all self-supervised tasks without the curriculum learning framework for unsupervised domain adaptation. “All” indicates the combination of the above four tasks. “Separate” indicates the method that employs a self-supervised task leading the domain separation. S, M, R, Q, T and W denote SNLI, MultiNLI, SRA, QQS, TrecQA and WikiQA, respectively. We use “avg” to indicate the average performance. And the bold numbers indicate the best performance. To be more specific, we can see that the domain-related self-supervised task is the most effective in most cases. Especially for the smaller datasets (SRA→QQS, QQS→SRA), the loss of the domain-related self-supervised task is much more helpful. It shows that using the direct domain signal for self-supervision may be a good choice for domain alignment. Moreover, the two general self-supervised tasks have similar performance. Besides, the task-related self-supervised task is comparable with general self-supervised tasks, which shows that it is useful to design special self-supervised tasks for a specific task. A specific task typically contains its own characteristics and special self-supervised tasks can suit the characteristics and further facilitate domain alignment in the task. This paper focuses on sentence matching, and the results verify that self-supervised tasks are useful for cross-domain sentence matching.
6.2 Domain Separation
Intuitively, in order to induce alignment between two domains, self-supervised tasks should not cause domain separation. If auxiliary self-supervised tasks cause domain separation, the discrepancy of space in the feature space will increase and lead to worse performance of the domain transfer. To explore it, we conduct experiments with a self-supervised task “Separate”, which directly predicts the specific domain (source or target) of each sample as the self-supervised task and does not have any consideration to eliminate domain discrepancy. The results are shown in Table 2.
We can find that the method “Separate” with leading domain separation is slightly better than the baseline (Zero-shot), sometimes even worse. It verifies that designed self-supervised tasks for unsupervised domain adaptation should not cause domain separation. Besides, the results also demonstrate that not all self-supervised tasks are useful and designing auxiliary self-supervised tasks needs consideration.
6.3 Multiple Self-Supervised Tasks
To investigate the influence of the number of self-supervised tasks, we show the curves of different numbers of self-supervised tasks in Fig.5. We can see that the performance increases with the number of self-supervised tasks increasing. The results demonstrate that more self-supervised tasks lead to better domain adaptation performance.
Similarly, the performance of “All” in Table 2 shows that the combination of multiple tasks facilitates domain alignment and performs better than every single one. These results demonstrate that more self-supervised tasks can further achieve domain alignment along multiple directions and help the source classifier generalize to the target.
7. Conclusions
To address unsupervised domain adaption on sentence matching, we proposed to perform auxiliary self-supervised tasks to achieve alignment between domains, so that the classifier trained on the source domain can generalize to the unlabeled target domain more easily and professionally. We conducted experiments on three sentence matching tasks across six datasets and the results showed that our method outperforms previous state-of-the-art methods, which demonstrates that self-supervision is useful to alleviate the domain shift problem. In addition, we showed the effectiveness of different self-supervised tasks with ablation studies, and derived three experimental conclusions for unsupervised domain adaption. These experimental conclusions may be helpful for other domain adaptation tasks. In the future work, we plan to explore how to design more effective self-supervised tasks.
Acknowledgements
We would like to sincerely thank all reviewers for their kind and constructive suggestions.
-
Figure 1. Visualization of how two domains are aligned by self-supervision. (a) Source classifier only. Without any self-supervised task, the source domain is far away from the target domain, and a source classifier can hardly generalize to the target. (b) Adding a self-supervised task. Performing one self-supervised task on both domains in a shared feature space can align the source and target along one direction. (c) Adding more self-supervised tasks. Performing more self-supervised tasks can further achieve domain alignment along multiple directions. Finally, the two domains are much closer and the source classifier is expected to be generalized to the target domain.
Table 1 Results of Unsupervised Domain Adaptation on Three Sentence Matching Tasks
Method Natural Language Inference Paraphrase Identification Question Answering All (Avg) S → M M → S R → Q Q → R T → W W → T Upper 84.6 91.2 80.8 75.6 95.3 87.8 85.9 Zero-shot[19] 64.1/20.5↓ 75.1/16.1↓ 64.4/16.4↓ 52.1/23.5↓ 89.6/5.7↓ 80.3/7.5↓ 70.9/15.0↓ IR - - 63.6/17.2↓ 63.7/11.9↓ 89.1/6.2↓ 81.3/6.5↓ - DANN[22] 67.3/17.3↓ 76.5/14.7↓ 63.7/17.1↓ 52.8/22.8↓ 93.0/2.3↓ 81.3/6.5↓ 72.4/13.5↓ DSN[48] 69.5/15.1↓ 77.1/14.1↓ 63.5/17.3↓ 55.4/20.2↓ 92.1/3.2↓ 81.2/6.6↓ 73.1/12.8↓ TRL[49] 70.0/14.6↓ 77.8/13.4↓ 66.1/14.7↓ 56.3/19.3↓ 90.3/5.0↓ 82.5/5.3↓ 73.8/12.1↓ JMMD[50] 68.5/16.1↓ 76.7/14.5↓ 63.7/17.1↓ 50.8/24.8↓ 90.2/5.1↓ 81.6/6.2↓ 71.9/14.0↓ CMD[51] 68.7/15.9↓ 78.4/12.8↓ 62.4/18.4↓ 53.1/22.5↓ 91.7/3.6↓ 81.9/5.9↓ 72.7/13.2↓ MT-Tri[52] 68.8/15.8↓ 78.3/12.9↓ 67.7/13.1↓ 50.7/24.9↓ 90.9/4.4↓ 82.0/5.8↓ 73.1/12.8↓ MMT[53] 66.2/18.4↓ 76.9/14.3↓ 66.2/14.6↓ 55.9/19.7↓ 93.2/2.1↓ 80.8/7.0↓ 73.2/12.7↓ SS (ours) 73.3/11.3↓ 81.5/9.7↓ 76.9/3.9↓ 61.0/14.6↓ 95.2/0.1↓ 84.1/3.7↓ 78.7/7.2↓ w/o curriculum 72.1/12.5↓ 80.4/10.8↓ 74.5/6.3↓ 60.4/15.2↓ 95.2/0.1↓ 83.6/4.2↓ 77.7/8.2↓ Note: Upper is learned on the target domain with labeled data and represents the upper bound for the cross-domain performance. The numerical value before “↓” denotes the dropped accuracy according to Upper. “w/o curriculum” indicates the performance of the combination of all self-supervised tasks without the curriculum learning framework for unsupervised domain adaptation. IR can only deal with tasks with 0/1 labels by a threshold. Thus, IR cannot deal with natural language inference due to its various label types (0/1/2). S, M, R, Q, T and W denote SNLI, MultiNLI, SRA, QQS, TrecQA, and WikiQA, respectively. We use “avg” to indicate the average performance. And the bold numbers indicate the best performance. Table 2 Results of Ablation Studies
Method Natural Language Inference Paraphrase Identification Question Answering All S → M M → S R → Q Q → R T → W W → T (Avg) SS (ours) 73.3 81.5 76.9 61.0 95.2 84.1 78.7 Zero-shot 64.1 75.1 64.4 52.1 89.6 80.3 70.9 LM (general) 69.5 78.1 65.7 53.2 92.5 81.8 73.5 w/o LM 72.4 80.9 76.1 60.2 95.0 83.5 78.0 Rotation (general) 69.4 77.4 70.6 52.7 90.9 81.9 73.8 w/o Rotation 71.6 80.0 76.3 60.6 95.0 83.9 77.9 Insensitive (domain-related) 68.2 77.7 73.8 60.0 92.5 81.3 75.6 w/o insensitive 71.0 79.3 76.0 60.4 94.8 83.2 77.4 LCS (task-related) 69.8 76.7 71.2 54.9 90.3 81.5 74.1 w/o LCS 72.7 80.7 76.3 60.7 94.9 83.7 78.1 All (w/o curriculum) 72.1 80.4 74.5 60.4 95.2 83.6 77.7 Separate 65.7 75.2 63.4 50.8 88.9 80.6 70.8 Note: “LM”, “Rotation”, “Insensitive” and “LCS” refer to our presented self-supervised tasks of L1, L2, L3 and L4, respectively. “w/o” denotes deleting a self-supervised task. “w/o curriculum” indicates the performance of the combination of all self-supervised tasks without the curriculum learning framework for unsupervised domain adaptation. “All” indicates the combination of the above four tasks. “Separate” indicates the method that employs a self-supervised task leading the domain separation. S, M, R, Q, T and W denote SNLI, MultiNLI, SRA, QQS, TrecQA and WikiQA, respectively. We use “avg” to indicate the average performance. And the bold numbers indicate the best performance. -
[1] Bowman S R, Angeli G, Potts C, Manning C D. A large annotated corpus for learning natural language inference. arXiv: 1508.05326, 2015. https://arxiv.org/abs/1508.05326, Nov. 2023.
[2] Williams A, Nangia N, Bowman S R. A broad-coverage challenge corpus for sentence understanding through inference. arXiv: 1704.05426, 2017. https://arxiv.org/abs/1704.05426, Nov. 2023.
[3] Rus V, Banjade R, Lintean M. On paraphrase identification corpora. In Proc. the 9th International Conference on Language Resources and Evaluation, May 2014, pp.2422–2429.
[4] Dzikovska M, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Dang H T. SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Proc. the 2nd Joint Conference on Lexical and Computational Semantics, Jun. 2013, pp.263–274.
[5] Nakov P, Hoogeveen D, Màrquez L, Moschitti A, Mubarak H, Baldwin T, Verspoor K. SemEval-2017 task 3: Community question answering. arXiv: 1912.00730, 2019. https://arxiv.org/abs/1912.00730, Nov. 2023.
[6] Wang M Q, Smith N A, Mitamura T. What is the jeopardy model? A quasi-synchronous grammar for QA. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jun. 2007, pp.22–32.
[7] Yang Y, Yih W T, Meek C. WikiQA: A challenge dataset for open-domain question answering. In Proc. the 2015 Conference on Empirical Methods in Natural Language Processing, Sept. 2015, pp.2013–2018. DOI: 10.18653/v1/D15-1237.
[8] Bao X Q, Wu Y F. A tensor neural network with layerwise pretraining: Towards effective answer retrieval. Journal of Computer Science and Technology , 2016, 31(6): 1151–1160. DOI: 10.1007/s11390-016-1689-4.
[9] Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A. Supervised learning of universal sentence representations from natural language inference data. arXiv: 1705.02364, 2017. https://arxiv.org/abs/1705.02364, Nov. 2023.
[10] Choi J, Yoo K M, Lee S. Learning to compose task-specific tree structures. arXiv: 1707.02786, 2017. https://arxiv.org/abs/1707.02786, Nov. 2023.
[11] Nie Y X, Bansal M. Shortcut-stacked sentence encoders for multi-domain inference. arXiv: 1708.02312, 2017. https://arxiv.org/abs/1708.02312, Nov. 2023.
[12] Shen T, Zhou T Y, Long G D, Jiang J, Wang S, Zhang C Q. Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling. arXiv: 1801.10296, 2018. https://arxiv.org/abs/1801.10296, Nov. 2023.
[13] Chen Q, Zhu X D, Ling Z H, Wei S, Jiang H, Inkpen D. Enhanced LSTM for natural language inference. arXiv: 1609.06038, 2016. https://arxiv.org/abs/1609.06038, Nov. 2023.
[14] Yang L, Ai Q Y, Guo J F, Croft W B. aNMM: Ranking short answer texts with attention-based neural matching model. In Proc. the 25th ACM International on Conference on Information and Knowledge Management, Oct. 2016, pp.287–296. DOI: 10.1145/2983323.2983818.
[15] Wang Z G, Hamza W, Florian R. Bilateral multi-perspective matching for natural language sentences. arXiv: 1702.03814, 2017. https://arxiv.org/abs/1702.03814, Nov. 2023.
[16] Gong Y C, Luo H, Zhang J. Natural language inference over interaction space. arXiv: 1709.04348, 2017. https://arxiv.org/abs/1709.04348, Nov. 2023.
[17] Liang D, Zhang F B, Zhang Q, Huang X J. Asynchronous deep interaction network for natural language inference. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.2692–2700. DOI: 10.18653/v1/D19-1271.
[18] Chen L, Zhao Y B, Lyu B E, Jin L S, Chen Z, Zhu S, Yu K. Neural graph matching networks for Chinese short text matching. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.6152–6158. DOI: 10.18653/v1/2020.acl-main.547.
[19] Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805, 2018. https://arxiv.org/abs/1810.04805, Nov. 2023.
[20] Pan S J, Yang Q. A survey on transfer learning. IEEE Trans. Knowledge and Data Engineering , 2010, 22(10): 1345–1359. DOI: 10.1109/TKDE.2009.191.
[21] Saenko K, Kulis B, Fritz M, Darrell T. Adapting visual category models to new domains. In Proc. the 11th European Conference on Computer Vision, Sept. 2010, pp.213–226. DOI: 10.1007/978-3-642-15561-1_16.
[22] Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research , 2016, 17(1): 2096–2030. DOI: 10.1007/978-3-319-58347-1_10.
[23] Wang Y Y, Gu J M, Wang C, Chen S C, Xue H. Discrimination-aware domain adversarial neural network. Journal of Computer Science and Technology , 2020, 35(2): 259–267. DOI: 10.1007/s11390-020-9969-4.
[24] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. arXiv: 1701.04862, 2017. https://arxiv.org/abs/1701.04862, Nov. 2023.
[25] Raina R, Battle A, Lee H, Packer B, Ng A Y. Self-taught learning: Transfer learning from unlabeled data. In Proc. the 24th International Conference on Machine Learning, Jun. 2007, pp.759–766. DOI: 10.1145/1273496.1273592.
[26] Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence , 2013, 35(8): 1798–1828. DOI: 10.1109/TPAMI.2013.50.
[27] Bengio Y, Louradour J, Collobert R, Weston J. Curriculum learning. In Proc. the 26th Annual International Conference on Machine Learning, Jun. 2009, pp.41–48. DOI: 10.1145/1553374.1553380.
[28] Peng M L, Zhang Q, Jiang Y G, Huang X J. Cross-domain sentiment classification with target domain specific information. In Proc. the 56th Annual Meeting of the Association for Computational Linguistics, Jul. 2018, pp.2505–2513. DOI: 10.18653/v1/P18-1233.
[29] Ghosal D, Hazarika D, Roy A, Majumder N, Mihalcea R, Poria S. KinGDOM: Knowledge-guided DOMain adaptation for sentiment analysis. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.3198–3210. DOI: 10.18653/v1/2020.acl-main.292.
[30] Cao Y, Fang M, Yu B S, Zhou J T. Unsupervised domain adaptation on reading comprehension. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.7480–7487. DOI: 10.1609/aaai.v34i05.6245.
[31] Kamath A, Jia R B, Liang P. Selective question answering under domain shift. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.5684–5696. DOI: 10.18653/v1/2020.acl-main.503.
[32] Ding N, Long D K, Xu G W, Zhu M H, Xie P J, Wang X B, Zheng H T. Coupling distant annotation and adversarial training for cross-domain Chinese word segmentation. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.6662–6671. DOI: 10.18653/v1/2020.acl-main.595.
[33] Rücklé A, Pfeiffer J, Gurevych I. MultiCQA: Zero-shot transfer of self-supervised text matching models on a massive scale. In Proc. the 2020 Conference on Empirical Methods in Natural Language Processing, Nov. 2020, pp.2471–2486. DOI: 10.18653/v1/2020.emnlp-main.194.
[34] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv: 1301.3781, 2013. https://arxiv.org/abs/1301.3781, Nov. 2023.
[35] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In Proc. the 26th International Conference on Neural Information Processing Systems, Dec. 2013, pp.3111–3119.
[36] Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. The Journal of Machine Learning Research , 2003, 3: 1137–1155. DOI: 10.1007/3-540-33486-6_6.
[37] Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. arXiv: 1802.05365, 2018. https://arxiv.org/abs/1802.05365, Nov. 2023.
[38] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training, 2018. https://www.bibsonomy.org/bibtex/15c343ed9a31ac52fd17a898f72af228f/lepsky?lang=en, Nov. 2023.
[39] Kumar M P, Packer B, Koller D. Self-paced learning for latent variable models. In Proc. the 23rd International Conference on Neural Information Processing Systems, Dec. 2010, pp.1189–1197.
[40] Sachan M, Xing E. Easy questions first? A case study on curriculum learning for question answering. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, Aug. 2016, pp.453–463. DOI: 10.18653/v1/P16-1043.
[41] Sachan M, Xing E. Self-training for jointly learning to ask and answer questions. In Proc. the 16th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2018, pp.629–640. DOI: 10.18653/v1/N18-1058.
[42] Tay Y, Wang S H, Tuan L A, Fu J, Phan M C, Yuan X D, Rao J F, Hui S C, Zhang A. Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. arXiv: 1905.10847, 2019.https://arxiv.org/abs/1905.10847, Nov. 2023.
[43] Xu B F, Zhang L, Mao Z, Wang Q, Xie H, Zhang Y. Curriculum learning for natural language understanding. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.6095–6104. DOI: 10.18653/v1/2020.acl-main.542.
[44] Wu J W, Wang X, Wang W Y. Self-supervised dialogue learning. arXiv: 1907.00448, 2019. https://arxiv.org/abs/1907.00448, Nov. 2023.
[45] Lewis M, Liu Y H, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv: 1910.13461, 2019. https://arxiv.org/abs/1910.13461, Nov. 2023.
[46] Jurczyk T, Zhai M, Choi J D. SelQA: A new benchmark for selection-based question answering. In Proc. the 28th International Conference on Tools with Artificial Intelligence, Nov. 2016, pp.820–827. DOI: 10.1109/ICTAI.2016.0128.
[47] Kingma D P, Ba J. Adam: A method for stochastic optimization. arXiv: 1412.6980, 2014. https://arxiv.org/abs/1412.6980, Nov. 2023.
[48] Bousmalis K, Trigeorgis G, Silberman N, Krishnan D, Erhan D. Domain separation networks. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.343–351.
[49] Ziser Y, Reichart R. Task refinement learning for improved accuracy and stability of unsupervised domain adaptation. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.5895–5906. DOI: 10.18653/v1/P19-1591.
[50] Long M S, Zhu H, Wang J M, Jordan M I. Deep transfer learning with joint adaptation networks. In Proc. the 34th International Conference on Machine Learning, Aug. 2017, pp.2208–2217.
[51] Zellinger W, Grubinger T, Lughofer E, Natschläger T, Saminger-Platz S. Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv: 1702.08811, 2017. https://arxiv.org/abs/1702.08811, Dec. 2023.
[52] Ruder S, Plank B. Strong baselines for neural semi-supervised learning under domain shift. arXiv: 1804.09530, 2018. https://arxiv.org/abs/1804.09530, Nov. 2023.
[53] Ge Y X, Chen D P, Li H S. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv: 2001.01526, 2020. https://arxiv.org/abs/2001.01526, Nov. 2023.
-
其他相关附件