We use cookies to improve your experience with our site.

基于自集成与自蒸馏的BERT微调方法

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

  • 摘要:
    近年来,预训练语言模型BERT及其变种在许多自然语言处理任务当中表现突出,并在对应的下游任务当中取得重大突破。目前,模型微调仍然是将预训练语言模型的知识迁移至下游任务的重要方式。在本文当中,我们提出了一种基于自集成与自蒸馏的BERT微调方法,该方法可以有效挖掘模型训练过程中间参数的信息,提升BERT的微调效果。在训练过程当中,自集成方法会记录模型的中间参数以更新经验池,并从经验池当中进行采样构建教师模型;自蒸馏方法则通过知识蒸馏的方式将教师模型的知识迁移到当前正在训练的模型当中,从而提升模型的鲁棒性。通过在一系列文本分类数据集以及通用自然语言理解(GLUE)数据集当中的实验表明,本文提出的微调方法可以在不引入外部知识或者数据的情况下显著提升BERT的微调结果:在文本分类数据集当中,该微调方法相较于基准模型可以降低6.26%的相对错误率;在GLUE数据集当中,该微调方法可以将平均得分从基准模型的79.7分提升至80.6分;在SNLI数据集当中,该微调方法得到了92.6%的准确率,显著超过了之前表现最好模型的结果(92.1%)。基于本文可以发现,在不引入额外的外部知识或者训练数据的情况下,对预训练模型进行训练过程的优化可以进一步提升预训练模型的鲁棒性并增强其泛化能力。与此同时,该方法与其余需要引入外部知识或训练数据的方法并不冲突,即通过引入外部知识或者训练数据的方式可以进一步提升BERT的微调效果。

     

    Abstract: Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing (NLP) and yield state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure, re-designing the pre-training tasks, and leveraging external data and knowledge. The fine-tuning strategy itself has yet to be fully explored. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation. The self-ensemble mechanism utilizes the checkpoints from an experience pool to integrate the teacher model. In order to transfer knowledge from the teacher model to the student model efficiently, we further use knowledge distillation, which is called self-distillation because the distillation comes from the model itself through the time dimension. Experiments on the GLUE benchmark and the Text Classification benchmark show that our proposed approach can significantly improve the adaption of BERT without any external data or knowledge. We conduct exhaustive experiments to investigate the efficiency of the self-ensemble and self-distillation mechanisms, and our proposed approach achieves a new state-of-the-art result on the SNLI dataset.

     

/

返回文章
返回