在OLML数据库中高效进行模型存储与复用

doi:10.1007/s11390-021-1353-5

摘要: 通过广泛的应用各类神经网络，深度学习在多种类型机器学习任务上均带来了效果提升。然而，训练神经网络需要大量的数据，不可避免的引起很高的训练成本。在训练数据规模较小的情况下，模型复用是获得更好训练效果的一种有希望的方法。本文提出了OnLine Machine Learning (OLML)数据库，在传统数据库基础上自然延展以支持高效的模型存储与复用。首先，我们从两个方面预测模型被复用的潜力，包括与目标任务的领域相关性以及模型质量；通过这两个方面的预测，能够为目标训练任务选择一组具有高复用潜力的已训练模型。之后，多个被选择模型将组合起来提升目标任务的训练效果。具体而言，被选择的模型将在训练数据上迭代训练，每次迭代训练数据将被重新赋予权重以鼓励模型多样性，以此来达到更好的模型集成效果。本文在两类自然语言处理任务上评估了提出方法的效果，实验结果显示在训练数据规模较小时，相比于从头开始训练，本文提出的方法能够有效的提升训练效果。通过集成本文提出的方法，OLML数据库能够以SQL风格接收训练请求，并通过模型选择与复用自动生成训练计划。本文对OLML数据库的使用流程进行了分析，展示了OLML数据库能够有效的存储和复用已有模型来提升新任务的训练效果。
1、研究背景（context）
深度学习在多种类型机器学习任务上均带来了效果提升。然而，训练神经网络需要大量的数据，不可避免的引起很高的训练成本。
2、目的（Objective）
在训练数据规模较小时，提升神经网络模型训练效果将有助于降低训练成本。
3、方法（Method）
提出了OnLine Machine Learning (OLML)数据库，在传统数据库基础上自然延展以支持高效的模型存储与复用。首先，从与目标任务的领域相关性以及模型质量两个方面预测模型被复用的潜力;之后，选择多个高复用潜力模型组合起来提升目标任务的训练效果。最后，通过构建OLML数据库原型系统展示快速实时响应训练请求的能力。
4、结果（Result&Findings）
在实验设置的文本情绪识别任务上，相比于从头开始训练，本文提出的方法能够在只使用1/5训练数据的情况下获得更好的训练效果。
5、结论（Conclusions）
已训练模型与目标任务的领域相关性以及模型质量能够较好的预测模型复用潜力；通过复用多个已训练模型可以在训练数据规模较小时提升训练效果。通过集成模型存储与复用能力，OLML数据库可以自动构建训练计划，实时响应训练请求，通过模型复用提升训练效果。

Abstract: Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.

在OLML数据库中高效进行模型存储与复用

Efficient Model Store and Reuse in an OLML Database System