Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (4): 792-805.doi: 10.1007/s11390-021-1353-5

Special Issue: Data Management and Data Mining

• Special Section on AI4DB and DB4AI • Previous Articles     Next Articles

Efficient Model Store and Reuse in an OLML Database System

Jian-Wei Cui, Member, CCF, Wei Lu, Member, CCF, Xin Zhao, and Xiao-Yong Du*, Fellow, CCF        

  1. Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China
  • Received:2021-02-04 Revised:2021-06-27 Online:2021-07-05 Published:2021-07-30
  • Contact: Xiao-Yong Du E-mail:duyong@ruc.edu.cn
  • About author:Jian-Wei Cui is currently a Ph.D. candidate in the Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, and School of Information at Renmin University of China, Beijing. His research interests include natural language processing, machine translation and DB4AI. He is a member of CCF.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China under Grant No. 62072458.

Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.

Key words: model selection; model reuse; OnLine Machine Learning (OLML) database;

[1] Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? arXiv:1411.1792, 2014. https://arxiv.org/abs/1411.1792, Nov. 2020.
[2] Wang W, Wang S, Gao J, Zhang M, Chen G, Ng T K, Ooi B C. Rafiki:Machine learning as an analytics service system. arXiv:1804.06087, 2018. https://arxiv.org/abs/1804.06087, Apr. 2021.
[3] Zhang W, Jiang J, Shao Y, Cui B. Efficient diversity-driven ensemble for deep neural networks. In Proc. the 36th IEEE International Conference on Data Engineering, Apr. 2020, pp.73-84. DOI:10.1109/ICDE48307.2020.00014.
[4] Derakhshan B, Mahdiraji A R, Abedjan Z, Rabl T, Markl V. Optimizing machine learning workloads in collaborative environments. In Proc. the 2020 ACM SIGMOD International Conference on Management of Data, Jun. 2020, pp.1701-1716. DOI:10.1145/3318464.3389715.
[5] Schapire R E. Explaining AdaBoost. In Empirical Inference, Schölkopf B, Luo Z, Vovk V (eds.), Springer, 2013, pp.37-52. DOI:10.1007/978-3-642-41136-65.
[6] Zhao Z, Chen H, Zhang J, Zhao X, Liu T, Lu W, Chen X, Deng H, Ju Q, Du X. UER:An open-source toolkit for pre-training models. arXiv:1909.05658, 2019. https://arxiv.org/abs/1909.05658, April 2021.
[7] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781, Jan. 2021.
[8] Pennington J, Socher R, Manning C D. Glove:Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532-1543. DOI:10.3115/v1/D14-1162.
[9] Zhao Z, Liu T, Li S, Li B, Du X. Ngram2vec:Learning improved word representations from ngram co-occurrence statistics. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.244-253. DOI:10.18653/v1/D17-1023.
[10] Dai A M, Olah C, Le Q V. Document embedding with paragraph vectors. arXiv:1507.07998, 2015. https://arxiv.org/abs/1507.07998, April 2021.
[11] Axelrod A, He X, Gao J. Domain adaptation via pseudo in-domain data selection. In Proc. the 2011 Conference on Empirical Methods in Natural Language Processing, Jul. 2011, pp.355-362.
[12] Chen B, Huang F. Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proc. the 20th SIGNLL Conference on Computational Natural Language Learning, Aug. 2016, pp.314-323. DOI:10.18653/v1/K16-1031.
[13] Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359. DOI:10.1109/TKDE.2009.191.
[14] Freitag M, Al-Onaizan Y. Fast domain adaptation for neural machine translation. arXiv:1612.06897, 2016. https://arxiv.org/abs/1612.06897, Dec. 2020.
[15] Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems, 2016, 41(1):Article No. 2. DOI:10.1145/2877204.
[16] Nguyen C, Hassner T, Seeger M, Archambeau C. LEEP:A new measure to evaluate transferability of learned representations. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.7294-7305.
[17] Dietterich T G. Ensemble methods in machine learning. In Proc. the 1st International Workshop on Multiple Classifier Systems, Jun. 2000, pp.1-15. DOI:10.1007/3-540-4501491.
[18] Fu F, Jiang J, Shao Y, Cui B. An experimental evaluation of large scale GBDT systems. Proceedings of the VLDB Endowment, 2019, 12(11):1357-1370. DOI:10.14778/3342263.3342273.
[19] Breiman L. Stacked regressions. Machine Learning, 1996, 24(1):49-64. DOI:10.1023/A:1018046112532.
[20] Ding Y X, Zhou Z H. Boosting-based reliable model reuse. In Proc. the 12th Asian Conference on Machine Learning, November 2020, pp.145-160.
[21] Miao H, Li A, Davis LS, Deshpande A. ModelHub:Towards unified data and lifecycle management for deep learning. arXiv:1611.06224, 2016. https://arxiv.org/abs/1611.06224, Nov. 2020.
[22] Vartak M, Subramanyam H, Lee W E, Viswanathan S, Husnoo S, Madden S, Zaharia M. MODELDB:A system for machine learning model management. In Proc. the Workshop on Human-in-the-Loop Data Analytics, June 26-July 1, 2016, Article No. 14. DOI:10.1145/2939502.2939516.
[23] Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore A J, Madden S, Parameswaran A G. Datahub:Collaborative data science & dataset version management at scale. arXiv:1409.0798, 2014. https://arxiv.org/abs/1409.0798, April 2021.
[24] Kraska T, Talwalkar A, Duchi J C, Griffith R, Franklin M J, Jordan M I. MLbase:A distributed machine-learning system. In Proc. the 6th Biennial Conference on Innovative Data Systems Research, Jan. 2013.
[25] Xin D, Ma L, Liu J, Macke S, Song S, Parameswaran A. HELIX:Accelerating human-in-the-loop machine learning. arXiv:1808.01095, 2018. https://arxiv.org/abs/1808.01095, April 2021.
[26] Xu L, Dong Q, Liao Y, Yu C, Tian Y, Liu W, Li L, Liu C, Zhang X. CLUENER2020:Fine-grained named entity recognition dataset and benchmark for Chinese. arXiv:2001.04351, 2020. https://arxiv.org/abs/2001.04351, Jan. 2021.
[27] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. arXiv:1706.03762, 2017. https://arxiv.org/abs/1706.03762, April 2021.
[28] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa:A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019. https://arxiv.org/abs/1907.11692, April 2021.
[1] Zhi-Xin Qi, Hong-Zhi Wang, An-Jie Wang. Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation [J]. Journal of Computer Science and Technology, 2021, 36(4): 806-821.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] Gong Zhenhe;. On Conceptual Model Specification and Verification[J]. , 1987, 2(1): 35 -50 .
[10] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved