计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (5): 957-971.doi: 10.1007/s11390-019-1954-4

所属专题: Data Management and Data Mining Software Systems

• Special Section on Software Systems 2019 • 上一篇    下一篇

基于半监督学习的Docker项目标签推荐方法

Wei Chen1,2, Member, CCF, Jia-Hong Zhou1,2, Jia-Xin Zhu1,2, Member, CCF, Guo-Quan Wu1,2,3, Member, CCF, Jun Wei1,2,3, Member, CCF   

  1. 1 Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 State Key Laboratory of Computer Sciences, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
  • 收稿日期:2019-02-28 修回日期:2019-07-12 出版日期:2019-08-31 发布日期:2019-08-31
  • 作者简介:Wei Chen received his Ph.D. degree in computer software and theory from Institute of Software, Chinese Academy of Sciences, Beijing, in 2013. He is currently an associate professor in Institute of Software, Chinese Academy of Sciences, Beijing. He is a member of CCF. His research interests include service-oriented computing, cloud computing and DevOps.
  • 基金资助:
    This work was supported by the National Natural Key Research and Development Program of China under Grant No. 2016YFB1000803, and the National Natural Science Foundation of China under Grant Nos. 61732019 and 61572480.

Semi-Supervised Learning Based Tag Recommendation for Docker Repositories

Wei Chen1,2, Member, CCF, Jia-Hong Zhou1,2, Jia-Xin Zhu1,2, Member, CCF, Guo-Quan Wu1,2,3, Member, CCF, Jun Wei1,2,3, Member, CCF   

  1. 1 Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 State Key Laboratory of Computer Sciences, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2019-02-28 Revised:2019-07-12 Online:2019-08-31 Published:2019-08-31
  • About author:Wei Chen received his Ph.D. degree in computer software and theory from Institute of Software, Chinese Academy of Sciences, Beijing, in 2013. He is currently an associate professor in Institute of Software, Chinese Academy of Sciences, Beijing. He is a member of CCF. His research interests include service-oriented computing, cloud computing and DevOps.
  • Supported by:
    This work was supported by the National Natural Key Research and Development Program of China under Grant No. 2016YFB1000803, and the National Natural Science Foundation of China under Grant Nos. 61732019 and 61572480.

Docker已经成为提供软件制品重用的主流技术。借助Docker,开发者可以容易地构建和部署他们的应用程序。目前,Docker在线社区已经积累了大量公开可重用的Docker镜像。但是,这些社区并没有提供标签服务,而人工打标签是一件耗时且困难事情。本文提出SemiTagRec来解决这个问题,SemiTagRec是一种基于半监督学习的Docker项目标签推荐方法。SemiTagRec由四个组件组成:(1)Predictor,用于计算将一个特定的标签分配给一个给定的Docker项目的概率;(2)Extender,基于标签相关度分析,扩展候选标签集合;(3)Evaluator,基于逻辑回归模型评估候选标签的质量;(4)Integrator,用于组合Predictor和Evaluator的结果,计算标签的最终得分。对于给定的未标签Docker项目,SemiTagRec为它们推荐对应的高分标签,从而得到新的有标签项目集,然后把新的有标签项目集添加到上一轮的训练数据集中,得到一个更大的训练数据集,接着开始下一轮迭代。通过这种方式,SemiTagRec使用不断累加的训练数据迭代地训练Predictor,同时不断扩展标签库,以实现高精度的标签推荐。最后,实验表明,SemiTagRec优于其他标签推荐方法,它的Recall@5和Recall@10精度分别是0.688和0.781。

关键词: 标签推荐, Docker项目, Dockerfile, 半监督学习

Abstract: Docker has been the mainstream technology of providing reusable software artifacts recently. Developers can easily build and deploy their applications using Docker. Currently, a large number of reusable Docker images are publicly shared in online communities, and semantic tags can be created to help developers effectively reuse the images. However, the communities do not provide tagging services, and manually tagging is exhausting and time-consuming. This paper addresses the problem through a semi-supervised learning-based approach, named SemiTagRec. SemiTagRec contains four components:(1) the predictor, which calculates the probability of assigning a specific tag to a given Docker repository; (2) the extender, which introduces new tags as the candidates based on tag correlation analysis; (3) the evaluator, which measures the candidate tags based on a logistic regression model; (4) the integrator, which calculates a final score by combining the results of the predictor and the evaluator, and then assigns the tags with high scores to the given Docker repositories. SemiTagRec includes the newly tagged repositories into the training data for the next round of training. In this way, SemiTagRec iteratively trains the predictor with the cumulative tagged repositories and the extended tag vocabulary, to achieve a high accuracy of tag recommendation. Finally, the experimental results show that SemiTagRec outperforms the other approaches and SemiTagRec's accuracy, in terms of Recall@5 and Recall@10, is 0.688 and 0.781 respectively.

Key words: tag recommendation, Docker repository, Dockerfile, semi-supervised learning

[1] Merkel D. Docker:Lightweight Linux containers for consistent development and deployment. Linux Journal, 2014, 2014(239):Article No. 2.
[2] Seo K T, Hwang H S, Moon I Y, Kwon O Y, Kim B J. Performance comparison analysis of linux container and virtual machine for building cloud. Advanced Science and Technology Letters, 2014, 66(2):105-111.
[3] Hummer W, Rosenberg F, Oliveira F, Eilam T. Testing idempotence for infrastructure as code. In Proc. the 14th ACM/IFIP/USENIX International Middleware Conference, December 2013, pp.368-388.
[4] Xu T Y, Marinov D. Mining container image repositories for software configuration and beyond. In Proc. the 40th International Conference on Software Engineering:New Ideas and Emerging Results, May 2018, pp.49-52.
[5] Xia X, Lo D, Wang X Y, Zhou B. Tag recommendation in software information sites. In Proc. the 10th IEEE Working Conference on Mining Software Repositories, May 2013, pp.287-296.
[6] Chen W, Xu P X, Dou W S, Wu G Q, Gao C S, Wei J. A hierarchical categorization approach for configuration management modules. In Proc. the 41st IEEE Annual Computer Software and Applications Conference, July 2017, pp.160-169.
[7] Wang S, Lo D, Vasilescu B, Serebrenik A. EnTagRec:An enhanced tag recommendation system for software information sites. In Proc. the 30th IEEE International Conference on Software Maintenance and Evolution, September 2014, pp.291-300.
[8] Hosmer D, Lemeshow J, Sturdivant R. Applied Logistic Regression (3rd edition). John Wiley & Sons, 2013.
[9] Yin K, Zhou J H, Chen W, Wu G Q, Zhu J X, Wei J. DTagger:A tag recommendation approach for Docker repositories. In Proc. the 10th Asia-Pacific Symposium on Internetware, September 2018, Article No. 3.
[10] Zhou P, Liu J, Yang Z J, Zhou G. Scalable tag recommendation for software information sites. In Proc. the 24th International Conference on Software Analysis, Evolution and Reengineering, February 2017, pp.272-282.
[11] Ramage D, Hall D, Nallapati R, Manning C. Labeled LDA:A supervised topic model for credit attribution in multilabeled corpora. In Proc. the 2009 Conference on Empirical Methods in Natural Language, August 2009, pp.248-256.
[12] David M, Andrew Y, Michael I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-1022.
[13] Zhang M, Zhou Z. A review on multi-label learning algorithms. IEEE Trans. Knowledge and Data Engineering, 2014, 26(8):1819-1837.
[14] Gousios G, Pinzger M, van Deursen A. An exploratory study of the pull-based software development model. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.345-355.
[15] Bergstra J, Bengio Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 2012, 13:281-305.
[16] McCallum A, Nigam K. A comparison of event models for naive Bayes text classification. In Proc. the 1998 AAAI/ICML Workshop on Learning for Text Categorization, July 1998, pp.41-48.
[17] Denoeux T. A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Transactions on Systems, Man, and Cybernetics, 1995, 25(5):804-813.
[18] Breiman L. Random forests. Machine Learning, 2001, 45(1):5-32.
[19] Shu R, Gu X, Enck W. A study of security vulnerabilities on Docker hub. In Proc. the 7th ACM Conference on Data and Application Security and Privacy, March 2017, pp.269-280.
[20] Manu A, Patel J, Akhtar S, Agrawal V, Murthy K. Docker container security via heuristics-based multilateral securityconceptual and pragmatic study. In Proc. the 2016 International Conference on Circuit, Power and Computing Technologies, March 2016, Article No. 114.
[21] Catuogno L, Galdi C. On the evaluation of security properties of containerized systems. In Proc. the 15th International Conference on Ubiquitous Computing and Communications and the 2016 International Symposium on Cyberspace and Security, December 2016, pp.69-76.
[22] Zerouali A, Mens T, Robles G, González-Barahona J M. On the relation between outdated Docker containers, severity vulnerabilities and bugs. In Proc. the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, February 2019, pp.491-501.
[23] Hassan F, Rodriguez R, Wang X. RUDSEA:Recommending updates of Dockerfiles via software environment analysis. In Proc. the 33rd ACM/IEEE International Conference on Automated Software Engineering, September 2018, pp.796-801.
[24] Zhang Y, Yin G, Wang T et al. An insight into the impact of Dockerfile evolutionary trajectories on quality and latency. In Proc. the 42nd IEEE Annual Computer Software and Applications Conference, July 2018, pp.138-143.
[25] Cito J, Schermann G, Wittern J, Leitner P, Zumberi S, Gall H. An empirical analysis of the docker container ecosystem on Github. In Proc. the 14th International Conference on Mining Software Repositories, May 2017, pp.323-333.
[26] Schermann G, Zumberi S, Cito J. Structured information on state and evolution of Dockerfiles on Github. In Proc. the 15th International Conference on Mining Software Repositories, May 2018, pp.26-29.
[27] Cai X, Zhu J, Shen B et al. GRETA:Graph-based tag assignment for Github repositories. In Proc. the 40th IEEE Annual Computer Software and Applications Conference, June 2016, pp.63-72.
[28] Ganesan K. Topic suggestions for millions of repositories. https://github.blog/2017-07-31-topics/, July 2019.
[29] Al-Kofahi J M, Tamrawi A, Nguyen T T, Nguyen H A, Nguyen T N. Fuzzy set approach for automatic tagging in evolving software. In Proc. the 26th IEEE International Conference on Software Maintenance, September 2010, Article No. 37.
[30] Gibaja E, Ventura S. A tutorial on multilabel learning. ACM Computing Surveys, 2015, 47(3):Article No. 52.
[31] Vargas-Baldrich S, V'asquez M L, Poshyvanyk D. Automated tagging of software projects using bytecode and dependencies (N). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, November 2015, pp.289-294.
[32] Liu J, Zhou P, Yang Z, Liu X, Grundy J. FastTagRec:Fast tag recommendation for software information sites. Automated Software Engineering, 2018, 25(4):675-701.
[33] Belém F, Almeida J, Gonçalves M. A survey on tag recommendation methods. Journal of the Association for Information Science and Technology, 2017, 68(4):830-844.
[34] Belém F, Heringer A G, Almeida J, Gonçalves M. Exploiting syntactic and neighbourhood attributes to address cold start in tag recommendation. Information Processing and Management, 2019, 56(3):771-790.
[1] Xing-Gang Wang, Jia-Si Wang, Peng Tang, Wen-Yu Liu. 混合监督的Fast R-CNN物体检测方法[J]. 计算机科学技术学报, 2019, 34(6): 1269-1278.
[2] Fei-Fei Kou, Jun-Ping Du, Cong-Xian Yang, Yan-Song Shi, Wan-Qiu Cui. 基于微博多特征的标签推荐[J]. , 2018, 33(4): 711-726.
[3] Xin-Yu Wang, Xin Xia, David Lo. TagCombine:一种为软件信息网站推荐标签的方法[J]. , 2015, 30(5): 1017-1035.
[4] Cun-Chao Tu, Zhi-Yuan Liu, Mao-Song Sun. 用于用户标签推荐的标签关联模型[J]. , 2015, 30(5): 1063-1072.
[5] Kai Huang, Li-Qing Zhang. 半监督稀疏多线性判别分析[J]. , 2014, 29(6): 1058-1071.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: