基于半监督学习的Docker项目标签推荐方法

doi:10.1007/s11390-019-1954-4

基于半监督学习的Docker项目标签推荐方法

Semi-Supervised Learning Based Tag Recommendation for Docker Repositories

摘要

摘要: Docker已经成为提供软件制品重用的主流技术。借助Docker，开发者可以容易地构建和部署他们的应用程序。目前，Docker在线社区已经积累了大量公开可重用的Docker镜像。但是，这些社区并没有提供标签服务，而人工打标签是一件耗时且困难事情。本文提出SemiTagRec来解决这个问题，SemiTagRec是一种基于半监督学习的Docker项目标签推荐方法。SemiTagRec由四个组件组成：（1）Predictor，用于计算将一个特定的标签分配给一个给定的Docker项目的概率；（2）Extender，基于标签相关度分析，扩展候选标签集合；（3）Evaluator，基于逻辑回归模型评估候选标签的质量；（4）Integrator，用于组合Predictor和Evaluator的结果，计算标签的最终得分。对于给定的未标签Docker项目，SemiTagRec为它们推荐对应的高分标签，从而得到新的有标签项目集，然后把新的有标签项目集添加到上一轮的训练数据集中，得到一个更大的训练数据集，接着开始下一轮迭代。通过这种方式，SemiTagRec使用不断累加的训练数据迭代地训练Predictor，同时不断扩展标签库，以实现高精度的标签推荐。最后，实验表明，SemiTagRec优于其他标签推荐方法，它的Recall@5和Recall@10精度分别是0.688和0.781。

Abstract: Docker has been the mainstream technology of providing reusable software artifacts recently. Developers can easily build and deploy their applications using Docker. Currently, a large number of reusable Docker images are publicly shared in online communities, and semantic tags can be created to help developers effectively reuse the images. However, the communities do not provide tagging services, and manually tagging is exhausting and time-consuming. This paper addresses the problem through a semi-supervised learning-based approach, named SemiTagRec. SemiTagRec contains four components:(1) the predictor, which calculates the probability of assigning a specific tag to a given Docker repository; (2) the extender, which introduces new tags as the candidates based on tag correlation analysis; (3) the evaluator, which measures the candidate tags based on a logistic regression model; (4) the integrator, which calculates a final score by combining the results of the predictor and the evaluator, and then assigns the tags with high scores to the given Docker repositories. SemiTagRec includes the newly tagged repositories into the training data for the next round of training. In this way, SemiTagRec iteratively trains the predictor with the cumulative tagged repositories and the extended tag vocabulary, to achieve a high accuracy of tag recommendation. Finally, the experimental results show that SemiTagRec outperforms the other approaches and SemiTagRec's accuracy, in terms of Recall@5 and Recall@10, is 0.688 and 0.781 respectively.

HTML全文

参考文献()

施引文献

资源附件()