一种针对大规模文本处理的基于随机变分推理的并行在线监督主题模型

doi:10.1007/s11390-018-1871-y

一种针对大规模文本处理的基于随机变分推理的并行在线监督主题模型

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

摘要

摘要: 作为一种有效处理文本数据的主流技术，主题建模在文本分析、自然语言处理、个性化推荐和计算机视觉等诸多领域都有广泛应用。在已有的众多主题模型中，sLDA被誉为最具影响力的监督主题模型。然而，随着待处理文本数据规模的不断增大，sLDA低效、耗时等弊端日趋突显，这使得sLDA仅适用于处理规模较小的文本数据，致使其应用范围受到极大限制。针对这一问题，本文提出了一种并行的、支持在线学习的监督主题模型PO-sLDA。PO-sLDA首先采用随机变分推理作为学习方法确保参数估计过程快速高效，在此基础上，利用MapReduce框架实现参数并行计算机制进而有效提升其云计算和大数据处理能力。此外，PO-sLDA的在线学习能力可有效打破sLDA的应用限制，更适用于解决具有较高实时性要求的真实在线应用。在两个不同规模数据集的实验验证表明：与sLDA相比，PO-sLDA仅需花费很小的训练时间就能取得与sLDA相似甚至更高的精度，并且其良好的收敛性和在线学习能力使其在实现大规模文本数据处理和分析方面具备更大的优势和潜力。

Abstract: Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.

HTML全文

参考文献()

施引文献

资源附件()