一种基于TextTiling的文档相似搜索模型

摘要: 文档相似搜索指从文档集中检索与给定查询文档相似的文档。对于给定的查询文档，我们期望文档相似搜索系统能够返回一个按相似度排序的相似文档列表。文档相似搜索技术已经被广泛应用到电子图书馆，搜索引擎等系统中，例如CiteSeer.IST科学文献数字图书馆的相似文献推荐功能，Google的相似网页查询功能等。传统的检索模型能够在一定程度上解决这个问题，包括TREC（文本检索大会）中常见的信息检索模型（例如Okapi系统中的BM25模型和SMART系统的向量空间模型）和一些流行的文档相似度计算模型（例如余弦模型）。BM25以及SMART模型对于传统的关键词查询有很好的表现，但相似搜索是以整篇文档作为查询，跟关键词查询有一定差别。余弦模型由于能够很好地度量文档之间的相似性，被认为是解决相似搜索问题的比较好的模型。目前还没有文献对这些模型在文档相似搜索问题上的性能进行过实验对比。本文对这些已知模型在相似搜索问题上的性能做出了实验对比，并且针对余弦模型无法考虑文档结构相似性的缺点，提出了基于TextTiling技术的相似检索模型，该模型考虑了文档的子主题结构信息。该模型首先采用TextTiling技术将文档分割成能代表子主题的文本块，然后计算两个文档中不同文本块之间的相似度，最后通过图论中的最优匹配方法综合文本块之间的相似度得到两个文档之间的总体相似度。我们通过实验验证了以下三点： 1）TREC中的常用信息检索模型不能很好地解决文档相似搜索；2）我们提出的基于TextTiling技术的模型是有效的，性能优于其他模型；3）我们提出的模型中所采用的方法是有效的，包括利用TextTiling技术进行文本子主题分割，利用余弦公式来计算文本块之间的相似度，以及利用最优匹配方法来求解文档之间的总体相似度。

Abstract: Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.

一种基于TextTiling的文档相似搜索模型

A New Retrieval Model Based on TextTiling for Document Similarity Search