一个用于点对点网络的基于语义的多样化内容搜索模型

黄俊成; 李秀琦; 吴杰

doi:10.1007/s11390-011-1190-z

摘要: 随着点对点网络(P2P Networks)的兴起和普及，如何在其中有效的搜索成了一个热门的研究话题。除了盲目类型的随机行走式搜索(Random Walk)，逐层递深式搜索(Iterative Deepening)，还有智能类型的索引寻路式搜索(Routing Indices)，指导式广度优先搜索(Directed BFS)等等。研究者还探索了许多基于语义的搜索方法，它们可以用于关键字全文搜索。比如在GES这个语义搜索模型中，作者借鉴了信息检索的原理，使用向量空间模型(Vector Space Model)，将一个点对点节点(P2P Node)上的内容提炼成一个节点语义向量(Vector)。根据节点语义向量的相似性，提出了节点网络结构自调节和相对应的搜索算法协议。节点网络结构自调节指得是将点对点节点按语义向量的相似性分类成语义簇(Semantic Cluster)，在内容相似的语义簇之间建立相似连接(Semantic Link)，不相似的语义簇之间建立随机连接(Random Link)。随着节点上内容的动态变动，语义簇也会动态的调整与重组。在搜索时，查询将沿着随机连接被导向至目标语义簇，并在其中泛滥式散布(Flooding)，这样就能提高在点对点网络搜索相似内容的效率。这个模型在点对点节点上的内容属性高度一致时非常有效，但是如果一个节点上存在多样化的内容时，这个模型就很难准确地提炼节点语义向量。我们进而提出了基于虚拟类的语义搜索模型。其核心想法是将一个点对点节点上的内容精确分类成多个虚拟类(Virtual Class)，在不同的虚拟类之间建立虚拟连接(Virtual Link)。因此，虚拟类取代了节点语义向量，虚拟连接又衍生出了新的网络结构自调节与搜索算法。不仅如此，我们还提出了两个搜索算法，CSS(1)和CSS(2)。CSS(1)是完全基于虚拟类和虚拟连接的，搜索查询将沿着虚拟连接从一个虚拟类被导向至另一个虚拟类。CSS(2)是部分基于虚拟类的，搜索查询将根据节点上虚拟类的相似性沿着物理连接从一个节点被导向至另一个节点。CSS(2)可以被看作是CSS(1)的变体。在实验环境上，我们使用了在信息检索领域权威性的TREC文档集合，将这些标准化又多样化的文档置于多个仿真节点上，使用有效的OSKM分类算法分类并建立虚拟连接网络，最后考察新的算法协议的各项性能。通过调节实验环境设置和参数，比如节点度(Node Degree)，网络类型，网络规模等等，我们有说服力地证明了新的模型(CSS)实现了更高的搜索效率，搜索精度并且降低了搜索开销。我们还研究了节点网络结构自调节所带来的消息负荷(Message Overhead)，通过设定合适的消息半径值(Bounded TTL)，CSS的消息负荷要低于GES。我们最后通过实验比较了两个搜索算法CSS(1)和CSS(2)的各项指标，比如命中比率和消息负荷等，揭示了它们的优劣。我们的主要贡献可归纳为提出了一种新的动态的语义搜索模型，尤其适用于内容各向异性分布的无结构式点对点网络(Unstructured P2P Networks)。据我们所知，这是第一个运用信息检索方法去搜索各向异性的单节点文档分布的节点网络的模型。

Abstract: Semantic-based searching in peer-to-peer (P2P) networks has drawn significant attention recently. A number of semantic searching schemes, such as GES proposed by Zhu Y et al., employ search models in Information Retrieval (IR). All these IR-based schemes use one vector to summarize semantic contents of all documents on a single node. For example, GES derives a node vector based on the IR model: VSM (Vector Space Model). A topology adaptation algorithm and a search protocol are then designed according to the similarity between node vectors of different nodes. Although the single semantic vector is suitable when the distribution of documents in each node is uniform, it may not be efficient when the distribution is diverse. When there are many categories of documents at each node, the node vector representation may be inaccurate. We extend the idea of GES and present a new class-based semantic searching scheme (CSS) specifically designed for unstructured P2P networks with heterogeneous single-node document collection. It makes use of a state-of-the-art data clustering algorithm, online spherical k-means clustering (OSKM), to cluster all documents on a node into several classes. Each class can be viewed as a virtual node. Virtual nodes are connected through virtual links. As a result, the class vector replaces the node vector and plays an important role in the class-based topology adaptation and search process. This makes CSS very efficient. Our simulation using the IR benchmark TREC collection demonstrates that CSS outperforms GES in terms of higher recall, higher precision, and lower search cost.

一个用于点对点网络的基于语义的多样化内容搜索模型

A Semantic Searching Scheme in Heterogeneous Unstructured P2P Networks