-
摘要: 随着点对点网络(P2P Networks)的兴起和普及,如何在其中有效的搜索成了一个热门的研究话题。除了盲目类型的随机行走式搜索(Random Walk),逐层递深式搜索(Iterative Deepening),还有智能类型的索引寻路式搜索(Routing Indices),指导式广度优先搜索(Directed BFS)等等。研究者还探索了许多基于语义的搜索方法,它们可以用于关键字全文搜索。比如在GES这个语义搜索模型中,作者借鉴了信息检索的原理,使用向量空间模型(Vector Space Model),将一个点对点节点(P2P Node)上的内容提炼成一个节点语义向量(Vector)。根据节点语义向量的相似性,提出了节点网络结构自调节和相对应的搜索算法协议。节点网络结构自调节指得是将点对点节点按语义向量的相似性分类成语义簇(Semantic Cluster),在内容相似的语义簇之间建立相似连接(Semantic Link),不相似的语义簇之间建立随机连接(Random Link)。随着节点上内容的动态变动,语义簇也会动态的调整与重组。在搜索时,查询将沿着随机连接被导向至目标语义簇,并在其中泛滥式散布(Flooding),这样就能提高在点对点网络搜索相似内容的效率。这个模型在点对点节点上的内容属性高度一致时非常有效,但是如果一个节点上存在多样化的内容时,这个模型就很难准确地提炼节点语义向量。我们进而提出了基于虚拟类的语义搜索模型。其核心想法是将一个点对点节点上的内容精确分类成多个虚拟类(Virtual Class),在不同的虚拟类之间建立虚拟连接(Virtual Link)。因此,虚拟类取代了节点语义向量,虚拟连接又衍生出了新的网络结构自调节与搜索算法。不仅如此,我们还提出了两个搜索算法,CSS(1)和CSS(2)。CSS(1)是完全基于虚拟类和虚拟连接的,搜索查询将沿着虚拟连接从一个虚拟类被导向至另一个虚拟类。CSS(2)是部分基于虚拟类的,搜索查询将根据节点上虚拟类的相似性沿着物理连接从一个节点被导向至另一个节点。CSS(2)可以被看作是CSS(1)的变体。在实验环境上,我们使用了在信息检索领域权威性的TREC文档集合,将这些标准化又多样化的文档置于多个仿真节点上,使用有效的OSKM分类算法分类并建立虚拟连接网络,最后考察新的算法协议的各项性能。通过调节实验环境设置和参数,比如节点度(Node Degree),网络类型,网络规模等等,我们有说服力地证明了新的模型(CSS)实现了更高的搜索效率,搜索精度并且降低了搜索开销。我们还研究了节点网络结构自调节所带来的消息负荷(Message Overhead),通过设定合适的消息半径值(Bounded TTL),CSS的消息负荷要低于GES。我们最后通过实验比较了两个搜索算法CSS(1)和CSS(2)的各项指标,比如命中比率和消息负荷等,揭示了它们的优劣。我们的主要贡献可归纳为提出了一种新的动态的语义搜索模型,尤其适用于内容各向异性分布的无结构式点对点网络(Unstructured P2P Networks)。据我们所知,这是第一个运用信息检索方法去搜索各向异性的单节点文档分布的节点网络的模型。Abstract: Semantic-based searching in peer-to-peer (P2P) networks has drawn significant attention recently. A number of semantic searching schemes, such as GES proposed by Zhu Y et al., employ search models in Information Retrieval (IR). All these IR-based schemes use one vector to summarize semantic contents of all documents on a single node. For example, GES derives a node vector based on the IR model: VSM (Vector Space Model). A topology adaptation algorithm and a search protocol are then designed according to the similarity between node vectors of different nodes. Although the single semantic vector is suitable when the distribution of documents in each node is uniform, it may not be efficient when the distribution is diverse. When there are many categories of documents at each node, the node vector representation may be inaccurate. We extend the idea of GES and present a new class-based semantic searching scheme (CSS) specifically designed for unstructured P2P networks with heterogeneous single-node document collection. It makes use of a state-of-the-art data clustering algorithm, online spherical k-means clustering (OSKM), to cluster all documents on a node into several classes. Each class can be viewed as a virtual node. Virtual nodes are connected through virtual links. As a result, the class vector replaces the node vector and plays an important role in the class-based topology adaptation and search process. This makes CSS very efficient. Our simulation using the IR benchmark TREC collection demonstrates that CSS outperforms GES in terms of higher recall, higher precision, and lower search cost.
-
Keywords:
- class-based search /
- GES /
- semantic clustering /
- topology adaptation /
- P2P networks
-
-
[1] Li X, Wu J. Searching techniques in peer-to-peer networks.Handbook of Theoretical and Algorithmic Aspects of Sensor,Ad Hoc Wireless, and Peer-to-Peer Networks, CRC Press,2005, http://www.kiv.zcn.lz/~ledvina/DHT/p2psurvey.pdf.
[2] Ratnasamy S, Francis P, Handley M, Karp R, Shenker S. Ascalable content addressable network. In Proc. the 2003Conf. Applications, Technologies, Architecture, and Proto-cols for Computer Communications (SIGCOMM 2001), SanDiego, USA, August 27-31, 2001, pp.161-172.
[3] Rowstron A, Druschel P. Pastry: Scalable, distributed ob-ject location and routing for large-scale peer-to-peer systems.In Proc. the 18th IFIP/ACM International Conference onDistributed System Platforms (Middleware 2001), Heidelberg,Germany, November 12-16, 2001, pp.329-350.
[4] Stoica I, Morris R, Nowell D L, Karger D, Kaashoek M, DabekF, Balakrishnan H. Chord: A scalable peer-to-peer lookupprotocol for internet applications. IEEE/ACM Transactionson Networking, 2003, 11(1): 17-32.
[5] Yu D, Chen X, Chang Y. An improved P2P model based onChord. In Proc. the 6th International Conference on Paral-lel and Distributed Computing Applications and Technologies(PDCAT2005), Dalian, China, December 5-8, 2005, pp.807-811.
[6] Xue K, Hongk P, Li J. FS-chord: A new P2P model with frac-tional steps joining. In Proc. Advanced International Con-ference on Telecommunications and International Conferenceon Internet and Web Applications and Services (AICT-ICIW2006), Guadeloupe, French Caribbean, February 19-25, 2006.
[7] Cai J, Shao X, Ma W. Ontology driven semantic search overstructured P2P network. In Proc. the 9th International Con-ference on Hybrid Intelligent Systems (HIS 2009), Shenyang,China, August 12-14, 2009, pp.29-34.
[8] Dragan F, Gardarin G, Yeh L. A semantic layer for publish-ing and localizing XML data for a P2P XQuery mediator. InProc. the 17th International World Wide Web Conference(WWW2008), Beijing, China, April 21-25, 2008, pp.1105-1106.
[9] Zhu Y, Hu Y. Efficient semantic search on DHT overlays.Journal of Parallel and Distributed Computing, 2007, 67(5):604-616.
[10] Clarke I, Sandberg O, Wiley B, Hong T W. Freenet: A dis-tributed anonymous information storage and retrieval system.In Proc. the 2000 Workshop on Design Issues in Anonymityand Unobservability, Berkeley, USA, July 25-26, 2000, pp.46-66.
[11] Manku G S, Bawa M, Raghavan P. Symphony: Distributedhashing in a small world. In Proc. the 4th USENIX Sym-posium on Internet Technology and Systems (USITS 2003),Seattle, USA, March 26-28, 2003.
[12] The Gnutella Protocol Specification V0.4. http://www.stanfo-rd.edu/class/cs244b/gnutella protocol 0.4.pdf.
[13] Lv Q, Cao P, Cohen E et al. Search and replication in un-structured peer-to-peer networks. In Proc. the 16th ACM In-ternational Conference on Supercomputing (ACM ICS 2002),New York, USA, June 22-26, 2002, pp.84-95.
[14] Yang B, Garcia-Molina H. Improving search in peer-to-peernetworks. In Proc. the 22nd IEEE International Conferenceon Distributed Computing (IEEE ICDCS 2002), Vienna, Aus-tria, July 2-5, 2002.
[15] Crespo A, Garcia-Molina H. Routing indices for peer-to-peersystems. In Proc. the 22nd International Conference on Dis-tributed Computing Systems (IEEE ICDCS 2002), Vienna,Austria, July 2-5, 2002.
[16] Zhu Y, Yang X, Hu Y. Making search efficient on Gnutella-likeP2P systems. In Proc. the 19th IEEE International Parallel& Distributed Processing Symposium (IPDPS 2005), Denver,USA, April 3-8, 2005.
[17] Faye D, Nachouki G, Valduriez P. Semantic query routing inSenPeer, a P2P data management system. In Lecture Notesin Computer Science 4658, Enokido T et al. (eds.), Springer-Verlag, 2007, pp.365-374.
[18] Dohnal V, Sedmidubsky J. Query routing mechanisms inself-organizing search systems. In Proc. the 2nd Inter-national Workshop on Similarity Search and Applications(SISAP 2009), Prague, Czech Republic, August 29-30, 2009,pp.132-139.
[19] Haase P, Siebes R, Harmelen F V. Expertise-based peer se-lection in Peer-to-Peer networks. Knowledge and InformationSystems, 2008, 15(1): 75-107.
[20] Pirro G, Ruffolo M, Talia D. Advanced semantic searchand retrieval in a collaborative peer-to-peer system. InProc. the 2008 International Workshop on Content Mana-gement and Delivery in Large-Scale Networks (UPGRADE-CN2008), Boston, USA, June 23-27, 2008, pp.65-72.
[21] Bawa M, Manku G, Raghavan P. SETS: Search enhanced bytopic segmentation. In Proc. the 26th Annual InternationalACM SIGIR Conference (SIGIR 2003), Toronto, Canada,July 28-August 1, 2003, pp.306-313.
[22] Shen H T, Shu Y, Yu B. Efficient semantic-based contentsearch in P2P network. IEEE Transactions on Knowledgeand Data Engineering, 2004, 16(7): 813-826.
[23] Zhou Y, Croft W B, Levine B N. Content-based search in peer-to-peer networks. Technical Report, University of Mas-sachusetts, 2004.
[24] Witschel H F. Content-oriented topology restructuring forsearch in P2P networks. Technical Report, University ofLeipzig, Germany, 2005.
[25] Zhu Y, Hu Y. Enhancing search performance on Gnutella-likeP2P systems. IEEE Transactions on Parallel and DistributedSystems, 2006, 17(12): 1482-1495.
[26] Yang X, Hu Y. SEIF: Search enhanced by intelligent feedbackin unstructured P2P networks. In Proc. International Con-ference on Parallel Processing, Vienna, Austria, September22-25, 2009, pp.494-501.
[27] Zhong S. Efficient online spherical K-means clustering.In Proc. IEEE Int. Joint Conf. Neural Networks(IJCNN 2005), Montreal, Canada, July 31-August 4, 2005,pp.3180-3185.
[28] Wang Q, Li R, Chen L, Lian J, ? Ozsu M T. Speed up semanticsearch in P2P networks. In Proc. the ACM 17th Conferenceon Information and Knowledge Management (CIKM 2008),Napa Valley, USA, October 26-30, 2008, pp.1341-1342.
[29] Kacimi M, Yetongnon K. Similarity search in a hybrid over-lay P2P network. In Proc. the 11th IEEE Symposiumon Computers and Communications (ISCC 2006), Cagliari,Italy, June 26-29, 2006.
[30] Comito C, Patarin S, Talia D. A semantic overlay net-work for P2P schema-based data integration. In Proc. the11th IEEE Symposium on Computers and Communications(ISCC 2006), Cagliari, Italy, June 26-29, 2006.
[31] Yang X, Hu Y. Search enhanced by distributed semantic clus-tering in Gnutella-like P2P systems. In Proc. the 15th Inter-national Symposium on Modelling, Analysis, and Simulationof Computer and Telecommunication (MASCOTS 2007), Is-tanbul, Turkey, October 24-26, 2007, pp.318-324.
[32] Huang J, Li X, Wu J. A class-based search system in unstruc-tured P2P networks. In Proc. the 21st International Con-ference on Advanced Networking and Applications, NiagaraFalls, Canada, May 21-23, 2007, pp.76-83.
[33] Ng C H, Sia K C. Peer clustering and firework query model.In Proc. the 11th International World Wide Web Conference(WWW 2002), Honolulu, Hawaii, USA, May 7-11, 2002.
[34] Crespo A, Garcia-Molina H. Semantic overlay networks forP2P systems. In Lecture Notes in Computer Science 3601,Garbonell J G, Siekmann J (eds.), Springer-Verlag, 2005,pp.1-13.
[35] Lin K, Wang C, Chou C, Golubchik L. SocioNet: A social-based multimedia access system for unstructured P2P net-works. IEEE Transactions on Parallel and Distributed Sys-tems, 2010, 21(7): 1027-1041.
[36] Deconinck G, Vanthournout K. Agora: A semantic overlaynetwork. International Journal of Critical Infrastructures,2009, 5(1/2): 175-195.
[37] Chawathe Y, Ratnasamy S, Breslau L, Lanham N, Sheaker S.Making gnutella-like P2P systems scalable. In Proc. the 2003Conf. Applications, Technologies, Architecture, and Proto-cols for Computer Communications (SIGCOMM2003), Karl-sruhe, Germany, August 25-29, 2003, pp.407-418.
[38] Berry M W, Drmac Z, Jessup E R. Matrices, vector spaces,and information retrieval. SIAM Review, 1999, 41(2): 335-362.
[39] Text REtrieval Conference (TREC). http://trec.nist.gov,May, 2010.
[40] McCallum A K. Rainbow toolkit. http://www.cs.cmu.edu/~mccallum/bow/, May, 2010.
计量
- 文章访问数: 35
- HTML全文浏览量: 0
- PDF下载量: 1535