计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (4): 727-746.doi: 10.1007/s11390-019-1939-3

所属专题: 综述 Data Management and Data Mining

• • 上一篇    下一篇

微博位置与轨迹识别

Na Ta1, Member, CCF, Guo-Liang Li2,*, Member, CCF, ACM, IEEE, Jun Hu3, Jian-Hua Feng2   

  1. 1 School of Journalism and Communication, Renmin University of China, Beijing 100872, China;
    2 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    3 Airbnb Network(Beijing) Inc., Beijing 100020, China
  • 收稿日期:2019-01-31 修回日期:2019-03-10 出版日期:2019-07-11 发布日期:2019-07-11
  • 通讯作者: Guo-Liang Li E-mail:liguoliang@tsinghua.edu.cn
  • 作者简介:Na Ta received her Ph.D.degree in computer science from Tsinghua University,Beijing,in 2017,her M.E.and B.E.degrees in computer science from Tsinghua University,Beijing,and Beijing University of Posts and Telecommunications,Beijing,in 2007 and 2004,respectively.She is currently an assistant professor at the School of Journalism and Communications in Renmin University of China,Beijing.Her main research interests include computational communicaitons and social networks.She is a member of CCF.
  • 基金资助:
    This work is supported by the National Natural Science Foundation of China under Grant Nos. 61802414, 61632016, 61521002 and 61661166012, the National Basic Research 973 Program of China under Grant No. 2015CB358700, the Social Science Foundation of Beijing under Grant No. 18XCC011, the Humanities and Social Sciences Base Foundation of Ministry of Education of China under Grant No. 16JJD860008, Huawei, and TAL (Tomorrow Advancing Life) education.

Location and Trajectory Identification from Microblogs

Na Ta1, Member, CCF, Guo-Liang Li2,*, Member, CCF, ACM, IEEE, Jun Hu3, Jian-Hua Feng2   

  1. 1 School of Journalism and Communication, Renmin University of China, Beijing 100872, China;
    2 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    3 Airbnb Network(Beijing) Inc., Beijing 100020, China
  • Received:2019-01-31 Revised:2019-03-10 Online:2019-07-11 Published:2019-07-11
  • Contact: Guo-Liang Li E-mail:liguoliang@tsinghua.edu.cn
  • Supported by:
    This work is supported by the National Natural Science Foundation of China under Grant Nos. 61802414, 61632016, 61521002 and 61661166012, the National Basic Research 973 Program of China under Grant No. 2015CB358700, the Social Science Foundation of Beijing under Grant No. 18XCC011, the Humanities and Social Sciences Base Foundation of Ministry of Education of China under Grant No. 16JJD860008, Huawei, and TAL (Tomorrow Advancing Life) education.

社交网络的迅速发展导致了用户生成内容(UGC)的激增,UGC数据有利于多种应用。例如,从微博中识别用户的位置对于有效的基于位置的广告和推荐非常重要。本文研究了从微博中识别用户位置的问题。解决这一问题存在的难点在于,微博中的位置信息并不完整,因此无法从单一微博中获得准确的位置。为了应对这一挑战,我们提出了一种全局的位置识别方法Glitter,结合用户的多条微博识别用户位置信息。Glitter不仅提高了用户位置识别的质量,还可以准确地补全微博位置。为了辅助位置识别,Glitter将兴趣点(POI)数据组织成一个树结构,其中叶节点是POI,而非叶节点是POI的分段,例如国家、州、城市、地区和街道。使用树结构,Glitter首先从用户的每条微博中提取对应于某些树节点的候选位置。然后,glitter聚集这些候选位置并识别微博用户的Top-K位置。使用确定的Top-K用户位置,Glitter进一步优化候选位置并计算每条微博的Top-K位置。为了达到高召回率,我们启用了位置和微博之间的模糊匹配,并提出了一种支持微博动态更新的增量算法。此外,我们还研究了如何根据提取的位置来识别用户的轨迹,并提出一种有效的轨迹提取算法,解决高质量用户轨迹的识别问题。在真实数据集的实验结果表明,该方法具有较高的质量和良好的性能,且具有良好的扩展性。

关键词: 位置识别, 微博, 轨迹识别

Abstract: The rapid development of social networks has resulted in a proliferation of user-generated content (UGC), which can benefit many applications. In this paper, we study the problem of identifying a user's locations from microblogs, to facilitate effective location-based advertisement and recommendation. Since the location information in a microblog is incomplete, we cannot get an accurate location from a local microblog. As such, we propose a global location identification method, Glitter. Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations. Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, Glitter organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, cities, and streets. Using the tree structure, Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then Glitter aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, Glitter refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. We also study how to identify users' trajectories based on the extracted locations. We propose an effective algorithm to extract high-quality trajectories. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales well.

Key words: location identification, microblog, trajectory identification

[1] Li R, Wang S, Deng H, Wang R, Chang K. Towards social user profiling:Unified and discriminative influence model for inferring home locations. In Proc. the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2012, pp.1023-1031.
[2] Chakrabarti K, Chaudhuri S, Ganti V, Xin D. An efficient filter for approximate membership checking. In Proc. the 2008 ACM SIGMOD International Conference on Management of Data, June 2008, pp.805-818.
[3] Li G, Deng D, Feng J. Faerie:Efficient filtering algorithms for approximate dictionary-based entity extraction. In Proc. the 2011 ACM SIGMOD International Conference on Management of Data, June 2011, pp.529-540.
[4] Li G, Deng D, Feng J. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In Proc. the 28th International Conference on Data Engineering, April 2012, pp.762-773.
[5] Hoffart J, Suchanek F M, Berberich K, Weikum G. YAGO2:A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 2013, 194:28-61.
[6] Cheng Z, Caverlee J, Lee K. You are where you tweet:A content-based approach to geo-locating Twitter users. In Proc. the 19th ACM International Conference on Information and Knowledge Management, June 2010, pp.759-768.
[7] Chandra S, Khan L, Muhaya F B. Estimating Twitter user location using social interactions-A content based approach. In Proc. the 3rd International Conference on Social Computing, October 2011, pp.838-843.
[8] Amitay E, Harel N, Sivan R, Soffer A. Web-a-where:Geotagging web content. In Proc. the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2004, pp.273-280.
[9] Backstrom L, Kleinberg J M, Kumar R, Novak J. Spatial variation in search engine queries. In Proc. the 17th International Conference on World Wide Web, April 2008, pp.357-366.
[10] Li G, Hu J, Feng J, Tan K. Effective location identification from microblogs. In Proc. the 30th International Conference on Data Engineering, March 2014, pp.880-891.
[11] Chandel A, Nagesh P C, Sarawagi S. Efficient batch top-k search for dictionary-based entity recognition. In Proc. the 22nd International Conference on Data Engineering, April 2006, Article No. 28.
[12] Sun C, Naughton J F. The token distribution filter for approximate string membership. In Proc. the 14th International Workshop on the Web and Databases, June 2011, Article No. 5.
[13] Lu J, Han J, Meng X. Efficient algorithms for approximate member extraction using signature-based inverted lists. In Proc. the 18th ACM Conference on Information and Knowledge Management, November 2009, pp.315-324.
[14] Wang W, Xiao C, Lin X, Zhang C. Efficient approximate entity extraction with edit distance constraints. In Proc. the 2009 ACM SIGMOD International Conference on Management of Data, June 2009, pp.759-770.
[15] Chaudhuri S, Ganti V, Xin D. Mining document collections to facilitate accurate approximate entity matching. Proceedings of the VLDB Endowment, 2009, 2(1):395-406.
[16] Agrawal S, Chakrabarti K, Chaudhuri S, Ganti V. Scalable ad-hoc entity extraction from text collections. Proceedings of the VLDB Endowment, 2008, 1(1):945-957.
[17] Deng D, Li G, Feng J, Duan Y, Gong Z. A unified framework for approximate dictionary-based entity extraction. Proceedings of the VLDB Endowment, 2015, 24(1):143-167.
[18] Li K, Li G. Approximate query processing:What is new and where to go?-A survey on approximate query processing. Data Science and Engineering, 2018, 3(4):379-397.
[19] Gao D, Tong Y, She J, Song T, Chen L, Xu K. Top-k team recommendation and its variants in spatial crowdsourcing. Data Science and Engineering, 2017, 2(2):136-150.
[20] Leal F, Malheiro B, González-Vélez H, Burguillo J. Trustbased modelling of multi-criteria crowdsourced data. Data Science and Engineering, 2017, 2(3):199-209.
[21] Mei Q, Liu C, Su H, Zhai C. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proc. the 15th International Conference on World Wide Web, May 2006, pp.533-542.
[22] Rattenbury T, Good N, Naaman M, Towards automatic extraction of event and place semantics from flickr tags. In Proc. the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2007, pp.103-110.
[23] Backstrom L, Sun E, Marlow C. Find me if you can:Improving geographical prediction with social and spatial proximity. In Proc. the 19th International Conference on World Wide Web, April 2010, pp.61-70.
[24] Hao Q, Cai R, Wang C, Xiao R, Yang J, Pang Y, Zhang L. Equip tourists with knowledge mined from travelogues. In Proc. the 19th International Conference on World Wide Web, April 2010, pp.401-410.
[25] Yin Z, Cao L, Han J, Zhai C, Huang T. Geographical topic discovery and comparison. In Proc. the 20th International Conference on World Wide Web, March 2011, pp.247-256.
[26] Hong L, Ahmed A, Gurumurthy S, Smola A, Tsioutsiouliklis K. Discovering geographical topics in the Twitter stream. In Proc. the 21st International Conference on World Wide Web, April 2012, pp.769-778.
[27] Li G, Deng D, Wang J, Feng J. PASS-JOIN:A partitionbased method for similarity joins. Proceedings of the VLDB Endowment, 2011, 5(3):253-264.
[28] Li G, Deng D, Feng J. A partition-based method for string similarity joins with edit-distance constraints. ACM Transactions on Database Systems, 2013, 38(2):Article No. 9.
[29] Wang J, Li G, Feng J. Trie-join:Efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment, 2010, 3(1):1219-1230.
[30] Xu L, Ling T W, Wu H, Bao Z. DDE:From Dewey to a fully dynamic XML labeling scheme. In Proc. the 2009 ACM SIGMOD International Conference on Management of Data, June 2009, pp.719-730.
[1] Fei-Fei Kou, Jun-Ping Du, Cong-Xian Yang, Yan-Song Shi, Wan-Qiu Cui. 基于微博多特征的标签推荐[J]. , 2018, 33(4): 711-726.
[2] Jin-Qi Zhu, Li Lu, Chun-Mei Ma. 从兴趣到位置:微博系统中基于近邻关系的朋友推荐[J]. , 2015, 30(6): 1188-1200.
[3] Xian Wu, Wei Fan, Jing Gao Zi-Ming Feng, Yong Yu. 检测僵尸用户提高微博数据的可靠度[J]. , 2015, 30(5): 1082-1096.
[4] Wu Yang Guo-Wei Shen, Wei Wang, Liang-Yi Gong, Miao Yu, Guo-Zhong Dong. 基于联合聚类的微博异常检测[J]. , 2015, 30(5): 1097-1108.
[5] Fei Jiang, Yi-Qun Liu, Huan-Bo LuanJia-Shen Sun, Xuan Zhu, Min Zhang, Shao-Ping Ma. 利用表情符空间进行微博情感分析[J]. , 2015, 30(5): 1120-1129.
[6] Yan-Tao Jia, Yuan-Zhuo Wang, Xue-Qi Cheng. 微博中融合结构和交互信息的链接预测学习[J]. , 2015, 30(4): 829-842.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[2] 向东; 魏道政; 陈世松;. Probabilistic Models for Estimation of Random and Pseudo-Random Test Length[J]. , 1992, 7(2): 164 -174 .
[3] 马军; 马绍汉;. Efficient Parallel Algorithms for Some Graph Theory Problems[J]. , 1993, 8(4): 76 -80 .
[4] 张明义;. Some Results on Default Logic[J]. , 1994, 9(3): 267 -274 .
[5] 马光胜; 张忠伟; 黄少滨;. A New Method of Solving Kernels in Algebraic Decomposition for the Synthesis of Logic Cell Array[J]. , 1995, 10(6): 569 -573 .
[6] 沈理;. Fuzzy Logic Control ASIC Chip[J]. , 1997, 12(3): 263 -270 .
[7] 陈阳军;. Graph Traversal and Top-Down Evaluation of Logic Queries[J]. , 1998, 13(4): 300 -316 .
[8] David de Frutos-Escrig; Luis Liana-Diaz; Manuel Nunez;. An invitation to Friendly Testing[J]. , 1998, 13(6): 531 -545 .
[9] 彭国强; 程虎;. A Causal Model for Diagnostic Reasoning[J]. , 2000, 15(3): 287 -294 .
[10] . 一种基于TextTiling的文档相似搜索模型[J]. , 2005, 20(4): 552 -558 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: