›› 2016, Vol. 31 ›› Issue (5): 1053-1068.doi: 10.1007/s11390-016-1679-6

Special Issue: Data Management and Data Mining

• Data Management and Data Mining • Previous Articles    

Topological Features Based Entity Disambiguation

Chen-Chen Sun, Student Member, CCF, Member, ACM, De-Rong Shen, Senior Member, CCF, Member, ACM, IEEE, Yue Kou, Member, CCF, ACM, IEEE, Tie-Zheng Nie, Member, CCF, ACM, IEEE, and Ge Yu, Fellow, CCF, Member, ACM, IEEE   

  1. School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2015-06-25 Revised:2016-03-06 Online:2016-09-05 Published:2016-09-05
  • About author:Chen-Chen Sun is a Ph.D. candidate in computer software and theory, the School of Computer Science and Engineering, Northeastern University, Shenyang. He got his B.S. degree in software engineering and M.S. degree in computer application technology from the same university in 2010 and 2012, respectively. His research interests are entity disambiguation and entity resolution.
  • Supported by:

    This work is supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the Fundamental Research Funds for the Central Universities of China under Grant No. N120816001, and the National Natural Science Foundation of China under Grant Nos. 61472070 and 61402213.

This work proposes an unsupervised topological features based entity disambiguation solution. Most existing studies leverage semantic information to resolve ambiguous references. However, the semantic information is not always accessible because of privacy or is too expensive to access. We consider the problem in a setting that only relationships between references are available. A structure similarity algorithm via random walk with restarts is proposed to measure the similarity of references. The disambiguation is regarded as a clustering problem and a family of graph walk based clustering algorithms are brought to group ambiguous references. We evaluate our solution extensively on two real datasets and show its advantage over two state-of-the-art approaches in accuracy.

[1] Ferreira A A, Gonçalves M A, Laender A H. A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 2012, 41(2): 15-26.

[2] Han H, Giles L, Zha H, Li C, Tsioutsiouliklis K. Two supervised learning approaches for name disambiguation in author citations. In Proc. the 4th ACM/IEEE Joint-CS Conference on Digital Libraries, June 2004, pp.296-305.

[3] Han H, Zha H, Giles C L. Name disambiguation in author citations using a k-way spectral clustering method. In Proc. the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, June 2005, pp.334-343.

[4] Bhattacharya I, Getoor L. A latent Dirichlet model for unsupervised entity resolution. In Proc. the 2006 SIAM Conference on Data Mining, April 2006.

[5] Shu L, Long B, Meng W. A latent topic model for complete entity resolution. In Proc. the 25th IEEE International Conference on Data Engineering, March 29-April 2, 2009, pp.880-891.

[6] Song Y, Huang J, Councill I G, Li J, Giles C L. Efficient topic-based unsupervised name disambiguation. In Proc. the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, June 2007, pp.342-351.

[7] Kataria S S, Kumar K S, Rastogi R R, Sen P, Sengamedu S H. Entity disambiguation with hierarchical topic models. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.1037-1045.

[8] Tang J, Fong A C M, Wang B, Zhang J. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6): 975-987.

[9] Sen P. Collective context-aware topic models for entity disambiguation. In Proc. the 21st International Conference on World Wide Web, April 2012, pp.729-738.

[10] Cen L, Dragut E C, Si L, Ouzzani M. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In Proc. the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 28-Aug. 1, 2013, pp.741-744.

[11] Li Y, Wang C, Han F, Han J, Roth D, Yan X. Mining evidences for named entity disambiguation. In Proc. the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2013, pp.1070-1078.

[12] Yang D, Shen D R, Yu G, Kou Y, Nie T Z. Query intent disambiguation of keyword-based semantic entity search in dataspaces. Journal of Computer Science and Technology, 2013, 28(2): 382-393.

[13] Malin B. Unsupervised name disambiguation via social network similarity. In Proc. the Workshop on Link Analysis, Counterterrorism, and Security at the 2005 SIAM International Conference on Data Mining, April 2005, pp.93-102.

[14] Hermansson L, Kerola T, Johansson F, Jethava V, Dubhashi D. Entity disambiguation in anonymized graphs using graph kernels. In Proc. the 22nd ACM International Conference on Information and Knowledge Management, October 2013, pp.1037-1046.

[15] Bekkerman R, McCallum A. Disambiguating web appearances of people in a social network. In Proc. the 14th International Conference on World Wide Web, May 2005, pp.463-470.

[16] Saha T K, Zhang B, Al Hasan M. Name disambiguation from link data in a collaboration graph. In Proc. the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, August 2014, pp.81-84.

[17] Saha T K, Zhang B, Al Hasan M. Name disambiguation from link data in a collaboration graph using temporal and topological features. Social Network Analysis and Mining, 2015, 5(1): Article No. 11.

[18] Minkov E, Cohen W W, Ng A Y. Contextual search and name disambiguation in email using graphs. In Proc. the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval, August 2006, pp.27-34.

[19] Yin X, Han J, Yu P. Object distinction: Distinguishing objects with identical names. In Proc. the 23rd IEEE International Conference on Data Engineering, April 2007, pp.1242-1246.

[20] Bhattacharya I, Getoor L. Iterative record linkage for cleaning and integration. In Proc. the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, June 2004, pp.11-18.

[21] Wang X, Tang J, Cheng H, Yu P S. ADANA: Active name disambiguation. In Proc. the 11th IEEE International Conference on Data Mining, December 2011, pp.794-803.

[22] Aggarwal C C, Yu P S. A condensation approach to privacy preserving data mining. In Proc. the 9th International Conference on Extending Database Technology, March 2004, pp.183-199.

[23] Liu K, Das K, Grandison T, Kargupta H. Privacypreserving data analysis on graphs and social networks. In Next Generation Data Mining, Kargupta H, Han J, Yu P S et al. (eds.), CRC Press, 2008, pp.419-437.

[24] Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: A generic approach to entity resolution. The International Journal on Very Large Data Bases, 2009, 18(1): 255-276.

[25] Jain A K, Murty M N, Flynn P J. Data clustering: A review. ACM Computing Surveys, 1999, 31(3): 264-323.

[26] Newman M E. Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 2005, 46(5): 323-351.

[27] Clauset A, Shalizi C R, Newman M E. Power-law distributions in empirical data. SIAM Review, 2009, 51(4): 661-703.

[28] Lovász L. Random walks on graphs: A survey. In Combinatorics: Paul Erdos is Eighty, Milos D, Sos V T, Szony T (eds.), Janos Bolyai Mathematical Society, 1996, pp.353-398.

[29] Macropol K, Can T, Singh A K. RRW: Repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics, 2009, 10(1): 283.

[30] Frey B J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315(5814): 972-976.

[31] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.

[32] Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In Proc. the 2003 KDD Workshop on Data Cleaning and Object Consolidation, August 2003, pp.73-78.

[33] Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2009, 2(1): 1282-1293.

[34] Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In Proc. the 2005 ACM SIGMOD International Conference on Management of Data, June 2005, pp.85-96.

[35] Nuray-Turan R, Kalashnikov D V, Mehrotra S. Adaptive connection strength models for relationship-based entity resolution. Journal of Data and Information Quality, 2013, 4(2): Article No. 8.

[36] Tong H, Faloutsos C, Pan J Y. Fast random walk with restart and its applications. In Proc. the 6th IEEE International Conference on Data Mining, December 2006, pp.613-622.

[37] Fan X, Wang J, Pu X, Zhou L, Lv B. On graph-based name disambiguation. Journal of Data and Information Quality, 2011, 2(2): Article No. 10.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[2] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[3] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[4] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[5] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[8] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[9] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[10] Zhang Bo; Zhang Ling;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved