|  Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929. Hadjieleftheriou M, Yu X, Koudas N, Srivastava D. Hashed samples:Selectivity estimators for set similarity selection queries. PVLDB, 2008, 1(1):201-212. Lee H, Ng R T, Shim K. Power-law based estimation of set similarity join size. PVLDB, 2009, 2(1):658-669. White R W, Jose J M. A study of topic similarity measures. In Proc. the 27th SIGIR, July 2004, pp.520-521. Zhu X, Song S, Lian X, Wang J, Zou L. Matching heterogeneous event data. In Proc. SIGMOD, June 2014, pp.1211-1222. Zhu X, Song S, Wang J, Yu P S, Sun J. Matching heterogeneous events with patterns. In Proc. the 30th ICDE, March 31-April 4, 2014, pp.376-387. Wang J, Song S, Zhu X, Lin X. Efficient recovery of missing events. PVLDB, 2013, 6(10):841-852. Wang J, Song S, Lin X, Zhu X, Pei J. Cleaning structured event logs:A graph repair approach. In Proc. the 31st ICDE, April 2015, pp.30-41. Song S, Chen L. Similarity joins of text with incomplete information formats. In Proc. the 12th DASFAA, April 2007, pp.313-324. Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In Proc. the 22nd ICDE, April 2006, p.5. Beckmann J L, Halverson A, Krishnamurthy R, Naughton J F. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proc. the 22nd ICDE, April 2006, p.58. Jain A, Doan A, Gravano L. SQL queries over unstructured text databases. In Proc. the 23rd ICDE, April 2007, pp.1255-1257. Dong X, Halevy A Y. Indexing dataspaces. In Proc. SIGMOD, June 2007, pp.43-54. Song S, Chen L, Yuan M. Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng., 2011, 23(12):1872-1887. Song S, Chen L, Yu P S. On data dependencies in dataspaces. In Proc. the 27th ICDE, April 2011, pp.470-481. Dong X, Halevy A Y, Madhavan J, Nemes E, Zhang J. Similarity search for web services. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.372-383. Song S, Chen L. Probabilistic correlation-based similarity measure of unstructured records. In Proc. the 16th CIKM, November 2007, pp.967-970. Song S, Zhu H, Chen L. Probabilistic correlation-based similarity measure on text records. Inf. Sci., 2014, 289:8-24. Sahami M, Heilman T D. A web-based kernel function for measuring the similarity of short text snippets. In Proc. the 15th WWW, May 2006, pp.377-386. Liu S, Liu F, Yu C, Meng W. An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In Proc. the 27th SIGIR, July 2004, pp.266-272. Jin R, Chai J Y, Si L. Learn to weight terms in information retrieval using category information. In Proc. the 22nd ICML, August 2005, pp.353-360. Xiong H, Shekhar S, Tan P N, Kumar V. Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In Proc. the 10th KDD, August 2004, pp.334-343. Song S, Chen L. Efficient set-correlation operator inside databases. In Proc. CIKM, October 2010, pp.139-148. Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In Proc. the 27th VLDB, September 2001, pp.491-500. Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proc. SIGMOD, June 1998, pp.201-212. Gravano L, Ipeirotis P G, Koudas N, Srivastava D. Text joins in an RDBMS for web data integration. In Proc. the 12th WWW, May 2003, pp.90-101. Salton G. Automatic Text Processing:The Transformation, Analysis, and Retrieval of Information by Computer. AddisonWesley, 1989. Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th KDD, August 2003, pp.39-48. Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In Proc. the 8th KDD, July 2002, pp.269-278. Hofmann T. Probabilistic latent semantic analysis. In Proc. UAI, July 1999, pp.289-296. Hofmann T. Probabilistic latent semantic indexing. In Proc. the 22nd SIGIR, August 1999, pp.50-57. Deerwester S C, Dumais S T, Landauer T K, Furnas G W, Harshman R A. Indexing by latent semantic analysis. JASIS, 1990, 41(6):391-407. Brin S, Motwani R, Silverstein C. Beyond market baskets:Generalizing association rules to correlations. In Proc. SIGMOD, May 1997, pp.265-276. Jermaine C. The computational complexity of high dimensional correlation search. In Proc. ICDM, November 2001, pp.249-256. Xiong H, Shekhar S, Tan P N, Kumar V. TAPER:A two-step approach for all-strong-pairs correlation query in large databases. IEEE Trans. Knowl. Data Eng., 2006, 18(4):493-508. Sparck Jones K. Index term weighting. Information Storage and Retrieval, 1973, 9(11):619-633. Robertson S. Understanding inverse document frequency:On theoretical argument for IDF. Journal of Documentation, 2004, 60(5):503-520. Chaudhuri S, Das G, Hristidis V, Weikum G. Probabilistic ranking of database query results. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.888-899. Chirita P A, Firan C S, Nejdl W. Personalized query expansion for the web. In Proc. the 30th SIGIR, July 2007, pp.7-14. Theobald M, Schenkel R, Weikum G. Efficient and self-tuning incremental query expansion for top-k query processing. In Proc. the 28th SIGIR, August 2005, pp.242-249. Metzler D, Dumais S T, Meek C. Similarity measures for short segments of text. In Proc. the 29th ECIR, April 2007, pp.16-27. Allan J, Wade C, Bolivar A. Retrieval and novelty detection at the sentence level. In Proc. the 26th SIGIR, August 2003, pp.314-321. Balasubramanian N, Allan J, Croft W B. A comparison of sentence retrieval techniques. In Proc. the 30th SIGIR, July 2007, pp.813-814. Li X, Croft W B. Improving novelty detection for general topics using sentence level information patterns. In Proc. the 15th CIKM, November 2006, pp.238-247. Li X, Croft W B. Novelty detection based on sentence level patterns. In Proc. the 14th CIKM, November 2005, pp.744-751. Murdock V, Croft W B. A translation model for sentence retrieval. In Proc. HLT/EMNLP, October 2005, pp.684-691. Fung P, Yee Lo Y. An IR approach for translating new words from nonparallel, comparable texts. In Proc. the 36th COLING-ACL, August 1998, pp.414-420. Cao G, Nie J Y, Bai J. Integrating word relationships into language models. In Proc. the 28th SIGIR, July 2005, pp.298-305. Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929. Lewis D D, Yang Y, Rose T G, Li F. RCV1:A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5:361-397. Lang K. NewsWeeder:Learning to filter netnews. In Proc. the 12th ICML, June 1995, pp.331-339. Van Rijsbergen C J. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979.