›› 2016, Vol. 31 ›› Issue (4): 683-701.doi: 10.1007/s11390-016-1657-z

Special Issue: Data Management and Data Mining

• Special Section on Data Management and Data Mining 2016 • Previous Articles     Next Articles

Efficient Set-Correlation Operator Inside Databases

Fei Gao1,2, Shao-Xu Song1,2,*, Lei Chen3, and Jian-Min Wang1,2   

  1. 1 Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;
    2 School of Software, Tsinghua University, Beijing 100084, China;
    3 Department of Computer Science and Engineering, Hong Kong University of Science and Technology Hong Kong, China
  • Received:2016-02-26 Revised:2016-05-05 Online:2016-07-05 Published:2016-07-05
  • Contact: Shao-Xu Song E-mail:sxsong@tsinghua.edu.cn
  • About author:Fei Gao is a Ph.D. student in the School of Software, Tsinghua University, Beijing. Her research interests include data quality and record matching.
  • Supported by:

    The work was supported by the National Key Technology R&D Program of China under Grant No. 2015BAH14F02, the National Natural Science Foundation of China under Grant Nos. 61572272, 61202008, 61325008, and 61370055, and the Tsinghua University Initiative Scientific Research Program.

Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.

[1] Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929.

[2] Hadjieleftheriou M, Yu X, Koudas N, Srivastava D. Hashed samples:Selectivity estimators for set similarity selection queries. PVLDB, 2008, 1(1):201-212.

[3] Lee H, Ng R T, Shim K. Power-law based estimation of set similarity join size. PVLDB, 2009, 2(1):658-669.

[4] White R W, Jose J M. A study of topic similarity measures. In Proc. the 27th SIGIR, July 2004, pp.520-521.

[5] Zhu X, Song S, Lian X, Wang J, Zou L. Matching heterogeneous event data. In Proc. SIGMOD, June 2014, pp.1211-1222.

[6] Zhu X, Song S, Wang J, Yu P S, Sun J. Matching heterogeneous events with patterns. In Proc. the 30th ICDE, March 31-April 4, 2014, pp.376-387.

[7] Wang J, Song S, Zhu X, Lin X. Efficient recovery of missing events. PVLDB, 2013, 6(10):841-852.

[8] Wang J, Song S, Lin X, Zhu X, Pei J. Cleaning structured event logs:A graph repair approach. In Proc. the 31st ICDE, April 2015, pp.30-41.

[9] Song S, Chen L. Similarity joins of text with incomplete information formats. In Proc. the 12th DASFAA, April 2007, pp.313-324.

[10] Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In Proc. the 22nd ICDE, April 2006, p.5.

[11] Beckmann J L, Halverson A, Krishnamurthy R, Naughton J F. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proc. the 22nd ICDE, April 2006, p.58.

[12] Jain A, Doan A, Gravano L. SQL queries over unstructured text databases. In Proc. the 23rd ICDE, April 2007, pp.1255-1257.

[13] Dong X, Halevy A Y. Indexing dataspaces. In Proc. SIGMOD, June 2007, pp.43-54.

[14] Song S, Chen L, Yuan M. Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng., 2011, 23(12):1872-1887.

[15] Song S, Chen L, Yu P S. On data dependencies in dataspaces. In Proc. the 27th ICDE, April 2011, pp.470-481.

[16] Dong X, Halevy A Y, Madhavan J, Nemes E, Zhang J. Similarity search for web services. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.372-383.

[17] Song S, Chen L. Probabilistic correlation-based similarity measure of unstructured records. In Proc. the 16th CIKM, November 2007, pp.967-970.

[18] Song S, Zhu H, Chen L. Probabilistic correlation-based similarity measure on text records. Inf. Sci., 2014, 289:8-24.

[19] Sahami M, Heilman T D. A web-based kernel function for measuring the similarity of short text snippets. In Proc. the 15th WWW, May 2006, pp.377-386.

[20] Liu S, Liu F, Yu C, Meng W. An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In Proc. the 27th SIGIR, July 2004, pp.266-272.

[21] Jin R, Chai J Y, Si L. Learn to weight terms in information retrieval using category information. In Proc. the 22nd ICML, August 2005, pp.353-360.

[22] Xiong H, Shekhar S, Tan P N, Kumar V. Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In Proc. the 10th KDD, August 2004, pp.334-343.

[23] Song S, Chen L. Efficient set-correlation operator inside databases. In Proc. CIKM, October 2010, pp.139-148.

[24] Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In Proc. the 27th VLDB, September 2001, pp.491-500.

[25] Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proc. SIGMOD, June 1998, pp.201-212.

[26] Gravano L, Ipeirotis P G, Koudas N, Srivastava D. Text joins in an RDBMS for web data integration. In Proc. the 12th WWW, May 2003, pp.90-101.

[27] Salton G. Automatic Text Processing:The Transformation, Analysis, and Retrieval of Information by Computer. AddisonWesley, 1989.

[28] Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th KDD, August 2003, pp.39-48.

[29] Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In Proc. the 8th KDD, July 2002, pp.269-278.

[30] Hofmann T. Probabilistic latent semantic analysis. In Proc. UAI, July 1999, pp.289-296.

[31] Hofmann T. Probabilistic latent semantic indexing. In Proc. the 22nd SIGIR, August 1999, pp.50-57.

[32] Deerwester S C, Dumais S T, Landauer T K, Furnas G W, Harshman R A. Indexing by latent semantic analysis. JASIS, 1990, 41(6):391-407.

[33] Brin S, Motwani R, Silverstein C. Beyond market baskets:Generalizing association rules to correlations. In Proc. SIGMOD, May 1997, pp.265-276.

[34] Jermaine C. The computational complexity of high dimensional correlation search. In Proc. ICDM, November 2001, pp.249-256.

[35] Xiong H, Shekhar S, Tan P N, Kumar V. TAPER:A two-step approach for all-strong-pairs correlation query in large databases. IEEE Trans. Knowl. Data Eng., 2006, 18(4):493-508.

[36] Sparck Jones K. Index term weighting. Information Storage and Retrieval, 1973, 9(11):619-633.

[37] Robertson S. Understanding inverse document frequency:On theoretical argument for IDF. Journal of Documentation, 2004, 60(5):503-520.

[38] Chaudhuri S, Das G, Hristidis V, Weikum G. Probabilistic ranking of database query results. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.888-899.

[39] Chirita P A, Firan C S, Nejdl W. Personalized query expansion for the web. In Proc. the 30th SIGIR, July 2007, pp.7-14.

[40] Theobald M, Schenkel R, Weikum G. Efficient and self-tuning incremental query expansion for top-k query processing. In Proc. the 28th SIGIR, August 2005, pp.242-249.

[41] Metzler D, Dumais S T, Meek C. Similarity measures for short segments of text. In Proc. the 29th ECIR, April 2007, pp.16-27.

[42] Allan J, Wade C, Bolivar A. Retrieval and novelty detection at the sentence level. In Proc. the 26th SIGIR, August 2003, pp.314-321.

[43] Balasubramanian N, Allan J, Croft W B. A comparison of sentence retrieval techniques. In Proc. the 30th SIGIR, July 2007, pp.813-814.

[44] Li X, Croft W B. Improving novelty detection for general topics using sentence level information patterns. In Proc. the 15th CIKM, November 2006, pp.238-247.

[45] Li X, Croft W B. Novelty detection based on sentence level patterns. In Proc. the 14th CIKM, November 2005, pp.744-751.

[46] Murdock V, Croft W B. A translation model for sentence retrieval. In Proc. HLT/EMNLP, October 2005, pp.684-691.

[47] Fung P, Yee Lo Y. An IR approach for translating new words from nonparallel, comparable texts. In Proc. the 36th COLING-ACL, August 1998, pp.414-420.

[48] Cao G, Nie J Y, Bai J. Integrating word relationships into language models. In Proc. the 28th SIGIR, July 2005, pp.298-305.

[49] Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929.

[50] Lewis D D, Yang Y, Rose T G, Li F. RCV1:A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5:361-397.

[51] Lang K. NewsWeeder:Learning to filter netnews. In Proc. the 12th ICML, June 1995, pp.331-339.

[52] Van Rijsbergen C J. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979.
No related articles found!
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved