We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Fei Gao, Shao-Xu Song, Lei Chen, Jian-Min Wang. Efficient Set-Correlation Operator Inside Databases[J]. Journal of Computer Science and Technology, 2016, 31(4): 683-701. DOI: 10.1007/s11390-016-1657-z
Citation: Fei Gao, Shao-Xu Song, Lei Chen, Jian-Min Wang. Efficient Set-Correlation Operator Inside Databases[J]. Journal of Computer Science and Technology, 2016, 31(4): 683-701. DOI: 10.1007/s11390-016-1657-z

Efficient Set-Correlation Operator Inside Databases

Funds: The work was supported by the National Key Technology R&D Program of China under Grant No. 2015BAH14F02, the National Natural Science Foundation of China under Grant Nos. 61572272, 61202008, 61325008, and 61370055, and the Tsinghua University Initiative Scientific Research Program.
More Information
  • Author Bio:

    Fei Gao is a Ph.D. student in the School of Software, Tsinghua University, Beijing. Her research interests include data quality and record matching.

  • Corresponding author:

    Shao-Xu Song E-mail: sxsong@tsinghua.edu.cn

  • Received Date: February 25, 2016
  • Revised Date: May 04, 2016
  • Published Date: July 04, 2016
  • Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.
  • [1]
    Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929.
    [2]
    Hadjieleftheriou M, Yu X, Koudas N, Srivastava D. Hashed samples:Selectivity estimators for set similarity selection queries. PVLDB, 2008, 1(1):201-212.
    [3]
    Lee H, Ng R T, Shim K. Power-law based estimation of set similarity join size. PVLDB, 2009, 2(1):658-669.
    [4]
    White R W, Jose J M. A study of topic similarity measures. In Proc. the 27th SIGIR, July 2004, pp.520-521.
    [5]
    Zhu X, Song S, Lian X, Wang J, Zou L. Matching heterogeneous event data. In Proc. SIGMOD, June 2014, pp.1211-1222.
    [6]
    Zhu X, Song S, Wang J, Yu P S, Sun J. Matching heterogeneous events with patterns. In Proc. the 30th ICDE, March 31-April 4, 2014, pp.376-387.
    [7]
    Wang J, Song S, Zhu X, Lin X. Efficient recovery of missing events. PVLDB, 2013, 6(10):841-852.
    [8]
    Wang J, Song S, Lin X, Zhu X, Pei J. Cleaning structured event logs:A graph repair approach. In Proc. the 31st ICDE, April 2015, pp.30-41.
    [9]
    Song S, Chen L. Similarity joins of text with incomplete information formats. In Proc. the 12th DASFAA, April 2007, pp.313-324.
    [10]
    Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In Proc. the 22nd ICDE, April 2006, p.5.
    [11]
    Beckmann J L, Halverson A, Krishnamurthy R, Naughton J F. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proc. the 22nd ICDE, April 2006, p.58.
    [12]
    Jain A, Doan A, Gravano L. SQL queries over unstructured text databases. In Proc. the 23rd ICDE, April 2007, pp.1255-1257.
    [13]
    Dong X, Halevy A Y. Indexing dataspaces. In Proc. SIGMOD, June 2007, pp.43-54.
    [14]
    Song S, Chen L, Yuan M. Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng., 2011, 23(12):1872-1887.
    [15]
    Song S, Chen L, Yu P S. On data dependencies in dataspaces. In Proc. the 27th ICDE, April 2011, pp.470-481.
    [16]
    Dong X, Halevy A Y, Madhavan J, Nemes E, Zhang J. Similarity search for web services. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.372-383.
    [17]
    Song S, Chen L. Probabilistic correlation-based similarity measure of unstructured records. In Proc. the 16th CIKM, November 2007, pp.967-970.
    [18]
    Song S, Zhu H, Chen L. Probabilistic correlation-based similarity measure on text records. Inf. Sci., 2014, 289:8-24.
    [19]
    Sahami M, Heilman T D. A web-based kernel function for measuring the similarity of short text snippets. In Proc. the 15th WWW, May 2006, pp.377-386.
    [20]
    Liu S, Liu F, Yu C, Meng W. An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In Proc. the 27th SIGIR, July 2004, pp.266-272.
    [21]
    Jin R, Chai J Y, Si L. Learn to weight terms in information retrieval using category information. In Proc. the 22nd ICML, August 2005, pp.353-360.
    [22]
    Xiong H, Shekhar S, Tan P N, Kumar V. Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In Proc. the 10th KDD, August 2004, pp.334-343.
    [23]
    Song S, Chen L. Efficient set-correlation operator inside databases. In Proc. CIKM, October 2010, pp.139-148.
    [24]
    Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In Proc. the 27th VLDB, September 2001, pp.491-500.
    [25]
    Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proc. SIGMOD, June 1998, pp.201-212.
    [26]
    Gravano L, Ipeirotis P G, Koudas N, Srivastava D. Text joins in an RDBMS for web data integration. In Proc. the 12th WWW, May 2003, pp.90-101.
    [27]
    Salton G. Automatic Text Processing:The Transformation, Analysis, and Retrieval of Information by Computer. AddisonWesley, 1989.
    [28]
    Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th KDD, August 2003, pp.39-48.
    [29]
    Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In Proc. the 8th KDD, July 2002, pp.269-278.
    [30]
    Hofmann T. Probabilistic latent semantic analysis. In Proc. UAI, July 1999, pp.289-296.
    [31]
    Hofmann T. Probabilistic latent semantic indexing. In Proc. the 22nd SIGIR, August 1999, pp.50-57.
    [32]
    Deerwester S C, Dumais S T, Landauer T K, Furnas G W, Harshman R A. Indexing by latent semantic analysis. JASIS, 1990, 41(6):391-407.
    [33]
    Brin S, Motwani R, Silverstein C. Beyond market baskets:Generalizing association rules to correlations. In Proc. SIGMOD, May 1997, pp.265-276.
    [34]
    Jermaine C. The computational complexity of high dimensional correlation search. In Proc. ICDM, November 2001, pp.249-256.
    [35]
    Xiong H, Shekhar S, Tan P N, Kumar V. TAPER:A two-step approach for all-strong-pairs correlation query in large databases. IEEE Trans. Knowl. Data Eng., 2006, 18(4):493-508.
    [36]
    Sparck Jones K. Index term weighting. Information Storage and Retrieval, 1973, 9(11):619-633.
    [37]
    Robertson S. Understanding inverse document frequency:On theoretical argument for IDF. Journal of Documentation, 2004, 60(5):503-520.
    [38]
    Chaudhuri S, Das G, Hristidis V, Weikum G. Probabilistic ranking of database query results. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.888-899.
    [39]
    Chirita P A, Firan C S, Nejdl W. Personalized query expansion for the web. In Proc. the 30th SIGIR, July 2007, pp.7-14.
    [40]
    Theobald M, Schenkel R, Weikum G. Efficient and self-tuning incremental query expansion for top-k query processing. In Proc. the 28th SIGIR, August 2005, pp.242-249.
    [41]
    Metzler D, Dumais S T, Meek C. Similarity measures for short segments of text. In Proc. the 29th ECIR, April 2007, pp.16-27.
    [42]
    Allan J, Wade C, Bolivar A. Retrieval and novelty detection at the sentence level. In Proc. the 26th SIGIR, August 2003, pp.314-321.
    [43]
    Balasubramanian N, Allan J, Croft W B. A comparison of sentence retrieval techniques. In Proc. the 30th SIGIR, July 2007, pp.813-814.
    [44]
    Li X, Croft W B. Improving novelty detection for general topics using sentence level information patterns. In Proc. the 15th CIKM, November 2006, pp.238-247.
    [45]
    Li X, Croft W B. Novelty detection based on sentence level patterns. In Proc. the 14th CIKM, November 2005, pp.744-751.
    [46]
    Murdock V, Croft W B. A translation model for sentence retrieval. In Proc. HLT/EMNLP, October 2005, pp.684-691.
    [47]
    Fung P, Yee Lo Y. An IR approach for translating new words from nonparallel, comparable texts. In Proc. the 36th COLING-ACL, August 1998, pp.414-420.
    [48]
    Cao G, Nie J Y, Bai J. Integrating word relationships into language models. In Proc. the 28th SIGIR, July 2005, pp.298-305.
    [49]
    Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929.
    [50]
    Lewis D D, Yang Y, Rose T G, Li F. RCV1:A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5:361-397.
    [51]
    Lang K. NewsWeeder:Learning to filter netnews. In Proc. the 12th ICML, June 1995, pp.331-339.
    [52]
    Van Rijsbergen C J. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979.
  • Related Articles

    [1]Jun-Feng Fan, Mei-Ling Wang, Chang-Liang Li, Zi-Qiang Zhu, Lu Mao. Intent-Slot Correlation Modeling for Joint Intent Prediction and Slot Filling[J]. Journal of Computer Science and Technology, 2022, 37(2): 309-319. DOI: 10.1007/s11390-020-0326-4
    [2]Rui Ren, Jiechao Cheng, Xi-Wen He, Lei Wang, Jian-Feng Zhan, Wan-Ling Gao, Chun-Jie Luo. HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data Systems[J]. Journal of Computer Science and Technology, 2019, 34(6): 1167-1184. DOI: 10.1007/s11390-019-1968-y
    [3]Yu-Rong Cheng, Ye Yuan, Lei Chen, Guo-Ren Wang. Threshold-Based Shortest Path Query over Large Correlated Uncertain Graphs[J]. Journal of Computer Science and Technology, 2015, 30(4): 762-780. DOI: 10.1007/s11390-015-1559-5
    [4]Cheng-De Zhang, Xiao Wu, Mei-Ling Shyu, Qiang Peng. A Novel Web Video Event Mining Framework with the Integration of Correlation and Co-Occurrence Information[J]. Journal of Computer Science and Technology, 2013, 28(5): 788-796. DOI: 10.1007/s11390-013-1377-6
    [5]Yan-Hui Ding, Qing-Zhong Li, Yong-Quan Dong, Zhao-Hui Peng. 2D Correlative-Chain Conditional Random Fields for Semantic Annotation of Web Objects[J]. Journal of Computer Science and Technology, 2010, 25(4): 761-770. DOI: 10.1007/s11390-010-1059-6
    [6]Fei Wu, Ya-Hong Han, Yue-Ting Zhuang. Multiple Hypergraph Clustering ofWeb Images by MiningWord2Image Correlations[J]. Journal of Computer Science and Technology, 2010, 25(4): 750-760. DOI: 10.1007/s11390-010-1058-7
    [7]Zheng Fang, Wu Wenhu, Fang Ditang. Center-Distance Continuous Probability Models and the Distance Measure[J]. Journal of Computer Science and Technology, 1998, 13(5): 426-437.
    [8]Zheng Fang, Wu Wenhu, Fang Ditang. A Log-Index Weighted Cepstral Distance Measure for Speech Recognition[J]. Journal of Computer Science and Technology, 1997, 12(2): 177-184.
    [9]Yin Yanzhi, Chen Changlin, Shen Baoning, Ding Shixiang. Ring Delay Chain for Exact Measure of Pulse Cycle Error[J]. Journal of Computer Science and Technology, 1988, 3(2): 147-152.
    [10]Wang Jianchao, Wei Daozheng. Reconvergent-Fanout-Oriented Testability Measure[J]. Journal of Computer Science and Technology, 1988, 3(1): 16-28.
  • Cited by

    Periodical cited type(1)

    1. Farzaneh Asadzadeh, Akram Reza, Midia Reshadi, et al. Thermal-aware application mapping using genetic and fuzzy logic techniques for minimizing temperature in three-dimensional network-on-chip. The Journal of Supercomputing, 2024. DOI:10.1007/s11227-023-05869-x

    Other cited types(0)

Catalog

    Article views (21) PDF downloads (820) Cited by(1)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return