数据库内高效率集合关联操作

doi:10.1007/s11390-016-1657-z

数据库内高效率集合关联操作

Efficient Set-Correlation Operator Inside Databases

摘要

摘要: 大规模短文本记录数据非常普遍,例如新闻标题、科学著作引用,以及在论坛发布的消息等。这些数据通常以集合记录的方式存储在隐藏的Web数据库中。关于短文本的关联查询在许多信息检索任务中有重要应用价值,如在新闻标题中寻找热点话题,以及对某个话题进行相关科学论文的搜索查询等。现有关系数据库管理系统(RDBMS)不直接支持集合关联查询,本文解决在数据库中进行集合记录关联查询的效果及效率问题。首先给出一个在数据库内部进行集合关联查询的框架,研究对Pearson关联进行扩展,并在数据库内部仅用SQL进行实现。通过设置关联筛选阈值,进一步降低查询时间。理论分析表明,设置合适的筛选阈值,可以在查询效果损失很小的情况下提高查询效率。最终,实验结果展示了本文所提出的关联查询和优化策略的效果及效率。

Abstract: Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.

HTML全文

参考文献()

施引文献

资源附件()