基于滑动窗口的近似top-k连续查询算法

doi:10.1007/s11390-017-1708-0

基于滑动窗口的近似top-k连续查询算法

Approximate Continuous Top-k Query over Sliding Window

摘要

摘要: 数据流环境下的top-k连续查询问题是流数据管理领域的经典问题。它返回窗口中分值最高的k个对象。现有算法的核心思想是维护流数据集合的一个子集。当窗口滑动时，新的查询结果可在该子集中找到。然而，上述算法均对查询参数和数据分布敏感。这些算法的增量维护代价较高，它们无法满足用户实时性的需求。针对这些问题，本文首先提出了（ε,δ）-近似top-K连续查询的概念。针对该查询，提出了三种适用于不同数据分布的过滤算法。由理论分析可知，这三种算法均可用O（s）的计算代价过滤掉O（s-k）的流数据。与此同时，它们可保证不误删查询结果的概率为ε。此后，提出了一种多段归并算法。该算法通过归并策略和压缩策略降低候选对象的维护代价。假设滑动窗口的长度为N，该算法处理N个对象的计算代价为O（Nlogk+(（（NK）/（s））logφ（（R）/（εk））+N×cost_F）。最后，通过模拟实验对所提出算法的性能进行评估。

Abstract: Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this type of queries, whose key idea is to maintain a subset of objects in the window, and try to retrieve answers from it. However, all the existing algorithms are sensitive to query parameters and data distribution. In addition, they suffer from expensive overhead for incremental maintenance, and thus cannot satisfy real-time requirement. In this paper, we define a novel query named (ε,δ)-approximate continuous top-k query, which returns approximate answers for top-k query. In order to efficiently support this query, we propose an efficient framework, named PABF (Probabilistic Approximate Based Framework), to support approximate top-k query over sliding window. We firstly maintain a self-adaptive pruning value, which could filter out newly arrived objects who have a probability less than 1-δ of being a query result. For those objects that are not filtered, we combine them together, if the score difference among them is less than a threshold. To efficiently maintain these combined results, the framework PABF also proposes a multi-phase merging algorithm. Theoretical analysis indicates that even in the worst case, we require only logarithmic complexity for maintaining each candidate.

HTML全文

参考文献()

施引文献

资源附件()