›› 2017,Vol. 32 ›› Issue (1): 93-109.doi: 10.1007/s11390-017-1708-0

所属专题: Data Management and Data Mining

• • 上一篇    下一篇

基于滑动窗口的近似top-k连续查询算法

Rui Zhu, Member, CCF, ACM, Bin Wang*, Member, CCF, Shi-Ying Luo, Member, CCF, ACM, Xiao-Chun Yang, Senior Member, CCF, IEEE, Member, ACM, and Guo-Ren Wang, Member, CCF, ACM, IEEE   

  1. College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • 收稿日期:2016-02-29 修回日期:2016-08-17 出版日期:2017-01-05 发布日期:2017-01-05
  • 通讯作者: Bin Wang E-mail:binwang@mail.neu.edu.cn
  • 作者简介:Rui Zhu received his M.S. degree in computer science from the Department of Computer Science, Northeastern University, Shenyang, in 2008. Currently, he is a Ph.D. candidate of Northeastern University, Shenyang. His research interests include design and analysis of algorithms, databases, data quality, and distributed systems.
  • 基金资助:

    This work is partially supported by the National Natural Science Fund for Distinguish Young Scholars of China under Grant No. 61322208, the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61272178 and 61572122, and the Key Program of the National Natural Science Foundation of China under Grant No. 61532021.

Approximate Continuous Top-k Query over Sliding Window

Rui Zhu, Member, CCF, ACM, Bin Wang*, Member, CCF, Shi-Ying Luo, Member, CCF, ACM, Xiao-Chun Yang, Senior Member, CCF, IEEE, Member, ACM, and Guo-Ren Wang, Member, CCF, ACM, IEEE   

  1. College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2016-02-29 Revised:2016-08-17 Online:2017-01-05 Published:2017-01-05
  • Contact: Bin Wang E-mail:binwang@mail.neu.edu.cn
  • About author:Rui Zhu received his M.S. degree in computer science from the Department of Computer Science, Northeastern University, Shenyang, in 2008. Currently, he is a Ph.D. candidate of Northeastern University, Shenyang. His research interests include design and analysis of algorithms, databases, data quality, and distributed systems.
  • Supported by:

    This work is partially supported by the National Natural Science Fund for Distinguish Young Scholars of China under Grant No. 61322208, the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61272178 and 61572122, and the Key Program of the National Natural Science Foundation of China under Grant No. 61532021.

数据流环境下的top-k连续查询问题是流数据管理领域的经典问题。它返回窗口中分值最高的k个对象。现有算法的核心思想是维护流数据集合的一个子集。当窗口滑动时,新的查询结果可在该子集中找到。然而,上述算法均对查询参数和数据分布敏感。这些算法的增量维护代价较高,它们无法满足用户实时性的需求。针对这些问题,本文首先提出了(ε,δ)-近似top-K连续查询的概念。针对该查询,提出了三种适用于不同数据分布的过滤算法。由理论分析可知,这三种算法均可用Os)的计算代价过滤掉Os-k)的流数据。与此同时,它们可保证不误删查询结果的概率为ε。此后,提出了一种多段归并算法。该算法通过归并策略和压缩策略降低候选对象的维护代价。假设滑动窗口的长度为N,该算法处理N个对象的计算代价为ONlogk+(((NK)/(s))logφ((R)/(εk))+N×costF)。最后,通过模拟实验对所提出算法的性能进行评估。

Abstract: Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this type of queries, whose key idea is to maintain a subset of objects in the window, and try to retrieve answers from it. However, all the existing algorithms are sensitive to query parameters and data distribution. In addition, they suffer from expensive overhead for incremental maintenance, and thus cannot satisfy real-time requirement. In this paper, we define a novel query named (ε,δ)-approximate continuous top-k query, which returns approximate answers for top-k query. In order to efficiently support this query, we propose an efficient framework, named PABF (Probabilistic Approximate Based Framework), to support approximate top-k query over sliding window. We firstly maintain a self-adaptive pruning value, which could filter out newly arrived objects who have a probability less than 1-δ of being a query result. For those objects that are not filtered, we combine them together, if the score difference among them is less than a threshold. To efficiently maintain these combined results, the framework PABF also proposes a multi-phase merging algorithm. Theoretical analysis indicates that even in the worst case, we require only logarithmic complexity for maintaining each candidate.

[1] Yang D, Shastri A, Rundensteiner E A, Ward M O. An optimal strategy for monitoring top-k queries in streaming windows. In Proc. the 14th International Conference on Extending Database Technology, March 2011, pp.57-68.

[2] Mouratidis K, Bakiras S, Papadias D. Continuous monitoring of top-k queries over sliding windows. In Proc. ACM SIGMOD International Conference on Management of Data, June 2006, pp.635-646.

[3] Bai M, Xin J C, Wang G R, Zhang L M, Zimmermann R, Yuan Y, Wu X D. Discovering the k representative skyline over a sliding window. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(8):2041-2056.

[4] Yu A, Agarwal P K, Yang J. Processing a large number of continuous preference top-k queries. In Proc. ACM SIGMOD International Conference on Management of Data, June 2012, pp.397-408.

[5] Shen Z T, Cheema M A, Lin X M, Zhang W J, Wang H X. Efficiently monitoring top-k pairs over sliding windows. In Proc. the 28th International Conference on Data Engineering, April 2012, pp.798-809.

[6] Yang X C, Qiu T, Wang B, Zheng B H, Wang Y S, Li C. Negative factor:Improving regular-expression matching in strings. ACM Transactions on Database Systems, 2016, 40(4):25.

[7] Yang X C, Liu H L, Wang B. ALAE:Accelerating local alignment with affine gap exactly in biosequence databases. Proceedings of the VLDB Endowment, 2012, 5(11):1507-1518.

[8] Yang X C, Wang B, Qiu T, Wang Y S, Li C. Improving regular-expression matching on strings using negative factors. In Proc. ACM SIGMOD International Conference on Management of Data, June 2013, pp.361-372.

[9] Xie X H, Yang X C, Wang J Y, Wang B, Li C. Efficient direct search on compressed genomic data. In Proc. the 29th International Conference on Data Engineering, April 2013, pp.961-972.

[10] Yi K, Yu H, Yang J, Xia G Q, Chen Y G. Efficient maintenance of materialized top-k views. In Proc. the 19th International Conference on Data Engineering, March 2003, pp.189-200.

[11] Pripu?i? K, ? arko I P, Aberer K. Time-and space-efficient sliding window top-k query processing. ACM Transactions on Database Systems, 2015, 40(1):Article No. 1.

[12] Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In Proc. the 28th Annual ACM Symposium on the Theory of Computing, May 1996, pp.20-29.

[13] Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. In Proc. the 13th Annual ACM SIAM Symposium on Discrete Algorithms, January 2002, pp.635-644.

[14] Harvey N J A, Nelson J, Onak K. Sketching and streaming entropy via approximation theory. In Proc. the 49th Annual IEEE Symposium on Foundations of Computer Science, Oct. 2008, pp.489-498.

[15] Tong Y X, Zhang X F, Chen L. Tracking frequent items over distributed probabilistic data. World Wide Web, 2016, 19(4):579-604.

[16] Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In Proc. the 29th International Conference on Automata, Languages and Programming, July 2002, pp.693-703.

[17] Ganguly S, Majumder A. Cr-precis:A deterministic summary structure for update data streams. In Proc. the 1st Int. Symp. Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, April 2007, pp.48-59.

[18] Shrivastava N, Buragohain C, Agrawal D, Suri S. Medians and beyond:New aggregation techniques for sensor networks. In Proc. the 2nd International Conference on Embedded Networked Sensor Systems, November 2004, pp.239-249.

[19] Cormode G, Muthukrishnan S. An improved data stream summary:The count-min sketch and its applications. Journal of Algorithms, 2005, 55(1):58-75.

[20] DeGroot M H, Schervish M J. Probability and Statistics (4th edition). China Machine Press, 2012.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Jose K- Raphel; Siu Cheung Hui; Angela Goh;. Class Based Contextual Logic for DOOD[J]. , 1996, 11(2): 161 -170 .
[2] 庄旗铭; 蒋定安; 杨庆;. A Comparative Analysis of Different Arbitration Protocols for Multiple-Bus Multiprocessors[J]. , 1996, 11(3): 313 -325 .
[3] 马宗民; Yan Li;. Using Multivalued Logic in Relational Database Containing Null Value[J]. , 1996, 11(4): 421 -426 .
[4] . 暂缺[J]. , 2006, 21(5): 682 -696 .
[5] . 不相容多流体的动画模拟[J]. , 2007, 22(1): 156 -160 .
[6] . 一种新的公钥加密方案[J]. , 2007, 22(1): 95 -02 .
[7] . URECA:普适计算环境下一个高效的资源定位中间件[J]. , 2008, 23(6 ): 929 -943 .
[8] . 用于视频编码的基于重要性分析的精细度自适应视频编码预处理[J]. , 2011, 26(1): 195 -202 .
[9] Long Zheng (郑龙), Mian-Xiong Dong (董冕雄), Student Member, IEEE, . 多核处理器上标签压缩技术的低功耗研究[J]. , 2011, 26(3): 491 -503 .
[10] Xin Liu (刘欣) and Tsuyoshi Murata, Member, ACM, IEEE. K部K一致(超)网络中的社区发现[J]. , 2011, 26(5): 778 -791 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: