• Database and Knowledge-Based Systems • Previous Articles     Next Articles

Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows

Hong Shen1,2 and Yu Zhang1   

  1. 1Department of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China 2School of Computer Science, University of Adelaide, SA 5005, Australia
  • Received:2008-01-14 Revised:2008-07-22 Online:2008-11-10 Published:2008-11-10

Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space $G$ bits and sliding window size $W$, our algorithm has an amortized time complexity of $O(\sqrt{G/W})$. Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.

Key words: Web; information retrieval; search engine; cooperative;


[1] Babcock B, Babu S, Datar M, Windom J. Model and issues in data stream systems. In {\it Proc. Principles of Database Systems $($PODS$)$}, Wisconsin, USA, June 2002, pp.1--16.
[2]} Golab L, Ozsu M T. Issues in data stream management. {\it SIGMOD Record}, June 2003, 32(2): 5--14.
[3]} Metwally A, Agrawal D, Abbadi A E. Duplicate detection in click streams. In {\it Proc. 14th Int. Conf. World Wide Web}, Chiba, Japan, May 2005, pp.12--21.
[4]} Deng F, Rafiei D. Approximately detecting duplicates for streaming data using stable Bloom filters. In {\it Proc. ACM SIGMOD Conf}., New York, USA, June 2006, pp.25--36.
[5]} Anupam V, Mayer A, Nissim K, Pinkas B, Reiter M. On the security of pay-per-click and other web advertising schemes. In {\it Proc. 8th Int. Conf. World Wide Web}, Toronto, Canada, May 1999, pp.1091--1100.
[6]}Heydon A, Najork M Mercator: A scalable, extensible web crawler. In {\it Proc. 8th Int. Conf. World Wide Web}, Toronto, Canada, May 1999, pp.219--229.
[7]} Broder A Z, Najork M, Wiener J L. Efficient URL caching for world wide web crawling. In {\it Proc. 12th Int. Conf. World Wide Web}, Budapest, Hungary, May 2003, pp.680--689.
[8]} Cisco network accounting services white paper. Cisco System Inc, 2002, pp.1--20, http://www.cisco.com /warp/public/cc/pd/iosw/prodlit/nwactwp.pdf.
[9]} Garcia-Molina H, Ullman J D, Widom J. Database System Implementation. Upper Saddle River: Prentice Hall, Inc., NJ, USA, 1999.
[10]} Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In {\it Proc. ACM SIGKDD Conf.}, Washington DC, USA, August 2003, pp.39--48.
[11]} Weis M, Naumann F. Dogmatix tracks down duplicates in XML. In {\it Proc. ACM SIGMOD Conf.}, Baltimore, Maryland, USA, June 2005, pp.431--442.
[12]} Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In {\it Proc. 28th Int. Conf. Very Large Data Bases}, Hong Kong, China, 2002, pp.586--597.
[13]} Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates. In {\it Proc. 21st Int. Conf. Data Engineering $($ICDE$)$}, Tokyo, Japan, 2005, pp.865--876.
[14]} Bloom B H. Space/time trade-offs in hash coding with allowable errors. {\it Communications of the ACM}, July 1970, 13(7): 422--426.
[15]} Fan L, Cao P, Almeida J, Broder A Z. Summary cache: A scalable wide-area Web cache sharing protocol. {\it IEEE/ACM Trans. Networking}, 2000, 8(3): 281--293.
[16]} Mitzenmacher M. Compressed Bloom filters. In {\it Proc. 20th ACM Symposium on Principles of Distributed Computing} Netw. Rhode Island, 2001, pp.144--150.
[17]} Cohen S, Matias Y. Spectral Bloom filters. In {\it Proc. ACM SIGMOD Conf.}, California, USA, June 2003, pp.241--252.
[18]} Whang K, Zenden B T V, Taylor H M. A linear-time probabilistic counting algorithm for database applications. {\it ACM Trans. Database Syst.}, 1990, 15(2): 208--299.
[19]} Cheng K, Xiang L, Iwaihara M. Time-decaying Bloom filters for data streams with skewed distributions. In {\it Proc. 15th Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications $($RIDE-SDMA$)$}, Tokyo, Japan, 2005, pp.63--69.
[20]} Cohen E, Strauss M. Maintaining time-decaying stream aggregates. In {\it Proc. Principles of Database Systems $($PODS$)$}, San Diego, California, USA, June 2003, pp.223--233.
[21]} Cormode G, Tirthapura S, Xu B. Time-decaying sketches for sensor data aggregation. In {\it Proc. Principles of Distributed Computing $($PODC$)$}, Portland, Oregon, USA, May 2007, pp.215--224.
[22]} Arasu A, Manku G. Approximate counts and quanlites over sliding windows. In {\it Proc. Principles of Database Systems $($PODS$)$}, Paris, France, June 2004, pp.286--296.
[1] Dong-Hui Yang, Zhen-Yu Li, Xiao-Hui Wang, Kavé Salamatian, Gao-Gang Xie. Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web Search [J]. Journal of Computer Science and Technology, 2021, 36(5): 1167-1183.
[2] Yi-Ting Wang, Jie Shen, Zhi-Xu Li, Qiang Yang, An Liu, Peng-Peng Zhao, Jia-Jie Xu, Lei Zhao, Xun-Jie Yang. Enriching Context Information for Entity Linking with Web Data [J]. Journal of Computer Science and Technology, 2020, 35(4): 724-738.
[3] Yan Zheng, Jian-Ye Hao, Zong-Zhang Zhang, Zhao-Peng Meng, Xiao-Tian Hao. Efficient Multiagent Policy Optimization Based on Weighted Estimators in Stochastic Cooperative Environments [J]. Journal of Computer Science and Technology, 2020, 35(2): 268-280.
[4] Ting Bai, Hong-Jian Dou, Wayne Xin Zhao, Ding-Yi Yang, Ji-Rong Wen. An Experimental Study of Text Representation Methods for Cross-Site Purchase Preference Prediction Using the Social Text Data [J]. , 2017, 32(4): 828-842.
[5] Xiao-Fang Qi, Zi-Yuan Wang, Jun-Qiang Mao, Peng Wang. Automated Testing of Web Applications Using Combinatorial Strategies [J]. , 2017, 32(1): 199-210.
[6] Gong-Qing Wu, Lei Li, Li Li, Xindong Wu. Web News Extraction via Tag Path Feature Fusion Using DS Theory [J]. , 2016, 31(4): 661-672.
[7] Zeinab Hmedeh, Harry Kourdounakis, Vassilis Christophides, Cédric du Mouza, et al. Content-Based Publish/Subscribe System for Web Syndication [J]. , 2016, 31(2): 359-380.
[8] Yong Li, Jiang Zhang, Xiao-Feng Meng, Chang-Qing Wang. Quantifying the Influence of Websites Based on Online Collective Attention Flow [J]. , 2015, 30(6): 1175-1187.
[9] Zhi-Neng Chen, Chong-Wah Ngo, Wei Zhang, Juan Cao, Yu-Gang Jiang. Name-Face Association in Web Videos: A Large-Scale Dataset, Baselines, and Open Issues [J]. , 2014, 29(5): 785-798.
[10] Chen-Da Hou, Dong Li, Jie-Fan Qiu, Hai-Long Shi, Li Cui. SeaHttp:A Resource-Oriented Protocol to Extend REST Style for Web of Things [J]. , 2014, 29(2): 205-215.
[11] Cheng-De Zhang, Xiao Wu, Mei-Ling Shyu, and Qiang Peng. A Novel Web Video Event Mining Framework with the Integration of Correlation and Co-Occurrence Information [J]. , 2013, 28(5): 788-796.
[12] Kun Xie, Jian-Nong Cao, and Ji-Gang Wen. Optimal Relay Assignment and Power Allocation for Cooperative Communications [J]. , 2013, 28(2): 343-356.
[13] Godfrey Winster Sathianesan and Swamynathan Sankaranarayanan. Personalized Semantic Based Blog Retrieval [J]. , 2012, 27(3): 591-598.
[14] Jia Chen (陈佳), Yi-He Zhu (朱一和), Hao-Fen Wang (王昊奋), Wei Jin (晋薇), and Yong Yu (俞勇). Effective and Efficient Multi-Facet Web Image Annotation [J]. , 2012, 27(3): 541-553.
[15] Peng Li (李鹏), Bin Wang (王斌), Senior Member, CCF, Member, ACM, IEEE, and Wei Jin (晋薇). ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques [J]. , 2012, 27(3): 554-566.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Ma Jun; Ma Shaohan;. Efficient Parallel Algorithms for Some Graph Theory Problems[J]. , 1993, 8(4): 76 -80 .
[2] Jin Guohua; Chen Fujie;. On the Problem of Optimizing Parallel Programs for Complex Memory Hierarchies[J]. , 1994, 9(1): 1 -26 .
[3] Zhou Jianqiang; Xie Li; Dai Fei; Sun Zhongxiu;. Adaptive Memory Coherence Algorithms in DSVM[J]. , 1994, 9(4): 365 -372 .
[4] Zhang Songmao;. Weak Precedence Story Parsing Grammar[J]. , 1995, 10(1): 53 -64 .
[5] NIE Xumin; GUO Qing;. Renaming a Set of Non-Horn Clauses[J]. , 2000, 15(5): 409 -415 .
[6] How Jiann Teo and Kok Cheong Wong. Texture Pattern Generation Using Clonal Mosaic Texture Pattern Generation Using Clonal Mosaic[J]. , 2006, 21(2): 173 -180 .
[7] Hui-Zhan Yi and Xue-Jun Yang. Toward the Optimal Configuration of Dynamic Voltage Scaling Points in Real-Time Applications[J]. , 2006, 21(6): 893 -900 .
[8] Xiang Gao, Yun-Ji Chen, Huan-Dong Wang, Dan Tang, and Wei-Wu Hu. System Architecture of Godson-3 Multi-Core Processors[J]. , 2010, 25(2): 181 -191 .
[9] Xue-Jun Yang (杨学军), Senior Member, CCF, Member, ACM, IEEE, Xiang-Ke Liao (廖湘科), Senior Member CCF, Member, ACM, Kai Lu . The TianHe-1A Supercomputer: Its Hardware and Software[J]. , 2011, 26(3): 344 -351 .
[10] Min-Yi Guo, Zi-Li Shao, Edwin Hsing-Mean Sha. Preface[J]. , 2011, 26(3): 373 -374 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved