Improved Approximate Detection of Duplicates forData Streams Over Sliding Windows
-
Abstract
Detecting duplicates indata streams is an important problem that has a wide range ofapplications. In general, precisely detecting duplicates in anunbounded data stream is not feasible in most streaming scenarios,and, on the other hand, the elements in data streams are always timesensitive. These make it particular significant approximatelydetecting duplicates among newly arrived elements of a data streamwithin a fixed time frame. In this paper, we present a novel datastructure, Decaying Bloom Filter (DBF), as an extension of theCounting Bloom Filter, that effectively removes stale elements asnew elements continuously arrive over sliding windows. On the DBF basiswe present an efficient algorithm to approximately detect duplicatesover sliding windows. Our algorithm may produce false positiveerrors, but not false negative errors as in many previous results. Weanalyze the time complexity and detection accuracy, and give a tightupper bound of false positive rate. For a given space G bits andsliding window size W, our algorithm has an amortized timecomplexity of O(\sqrtG/W). Both analytical and experimentalresults on synthetic data demonstrate that our algorithm is superiorin both execution time and detection accuracy to the previousresults.
-
-