Effect of Count Estimation in Finding Frequent Itemsets over Online Transactional Data Streams
-
Abstract
A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This issue becomes more serious in finding frequent itemsets or frequency counting over an online transactional data stream since there can be a large number of itemsets to be monitored. We have proposed a method called the estDec method for finding frequent itemsets over an online data stream. In order to reduce the number of monitored itemsets in this method, monitoring the count of an itemset is delayed until its support is large enough to become a frequent itemset in the near future. For this purpose, the count of an itemset should be estimated. Consequently, how to estimate the count of an itemset is a critical issue in minimizing memory usage as well as processing time. In this paper, the effects of various count estimation methods for finding frequent itemsets are analyzed in terms of mining accuracy, memory usage and processing time.
-
-