Multi-Scaling Sampling: An Adaptive Sampling Method for Discovering Approximate Association Rules
-
Abstract
One of the obstacles of the efficient association rule mining is theexplosive expansion of data sets since it is costly or impossible toscan large databases, esp., for multiple times. A popular solution toimprove the speed and scalability of the association rule mining is todo the algorithm on a random sample instead of the entire database. Buthow to effectively define and efficiently estimate the degree of errorwith respect to the outcome of the algorithm, and how to determine the samplesize needed are entangling researches until now. In this paper,an effective and efficient algorithm is given based on the PAC(Probably Approximate Correct) learning theory to measure and estimatesample error. Then, a new adaptive, on-line, fast samplingstrategy --- multi-scaling sampling --- is presented inspired by MRA(Multi-Resolution Analysis) and Shannon sampling theorem, for quicklyobtaining acceptably approximate association rules at appropriate samplesize. Both theoretical analysis and empirical study have showed that thesampling strategy can achieve a very good speed-accuracy trade-off.
-
-