• Articles • Previous Articles     Next Articles

Multi-Scaling Sampling: An Adaptive Sampling Method for Discovering Approximate Association Rules

Cai-Yan Jia1,2 and Xie-Ping Gao3   

  1. 1The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, P.R. China
    2Graduate School of the Chinese Academy of Sciences, Beijing 100039, P.R. China
    3Information Engineering College, Xiangtan University, Xiangtan 411105, P.R. China
  • Received:2003-07-08 Revised:2004-02-20 Online:2005-05-10 Published:2005-05-10

One of the obstacles of the efficient association rule mining is theexplosive expansion of data sets since it is costly or impossible toscan large databases, esp., for multiple times. A popular solution toimprove the speed and scalability of the association rule mining is todo the algorithm on a random sample instead of the entire database. Buthow to effectively define and efficiently estimate the degree of errorwith respect to the outcome of the algorithm, and how to determine the samplesize needed are entangling researches until now. In this paper,an effective and efficient algorithm is given based on the PAC(Probably Approximate Correct) learning theory to measure and estimatesample error. Then, a new adaptive, on-line, fast samplingstrategy --- multi-scaling sampling --- is presented inspired by MRA(Multi-Resolution Analysis) and Shannon sampling theorem, for quicklyobtaining acceptably approximate association rules at appropriate samplesize. Both theoretical analysis and empirical study have showed that thesampling strategy can achieve a very good speed-accuracy trade-off.

Key words: Expansion net; process of elementary net system; contact;

[1] Evfimievski A, Srikant R, Agrawal R, Gehrke J. Privacypreserving mining of association rules. In Proc. The 8thACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining,Edmonton, Alberta, Canada, July 2002, pp.217--228.

[2]Agrawal R, Mannila H, Srikant R et al. Fast Discovery of the Association Rules. Advances inKnowledge Discovery and Data Mining, AAAI Press, 1996, pp.307--328.

[3]Li Q, Wang H et al. Efficient mining ofassociation rules by reducing the number of passes over thedatabase. Journal of Computer Science and Technology, 2001,16(2): 182--188.

[4]Zaki M J. Parallel and distributed association mining: Asurvey. IEEE Concurrency, 1999, 7(4): 14--25.

[5]Agrawal R, Shafer J C. Parallel mining of association rules. IEEE Trans. Knowledge and Data Engineering, 1996, 8(6):962--969.

[6]SAS Institute Inc. Data mining and case for sampling: Solvingbusiness problems using SAS enterprise miner software. SASInstitute White Paper, 1998.

[7]Chen B, Haas P, Scheuermann P. A new two-phase sampling basedalgorithms for discovery association rules. In Proc. The 8th ACMSIGKDD Int. Conf. Knowledge Discovery and Data Mining,Edmonton, Alberta, Canada, July 2002, pp.462--468.

[8] Parthasarathy S. Efficient progressive sampling forassociation rules. In Proc. The IEEE Int. Conf.Data Mining (ICDM'02), Maebashi City, Japan, Dec., 2002, pp.354--361.

[9] Toivonen H. Sampling large databases forassociation rules. In Proc. The 22nd Int. Conf. Very LargeData Bases, Mumbai, Bombay, India, Sept. 1996, pp.134--145.

[10] Zaki M J, Parthasarathy S, Li W et al. Evaluation of sampling for data mining of association rules. In Proc. The 7th Workshop on Research Issues in DataEngineer, Birmingham, UK, April 1997, pp.42--50.

[11] Watanabe O. Simple sampling techniques for discoveryscience. IEICE Trans. Information and Systems, 2000,E83-D(1): 19--26.

[12]Zhang C, Zhang S, Webb G I. Identifying approximate itemsetsof interest in large databases. Applied Intelligence, 2003, 18:91--104.

[13]Valiant L G. A theory of the learnable. Communications of the ACM, 1984, 27: 1134--1142.

[14] John G H, Langley P. Static versus dynamic sampling for datamining. In Proc. The 2nd Int. Conf. Knowledge Discovery andData Mining, KDD-96, Portland, OR, Aug. 1996, pp.367--370.

[15]Suzuki E. Sampling theories for rule discovery basedon generality and accuracy, the worst case and adistribution-based case. Communication of Institute ofInformation and Computing Machinery, May, 2002, 5(2): 83--88.

[16]Zaki M J, Hsiao C J. CHARM: An efficient algorithm for closedassociation rule mining. Technical Report 99-10, ComputerScience Dept., Rensselaer Polytechnic Institute, Oct., 1999.

[17]Burdick D, Calimlim M, Gehrke J. MAFIA: A maximalfrequent itemset algorithm for transactional databases. In Proc. The 17th Int. Conf. Data Engineering, Heidelberg, Germany, April,2001, pp.443--452.

[18] Agrawal R, Srikant R. Fast algorithms for mining associationrules. In Proc. The 20th Int. Conf. Very LargeData Bases, Santiago, Chile, 1994, pp.487--499.

[19]Han J, Pei J, Yin Y. Mining frequent patterns without candidategeneration. In Proc. The ACM SIGMOD Int. Conf.the Management of Data, Dallas, TX, May 2000, pp.1--12.

[20] Pei J, Han J, Mao R. CLOSET: An efficient algorithmfor mining frequent closed itemsets. In Proc. theACM-SIGMOD Workshop on Research Issues in Data Mining and KnowledgeDiscovery, Dallas, TX, May, 2000, pp.21--30.

[21] Provost F, Jensen D, Oates T. Efficient progressivesampling. In Proc. The 5th ACM SIGKDD Int. Conf.Knowledge Discovery and Data Mining, San Diego, CA,USA, Aug. 1999, pp.23--32.

[22]Vitter J S. An efficient algorithm for sequentialrandom sampling. ACM Trans. Mathematical Software,1987, 13(1): 58--67.

[23] http://fuzzy.cs.uni-magdeburg.de/~borgelt/
[1] Hong Tang, Shuai Mu, Jin Huang, Jia Zhu, Jian Chen, Rui Ding. Zip: An Algorithm Based on Loser Tree for Common Contacts Searching in Large Graphs [J]. , 2015, 30(4): 799-809.
[2] Cao Cungen;. Expansion Nets and Expansion Processes of Elementary Net Systems [J]. , 1995, 10(4): 325-333.
Full text



No Suggested Reading articles found!

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved