›› 2016,Vol. 31 ›› Issue (4): 776-786.doi: 10.1007/s11390-016-1662-2

所属专题: Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

HUITWU:一个在事务数据库中有效挖掘高效用项集的算法

Shi-Ming Guo, and Hong Gao, Member, CCF   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • 收稿日期:2016-02-29 修回日期:2016-06-06 出版日期:2016-07-05 发布日期:2016-07-05
  • 作者简介:Shi-Ming Guo received his B.S. degree in software engineering from Harbin Institute of Technology, Harbin, in 2004. He is currently a Ph.D. candidate in the School of Computer Science and Technology at Harbin Institute of Technology, Harbin. His current research interests include high-utility pattern mining and massive data management.
  • 基金资助:

    This work is partly supported by the National Basic Research 973 Program of China under Grant No. 2012CB316200 and the National Natural Science Foundation of China under Grant Nos. 61190115 and 61173022.

HUITWU: An Efficient Algorithm for High-Utility Itemset Mining in Transaction Databases

Shi-Ming Guo, and Hong Gao, Member, CCF   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Received:2016-02-29 Revised:2016-06-06 Online:2016-07-05 Published:2016-07-05
  • About author:Shi-Ming Guo received his B.S. degree in software engineering from Harbin Institute of Technology, Harbin, in 2004. He is currently a Ph.D. candidate in the School of Computer Science and Technology at Harbin Institute of Technology, Harbin. His current research interests include high-utility pattern mining and massive data management.
  • Supported by:

    This work is partly supported by the National Basic Research 973 Program of China under Grant No. 2012CB316200 and the National Natural Science Foundation of China under Grant Nos. 61190115 and 61173022.

从事务数据库中挖掘高效用项集是指寻找所有具有高效用属性的项集,例如具有高利润的项集。大部分存在的研究方法从事务数据库中以两个阶段寻找高效用项集。第一阶段,采用不同的高估技术,计算项集效用的上届。选择高估效用不小于用户指定阈值的项集做为候选高效用项集。第二阶段,对以第一阶段产生的候选高效用项集扫描数据库,计算候选高效用项集的精确效用,从而获得所有高效用项集。可是,候选高效用项集数量巨大,招致如下两个问题:1)大量的内存用于存储候选高效用项集;2)计算候选高效用项集的精确效用耗时巨大。最近垂直数据格式应用于高效用项集挖掘。可是这类方法不能有效的处理具有相同项的事务,从而导致数据库尺寸不能充分的减小,进而影响算法的性能。因此本文提出一个新算法HUITWU挖掘高效用项集。采用一种新的树结构HUITWU-Tree计算数据库中项集的效用。我们在稀松数据集和稠密数据集上进行了大量的实验,实验结果表明我们提出的算法在运行时间上比当前最好的算法快一个数量级,同时在内存使用方面少于当前最好的算法。

Abstract: Mining high-utility itemsets (HUIs) from a transaction database refers to the discovery of itemsets with high utilities like profits. Most of existing studies discover HUIs from a transaction database in two phases. In phase 1, different overestimation methods are applied to calculate the upper bounds of the utilities of itemsets. Since the overestimated utilities of itemsets are adopted, the itemsets whose overestimated utilities are no less than a user-specified threshold are selected as candidate HUIs, and they are verified by scanning the database one more time in phase 2. However, a large number of candidate HUIs incur two problems:1) it requires excessive memory to store these candidates; 2) it needs a large amount of running time to calculate their exact utilities. Vertical data format has been applied to mine HUIs recently. However this kind of method cannot deal with transactions with the same items effectively so that the size of database cannot be reduced sufficiently. The overall performance of algorithms is degraded consequently. Thus an algorithm HUITWU is proposed in this paper for mining HUIs. A novel data structure HUITWU-Tree is adopted to efficiently calculate the utilities of itemsets in a database. Extensive studies with both sparse and dense datasets have demonstrated that our proposed algorithm is more than an order of magnitude faster and consumes less memory than the state-of-the-art algorithms.

[1] Yao H, Hamilton H J, Butz C J. A foundational approach to mining itemset utilities from databases. In Proc. the 4th SIAM Int. Conf. Data Min., April 2004, pp.482-486.

[2] Ahmed C F, Tanbeer S K, Jeong B S et al. Efficient tree structures for high utility pattern mining in incremental databases. IEEE Trans. Knowl. Data Eng., 2009, 21(12):1708-1721.

[3] Tseng V A, Shie B E, Wu C W, Yu P S. Efficient algorithms for mining high utility itemsets from transactional databases. IEEE Trans. Knowl. Data Eng., 2013, 25(8):1772-1786.

[4] Pei J, Han J, Lakshmanan L V S. Mining frequent itemsets with convertible constraints. In Proc. the 17th Int. Conf. Data Eng., April 2001, pp.433-442.

[5] Li Y X, Yeh J S, Chang C C. Isolated items discarding strategy for discovering high utility itemsets. Data Knowl. Eng., 2008, 64(1):198-217.

[6] Liu Y, Liao W, Choudhary A. A two-phase algorithm for fast discovery of high utility of itemsets. In Proc. the 9th Pacific-Asia Conf. Knowl. Discov. Data Min., May 2005, pp.689-695.

[7] Liu M, Qu J. Mining high utility itemsets without candidate generation. In Proc. the 21st ACM Int. Conf. Inf. Knowl. Man., October 29-November 2, 2012, pp.55-64.

[8] Fournier-Viger P, Wu C W, Zida S, Tseng V S. FHM:Faster high-utility itemset mining using estimated utility co-occurrence pruning. In Proc. the 21th Int. Symp. Meth. Intel. Sys., June 2014, pp.83-92.

[9] Han J, Pei J, Yin Y, Mao R. Mining frequent patterns without candidate generation:A frequent-pattern tree approach. Data Min. Knowl. Discov., 2004, 8(1):53-87.

[10] Wang J, Han J, Pei J. CLOSET+:Searching for the best strategies for mining frequent closed itemsets. In Proc. the 9th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., August 2003, pp.236-245.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 章萃; 赵沁平; 徐家福;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[8] 卢学妙;. On the Complexity of Induction of Structural Descriptions[J]. , 1987, 2(1): 12 -21 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: