? 针对大数据的并行增量式频繁项集挖掘算法
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | Help
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2017, Vol. 32 Issue (2) :368-385    DOI: 10.1007/s11390-017-1726-y
Regular Paper << Previous Articles | Next Articles >>
针对大数据的并行增量式频繁项集挖掘算法
Yu-Geng Song1,2, Hui-Min Cui1,2, Member, CCF, Xiao-Bing Feng1, Member, CCF, ACM, IEEE
1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
2 University of Chinese Academy of Sciences, Beijing 100049, China
Parallel Incremental Frequent Itemset Mining for Large Data
Yu-Geng Song1,2, Hui-Min Cui1,2, Member, CCF, Xiao-Bing Feng1, Member, CCF, ACM, IEEE
1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
2 University of Chinese Academy of Sciences, Beijing 100049, China

摘要
参考文献
相关文章
Download: [PDF 2209KB]  
摘要 频繁项集挖掘是一个流行的数据挖掘问题,应用于很多领域,比如零售业的商品推荐、网络搜索的日志分析、查询推荐(或相关搜索)。目前已经有很多频繁项集挖掘算法被提出,以获得更高的性能,其中包括用于处理大数据的并行算法。另外,人们也提出了增量式算法来应对数据库的增量式更新。然而,这些增量式算法往往并行度很低,处理大数据的时候效率低下。本文提出了两个在MapReduce框架上实现的并行增量式频繁项集挖掘算法:IncMiningPFP和IncBuildingPFP。IncMiningPFP在原始计算过程中保存FP-tree的挖掘结果,并且在增量计算的时候利用它们。具体来说,我们提出了一种在增量计算过程中生成局部FP-tree的方法,以避免不必要的挖掘操作。此外,当新插入的数据包含的项目种类比较少时,增量计算过程中的某些并行化任务可以被省略。IncBuildingPFP在原始计算过程中保存构建好的各个CanTree,然后在增量计算时将新的事务添加到它们之上。我们的实验结果表明,在大多数情况下,IncMiningPFP面对数据库增量输入的情景与PFP和一个单线程增量算法(CanTree)相比均可以获得显著的加速,在其他情况下IncBuildingPFP可以做到这一点。
关键词增量并行FPGrowth   数据挖掘   频繁项集挖掘   MapReduce     
Abstract: Frequent itemset mining (FIM) is a popular data mining issue adopted in many fields, such as commodity recommendation in the retail industry, log analysis in web searching, and query recommendation (or related search). A large number of FIM algorithms have been proposed to obtain better performance, including parallelized algorithms for processing large data volumes. Besides, incremental FIM algorithms are also proposed to deal with incremental database updates. However, most of these incremental algorithms have low parallelism, causing low efficiency on huge databases. This paper presents two parallel incremental FIM algorithms called IncMiningPFP and IncBuildingPFP, implemented on the MapReduce framework. IncMiningPFP preserves the FP-tree mining results of the original pass, and utilizes them for incremental calculations. In particular, we propose a method to generate a partial FP-tree in the incremental pass, in order to avoid unnecessary mining work. Further, some of the incremental parallel tasks can be omitted when the inserted transactions include fewer items. IncbuildingPFP preserves the CanTrees built in the original pass, and then adds new transactions to them during the incremental passes. Our experimental results show that IncMiningPFP can achieve significant speedup over PFP (Parallel FPGrowth) and a sequential incremental algorithm (CanTree) in most cases of incremental input database, and in other cases IncBuildingPFP can achieve it.
Keywordsincremental parallel FPGrowth   data mining   frequent itemset mining   MapReduce     
Received 2015-11-18;
本文基金:

This work was supported by the National High Technology Research and Development 863 Program of China under Grant Nos. 2015AA011505, 2015AA015306, and 2012AA010902, the National Natural Science Foundation of China under Grant Nos. 61202055, 61221062, 61521092, 61303053, 61432016, 61402445, and 61672492, and the National Key Research and Development Program of China under Grant No. 2016YFB1000402.

About author: Yu-Geng Song received his B.S. degree in automation from Tsinghua University, Beijing, in 2012. He is currently working toward his Ph.D. degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His research interests include parallel computing and parallel compiling.
引用本文:   
Yu-Geng Song, Hui-Min Cui, Xiao-Bing Feng.针对大数据的并行增量式频繁项集挖掘算法[J]  Journal of Computer Science and Technology , 2017,V32(2): 368-385
Yu-Geng Song, Hui-Min Cui, Xiao-Bing Feng.Parallel Incremental Frequent Itemset Mining for Large Data[J]  Journal of Computer Science and Technology, 2017,V32(2): 368-385
链接本文:  
http://jcst.ict.ac.cn:8080/jcst/CN/10.1007/s11390-017-1726-y
Copyright 2010 by Journal of Computer Science and Technology