无冗余地挖掘频繁广义元素集和广义关联规则

摘要: 本文首次提出了有效挖掘“最大广义频繁集”以及“必要广义关联规则”的算法。最大广义频繁集和必要广义关联规则分别是广义频繁集和广义关联规则的一种压缩表示。本文的结果把两个概念合二为一而填充了挖掘频繁集和关联规则研究的一项空白。第一个概念是“广义”。这个概念的意思是说存在一个元素的分类树。比如牛肉和鸡肉都是肉。事务数据库里纪录牛肉，鸡肉等的信息。但挖掘出的广义频繁集和广义关联规则可以存在肉，比如（肉，奶）可能频繁出现，但（牛肉，奶）和（鸡肉，奶）都不频繁。第二个概念是“压缩表示”。我们用的压缩表示可以指数极地减少结果空间。同时，从压缩表示可以不经由事务数据库而直接获得全部结果。实验结果表明，跟直接获得全部结果的算法比，计算压缩结果再导出全部结果可以获得高达百万倍的速度提高。对这两个问题中的每一个，我们都提出了一个基于概念格的简单算法和一个基于分类树的算法。挖掘最大广义频繁集的分类树算法命名为MFGI_class。挖掘必要广义关联规则的分类树算法命名为EGR_class。以MFGI_class为例，分类树算法由一个概念上的分类树和一系列剪枝策略组成。我们给出了一个概念上的分类树，其每个节点都对应一个广义元素集的集合。根节点所对应的集合是全集。每个父节点所对应的集合被其所有子节点所对应的集合分类。也就是说一个父节点中的任意元素集都一定出现在某个子节点所对应的集合里，而任意两个兄弟节点所对应的集合都没有交集。虽然一个节点对应一个元素集的集合，并且元素集的个数无上限，但每个节点都可以用常量空间表示。分类树算法从根节点开始动态产生子节点。随着子节点的产生，我们提出的剪枝策略可以动态剪掉不需产生的子树。实验结果证明了分类树算法的有效性。

Abstract: This paper presents some new algorithms to efficiently mine maxfrequent generalized itemsets (g-itemsets) and essential generalizedassociation rules (g-rules). These are compact and generalrepresentations for all frequent patterns and all strong associationrules in the generalized environment. Our results fill an importantgap among algorithms for frequent patterns and association rules bycombining two concepts. First, generalized itemsets employ a taxonomyof items, rather than a flat list of items. This produces morenatural frequent itemsets and associations such as (\em meat, milk)instead of (beef, milk), (chicken, milk), etc. Second, compactrepresentations of frequent itemsets and strong rules, whose resultsize is exponentially smaller, can solve a standard dilemma in miningpatterns: with small threshold values for support and confidence, theuser is overwhelmed by the extraordinary number of identifiedpatterns and associations; but with large threshold values, someinteresting patterns and associations fail to be identified.Our algorithms can also expand those max frequent g-itemsets andessential g-rules into the much larger set of ordinary frequentg-itemsets and strong g-rules. While that expansion is notrecommended in most practical cases, we do so in order to present acomparison with existing algorithms that only handle ordinaryfrequent g-itemsets. In this case, the new algorithm is shown to bethousands, and in some cases millions, of the time faster than previousalgorithms. Further, the new algorithm succeeds in analyzing deepertaxonomies, with the depths of seven or more. Experimental results forprevious algorithms limited themselves to taxonomies with depth atmost three or four.In each of the two problems, a straightforward lattice-based approach isbriefly discussed and then a classification-based algorithm is developed. Inparticular, the two classification-based algorithms are MFGI\_class formining max frequent g-itemsets and EGR\_class for mining essentialg-rules. The classification-based algorithms are featured with conceptualclassification trees and dynamic generation and pruning algorithms.

无冗余地挖掘频繁广义元素集和广义关联规则

Mining Frequent Generalized Itemsets and Generalized Association Rules Without Redundancy