相关性不确定数据库中的频繁项集挖掘

童咏昕; 陈雷; 佘洁莹

doi:10.1007/s11390-015-1555-9

摘要: 最近随着物联网与普适计算技术的普及,大量如RFID数据,传感器数据,实时视频数据等不确定数据已被收集。作为不确定数据挖掘的基础性问题之一的不确定频繁模式挖掘已经吸引了数据库与数据挖掘领域的众多关注。虽然已有一些不确定频繁模式挖掘的解决方案,但是它们中的大多数采用了数据独立假设,这一假设在现实世界中是不真实的。因此,对于相关性不确定数据当前基于独立假设的求解方法可能导致不精准的结果。本文聚焦于相关性不确定数据中的频繁项集挖掘问题,其中相关性可存在于任意一对不确定对象之中。我们提出了一项称为“相关频繁概率模型”(简称为CFP模型)的新型概率模型,其用于表达相关性不确定数据集中支持度的概率分布。基于得自CFP模型的支持度概率分布,我们观察到一些概率频繁项集仅在部分具有高度正相关的事务中频繁。特别的是,全局频繁项集在消除数据噪音与相关性的影响时将更为重要。为了消减冗余的频繁项集,我们进一步提出了一种被成为全局概率频繁项集的新型模式,其旨在发现将相关性不确定数据库根据相关性拆分为不相交的组中皆频繁的项集。为了加速挖掘过程,我们还设计了一个动态规划的求解方案和两个剪枝与定界结束。通过大量真实与人造数据集中的实验验证了本文所提出模型与算法的高效性。

Abstract: Recently, with the growing popularity of Internet of Things (IoT) and pervasive computing, a large amount of uncertain data, e.g. RFID data, sensor data, real-time video data, etc., has been collected. As one of the most fundamental issues of uncertain data mining, uncertain frequent pattern mining has attracted much attention in the database and the data mining communities. Although there have been some solutions for uncertain frequent pattern mining, most of them assume that the data is independent, which is not true in most real-world scenarios. Therefore, current methods that are based on the independent assumption may generate inaccurate results for correlated uncertain data. In this paper, we focus on the problem of mining frequent itemsets over correlated uncertain data, where correlation can exist in any pair of uncertain data objects (transactions). We propose a novel probabilistic model, called correlated frequent probability model (CFP model) to represent the probability distribution of support in a given correlated uncertain data set. Based on the distribution of support derived from the CFP model, we observe that some probabilistic frequent itemsets are only frequent in several transactions with high positive correlation. In particular, the itemsets, which are global probabilistic frequent, have more significance in eliminating the influence of the existing noise and correlation in data. In order to reduce redundant frequent itemsets, we further propose a new type of patterns, called global probabilistic frequent itemsets, to identify itemsets that are always frequent in each group of transactions if the whole correlated uncertain database is divided into disjoint groups based on their correlation. To speed up the mining process, we also design a dynamic programming solution, as well as two pruning and bounding techniques. Extensive experiments on both real and synthetic datasets verify the effectiveness and efficiency of the proposed model and algorithms.

相关性不确定数据库中的频繁项集挖掘

Mining Frequent Itemsets in Correlated Uncertain Databases