We use cookies to improve your experience with our site.

一种高效的面向大规模多维数据集的分布式离群点检测算法

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

  • 摘要: 基于距离的离群点是一种被广泛使用的离群点定义方法。一个数据点是否被认定为离群点, 主要依据该点与其临近点的距离大小。在本文中, 为解决在分布式环境中的离群点检测问题, 提出了一种基于ZH树的DBOZ算法。首先, 为有效管理分布式环境中的数据, 我们提出了一种基于Z-order空间填充曲线的索引机制, 称为ZH树。ZH树具有如下两个主要优势: a)聚类特性: 有助于邻近点的查找;b)层次结构: 有助于进行空间剪枝。我们还设计了一种自底向上的ZH树建立方法。该方法的时间复杂度与维度和数据集规模呈线性关系。接下来, 提出了用于分布式离群点检测的DBOZ算法, 主要包括2个步骤: a)为避免计算所有数据点的最近邻点, 我们设计了一种贪心算法和一种基于ZH树的k近邻查找算法(简称为ZHkNN), 可以快速得到一个阈值LW;b)我们提出了一种“先过滤后精炼”的方法。该方法首先使用阈值LW过滤掉大量的非离群点, 然后对剩余数据点进一步精炼, 得到最终的离群点。最后, 使用一系列实验验证了所提出的ZH树和DBOZ算法的有效性。

     

    Abstract: The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.

     

/

返回文章
返回