An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Xi-Te Wang; De-Rong Shen; Mei Bai; Tie-Zheng Nie; Yue Kou; Ge Yu

doi:10.1007/s11390-015-1596-0

Xi-Te Wang, De-Rong Shen, Mei Bai, Tie-Zheng Nie, Yue Kou, Ge Yu. An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets[J]. Journal of Computer Science and Technology, 2015, 30(6): 1233-1248. DOI: 10.1007/s11390-015-1596-0

Citation:

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Abstract

Abstract

The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.

FullText(HTML)

References (27)

Relative Articles

Supplements (0)

Cited By

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Abstract

Catalog

Export File

Citation

Format

Content