›› 2015,Vol. 30 ›› Issue (6): 1233-1248.doi: 10.1007/s11390-015-1596-0

所属专题: Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

一种高效的面向大规模多维数据集的分布式离群点检测算法

Xi-Te Wang(王习特), Student Member, CCF, Member, ACM, De-Rong Shen(申德荣), Senior Member, CCF, Member, ACM, IEEE, Mei Bai(白梅), Tie-Zheng Nie(聂铁铮), Member, CCF, ACM, Yue Kou(寇月), Member, CCF, ACM, Ge Yu(于戈), Fellow, CCF, Member, ACM, IEEE   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • 收稿日期:2015-05-15 修回日期:2015-10-14 出版日期:2015-11-05 发布日期:2015-11-05
  • 作者简介:Xi-Te Wang is a Ph.D. candidate in computer software and theory, Northeastern University, Shenyang, from which he also received his M.S. degree in computer architecture in 2011. He is a student member of CCF, and a member of ACM. His research interests include cloud computing and big data management.
  • 基金资助:

    This work was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N120816001.

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Xi-Te Wang(王习特), Student Member, CCF, Member, ACM, De-Rong Shen(申德荣), Senior Member, CCF, Member, ACM, IEEE, Mei Bai(白梅), Tie-Zheng Nie(聂铁铮), Member, CCF, ACM, Yue Kou(寇月), Member, CCF, ACM, Ge Yu(于戈), Fellow, CCF, Member, ACM, IEEE   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2015-05-15 Revised:2015-10-14 Online:2015-11-05 Published:2015-11-05
  • About author:Xi-Te Wang is a Ph.D. candidate in computer software and theory, Northeastern University, Shenyang, from which he also received his M.S. degree in computer architecture in 2011. He is a student member of CCF, and a member of ACM. His research interests include cloud computing and big data management.
  • Supported by:

    This work was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N120816001.

基于距离的离群点是一种被广泛使用的离群点定义方法。一个数据点是否被认定为离群点, 主要依据该点与其临近点的距离大小。在本文中, 为解决在分布式环境中的离群点检测问题, 提出了一种基于ZH树的DBOZ算法。首先, 为有效管理分布式环境中的数据, 我们提出了一种基于Z-order空间填充曲线的索引机制, 称为ZH树。ZH树具有如下两个主要优势: a)聚类特性: 有助于邻近点的查找;b)层次结构: 有助于进行空间剪枝。我们还设计了一种自底向上的ZH树建立方法。该方法的时间复杂度与维度和数据集规模呈线性关系。接下来, 提出了用于分布式离群点检测的DBOZ算法, 主要包括2个步骤: a)为避免计算所有数据点的最近邻点, 我们设计了一种贪心算法和一种基于ZH树的k近邻查找算法(简称为ZHkNN), 可以快速得到一个阈值LW;b)我们提出了一种“先过滤后精炼”的方法。该方法首先使用阈值LW过滤掉大量的非离群点, 然后对剩余数据点进一步精炼, 得到最终的离群点。最后, 使用一系列实验验证了所提出的ZH树和DBOZ算法的有效性。

Abstract: The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.

[1] Hawkins D M. Identification of Outliers. Springer, 1980.

[2] Barnett V, Lewis T. Outliers in Statistical Data.Wiley New York, 1994.

[3] Rousseeuw P J, Leroy A M. Robust Regression and Outlier Detection. John Wiley & Sons, 2003.

[4] Knorr E M, Ng R T. Algorithms for mining distancebased outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases, August 1998, pp.392-403.

[5] Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 2000, 29(2): 427-438.

[6] Angiulli F, Pizzuti C. Outlier mining in large highdimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(2): 203-215.

[7] Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In Proc. the 6th European Conference on Principles of Data Mining and Knowledge Discovery, August 2002, pp.15-27.

[8] Schubert E, Zimek A, Kriegel H P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 2014, 28(1): 190-237.

[9] Shiokawa H, Fujiwara Y, Onizuka M. Scan++: Efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proceedings of the VLDB Endowment, 2015, 8(11): 1178-1189.

[10] Aggarwal C C, Yu P S. Outlier detection for high dimensional data. ACM SIGMOD Record, 2001, 30(2): 37-46.

[11] Breunig M M, Kriegel H P, Ng R T, Sander J. LOF: Identifying density-based local outliers. ACM SIGMOD Record, 2000, 29(2): 93-104.

[12] He Z, Xu X, Deng S. Discovering cluster-based local outliers. Pattern Recognition Letters, 2003, 24(9/10): 1641- 1650.

[13] Bay S D, Schwabacher M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp.29-38.

[14] Angiulli F, Fassetti F. Very efficient mining of distancebased outliers. In Proc. the 16th ACM Conference on Information and Knowledge Management, November 2007, pp.791-800.

[15] Guttman A. R-trees: A dynamic index structure for spatial searching. ACM SIGMOD Record, 1984, 14(2): 47-57.

[16] Ciaccia P, Patella M, Zezula P. M-tree: An efficient access method for similarity search in metric spaces. In Proc. the 23rd International Conference on Very Large Databases, August 1997, pp.426-435.

[17] Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful? In Proc. the 7th International Conference on Database Theory, January 1999, pp.217-235.

[18] Wang B, Yang X C, Wang G R, Yu G. Outlier detection over sliding windows for probabilistic data streams. Journal of Computer Science and Technology, 2010, 25(3): 389-400.

[19] Cao K Y, Wang G R, Han D H, Ding G H, Wang A X, Shi L X. Continuous outlier monitoring on uncertain data streams. Journal of Computer Science and Technology, 2014, 29(3): 436-448.

[20] Gupta M, Gao J, Aggarwal C, Han J. Outlier detection for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery, 2014, 5(1): 1-129.

[21] Yan Y, Zhang J, Huang B, Sun X, Mu J, Zhang Z, Moscibroda T. Distributed outlier detection using compressive sensing. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 31-June 04, 2015, pp.3-16.

[22] Hung E, Cheung D W. Parallel mining of outliers in large database. Distributed and Parallel Databases, 2002, 12(1): 5-26.

[23] Lozano E, Acuña E. Parallel algorithms for distance-based and density-based outliers. In Proc. the 5th IEEE International Conference on Data Mining, November 2005, pp.729- 732.

[24] Angiulli F, Basta S, Lodi S, Sartori C. Distributed strategies for mining outliers in large data sets. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(7): 1520-1532.

[25] Jagadish H V, Ooi B C, Tan K, Yu C, Zhang R. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transaction Database System, 2005, 30(2): 364-397.

[26] Ramsak F, Markl V, Fenk R, Zirkel M, Elhardt K, Bayer R. Integrating the UB-tree into a database system kernel. In Proc. the 26th International Conference on Very Large Data Bases, September 2000, pp.263-272.

[27] O'Malley O. Terabyte sort on apache hadoop. http://sortbenchmark.org/YahooHadoop.pdf, October 2015.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: