›› 2014,Vol. 29 ›› Issue (3): 408-422.doi: 10.1007/s11390-014-1439-4

所属专题: Computer Architecture and Systems Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

一种利用GPU处理加速异常探测特征选择的方法

Fatemeh Azmandian1, Member, IEEE, Ayse Yilmazer1, Student Member, IEEE, Jennifer G. Dy1, Member, IEEE Javed A. Aslam2, Member, ACM, and David R. Kaeli1, Fellow, IEEE, Member, ACM   

  1. 1 Department of Electrical and Computer Engineering, Northeastern University, Boston 02115-5096, U.S.A.;
    2 College of Computer and Information Science, Northeastern University, Boston 02115-5096, U.S.A.
  • 收稿日期:2013-08-31 修回日期:2014-01-28 出版日期:2014-05-05 发布日期:2014-05-05
  • 作者简介:Fatemeh Azmandian received her M.S. and Ph.D. degrees in computer engineering from Northeastern University, Boston. Dr. Azmanian's publications present the results of her research in utilizing data mining and machine learning techniques for a wide array of applications, including cloud storage, virtualization, cybersecurity, and radiotherapy treatment planning. Since 2012, Dr. Azmandian has brought her skills and talent to EMC Corporation. She is a member of IEEE.

Harnessing the Power of GPUs to Speed Up Feature Selection for Outlier Detection

Fatemeh Azmandian1, Member, IEEE, Ayse Yilmazer1, Student Member, IEEE, Jennifer G. Dy1, Member, IEEE Javed A. Aslam2, Member, ACM, and David R. Kaeli1, Fellow, IEEE, Member, ACM   

  1. 1 Department of Electrical and Computer Engineering, Northeastern University, Boston 02115-5096, U.S.A.;
    2 College of Computer and Information Science, Northeastern University, Boston 02115-5096, U.S.A.
  • Received:2013-08-31 Revised:2014-01-28 Online:2014-05-05 Published:2014-05-05
  • About author:Fatemeh Azmandian received her M.S. and Ph.D. degrees in computer engineering from Northeastern University, Boston. Dr. Azmanian's publications present the results of her research in utilizing data mining and machine learning techniques for a wide array of applications, including cloud storage, virtualization, cybersecurity, and radiotherapy treatment planning. Since 2012, Dr. Azmandian has brought her skills and talent to EMC Corporation. She is a member of IEEE.

获取能够用于区分正常数据和异常数据之间差异的特征将有助于识别异常数据。本文提出了一种新的非参数评估方法以用于基于过滤的特征选择,其目的是为了探测异常数据。本文所提出的方法一方面寻求能够准确表示正常数据内在特点的特征子集。另一方面,该方法也突出显示异常数据,以使得它们更加容易的被异常探测算法识别。在真实数据集上的实验结果表明,本文所提出的方法优于已有特征选择方法。而且,本文所提出的方法能够解决小样本空间问题并在高度非均衡数据集上表现良好。进而,由于特征选择的高度并行性,我们将本文所提出的方法在GPU上实现以对串行版本进行加速。 GPU实现带来了两方面的好处:无论是在特征数量还是在数据规模上都为本文所提出的特征选择方法带来了良好的可扩展性。

Abstract: Acquiring a set of features that emphasize the differences between normal data points and outliers can drastically facilitate the task of identifying outliers. In our work, we present a novel non-parametric evaluation criterion for filter-based feature selection which has an eye towards the final goal of outlier detection. The proposed method seeks the subset of features that represent the inherent characteristics of the normal dataset while forcing outliers to stand out, making them more easily distinguished by outlier detection algorithms. Experimental results on real datasets show the advantage of our feature selection algorithm compared with popular and state-of-the-art methods. We also show that the proposed algorithm is able to overcome the small sample space problem and perform well on highly imbalanced datasets. Furthermore, due to the highly parallelizable nature of the feature selection, we implement the algorithm on a graphics processing unit (GPU) to gain significant speedup over the serial version. The benefits of the GPU implementation are two-fold, as its performance scales very well in terms of the number of features, as well as the number of data points.

[1] Schölkopf B, Smola A, Müller K R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 1998, 10(5): 1299-1319.

[2] Kira K, Rendell L A. A practical approach to feature selection. In Proc. the 9th ICML, July 1992, pp.249-256.

[3] Liu H, Motoda H. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998.

[4] Kohavi R, John G H. Wrappers for feature subset selection. Artificial Intelligence, 1997, 97(1/2): 273-324.

[5] Dash M, Liu H. Feature selection for classification. Intelligent Data Analysis, 1997, 1(1/4): 131-156.

[6] Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Machine Learning Research, 2003, 3: 1157-1182.

[7] Nguyen H V, Gopalkrishnan V. Feature extraction for outlier detection in high-dimensional spaces. In Proc. the 4th Int. Workshop. Feature Selection in Data Mining, June 2010, pp.66-75.

[8] Azmandian F, Yilmazer A, Dy J et al. GPU-accelerated feature selection for outlier detection using the local kernel density ratio. In Proc. the 12th ICDM, December 2012, pp.51-60.

[9] Tibshirani R. Regression shrinkage and selection via the lasso. J. Royal Statistical Society, Series B, 1996, 58(1): 267-288.

[10] Song L, Bedo J, Borgwardt K M et al. Gene selection via the BAHSIC family of algorithms. Bioinformatics, 2007, 23(3): i490-i498.

[11] Wu X, Yu K, Ding W et al. Online feature selection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(5): 1178-1192.

[12] Chen X W, Wasikowski M. FAST: A ROC-based feature selection metric for small samples and imbalanced data classification problems. In Proc. the 14th KDD, August 2008, pp.124-132.

[13] Aggarwal C, Yu S. An effective and effcient algorithm for high-dimensional outlier detection. The VLDB Journal, 2005, 14(2): 211-221.

[14] de Vries T, Chawla S, Houle M E. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems, 2012, 32(1): 25-52.

[15] Branch J W, Giannella C, Szymanski B K et al. In-network outlier detection in wireless sensor networks. Knowledge and Information Systems, 2013, 34(1): 23-54.

[16] Hido S, Tsuboi Y, Kashima H et al. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 2011, 26(2): 309-336.

[17] Sugiyama M, Yamada M, von Bünau P et al. Direct densityratio estimation with dimensionality reduction via leastsquares hetero-distributional subspace search. Neural Networks, 2011, 24(2): 183-198.

[18] Smola A, Song L, Teo C H. Relative novelty detection. In Proc. the 12th AISTATS, April 2009, pp.536-543.

[19] Hawkins D M. Identification of Outliers. London, New York: Chapman and Hall, 1980.

[20] Breunig M M, Kriegel H P, Ng R T et al. LOF: Identifying density-based local outliers. ACM SIGMOD Record, 2000, 29(2): 93-104.

[21] Horn R A, Johnson C R. Matrix Analysis. Cambridge, New York: Cambridge University Press, 1985.

[22] Balcan M F, Blum A. On a theory of learning with similarity functions. In Proc. the 23rd International Conference on Machine Learning (ICML), June 2006, pp.73-80.

[23] Parzen E. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1962, 33(3): 1065-1076.

[24] Rosenblatt M. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 1956, 27(3): 832-837.

[25] Devijver P A, Kittler J. Pattern Recognition: A Statistical Approach. Englewood Cliffs, NJ: Prentice Hall, 1982.

[26] Masaeli M, Fung G, Dy J G. From transformation-based dimensionality reduction to feature selection. In Proc. the 27th ICML, June 2010, pp.751-758.

[27] Kirk D B, Hwu W W. Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series). Morgan Kaufmann Publishers, 2010.

[28] Lv Q, Josephson W,Wang Z et al. Multi-probe LSH: Effcient indexing for high-dimensional similarity search. In Proc. the 33rd VLDB, Sept. 2007, pp.950-961.

[29] Arya S, Mount D M, Netanyahu N S et al. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 1998, 45(6): 891-923.

[30] Garcia V, Debreuve E, Barlaud M. Fast k nearest neighbor search using GPU. In Proc. Workshop on Computer Vision and Pattern Recognition, June 2008.

[31] Azmandian F, Dy J G, Aslam J A et al. Local kernel density ratio-based feature selection for outlier detection. In Proc. the 4th Asian Conf. Machine Learning, Nov. 2012, pp.49-64.

[32] Güvenir H A, Acar B, Demiröz G et al. A supervised machine learning algorithm for arrhythmia analysis. In Proc. Computers in Cardiology Conference, September 1998, pp.433-436.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: