›› 2016,Vol. 31 ›› Issue (4): 673-682.doi: 10.1007/s11390-016-1656-0

所属专题: Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

使用PA算法解决类不平衡的在线特征选择问题

Chao Han, Yun-Kun Tan, Jin-Hui Zhu, Member, IEEE, Yong Guo, Jian Chen*, Member, CCF, ACM, IEEE, and Qing-Yao Wu*, Member, CCF, IEEE   

  1. School of Software Engineering, South China University of Technology, Guangzhou 510000, China
  • 收稿日期:2016-03-06 修回日期:2016-05-18 出版日期:2016-07-05 发布日期:2016-07-05
  • 通讯作者: Jian Chen, Qing-Yao Wu E-mail:ellachen@scut.edu.cn;qyw@scut.edu.cn
  • 作者简介:Chao Han received her B.S. degree in computer science and technology from Yunnan University, Kunming, in 2013, and enrolled in the graduate division of the School of Software Engineering at the South China University of Technology, Guangzhou, in 2013 to become a Ph.D. student. Her research interests mainly include data mining, machine learning, social network, and big data processing.
  • 基金资助:

    This research was supported by the Guangzhou Key Laboratory of Robotics and Intelligent Software under Grant No. 15180007, the Fundamental Research Funds for the Central Universities of China under Grant Nos. D215048w and 2015ZZ029, and the National Natural Science Foundation of China under Grant Nos. 61005061 and 61502177.

Online Feature Selection of Class Imbalance via PA Algorithm

Chao Han, Yun-Kun Tan, Jin-Hui Zhu, Member, IEEE, Yong Guo, Jian Chen*, Member, CCF, ACM, IEEE, and Qing-Yao Wu*, Member, CCF, IEEE   

  1. School of Software Engineering, South China University of Technology, Guangzhou 510000, China
  • Received:2016-03-06 Revised:2016-05-18 Online:2016-07-05 Published:2016-07-05
  • Contact: Jian Chen, Qing-Yao Wu E-mail:ellachen@scut.edu.cn;qyw@scut.edu.cn
  • About author:Chao Han received her B.S. degree in computer science and technology from Yunnan University, Kunming, in 2013, and enrolled in the graduate division of the School of Software Engineering at the South China University of Technology, Guangzhou, in 2013 to become a Ph.D. student. Her research interests mainly include data mining, machine learning, social network, and big data processing.
  • Supported by:

    This research was supported by the Guangzhou Key Laboratory of Robotics and Intelligent Software under Grant No. 15180007, the Fundamental Research Funds for the Central Universities of China under Grant Nos. D215048w and 2015ZZ029, and the National Natural Science Foundation of China under Grant Nos. 61005061 and 61502177.

在机器学习的许多实际应用领域中,数据集会出现类不平衡问题,即多数(或正)类中样本的数量远远多于少数(或负)类中样本的数量。解决类不平衡问题的技术有很多,其中对于高维的类不平衡数据集,特征选择方法能够提高分类器的性能和减少计算代价。但是,大多数有关特征选择和类不平衡的研究都局限于批量学习,而这些方法并不适合所有的实际应用,比如在线学习。本文,我们的目的是使用特征选择技术提高类不平衡的在线学习分类器的性能,即只使用一小部分特征正确、有效地高维的在线的类不平衡数据集。我们提出了两种在线学习算法,使得在线分类器能够选择类不平衡数据集中一小部分特征分类一序列类不平衡的在线数据。这两种学习算法都是基于优化问题和迭代方法的思想,使用Passive-Aggressive (PA)算法和截断梯度(Truncated Gradient, TG)算法解决。我们提供了两种策略修改PA算法,提高其在类不平衡数据集上分类性能。一种是修改PA的损失函数,为多数类样本和少数类样本设置不同的间隔阈值;另一种是人工合成一部分少数类样本。我们测试了多个实际的数据集评估算法的性能,实验结果证明了我们提出的算法能够有效地解决类不平衡的在线特征选择问题。

Abstract: Imbalance classification techniques have been frequently applied in many machine learning application domains where the number of the majority (or positive) class of a dataset is much larger than that of the minority (or negative) class. Meanwhile, feature selection (FS) is one of the key techniques for the high-dimensional classification task in a manner which greatly improves the classification performance and the computational efficiency. However, most studies of feature selection and imbalance classification are restricted to off-line batch learning, which is not well adapted to some practical scenarios. In this paper, we aim to solve high-dimensional imbalanced classification problem accurately and efficiently with only a small number of active features in an online fashion, and we propose two novel online learning algorithms for this purpose. In our approach, a classifier which involves only a small and fixed number of features is constructed to classify a sequence of imbalanced data received in an online manner. We formulate the construction of such online learner into an optimization problem and use an iterative approach to solve the problem based on the passive-aggressive (PA) algorithm as well as a truncated gradient (TG) method. We evaluate the performance of the proposed algorithms based on several real-world datasets, and our experimental results have demonstrated the effectiveness of the proposed algorithms in comparison with the baselines.

[1] Longadge R, Dongre S S, Malik L. Class imbalance problem in data mining:Review. International Journal of Computer Science and Network, 2013, 2(1):1305-1707.

[2] Dash M, Liu H. Feature selection for classification. Intelligent Data Analysis, 1997, 1(1/2/3/4):131——156.

[3] Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and Naive Bayes. In Proc. the 16th Int. Conf. Machine Learning, June 1999, pp.258-267.

[4] Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, 3:1157-1182.

[5] Wasikowski M, Chen X. Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng., 2010, 22(10):1388-1400.

[6] Hoi S C H, Wang J, Zhao P, Jin R. Online feature selection for mining big data. In Proc. the 1st Int. Workshop on Big Data, Streams and Heterogeneous Source Mining:Algorithms, Systems, Programming Models and Applications, August 2012, pp.93-100.

[7] Wang J, Zhao P, Hoi S C H, Jin R. Online feature selection and its application. IEEE Trans. Knowl. Data Eng., 2014, 26(3):698-710.

[8] Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 2003, 3:1289-1305.

[9] Zheng Z, Wu X, Srihari R K. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 2004, 6(1):80-89.

[10] Chawla N V, Japkowicz N, Kotcz A. Editorial:Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1):1-6.

[11] Langford J, Li L, Zhang T. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 2009, 10:777-801.

[12] Chawla N V, Bowyer K W, Hall L O, Philip Kegelmeyer W. SMOTE:Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2011, 16(1):321-357.

[13] Maldonado S, Weber R, Famili F. Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Information Sciences:an International Journal, 2014, 286:228-246.

[14] Tax D M J, Duin R P W. Support vector data description. Machine Learning, 2004, 54(1):45——66.

[15] Zhou Z, Liu X. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng., 2006, 18(1):63——77.

[16] Zhao P, Hoi S C H. Cost-sensitive online active learning with application to malicious URL detection. In Proc. the 19th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2013, pp.919-927.

[17] Wang J, Zhao P, Hoi S C H. Cost-sensitive online classification. IEEE Trans. Knowl. Data Eng., 2014, 26(10):2425-2438.

[18] Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004, 5:1205-1224.

[19] Chen X, Jeong J C. Minimum reference set based feature selection for small sample classification. In Proc. the 24th Int. Conf. Machine Learning, June 2007, pp.153-160.

[20] Wu Q, Ye Y, Zhang H, Ng M K, Ho S S. ForesTexter:An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based Systems, 2014, 67:105-116.

[21] Wu Q, Ye Y, Liu Y, Ng M K. SNP selection and classification of genome-wide SNP data using stratified sampling random forests. IEEE Trans. Nanobioscience, 2012, 11(3):216-227.

[22] Rosenblatt F. The perceptron:A probabilistic model for information storage and organization in the brain. Psychological Review, 1988, 65(6):386-408.

[23] Freund Y, Schapire R E. Large margin classification using the perceptron algorithm. Machine Learning, 1999, 37(3):277-296.

[24] Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. Journal of Machine Learning Research, 2006, 7:551-585.

[25] Kubat M, Matwin S. Addressing the curse of imbalanced training sets:One-sided selection. In Proc. 14th Int. Conf. Machine Learning, April 1997, pp.179-186.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] 孟力明; 徐晓飞; 常会友; 陈光熙; 胡铭曾; 李生;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] 林琦; 夏培肃;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] 黎仁蔚;. Soundness and Completeness of Kung s Reasoning Procedure[J]. , 1988, 3(1): 7 -15 .
[5] 孙成政; 慈云桂;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[6] 张钹; 张恬; 张建伟; 张铃;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[7] 刘惟一;. An Efficient Algorithm for Processing Multi-Relation Queries in Relational Databases[J]. , 1990, 5(3): 236 -240 .
[8] 王鼎兴; 郑纬民; 杜晓黎; 郭毅可;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[9] 周权; 魏道政;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[10] 赵靓海; 刘慎权;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: