|
›› 2016,Vol. 31 ›› Issue (4): 673-682.doi: 10.1007/s11390-016-1656-0
所属专题: Data Management and Data Mining
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Chao Han, Yun-Kun Tan, Jin-Hui Zhu, Member, IEEE, Yong Guo, Jian Chen*, Member, CCF, ACM, IEEE, and Qing-Yao Wu*, Member, CCF, IEEE
Chao Han, Yun-Kun Tan, Jin-Hui Zhu, Member, IEEE, Yong Guo, Jian Chen*, Member, CCF, ACM, IEEE, and Qing-Yao Wu*, Member, CCF, IEEE
在机器学习的许多实际应用领域中,数据集会出现类不平衡问题,即多数(或正)类中样本的数量远远多于少数(或负)类中样本的数量。解决类不平衡问题的技术有很多,其中对于高维的类不平衡数据集,特征选择方法能够提高分类器的性能和减少计算代价。但是,大多数有关特征选择和类不平衡的研究都局限于批量学习,而这些方法并不适合所有的实际应用,比如在线学习。本文,我们的目的是使用特征选择技术提高类不平衡的在线学习分类器的性能,即只使用一小部分特征正确、有效地高维的在线的类不平衡数据集。我们提出了两种在线学习算法,使得在线分类器能够选择类不平衡数据集中一小部分特征分类一序列类不平衡的在线数据。这两种学习算法都是基于优化问题和迭代方法的思想,使用Passive-Aggressive (PA)算法和截断梯度(Truncated Gradient, TG)算法解决。我们提供了两种策略修改PA算法,提高其在类不平衡数据集上分类性能。一种是修改PA的损失函数,为多数类样本和少数类样本设置不同的间隔阈值;另一种是人工合成一部分少数类样本。我们测试了多个实际的数据集评估算法的性能,实验结果证明了我们提出的算法能够有效地解决类不平衡的在线特征选择问题。
[1] Longadge R, Dongre S S, Malik L. Class imbalance problem in data mining:Review. International Journal of Computer Science and Network, 2013, 2(1):1305-1707.[2] Dash M, Liu H. Feature selection for classification. Intelligent Data Analysis, 1997, 1(1/2/3/4):131——156.[3] Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and Naive Bayes. In Proc. the 16th Int. Conf. Machine Learning, June 1999, pp.258-267.[4] Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, 3:1157-1182.[5] Wasikowski M, Chen X. Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng., 2010, 22(10):1388-1400.[6] Hoi S C H, Wang J, Zhao P, Jin R. Online feature selection for mining big data. In Proc. the 1st Int. Workshop on Big Data, Streams and Heterogeneous Source Mining:Algorithms, Systems, Programming Models and Applications, August 2012, pp.93-100.[7] Wang J, Zhao P, Hoi S C H, Jin R. Online feature selection and its application. IEEE Trans. Knowl. Data Eng., 2014, 26(3):698-710.[8] Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 2003, 3:1289-1305.[9] Zheng Z, Wu X, Srihari R K. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 2004, 6(1):80-89.[10] Chawla N V, Japkowicz N, Kotcz A. Editorial:Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1):1-6.[11] Langford J, Li L, Zhang T. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 2009, 10:777-801.[12] Chawla N V, Bowyer K W, Hall L O, Philip Kegelmeyer W. SMOTE:Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2011, 16(1):321-357.[13] Maldonado S, Weber R, Famili F. Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Information Sciences:an International Journal, 2014, 286:228-246.[14] Tax D M J, Duin R P W. Support vector data description. Machine Learning, 2004, 54(1):45——66.[15] Zhou Z, Liu X. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng., 2006, 18(1):63——77.[16] Zhao P, Hoi S C H. Cost-sensitive online active learning with application to malicious URL detection. In Proc. the 19th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2013, pp.919-927.[17] Wang J, Zhao P, Hoi S C H. Cost-sensitive online classification. IEEE Trans. Knowl. Data Eng., 2014, 26(10):2425-2438.[18] Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004, 5:1205-1224.[19] Chen X, Jeong J C. Minimum reference set based feature selection for small sample classification. In Proc. the 24th Int. Conf. Machine Learning, June 2007, pp.153-160.[20] Wu Q, Ye Y, Zhang H, Ng M K, Ho S S. ForesTexter:An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based Systems, 2014, 67:105-116.[21] Wu Q, Ye Y, Liu Y, Ng M K. SNP selection and classification of genome-wide SNP data using stratified sampling random forests. IEEE Trans. Nanobioscience, 2012, 11(3):216-227.[22] Rosenblatt F. The perceptron:A probabilistic model for information storage and organization in the brain. Psychological Review, 1988, 65(6):386-408.[23] Freund Y, Schapire R E. Large margin classification using the perceptron algorithm. Machine Learning, 1999, 37(3):277-296.[24] Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y. Online passive-aggressive algorithms. Journal of Machine Learning Research, 2006, 7:551-585.[25] Kubat M, Matwin S. Addressing the curse of imbalanced training sets:One-sided selection. In Proc. 14th Int. Conf. Machine Learning, April 1997, pp.179-186. |
No related articles found! |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |