Journal of Computer Science and Technology

   

Random Subspace Sampling for Classification with Missing Data

Yun-Hao Cao1 (曹云浩), and Jian-Xin Wu1,∗ (吴建鑫), Member, CCF, IEEE   

  1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
  • Received:2021-05-26 Revised:2023-02-01 Accepted:2023-02-04
  • Contact: Jian-Xin Wu E-mail:wujx2001@nju.edu.cn
  • About author:Jian-Xin Wu is currently a professor in the School of Artificial Intelligence at Nanjing University, Nanjing, and is associated with the State Key Laboratory for Novel Software Technology, China. He received his BS and MS degrees from Nanjing University, and his PhD degree from the Georgia Institute of Technology, all in computer science. He has served as an (senior) area chair for CVPR, ICCV, ECCV, AAAI and IJCAI, and as an associate editor for the IEEE Transac- tions on Pattern Analysis and Machine Intelligence. His research interests are computer vision and machine learning.

Many real-world datasets suffer from the unavoidable issue of missing values, and therefore classification with missing data has to be carefully handled since inadequate treatment of missing values will cause large errors. In this paper, we propose a random subspace sampling (RSS) method by sampling missing items from the corresponding feature histogram distributions in random subspaces, which is effective and efficient at different levels of missing data. Unlike most established approaches, RSS does not train on fixed imputed datasets. Instead, we design a dynamic training strategy where the filled values change dynamically by resampling during training. Moreover, thanks to the sampling strategy, we design an ensemble testing strategy where we combine the results of multiple runs of a single model, which is more efficient and resource-saving than previous ensemble methods. Finally, we combine these two strategies with the random subspace method, which makes our estimations more robust and accurate. The effectiveness of the proposed method is well validated by experimental studies.


中文摘要

1、研究背景
分类是机器学习和数据挖掘中最重要的任务之一。目前有许多算法来处理分类问题,但是它们中的大多数都需要完整的数据,不能直接应用于具有缺失值的数据。即使对于那些可以处理不完整数据的算法,缺失值也常常会导致较大的分类错误。然而在现实世界中缺失值经常出现,因此如何正确的应对缺失值是一个至关重要的问题。从算法层面来看,平均插补等单一插补方法通常很有效,但不够准确。相比之下,多重插补方法创建多个插补数据集,以更好地反映不完整数据的不确定性。它们通常更准确,但计算成本很高。如何设计一种既有效又高效的方式来结合分类算法和插补仍然是一个挑战。从缺失数据的程度来看,现有的用于缺失数据分类的方法通常不能很好地适用于具有大量缺失值的数据集,因此如何设计有效应对大量缺失值的算法仍然是一个挑战。
2、目的
我们希望通过结合多重插补和集成学习的优势,设计一种高效的缺失值处理办法,能有效地应对不同缺失程度的分类数据集。
3、方法
我们提出了一种随机子空间采样 (RSS) 方法,用于对缺失数据进行分类,该方法首先构建不同的随机子空间和相应的基学习器。然后,对于每个随机子空间中的每个缺失项,我们直接从对应的特征直方图分布中采样进行填充。在训练阶段,我们设计了一个动态训练策略,我们重新采样并概率化的更改每个缺失项目的填充值。在推理阶段,我们设计了一个集成测试策略,我们将单个模型的多次运行结果结合起来,这是高效且有效的。与需要迭代步骤来估算的多重估算方法相比,我们的直接采样更有效。此外,动态训练策略将我们的方法与大多数在固定数据上进行训练的方法区分开来。
4、结果
实验结果验证了所提方法的有效性。我们在 6 个具有固有缺失值的不完整数据集以及7个完整数据集上引入4种缺失级别进行了实验,我们的方法显著的好于同类方法。值得一提的是,我们的方法的优势会随着缺失程度的增加会进一步扩大。
5、结论
我们提出了一种随机子空间采样方法用于缺失数据的分类。与大多数已有方法不同,我们的方法不会在固定的数据集上进行训练。相反,我们采用了一种新颖的动态训练策略,在训练期间通过在每轮重新采样来动态填充每个缺失的值。我们无需为集成训练多个模型,我们设计了一种有效的集成测试策略,我们可以多次运行一个模型打达到集成的效果。此外,随机子空间方法使我们可以在不同的随机子空间中为每个缺失特征使用多个值,以更好地反映不确定性并获得更稳健的估计。我们对不同缺失值水平下的不完整和完整数据集进行了实验。实验结果表明,我们方法的性能优于其他比较方法。未来,我们将从理论的角度进一步研究我们的方法。

Key words: missing data; random subspace; neural networks; ensemble learning;

[1] Xiao-Qing Deng, Bo-Lin Chen, Wei-Qi Luo, and Da Luo. Universal Image Steganalysis Based on Convolutional Neural Network with Global Covariance Pooling [J]. Journal of Computer Science and Technology, 2022, 37(5): 1134-1145.
[2] Wei-Qing, Liu Jing Li. An Approach to Automatic Performance Prediction for Cloud-enhanced Mobile Applications with Sparse Data [J]. , 2017, 32(5): 936-956.
[3] Xu-Ran Zhao, Xun Wang, Qi-Chao Chen. Temporally Consistent Depth Map Prediction Using Deep CNN and Spatial-temporal Conditional Random Field [J]. , 2017, 32(3): 443-456.
[4] Mohamed Farouk Abdel Hady and Friedhelm Schwenker. Combining Committee-Based Semi-Supervised Learning and Active Learning [J]. , 2010, 25(4): 681-698.
[5] Zhi-Hua Zhou. Multi-Instance Learning from Supervised View [J]. , 2006, 21(5): 800-809 .
[6] Xin Geng and Zhi-Hua Zhou. Image Region Selection and Ensemble for Face Recognition [J]. , 2006, 21(1): 116-125 .
[7] Zhi-Hua Zhou and Yang Yu. Adapt Bagging to Nearest Neighbor Classifiers [J]. , 2005, 20(1): 0-0.
[8] Zhang Zhong;. Simulation of ATPG Neural Network and Its Experimental Results [J]. , 1995, 10(4): 310-324.
[9] Cai Yifa;. Experimental Studies of Artificial Conscious Systems [J]. , 1995, 10(4): 344-353.
[10] Yao Shu; Zhang Bo;. Situated Learning of a Behavior-Based Mobile Robot Path Planner [J]. , 1995, 10(4): 375-379.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved