Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (4): 806-821.doi: 10.1007/s11390-021-1344-6

Special Issue: Data Management and Data Mining

• Special Section on AI4DB and DB4AI • Previous Articles     Next Articles

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Zhi-Xin Qi, Hong-Zhi Wang*, Distinguished Member, CCF, Member, ACM, IEEE, and An-Jie Wang        

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Received:2021-01-31 Revised:2021-06-27 Online:2021-07-05 Published:2021-07-30
  Contact: Hong-Zhi Wang
  • About author:Zhi-Xin Qi is a Ph.D. candidate in School of Computer Science and Technology, Harbin Institute of Technology, Harbin. She received her B.S. degree in information security from Harbin Engineering University, Harbin, in 2016, and her M.S. degree in computer technology from Harbin Institute of Technology, Harbin, in 2018. Her research interests include database, graph data management, and knowledge graph.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China under Grant Nos. U1866602 and 71773025, the CCF-Huawei Database System Innovation Research Plan under Grant No. CCF-HuaweiDBIR2020007B, and the National Key Research and Development Program of China under Grant No. 2020YFB1006104.

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Key words: data quality; classification; clustering; model selection; data cleaning;

