Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (4): 806-821.doi: 10.1007/s11390-021-1344-6

Special Issue: Data Management and Data Mining

• Special Section on AI4DB and DB4AI • Previous Articles     Next Articles

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Zhi-Xin Qi, Hong-Zhi Wang*, Distinguished Member, CCF, Member, ACM, IEEE, and An-Jie Wang        

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Received:2021-01-31 Revised:2021-06-27 Online:2021-07-05 Published:2021-07-30
  • Contact: Hong-Zhi Wan
  • About author:Zhi-Xin Qi is a Ph.D. candidate in School of Computer Science and Technology, Harbin Institute of Technology, Harbin. She received her B.S. degree in information security from Harbin Engineering University, Harbin, in 2016, and her M.S. degree in computer technology from Harbin Institute of Technology, Harbin, in 2018. Her research interests include database, graph data management, and knowledge graph.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China under Grant Nos. U1866602 and 71773025, the CCF-Huawei Database System Innovation Research Plan under Grant No. CCF-HuaweiDBIR2020007B, and the National Key Research and Development Program of China under Grant No. 2020YFB1006104.

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Key words: data quality; classification; clustering; model selection; data cleaning;

[1] Beskales G, Ilyas I F, Golab L, Galiullin A. On the relative trust between inconsistent data and inaccurate constraints. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.541-552. DOI:10.1109/ICDE.2013.6544854.
[2] Chu X, Ilyas I F, Papotti P. Holistic data cleaning:Putting violations into context. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.458-469. DOI:10.1109/ICDE.2013.6544847.
[3] Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA:A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the 36th ACM Int. Conf. Management of Data, May 2015, pp.1247-1261. DOI:10.1145/2723372.2749431.
[4] Hao S, Tang N, Li G, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd IEEE Int. Conf. Data Engineering, Apr. 2017, pp.933-944. DOI:10.1109/ICDE.2017.141.
[5] Wang J, Kraska T, Franklin M J, Feng J. CrowdER:Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11):1483-1494. DOI:10.14778/2350229.2350263.
[6] Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF:A commodity data cleaning system. In Proc. the 34th ACM Int. Conf. Management of Data, Jun. 2013, pp.541-552. DOI:10.1145/2463676.2465327.
[7] Gamberger D, Lavrač N. Conditions for Occam's razor applicability and noise elimination. In Proc. the 9th Springer Eur. Conf. Machine Learning, Apr. 1997, pp.108-123. DOI:10.1007/3-540-62858-476.
[8] García-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data:A review. Neural Computing and Applications, 2010, 19(2):263-282. DOI:10.1007/s00521-009-0295-6.
[9] Lim S. Cleansing noisy city names in spatial data mining. In Proc. the 2010 Int. Conf. Information Science and Applications, Apr. 2010. DOI:10.1109/ICISA.2010.5480390.
[10] Frénay B, Verleysen M. Classification in the presence of label noise:A survey. IEEE Trans. Neural Networks and Learning Systems, 2013, 25(5):845-869. DOI:10.1109/TNNLS.2013.2292894.
[11] Zhu X, Wu X. Class noise vs. attribute noise:A quantitative study. Artificial Intelligence Review, 2004, 22(3):177-210. DOI:10.1007/s10462-004-0751-8.
[12] Song S, Li C, Zhang X. Turn waste into wealth:On simultaneous clustering and cleaning over dirty data. In Proc. the 21st ACM Int. Conf. Knowledge Discovery and Data Mining, Aug. 2015, pp.1115-1124. DOI:10.1145/2783258.2783317.
[13] Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd ACM Int. Conf. Machine Learning, Jun. 2006, pp.161-168. DOI:10.1145/1143844.1143865.
[14] Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In Proc. the 25th ACM Int. Conf. Machine Learning, Jul. 2008, pp.96-103. DOI:10.1145/1390156.1390169.
[15] Ghotra B, McIntosh S, Hassan A E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. the 37th IEEE/ACM Int. Conf. Software Engineering, May 2015, pp.789-800. DOI:10.1109/ICSE.2015.91.
[16] Kirchner K, Zec J, Delibašić B. Facilitating data preprocessing by a generic framework:A proposal for clustering. Artificial Intelligence Review, 2016, 45(3):271-297. DOI:10.1007/s10462-015-9446-6.
[17] Sidi F, Panahy P H S, Affendey L S, Jabar M A, Ibrahim H, Mustapha A. Data quality:A survey of data quality dimensions. In Proc. the 2nd IEEE Int. Conf. Information Retrieval and Knowledge Management, Mar. 2012, pp.300-304. DOI:10.1109/InfRKM.2012.6204995.
[18] Fan W, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, Jun. 2010, pp.169-178. DOI:10.1145/1807085.1807109.
[19] Getoor L, Machanavajjhala A. Entity resolution:Theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12):2018-2019. DOI:10.14778/2367502.2367564.
[20] Arocena P C, Glavic B, Mecca G, Miller R J, Papotti P, Santoro D. Messing up with BART:Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment, 2015, 9(2):36-47. DOI:10.14778/2850578.2850579.
[1] Jian-Wei Cui, Wei Lu, Xin Zhao, Xiao-Yong Du. Efficient Model Store and Reuse in an OLML Database System [J]. Journal of Computer Science and Technology, 2021, 36(4): 792-805.
[2] Jing-Xuan Zhang, Chuan-Qi Tao, Zhi-Qiu Huang, Xin Chen. Discovering API Directives from API Specifications with Text Classification [J]. Journal of Computer Science and Technology, 2021, 36(4): 922-943.
[3] Li Wang, Hao Zhang, Hao-Wu Chang, Qing-Ming Qin, Bo-Rui Zhang, Xue-Qing Li, Tian-Heng Zhao, Tian-Yue Zhang. GAEBic: A Novel Biclustering Analysis Method for miRNA-Targeted Gene Data Based on Graph Autoencoder [J]. Journal of Computer Science and Technology, 2021, 36(2): 299-309.
[4] Jun Gao, Paul Liu, Guang-Di Liu, Le Zhang. Robust Needle Localization and Enhancement Algorithm for Ultrasound by Deep Learning and Beam Steering Methods [J]. Journal of Computer Science and Technology, 2021, 36(2): 334-346.
[5] Hua Chen, Juan Liu, Qing-Man Wen, Zhi-Qun Zuo, Jia-Sheng Liu, Jing Feng, Bao-Chuan Pang, Di Xiao. CytoBrain: Cervical Cancer Screening System Based on Deep Learning Technology [J]. Journal of Computer Science and Technology, 2021, 36(2): 347-360.
[6] Xia-An Bi, Zhao-Xu Xing, Rui-Hui Xu, Xi Hu. An Efficient WRF Framework for Discovering Risk Genes and Abnormal Brain Regions in Parkinson's Disease Based on Imaging Genetics Data [J]. Journal of Computer Science and Technology, 2021, 36(2): 361-374.
[7] Bo-Wei Zou, Rong-Tao Huang, Zeng-Zhuang Xu, Yu Hong, Guo-Dong Zhou. Language Adaptation for Entity Relation Classification via Adversarial Neural Networks [J]. Journal of Computer Science and Technology, 2021, 36(1): 207-220.
[8] Yong-Hao Wu, Zheng Li, Yong Liu, Xiang Chen. FATOC: Bug Isolation Based Multi-Fault Localization by Using OPTICS Clustering [J]. Journal of Computer Science and Technology, 2020, 35(5): 979-998.
[9] Punit Kumar, Atul Gupta. Active Learning Query Strategies for Classification, Regression, and Clustering: A Survey [J]. Journal of Computer Science and Technology, 2020, 35(4): 913-945.
[10] Yi-Min Wen, Shuai Liu. Semi-Supervised Classification of Data Streams by BIRCH Ensemble and Local Structure Mapping [J]. Journal of Computer Science and Technology, 2020, 35(2): 295-304.
[11] An-Zhen Zhang, Jian-Zhong Li, Hong Gao. Interval Estimation for Aggregate Queries on Incomplete Data [J]. Journal of Computer Science and Technology, 2019, 34(6): 1203-1216.
[12] Yang Li, Wen-Zhuo Song, Bo Yang. Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing [J]. Journal of Computer Science and Technology, 2018, 33(5): 1007-1022.
[13] Tong Shen, Da-Fang Zhang, Gao-Gang Xie, Xin-Yi Zhang. Optimizing Multi-Dimensional Packet Classification for Multi-Core Systems [J]. Journal of Computer Science and Technology, 2018, 33(5): 1056-1071.
[14] Xian-Hua Zeng, Bang-Gui Liu, Meng Zhou. Understanding and Generating Ultrasound Image Description [J]. Journal of Computer Science and Technology, 2018, 33(5): 1086-1100.
[15] Xin Xu, Jiaheng Lu, Wei Wang. Hierarchical Clustering of Complex Symbolic Data and Application for Emitter Identification [J]. , 2018, 33(4): 807-822.
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[4] Pan Qijing;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[5] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[6] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[7] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[8] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[9] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[10] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved