Abstract Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches.
This work was supported by the National Natural Science Foundation of China under Grant Nos. U1509216 and 61472099, the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and MOE-Microsoft Key Laboratory of Natural Language Processing and Speech of Harbin Institute of Technology.
About author: Hong-Zhi Wang is a professor and doctoral supervisor of Harbin Institute of Technology, Harbin. He received his Ph.D. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2008. He was awarded Microsoft Fellowship, Chinese Excellent Database Engineer, and IBM Ph.D. Fellowship. His research interests include big data management, data quality, and graph data management.
Cite this article:
Hong-Zhi Wang, Zhi-Xin Qi, Ruo-Xi Shi, Jian-Zhong Li, Hong Gao.COSSET+:Crowdsourced Missing Value Imputation Optimized by Knowledge Base[J] Journal of Computer Science and Technology, 2017,V32(5): 845-857
 Weinberg J B, Biswas G, Koller G R. Conceptual clustering with systematic missing values. In Proc. the 9th Int. Workshop on Machine Learning, July 1992, pp.464-469. Silva L O, Zárate L E. A brief review of the main approaches for treatment of missing data. Intelligent Data Analysis, 2014, 18(6):1177-1198. Hua M, Pei J. DiMaC:A system for cleaning disguised missing data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2008, pp.1263-1266. Himmelspach L, Conrad S. Clustering approaches for data with missing values:Comparison and evaluation. In Proc. the 5th Int. Conf. Digital Information Management, July 2010, pp.19-28. Shan Y, Deng G. Kernel PCA regression for missing data estimation in DNA microarray analysis. In Proc. IEEE Int. Symp. Circuits and Systems, May 2009, pp.1477-1480. Yang K, Li J Z, Wang C K. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th Int. Conf. Computational Science, May 2006, pp.662-669. Siddique J, Belin T R. Using an Approximate Bayesian Bootstrap to multiply impute nonignorable missing data. Computational Statistics & Data Analysis, 2008, 53(2):405-415. Rubin D B. Multiple imputation after 18+ years. Journal of the American Statistical Association, 1996, 91(434):473-489. Patrician P A. Multiple imputation for missing data. Research in Nursing & Health, 2002, 25(1):76-84. Lakshminarayan K, Harp S A, Goldman R, Samad T. Imputation of missing data using machine learning techniques. In Proc. the 2nd Int. Conf. Knowledge Discovery and Data Mining, August 1996, pp.140-145. Li X B. A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality (JDIQ), 2009, 1(1):Article No. 3. Di Zio M, Scanu M, Coppola L, Luzi O, Ponti A. Bayesian networks for imputation. Journal of the Royal Statistical Society Series A (Statistics in Society), 2004, 167(2):309-322. Mayfield C, Neville J, Prabhakar S. ERACER:A database approach for statistical inference and data cleaning. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.75-86. Zhang S C. Shell-neighbor method and its application in missing data imputation. Applied Intelligence, 2011, 35(1):123-133. Zhang C Q, Zhu X F, Zhang J L, Qin Y S, Zhang S C. GBKⅡ:An imputation method for missing values. In Proc. the 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining, May 2007, pp.1080-1087. Setiawan N A, Venkatachalam P A, Hani A F M. Missing attribute value prediction based on artificial neural network and rough set theory. In Proc. Int. Conf. Biomedical Engineering and Informatics, May 2008, pp.306-310. Tang N, Vemuri V R. Web-based knowledge acquisition to impute missing values for classification. In Proc. the IEEE/WIC/ACM Int. Conf. Web Intelligence, September 2004, pp.124-130. Hao S, Tang N, Li G L, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd Int. Conf. Data Engineering, April 2017, pp.933-944. Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA:Reliable data cleaning with knowledge bases and crowdsourcing. Proceedings of the VLDB Endowment, 2015, 8(12):1952-1955. Qi Z X, Wang H Z, Meng F S, Li J Z, Gao H. Capture missing values with inference on knowledge base. In Proc. the Int. Conf. Database Systems for Advanced Applications, March 2017, pp.185-194. Ye C, Wang H Z. Capture missing values based on crowdsourcing. In Proc. the 9th Int. Conf. Wireless Algorithms Systems and Applications, June 2014, pp.783-792. Ye C, Wang H Z, Li J Z, Gao H, Cheng S Y. Crowdsourcingenhanced missing values imputation based on Bayesian network. In Proc. the 21st Int. Conf. Database Systems for Advanced Applications, April 2016, pp.67-81. Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA:A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1247-1261. Wang Q, Wang B, Guo L. Knowledge base completion using embeddings and rules. In Proc. the 24th Int. Conf. Artificial Intelligence, July 2015, pp.1859-1865. Neelakantan A, Chang M W. Inferring missing entity type instances for knowledge base completion:New dataset and methods. In Proc. Human Language Technologies:The 2015 Annual Conf. the North American Chapter of the ACL, May 2015, pp.515-525. Neelakantan A, Roth B, McCallum A. Compositional vector space models for knowledge base completion. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing, July 2015, pp.156-166. Guo H Z, Chen Q C, Wang X L, Cui L. Tolerance rough set based attribute extraction approach for multiple semantic knowledge base integration. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2011, 19(4):659-684. Marinos L, Lee J. Using structural and procedural knowledge in database and knowledge base integration. In Proc. IEEE Int. Workshop on Tools for Artificial Intelligence, Architectures Languages and Algorithms, October 1989, pp.407-417. Zheng Y D, Li G L, Cheng R. DOCS:A domain-aware crowdsourcing system using knowledge bases. Proceedings of the VLDB Endowment, 2016, 10(4):361-372. Li H W, Zhao B, Fuxman A. The wisdom of minority:Discovering and targeting the right group of workers for crowdsourcing. In Proc. the 23rd Int. Conf. World Wide Web, April 2014, pp.165-176. Wang J, Ipeirotis P G Provost F. Quality-based pricing for crowdsourced workers NYU Working Paper No. 2451/31833 Social Science Electronic Publishing, 2013. https://ssrn.com/abstract=2283000, June 2017. Fan J, Li G L, Ooi B C, Tan K L, Feng J H. iCrowd:An adaptive crowdsourcing framework. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1015-1030. Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467. Zheng Y D, Wang J N, Li G L, Cheng R, Feng J H. QASCA:A quality-aware task assignment system for crowdsourcing applications. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046. Raykar V C, Yu S P. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. The Journal of Machine Learning Research, 2012, 13(1):491-518. Cavallo R, Jain S. Efficient crowdsourcing contests. In Proc. the 11th Int. Conf. Autonomous Agents and Multiagent Systems, June 2012, pp.677-686. Roy S B, Lykourentzou I, Thirumuruganathan S, AmerYahia S, Das G. Task assignment optimization in knowledge-intensive crowdsourcing. The VLDB Journal, 2015, 24(4):467-491. Fomin F V, Grandoni F, Pyatkin A V, Stepanov A A. Bounding the number of minimal dominating sets:A measure and conquer approach. In Proc. the 16th Int. Symp. Algorithms and Computation, December 2005, pp.573-582. DeVore R A, Temlyakov V N. Some remarks on greedy algorithms. Advances in Computational Mathematics, 1996, 5(1):173-187. Kann V. On the approximability of the maximum common subgraph problem. In Proc. the 9th Annual Symp. Theoretical Aspects of Computer Science, February 1992, pp.375-388. Feige U. A threshold of lnn for approximating set cover. Journal of the ACM, 1998, 45(4):634-652. Rahman G, Islam Z. A decision tree-based missing value imputation technique for data pre-processing. In Proc. the 9th Australasian Data Mining Conf., December 2011, pp.41-50. Li H, Emmanuel A, LI P, Wu M. Imputation algorithm of missing values based on EM and Bayesian network. Computer Engineering and Applications, 2010, 46(5):123-125. Miyakoshi Y, Kato S. A missing value imputation method using a Bayesian network with weighted learning. Electronics and Communications in Japan, 2012, 95(12):1-9. Li Z X, Sharaf M A, Sitbon L, Sadiq S, Indulska M, Zhou X F. A web-based approach to data imputation. World Wide Web, 2014, 17(5):873-897.