? EntityManager:基于实体识别的劣质数据管理系统
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | Help
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2017, Vol. 32 Issue (3) :644-661    DOI: 10.1007/s11390-017-1731-1
Regular Paper << Previous Articles | >>
EntityManager:基于实体识别的劣质数据管理系统
Xue-Li Liu, Hong-Zhi Wang*, Member, CCF, Jian-Zhong Li, Fellow, CCF, Hong Gao, Senior Member, CCF
Massive Data Computing Laboratory, Harbin Institute of Technology, Harbin 150001, China
EntityManager: Managing Dirty Data Based on Entity Resolution
Xue-Li Liu, Hong-Zhi Wang*, Member, CCF, Jian-Zhong Li, Fellow, CCF, Hong Gao, Senior Member, CCF
Massive Data Computing Laboratory, Harbin Institute of Technology, Harbin 150001, China

摘要
参考文献
相关文章
Download: [PDF 994KB]  
摘要 在数据挖掘,数据分析,决策系统等以数据驱动的应用中,数据的质量问题至关重要。当前工作主要采用数据清洗的方法提高数据质量,可能丢失有用信息并且引进新的错误。鉴于此,我们设计了劣质数据管理系统EntityManager。该系统不对数据直接清洗,将实体识别的结果组织起来以实体为单位存储并处理数据。实体的每个属性都不是确定的,保持了每个可能冲突的值。查询处理时,系统根据用户要求的质量度返回查询结果。本文给出了 EntityManager的一个整体概述,包含其开发的主要目的,如何存储和处理数据,现实的需求以及面临的挑战。同时本文简述了主要框架,数据模型,查询处理的新技术以及优化计数。最后,实验有效的验证了该系统的有效性和效率。
关键词劣质数据   实体识别   不确定属性   查询处理   查询优化     
Abstract: Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.
Keywordsdirty data   entity resolution   uncertain attribute   query processing   query optimization     
Received 2016-02-29;
本文基金:

This work was partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the National Natural Science Foundation of China under Grant Nos. U1509216, 61472099, and 61133002, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and the Ministry of Education (MOE)-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.

通讯作者: Hong-Zhi Wang     Email: wangzh@hit.edu.cn
About author: Xue-Li Liu is a Ph.D. candidate in computer technology and science, Harbin Institute of Technology, Harbin. Her research interests include data quality and massive data management.
引用本文:   
Xue-Li Liu, Hong-Zhi Wang, Jian-Zhong Li, Hong Gao.EntityManager:基于实体识别的劣质数据管理系统[J]  Journal of Computer Science and Technology , 2017,V32(3): 644-661
Xue-Li Liu, Hong-Zhi Wang, Jian-Zhong Li, Hong Gao.EntityManager: Managing Dirty Data Based on Entity Resolution[J]  Journal of Computer Science and Technology, 2017,V32(3): 644-661
链接本文:  
http://jcst.ict.ac.cn:8080/jcst/CN/10.1007/s11390-017-1731-1
Copyright 2010 by Journal of Computer Science and Technology