Processing math: 100%
We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Ding Y, Guo YH, Lu W et al. Context-aware semantic type identification for relational attributes. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 38(4): 927−946 July 2023. DOI: 10.1007/s11390-021-1048-y.
Citation: Ding Y, Guo YH, Lu W et al. Context-aware semantic type identification for relational attributes. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 38(4): 927−946 July 2023. DOI: 10.1007/s11390-021-1048-y.

Context-Aware Semantic Type Identification for Relational Attributes

Funds: The work was supported by the National Key Research and Development Program of China under Grant No. 2020YFB2104100, the National Natural Science Foundation of China under Grant Nos. 61972403 and U1711261, the Fundamental Research Funds for the Central Universities of China, the Research Funds of Renmin University of China, and Tencent Rhino-Bird Joint Research Program.
More Information
  • Author Bio:

    Yue Ding received her B.S. degree in Internet of Things from Nanjing University of Posts and Telecommunications, Nanjing, in 2018. She is currently pursuing her M.S. degree in Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education and School of Information at Renmin University of China, Beijing. Her research interests include data integration and machine learning

    Yu-He Guo received her B.S. degree in computer science from Renmin University of China, Beijing, in 2018. She is currently pursuing her M.S. degree in Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education and School of Information in Renmin University of China, Beijing. Her research interests lie in natural language processing and data integration

    Wei Lu is currently an associate professor in Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education and School of Information at Renmin University of China, Beijing. He received his Ph.D. degree in computer applied technology from Renmin University of China, Beijing, in 2011. His research interests include query processing in the context of spatiotemporal, cloud database systems and applications. He is a member of CCF

    Hai-Xiang Li is currently a senior expert at Tencent (Beijing) Technology Company Limited, Beijing. His research interests include transaction processing, query optimization, distributed consistency, high availability, database system architecture, cloud database and distributed database systems. He is a member of CCF

    Mei-Hui Zhang received her Ph.D. degree in computer science from National University of Singapore, Singapore, in 2013. She is currently a professor with Beijing Institute of Technology, Beijing, and was an assistant professor with Singapore University of Technology and Design, Singapore, from 2014 to 2017. Her research interests include big data management and analytics, large-scale data integration, modern database systems, block chain and AI. She has served as PC Vice-Chair of ICDE 2018 and associate editor of VLDB 2018, VLDB 2019, VLDB 2020 and SIGMOD 2021. She is a winner of VLDB 2020 Early Career Research Contribution Award. She is a member of CCF, ACM and IEEE

    Hui Li is currently a professor in College of Computer Science and Technology at Guizhou University, Guiyang. He received his Ph.D. degree in computer software and theory from Renmin University of China, Beijing, in 2012. His research interests include large-scale data analytics, high-performance database systems and data-driven intelligent applications. He is a member of CCF, ACM and IEEE

    An-Qun Pan is a technical director at Tencent (Shenzhen) Technology Company Limited, Shenzhen, with more than 15 years of experience in the research and development of distributed computing and storage systems. He is currently responsible for the research and development of distributed database system (TDSQL). He is a member of CCF

    Xiao-Yong Du is a professor in Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education and School of Information at Renmin University of China, Beijing. He received his Ph.D. degree in computer science from Nagoya Institute of Technology, Nagoya, in 1997. His research focuses on intelligent information retrieval, high performance database and unstructured data management. He is a fellow of CCF

  • Corresponding author:

    lu-wei@ruc.edu.cn

  • Received Date: October 04, 2020
  • Accepted Date: June 08, 2021
  • Identifying semantic types for attributes in relations, known as attribute semantic type (AST) identification, plays an important role in many data analysis tasks, such as data cleaning, schema matching, and keyword search in databases. However, due to a lack of unified naming standards across prevalent information systems (a.k.a. information islands), AST identification still remains as an open problem. To tackle this problem, we propose a context-aware method to figure out the ASTs for relations in this paper. We transform the AST identification into a multi-class classification problem and propose a schema context aware (SCA) model to learn the representation from a collection of relations associated with attribute values and schema context. Based on the learned representation, we predict the AST for a given attribute from an underlying relation, wherein the predicted AST is mapped to one of the labeled ASTs. To improve the performance for AST identification, especially for the case that the predicted semantic types of attributes are not included in the labeled ASTs, we then introduce knowledge base embeddings (a.k.a. KBVec) to enhance the above representation and construct a schema context aware model with knowledge base enhanced (SCA-KB) to get a stable and robust model. Extensive experiments based on real datasets demonstrate that our context-aware method outperforms the state-of-the-art approaches by a large margin, up to 6.14% and 25.17% in terms of macro average F1 score, and up to 0.28% and 9.56% in terms of weighted F1 score over high-quality and low-quality datasets respectively.

  • [1]
    Kandel S, Paepcke A, Hellerstein J, Heer J. Wrangler: Interactive visual specification of data transformation scripts. In Proc. the 2011 SIGCHI Conference on Human Factors in Computing Systems, May 2011, pp.3363–3372. DOI: 10.1145/1978942.1979444.
    [2]
    Rahm E, Bernstein P A. A survey of approaches to automatic schema matching. The VLDB Journal, 2001, 10(4): 334–350. DOI: 10.1007/s007780100057.
    [3]
    Zapilko B, Zloch M, Schaible J. Utilizing regular expressions for instance-based schema matching. In Proc. the 7th International Conference on Ontology Matching, Nov. 2012, pp.240–241. DOI: 10.5555/2887596.2887623.
    [4]
    Venetis P, Halevy A, Madhavan J, Paşca M, Shen W, Wu F, Miao G X, Wu C. Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 2011, 4(9): 528–538. DOI: 10.14778/2002938.2002939.
    [5]
    Snipes G. Google data studio. Journal of Librarianship and Scholarly Communication, 2018, 6(1): eP2214. DOI: 10.7710/2162-3309.2214.
    [6]
    Kaelin M. Microsoft power BI: A cheat sheet. Technical Report, Techrepublic, 2019. https://www.techrepublic.com/article/microsoft-power-bi-a-smart-persons-guide, July 2023.
    [7]
    Black D. Data wrangling ‘decoder ring’ homogenizes polyglot data lakes. Technical Report, Enterprise Tech., 2016. https://www.enterpriseai.news/2016/02/11/trifactas-data-wrangling-decoder-ring-homogenizes-polyglot-data-lakes/, July 2023.
    [8]
    Zhao C, He Y Y. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In Proc. the 2019 World Wide Web Conference, May 2019, pp.2413–2424. DOI: 10.1145/3308558.3313578.
    [9]
    Chen J Y, Jiménez-Ruiz E, Horrocks I, Sutton C. Learning semantic annotations for tabular data. In Proc. the 28th International Joint Conference on Artificial Intelligence, Jul. 2019, pp.2088–2094. DOI: 10.24963/ijcai.2019/289.
    [10]
    Chen J Y, Jiménez-Ruiz E, Horrocks I, Sutton C. ColNet: Embedding the semantics of web tables for column type prediction. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Feb. 2019, pp.29–36. DOI: 10.1609/aaai.v33i01.330129.
    [11]
    Ramnandan S K, Mittal A, Knoblock C A, Szekely P. Assigning semantic labels to data sources. In Proc. the 12th European Semantic Web Conference, May 31–June 4, 2015, pp.403–417. DOI: 10.1007/978-3-319-18818-8_25.
    [12]
    Pham M, Alse S, Knoblock C A, Szekely P. Semantic labeling: A domain-independent approach. In Proc. the 15th International Semantic Web Conference, Oct. 2016, pp.446–462. DOI: 10.1007/978-3-319-46523-4_27.
    [13]
    Hulsebos M, Hu K, Bakker M, Zgraggen E, Satyanarayan A, Kraska T, Demiralp Ç, Hidalgo C. Sherlock: A deep learning approach to semantic data type detection. In Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Jul. 2019, pp.1500–1508. DOI: 10.1145/3292500.3330993.
    [14]
    Krishna S. Introduction to Database and Knowledge-Base Systems. World Scientific Publishing, 1992. DOI: 10.1142/1374.
    [15]
    Gao Y, Liang J, Han B, Yakout M, Mohamed A. Building a large-scale accurate and fresh knowledge graph. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2018. https://kdd2018tutorialt39.azurewebsites.net/, July 2023.
    [16]
    Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. DBpedia: A nucleus for a web of open data. In Proc. the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference, Nov. 2007, pp.722–735. DOI: 10.1007/978-3-540-76298-0_52.
    [17]
    Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proc. the 2008 ACM SIGMOD International Conference on Management of Data, Jun. 2008, pp.1247–1250. DOI: 10.1145/1376616.1376746.
    [18]
    Rebele T, Suchanek F, Hoffart J, Biega J, Kuzey E, Weikum G. YAGO: A multilingual knowledge base from Wikipedia, Wordnet, and Geonames. In Proc. the 15th International Semantic Web Conference, Oct. 2016, pp.177–185. DOI: 10.1007/978-3-319-46547-0_19.
    [19]
    Zwicklbauer S, Einsiedler C, Granitzer, M, Seifert C. Towards disambiguating Web tables. In Proc. the 2013 International Semantic Web Conference, Oct. 2013, pp.205–208.
    [20]
    Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press, 2016.
    [21]
    LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436–444. DOI: 10.1038/nature14539.
    [22]
    Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks, 2015, 61: 85–117. DOI: 10.1016/j.neunet.2014.09.003.
    [23]
    Wang W, Zhang M H, Chen G, Jagadish H V, Ooi B C, Tan K L. Database meets deep learning: Challenges and opportunities. ACM SIGMOD Record, 2016, 45(2): 17–22. DOI: 10.1145/3003665.3003669.
    [24]
    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010. DOI: 10.5555/3295222.3295349.
    [25]
    He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: 10.1109/CVPR.2016.90.
    [26]
    Ba J L, Kiros J R, Hinton G E. Layer normalization. arXiv: 1607.06450, 2016. https://arxiv.org/abs/1607.06450, July 2023.
    [27]
    Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805, 2018. https://arxiv.org/abs/1810.04805, July 2023.
    [28]
    May C, Wang A, Bordia S, Bowman S R, Rudinger R. On measuring social biases in sentence encoders. arXiv: 1903.10561, 2019. https://arxiv.org/abs/1903.10561, July 2023.
    [29]
    Qiao Y F, Xiong C Y, Liu Z H, Liu Z Y. Understanding the behaviors of BERT in ranking. arXiv: 1904.07531, 2019. https://arxiv.org/abs/1904.07531, July 2023.
    [30]
    Harris Z S. Distributional structure. Word, 1954, 10(2/3): 146–162. DOI: 10.1080/00437956.1954.11659520.
    [31]
    Erk K. Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Compass, 2012, 6(10): 635–653. DOI: 10.1002/lnco.362.
    [32]
    Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532–1543. DOI: 10.3115/v1/D14-1162.
    [33]
    Le Q, Mikolov T. Distributed representations of sentences and documents. In Proc. the 31st International Conference on Machine Learning, Jun. 2014, pp.1188–1196. DOI: 10.5555/3044805.3045025.
    [34]
    Köpcke H, Thor A, Rahm E. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 2010, 3(1/2): 484–493. DOI: 10.14778/1920841.1920904.
    [35]
    Konda P, Das S, Suganthan G C P, Doan A, Ardalan A, Ballard J R, Li H, Panahi F, Zhang H J, Naughton J, Prasad S, Krishnan G, Deep R, Raghavendra V. Magellan: Toward building entity matching management systems. Proceedings of the VLDB Endowment, 2016, 9(12): 1197–1208. DOI: 10.14778/2994509.2994535.
    [36]
    Eberius J, Braunschweig K, Hentsch M, Thiele M, Ahmadov A, Lehner W. Building the Dresden Web Table Corpus: A classification approach. In Proc. the 2nd International Symposium on Big Data Computing, Dec. 2015, pp.41–50. DOI: 10.1109/BDC.2015.30.
    [37]
    Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv: 1301.3781, 2013. https://arxiv.org/abs/1301.3781, July 2023.
    [38]
    Matthews B W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975, 405(2):442–451. DOI: 10.1016/0005-2795(75)90109-9.
    [39]
    Chicco D. Ten quick tips for machine learning in computational biology. BioData Mining, 2017, 10(1): Article No. 35. DOI: 10.1186/s13040-017-0155-3.
    [40]
    Everitt B S. The Cambridge Dictionary of Statistics (2nd edition). Cambridge University Press, 2002.
    [41]
    Efthymiou V, Hassanzadeh O, Rodriguez-Muro M, Christophides V. Matching Web tables with knowledge base entities: From entity lookups to entity embeddings. In Proc. the 16th International Semantic Web Conference, Oct. 2017, pp.260–270. DOI: 10.1007/978-3-319-68288-4_16.
    [42]
    Shen S, Dong Z, Ye J Y, Ma L J, Yao Z W, Gholami A, Mahoney M W, Keutzer K. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.8815–8821. DOI: 10.1609/aaai.v34i05.6409.
    [43]
    Jiao X Q, Yin Y C, Shang L F, Jiang X, Chen X, Li L L, Wang F, Liu Q. TinyBERT: Distilling BERT for natural language understanding. arXiv: 1909.10351, 2019. https://arxiv.org/abs/1909.10351, July 2023.
    [44]
    Lan Z Z, Chen M D, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv: 1909.11942, 2019. https://arxiv.org/abs/1909.11942, July 2023.

Catalog

    Article views (207) PDF downloads (12) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return