计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (4): 762-777.doi: 10.1007/s11390-021-1351-7

所属专题: Data Management and Data Mining

• • 上一篇    下一篇

基数估计器:利用垂直扫描卷积神经网络处理SQL

Shao-Jie Qiao1, Senior Member, CCF, Guo-Ping Yang1, Nan Han2,*, Hao Chen3, Fa-Liang Huang4, Member, CCF, Kun Yue5, Member, CCF, Yu-Gen Yi6, Member, CCF, IEEE, and Chang-An Yuan7, Member, CCF   

  1. 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China;
    2 School of Management, Chengdu University of Information Technology, Chengdu 610225, China;
    3 Beijing Huawei Digital Technologies Co., Ltd., Beijing 100085, China;
    4 School of Computer and Information Engineering, Nanning Normal University, Nanning 530299, China;
    5 School of Information Science and Engineering, Yunnan University, Kunming 650500, China;
    6 School of Software, Jiangxi Normal University, Nanchang 330022, China;
    7 Guangxi College of Education, Nanning 530007, China
  • 收稿日期:2021-02-01 修回日期:2021-07-01 出版日期:2021-07-05 发布日期:2021-07-30
  • 通讯作者: Nan Han E-mail:hannan@cuit.edu.cn
  • 作者简介:Shao-Jie Qiao received his B.S. and Ph.D. degrees in computer application technology from Sichuan University, Chengdu, in 2004 and 2009, respectively. From 2007 to 2008, he worked as a visiting scholar in the School of Computing at the National University of Singapore, Singapore. He is a distinguished young scholar of Sichuan Province of China. He is currently a professor with the School of Software Engineering, Chengdu University of Information Technology, Chengdu. He has authored more than 150 high quality papers, including IEEE Transactions on ITS, IEEE TKDE, IEEE TNNLS, IEEE TCSVT, IEEE Transactions on SMC:Systems, ACM TKDD, ACM TIST, etc.. His research interests include databases, data mining and artificial intelligence. He is a senior member of CCF.
  • 基金资助:
    The work was supported in part by the CCF-Huawei Database System Innovation Research Plan under Grant No. CCFHuaweiDBIR2020004A, the National Natural Science Foundation of China under Grant Nos. 61772091, 61802035, 61962006 and 61962038, the Sichuan Science and Technology Program under Grant Nos. 2021JDJQ0021 and 2020YJ0481, and the Digital Media Art, Key Laboratory of Sichuan Province, Sichuan Conservatory of Music, Chengdu, China under Grant No. 21DMAKL02.

Cardinality Estimator: Processing SQL with a Vertical Scanning Convolutional Neural Network

Shao-Jie Qiao1, Senior Member, CCF, Guo-Ping Yang1, Nan Han2,*, Hao Chen3, Fa-Liang Huang4, Member, CCF, Kun Yue5, Member, CCF, Yu-Gen Yi6, Member, CCF, IEEE, and Chang-An Yuan7, Member, CCF        

  1. 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China;
    2 School of Management, Chengdu University of Information Technology, Chengdu 610225, China;
    3 Beijing Huawei Digital Technologies Co., Ltd., Beijing 100085, China;
    4 School of Computer and Information Engineering, Nanning Normal University, Nanning 530299, China;
    5 School of Information Science and Engineering, Yunnan University, Kunming 650500, China;
    6 School of Software, Jiangxi Normal University, Nanchang 330022, China;
    7 Guangxi College of Education, Nanning 530007, China
  • Received:2021-02-01 Revised:2021-07-01 Online:2021-07-05 Published:2021-07-30
  • Contact: Nan Han E-mail:hannan@cuit.edu.cn
  • About author:Shao-Jie Qiao received his B.S. and Ph.D. degrees in computer application technology from Sichuan University, Chengdu, in 2004 and 2009, respectively. From 2007 to 2008, he worked as a visiting scholar in the School of Computing at the National University of Singapore, Singapore. He is a distinguished young scholar of Sichuan Province of China. He is currently a professor with the School of Software Engineering, Chengdu University of Information Technology, Chengdu. He has authored more than 150 high quality papers, including IEEE Transactions on ITS, IEEE TKDE, IEEE TNNLS, IEEE TCSVT, IEEE Transactions on SMC:Systems, ACM TKDD, ACM TIST, etc.. His research interests include databases, data mining and artificial intelligence. He is a senior member of CCF.
  • Supported by:
    The work was supported in part by the CCF-Huawei Database System Innovation Research Plan under Grant No. CCFHuaweiDBIR2020004A, the National Natural Science Foundation of China under Grant Nos. 61772091, 61802035, 61962006 and 61962038, the Sichuan Science and Technology Program under Grant Nos. 2021JDJQ0021 and 2020YJ0481, and the Digital Media Art, Key Laboratory of Sichuan Province, Sichuan Conservatory of Music, Chengdu, China under Grant No. 21DMAKL02.

研究背景
查询优化是数据库管理系统中提高查询效率的重要手段。查询优化器的优化过程对数据库开发人员是透明的,可以自动进行等价转换,并对物理执行计划进行选择。查询优化器的目的是从理论上获得SQL查询的最佳执行计划。传统的查询优化器不能准确地估计基数,导致选择的执行计划较差,进而影响查询性能。主要原因是基数估计的误差在于模型不能准确地捕捉多个表连接操作之间的复杂关系。如何准确地捕获不同表或列之间的关系直接决定了查询结果的准确性。
目的
基数估计问题可视为有监督学习,标签是真正的基数。最大的挑战是如何捕获SQL语义及多表连接关系。目前,已有的基于学习方法的数据库优化技术效果不佳。本文提出了一种基数估计模型VSCNN (Vertical Scanning Convolutional Neural Network,垂直扫描卷积神经网络)来应对上述挑战。该模型不但能够捕获复杂SQL查询中每个词的语义,而且能够捕获表之间的连接关系,从而得到准确的基数估计。
方法
本文提出的基于学习的基数估计器将SQL查询从一个句子转换为词向量,采用one-hot方法对表名进行编码,将样本分别编码成位图,然后将上述两种编码合并,可以从数据样本中获得足够的语义信息。由垂直扫描卷积神经网络获得的特征向量包含语义信息,包括:关于SQL查询的表、连接和谓词。通过与数据库系统中流行的基数估计器进行比较,大量的实验证明了所提出的基于垂直扫描卷积神经网络的技术估计模型的准确性和时效性。
结果
从实验结果来看,传统的数据库系统往往会高估基数,q-error非常大,甚至高达几十万。与传统数据库中的估计结果相比,基于学习的基数估计模型的q-error至少可以降低14.6%。MSCN (Multi-Set Convolutional Network)可以处理简单的查询,但是对于JOB (Join Order Benchmark)包含字符串的复杂查询,MSCN不起作用。可以发现,所提模型在所有实验中都取得了很好的效果。MSCN无法处理字符串类型字段,因为它没有能够识别字符串的神经网络结构。MSCN (string)可以处理字符串,但是q-error比VSCNN大。VSCNN模型可以很好地处理字符串,因为它通过嵌入技术将字符串作为词向量嵌入,并且能够很好地捕获SQL语义信息。
结论
本文介绍了垂直扫描卷积神经网络模型VSCNN处理基数估计问题的工作原理。VSCNN可以将SQL查询从一个句子转换为词向量,使用垂直扫描卷积神经网络来捕获词向量中词之间的关系。它在处理多个表连接查询时性能优越,并支持处理字符串类型的谓词。为了提高基数估计的精度,采用从基表中抽取样本的方法,并将其压缩成位图。为了将该模型应用于更多类型的查询,未来将从以下几个方面对所提模型进行改进:与数值类型数据相比,数值数据字段在表中具有最大值和最小值,所有值都在最大值和最小值之间。字符串类型很难处理,因为字符串中的字符数不同,而且字符本身也不同。这样,字符串类型的数据比数字类型的数据更稀疏。未来将设计一种更好基数估计方法来处理字符串类型的数据。

关键词: 基数估计, 字向量, 垂直扫描卷积神经网络, 采样方法

Abstract: Although the popular database systems perform well on query optimization, they still face poor query execution plans when the join operations across multiple tables are complex. Bad execution planning usually results in bad cardinality estimations. The cardinality estimation models in traditional databases cannot provide high-quality estimation, because they are not capable of capturing the correlation between multiple tables in an effective fashion. Recently, the state-ofthe-art learning-based cardinality estimation is estimated to work better than the traditional empirical methods. Basically, they used deep neural networks to compute the relationships and correlations of tables. In this paper, we propose a vertical scanning convolutional neural network (abbreviated as VSCNN) to capture the relationships between words in the word vector in order to generate a feature map. The proposed learning-based cardinality estimator converts Structured Query Language (SQL) queries from a sentence to a word vector and we encode table names in the one-hot encoding method and the samples into bitmaps, separately, and then merge them to obtain enough semantic information from data samples. In particular, the feature map obtained by VSCNN contains semantic information including tables, joins, and predicates about SQL queries. Importantly, in order to improve the accuracy of cardinality estimation, we propose the negative sampling method for training the word vector by gradient descent from the base table and compress it into a bitmap. Extensive experiments are conducted and the results show that the estimation quality of q-error of the proposed vertical scanning convolutional neural network based model is reduced by at least 14.6% when compared with the estimators in traditional databases.

Key words: cardinality estimation, word vector, vertical scanning convolutional neural network, sampling method

[1] Leis V, Radke B, Gubichev A, Kemper A, Neumann T. Cardinality estimation done right:Index-based join sampling. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, Jan. 2017.
[2] Li G, Zhou X, Li S. XuanYuan:An AI-native database. IEEE Data Eng. Bull., 2019, 42(2):70-81.
[3] Kipf A, Kipf T, Radke B, Leis V, Boncz P A, Kemper A. Learned cardinalities:Estimating correlated joins with deep learning. In Proc. the 9th Biennial Conference on Innovative Data Systems Research, Jan. 2019.
[4] Ioannidis Y E. The history of histograms (abridged). In Proc. the 29th International Conference on Very Large Data Bases, Sept. 2003, pp.19-30. DOI:10.1016/B978-012722442-8/50011-2.
[5] Giroire F. Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math., 2009, 157(2):406-427. DOI:10.1016/j.dam.2008.06.020.
[6] Flajolet P, Martin G N. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 1985, 31(2):182-209. DOI:10.1016/0022-0000(85)90041-8.
[7] Durand M, Flajolet P. Loglog counting of large cardinalities. In Proc. the 11th Annual European Symposium, Sept. 2003, pp.605-617. DOI:10.1007/978-3-540-39658-155.
[8] Flajolet P, Fusy É, Gandouet O, Meunier F. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. In Proc. the 2007 Conference on Analysis of Algorithm, Jun. 2007, pp.137-156.
[9] Whang K, Zanden B T V, Taylor H M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 1990, 15(2):208-229. DOI:10.1145/78922.78925.
[10] Wu W, Naughton J F, Singh H. Sampling-based query reoptimization. In Proc. the 2016 International Conference on Management of Data, June 26-July 1, 2016, pp.1721-1736. DOI:10.1145/2882903.2882914.
[11] Lipton R J, Naughton J F, Schneider D A. Practical selectivity estimation through adaptive sampling. In Proc. the 1990 ACM SIGMOD International Conference on Management of Data, May 1990, pp.1-11. DOI:10.1145/936-05.93611.
[12] Olken F, Rotem D. Random sampling from database files:A survey. In Proc. the 5th International Conference on Statistical and Scientific Database Management, Apr. 1990, pp.92-111. DOI:10.1007/3-540-52342-123.
[13] Estan C, Naughton J F. End-biased samples for join cardinality estimation. In Proc. the 22nd International Conference on Data Engineering, Apr. 2006, Article No. 20. DOI:10.1109/ICDE.2006.61.
[14] Neumann T, Leis V, Kemper A. The complete story of joins (in hyper). In Proc. the Datenbanksysteme für Business, Technologie und Web, Mar. 2017, pp.31-50.
[15] Neumann T, Radke B. Adaptive optimization of very large join queries. In Proc. the 2018 International Conference on Management of Data, Jun. 2018, pp.677-692. DOI:10.1145/3183713.3183733.
[16] Zhang W E, Sheng Q Z, Qin Y, Taylor K, Yao L. Learningbased SPARQL query performance modeling and prediction. World Wide Web, 2018, 21(4):1015-1035. DOI:10.1007/s11280-017-0498-1.
[17] Leis V, Gubichev A, Mirchev A, Boncz P A, Kemper A, Neumann T. How good are query optimizers, really? Proc. VLDB Endow., 2015, 9(3):204-215. DOI:10.14778/28505-83.2850594.
[18] Lakshmi M S, Zhou S. Selectivity estimation in extensible databases-A neural network approach. In Proc. the 24th International Conference on Very Large Data Bases, Aug. 1998, pp.623-627.
[19] Malik T, Burns R C, Chawla N V. A black-box approach to query cardinality estimation. In Proc. the 3rd Biennial Conference on Innovative Data Systems Research, Jan. 2007, pp.56-67.
[20] Yang Z, Liang E, Kamsetty A, Wu C, Duan Y, Chen X, Abbeel P, Hellerstein J M, Krishnan S, Stoica I. Selectivity estimation with deep likelihood models. arXiv:1905.04278, 2019. http://arxiv.org/abs/1905.04278, Aug. 2020.
[21] Liu H, Xu M, Yu Z, Corvinelli V, Zuzarte C. Cardinality estimation using neural networks. In Proc. the 25th Annual International Conference on Computer Science and Software Engineering, Nov. 2015, pp.53-59.
[22] Knagenhjelm P, Brauer P. Classification of vowels in continuous speech using MLP and a hybrid net. Speech Commun., 1990, 9(1):31-34. DOI:10.1016/0167-6393(90)90042-8.
[23] Mahmoud M A B, Guo P. DNA sequence classification based on MLP with PILAE algorithm. Soft Comput., 2021, 25(5):4003-4014. DOI:10.1007/s00500-020-05429-y.
[24] Sun J, Li G. An end-to-end learning-based cost estimator. Proc. VLDB Endow., 2019, 13(3):307-319. DOI:10.14778/3368289.3368296.
[25] Yu X, Li G, Chai C, Tang N. Reinforcement learning with tree-LSTM for join order selection. In Proc. the 36th IEEE International Conference on Data Engineering, Apr. 2020, pp.1297-1308. DOI:10.1109/ICDE48307.2020.00116.
[26] Zhang J, Liu Y, Zhou K, Li G, Xiao Z, Cheng B, Xing J, Wang Y, Cheng T, Liu L, Ran M, Li Z. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proc. the 2019 International Conference on Management of Data, Jun. 2019, pp.415-432. DOI:10.1145/3299869.3300085.
[27] Li G, Zhou X, Li S, Gao B. QTune:A query-aware database tuning system with deep reinforcement learning. Proc. VLDB Endow., 2019, 12(12):2118-2130. DOI:10.14778/3352063.3352129.
[28] Li G, Chai C, Fan J, Weng X, Li J, Zheng Y, Li Y, Yu X, Zhang X, Yuan H. CDB:Optimizing queries with crowdbased selections and joins. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1463-1478. DOI:10.1145/3035918.3064036.
[29] Fan J, Li G, Zhou L. Interactive SQL query suggestion:Making databases user-friendly. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.351-362. DOI:10.1109/ICDE.2011.5767843.
[30] Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In Proc. the 27th International Conference on Neural Information Processing Systems, Dec. 2013, pp.3111-3119.
[31] Zimmer R, Pellegrini T, Singh S F, Masquelier T. Supervised training of convolutional spiking neural networks with PyTorch. arXiv:1911.10124, 2019. https://arxiv.org/abs/1911.10124, Nov. 2020.
[32] Al-Mouhamed M A, Hasan Khan A, Mohammad N. A review of CUDA optimization techniques and tools for structured grid computing. Computing, 2020, 102(4):977-1003. DOI:10.1007/s00607-019-00744-1.
[33] Liu B, Liang Y. Optimal function approximation with ReLU neural networks. Neurocomputing, 2021, 435:216-227. DOI:10.1016/j.neucom.2021.01.007.
[34] Hinton G E, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. https://arxiv.org/abs/1207.0580, May 2021.
[35] Kingma D P, Ba J. Adam:A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.
[36] Moerkotte G, Neumann T, Steidl G. Preventing bad plans by bounding the impact of cardinality estimation errors. Proc. VLDB Endow., 2009, 2(1):982-993. DOI:10.14778/1687627.1687738.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] 章萃; 赵沁平; 徐家福;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[5] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[6] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[7] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[8] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[9] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[10] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: