Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (4): 762-777.doi: 10.1007/s11390-021-1351-7

Special Issue: Data Management and Data Mining

• Special Section on AI4DB and DB4AI • Previous Articles     Next Articles

Cardinality Estimator: Processing SQL with a Vertical Scanning Convolutional Neural Network

Shao-Jie Qiao1, Senior Member, CCF, Guo-Ping Yang1, Nan Han2,*, Hao Chen3, Fa-Liang Huang4, Member, CCF, Kun Yue5, Member, CCF, Yu-Gen Yi6, Member, CCF, IEEE, and Chang-An Yuan7, Member, CCF        

  1. 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China;
    2 School of Management, Chengdu University of Information Technology, Chengdu 610225, China;
    3 Beijing Huawei Digital Technologies Co., Ltd., Beijing 100085, China;
    4 School of Computer and Information Engineering, Nanning Normal University, Nanning 530299, China;
    5 School of Information Science and Engineering, Yunnan University, Kunming 650500, China;
    6 School of Software, Jiangxi Normal University, Nanchang 330022, China;
    7 Guangxi College of Education, Nanning 530007, China
  • Received:2021-02-01 Revised:2021-07-01 Online:2021-07-05 Published:2021-07-30
  • Contact: Nan Han
  • About author:Shao-Jie Qiao received his B.S. and Ph.D. degrees in computer application technology from Sichuan University, Chengdu, in 2004 and 2009, respectively. From 2007 to 2008, he worked as a visiting scholar in the School of Computing at the National University of Singapore, Singapore. He is a distinguished young scholar of Sichuan Province of China. He is currently a professor with the School of Software Engineering, Chengdu University of Information Technology, Chengdu. He has authored more than 150 high quality papers, including IEEE Transactions on ITS, IEEE TKDE, IEEE TNNLS, IEEE TCSVT, IEEE Transactions on SMC:Systems, ACM TKDD, ACM TIST, etc.. His research interests include databases, data mining and artificial intelligence. He is a senior member of CCF.
  • Supported by:
    The work was supported in part by the CCF-Huawei Database System Innovation Research Plan under Grant No. CCFHuaweiDBIR2020004A, the National Natural Science Foundation of China under Grant Nos. 61772091, 61802035, 61962006 and 61962038, the Sichuan Science and Technology Program under Grant Nos. 2021JDJQ0021 and 2020YJ0481, and the Digital Media Art, Key Laboratory of Sichuan Province, Sichuan Conservatory of Music, Chengdu, China under Grant No. 21DMAKL02.

Although the popular database systems perform well on query optimization, they still face poor query execution plans when the join operations across multiple tables are complex. Bad execution planning usually results in bad cardinality estimations. The cardinality estimation models in traditional databases cannot provide high-quality estimation, because they are not capable of capturing the correlation between multiple tables in an effective fashion. Recently, the state-ofthe-art learning-based cardinality estimation is estimated to work better than the traditional empirical methods. Basically, they used deep neural networks to compute the relationships and correlations of tables. In this paper, we propose a vertical scanning convolutional neural network (abbreviated as VSCNN) to capture the relationships between words in the word vector in order to generate a feature map. The proposed learning-based cardinality estimator converts Structured Query Language (SQL) queries from a sentence to a word vector and we encode table names in the one-hot encoding method and the samples into bitmaps, separately, and then merge them to obtain enough semantic information from data samples. In particular, the feature map obtained by VSCNN contains semantic information including tables, joins, and predicates about SQL queries. Importantly, in order to improve the accuracy of cardinality estimation, we propose the negative sampling method for training the word vector by gradient descent from the base table and compress it into a bitmap. Extensive experiments are conducted and the results show that the estimation quality of q-error of the proposed vertical scanning convolutional neural network based model is reduced by at least 14.6% when compared with the estimators in traditional databases.

Key words: cardinality estimation; word vector; vertical scanning convolutional neural network; sampling method;

[1] Leis V, Radke B, Gubichev A, Kemper A, Neumann T. Cardinality estimation done right:Index-based join sampling. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, Jan. 2017.
[2] Li G, Zhou X, Li S. XuanYuan:An AI-native database. IEEE Data Eng. Bull., 2019, 42(2):70-81.
[3] Kipf A, Kipf T, Radke B, Leis V, Boncz P A, Kemper A. Learned cardinalities:Estimating correlated joins with deep learning. In Proc. the 9th Biennial Conference on Innovative Data Systems Research, Jan. 2019.
[4] Ioannidis Y E. The history of histograms (abridged). In Proc. the 29th International Conference on Very Large Data Bases, Sept. 2003, pp.19-30. DOI:10.1016/B978-012722442-8/50011-2.
[5] Giroire F. Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math., 2009, 157(2):406-427. DOI:10.1016/j.dam.2008.06.020.
[6] Flajolet P, Martin G N. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 1985, 31(2):182-209. DOI:10.1016/0022-0000(85)90041-8.
[7] Durand M, Flajolet P. Loglog counting of large cardinalities. In Proc. the 11th Annual European Symposium, Sept. 2003, pp.605-617. DOI:10.1007/978-3-540-39658-155.
[8] Flajolet P, Fusy É, Gandouet O, Meunier F. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. In Proc. the 2007 Conference on Analysis of Algorithm, Jun. 2007, pp.137-156.
[9] Whang K, Zanden B T V, Taylor H M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 1990, 15(2):208-229. DOI:10.1145/78922.78925.
[10] Wu W, Naughton J F, Singh H. Sampling-based query reoptimization. In Proc. the 2016 International Conference on Management of Data, June 26-July 1, 2016, pp.1721-1736. DOI:10.1145/2882903.2882914.
[11] Lipton R J, Naughton J F, Schneider D A. Practical selectivity estimation through adaptive sampling. In Proc. the 1990 ACM SIGMOD International Conference on Management of Data, May 1990, pp.1-11. DOI:10.1145/936-05.93611.
[12] Olken F, Rotem D. Random sampling from database files:A survey. In Proc. the 5th International Conference on Statistical and Scientific Database Management, Apr. 1990, pp.92-111. DOI:10.1007/3-540-52342-123.
[13] Estan C, Naughton J F. End-biased samples for join cardinality estimation. In Proc. the 22nd International Conference on Data Engineering, Apr. 2006, Article No. 20. DOI:10.1109/ICDE.2006.61.
[14] Neumann T, Leis V, Kemper A. The complete story of joins (in hyper). In Proc. the Datenbanksysteme für Business, Technologie und Web, Mar. 2017, pp.31-50.
[15] Neumann T, Radke B. Adaptive optimization of very large join queries. In Proc. the 2018 International Conference on Management of Data, Jun. 2018, pp.677-692. DOI:10.1145/3183713.3183733.
[16] Zhang W E, Sheng Q Z, Qin Y, Taylor K, Yao L. Learningbased SPARQL query performance modeling and prediction. World Wide Web, 2018, 21(4):1015-1035. DOI:10.1007/s11280-017-0498-1.
[17] Leis V, Gubichev A, Mirchev A, Boncz P A, Kemper A, Neumann T. How good are query optimizers, really? Proc. VLDB Endow., 2015, 9(3):204-215. DOI:10.14778/28505-83.2850594.
[18] Lakshmi M S, Zhou S. Selectivity estimation in extensible databases-A neural network approach. In Proc. the 24th International Conference on Very Large Data Bases, Aug. 1998, pp.623-627.
[19] Malik T, Burns R C, Chawla N V. A black-box approach to query cardinality estimation. In Proc. the 3rd Biennial Conference on Innovative Data Systems Research, Jan. 2007, pp.56-67.
[20] Yang Z, Liang E, Kamsetty A, Wu C, Duan Y, Chen X, Abbeel P, Hellerstein J M, Krishnan S, Stoica I. Selectivity estimation with deep likelihood models. arXiv:1905.04278, 2019., Aug. 2020.
[21] Liu H, Xu M, Yu Z, Corvinelli V, Zuzarte C. Cardinality estimation using neural networks. In Proc. the 25th Annual International Conference on Computer Science and Software Engineering, Nov. 2015, pp.53-59.
[22] Knagenhjelm P, Brauer P. Classification of vowels in continuous speech using MLP and a hybrid net. Speech Commun., 1990, 9(1):31-34. DOI:10.1016/0167-6393(90)90042-8.
[23] Mahmoud M A B, Guo P. DNA sequence classification based on MLP with PILAE algorithm. Soft Comput., 2021, 25(5):4003-4014. DOI:10.1007/s00500-020-05429-y.
[24] Sun J, Li G. An end-to-end learning-based cost estimator. Proc. VLDB Endow., 2019, 13(3):307-319. DOI:10.14778/3368289.3368296.
[25] Yu X, Li G, Chai C, Tang N. Reinforcement learning with tree-LSTM for join order selection. In Proc. the 36th IEEE International Conference on Data Engineering, Apr. 2020, pp.1297-1308. DOI:10.1109/ICDE48307.2020.00116.
[26] Zhang J, Liu Y, Zhou K, Li G, Xiao Z, Cheng B, Xing J, Wang Y, Cheng T, Liu L, Ran M, Li Z. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proc. the 2019 International Conference on Management of Data, Jun. 2019, pp.415-432. DOI:10.1145/3299869.3300085.
[27] Li G, Zhou X, Li S, Gao B. QTune:A query-aware database tuning system with deep reinforcement learning. Proc. VLDB Endow., 2019, 12(12):2118-2130. DOI:10.14778/3352063.3352129.
[28] Li G, Chai C, Fan J, Weng X, Li J, Zheng Y, Li Y, Yu X, Zhang X, Yuan H. CDB:Optimizing queries with crowdbased selections and joins. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1463-1478. DOI:10.1145/3035918.3064036.
[29] Fan J, Li G, Zhou L. Interactive SQL query suggestion:Making databases user-friendly. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.351-362. DOI:10.1109/ICDE.2011.5767843.
[30] Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In Proc. the 27th International Conference on Neural Information Processing Systems, Dec. 2013, pp.3111-3119.
[31] Zimmer R, Pellegrini T, Singh S F, Masquelier T. Supervised training of convolutional spiking neural networks with PyTorch. arXiv:1911.10124, 2019., Nov. 2020.
[32] Al-Mouhamed M A, Hasan Khan A, Mohammad N. A review of CUDA optimization techniques and tools for structured grid computing. Computing, 2020, 102(4):977-1003. DOI:10.1007/s00607-019-00744-1.
[33] Liu B, Liang Y. Optimal function approximation with ReLU neural networks. Neurocomputing, 2021, 435:216-227. DOI:10.1016/j.neucom.2021.01.007.
[34] Hinton G E, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012., May 2021.
[35] Kingma D P, Ba J. Adam:A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.
[36] Moerkotte G, Neumann T, Steidl G. Preventing bad plans by bounding the impact of cardinality estimation errors. Proc. VLDB Endow., 2009, 2(1):982-993. DOI:10.14778/1687627.1687738.
No related articles found!
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] Zhang Cui; Zhao Qinping; Xu Jiafu;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[5] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[6] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[7] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[8] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[9] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[10] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved