基数估计器:利用垂直扫描卷积神经网络处理SQL
Cardinality Estimator: Processing SQL with a Vertical Scanning Convolutional Neural Network
-
摘要: 研究背景
查询优化是数据库管理系统中提高查询效率的重要手段。查询优化器的优化过程对数据库开发人员是透明的,可以自动进行等价转换,并对物理执行计划进行选择。查询优化器的目的是从理论上获得SQL查询的最佳执行计划。传统的查询优化器不能准确地估计基数,导致选择的执行计划较差,进而影响查询性能。主要原因是基数估计的误差在于模型不能准确地捕捉多个表连接操作之间的复杂关系。如何准确地捕获不同表或列之间的关系直接决定了查询结果的准确性。
目的
基数估计问题可视为有监督学习,标签是真正的基数。最大的挑战是如何捕获SQL语义及多表连接关系。目前,已有的基于学习方法的数据库优化技术效果不佳。本文提出了一种基数估计模型VSCNN (Vertical Scanning Convolutional Neural Network,垂直扫描卷积神经网络)来应对上述挑战。该模型不但能够捕获复杂SQL查询中每个词的语义,而且能够捕获表之间的连接关系,从而得到准确的基数估计。
方法
本文提出的基于学习的基数估计器将SQL查询从一个句子转换为词向量,采用one-hot方法对表名进行编码,将样本分别编码成位图,然后将上述两种编码合并,可以从数据样本中获得足够的语义信息。由垂直扫描卷积神经网络获得的特征向量包含语义信息,包括:关于SQL查询的表、连接和谓词。通过与数据库系统中流行的基数估计器进行比较,大量的实验证明了所提出的基于垂直扫描卷积神经网络的技术估计模型的准确性和时效性。
结果
从实验结果来看,传统的数据库系统往往会高估基数,q-error非常大,甚至高达几十万。与传统数据库中的估计结果相比,基于学习的基数估计模型的q-error至少可以降低14.6%。MSCN (Multi-Set Convolutional Network)可以处理简单的查询,但是对于JOB (Join Order Benchmark)包含字符串的复杂查询,MSCN不起作用。可以发现,所提模型在所有实验中都取得了很好的效果。MSCN无法处理字符串类型字段,因为它没有能够识别字符串的神经网络结构。MSCN (string)可以处理字符串,但是q-error比VSCNN大。VSCNN模型可以很好地处理字符串,因为它通过嵌入技术将字符串作为词向量嵌入,并且能够很好地捕获SQL语义信息。
结论
本文介绍了垂直扫描卷积神经网络模型VSCNN处理基数估计问题的工作原理。VSCNN可以将SQL查询从一个句子转换为词向量,使用垂直扫描卷积神经网络来捕获词向量中词之间的关系。它在处理多个表连接查询时性能优越,并支持处理字符串类型的谓词。为了提高基数估计的精度,采用从基表中抽取样本的方法,并将其压缩成位图。为了将该模型应用于更多类型的查询,未来将从以下几个方面对所提模型进行改进:与数值类型数据相比,数值数据字段在表中具有最大值和最小值,所有值都在最大值和最小值之间。字符串类型很难处理,因为字符串中的字符数不同,而且字符本身也不同。这样,字符串类型的数据比数字类型的数据更稀疏。未来将设计一种更好基数估计方法来处理字符串类型的数据。Abstract: Although the popular database systems perform well on query optimization, they still face poor query execution plans when the join operations across multiple tables are complex. Bad execution planning usually results in bad cardinality estimations. The cardinality estimation models in traditional databases cannot provide high-quality estimation, because they are not capable of capturing the correlation between multiple tables in an effective fashion. Recently, the state-ofthe-art learning-based cardinality estimation is estimated to work better than the traditional empirical methods. Basically, they used deep neural networks to compute the relationships and correlations of tables. In this paper, we propose a vertical scanning convolutional neural network (abbreviated as VSCNN) to capture the relationships between words in the word vector in order to generate a feature map. The proposed learning-based cardinality estimator converts Structured Query Language (SQL) queries from a sentence to a word vector and we encode table names in the one-hot encoding method and the samples into bitmaps, separately, and then merge them to obtain enough semantic information from data samples. In particular, the feature map obtained by VSCNN contains semantic information including tables, joins, and predicates about SQL queries. Importantly, in order to improve the accuracy of cardinality estimation, we propose the negative sampling method for training the word vector by gradient descent from the base table and compress it into a bitmap. Extensive experiments are conducted and the results show that the estimation quality of q-error of the proposed vertical scanning convolutional neural network based model is reduced by at least 14.6% when compared with the estimators in traditional databases.