current position:Home>VLDB 2021 EA & B best paper: led by Wang Jiannan of IEEE data engineering Star Award, why can't the cardinality estimation of deep analytical machine learning be realized

VLDB 2021 EA & B best paper: led by Wang Jiannan of IEEE data engineering Star Award, why can't the cardinality estimation of deep analytical machine learning be realized

2021-08-26 09:02:20 Xinzhiyuan

New Zhiyuan Report

author : Qu Changbo

edit : Good trapped

【 Introduction to new wisdom 】 In recent days, , Wang Jiannan's team paper 《Are We Ready for Learned Cardinality Estimation》 Winning the top of the database will VLDB 2021 Annual EA&B Best Paper Award .

Database is a complex software system for enterprise to manage and query data .

In recent years, with the technological progress of machine learning and deep learning and the precedent of successful application in other fields ,ML for DB This subject is becoming more and more popular , However, most methods are still limited to the exploratory stage of the academic circle .

If you focus on 5-10 Years later, the development of the subject , One possibility is that we find that these models can not be really used in production environment , Thus gradually lose interest in this subject .

Another possible result is that mainstream database systems officially begin to adopt these machine learning models , And the performance has been greatly improved .

So answer 「 Can you base it on ML A module of X(Learned X) Landing in database products ?」 This question is of the utmost importance .

In the near future ,SFU( Simon Fraser University ) Our database team published an experimental analysis paper , from 「Cardinality Estimation( The cardinality estimate )」 This angle answers the question .

The paper was presented at the database top-level Conference VLDB On , And get the... Of the meeting EA&B (Experiment, Analysis and Benchmark) Best paper .

The paper :http://www.vldb.org/pvldb/vol14/p1640-wang.pdf

Github:https://github.com/sfu-db/AreCELearnedYet

Research background

Database optimizer (Query Optimizer) Is one of the most important modules of database , He decided SQL Query plan for statement (Query Plan) The choice of , And directly affect or even determine the query speed .

Simply speaking , The cardinality estimate (Cardinality Estimation) The purpose of is not to perform SQL In case of query , Predict the number of rows that may be returned by the query .

for example : A simple SQL sentence :

SELECT * FROM DB WHERE A = 1;

Cardinality estimation is to estimate the results of the following queries :

SELECT COUNT(*) FROM DB WHERE A = 1;

The query optimizer uses the cardinality estimation results to generate the best query plan . By more accurate estimates , The query optimizer can usually better generate better query plans .

But cardinality estimation is the of database optimizer 「 Achilles' Heel 」, It is also the focus of this paper . This problem has been studied for more than thirty years , But it still hasn't been completely solved .

Query Optimizer Principle

Research purpose and contribution

In recent years , More and more researches begin to use machine learning or deep model learning to predict cardinality , And it shows great potential to surpass the cardinality estimation methods in traditional databases in accuracy .

however , Can you Learned Cardinality Estimation Method is applied to the production environment of database ?

The answer is not clear .

So , This paper answers the above questions from three angles :

(1)Are Learned Methods Ready For Static Environments?

(2)Are Learned Methods Ready For Dynamic Environments?

(3)When Do Learned Methods Go Wrong?

This paper comprehensively compares Learned Cardinality Estimation Method . Unified the experimental environment of different methods , Answer the above questions in detail , The future research direction is pointed out .

The main experiment

Classification of experimental methods

This paper investigates the data published in the database top-level conference in recent years 7 A machine learning method , These methods are divided into two categories :

Learned Cardinality Estimation Methods classification

(1)Query Driven (Regression) Models: This kind of method understands the cardinality estimation problem as a regression problem , The base number is predicted by establishing a regression model .

In training set , Characterize SQL Statement and its query results are used as features and labels respectively , So as to obtain the model . In the prediction of the , Enter the characterized SQL Statement to get the predicted result .

(2)Data Driven (Joint Distribution) Models: This kind of method understands the cardinality estimation as the estimation of the joint probability distribution of the data itself .

During the training , Estimate the joint probability distribution of the data . In the prediction of the ,SQL query Is a query for the distribution , So as to give the prediction results .

Data set selection

Due to the different methods available , Compare on different data sets , There is no uniform standard . In this paper, 4 A real dataset , Their size 、 Row number 、 The number of columns is different , And cited at least once by the above papers .

Data set details

Query generation framework

Because different papers generate queries in different ways , After comprehensive consideration in this paper , A unified generation framework is proposed .(# predicate yes SQL where Number of assertions in ,Operators yes where Whether or not “=” and “ Section ”,OOD It refers to whether the query will query data that does not exist in the data .)

among , A unified test set yes 10K A query . about Query Driven Models,training set Its size is 100K.

The difference of query load

Experimental measurement (metric)

In the experiment, Radix estimation is used metric——Q_error.

That is, the ratio of the maximum value to the minimum value of the actual value to the estimated value . The closer the 1, Explain that the more accurate the estimate .

Q_error

Are Learned Methods Ready For Static Environments?

The author is in a static environment ( The data in the database is not updated ) From accuracy and efficiency ( Including training time and inference time ) The learning based cardinality estimation method and the traditional method are compared in two aspects .

This article selects 3 A widely used database product and 6 A common and latest traditional cardinality estimation method is used as the benchmark .

And selected the above 4 A common data set with different attributes and a unified query load generation framework , And use it to generate the required training and test sets .

Accuracy comparison (Error The smaller the better. )

The phenomenon :

Naru The accuracy is the best .

The latest machine learning methods are more accurate than the traditional methods .

The training time is ( With Power Data sets, for example )

The following figure shows the time required to complete the training in different methods .

Differences in training time

The phenomenon :

All databases , Can complete the update of statistical information in a few seconds , In machine learning methods , Only LW-XGB There is comparability , The rest can even slow down to 1000 More than times .

Inferring time is more ( With Power Data sets, for example )

The following figure shows the different methods , The average time it takes to complete an inference .

The difference in prediction time

The phenomenon :

All databases , All have a response speed of milliseconds , In machine learning methods , Only Regression The methods are comparable , The rest can be slow 30 More than times .

Main conclusion :

The method based on machine learning has a great improvement in accuracy compared with the traditional method . However, compared with traditional methods , except LW-XGB outside , Machine learning methods require more time to train models , At the same time, it will be slower when the model runs .

Are Learned Methods Ready For Dynamic Environments?

In a real production environment , The data in the database will change frequently .

The difficulty in designing this experiment is to consider the various changes brought by data update :

1. Data aspect

(1) The frequency of data updates ;

(2) Data change .

2. In terms of models

(1) How different models are updated ;

(2) Time spent updating the model ;

(3) When the accuracy of the model changes before and after updating , Comprehensive measurement accuracy .

therefore , In the second part of the experiment, this paper designs and introduces a dynamic framework to compare machine learning methods with existing database products more fairly . Some of the results are as follows :

When we set the data update frequency higher , Check whether the model can be updated , The results are as follows :

When limiting the model update time , Whether the update can be completed

The phenomenon :

When the update frequency is high , Most machine learning methods cannot keep up with frequent updates , And most of the time , The database can be updated .

When we set the data update frequency to be particularly low , And ensure that all machine learning methods can be updated , Compare the accuracy of different methods (1 best ,5 The worst ), The results are as follows :

The phenomenon :

When all machine learning methods can be updated , No method is always best on different data sets .

Main conclusion :

Machine learning methods take longer to update the model , Therefore, it often can't keep up with the speed of data update . meanwhile , The estimation errors of different machine learning methods are different on different data sets , There is no obvious optimal method .

When Do Learned Methods Go Wrong?

A disadvantage of machine learning and deep learning models compared with traditional heuristic methods is that they are relatively difficult to be intuitively understood and explained .

This also makes it difficult to predict when they will perform poorly , This makes it difficult for users to really trust and use them .

The author also found some illogical behaviors of the model , Examples are as follows :

1. The estimation result is not monotonous . The following two queries :

so ,Q1 The query range of is greater than Q2, So the logical result should be Q1 The estimated result of >= Q2 The estimated result of . However , In the actual experimental results ,Q2 But than Q1 Bigger than 60%.

2.  The estimate is unstable . Same query , repeat 2000 The distribution diagram of secondary is as follows :

The estimated results of the same query range from 0 To 6000, Such estimation results will lead to unstable performance of the system , Thus affecting people's trust in the system .

Main conclusion :

The black box model leads to the failure to explain the source of error and the results are difficult to predict .

Conclusion of the paper

ask : Can you base it on ML Module X(Learned X) Landing in database products ?

answer : Not yet .

Although the current cardinality estimation method based on machine learning can not be put into use , But they show the potential in accuracy and the huge impact they can have on the database optimizer .

In order to really deploy them to real database products , The author believes that two problems need to be solved in the future :

1. Improve model training and query speed

At present, most machine learning models are compared with database products , There is a big gap between training and query time , At the same time, the parameter tuning for the model itself will also cost the database administrator's time and energy .

Because the update speed is too slow , Many models with high accuracy cannot deal with rapidly changing data .

Therefore, improving model training and query speed is an important prerequisite for deploying this kind of method to database products .

2. Improve the interpretability of the model

Most of the existing learning model-based methods are based on complex black box models , It's hard to predict when they will go wrong , And if there is a problem, it is difficult to debug .

This paper also shows some illogical behaviors in the experiment , These will become obstacles to putting them into real use .

therefore , Improving model interpretability will be a very important topic in the future , The author believes that on the one hand, we can start with studying a more transparent model , On the other hand, you can also invest more energy in model interpretation and debugging .

The authors introduce

Wang Xiaoying 、 Qu Changbo 、 Wu Weiyuan , In the Ph.D , Simon Fraser University

Wang Jiannan , associate professor , Simon Fraser University

2013 He received his doctorate from Tsinghua University in , And in 2013 - 2015 At the University of California, Berkeley AMPLab Conduct postdoctoral research .

Once obtained 2020 Outstanding youth award awarded by the Canadian Computer Association ,2018 year IEEE Data engineering Star Award ,2016 year ACM SIGMOD Best presentation ,2013 year CCF The best doctoral thesis ,2011 year Google PhD Fellowship.

Zhou Qingqing , tencent

Reference material :

http://www.vldb.org/pvldb/vol14/p1640-wang.pdf

copyright notice
author[Xinzhiyuan],Please bring the original link to reprint, thank you.
https://caren.inotgo.com/2021/08/20210826090211201U.html

Random recommended