mirror of
https://github.com/apache/impala.git
synced 2025-12-23 21:08:39 -05:00
This is the intital commit and is a work in progress. See the README for a
list of possible improvements.
As an overview of how the files are related:
model.py: This is the base upon which the other files are built. It
contains something like a grammer for queries.
query_generator.py: Generates random permutations of the model.
model_translator.py: Produces SQL based on the model
discrepancy_searcher.py: Uses the above to generate, run, and compare
query results.
Change-Id: Iaca6277766f5a86568eaa3f05b99c832942ab38b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1648
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
Purpose:
This package is intended to augment the standard test suite. The standard tests are
more efficient with regards to features tested versus execution time. However their
coverage as a test suite still leaves gaps in query coverage. This package provides a
random query generator to compare the results of a wide range of queries against a
reference database engine. The queries will range from very simple single table selects to
extremely complicated with multiple level of nesting. This method of testing will be
slower but has a larger coverage area.
Requirements:
1) It's assumed that Impala is running locally.
2) Impyla -- an implementation of DB API 2 for Impala.
sudo pip install git+http://github.com/laserson/impyla.git#impyla
3) At least one python driver for a reference database.
sudo apt-get install python-mysqldb
sudo apt-get install python-psycopg2 # Postgresql
Usage:
1) Generate test data
./data_generator.py --use-mysql
This will generate tables and data in MySQL and Impala
2) Run the comparison
./discrepancy_searcher.py
This will generate queries using the test database and compare the results against
MySQL (the default).
Known Issues:
1) Floats will produce false-positives. For example the results of a query that has
SELECT FLOOR(COUNT(...) * AVG(...)) AS col_1
will produce different results on Impala and MySQL if COUNT() == 3 and AVG() == 1/3.
One of the databasses will FLOOR(1) while the other will FLOOR(0.999).
Maybe this could be worked around or reduced by replacing all uses of AVG() with
"AVG() + foo", where foo is some number that makes it unlikely that
"COUNT() * (AVG() + foo)" will result in an int.
I'd guess this issue comes up in 1 out of 10-20k queries.
2) Impyla may fail with "Invalid query handle". Some queries will fail every time when run
through Impyla but run fine through the impala-shell. I need to research more and file
an issue with Impyla.
3) Impyla will fail with "Invalid session". I'm pretty sure this is also an Impyla issue
but also need to investigate more.
Things to Know:
1) A good number of queries to run seems to be about 5k. Ideally each test run would
discover the complete list of known issues. From experience a 1k query test run may
complete without finding any issues that were discovered in previous runs. 5k seems
to be about the magic number were most issues will be rediscovered. This can take 1-2
hours. However as of this writing it's rare to run 1k queries without finding at
least one discrepancy.
2) It's possible to provide a randomization seed so that the randomness is actually
reproducable. The data generation currently has a default seed so will always produce
the same tables. This also mean if a new data type is added those generated tables
will change.
3) There is a query log. It's possible that a sequence of queries is required to expose
a bug. If you come across a failure that can't be reproduced by rerunning the failed
query, try running the queries leading up to that query as well.
Miscellaneous:
1) Instead of generating new random queries with each run, it may be better to reuse a
list of queries from a previous run that are known to produce results. As of this
writing only about 50% of queries produce results. So it may be better to trade high
randomness for higher quality queries. For example it would be possible to build up a
library of 100k queries that produce results then randomly select 2.5k of those.
Maybe that would provide testing equivalent to 5k totally random queries in less
time.
This would also be useful in eliminating queries that have known issues above.
Postgresql:
1) Supports bascially all Impala language features
2) Does int division, 1 / 2 == 0
3) Has strange sorting of strings, '-1' > '1'. This may be important if ORDER BY is ever
used. The databases being compared would need to have the same collation, which is
probably configurable.
4) This was the original reference database but I moved to MySQL while trying to add
support for floats and never moved back.
MySQL:
1) Supports bascially all Impala language features, except WITH clause requires emulation
with inline views.
2) Has poor boolean support. It may be worth switching back to Postgresql for this.
Improvements:
1) Add support for simplifing buggy queries. When a random query fails the comparison
check it is basically always much too complex for directly posting a bug report. It
is also time consuming to simplify the queries because there is a lot of trial and
error and manually editing queries.
2) Add more language features
a) SEMI JOIN
b) ORDER BY
c) LIMIT, OFFSET
d) CASE / WHEN
3) Add common built-in functions. Ex: CAST, IF, NVL, ...
4) Make randomization of the query generation configurable. As of this writing all the
probabilities are hard-coded. At a very minimum it should be easy to disable or force
the use of some language features such as CROSS JOIN, GROUP BY, etc.
5) More investingation of using the existing "functional" test datasets. A very quick
trial run wasn't successful but another attempt with more effort should be made before
introducing a new dataset.
I suspect the problem with using the functional dataset was that I only imported a few
tables, maybe alltypes, alltypesagg, and something else. I don't think I imported the
tiny tables since the odds of them producing results from a random query would be
very low.
6) If the functional dataset cannot be used, someone should think more about what the
random data should be like. Only a few minutes of thought were put into selecting
random value ranges (including number of tables and columns), and it's not clear how
important those ranges are.
7) Add support for comparing results with codegen enabled and disabled. Uri recently added
support for query options in Impyla.
8) Consider adding Oracle or SQL Server support, these could be useful in the future for
analytic queries.
9) Try running with tables in various formats. Ex: parquet and/or avro.
10) Support for more data types. Only int types are known to give good results.
Floats may work but non-numeric types are not supported yet.