mirror of
https://github.com/apache/impala.git
synced 2025-12-31 15:00:10 -05:00
This is the intital commit and is a work in progress. See the README for a
list of possible improvements.
As an overview of how the files are related:
model.py: This is the base upon which the other files are built. It
contains something like a grammer for queries.
query_generator.py: Generates random permutations of the model.
model_translator.py: Produces SQL based on the model
discrepancy_searcher.py: Uses the above to generate, run, and compare
query results.
Change-Id: Iaca6277766f5a86568eaa3f05b99c832942ab38b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1648
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Casey Ching <casey@cloudera.com>
165 lines
6.2 KiB
Plaintext
165 lines
6.2 KiB
Plaintext
Purpose:
|
|
|
|
This package is intended to augment the standard test suite. The standard tests are
|
|
more efficient with regards to features tested versus execution time. However their
|
|
coverage as a test suite still leaves gaps in query coverage. This package provides a
|
|
random query generator to compare the results of a wide range of queries against a
|
|
reference database engine. The queries will range from very simple single table selects to
|
|
extremely complicated with multiple level of nesting. This method of testing will be
|
|
slower but has a larger coverage area.
|
|
|
|
|
|
Requirements:
|
|
|
|
1) It's assumed that Impala is running locally.
|
|
|
|
2) Impyla -- an implementation of DB API 2 for Impala.
|
|
|
|
sudo pip install git+http://github.com/laserson/impyla.git#impyla
|
|
|
|
3) At least one python driver for a reference database.
|
|
|
|
sudo apt-get install python-mysqldb
|
|
sudo apt-get install python-psycopg2 # Postgresql
|
|
|
|
|
|
Usage:
|
|
|
|
1) Generate test data
|
|
|
|
./data_generator.py --use-mysql
|
|
|
|
This will generate tables and data in MySQL and Impala
|
|
|
|
|
|
2) Run the comparison
|
|
|
|
./discrepancy_searcher.py
|
|
|
|
This will generate queries using the test database and compare the results against
|
|
MySQL (the default).
|
|
|
|
|
|
Known Issues:
|
|
|
|
1) Floats will produce false-positives. For example the results of a query that has
|
|
|
|
SELECT FLOOR(COUNT(...) * AVG(...)) AS col_1
|
|
|
|
will produce different results on Impala and MySQL if COUNT() == 3 and AVG() == 1/3.
|
|
One of the databasses will FLOOR(1) while the other will FLOOR(0.999).
|
|
|
|
Maybe this could be worked around or reduced by replacing all uses of AVG() with
|
|
"AVG() + foo", where foo is some number that makes it unlikely that
|
|
"COUNT() * (AVG() + foo)" will result in an int.
|
|
|
|
I'd guess this issue comes up in 1 out of 10-20k queries.
|
|
|
|
2) Impyla may fail with "Invalid query handle". Some queries will fail every time when run
|
|
through Impyla but run fine through the impala-shell. I need to research more and file
|
|
an issue with Impyla.
|
|
|
|
3) Impyla will fail with "Invalid session". I'm pretty sure this is also an Impyla issue
|
|
but also need to investigate more.
|
|
|
|
|
|
Things to Know:
|
|
|
|
1) A good number of queries to run seems to be about 5k. Ideally each test run would
|
|
discover the complete list of known issues. From experience a 1k query test run may
|
|
complete without finding any issues that were discovered in previous runs. 5k seems
|
|
to be about the magic number were most issues will be rediscovered. This can take 1-2
|
|
hours. However as of this writing it's rare to run 1k queries without finding at
|
|
least one discrepancy.
|
|
|
|
2) It's possible to provide a randomization seed so that the randomness is actually
|
|
reproducable. The data generation currently has a default seed so will always produce
|
|
the same tables. This also mean if a new data type is added those generated tables
|
|
will change.
|
|
|
|
3) There is a query log. It's possible that a sequence of queries is required to expose
|
|
a bug. If you come across a failure that can't be reproduced by rerunning the failed
|
|
query, try running the queries leading up to that query as well.
|
|
|
|
|
|
Miscellaneous:
|
|
|
|
1) Instead of generating new random queries with each run, it may be better to reuse a
|
|
list of queries from a previous run that are known to produce results. As of this
|
|
writing only about 50% of queries produce results. So it may be better to trade high
|
|
randomness for higher quality queries. For example it would be possible to build up a
|
|
library of 100k queries that produce results then randomly select 2.5k of those.
|
|
Maybe that would provide testing equivalent to 5k totally random queries in less
|
|
time.
|
|
|
|
This would also be useful in eliminating queries that have known issues above.
|
|
|
|
|
|
Postgresql:
|
|
|
|
1) Supports bascially all Impala language features
|
|
|
|
2) Does int division, 1 / 2 == 0
|
|
|
|
3) Has strange sorting of strings, '-1' > '1'. This may be important if ORDER BY is ever
|
|
used. The databases being compared would need to have the same collation, which is
|
|
probably configurable.
|
|
|
|
4) This was the original reference database but I moved to MySQL while trying to add
|
|
support for floats and never moved back.
|
|
|
|
|
|
MySQL:
|
|
|
|
1) Supports bascially all Impala language features, except WITH clause requires emulation
|
|
with inline views.
|
|
|
|
2) Has poor boolean support. It may be worth switching back to Postgresql for this.
|
|
|
|
|
|
Improvements:
|
|
|
|
1) Add support for simplifing buggy queries. When a random query fails the comparison
|
|
check it is basically always much too complex for directly posting a bug report. It
|
|
is also time consuming to simplify the queries because there is a lot of trial and
|
|
error and manually editing queries.
|
|
|
|
2) Add more language features
|
|
|
|
a) SEMI JOIN
|
|
b) ORDER BY
|
|
c) LIMIT, OFFSET
|
|
d) CASE / WHEN
|
|
|
|
3) Add common built-in functions. Ex: CAST, IF, NVL, ...
|
|
|
|
4) Make randomization of the query generation configurable. As of this writing all the
|
|
probabilities are hard-coded. At a very minimum it should be easy to disable or force
|
|
the use of some language features such as CROSS JOIN, GROUP BY, etc.
|
|
|
|
5) More investingation of using the existing "functional" test datasets. A very quick
|
|
trial run wasn't successful but another attempt with more effort should be made before
|
|
introducing a new dataset.
|
|
|
|
I suspect the problem with using the functional dataset was that I only imported a few
|
|
tables, maybe alltypes, alltypesagg, and something else. I don't think I imported the
|
|
tiny tables since the odds of them producing results from a random query would be
|
|
very low.
|
|
|
|
6) If the functional dataset cannot be used, someone should think more about what the
|
|
random data should be like. Only a few minutes of thought were put into selecting
|
|
random value ranges (including number of tables and columns), and it's not clear how
|
|
important those ranges are.
|
|
|
|
7) Add support for comparing results with codegen enabled and disabled. Uri recently added
|
|
support for query options in Impyla.
|
|
|
|
8) Consider adding Oracle or SQL Server support, these could be useful in the future for
|
|
analytic queries.
|
|
|
|
9) Try running with tables in various formats. Ex: parquet and/or avro.
|
|
|
|
10) Support for more data types. Only int types are known to give good results.
|
|
Floats may work but non-numeric types are not supported yet.
|
|
|