impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 09:00:39 -05:00

Files

Riza Suminto 78ec9903cc IMPALA-14668: Upgrade to pytest 6.2.5

This patch upgrades pytest version from 2.9.2 to 6.2.5, the highest
pytest version available without the need to upgrade setuptools.

The pytest requirement is moved to py3-requirements.txt. We can not go
back to test with Python2 + pytest-2.9.2 after this patch because our
test script need to be adjusted as well. This is OK since we have
default testing using Python3 since IMPALA-14333 and stop building
impala-shell for Python2 since IMPALA-14606. The adjustment are follows:

- Replace deprecated @pytest.yield_fixture with plain @pytest.fixture.
- Replace --resultlog parameter (removed in version 6.0) with
  --report-log from pytest-reportlog plugin.
- Make impala-shell.sh bootstrap Python3
  venv (infra/python/env-gcc10.4.0-py3/) by default. Python2
  venv (infra/python/env-gcc10.4.0/) is not bootstrapped automatically
  anymore.
- Upgrade execnet to version 1.9.0. This is required by
  pytest-xdist==2.4.0.
- Remove unused pytest-runner plugin
- Add -Wonce argument in pytest.ini to slightly suppress warnings.
- Add "junit_logging = system-err" option at pytest.ini to continue
  logging stderr output to junit xml file.
- Fix rootdir and pytest.ini path programmatically at run-tests.py
  because individual pytest-xdist worker often does not pick this up
  correctly during parallel run.

With pytest-xdist==2.4.0, parallel EE tests does not show verbose
individual test names. It only shows pytest progress marker like
following line

...ss....s................................s.............................

This is a known limitation in pytest-xdist because execnet, the
underlying library used for communication between master and workers,
does not support transferring stdout/stderr from workers.
https://pytest-xdist.readthedocs.io/en/stable/known-limitations.html

Read following links for more detail about the deprecation notes:
https://docs.pytest.org/en/stable/deprecations.html

Change SKIP_SSL_MSG default to empty string because skipif does not
accept None reason anymore. Removed run-process-failure-tests.sh (unused
after IMPALA-5534) and unused pytest fixtures. Fixed erroneous log
formatting in test_restart_services.py.

Fixed small warnings found by pytest-6.2.5 that help stabilize
exhaustive tests run at:
- test_automatic_invalidation.py
- test_calcite_planner.py
- test_events_custom_configs.py
- test_session_expiration.py
- test_shell_interactive.py
- auto_scaler.py
- concurrent_workload.py
- hdfs_util.py

Most of the remaining warnings are about not closing resources properly
at the end of test. These warnings should be addressed in follow up
patches.

Testing:
- Pass exhaustive tests.

Change-Id: Ic3812fe976ef09ac48753dee30151714f4752c24
Reviewed-on: http://gerrit.cloudera.org:8080/23842
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2026-01-14 05:35:02 +00:00

leopard

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

tests

IMPALA-14668: Upgrade to pytest 6.2.5

2026-01-14 05:35:02 +00:00

util

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

__init__.py

Testing: Generate queries and compare results against other databases

2014-05-01 14:20:35 -07:00

cli_options.py

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

cluster.py

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

common.py

IMPALA-11977: Fix Python 3 broken imports and object model differences

2023-03-09 17:17:57 +00:00

compat.py

IMPALA-11973: Add absolute_import, division to all eligible Python files

2023-03-09 17:17:57 +00:00

data_generator_mapper.py

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

data_generator_mapred_common.py

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

data_generator_reducer.py

IMPALA-3918: Remove Cloudera copyrights and add ASF license header

2016-08-09 08:19:41 +00:00

data_generator.py

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

db_connection.py

IMPALA-14398: Deflake test_load_table_with_primary_key_attr

2025-09-11 21:12:14 +00:00

db_types.py

IMPALA-14333: Run impala-py.test using Python3

2025-09-03 10:01:29 +00:00

discrepancy_searcher.py

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

funcs.py

IMPALA-11974: Fix lazy list operators for Python 3 compatibility

2023-03-09 17:17:57 +00:00

model_translator.py

IMPALA-11973: Add absolute_import, division to all eligible Python files

2023-03-09 17:17:57 +00:00

ORACLE.txt

IMPALA-14501: Migrate most scripts from impala-python to impala-python3

2025-10-22 16:30:17 +00:00

POSTGRES.txt

IMPALA-4340: explain how to install postgresql-9.5 or higher

2016-10-31 23:18:55 +00:00

query_flattener.py

IMPALA-11973: Add absolute_import, division to all eligible Python files

2023-03-09 17:17:57 +00:00

query_generator.py

IMPALA-11975: Fix Dictionary methods to work with Python 3

2023-03-09 17:17:57 +00:00

query_profile.py

IMPALA-11975: Fix Dictionary methods to work with Python 3

2023-03-09 17:17:57 +00:00

query.py

IMPALA-11977: Fix Python 3 broken imports and object model differences

2023-03-09 17:17:57 +00:00

random_val_generator.py

IMPALA-11973: Add absolute_import, division to all eligible Python files

2023-03-09 17:17:57 +00:00

README

IMPALA-7170: Update data_generator.py for Hadoop 3

2018-07-29 02:25:30 +00:00

statement_generator.py

IMPALA-11974: Fix lazy list operators for Python 3 compatibility

2023-03-09 17:17:57 +00:00

README

Purpose:

This package is intended to augment the standard test suite. The standard tests are
more efficient with regards to features tested versus execution time. However their
coverage as a test suite still leaves gaps in query coverage. This package provides a
random query generator to compare the results of a wide range of queries against a
reference database engine. The queries will range from very simple single table selects to
extremely complicated with multiple level of nesting. This method of testing will be
slower but has a larger coverage area.

Requirements:

1) It's assumed that Impala is running locally. The minicluster should either be run with
Yarn (by setting INCLUDE_YARN=true and running ./buildall.sh -start_minicluster), or
mapreduce should be configured to use local mode (by modifying mapreduce.framework.name
in testdata/cluster/node_templates/common/etc/hadoop/conf/mapred-site.xml to 'local'
and running ./buildall.sh -start_minicluster)

2) Impyla -- an implementation of DB API 2 for Impala.

sudo pip install impyla

3) At least one python driver for a reference database.

sudo apt-get install python-mysqldb
sudo apt-get install python-psycopg2 # Postgresql

For Impala/Kudu CRUD random query generation and comparison, please see
the supplemental POSTGRES.txt on setting up PostgresQL 9.5 or higher as
a reference database.

Please see the supplemental ORACLE.txt on setting up Oracle as a reference
database.

Usage:

1) Generate test data

./data_generator.py --use-postgresql

This will generate tables and data in Postgresql and Impala

2) Run the comparison

./discrepancy_searcher.py

This will generate queries using the test database and compare the results against
Postgresql (the default).

Things to Know:

1) A good number of queries to run seems to be about 5k. Ideally each test run would
discover the complete list of known issues. From experience a 1k query test run may
complete without finding any issues that were discovered in previous runs. 5k seems
to be about the magic number were most issues will be rediscovered. This can take 1-2
hours. However as of this writing it's rare to run 1k queries without finding at
least one discrepancy.

2) It's possible to provide a randomization seed so that the randomness is actually
reproducible. The data generation currently has a default seed so will always produce
the same tables. This also mean if a new data type is added those generated tables
will change.

3) There is a query log. It's possible that a sequence of queries is required to expose
a bug. If you come across a failure that can't be reproduced by rerunning the failed
query, try running the queries leading up to that query as well.

Miscellaneous:

1) Instead of generating new random queries with each run, it may be better to reuse a
list of queries from a previous run that are known to produce results. As of this
writing only about 50% of queries produce results. So it may be better to trade high
randomness for higher quality queries. For example it would be possible to build up a
library of 100k queries that produce results then randomly select 2.5k of those.
Maybe that would provide testing equivalent to 5k totally random queries in less
time.

This would also be useful in eliminating queries that have known issues above.

Postgresql:

1) Supports basically all Impala language features. Exceptions include:

a) IGNORE NULLS clause for analytic functions

2) Has strange sorting of strings, '-1' > '1'. This may be important if ORDER BY is ever
used. The databases being compared would need to have the same collation, which is
probably configurable.

MySQL:

1) Does not support analytics.

2) Has poor boolean support.

Oracle:

1) No Boolean support.

2) Better analytic function support, e.g. "IGNORE NULLS" is supported.

3) Strange oddities abound, like no LIMIT clause.

Improvements:

1) Add the ability to incrementally increase the complexity of query profiles
automatically during execution. For example, the profile could start with no joins,
then after 100 or so queries, the number of joins could be increased. This would
lead to bugs that are not related to joins being found by much simpler queries.

2) Add support for simplifying buggy queries. When a random query fails the comparison
check it is basically always much too complex for directly posting a bug report. It
is also time consuming to simplify the queries because there is a lot of trial and
error and manually editing queries.

3) Add common built-in functions. Ex: NVL, ...

4) Add support for comparing results with codegen enabled and disabled. Uri recently added
support for query options in Impyla.