This puts all of the thrift-generated python code into the
impala_thrift_gen package. This is similar to what Impyla
does for its thrift-generated python code, except that it
uses the impala_thrift_gen package rather than impala._thrift_gen.
This is a preparatory patch for fixing the absolute import
issues.
This patches all of the thrift files to add the python namespace.
This has code to apply the patching to the thirdparty thrift
files (hive_metastore.thrift, fb303.thrift) to do the same.
Putting all the generated python into a package makes it easier
to understand where the imports are getting code. When the
subsequent change rearranges the shell code, the thrift generated
code can stay in a separate directory.
This uses isort to sort the imports for the affected Python files
with the provided .isort.cfg file. This also adds an impala-isort
shell script to make it easy to run.
Testing:
- Ran a core job
Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0
Reviewed-on: http://gerrit.cloudera.org:8080/20169
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.
All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.
The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.
get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.
Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This introduces the IMPALA_USE_PYTHON3_TESTS environment variable
to select whether to run tests using the toolchain Python 3.
This is an experimental option, so it defaults to false,
continuing to run tests with Python 2.
This fixes a first batch of Python 2 vs 3 issues:
- Deciding whether to open a file in bytes mode or text mode
- Adapting to APIs that operate on bytes in Python 3 (e.g. codecs)
- Eliminating 'basestring' and 'unicode' locations in tests/ by using
the recommendations from future
( https://python-future.org/compatible_idioms.html#basestring and
https://python-future.org/compatible_idioms.html#unicode )
- Uses impala-python3 for bin/start-impala-cluster.py
All fixes leave the Python 2 path working normally.
Testing:
- Ran an exhaustive run with Python 2 to verify nothing broke
- Verified that the new environment variable works and that
it uses Python 3 from the toolchain when specified
Change-Id: I177d9b8eae9b99ba536ca5c598b07208c3887f8c
Reviewed-on: http://gerrit.cloudera.org:8080/21474
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.
When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.
Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.
Addresses clang-tidy by adding override in source files.
Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
This seems useful as a theoretical corner-case; it currently writes
the too-large value to the page anyway, but if we ever start checking
whether the first value will fit the page this could become an issue.
Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Reviewed-on: http://gerrit.cloudera.org:8080/19898
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
IMPALA-7307 disabled column index writing for floating point columns
until PARQUET-1222 is resolved. However, the problematic values are
only the NaNs. Therefore we can write column index if NaNs are not
present in data.
Testing:
* Added tests which should fail if a column index is
present while having NaN values in the column.
Change-Id: Ic9d367500243c8ca142a16ebfeef6c841f013434
Reviewed-on: http://gerrit.cloudera.org:8080/14264
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit adds two new query options to Impala. One is to disable/
enable parquet page index writing (by default it is enabled), the other
is to set a row count limit per Parquet page (by default there is no
row count limit).
It removes the old command-line flag that controlled the enablement
of page index writing.
Since page index writing is the default since IMPALA-5843, I moved the
tests from the "custom cluster" test suite to the "query test" test
suite. This way the tests run faster because we don't need to restart
the Impala daemons.
Testing:
Added new test cases to test the effect of the query options.
Change-Id: Ib9ec8b16036e1fd35886e887809be8eca52a6982
Reviewed-on: http://gerrit.cloudera.org:8080/13361
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit adds the command line flag enable_parquet_page_index_writing
to the Impala daemon that switches Impala's ability of writing the
Parquet page index. By default the flag is false, i.e. Impala doesn't
write the page index.
This flag is only temporary, we plan to remove it once Impala is able to
read the page index and has better testing around it.
Because of this change I had to move test_parquet_page_index.py to the
custom_cluster test suite since I need to set this command line flag
in order to test the functionality. I also merged most of the test cases
because we don't want to restart the cluster too many times.
I removed 'num_data_pages_' from BaseColumnWriter since it was rather
confusing and didn't provide any measurable performance improvement.
This commit fixes the ASAN error produced by the first IMPALA-7644
commit which was reverted later.
Change-Id: Ib4a9098a2085a385351477c715ae245d83bf1c72
Reviewed-on: http://gerrit.cloudera.org:8080/11694
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit adds the command line flag enable_parquet_page_index_writing
to the Impala daemon that switches Impala's ability of writing the
Parquet page index. By default the flag is false, i.e. Impala doesn't
write the page index.
This flag is only temporary, we plan to remove it once Impala is able to
read the page index and has better testing around it.
Because of this change I had to move test_parquet_page_index.py to the
custom_cluset test suite since I need to set this command line flag
in order to test the functionality. I also merged most of the test cases
because we don't want to restart the cluster too many times.
Change-Id: If9994882aa59cbaf3ae464100caa8211598287bc
Reviewed-on: http://gerrit.cloudera.org:8080/11563
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala master branch can already write the Parquet
page index. However, we still don't have a well-defined
ordering for floating-point numbers in Parquet, see
PARQUET-1222
Currently impala writes the page index with
fmax()/fmin() semantics, but it might contradicts the
future semantics that will be defined once PARQUET-1222
is resolved.
From this patch Impala won't write the column index
for floating-point columns until PARQUET-1222 is
resolved and implemented.
I updated the python test accordingly.
Change-Id: I50aa2e6607de6a8943eb068b8162b0506763078b
Reviewed-on: http://gerrit.cloudera.org:8080/10951
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The test in the title fails when the local filesystem is used.
Looking at the error message it seems that the determined
Parquet file size is too small when the local filesystem
is used. There is already an annotation for that:
'SkipIfLocal.parquet_file_size'
I added this annotation to the TestHdfsParquetTableIndexWriter
class, therefore these tests won't be executed when the
test-warehouse directory of Impala resides on the local
filesystem.
Change-Id: Idd3be70fb654a49dda44309a8914fe1f2b48a1af
Reviewed-on: http://gerrit.cloudera.org:8080/10476
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/
The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.
I updated Pooja's python tests as well.
Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Reviewed-on: http://gerrit.cloudera.org:8080/9693
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>