impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	c5a0ec8bdf	IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package This puts all of the thrift-generated python code into the impala_thrift_gen package. This is similar to what Impyla does for its thrift-generated python code, except that it uses the impala_thrift_gen package rather than impala._thrift_gen. This is a preparatory patch for fixing the absolute import issues. This patches all of the thrift files to add the python namespace. This has code to apply the patching to the thirdparty thrift files (hive_metastore.thrift, fb303.thrift) to do the same. Putting all the generated python into a package makes it easier to understand where the imports are getting code. When the subsequent change rearranges the shell code, the thrift generated code can stay in a separate directory. This uses isort to sort the imports for the affected Python files with the provided .isort.cfg file. This also adds an impala-isort shell script to make it easy to run. Testing: - Ran a core job Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0 Reviewed-on: http://gerrit.cloudera.org:8080/20169 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-04-15 17:03:02 +00:00
Csaba Ringhofer	f98b697c7b	IMPALA-13929: Make 'functional-query' the default workload in tests This change adds get_workload() to ImpalaTestSuite and removes it from all test suites that already returned 'functional-query'. get_workload() is also removed from CustomClusterTestSuite which used to return 'tpch'. All other changes besides impala_test_suite.py and custom_cluster_test_suite.py are just mass removals of get_workload() functions. The behavior is only changed in custom cluster tests that didn't override get_workload(). By returning 'functional-query' instead of 'tpch', exploration_strategy() will no longer return 'core' in 'exhaustive' test runs. See IMPALA-3947 on why workload affected exploration_strategy. An example for affected test is TestCatalogHMSFailures which was skipped both in core and exhaustive runs before this change. get_workload() functions that return a different workload than 'functional-query' are not changed - it is possible that some of these also don't handle exploration_strategy() as expected, but individually checking these tests is out of scope in this patch. Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115 Reviewed-on: http://gerrit.cloudera.org:8080/22726 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-08 07:12:55 +00:00
Joe McDonnell	8d5adfd0ba	IMPALA-13123: Add option to run tests with Python 3 This introduces the IMPALA_USE_PYTHON3_TESTS environment variable to select whether to run tests using the toolchain Python 3. This is an experimental option, so it defaults to false, continuing to run tests with Python 2. This fixes a first batch of Python 2 vs 3 issues: - Deciding whether to open a file in bytes mode or text mode - Adapting to APIs that operate on bytes in Python 3 (e.g. codecs) - Eliminating 'basestring' and 'unicode' locations in tests/ by using the recommendations from future ( https://python-future.org/compatible_idioms.html#basestring and https://python-future.org/compatible_idioms.html#unicode ) - Uses impala-python3 for bin/start-impala-cluster.py All fixes leave the Python 2 path working normally. Testing: - Ran an exhaustive run with Python 2 to verify nothing broke - Verified that the new environment variable works and that it uses Python 3 from the toolchain when specified Change-Id: I177d9b8eae9b99ba536ca5c598b07208c3887f8c Reviewed-on: http://gerrit.cloudera.org:8080/21474 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2024-12-17 07:28:51 +00:00
Michael Smith	2fc4f74796	IMPALA-10186: Fix writing empty parquet page Fixes writing an empty parquet page when a page fills (or reaches parquet_page_row_count_limit) at the same time that its dictionary fills. When a page filled (or reached parquet_page_row_count_limit) at the same time that the dictionary filled, Impala would first detect the page was full and create a new page. It would then detect the dictionary is full and create another page, resulting in an empty page. Parquet readers like Hive error if they encounter an empty page. This patch attempts to make it impossible to generate an empty page by reworking AppendRow and adding DCHECKs for empty pages. Dictionary size is checked on FinalizeCurrentPage so whenever a page is written, we also flush the dictionary if full. Addresses clang-tidy by adding override in source files. Testing: - new test for full page size reached with full dictionary - new test for parquet_page_row_count_limit with full dictionary - new test for parquet_page_row_count_limit followed by large value. This seems useful as a theoretical corner-case; it currently writes the too-large value to the page anyway, but if we ever start checking whether the first value will fit the page this could become an issue. Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799 Reviewed-on: http://gerrit.cloudera.org:8080/19898 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-19 12:02:42 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
norbert.luksa	f65c2a754f	IMPALA-8498: Write column index for floating types when NaN is not present IMPALA-7307 disabled column index writing for floating point columns until PARQUET-1222 is resolved. However, the problematic values are only the NaNs. Therefore we can write column index if NaNs are not present in data. Testing: * Added tests which should fail if a column index is present while having NaN values in the column. Change-Id: Ic9d367500243c8ca142a16ebfeef6c841f013434 Reviewed-on: http://gerrit.cloudera.org:8080/14264 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-14 15:21:16 +00:00
Zoltan Borok-Nagy	95be560e0c	IMPALA-7936: Enable better control over Parquet writing This commit adds two new query options to Impala. One is to disable/ enable parquet page index writing (by default it is enabled), the other is to set a row count limit per Parquet page (by default there is no row count limit). It removes the old command-line flag that controlled the enablement of page index writing. Since page index writing is the default since IMPALA-5843, I moved the tests from the "custom cluster" test suite to the "query test" test suite. This way the tests run faster because we don't need to restart the Impala daemons. Testing: Added new test cases to test the effect of the query options. Change-Id: Ib9ec8b16036e1fd35886e887809be8eca52a6982 Reviewed-on: http://gerrit.cloudera.org:8080/13361 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-21 18:25:33 +00:00
Zoltan Borok-Nagy	de7f09d726	IMPALA-7644: Hide Parquet page index writing with feature flag This commit adds the command line flag enable_parquet_page_index_writing to the Impala daemon that switches Impala's ability of writing the Parquet page index. By default the flag is false, i.e. Impala doesn't write the page index. This flag is only temporary, we plan to remove it once Impala is able to read the page index and has better testing around it. Because of this change I had to move test_parquet_page_index.py to the custom_cluster test suite since I need to set this command line flag in order to test the functionality. I also merged most of the test cases because we don't want to restart the cluster too many times. I removed 'num_data_pages_' from BaseColumnWriter since it was rather confusing and didn't provide any measurable performance improvement. This commit fixes the ASAN error produced by the first IMPALA-7644 commit which was reverted later. Change-Id: Ib4a9098a2085a385351477c715ae245d83bf1c72 Reviewed-on: http://gerrit.cloudera.org:8080/11694 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-17 19:57:17 +00:00
Joe McDonnell	af76186e01	IMPALA-7704: Revert "IMPALA-7644: Hide Parquet page index writing with feature flag" The fix for IMPALA-7644 introduced ASAN issues detailed in IMPALA-7704. Reverting for now. This reverts commit `843683ed6c`. Change-Id: Icf0a64d6ec747275e3ecd6e801e054f81095591a Reviewed-on: http://gerrit.cloudera.org:8080/11671 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com>	2018-10-13 03:26:03 +00:00
Zoltan Borok-Nagy	843683ed6c	IMPALA-7644: Hide Parquet page index writing with feature flag This commit adds the command line flag enable_parquet_page_index_writing to the Impala daemon that switches Impala's ability of writing the Parquet page index. By default the flag is false, i.e. Impala doesn't write the page index. This flag is only temporary, we plan to remove it once Impala is able to read the page index and has better testing around it. Because of this change I had to move test_parquet_page_index.py to the custom_cluset test suite since I need to set this command line flag in order to test the functionality. I also merged most of the test cases because we don't want to restart the cluster too many times. Change-Id: If9994882aa59cbaf3ae464100caa8211598287bc Reviewed-on: http://gerrit.cloudera.org:8080/11563 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-08 13:27:49 +00:00
Zoltan Borok-Nagy	041197444d	IMPALA-7304: Don't write floating column index until PARQUET-1222 is resolved. Impala master branch can already write the Parquet page index. However, we still don't have a well-defined ordering for floating-point numbers in Parquet, see PARQUET-1222 Currently impala writes the page index with fmax()/fmin() semantics, but it might contradicts the future semantics that will be defined once PARQUET-1222 is resolved. From this patch Impala won't write the column index for floating-point columns until PARQUET-1222 is resolved and implemented. I updated the python test accordingly. Change-Id: I50aa2e6607de6a8943eb068b8162b0506763078b Reviewed-on: http://gerrit.cloudera.org:8080/10951 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-17 12:27:55 +00:00
Zoltan Borok-Nagy	710b08bfd8	IMPALA-7048: Failed test: test_write_index_many_columns_tables The test in the title fails when the local filesystem is used. Looking at the error message it seems that the determined Parquet file size is too small when the local filesystem is used. There is already an annotation for that: 'SkipIfLocal.parquet_file_size' I added this annotation to the TestHdfsParquetTableIndexWriter class, therefore these tests won't be executed when the test-warehouse directory of Impala resides on the local filesystem. Change-Id: Idd3be70fb654a49dda44309a8914fe1f2b48a1af Reviewed-on: http://gerrit.cloudera.org:8080/10476 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-23 15:19:27 +00:00
Zoltan Borok-Nagy	ccf19f9f8f	IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Reviewed-on: http://gerrit.cloudera.org:8080/9693 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 20:22:02 +00:00

13 Commits