impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Riza Suminto	28cff4022d	IMPALA-14333: Run impala-py.test using Python3 Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true reveals some tests that require adjustment. This patch made such adjustment, which mostly revolves around encoding differences and string vs bytes type in Python3. This patch also switch the default to run pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The following are the details: Change hash() function in conftest.py to crc32() to produce deterministic hash. Hash randomization is enabled by default since Python 3.3 (see https://docs.python.org/3/reference/datamodel.html#object.__hash__). This cause test sharding (like --shard_tests=1/2) produce inconsistent set of tests per shard. Always restart minicluster during custom cluster tests if --shard_tests argument is set, because test order may change and affect test correctness, depending on whether running on fresh minicluster or not. Moved one test case from delimited-latin-text.test to test_delimited_text.py for easier binary comparison. Add bytes_to_str() as a utility function to decode bytes in Python3. This is often needed when inspecting the return value of subprocess.check_output() as a string. Implement DataTypeMetaclass.__lt__ to substitute DataTypeMetaclass.__cmp__ that is ignored in Python3 (see https://peps.python.org/pep-0207/). Fix WEB_CERT_ERR difference in test_ipv6.py. Fix trivial integer parsing in test_restart_services.py. Fix various encoding issues in test_saml2_sso.py, test_shell_commandline.py, and test_shell_interactive.py. Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1. Switch to binary comparison in test_iceberg.py where needed. Specify text mode when calling tempfile.NamedTemporaryFile(). Simplify create_impala_shell_executable_dimension to skip testing dev and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason is that several UTF-8 related tests in test_shell_commandline.py break in Python3 pytest + Python2 impala-shell combo. This skipping already happen automatically in build OS without system Python2 available like RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty). Removed unused vector argument and fixed some trivial flake8 issues. Several test logic require modification due to intermittent issue in Python3 pytest. These include: Add _run_query_with_client() in test_ranger.py to allow reusing a single Impala client for running several queries. Ensure clients are closed when the test is done. Mark several tests in test_ranger.py with SkipIfFS.hive because they run queries through beeline + HiveServer2, but Ozone and S3 build environment does not start HiveServer2 by default. Increase the sleep period from 0.1 to 0.5 seconds per iteration in test_statestore.py and mark TestStatestore to execute serially. This is because TServer appears to shut down more slowly when run concurrently with other tests. Handle the deprecation of Thread.setDaemon() as well. Always force_restart=True each test method in TestLoggingCore, TestShellInteractiveReconnect, and TestQueryRetries to prevent them from reusing minicluster from previous test method. Some of these tests destruct minicluster (kill impalad) and will produce minidump if metrics verifier for next tests fail to detect healthy minicluster state. Testing: Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true. Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4 Reviewed-on: http://gerrit.cloudera.org:8080/23319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-03 10:01:29 +00:00
Mihaly Szjatinya	e0cb533c25	IMPALA-13912: Use SHARED_CLUSTER_ARGS in more custom cluster tests In addition to IMPALA-13503 which allowed having the single cluster running for the entire test class, this attempts to minimize restarting between the existing tests without modifying any of their code. This changeset saves the command line with which 'start-impala-cluster.py' has been run and skips the restarting if the command line is the same for the next test. Some tests however do require restart due to the specific metrics being tested. Such tests are defined with the 'force_restart' flag within the 'with_args' decorator. NOTE: there might be more tests like that revealed after running the tests in different order resulting in test failures. Experimentally, this results in ~150 fewer restarts, mostly coming from restarts between tests. As for restarts between different variants of the same test, most of the cluster tests are restricted to single variant, although multi-variant tests occur occasionally. Change-Id: I7c9115d4d47b9fe0bfd9dbda218aac2fb02dbd09 Reviewed-on: http://gerrit.cloudera.org:8080/22901 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-06-19 17:48:25 +00:00
Riza Suminto	f8a1f6046a	IMPALA-14091: Migrate test_query_retries.py to HS2 test_query_retries.py still pinned to test using beeswax protocol by default. This patch refactor to test using hs2 protocol. Testing: - Run and pass test_query_retries.py in exhaustive mode. Change-Id: If12eeb47b843f0d1faca47994b2001e6d4c8ac58 Reviewed-on: http://gerrit.cloudera.org:8080/22939 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-05-29 03:03:57 +00:00
Riza Suminto	f28a32fbc3	IMPALA-13916: Change BaseTestSuite.default_test_protocol to HS2 This is the final patch to move all Impala e2e and custom cluster tests to use HS2 protocol by default. Only beeswax-specific test remains testing against beeswax protocol by default. We can remove them once Impala officially remove beeswax support. HS2 error message formatting in impala-hs2-server.cc is adjusted a bit to match with formatting in impala-beeswax-server.cc. Move TestWebPageAndCloseSession from webserver/test_web_pages.py to custom_cluster/test_web_pages.py to disable glog log buffering. Testing: - Pass exhaustive tests, except for some known and unrelated flaky tests. Change-Id: I42e9ceccbba1e6853f37e68f106265d163ccae28 Reviewed-on: http://gerrit.cloudera.org:8080/22845 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Jason Fehr <jfehr@cloudera.com>	2025-05-20 14:32:10 +00:00
Joe McDonnell	c5a0ec8bdf	IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package This puts all of the thrift-generated python code into the impala_thrift_gen package. This is similar to what Impyla does for its thrift-generated python code, except that it uses the impala_thrift_gen package rather than impala._thrift_gen. This is a preparatory patch for fixing the absolute import issues. This patches all of the thrift files to add the python namespace. This has code to apply the patching to the thirdparty thrift files (hive_metastore.thrift, fb303.thrift) to do the same. Putting all the generated python into a package makes it easier to understand where the imports are getting code. When the subsequent change rearranges the shell code, the thrift generated code can stay in a separate directory. This uses isort to sort the imports for the affected Python files with the provided .isort.cfg file. This also adds an impala-isort shell script to make it easy to run. Testing: - Ran a core job Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0 Reviewed-on: http://gerrit.cloudera.org:8080/20169 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-04-15 17:03:02 +00:00
Csaba Ringhofer	f98b697c7b	IMPALA-13929: Make 'functional-query' the default workload in tests This change adds get_workload() to ImpalaTestSuite and removes it from all test suites that already returned 'functional-query'. get_workload() is also removed from CustomClusterTestSuite which used to return 'tpch'. All other changes besides impala_test_suite.py and custom_cluster_test_suite.py are just mass removals of get_workload() functions. The behavior is only changed in custom cluster tests that didn't override get_workload(). By returning 'functional-query' instead of 'tpch', exploration_strategy() will no longer return 'core' in 'exhaustive' test runs. See IMPALA-3947 on why workload affected exploration_strategy. An example for affected test is TestCatalogHMSFailures which was skipped both in core and exhaustive runs before this change. get_workload() functions that return a different workload than 'functional-query' are not changed - it is possible that some of these also don't handle exploration_strategy() as expected, but individually checking these tests is out of scope in this patch. Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115 Reviewed-on: http://gerrit.cloudera.org:8080/22726 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-08 07:12:55 +00:00
Riza Suminto	4617c2370f	IMPALA-13908: Remove reference to ImpalaBeeswaxException This patch replace ImpalaBeeswaxException reference to IMPALA_CONNECTION_EXCEPTION as much as possible. Fix some easy flake8 issues caught thorugh this command: git show HEAD --name-only \| grep '^tests.*py' \ \| xargs -I {} impala-flake8 {} \ \| grep -e U100 -e E111 -e E301 -e E302 -e E303 -e F... Testing: - Pass exhaustive tests. Change-Id: I676a9954404613a1cc35ebbc9ffa73e8132f436a Reviewed-on: http://gerrit.cloudera.org:8080/22701 Reviewed-by: Jason Fehr <jfehr@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-30 00:15:43 +00:00
Riza Suminto	00dc79adf6	IMPALA-13907: Remove reference to create_beeswax_client This patch replace create_beeswax_client() reference to create_hs2_client() or vector-based client creation to prepare towards hs2 test migration. test_session_expiration_with_queued_query is changed to use impala.dbapi directly from Impyla due to limitation in ImpylaHS2Connection. TestAdmissionControllerRawHS2 is migrated to use hs2 as default test protocol. Modify test_query_expiration.py to set query option through client instead of SET query. test_query_expiration is slightly modified due to behavior difference in hs2 ImpylaHS2Connection. Remove remaining reference to BeeswaxConnection.QueryState. Fixed a bug in ImpylaHS2Connection.wait_for_finished_timeout(). Fix some easy flake8 issues caught thorugh this command: git show HEAD --name-only \| grep '^tests.*py' \ \| xargs -I {} impala-flake8 {} \ \| grep -e U100 -e E111 -e E301 -e E302 -e E303 -e F... Testing: - Pass exhaustive tests. Change-Id: I1d84251835d458cc87fb8fedfc20ee15aae18d51 Reviewed-on: http://gerrit.cloudera.org:8080/22700 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-29 18:37:45 +00:00
Riza Suminto	8324201acd	IMPALA-13847: Remove beeswax-specific way to obtain query id With IMPALA-13682 merged, checking for query state can be done via ImpalaConnection.handle_id() that works for beeswax, hs2, and hs2-http protocol. This patch apply such change. ImpalaTestSuite.wait_for_progress() is refactored a bit to make client parameter required. Testing: - Run and pass the affected tests. Change-Id: I0a2bac1011f5a0e058f88f973ac403cce12d2b86 Reviewed-on: http://gerrit.cloudera.org:8080/22606 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-12 07:14:19 +00:00
Riza Suminto	71feb617e4	IMPALA-13835: Remove reference to protocol-specific states With IMPALA-13682 merged, checking for query state can be done via wait_for_impala_state(), wait_for_any_impala_state() and other helper methods of ImpalaConnection. This patch remove all reference to protocol-specific states such as BeeswaxService.QueryState. Also fix flake8 errors and unused variable in modified test files. Testing: - Run and pass all affected tests. Change-Id: Id6b56024fbfcea1ff005c34cd146d16e67cb6fa1 Reviewed-on: http://gerrit.cloudera.org:8080/22586 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-09 00:04:05 +00:00
Riza Suminto	95f353ac4a	IMPALA-13507: Allow disabling glog buffering via with_args fixture We have plenty of custom_cluster tests that assert against content of Impala daemon log files while the process is still running using assert_log_contains() and it's wrappers. The method specifically mention about disabling glog buffering ('-logbuflevel=-1'), but not all custom_cluster tests do that. This often result in flaky test that hard to triage and often neglected if it does not frequently run in core exploration. This patch adds boolean param 'disable_log_buffering' into CustomClusterTestSuite.with_args for test to declare intention to inspect log files in live minicluster. If it is True, start minicluster with '-logbuflevel=-1' for all daemons. If it is False, log WARNING on any calls to assert_log_contains(). There are several complex custom_cluster tests that left unchanged and print out such WARNING logs, such as: - TestQueryLive - TestQueryLogTableBeeswax - TestQueryLogOtherTable - TestQueryLogTableHS2 - TestQueryLogTableAll - TestQueryLogTableBufferPool - TestStatestoreRpcErrors - TestWorkloadManagementInitWait - TestWorkloadManagementSQLDetails This patch also fixed some small flake8 issues on modified tests. There is a flakiness sign at test_query_live.py where test query is submitted to coordinator and fail because sys.impala_query_live table has not exist yet from coordinator's perspective. This patch modify test_query_live.py to wait for few seconds until sys.impala_query_live is queryable. Testing: - Pass custom_cluster tests in exhaustive exploration. Change-Id: I56fb1746b8f3cea9f3db3514a86a526dffb44a61 Reviewed-on: http://gerrit.cloudera.org:8080/22015 Reviewed-by: Jason Fehr <jfehr@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-05 04:49:05 +00:00
Riza Suminto	9c87cf41bf	IMPALA-13396: Unify tmp dir management in CustomClusterTestSuite There are many custom cluster tests that require creating temporary directory. The temporary directory typically live within a scope of test method and cleaned afterwards. However, some test do create temporary directory directly and forgot to clean them afterwards, leaving junk dirs under /tmp/ or $LOG_DIR. This patch unify the temporary directory management inside CustomClusterTestSuite. It introduce new 'tmp_dir_placeholders' arg in CustomClusterTestSuite.with_args() that list tmp dirs to create. 'impalad_args', 'catalogd_args', and 'impala_log_dir' now accept formatting pattern that is replaceable by a temporary dir path, defined through 'tmp_dir_placeholders'. There are few occurrences where mkdtemp is called and not replaceable by this work, such as tests/comparison/cluster.py. In that case, this patch change them to supply prefix arg so that developer knows that it comes from Impala test script. This patch also addressed several flake8 errors in modified files. Testing: - Pass custom cluster tests in exhaustive mode. - Manually run few modified tests and observe that the temporary dirs are created and removed under logs/custom_cluster_tests/ as the tests go. Change-Id: I8dd665e8028b3f03e5e33d572c5e188f85c3bdf5 Reviewed-on: http://gerrit.cloudera.org:8080/21836 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-02 01:25:39 +00:00
Michael Smith	f05eac6476	IMPALA-12602: Unregister queries on idle timeout Queries cancelled due to idle_query_timeout/QUERY_TIMEOUT_S are now also Unregistered to free any remaining memory, as you cannot fetch results from a cancelled query. Adds a new structure - idle_query_statuses_ - to retain Status messages for queries closed this way so that we can continue to return a clear error message if the client returns and requests query status or attempts to fetch results. This structure must be global because HS2 server can only identify a session ID from a query handle, and the query handle no longer exists. SessionState tracks queries added to idle_query_statuses_ so they can be cleared when the session is closed. Also ensures MarkInactive is called in ClientRequestState when Wait() completes. Previously WaitInternal would only MarkInactive on success, leaving any failed requests in an active state until explicitly closed or the session ended. The beeswax get_log RPC will not return the preserved error message or any warnings for these queries. It's also possible the summary and profile are rotated out of query log as the query is no longer inflight. This is an acceptable outcome as a client will likely not look for a log/summary/profile after it times out. Testing: - updates test_query_expiration to verify number of waiting queries is only non-zero for queries cancelled by EXEC_TIME_LIMIT_S and not yet closed as an idle query - modified test_retry_query_timeout to use exec_time_limit_s because queries closed by idle_timeout_s don't work with get_exec_summary Change-Id: Iacfc285ed3587892c7ec6f7df3b5f71c9e41baf0 Reviewed-on: http://gerrit.cloudera.org:8080/21074 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-03 03:25:10 +00:00
Michael Smith	6c6142ba2e	IMPALA-12633: Remove DCHECK for slow SetQueryInflight Removes the DCHECK that the original query is inflight before trying to close it during a query retry. SetQueryInflight is a separate operation the server performs after a query has started executing async, and it's possible for the query to fail and retry before the server calls SetQueryInflight. When that happens, we still need to perform cleanup or the original request_state is never closed and we hit a different DCHECK: "BlockOnWait() needs to be called!" Adds an option to CloseClientRequestState for when we close a ClientRequestState but the query is retrying with a new state. It ensures that we bypass most of SetQueryInflight in case CloseClientRequestState was called first. Updates the message from DCHECK in ClientRequestState's destructor to reflect that wait_thread_ is only reset in Finalize. Adds a debug action and test where just the original query is delayed during the SetQueryInflight call. Change-Id: Ic17a5e12d9db61cb19306270174518a8dfd281a7 Reviewed-on: http://gerrit.cloudera.org:8080/20799 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-12-21 17:23:34 +00:00
Riza Suminto	0c8fc997ef	IMPALA-12395: Override scan cardinality for optimized count star The cardinality estimate in HdfsScanNode.java for count queries does not account for the fact that the count optimization only scans metadata and not the actual columns. Optimized count star scan will return only 1 row per parquet row group. This patch override the scan cardinality with total number of files, which is the closest estimate to number of row group. Similar override already exist in IcebergScanNode.java. Testing: - Add count query testcases in test_query_cpu_count_divisor_default - Pass core tests Change-Id: Id5ce967657208057d50bd80adadac29ebb51cbc5 Reviewed-on: http://gerrit.cloudera.org:8080/20406 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-08-28 20:32:13 +00:00
Riza Suminto	7ca20b3c94	Revert "IMPALA-11123: Optimize count(star) for ORC scans" This reverts commit `f932d78ad0`. The commit is reverted because it cause significant regression for non-optimized counts star query in parquet format. There are several conflicts that need to be resolved manually: - Removed assertion against 'NumFileMetadataRead' counter that is lost with the revert. - Adjust the assertion in test_plain_count_star_optimization, test_in_predicate_push_down, and test_partitioned_insert of test_iceberg.py due to missing improvement in parquet optimized count star code path. - Keep the "override" specifier in hdfs-parquet-scanner.h to pass clang-tidy - Keep python3 style of RuntimeError instantiation in test_file_parser.py to pass check-python-syntax.sh Change-Id: Iefd8fd0838638f9db146f7b706e541fe2aaf01c1 Reviewed-on: http://gerrit.cloudera.org:8080/19843 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-05-06 22:55:05 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	2b550634d2	IMPALA-11952 (part 2): Fix print function syntax Python 3 now treats print as a function and requires the parenthesis in invocation. print "Hello World!" is now: print("Hello World!") This fixes all locations to use the function invocation. This is more complicated when the output is being redirected to a file or when avoiding the usual newline. print >> sys.stderr , "Hello World!" is now: print("Hello World!", file=sys.stderr) To support this properly and guarantee equivalent behavior between python 2 and python 3, all files that use print now add this import: from __future__ import print_function This also fixes random flake8 issues that intersect with the changes. Testing: - check-python-syntax.sh shows no errors related to print Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351 Reviewed-on: http://gerrit.cloudera.org:8080/19552 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Joe McDonnell	c71de994b0	IMPALA-11952 (part 1): Fix except syntax Python 3 does not support this old except syntax: except Exception, e: Instead, it needs to be: except Exception as e: This uses impala-futurize to fix all locations of the old syntax. Testing: - The check-python-syntax.sh no longer shows errors for except syntax. Change-Id: I1737281a61fa159c8d91b7d4eea593177c0bd6c9 Reviewed-on: http://gerrit.cloudera.org:8080/19551 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Michael Smith	a870a11e64	IMPALA-7098: Re-enable tests under EC Re-enables tests under erasure coding, or provides more specific exceptions. Erasure coding uses multiple data blocks to construct a block group. Our tests use RS-3-2-1024k, which includes 3 data blocks in a block group. Each of these blocks is sized according to `dfs.block.size`, so block groups by default hold up to 384MB of data. Impala schedules work to executors based on blocks reported by HDFS, which for EC actually represent block groups. So with default block size, a file in EC has 1/3rd the number of schedulable blocks. In the case of tpch.lineitem, this produces 2 parquet files instead of 3 and reduces the number of executors scheduled to read parquet lineitem as 1. lineitem.tbl is loaded via Hive. With EC it uses 2 block groups, without EC it uses 6 blocks. 2. parquet lineitem is created by select/insert from lineitem.tbl. Impala schedules reads to executors based on available blocks, so with EC this gets scheduled across 2 executors instead of 3 and each executor writes a separate parquet file. Change-Id: Ib452024993e35d5a8d2854c6b2085115b26e40df Reviewed-on: http://gerrit.cloudera.org:8080/19172 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-11-04 22:13:50 +00:00
Michael Smith	1eb0510eaa	IMPALA-11456: Collapse filesystem Skip logic Combines all SkipIf* classes for different filesystems into a single SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the rest as filesystem-specific special cases. The 'jira' option is removed in favor of specific flags for each issue. Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1 Reviewed-on: http://gerrit.cloudera.org:8080/18781 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-10 22:37:08 +00:00
stiga-huang	ae00781983	IMPALA-10895: Fix flakiness of test_retrying_query_cancel test_retrying_query_cancel test canceling the query when it's in the RETRYING state. The test first run the query and wait for the state become RETRYING. There is a debug action added to make the RETRYING state longer than 1s, so it can be sufficient in the test. However, when waiting for the RETRYING state, the interval is 0.5s. This waste the majority of the time. In ASAN builds, the time is not enough for the following steps, resulting the query state become RETRIED and fail the test. This patch reduces the wait interval to 0.1s. Also add some logs and modify the code to get state after the wait instead of before the wait. Tests: - Run the test more than 1000 times in an ASAN build. Before this patch it fails in around 30 runs. Change-Id: Id069091c94160d09868fcdc36ac7195b1deb337a Reviewed-on: http://gerrit.cloudera.org:8080/18659 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-24 04:25:13 +00:00
xqhe	dfc2f175bd	IMPALA-10414: fix memory leak when canceling the retried query The query retry launches in a separate thread. This thread may not finishes when deleting the query from the given QueryDriverMap if the query retry was failed launched. In this case, the resources for the query retry thread will not release. So the reference count of QueryDriver (via the shared_ptr) will not go to 0 and it will not be destroyed. We need wait until the query retry thread execution has completed when deleting the query from the given QueryDiverMap. Testing: Modify the test_query_retries.py to verify memory leak by checking the debug web UI of memz. Change-Id: If804ca65da1794c819a6b2e6567ea7651ab5112f Reviewed-on: http://gerrit.cloudera.org:8080/17735 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>	2022-04-15 01:36:59 +00:00
Riza Suminto	f932d78ad0	IMPALA-11123: Optimize count(star) for ORC scans This patch provides count(star) optimization for ORC scans, similar to the work done in IMPALA-5036 for Parquet scans. We use the stripes num rows statistics when computing the count star instead of materializing empty rows. The aggregate function changed from a count to a special sum function initialized to 0. This count(star) optimization is disabled for the full ACID table because the scanner might need to read and validate the 'currentTransaction' column in table's special schema. This patch drops 'parquet' from names related to the count star optimization. It also improves the count(star) operation in general by serving the result just from the file's footer stats for both Parquet and ORC. We unify the optimized count star and zero slot scan functions into HdfsColumnarScanner. The following table shows a performance comparison before and after the patch. primitive_count_star query target tpch10_parquet.lineitem table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc query is a modified primitive_count_star query that targets tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly. +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| tpch_parquet \| count_star_parq \| parquet / none / none \| 0.06 \| 0.07 \| -10.45% \| 2.87% \| * 25.51% * \| 9 \| -1.47% \| -1.26 \| -1.22 \| \| tpch_orc_def \| count_star_orc \| orc / def / none \| 0.06 \| 0.08 \| -22.37% \| 6.22% \| * 30.95% * \| 9 \| -1.85% \| -1.16 \| -2.14 \| \| TARGETED-PERF(10) \| primitive_count_star \| parquet / none / none \| 0.06 \| 0.08 \| I -30.40% \| 2.68% \| * 29.63% * \| 9 \| I -7.20% \| -2.42 \| -3.07 \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Add PlannerTest.testOrcStatsAgg - Add TestAggregationQueries::test_orc_count_star_optimization - Exercise count(star) in TestOrc::test_misaligned_orc_stripes - Pass core tests Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Reviewed-on: http://gerrit.cloudera.org:8080/18327 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-05 13:27:10 +00:00
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
xqhe	5cc01ed8b2	IMPALA-10825: fix impalad crashes when closing the retrying query The crash happens when canceling the retrying query in web UI. The canceling action will call ImpalaServer#UnregisterQuery. The QueryDriver will be null if the query has already been unregistered. Testing: Add test in tests/custom_cluster/test_query_retries.py and manually tested 100 times to make sure that there was no Impalad crash Change-Id: I3b9a2cccbfbdca00b099e0f8d5f2d4bcb4d0a8c3 Reviewed-on: http://gerrit.cloudera.org:8080/17729 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-08-12 07:38:40 +00:00
stiga-huang	d111443e8f	IMPALA-10704: Fix retried query id not being unregistered when retry fails When query retry fails in RetryQueryFromThread(), the retried query id may not be unregistered if the failure happens before we store the retry_request_state. In this case, QueryDriver::Unregister() has no way to get the retried query id so it's not deleted. Note that the retried query id is registered in RetryQueryFromThread() so should be deleted later. This finally results in a leak in the query driver map, where queries in it are shown as in-flight queries. test_retry_query_result_cacheing_failed and test_retry_query_set_query_in_flight_failed (added in IMPALA-10413) asserts one in-flight query at the end. This is satisfied by the leak. Instead, we should verify no running queries at the end. This patch adds a new field in QueryDriver to remember the registered retry query id as a backup way for getting it when query retry fails before we store the ClientRequestState of the retried query (so retried_client_request_state_ is null). Tests: - Run test_retry_query_result_cacheing_failed and test_retry_query_set_query_in_flight_failed 100 times. Change-Id: I074526799d68041a425b2379e74f8d8b45ce892a Reviewed-on: http://gerrit.cloudera.org:8080/17465 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-18 13:54:33 +00:00
xqhe	42684b44d3	IMPALA-10413: fix impalad crashes when canceling the retrying query The crash happens when canceling the retrying query. If the original query was unregistered while the new query was being created, it will call HandleRetryFailure to abort the new query. But the status is ok, so when calling Status::AddDetail impalad will crash. After the WaitAsync interface called and before the retry_request_state moved to retried_client_request_state_ , if abort the new retry query, retry_request_state need to call Finalize, otherwise the wait-thread will leak. In some cases like canceled the original query or closed the session we may not create the new query, so we also check whether the query is retried. Tests: Add test in tests/custom_cluster/test_query_retries.py and manually tested 100 times to make sure that there was no Impalad crash Change-Id: I4fd7228acd0a70d33859029052239f9b9f795e5d Reviewed-on: http://gerrit.cloudera.org:8080/16911 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-12 06:37:43 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Riza Suminto	49ac55fb69	IMPALA-9856: Enable result spooling by default. Result spooling has been relatively stable since it was introduced, and it has several benefits described in IMPALA-8656. This patch enable result spooling (SPOOL_QUERY_RESULTS) query options by default. Furthermore, some tests need to be adjusted to account for result spooling by default. The following are the adjustment categories and list of tests that fall under such category. Change in assertions: PlannerTest#testAcidTableScans PlannerTest#testBloomFilterAssignment PlannerTest#testConstantFolding PlannerTest#testFkPkJoinDetection PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled PlannerTest#testKuduSelectivity PlannerTest#testMaxRowSize PlannerTest#testMinMaxRuntimeFilters PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled PlannerTest#testMtDopValidation PlannerTest#testParquetFiltering PlannerTest#testParquetFilteringDisabled PlannerTest#testPartitionPruning PlannerTest#testPreaggBytesLimit PlannerTest#testResourceRequirements PlannerTest#testRuntimeFilterQueryOptions PlannerTest#testSortExprMaterialization PlannerTest#testSpillableBufferSizing PlannerTest#testTableSample PlannerTest#testTpch PlannerTest#testKuduTpch PlannerTest#testTpchNested PlannerTest#testUnion TpcdsPlannerTest custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs metadata/test_explain.py::TestExplain::test_explain_level2 metadata/test_explain.py::TestExplain::test_explain_level3 metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation Increase BUFFER_POOL_LIMIT: query_test/test_queries.py::TestQueries::test_analytic_fns query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive query_test/test_udfs.py::TestUdfExecution::test_mem_limits Increase MEM_LIMIT: query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling Increase MAX_ROW_SIZE: custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config query_test/test_insert.py::TestInsertQueries::test_insert_large_string query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter query_test/test_scanners.py::TestWideRow::test_wide_row Disable result spooling to maintain assertion: custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only) query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit shell/test_shell_client.py::TestShellClient::test_fetch_size Testing: - Pass exhaustive tests. Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4 Reviewed-on: http://gerrit.cloudera.org:8080/16755 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-02 04:58:51 +00:00
wzhou-code	b5e2a0ce2e	IMPALA-9224: Blacklist nodes with faulty disk for spilling This patch extends blacklist functionality by adding executor node to blacklist if a query fails caused by disk failure during spill-to-disk. Also classifies disk error codes and defines a blacklistable error set for non-transient disk errors. Coordinator blacklists executor only if the executor hitted blacklistable error during spill-to-disk. Adds a new debug action to simulate disk write error during spill-to- disk. To use, specify in query options as: 'debug_action': 'IMPALA_TMP_FILE_WRITE:<hostname>:<port>:<action>' where <hostname> and <port> represent the impalad which execute the fragment instances, <port> is the BE krpc port (default 27000). Adds new test cases for blacklist and query-retry to cover the code changes. Testing: - Passed new test cases. - Passed exhaustive test. - Manually simulated disk failures in scratch directories on nodes of a cluster, verified that the nodes were blacklisted as expected. Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437 Reviewed-on: http://gerrit.cloudera.org:8080/16949 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-04 05:12:42 +00:00
Riza Suminto	2004a87edf	IMPALA-10337: Consider MAX_ROW_SIZE when computing max reservation PlanRootSink can fail silently if result spooling is enabled and maxMemReservationBytes is less than 2 * MAX_ROW_SIZE. This happens because results are spilled using a SpillableRowBatchQueue which needs 2 buffer (read and write) with at least MAX_ROW_SIZE bytes per buffer. This patch fixes this by setting a lower bound of 2 * MAX_ROW_SIZE while computing the min reservation for the PlanRootSink. Testing: - Pass exhaustive tests. - Add e2e TestResultSpoolingMaxReservation. - Lower MAX_ROW_SIZE on tests where MAX_RESULT_SPOOLING_MEM is set to extremely low value. Also verify that PLAN_ROOT_SINK's ReservationLimit remain unchanged after lowering the MAX_ROW_SIZE. Change-Id: Id7138e1e034ea5d1cd15cf8de399690e52a9d726 Reviewed-on: http://gerrit.cloudera.org:8080/16765 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-04 23:55:25 +00:00
wzhou-code	1af60a1560	IMPALA-9180 (part 3): Remove legacy backend port The legacy Thrift based Impala internal service has been removed so the backend port 22000 can be freed up. This patch set flag be_port as a REMOVED_FLAG and all infrastructures around it are cleaned up. StatestoreSubscriber::subscriber_id is set as hostname + krpc_port. Testing: - Passed the exhaustive test. Change-Id: Ic6909a8da449b4d25ee98037b3eb459af4850dc6 Reviewed-on: http://gerrit.cloudera.org:8080/16533 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-03 00:56:26 +00:00
wzhou-code	6bb3b88d05	IMPALA-9180 (part 1): Remove legacy ImpalaInternalService The legacy Thrift based Impala internal service has been deprecated and can be removed now. This patch removes ImpalaInternalService. All infrastructures around it are cleaned up, except one place for flag be_port. StatestoreSubscriber::subscriber_id consists be_port, but we cannot change format of subscriber_id now. This remaining be_port issue will be fixed in a succeeding patch (part 4). TQueryCtx.coord_address is changed to TQueryCtx.coord_hostname since the port in TQueryCtx.coord_address is set as be_port and is unused now. Also Rename TQueryCtx.coord_krpc_address as TQueryCtx.coord_ip_address. Testing: - Passed the exhaustive test. - Passed Quasar-L0 test. Change-Id: I5fa83c8009590124dded4783f77ef70fa30119e6 Reviewed-on: http://gerrit.cloudera.org:8080/16291 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-30 22:41:00 +00:00
Thomas Tauber-Marshall	28181cbe6c	IMPALA-9930 (part 1): Initial refactor for admission control service This patch contains the following refactors that are needed for the admission control service, in order to make the main patch easier to review: - Adds a new class AdmissionControlClient which will be used to abstract the logic for submitting queries to either a local or remote admission controller out from ClientRequestState/Coordinator. Currently only local submission is supported. - SubmitForAdmission now takes a BackendId representing the coordinator instead of assuming that the local impalad will be the coordinator. - The CRS_BEFORE_ADMISSION debug action is moved into SubmitForAdmission() so that it will be executed on whichever daemon is performing admission control rather than always on the coordinator (needed for TestAdmissionController.test_cancellation). - ShardedQueryMap is extended to allow keys to be either TUniqueId or UniqueIdPB and Add(), Get(), and Delete() convenience functions are added. - Some utils related to seralizing Thrift objects into sidecars are added. Testing: - Passed a run of existing core tests. Change-Id: I7974a979cf05ed569f31e1ab20694e29fd3e4508 Reviewed-on: http://gerrit.cloudera.org:8080/16411 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-29 23:14:50 +00:00
wzhou-code	40777b706b	IMPALA-9636: Don't run retried query on the blacklisted nodes When a node is blacklisted, it is only placed on the blacklist for a certain period of time. For the current implementation, it is possible that the retried query could end up running on the node that it blacklisted during its original attempt. To avoid same failure for the retried query, we should not schedule query fragment instances on the blacklisted nodes which caused the original query to fail. This patch filters out the executors from executor group for those nodes which are blacklisted during its original attempt when make schedule for the retried query. Adds new test cases test_retry_exec_rpc_failure_before_admin_delay() and test_retry_query_failure_all_executors_blacklisted() for retried queries which are triggered by RPC failure and blacklist timeout are triggered by adding delay before admission. Testing: - Passed test_query_retries.py, including the new test cases. - Passed core tests. Change-Id: I00bc1b5026efbd0670ffbe57bcebc457d34cb105 Reviewed-on: http://gerrit.cloudera.org:8080/16369 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-15 20:31:50 +00:00
stiga-huang	578933fe74	IMPALA-10065: Fix DCHECK when retrying a query in FINISHED state A query will come into the FINISHED state when some rows are available, even when some fragment instances are still executing. When a retryable query comes into the FINISHED state and the client hasn't fetched any results, we are still able to retry it for any retryable failures. This patch fixes a DCHECK when retrying a FINISHED state query. Tests: - Add a test in test_query_retries.py for retrying a query in FINISHED state. Change-Id: I11d82bf80640760a47325833463def8a3791bdda Reviewed-on: http://gerrit.cloudera.org:8080/16351 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-31 13:28:29 +00:00
stiga-huang	61dcc805e5	IMPALA-9225: Query option for retryable queries to spool all results before returning any to the client If we have returned any results to the client in the original query, query retry will be skipped to avoid incorrect results. This patch adds a query option, spool_all_results_for_retries, for retryable queries to spool all results before returning any to the client. It defaults to true. If all query results cannot be contained in the allocated result spooling space, we'll return results and thus disabled query retry on the query. Setting spool_all_results_for_retries to false will fallback to the original behavior - client can fetch results when any of them are ready. So we explicitly set it to false in the retried query since it won't be retried. For non retryable queries or queries that don't enable results spooling, the spool_all_results_for_retries option takes no effect. To implement this, this patch defers the time when results are ready to be fetched. By default, the “rows available” event happens when any results are ready. For a retryable query, when spool_query_results and spool_all_results_for_retries are both true, the “rows available” event happens after all results are spooled or any errors stopping us to do so, e.g. batch queue is full, cancellation or failures. After waiting for the root fragment instance’s Open() finishes, the coordinator will wait until results of BufferedPlanRootSink are ready. BufferedPlanRootSink sets the results ready signal in its Send(), Close(), Cancel(), FlushFinal() methods. Tests: - Add a test to verify that a retryable query will spool all its results when results spooling and spool_all_results_for_retries are enabled. - Add a test to verify that query retry succeeds when a retryable query is still spooling its results (spool_all_results_for_retries=true). - Add a test to verify that the retried query won't spool all results even when results spooling and spool_all_results_for_retries are enabled in the original query. - Add a test to verify that the original query can be canceled correctly. We need this because the added logics for spool_all_results_for_retries are related to the cancellation code path. - Add a test to verify results will be returned when all of them can't fit into the result spooling space, and query retry will be skipped. Change-Id: I462dbfef9ddab9060b30a6937fca9122484a24a5 Reviewed-on: http://gerrit.cloudera.org:8080/16323 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-27 04:09:58 +00:00
Sahil Takiar	70c2073d02	IMPALA-9834: De-flake TestQueryRetries on EC builds This patch skips all tests in TestQueryRetries on EC builds. The tests in TestQueryRetries runs queries that run on three instances during regular builds (HDFS, S3, etc.), but only two instances on EC builds. This causes some non-deterministism during the test because killing an impalad in the mini-cluster won't necessarily cause a retry to be triggered. It bumps up the timeout used when waiting for a query to be retried. It improves the assertion in __get_query_id_from_profile so that it dumps the full profile when the assertion fails. This should help debuggability of any test failures that fail in this assertion. Testing: * Ran TestQueryRetries locally Change-Id: Id5c73c2cbd0ef369175856c41f36d4b0de4b8d71 Reviewed-on: http://gerrit.cloudera.org:8080/16149 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-09 21:10:32 +00:00
stiga-huang	1fbca6d43b	IMPALA-9569: Fix progress bar and live_summary to show info of the retried query Impala-shell periodically calls GetExecSummary() when the query is queuing or running. If the query is being retried, GetExecSummary() should return the TExecSummary of the retried query. So the progress bar and live_summary can reflect the most recent state. This patch also modifies get_summary() to return retry information in error_logs of TExecSummary. Impala-shell and other clients can print the info right after the query starts being retried. Modified impala-shell to print the retried query link when the retried query is running. Example output when the retried query is running: Query: select count(*) from functional.alltypes where bool_col = sleep(60) Query submitted at: 2020-06-18 22:08:49 (Coordinator: http://quanlong-OptiPlex-BJ:25000) Query progress can be monitored at: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=9444fe7f0df0da28:29134b0800000000 Failed due to unreachable impalad(s): quanlong-OptiPlex-BJ:22001 Retrying query using query id: 5748d9a3ccc28ba8:a75e2fab00000000 Retried query link: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=5748d9a3ccc28ba8:a75e2fab00000000 [############################### ] 50% Tests: - Manually verify the progress bar and live_summary work when the query is being retried. - Add tests in test_query_retries.py to validate the get_summary() results. Change-Id: I8f96919f00e0b64d589efd15b6b5ec82fb725d56 Reviewed-on: http://gerrit.cloudera.org:8080/16096 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-30 12:11:24 +00:00
stiga-huang	931063f0f2	IMPALA-9213: Add query retry info to GetLog result Beeswax clients use get_log() to retrieve the warning/error message after the query finishes. HS2 clients use GetLog() for the same purpose. This patch adds the retry information into the returned result if the query is retried. So clients that print the log can show the original query failure and the retried query id. This patch also modifies impala-shell to extract the retried query id and print the retried query link. Here's an example of the impala-shell output: Query: select count() from functional.alltypes where bool_col = sleep(60) Query submitted at: 2020-06-18 21:23:52 (Coordinator: http://quanlong-OptiPlex-BJ:25000) Query progress can be monitored at: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=7944ffee4d81cdd4:e7f9357a00000000 +----------+ \| count() \| +----------+ \| 3650 \| +----------+ WARNINGS: Original query failed: Failed due to unreachable impalad(s): quanlong-OptiPlex-BJ:22001 Query has been retried using query id: 934b2734f67a1161:a0dbd60200000000 Retried query link: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=934b2734f67a1161:a0dbd60200000000 Tests: - Add tests in test_query_retries.py to verify client logs returned from GetLog(). - Run test_query_retries.py. - Manually run queries in impala-shell and kill impalads. Verify printed messages when the retried queries succeed or fail. Change-Id: I58cf94f91a0b92eb9a3088bee3894ac157a954dc Reviewed-on: http://gerrit.cloudera.org:8080/16093 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-30 05:58:02 +00:00
Sahil Takiar	bd4d01a379	IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry * QueryDrivers and ClientRequestStates are now referenced using a QueryHandle Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC-{H,DS} workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Reviewed-on: http://gerrit.cloudera.org:8080/14824 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Sahil Takiar <stakiar@cloudera.com>	2020-05-15 20:11:07 +00:00

43 Commits