impala

mirror of https://github.com/apache/impala.git synced 2025-12-23 11:55:25 -05:00

Author	SHA1	Message	Date
Daniel Vanko	321429eac6	IMPALA-14237: Fix Iceberg partition values encoding This patch modifies the string overload of IcebergFunctions::TruncatePartitionTransform so that it always handles strings as UTF-8-encoded ones, because the Iceberg specification states that that strings are UTF-8 encoded. Also, for an Iceberg table UrlEncode is called in not the Hive-compatible way, rather than the standard way, similar to Java's URLEncoder.encode() (which the Iceberg API also uses) to conform with existing practices by Hive, Spark and Trino. This included a change in the set of characters which are not escaped to follow the URL Standard's application/x-www-form-urlencoded format. [1] Also renamed it from ShouldNotEscape to IsUrlSafe for better readability. Testing: * add and extend e2e tests to check partitions with Unicode characters * add be tests to coding-util-test.cc [1]: https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set Change-Id: Iabb39727f6dd49b76c918bcd6b3ec62532555755 Reviewed-on: http://gerrit.cloudera.org:8080/23190 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-08 18:54:07 +00:00
Riza Suminto	4ab0ce139d	IMPALA-12162: (addendum) Move test_parallel_checksum test_parallel_checksum only need to run over single exec option dimension and text/none format. Leaving it in TestInsertQueries will exercise test_parallel_checksum over 'compression_codec' query option (in exhaustive builds). The CTAS fails when compression_codec != none since the target table is in text format and writing to compressed text table is not supported. This patch move test_parallel_checksum under TestInsertNonPartitionedTable that have such limited test dimension. Also add assertion that CTAS query is successful. Change-Id: I2b2bc34ae48a2355ee1e6f6e9e42da9076adf96b Reviewed-on: http://gerrit.cloudera.org:8080/22948 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-05-28 20:35:26 +00:00
Michael Smith	cee2d01f52	IMPALA-12162: Use thread pool to collect checksums Refactors ParallelFileMetadataLoader to be usable for multiple types of metadata. Uses it to collect checksums for new files in parallel. Testing: adds test that multiple loading threads are used and checksum does not take too long. Change-Id: I314621104e4757620c0a90d41dd6875bf8855b51 Reviewed-on: http://gerrit.cloudera.org:8080/22872 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-05-23 15:46:42 +00:00
Riza Suminto	f28a32fbc3	IMPALA-13916: Change BaseTestSuite.default_test_protocol to HS2 This is the final patch to move all Impala e2e and custom cluster tests to use HS2 protocol by default. Only beeswax-specific test remains testing against beeswax protocol by default. We can remove them once Impala officially remove beeswax support. HS2 error message formatting in impala-hs2-server.cc is adjusted a bit to match with formatting in impala-beeswax-server.cc. Move TestWebPageAndCloseSession from webserver/test_web_pages.py to custom_cluster/test_web_pages.py to disable glog log buffering. Testing: - Pass exhaustive tests, except for some known and unrelated flaky tests. Change-Id: I42e9ceccbba1e6853f37e68f106265d163ccae28 Reviewed-on: http://gerrit.cloudera.org:8080/22845 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Jason Fehr <jfehr@cloudera.com>	2025-05-20 14:32:10 +00:00
Csaba Ringhofer	f98b697c7b	IMPALA-13929: Make 'functional-query' the default workload in tests This change adds get_workload() to ImpalaTestSuite and removes it from all test suites that already returned 'functional-query'. get_workload() is also removed from CustomClusterTestSuite which used to return 'tpch'. All other changes besides impala_test_suite.py and custom_cluster_test_suite.py are just mass removals of get_workload() functions. The behavior is only changed in custom cluster tests that didn't override get_workload(). By returning 'functional-query' instead of 'tpch', exploration_strategy() will no longer return 'core' in 'exhaustive' test runs. See IMPALA-3947 on why workload affected exploration_strategy. An example for affected test is TestCatalogHMSFailures which was skipped both in core and exhaustive runs before this change. get_workload() functions that return a different workload than 'functional-query' are not changed - it is possible that some of these also don't handle exploration_strategy() as expected, but individually checking these tests is out of scope in this patch. Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115 Reviewed-on: http://gerrit.cloudera.org:8080/22726 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-08 07:12:55 +00:00
Riza Suminto	e73e2d40da	IMPALA-13864: Implement ImpylaHS2ResultSet.exec_summary This patch implement building exec summary table for ImpylaHS2Connection. It adds fetch_exec_summary argument in ImpalaConnection.execute(). If this argument is True, an exec summary table will be added into the returned result object. fetch_exec_summary is also implemented for BeeswaxConnection. Thus, BeeswaxConnection will not fetch exec summary by default all the time. Tests that validate exec summary table is updated to set fetch_exec_summary=True and migrated to test against hs2 protocol. Change TestExecutorGroup._set_query_options() to do query option setting through hs2_client iconfig instead of SET query. Some flake8 issues are addressed as well. Move build_exec_summary_table to separate exec_summary.py file. Tweak it a bit to return early if given TExecSummary is empty. Fixed bug in ImpalaBeeswaxClient.fetch_results() where fetch will not happen at all if discard_result argument is True. Testing: - Run and pass affected tests locally. Change-Id: I7d88f78e58eeda29ce21e7828884c7a129d7efe6 Reviewed-on: http://gerrit.cloudera.org:8080/22626 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-24 22:34:20 +00:00
Riza Suminto	de1a925cb7	IMPALA-13349: Fix remaining tests with unexercised exec_option This patch fixes remaining tests that has unexercised exec_option. Some test reorganization are done to clarify their test dimension declaration. The WARNING log added by IMPALA-13323 is turned into pytest.fail() with error message suggestion on how to fix it. Fixed some flake8 warnings and error as well. Testing: - Pass EE and custom cluster tests in exhaustive exploration. Change-Id: I33bb4b6c4ff50b55a082460dd9944d2aa3511e11 Reviewed-on: http://gerrit.cloudera.org:8080/21743 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-05 21:24:46 +00:00
pranavyl	85cd07a11e	IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFFFFFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Reviewed-on: http://gerrit.cloudera.org:8080/21131 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-09 15:09:21 +00:00
David Rorke	25a8d70664	IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator This patch improves the accuracy of the CPU ProcessingCost estimates for several of the CPU intensive operators by basing the costs on benchmark data. The general approach for a given operator was to run a set of queries that exercised the operator under various conditions (e.g. large vs small row sizes and row counts, varying NDV, different file formats, etc) and capture the CPU time spent per unit of work (the unit of work might be measured as some number of rows, some number of bytes, some number of predicates evaluated, or some combination of these). The data was then analyzed in an attempt to fit a simple model that would allow us to predict CPU consumption of a given operator based on information available at planning time. For example, the CPU ProcessingCost for a Parquet scan is estimated as: TotalCost = (0.0144 * BytesMaterialized) + (0.0281 * Rows * Predicate Count) The coefficients (0.0144 and 0.0281) are derived from benchmarking scans under a variety of conditions. Similar cost functions and coefficients were derived for all of the benchmarked operators. The coefficients for all the operators are normalized such that a single unit of cost equates to roughly 100 nanoseconds of CPU time on a r5d.4xlarge instance. So we would predict an operator with a cost of 10,000,000 would complete in roughly one second on a single core. Limitations: * Costing only addresses CPU time spent and doesn't account for any IO or other wait time. * Benchmarking scenarios didn't provide comprehensive coverage of the full range of data types, distributions, etc. More thorough benchmarking could improve the costing estimates further. * This initial patch only covers a subset of the operators, focusing on those that are most common and most CPU intensive. Specifically the following operators are covered by this patch. All others continue to use the previous ProcessingCost code: AggregationNode DataStreamSink (exchange sender) ExchangeNode HashJoinNode HdfsScanNode HdfsTableSink NestedLoopJoinNode SortNode UnionNode Benchmark-based costing of the remaining operators will be covered by a future patch. Future patches will automate the collection and analysis of the benchmark data and the computation of the cost coefficients to simplify maintenance of the costing as performance changes over time. Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca Reviewed-on: http://gerrit.cloudera.org:8080/21279 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-20 06:48:58 +00:00
Peter Rozsa	c5ecd8e666	IMPALA-12386: Fix clone constructor in CastExpr This commit addresses an issue in the CastExpr class where the clone constructor was not properly preserving compatibility settings. The clone constructor assigned the default compatibility regardless of the source expression, causing substitution errors for partitioned tables. Example: 'insert into unsafe_insert_partitioned(int_col, string_col) values("1", null), (null, "1")' Throws: ERROR: IllegalStateException: Failed analysis after expr substitution. CAUSED BY: IllegalStateException: cast STRING to INT Tests: - new test case added to insert-unsafe.test Change-Id: Iff64ce02539651fcb3a90db678f74467f582648f Reviewed-on: http://gerrit.cloudera.org:8080/20385 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-08-22 20:25:25 +00:00
Peter Rozsa	3247cc6839	IMPALA-10173: Allow implicit casts between numeric and string types when inserting into table This patch adds an expiremental query option called ALLOW_UNSAFE_CASTS which allows implicit casting between some numeric types and string types. A new type of compatibility is introduced for this purpose, and the compatibility rule handling is refactored also. The new approach uses an enum to differentiate the compatibility levels, and to make it easier to pass them through methods. The unsafe compatibility is used only in two cases: for set operations and for insert statements. The insert statements and set operations accept unsafe implicitly casted expressions only when the source expressions are constant. The following implicit type casts are enabled in unsafe mode: - String -> Float, Double - String -> Tinyint, Smallint, Int, Bigint - Float, Double -> String - Tinyint, Smallint, Int, Bigint -> String The patch also covers IMPALA-3217, and adds two more rules to handle implicit casting in set operations and insert statements between string types: - String -> Char(n) - String -> Varchar(n) The unsafe implicit casting requires that the source expression must be constant in this case as well. Tests: - tests added to AnalyzeExprsTest.java - new test class added to test_insert.py Change-Id: Iee5db2301216c2e088b4b3e4f6cb5a1fd10600f7 Reviewed-on: http://gerrit.cloudera.org:8080/19881 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-29 18:45:38 +00:00
ttttttz	59bb9259ef	IMPALA-12131: Fix empty partition map in non-partitioned table when file metadata loading fails When inserting non-partitioned tables, the catalog update request could fail due to file not found exceptions. At that point we have reset(cleared) the partition map so it becomes empty after the failure, which is an illegal state and will cause failures in later operations. Currently, users have to manually invalidate the metadata of the table to recover. We can improve this by making all the updates happen after all the external loadings succeed. So any failures in loading the file metadata won't leave the table metadata in a partially updated state. Testing: 1. Added a test which simulates a failure in a catalog update request by throwing an exception through the debug action and confirms that subsequent catalog update requests are not affected by the failure. Change-Id: I28e76a73b7905c24eb93b935124d20ea7abe8513 Reviewed-on: http://gerrit.cloudera.org:8080/19878 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-29 11:49:32 +00:00
Riza Suminto	dbddb08447	IMPALA-12120: Limit output writer parallelism based on write volume The new processing cost-based planner changes (IMPALA-11604, IMPALA-12091) will impact output writer parallelism for insert queries, with the potential for more small files if the processing cost-based planning results in too many writer fragments. It can further exacerbate a problem introduced by MT_DOP (see IMPALA-8125). The MAX_FS_WRITERS query option can help mitigate this. But even without the MAX_FS_WRITERS set, the default output writer parallelism should avoid creating excessive writer parallelism for partitioned and unpartitioned inserts. This patch implements such a limit when using the cost-based planner. It limits the number of writer fragments such that each writer fragment writes at least 256MB of rows. This patch also allows CTAS (a kind of DDL query) to be eligible for auto-scaling. This patch also remove comments about NUM_SCANNER_THREADS added by IMPALA-12029, since it does not applies anymore after IMPALA-12091. Testing: - Add test cases in test_query_cpu_count_divisor_default - Add test_processing_cost_writer_limit in test_insert.py - Pass test_insert.py::TestInsertHdfsWriterLimit - Pass test_executor_groups.py Change-Id: I289c6ffcd6d7b225179cc9fb2f926390325a27e0 Reviewed-on: http://gerrit.cloudera.org:8080/19880 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-26 00:08:02 +00:00
Riza Suminto	1d0b111bcf	IMPALA-12091: Control scan parallelism by its processing cost Before this patch, Impala still relies on MT_DOP option to decide the degree of parallelism of the scan fragment when a query runs with COMPUTE_PROCESSING_COST=1. This patch adds the scan node's processing cost as another consideration to raise scan parallelism beyond MT_DOP. Scan node cost is now adjusted to also consider the number of effective scan ranges. Each scan range is given a weight of (0.5% * min_processing_per_thread), which roughly means that one scan node instance can handle at most 200 scan ranges. Query option MAX_FRAGMENT_INSTANCES_PER_NODE is added as an upper bound on scan parallelism if COMPUTE_PROCESSING_COST=true. If the number of scan ranges is fewer than the maximum parallelism allowed by the scan node's processing cost, that processing cost will be clamped down to (min_processing_per_thread / number of scan ranges). Lowering MAX_FRAGMENT_INSTANCES_PER_NODE can also clamp down the scan processing cost in a similar way. For interior fragments, a combination of MAX_FRAGMENT_INSTANCES_PER_NODE, PROCESSING_COST_MIN_THREADS, and the number of available cores per node is accounted to determine maximum fragment parallelism per node. For scan fragment, only the first two are considered to encourage Frontend to choose a larger executor group as needed. Two new static state is added into exec-node.h: is_mt_fragment_ and num_instances_per_node_. The backend code that refers to the MT_DOP option is replaced with either is_mt_fragment_ or num_instances_per_node_. Two new criteria are added during effective parallelism calculation in PlanFragment.adjustToMaxParallelism(): - If a fragment has UnionNode, its parallelism is the maximum between its input fragments and its collocated ScanNode's expected parallelism. - If a fragment only has a single ScanNode (and no UnionNode), its parallelism is calculated in the same fashion as the interior fragment but will not be lowered anymore since it will not have any child fragment to compare with. Admission control slots remain unchanged. This may cause a query to fail admission if Planner selects scan parallelism that is higher than the configured admission control slots value. Setting MAX_FRAGMENT_INSTANCES_PER_NODE equal to or lower than configured admission control slots value can help lower scan parallelism and pass the admission controller. The previous workaround to control scan parallelism by IMPALA-12029 is now removed. This patch also disables IMPALA-10287 optimization if COMPUTE_PROCESSING_COST=true. This is because IMPALA-10287 relies on a fixed number of fragment instances in DistributedPlanner.java. However, effective parallelism calculation is done much later and may change the final number of instances of hash join fragment, rendering DistributionMode selected by IMPALA-10287 inaccurate. This patch is benchmarked using single_node_perf_run.py with the following parameters: args="-gen_experimental_profile=true -default_query_options=" args+="mt_dop=4,compute_processing_cost=1,processing_cost_min_threads=1 " ./bin/single_node_perf_run.py --num_impalads=3 --scale=10 \ --workloads=tpcds --iterations=5 --table_formats=parquet/none/none \ --impalad_args="$args" \ --query_names=TPCDS-Q3,TPCDS-Q14-1,TPCDS-Q14-2,TPCDS-Q23-1,TPCDS-Q23-2,TPCDS-Q49,TPCDS-Q76,TPCDS-Q78,TPCDS-Q80A \ "IMPALA-12091~1" IMPALA-12091 The benchmark result is as follows: +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| TPCDS(10) \| TPCDS-Q23-1 \| parquet / none / none \| 4.62 \| 4.54 \| +1.92% \| 0.23% \| 1.59% \| 5 \| +2.32% \| 1.15 \| 2.67 \| \| TPCDS(10) \| TPCDS-Q14-1 \| parquet / none / none \| 5.82 \| 5.76 \| +1.08% \| 5.27% \| 3.89% \| 5 \| +2.04% \| 0.00 \| 0.37 \| \| TPCDS(10) \| TPCDS-Q23-2 \| parquet / none / none \| 4.65 \| 4.58 \| +1.38% \| 1.97% \| 0.48% \| 5 \| +0.81% \| 0.87 \| 1.51 \| \| TPCDS(10) \| TPCDS-Q49 \| parquet / none / none \| 1.49 \| 1.48 \| +0.46% \| * 36.02% * \| * 34.95% * \| 5 \| +1.26% \| 0.58 \| 0.02 \| \| TPCDS(10) \| TPCDS-Q14-2 \| parquet / none / none \| 3.76 \| 3.75 \| +0.39% \| 1.67% \| 0.58% \| 5 \| -0.03% \| -0.58 \| 0.49 \| \| TPCDS(10) \| TPCDS-Q78 \| parquet / none / none \| 2.80 \| 2.80 \| -0.04% \| 1.32% \| 1.33% \| 5 \| -0.42% \| -0.29 \| -0.05 \| \| TPCDS(10) \| TPCDS-Q80A \| parquet / none / none \| 2.87 \| 2.89 \| -0.51% \| 1.33% \| 0.40% \| 5 \| -0.01% \| -0.29 \| -0.82 \| \| TPCDS(10) \| TPCDS-Q3 \| parquet / none / none \| 0.18 \| 0.19 \| -1.29% \| * 15.26% * \| * 15.87% * \| 5 \| -0.54% \| -0.87 \| -0.13 \| \| TPCDS(10) \| TPCDS-Q76 \| parquet / none / none \| 1.08 \| 1.11 \| -2.98% \| 0.92% \| 1.70% \| 5 \| -3.99% \| -2.02 \| -3.47 \| +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Pass PlannerTest.testProcessingCost - Pass test_executor_groups.py - Reenable test_tpcds_q51a in TestTpcdsQueryWithProcessingCost with MAX_FRAGMENT_INSTANCES_PER_NODE set to 5 - Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost - Pass core tests Change-Id: If948e45455275d9a61a6cd5d6a30a8b98a7c729a Reviewed-on: http://gerrit.cloudera.org:8080/19807 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-11 22:46:31 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	c71de994b0	IMPALA-11952 (part 1): Fix except syntax Python 3 does not support this old except syntax: except Exception, e: Instead, it needs to be: except Exception as e: This uses impala-futurize to fix all locations of the old syntax. Testing: - The check-python-syntax.sh no longer shows errors for except syntax. Change-Id: I1737281a61fa159c8d91b7d4eea593177c0bd6c9 Reviewed-on: http://gerrit.cloudera.org:8080/19551 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Michael Smith	a870a11e64	IMPALA-7098: Re-enable tests under EC Re-enables tests under erasure coding, or provides more specific exceptions. Erasure coding uses multiple data blocks to construct a block group. Our tests use RS-3-2-1024k, which includes 3 data blocks in a block group. Each of these blocks is sized according to `dfs.block.size`, so block groups by default hold up to 384MB of data. Impala schedules work to executors based on blocks reported by HDFS, which for EC actually represent block groups. So with default block size, a file in EC has 1/3rd the number of schedulable blocks. In the case of tpch.lineitem, this produces 2 parquet files instead of 3 and reduces the number of executors scheduled to read parquet lineitem as 1. lineitem.tbl is loaded via Hive. With EC it uses 2 block groups, without EC it uses 6 blocks. 2. parquet lineitem is created by select/insert from lineitem.tbl. Impala schedules reads to executors based on available blocks, so with EC this gets scheduled across 2 executors instead of 3 and each executor writes a separate parquet file. Change-Id: Ib452024993e35d5a8d2854c6b2085115b26e40df Reviewed-on: http://gerrit.cloudera.org:8080/19172 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-11-04 22:13:50 +00:00
Michael Smith	1eb0510eaa	IMPALA-11456: Collapse filesystem Skip logic Combines all SkipIf* classes for different filesystems into a single SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the rest as filesystem-specific special cases. The 'jira' option is removed in favor of specific flags for each issue. Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1 Reviewed-on: http://gerrit.cloudera.org:8080/18781 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-10 22:37:08 +00:00
Joe McDonnell	6a199be854	IMPALA-11249: Fix add_test_dimensions() locations to call super() The original issue is that the strict HS2 shell tests are not running in precommit or nightly jobs, but they do run in local developer environments. Investigating this showed that the shell tests were running with a weird set of test dimensions that includes table_format_and_file_extension. That dimension is only used in test_insert.py::TestInsertFileExtension. What is happening is that the shell tests and other locations are running add_test_dimensions() without calling super(..., cls).add_test_dimensions(). The behavior is unclear, but there is clearly cross-talk between the different tests that do this. This changes all add_test_dimensions() locations to call super(..., cls).add_test_dimensions() if they don't already. Each location has been tuned to run the same set of tests as before (except the shell tests which now run the strict HS2 tests). As part of this, several shell tests need to be skipped or fixed for strict HS2. Testing: - Ran core job - Ran tests locally to verify the set of tests didn't change. Change-Id: Ib20fd479d3b91ed0ed89a0bc5623cd2a5a458614 Reviewed-on: http://gerrit.cloudera.org:8080/18557 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-26 03:42:51 +00:00
Riza Suminto	49ac55fb69	IMPALA-9856: Enable result spooling by default. Result spooling has been relatively stable since it was introduced, and it has several benefits described in IMPALA-8656. This patch enable result spooling (SPOOL_QUERY_RESULTS) query options by default. Furthermore, some tests need to be adjusted to account for result spooling by default. The following are the adjustment categories and list of tests that fall under such category. Change in assertions: PlannerTest#testAcidTableScans PlannerTest#testBloomFilterAssignment PlannerTest#testConstantFolding PlannerTest#testFkPkJoinDetection PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled PlannerTest#testKuduSelectivity PlannerTest#testMaxRowSize PlannerTest#testMinMaxRuntimeFilters PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled PlannerTest#testMtDopValidation PlannerTest#testParquetFiltering PlannerTest#testParquetFilteringDisabled PlannerTest#testPartitionPruning PlannerTest#testPreaggBytesLimit PlannerTest#testResourceRequirements PlannerTest#testRuntimeFilterQueryOptions PlannerTest#testSortExprMaterialization PlannerTest#testSpillableBufferSizing PlannerTest#testTableSample PlannerTest#testTpch PlannerTest#testKuduTpch PlannerTest#testTpchNested PlannerTest#testUnion TpcdsPlannerTest custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs metadata/test_explain.py::TestExplain::test_explain_level2 metadata/test_explain.py::TestExplain::test_explain_level3 metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation Increase BUFFER_POOL_LIMIT: query_test/test_queries.py::TestQueries::test_analytic_fns query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive query_test/test_udfs.py::TestUdfExecution::test_mem_limits Increase MEM_LIMIT: query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling Increase MAX_ROW_SIZE: custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config query_test/test_insert.py::TestInsertQueries::test_insert_large_string query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter query_test/test_scanners.py::TestWideRow::test_wide_row Disable result spooling to maintain assertion: custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only) query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit shell/test_shell_client.py::TestShellClient::test_fetch_size Testing: - Pass exhaustive tests. Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4 Reviewed-on: http://gerrit.cloudera.org:8080/16755 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-02 04:58:51 +00:00
Joe McDonnell	35bae939ab	IMPALA-10427: Remove SkipIfS3.eventually_consistent pytest marker These tests were disabled due to S3's eventually consistent behavior. Now that S3 is strongly consistent, these tests do not need to be disabled. Testing: - Ran s3 core job Change-Id: Ie9041f530bf3a818f8954b31a3d01d9f6753d7d4 Reviewed-on: http://gerrit.cloudera.org:8080/16931 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-07 23:53:56 +00:00
Bikramjeet Vig	0a13029afc	IMPALA-8125: Add query option to limit number of hdfs writer instances This patch adds a new query option MAX_FS_WRITERS that limits the number of HDFS writer instances. Highlights: - Depending on the plan, it either restricts the num of instances of the root fragment or adds an exchange and then limits the num of instances of that. - Assigns instances evenly across available backends. - "no-shuffle" query hint is ignored when using query option. - Change in behavior of plans is only when this query option is used. - The only exception to the previous point is that the optimization logic that decides to add an exchange now looks at the num of instances instead of the number of nodes. Limitation: A mismatch of cluster state during query planning and scheduling can result in more or less fragment instances to be scheduled than expected. Eg. If max_fs_writers in 2 and the planner sees only 2 executors then it might not add an exchange between a scan node and the table sink, but during scheduling if there are 3 nodes then that scan+tablesink instance will be scheduled on 3 backends. Testing: - Added planner tests to cover all cases where this enforcement kicks in and to highlight the behavior. - Added e2e tests to confirm that the scheduler is enforcing the limit and distributing the instance evenly across backends for different plan shapes. Change-Id: I17c8e61b9a32d908eec82c83618ff9caa41078a5 Reviewed-on: http://gerrit.cloudera.org:8080/16204 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-04 05:56:03 +00:00
Adam Tamas	7295edcc26	IMPALA-9680: Fixed compressed inserts failing Modified the insert testfiles to get which database they need to use for 'CREATE TABLE LIKE' dynamically. Tests: Did targeted exhaustive testruns in test_insert.py and test_mt_dop.py and did a full exhaustive testrun. Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d Reviewed-on: http://gerrit.cloudera.org:8080/15816 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-05-11 19:32:08 +00:00
Adam Tamas	02d84dcf50	IMPALA-9665: Fixed database not found errors in query_test.test_insert Fixed the usage of the unique_database in the test_insert.py to wait with the tests until the database is synced. Testing: -tests/run-tests.py query_test/test_insert.py --exploration_strategy=exhaustive Change-Id: I9b7aa3775dd4375f536d76f2e236ce126f8c78cd Reviewed-on: http://gerrit.cloudera.org:8080/15766 Reviewed-by: Andrew Sherman <asherman@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-21 06:12:56 +00:00
Adam Tamas	c32849a391	IMPALA-8980: Remove functional*.alltypesinsert from EE tests -Modified the ‘test_insert.py’ so the tests can run parallel. -Every test will create its own temporary tables for insert testing. -Swapped out the SETUP tags to Truncate table QUERY statement. -Becouse the SETUP tag is not used anymore, the correspondig code was removed. -A test query in ‘insert.test’. The test was incorrect so modified to test for the right behavior. Testing: -tests/run-tests.py query_test/test_insert.py -impala-py.test tests/query_test/test_insert.py -the same for test_insert_permutation.py and test_load.py Change-Id: I257e936868917a2fcc6c030f6c855b247e8a0eea Reviewed-on: http://gerrit.cloudera.org:8080/15529 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-14 12:18:21 +00:00
Sahil Takiar	8b8a49e617	IMPALA-8557: Add '.txt' to text files, remove '.' at end of filenames Writes to text tables on ABFS are failing because HADOOP-15860 recently changed the ABFS behavior when writing files / folders that end with a '.'. ABFS explicitly does not allow files / folders that end with a dot. From the ABFS docs: "Avoid blob names that end with a dot (.), a forward slash (/), or a sequence or combination of the two." The behavior prior to HADOOP-15860 was to simply drop any trailing dots when writing files or folders, but that can lead to various issues because clients may try to read back a file that should exist on ABFS, but doesn't. HADOOP-15860 changed the behavior so that any attempt to write a file or folder with a trailing dot fails on ABFS. Impala writes all text files with a trailing dot due to some odd behavior in hdfs-table-sink.cc. The table sink writes files with a "file extension" which is dependent on the file type. For example, Parquet files have a file extension of ".parq". For some reason, text files had no file extension, so Impala would try to write text files of the following form: "244c5ee8ece6f759-8b1a1e3b00000000_45513034_data.0.". Several tables created during dataload, such as alltypes, already use the '.txt' extension for their files. These tables are not created via Impala's INSERT code path, they are copied into the table. However, there are several tables created during dataload, such as alltypesinsert, that are created via Impala. This patch will change the files in these tables so that they end in '.txt'. This patch adds the ".txt" extension to all written text files and modifies the hdfs-table-sink.cc so that it doesn't add a trailing dot to a filename if there is no file extension. Testing: * Ran core tests * Re-ran affected ABFS tests * Added test to validate that the correct file extension is used for Parquet and text tables * Manually validated that without the addition of the '.txt' file extension, files are not written with a trailing dot Change-Id: I2a9adacd45855cde86724e10f8a131e17ebf46f8 Reviewed-on: http://gerrit.cloudera.org:8080/14621 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-06 05:49:31 +00:00
Sahil Takiar	e8fda1f224	IMPALA-9117, IMPALA-7726: Fixed a few unit tests for ABFS This test makes the following changes / fixes when running Impala tests on ABFS: * Skips some tests in test_lineage.py that don't work on ABFS / ADLS (they were already skipped for S3) * Skips some tests in test_mt_dop.py; the test creates a directory that ends with a period (and ABFS does not support writing files or directories that end with a period) * Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with purge tests fail against ABFS due to trash misbehavior"); I removed these flags and looped the tests overnight with no failures, so it is likely whatever bug was causing this has now been fixed * Now that HADOOP-15860 has been resolved, and the agreed upon behavior for ABFS is that it will fail if a client tries to write a file / directory that ends with a period, I added a new entry to the SkipIfABFS class called file_or_folder_name_ends_with_period and applied it where necessary Testing: * Ran core tests on ABFS Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285 Reviewed-on: http://gerrit.cloudera.org:8080/14636 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-06 05:44:01 +00:00
Tim Armstrong	548106f5e1	IMPALA-8451,IMPALA-8905: enable admission control for dockerised tests This gives us some additional coverage for using admission control in a simple but realistic configuration. What are the implications of this change for test stability and flakiness? On one hand were are adding some more unpredictability to tests, because they may be queued for an arbitrary amount of time. On the other, we can prevent queries from contending over memory. Currently we rely on luck to prevent concurrent queries from forcing each other out-of-memory. I think the unpredictability from the queueing is preferable, because we can generally work around these by fixing tests that are sensitive to being queued, whereas contention over memory requires us to use crude workarounds like forcing tests to execute serially. Added observability for the configured queue wait time for each pool. I noticed that I did not have a direct way to observe the effective value when I set configs. This is IMPALA-8905. I had to tweak tests in a few ways: * Tests with large strings needed higher memory limits. * Hardcoded instances of default-pool had to handle root.default as well. * test_query_mem_limit needed to run without a mem_limit. I created a special pool root.no-limits with no memory limits to allow that. Testing: Ran the dockerised build 5-6 times to flush out flaky tests. Change-Id: I7517673f9e348780fcf7cd6ce1f12c9c5a55373a Reviewed-on: http://gerrit.cloudera.org:8080/13942 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-27 01:54:39 +00:00
Zoltan Borok-Nagy	3e9cac0cac	IMPALA-8854: fix acid insert tests test_acid_nonacid_insert has been failing lately. HMS became more strict about checking the capabilities of its clients. Seems like the Python client doesn't set any capabilities for itself therefore HMS rejects its attempts of creating and dropping tables. Now instead of using the RESET utility from the e2e test framework (to drop and re-create tables), the test is using a unique database and creates the tables through Impala. Different file formats are exercised with the help of the DEFAULT_FILE_FORMAT query option. Change-Id: I3a82338a7820d0ee748c961c8656fa3319c3929c Reviewed-on: http://gerrit.cloudera.org:8080/14064 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-15 13:02:55 +00:00
Tim Armstrong	cef76db392	IMPALA-8854: skip test_acid_insert The test is failing because of a Hive version change in some configurations. Disabling for now until it can be fixed. Change-Id: I3bc5cce8b9c3843b5bb8ac4d29b2219411f671b4 Reviewed-on: http://gerrit.cloudera.org:8080/14056 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2019-08-14 09:27:14 +00:00
Csaba Ringhofer	a0c00e508f	Bump CDP_BUILD_NUMBER to 1318335 The main reason for bumping is to include HIVE-21838. Also skips / fixes some tests. Change-Id: I432e8c02dbd349a3507bfabfef2727914537652c Reviewed-on: http://gerrit.cloudera.org:8080/14005 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-08 15:29:13 +00:00
Zoltan Borok-Nagy	48bb93d474	IMPALA-8636: fix flakiness of ACID INSERT tests I had to add @UniqueDatabase.parametrize(sync_ddl=True) to some e2e tests because they were broken in exhaustive mode. When the tests run with sync_ddl=True then the test files are executed against multiple impalads which means that each statement in the .test file is executed against a random impalad. Change-Id: Ic724e77833ed9ea58268e1857de0d33f9577af8b Reviewed-on: http://gerrit.cloudera.org:8080/13966 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-01 17:44:26 +00:00
Zoltan Borok-Nagy	6360657cb4	IMPALA-8636: Implement INSERT for insert-only ACID tables This commit adds INSERT support for insert-only ACID tables. The Frontend opens a transaction for INSERT statements when the target table is transactional. It also allocates a write ID for the target table. The Frontend aborts the transaction if an error occurs during analysis/planning. The Backend gets the transaction id and the write id in TFinalizeParams. The write id is also set the for the HDFS table sinks. The sinks write the files at their final destination which is an ACID base or delta directory. There is no need for finalization of transactional INSERTS. When the sinks finished with writing the data, the Coordinator invokes updateCatalog() on catalogd which also commits the transaction if everything went well, otherwise the Coordinator aborts the transaction. Testing: * added new tables during dataload * added acid-insert.test file with INSERT statements against the new tables * test insertions between ACID and non-ACID tables * test error scenarios via debug actions * added integration test with Hive to test_hms_integration.py. The test inserts data with Impala and reads with Hive. (These integration tests only run with exhaustive exploration strategy) TODO in following commits: * add locks and heartbeats (without heartbeats long-running transactions might be aborted by HMS) * implement TRUNCATE * CTAS creates files in the 'root' directory of the table/partition. It is handled correctly during SELECT, but would be better to create a base directory from the beginning. Hive creates a delta directory for CTAS. Change-Id: Id6c36fa6902676f06b4e38730f737becfc7c06ad Reviewed-on: http://gerrit.cloudera.org:8080/13559 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-27 13:45:51 +00:00
Abhishek	97a6a3c807	IMPALA-8617: Add support for lz4 in parquet A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents the block compression scheme used by Hadoop. Its similar to SNAPPY_BLOCKED as far as the block format is concerned, with the only difference being the codec used for compression and decompression. Added Lz4BlockCompressor and Lz4BlockDecompressor classes for compressing and decompressing parquet data using Hadoop's lz4 block compression scheme. The Lz4BlockCompressor treats the input as a single block and generates a compressed block with following layout <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> The hdfs parquet table writer should call the Lz4BlockCompressor using the ideal input size (unit of compression in parquet is a page), and so the Lz4BlockCompressor does not further break down the input into smaller blocks. The Lz4BlockDecompressor on the other hand should be compatible with blocks written by Impala and other engines in Hadoop ecosystem. It can decompress compressed data in following format <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> ... <4 byte big endian compressed size> <lz4 compressed block> ... <repeated untill uncompressed size from outer block is consumed> Externally users can now set the lz4 codec for parquet using: set COMPRESSION_CODEC=lz4 This gets translated into LZ4_BLOCKED codec for the HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet data, the LZ4_BLOCKED codec is used. Testing: - Added unit tests for LZ4_BLOCKED in decompress-test.cc - Added unit tests for Hadoop compatibility in decompress-test.cc, basically being able to decompress an outer block with multiple inner blocks (the Lz4BlockDecompressor description above) - Added interoperability tests for Hive and Impala for all parquet codecs. New test added to tests/custom_cluster/test_hive_parquet_codec_interop.py Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c Reviewed-on: http://gerrit.cloudera.org:8080/13582 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-19 04:43:43 +00:00
Abhishek	51e8175c62	IMPALA-8450: Add support for zstd in parquet Makefile was updated to include zstd in the ${IMPALA_HOME}/toolchain directory. Other changes were made to make zstd headers and libs accessible. Class ZstandardCompressor/ZstandardDecompressor was added to provide interfaces for calling ZSTD_compress/ZSTD_decompress functions. Zstd supports different compression levels (clevel) from 1 to ZSTD_maxCLevel(). Zstd also supports -ive clevels, but since the -ive values represents uncompressed data they won't be supported. The default clevel is ZSTD_CLEVEL_DEFAULT. HdfsParquetTableWriter was updated to support ZSTD codec. The new codecs can be set using existing query option as follows: set COMPRESSION_CODEC=ZSTD:<clevel>; set COMPRESSION_CODEC=ZSTD; // uses ZSTD_CLEVEL_DEFAULT Testing: - Added unit test in DecompressorTest class with ZSTD_CLEVEL_DEFAULT clevel and a random clevel. The test unit decompresses an input compressed data and validates the result. It also tests for expected behavior when passing an over/under sized buffer for decompressing. - Added unit tests for valid/invalid values for COMPRESSION_CODEC. - Added e2e test in test_insert_parquet.py which tests writing/read- ing (null/non-null) data into/from a table (w different data type columns) using multiple codecs. Other existing e2e tests were updated to also use parquet/zstd table format. - Manual interoperability tests were run between Impala and Hive. Change-Id: Id2c0e26e6f7fb2dc4024309d733983ba5197beb7 Reviewed-on: http://gerrit.cloudera.org:8080/13507 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 11:15:04 +00:00
Michael Ho	ed0a2b6010	IMPALA-7176: Increase wait time in test_insert_mem_limit test_insert_mem_limit in test_insert.py will wait for all fragments to exit before proceeding after the test query hits a memory limit. On some slower EC2 instances with Centos6, the default wait time of 60s may occasionally cause the test to fail due to time out waiting for all fragments to exit. This change increases the timeout to 180 seconds. Testing done: - Looped test_insert_mem_limit for 100 times on Centos6; Change-Id: I2e14bef79c6c6fb0004270319f1c491194260438 Reviewed-on: http://gerrit.cloudera.org:8080/13292 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-10 00:38:54 +00:00
Tim Armstrong	23d7a6dce6	IMPALA-8492: reenable large string tests in docker IMPALA-4865 is fixed so these now pass. I noticed that the IMPALA-4874 test occasionally hit "Memory Limit Exceeded" when looped, so I reduced the data size there slightly. Testing: Looped the tests locally against a dockerised minicluster for a while. Change-Id: I030f4eff2d3fb771fc92b760efb13170e68285dc Reviewed-on: http://gerrit.cloudera.org:8080/13233 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-05 00:00:31 +00:00
Tim Armstrong	2ca7f8e7c0	IMPALA-7995: part 1: fixes for e2e dockerised impala tests This fixes all core e2e tests running on my local dockerised minicluster build. I do not yet have a CI job or script running but I wanted to get feedback on these changes sooner. The second part of the change will include the CI script and any follow-on fixes required for the exhaustive tests. The following fixes were required: * Detect docker_network from TEST_START_CLUSTER_ARGS * get_webserver_port() does not depend on the caller passing in the default webserver port. It failed previously because it relied on start-impala-cluster.py setting -webserver_port for all processes. * Add SkipIf markers for tests that don't make sense or are non-trivial to fix for containerised Impala. * Support loading Impala-lzo plugin from host for tests that depend on it. * Fix some tests that had 'localhost' hardcoded - instead it should be $INTERNAL_LISTEN_HOST, which defaults to localhost. * Fix bug with sorting impala daemons by backend port, which is the same for all dockerised impalads. Testing: I ran tests locally as follows after having set up a docker network and starting other services: ./buildall.sh -noclean -notests -ninja ninja -j $IMPALA_BUILD_THREADS docker_images export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" export FE_TEST=false export BE_TEST=false export JDBC_TEST=false export CLUSTER_TEST=false ./bin/run-all-tests.sh Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755 Reviewed-on: http://gerrit.cloudera.org:8080/12639 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-13 02:42:32 +00:00
Philip Zeyliger	214f61a180	IMPALA-8250: Clean up JNI warnings. Using LIBHDFS_OPTS+="-Xcheck:jni" revealed a handful of warnings related to (a) checking for exceptions and (b) leaking local references. Checking for exceptions required sprinkling RETURN_ERROR_IF_EXC left and right. I chose not to expand the JniCall infrastructure to handle this more generally at the moment. The leaky local references are a bit harder. In the logs, they show up as "WARNING: JNI local refs: 2597, exceeds capacity: 35" or similar. A few of these errors seem to be not in our code. The ones that I've found in our code stemmed from HBaseTableScanner::GetRowKey(): this method uses local references and wasn't returning them. Using a JniLocalFrame seems to have taken care of the warnings. I have added code to skip test_large_strings when JNI checking is enabled. This test takes forever (presumably because JNI is checking bounds on strings very aggressively), and times out. The time out also causes some metric-related checks to fail (since a query is still in flight). Debugging this required customizing my JDK to give stack traces when these warnings occurred. The following diff facilitated this. diff -r 76a9c9cf14f1 src/share/vm/prims/jniCheck.cpp --- a/src/share/vm/prims/jniCheck.cpp Tue Jan 15 10:43:31 2019 +0000 +++ b/src/share/vm/prims/jniCheck.cpp Wed Feb 27 11:57:13 2019 -0800 @@ -143,11 +143,30 @@ static const char * fatal_instance_field_mismatch = "Field type (instance) mismatch in JNI get/set field operations"; static const char * fatal_non_string = "JNI string operation received a non-string"; +// thisone: whether to print every time, or maybe, depending on future +// how many future stacks we want printed (totally racy); helps catch +// missing exception handling if there's a way to tickle that code +// reliably. +static inline void dump_native_stack(JavaThread* thr, bool thisone, int future) { + static int fut_stacks = 0; // racy! + if (fut_stacks > 0) { + thisone = true; + fut_stacks--; + } + if (future > 0) fut_stacks = future; + if (thisone) { + frame fr = os::current_frame(); + char buf[6000]; + tty->print_cr("Thread: %s %d", thr->get_thread_name(), thr->osthread()->thread_id()); + print_native_stack(tty, fr, thr, buf, sizeof(buf)); + } +} // When in VM state: static void ReportJNIWarning(JavaThread* thr, const char msg) { tty->print_cr("WARNING in native method: %s", msg); thr->print_stack(); + dump_native_stack(thr, true, 0); } // When in NATIVE state: @@ -199,11 +218,14 @@ tty->print_cr("WARNING in native method: JNI call made without checking exceptions when required to from %s", thr->get_pending_jni_exception_check()); thr->print_stack(); + dump_native_stack(thr, true, 10); ) thr->clear_pending_jni_exception_check(); // Just complain once } } + + /* * Add to the planned number of handles. I.e. plus current live & warning threshold */ @@ -254,9 +276,12 @@ tty->print_cr("WARNING: JNI local refs: %zu, exceeds capacity: %zu", live_handles, planned_capacity); thr->print_stack(); + dump_native_stack(thr, true, 0); ) // Complain just the once, reset to current + warn threshold add_planned_handle_capacity(handles, 0); + } else { + dump_native_stack(thr, false, 0); } } Change-Id: Idd1709f749a764c1d947704bc64306493863b45f Reviewed-on: http://gerrit.cloudera.org:8080/12660 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-08 03:35:09 +00:00
poojanilangekar	3a3ab7ff8f	IMPALA-6544/IMPALA-7070: Disable tests which fail due to S3's eventual consistency This patch is a temporary fix to disable tests which fail due to S3's eventually consistent behavior. The permanent fix would involve running tests with S3Guard enabled. Change-Id: I676faa191bec8b156e430661c22ee69242eeba9d Reviewed-on: http://gerrit.cloudera.org:8080/12203 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-12 03:42:48 +00:00
Tim Armstrong	11a48234ec	IMPALA-7648: add tests for OOM from scanning big string This extends test_insert_large_string to exercise the OOM code path. Change-Id: I0d1e9b2e8cf6e167da2542950ae90717d4865e9b Reviewed-on: http://gerrit.cloudera.org:8080/12123 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-27 22:23:40 +00:00
Bikramjeet Vig	cf28d8acbd	IMPALA-7994: Prevent test_insert_large_string from causing OOM issues test_insert_large_string uses upto 4GB of untracked memory which results in random OOMs during exhaustive testing on release builds. Queries run faster on release builds which might result in a different set of tests running together when compared to those on debug builds. This can result in queries requiring more memory running together with test_insert_large_string and eventually encounter OOM errors. Testing: Successfully ran exhaustive tests twice on release build. Change-Id: I6c950f6860b2f86865dbc5ce60055175e2c0bebc Reviewed-on: http://gerrit.cloudera.org:8080/12110 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-19 21:28:04 +00:00
Tim Armstrong	58cd69ac48	IMPALA-402: test for random partitioning in insert This adds a basic regression test for the bug reported in IMPALA-402. Testing: Exhaustive build. Looped the modified test overnight. Change-Id: I4bbca5c64977cadf79dabd72f0c8876a40fdf410 Reviewed-on: http://gerrit.cloudera.org:8080/11799 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-06 00:00:12 +00:00
Sean Mackrory	7a022cf36a	IMPALA-7681. Add Azure Blob File System (ADLS Gen2) support. HADOOP-15407 adds a new FileSystem implementation called "ABFS" for the ADLS Gen2 service. It's in the hadoop-azure module as a replacement for WASB. Filesystem semantics should be the same, so skipped tests and other behavior changes have simply mirrored what is done for ADLS Gen1 by default. Tests skipped on ADLS Gen1 due to eventual consistency of the Python client can be run against ADLS Gen2. Change-Id: I5120b071760e7655e78902dce8483f8f54de445d Reviewed-on: http://gerrit.cloudera.org:8080/11630 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 06:43:00 +00:00
Tim Armstrong	d05f73f415	IMPALA-7647: Add HS2/Impyla dimension to TestQueries I used some ideas from Alex Leblang's abandoned patch: https://gerrit.cloudera.org/#/c/137/ in order to run .test files through HS2. The advantage of using Impyla is that much of the code will be reusable for any Python client implementing the standard Python dbapi and does not require us implementing yet another thrift client. This gives us better coverage of non-trivial result sets from HS2, including handling of NULLs, error logs and more interesting result sets than the basic HS2 tests. I added HS2 coverage to TestQueries, which has a reasonable variety of queries and covers the data types in alltypes. I also added TestDecimalQueries, TestStringQuery and TestCharFormats to get coverage of DECIMAL, CHAR and VARCHAR that aren't in alltypes. Coverage of results sets with NULLs was limited so I added a couple of queries. Places where results differ from Beeswax: * Impyla is a Python dbapi client so must convert timestamps into python datetime objects, which only have microsecond precision. Therefore result timestamps within nanosecond precision are truncated. * The HS2 interface reports the NULL type as BOOLEAN as a workaround for IMPALA-914. * The Beeswax interface reported VARCHAR as STRING, but HS2 reports VARCHAR. I dealt with different results by adding additional result sections so that the expected differences between the clients/protocols were explicit. Limitations: * Not all of the same methods are implemented as for beeswax, so some tests that have more complicated interactions with the client will not work with HS2 yet. * We don't have a way to get the affected row count for inserts. I also simplified the ImpalaConnection API by removing some unnecessary methods and moved some generic methods to the base class. Testing: * Confirmed that it detected IMPALA-7588 by re-applying the buggy patch. * Ran exhaustive and CentOS6 tests. Change-Id: I9908ccc4d3df50365be8043b883cacafca52661e Reviewed-on: http://gerrit.cloudera.org:8080/11546 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-09 00:45:10 +00:00
Tianyi Wang	21d92aacbf	IMPALA-7019: Schedule EC as remote & disable failed tests This patch schedules HDFS EC files without considering locality. Failed tests are disabled and a jenkins build should succeed with export ERASURE_COINDG=true. Testing: It passes core tests. Cherry-picks: not for 2.x. Change-Id: I138738d3e28e5daa1718c05c04cd9dd146c4ff84 Reviewed-on: http://gerrit.cloudera.org:8080/10413 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-22 01:10:14 +00:00
Joe McDonnell	1e6544f7da	IMPALA-7023: Wait for fragments to finish for test_insert.py The arrangement of tests in test_insert.py changed with IMPALA-7010, splitting out the memory limit tests into test_insert_mem_limit(). On exhaustive, the combination of test dimensions means test_insert_mem_limit() executes 11 different combinations. Each of these statements can use a large amount of memory and this is not cleaned up immediately. This has been causing test_insert_overwrite(), which immediately follows test_insert_mem_limit(), to hit the process memory limit. This changes test_insert_mem_limit() to make it wait for its fragments to finish. Change-Id: I5642e9cb32dd02afd74dde7e0d3b31bddbee3ccd Reviewed-on: http://gerrit.cloudera.org:8080/10426 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-16 23:33:57 +00:00
Tim Armstrong	25c13bfdd6	IMPALA-7010: don't run memory usage tests on non-HDFS Moved a number of tests with tuned mem_limits. In some cases this required separating the tests from non-tuned functional tests. TestQueryMemLimit used very high and very low limits only, so seemed safe to run in all configurations. Change-Id: I9686195a29dde2d87b19ef8bb0e93e08f8bee662 Reviewed-on: http://gerrit.cloudera.org:8080/10370 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-11 22:41:49 +00:00
Michael Ho	ed72910e96	IMPALA-6262: Always initialize runtime profile for DataSink This change moves the creation of the runtime profile from DataSink::Prepare() to the ctor of DataSink derived classes. This makes sure that DataSink::Close() and other functions can access the profile even if the DataSink fails to initialize. Testing done: Added a test case which triggers failure in the initialization of output expressions in a HdfsTableSink. Impalad crashed consistently without the fix. Change-Id: I2a683000ef180027b929dbebe78bc2a530a4767e Reviewed-on: http://gerrit.cloudera.org:8080/8770 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-07 09:47:09 +00:00
Tim Armstrong	5b670f49b6	IMPALA-5640: re-enable gzip for parquet insert tests This addresses a gap in test coverage. There are no known bugs here so we expect this to work. Testing: Ran exhaustive build. Change-Id: I4bea8bac37bb1e72f3ba0b2e162e6fc544aec8a8 Reviewed-on: http://gerrit.cloudera.org:8080/7398 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-12 00:18:44 +00:00

1 2

90 Commits