impala

mirror of https://github.com/apache/impala.git synced 2025-12-25 02:03:09 -05:00

Author	SHA1	Message	Date
Jim Apple	103774a8e5	Update Python requests package to 2.20.0 See https://2.python-requests.org/en/master/community/updates/#id8. This is currently only used in the tests, but it's best to fix this now. While here, remove now-false not about required support for Python 2.6. Change-Id: I092a641a12f38cdb45b0062c31ffb51c0c664800 Reviewed-on: http://gerrit.cloudera.org:8080/17215 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2021-03-30 01:27:04 +00:00
Vihang Karajgaonkar	7198f07c92	IMPALA-10605: Deflake test_refresh_native This test uses a regex to parse the output of describe database and extract the db properties. The regex assumes that there will only be 1 key value pair which is broken when events processor is running. The fix is to modify the regex so that it only extracts the relevant function name prefix and its value. Testing: 1. The test fails when events processor is enabled. After the patch the test works as expected. Change-Id: I1df35b9c5f2b21cc7172f03ff8611d46070d64c2 Reviewed-on: http://gerrit.cloudera.org:8080/17227 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-30 01:05:40 +00:00
Zoltan Borok-Nagy	dbc2fc14d8	IMPALA-10597: Enable setting 'iceberg.file_format' Currently we prohibit setting the following properties: * iceberg.catalog * iceberg.catalog_location * iceberg.file_format * iceberg.table_identifier This patch enables setting 'iceberg.file_format', therefore if a table was created by another engine, but using HiveCatalog, we'll be able to set the data file format to the proper value and make the table readable by Impala. Setting the other properties are not needed for HiveCatalog tables. If the table wasn't created by HiveCatalog, then we cannot load the table, therefore we cannot invoke any ALTER TABLE statement at all. In that case we need to create an external table. If the table already contains data files, then Impala checks if all of them have the proper file format. If not, the ALTER TABLE statement fails. Before this patch a CREATE TABLE statement accepted any string for 'iceberg.file_format', and in case of invalid file formats the frontend silently used Parquet. This patch also adds a check to only allow valid file formats. Testing: * added e2e test Change-Id: I4b3506be4562a1ace3e6435867aadb3bdde7a8e2 Reviewed-on: http://gerrit.cloudera.org:8080/17207 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-29 18:32:31 +00:00
Fucun Chu	77d6acd032	IMPALA-10581: Implement ds_theta_intersect_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the intersection of two sketches of same or different column and returns the resulting sketch of intersection. Example: select ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) from sketch_tbl; +-----------------------------------------------------------+ \| ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) \| +-----------------------------------------------------------+ \| 5 \| +-----------------------------------------------------------+ Change-Id: I335eada00730036d5433775cfe673e0e4babaa01 Reviewed-on: http://gerrit.cloudera.org:8080/17186 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-29 15:59:49 +00:00
Steve Carlin	8f8668aaf0	IMPALA-10593: Conditionally skip runtime filter for outer joins Currently there is code that asserts that an Expr is not constant after substituting SlotRefs with constant nulls. For External FE, this restriction to be weakened. In a case where an Expr is checked and the Expr is not constant even after substituting nulls, the result will be to not generate a runtime filter for that Expr. Testing: Manually tested with this query in the External FE: select id, int_col, year, month from alltypessmall s where s.int_col = (select count(*) from alltypestiny t where s.id = t.id) order by id Change-Id: I46462e2030731d97c4c88e364148c0093c025ab3 Reviewed-on: http://gerrit.cloudera.org:8080/17200 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-28 05:51:18 +00:00
stiga-huang	cf463f28c3	IMPALA-10609: Fix NPE in resolving table masks in StmtMetadataLoader When creating a StmtMetadataLoader, the 'user' field could be null in places that don't need to resolve column-masking/row-filtering policies. E.g. in processing GetColumns HS2 operation, or some FE tests. This patch skips resolving the table masks in such cases to avoid NullPointerException. Tests: - Run CORE tests and verified no NullPointerException found. Change-Id: I7aa20458b02e8a93a871b6dd875decfab82c4eae Reviewed-on: http://gerrit.cloudera.org:8080/17235 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-27 12:26:23 +00:00
Aman Sinha	2e5589d85f	IMPALA-10116: Allow unwrapping a builtin cast function similar to CastExpr This change allows unwrapping a builtin cast function such as casttobigint(col) similar to a CAST(col as bigint). Unwrapping is useful to access the SlotRef of the column and this in turn is needed to compute predicate selectivity correctly. Without unwrapping, the cast function uses default 10 % selectivity for a predicate such as 'casttobigint(l_quantity) is NOT NULL' which is not accurate. Note that Impala does not allow a user query to directly call the builtin cast function. Rather, they have to use the explicit CAST syntax. However, since the frontend jar can be used by an external frontend module as a library, the builtin function can be called and this patch makes the behavior consistent. Testing: - Ran PlannerTest - Manual testing by commenting out the code in FunctionCallExpr.analyzeImpl() that throws an AnalysisException if builtin cast function is called. I haven't added a new test for this reason. Cardinality before this change: explain select * from date_dim d1, date_dim d2 where d1.d_week_seq = d2.d_week_seq - 52 and casttobigint(d1.d_week_seq) is not null and casttobigint(d2.d_week_seq) is not null SCAN HDFS [tpcds.date_dim d1] HDFS partitions=1/1 files=1 size=9.84MB predicates: casttobigint(d1.d_week_seq) IS NOT NULL runtime filters: RF000 -> d1.d_week_seq row-size=255B cardinality=7.30K Cardinality after this change: SCAN HDFS [tpcds.date_dim d1] HDFS partitions=1/1 files=1 size=9.84MB predicates: casttobigint(d1.d_week_seq) IS NOT NULL runtime filters: RF000 -> d1.d_week_seq row-size=255B cardinality=73.05K Change-Id: Idf82b2de78c6a7051ea036062f177d69e2558940 Reviewed-on: http://gerrit.cloudera.org:8080/16407 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-27 09:05:32 +00:00
Bikramjeet Vig	0b79464d9c	IMPALA-10397: Fix test_single_workload The logs on failed runs indicated that the autoscaler never started another cluster. This can only happen if it never notices a queued query which is possible since this test was only failing in release builds. This patch increases the runtime of the sample query to make execution more predictable. Testing: Looped on my local on a release build Change-Id: Ide3c7fb4509ce9a797b4cbdd141b2a319b923d4e Reviewed-on: http://gerrit.cloudera.org:8080/17218 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-25 22:59:16 +00:00
wzhou-code	281a47caad	IMPALA-10564 (part 2): Fixed test_ctas_exprs failure for S3 build New test case TestDecimalOverflowExprs::test_ctas_exprs was added in the first patch for IMPALA-10564. But it failed in S3 build with Parquet format since the table was not successfully created when CTAS query failed. This patch fixed the test failure by skipping checking if NULL is inserted into table after CTAS failed for S3 build with Parquet. Testing: - Reproduced the test failure in local box with defaultFS as s3a. Verified the fixing was working with defaultFS as s3a. - Passed EE_TEST. Change-Id: Ia627ca70ed41764e86be348a0bc19e330b3334d2 Reviewed-on: http://gerrit.cloudera.org:8080/17228 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-25 21:45:10 +00:00
Vihang Karajgaonkar	5b27b7ca72	IMPALA-10598: Deflake test_cache_reload_validation This patch deflakes the test test_cache_reload_validation in test_hdfs_caching.py e2e test. The util method which the test relies on to get the count of list of cache directives by parsing the output of command "hdfs cacheadmin -listDirectives -stats" does not consider that the output may contain trailing new lines or headers. Hence the test fails because the expected number of cache directives does not match the number of lines of the output. The fix parses the line "Found <int> entries" in the output when available and returns the count from that line. If the line is not found, it fallbacks to the earlier implementation of using the number of lines. Testing: 1. The test was failing for me when run individually. After the patch, I looped the test 10 times without any errors. Change-Id: I2d491e90af461d5db3575a5840958d17ca90901c Reviewed-on: http://gerrit.cloudera.org:8080/17210 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-25 02:51:54 +00:00
Thomas Tauber-Marshall	e3bafcbef4	IMPALA-10590: Introduce admission service heartbeat mechanism Currently, if a ReleaseQuery rpc fails, it's possible for the admission service to think that some resources are still being used that are actually free. This patch fixes the issue by introducing a periodic heartbeat rpc from coordinators to the admission service which contains a list of queries registered at that coordinator. If there is a query that the admission service thinks is running but is not included in the heartbeat, the admission service can conclude that the query must have already completed and release its resources. Testing: - Added a test that uses a debug action to simulate ReleaseQuery rpcs failing and checks that query resources are released properly. Change-Id: Ia528d92268cea487ada20b476935a81166f5ad34 Reviewed-on: http://gerrit.cloudera.org:8080/17194 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-24 22:41:27 +00:00
Thomas Tauber-Marshall	452c2f1f7f	IMPALA-10604: Allow setting KuduClient's verbose log level directly This patch adds a flag --kudu_client_v which allows setting the verbose logging level for the KuduClient to a value other than the level for the rest of Impala (set by -v) in order to enable debugging of issues in the KuduClient without producing the enormous amount of logging that comes with setting a high -v value on all of Impala. Testing: - Manually set --kudu_client_v and confirmed that the expected logging is produced. Change-Id: Ib39358709ee714b8cdffd72a0ee58f66d5fab37e Reviewed-on: http://gerrit.cloudera.org:8080/17222 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-24 19:06:09 +00:00
Fucun Chu	622e3c95ad	IMPALA-10580: Implement ds_theta_union_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 15 \| +-------------------------------------------------------+ Change-Id: I8329979b81ceeaad739a43fab79768ca9c2916fa Reviewed-on: http://gerrit.cloudera.org:8080/17179 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-24 15:16:07 +00:00
wzhou-code	410c3e79e4	IMPALA-10564: Return error when inserting an invalid decimal value When using CTAS statements or INSERT-SELECT statements to insert rows to table with decimal columns, Impala insert NULL for overflowed decimal values, instead of returning error. This issue happens when the data expression for the decimal column in SELECT sub-query consists at least one alias. This issue is similar as IMPALA-6340, but IMPALA-6340 only fixed the issue for the cases with the data expression for the decimal columns as constants. This patch fixed the issue by calling RuntimeState::CheckQueryState() in the end of HdfsTableWriter::AppendRows() and KuduTableSink::Send(). If there is an invalid decimal error, the query will be failed without inserting NULL for decimal column. We did not change the behaviour for decimal_v1. NULL will be inserted to the table for invalid decimal values with warning message. Tests: - Added unit-tests for INSERT-SELECT and CTAS statements with overflowed decimal values to be inserted into tables. The overflowed decimal values are expressed as a constant expression, or as an expression with aliases. Also added cases to verify behaviour of decimal_v1 is unchanged. - Passed exhaustive tests. Change-Id: I64ce4ed194af81ef06401ffc1124e12f05b8da98 Reviewed-on: http://gerrit.cloudera.org:8080/17168 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-23 22:52:38 +00:00
Andrew Sherman	b28da054f3	IMPALA-10592: prevent pytest from hanging at exit. In TestAdmissionControllerStress mark worker threads as daemons so that an exception in teardown() will not cause pytest to hang just after printing the test results. https://stackoverflow.com/questions/19219596/py-test-hangs-after-showing-test-results TESTING: Simulated the failure in IMPALA-10596 by throwing an exception during teardown(). Without this fix the pytest invocation hangs. Change-Id: I74cca8f577c7fbc4d394311e2f039cf4f68b08df Reviewed-on: http://gerrit.cloudera.org:8080/17212 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-22 23:33:55 +00:00
Jim Apple	e5d5dbc30a	Update Paramiko to 2.4.2. See https://www.paramiko.org/changelog.html#2.4.2. This shouldn't directly apply to Impala deployments, but it is best to fix this in test now. Change-Id: If9cc9ea4a0763c8b5303ca4e8482761ee2f53efa Reviewed-on: http://gerrit.cloudera.org:8080/17214 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-22 19:34:00 +00:00
stiga-huang	a0f77680c5	IMPALA-10483: Support subqueries in Ranger masking policies This patch adds support for using subqueries in Ranger masking policies, i.e. column-masking/row-filtering policies. The subquery can reference either the current table or other tables. However, masking policies on these tables won't be applied recursively. This is consistent with Hive. One motivation is to avoid infinitely masking if it references the same table. Another motivation I think is to simplify the masking behavior, so when the admin is setting a masking expression, it can be considered as running in the admin's perspective (i.e. no masking). Implementation Before analyzing the query, the coordinator loads the metadata of all possibly used tables into the query's StmtTableCache. Table masking takes place after the analyzing phase. If the subquery filter introduces any new tables, the analyzer will fail to resolve them since their metadata is not loaded in the StmtTableCache. This patch modified the StmtMetadataLoader to also load those tables introduced by masking policies. So they can be resolved correctly. Tests - Add more complex tests in test_row_filtering Change-Id: I254df9f684c95c660f402abd99ca12dded7e764f Reviewed-on: http://gerrit.cloudera.org:8080/17185 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-22 15:52:03 +00:00
stiga-huang	c9d7bcb4a1	IMPALA-9661: Avoid introducing unused columns in table masking view Previously, if a table has column masking policies, we replace its unanalyzed TableRef with an analyzed InlineViewRef (table masking view) in FromClause.analyze(). However, we can't detect which columns are actually used in the original query at this point. In fact, analyze() for SelectList, WhereClause, GroupByClause and other clauses containing SlotRefs happen after FromClause.analyze(). After the whole query block is analyzed, we can get the exact set of required columns. This patch refactor the codes to do table masking after analyze() to avoid introducing unused columns. Referenced columns of a TableRef are registered in analyze(), which helps to figure out what columns are actually needed. With this, we don't need to revert table masking in FromClause.reset(). The doTableMasking flag in AST is also removed since now the table mask is resolved once after analyze(). Tests: - Add more e2e tests in test_ranger.py - Run CORE tests Change-Id: Ib015a8ab528065907b27fbdceb8e2818deb814e1 Reviewed-on: http://gerrit.cloudera.org:8080/17199 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-22 08:41:00 +00:00
John Sherman	879986ea6f	IMPALA-10552: Support external frontends supplying timeline for profile - Add EXTERNAL_FRONTEND as a client session type - Use EXTERNAL_FRONTEND session type for clients connected to external frontend interface. - Rename Query Timeline to Impala Backend Timeline for external frontends - the query timeline is no longer an end to end timeline when executing a plan from an external frontend - External frontends can provide timeline information through a TExecRequest by filling in the timeline field with a valid TEventSequence - The frontend timeline and backend timeline are completely separate entities, meaning there is no overall attempt to capture the timing end to end - This is due to the fact that the frontend and Impala may not share the same time source (or even machine). - It is safe to add together the backend + frontend timeline times to get a rough idea how long the query took end to end to execute, but keep in mind that this number does not capture the time it took the frontend to send the plan to the backend (Impala) nor does it capture how long it took the end user to read the results. Example timeline with external frontend: Frontend Timeline: 3s016ms - Analysis finished: 1s130ms (1s130ms) - Calcite plan generated: 2s170ms (1s040ms) - Metadata load started: 2s245ms (74.486ms) - Metadata load finished. loaded-tables=1: 2s654ms (409.847ms) - Single node plan created: 2s726ms (71.659ms) - Runtime filters computed: 2s756ms (30.000ms) - Distributed plan created: 2s761ms (5.265ms) - Execution request created: 2s890ms (128.387ms) - Impala plan generated: 2s891ms (1.508ms) - Planning finished: 2s893ms (1.894ms) - Submitted query: 3s016ms (122.377ms) Impala Backend Timeline: 79.998ms - Query submitted: 0.000ns (0.000ns) - Submit for admission: 0.000ns (0.000ns) - Completed admission: 0.000ns (0.000ns) - Ready to start on 1 backends: 3.999ms (3.999ms) - All 1 execution backends (2 fragment instances) started: 7.999ms (3.999ms) - Rows available: 55.999ms (47.999ms) - Execution cancelled: 79.998ms (23.999ms) - Released admission control resources: 79.998ms (0.000ns) - Unregister query: 79.998ms (0.000ns) Testing done: - Manual inspection of profiles on the Impala web UI - test_hs2.py - test_tpch_queries.py - test_tpcds_queries.py::TestTpcdsDecimalV2Query Co-authored-by: Kurt Deschler <kdeschle@cloudera.com> Change-Id: I2b3692b4118ea23c0f9f8ec4bcc27b0b68bb32ec Reviewed-on: http://gerrit.cloudera.org:8080/17183 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-19 22:24:29 +00:00
stiga-huang	98de1c5436	IMPALA-9234: Support Ranger row filtering policies Ranger row filtering policies provide customized expressions to filter out rows for specific users when reading from a table. This patch adds support for this feature. A new feature flag, enable_row_filtering, is added to disable this experimental feature. It defaults to be true so the feature is enabled by default. Enabling row-filtering requires --enable_column_masking=true since it depends on the column masking implementation. Note that row filtering policies take effects prior to any column masking policies, because column masking policies apply on result data. Implementation: The existing table masking view infrastructure can be extended to support row filtering. Currently when analyzing a table with column masking policies, we replace the TableRef with an InlineViewRef which contains a SelectStmt wrapping the columns with masking expressions. This patch adds the row filtering expressions to the WhereClause of the SelectStmt. Limitations: - Expressions using subqueries are not supported (IMPALA-10483). - Row filtering policies on nested tables will not be applied when nested collection columns are used directly in the FROM clause. This will leak data so we forbid such kinds of queries until IMPALA-10484 is resolved. Tests: - Add FE test for error message when disabling row filtering. - Add e2e test with row filtering policies. - Add e2e test with column masking and row filtering policies both take place. - Verified audits in a CDP cluster with Ranger and Solr set up. Change-Id: I580517be241225ca15e45686381b78890178d7cc Reviewed-on: http://gerrit.cloudera.org:8080/16976 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 21:08:14 +00:00
Zoltan Borok-Nagy	6162343842	IMPALA-10512: ALTER TABLE ADD PARTITION should bump the write id for ACID tables ALTER TABLE ADD PARTITION should bump the write id for ACID tables. Both for INSERT-only and full ACID tables. For transational tables we are adding partitions in an ACID transaction in the following sequence: 1. open transaction 2. allocate write id for table 3. add partitions to HMS table 4. commit transaction However, please note that table metadata modifications are independent of ACID transactions. I.e. if add partitions succeed, but we cannot commit the transaction, then we the newly added partitions won't get removed. So why are we opening a txn then? We are doing it in order to bump the write id in a best-effort way. This aids table metadata caching, so by looking at the table write id we can determine if the cached table metadata is up-to-date. Testing: * added e2e test Change-Id: Iad247008b7c206db00516326c1447bd00a9b34bd Reviewed-on: http://gerrit.cloudera.org:8080/17081 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 19:35:58 +00:00
Kurt Deschler	ac4984b74f	Revert "IMPALA-10503: testdata load hits hive memory limit errors during hive inserts" This reverts commit `c60a626ac6`. Change-Id: I896c7b2457d537fa1bfe8dc29063da0b7b3df199 Reviewed-on: http://gerrit.cloudera.org:8080/17191 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 07:24:31 +00:00
Kurt Deschler	231d41ae1f	IMPALA-10549: Register transactions from external frontend DML This change registers transactions that were started by an external frontend so that coordinator keepalive can track them properly. Testing: manually tested using DMLs from external frontend Reviewed-by: John Sherman <jfs@cloudera.com> Change-Id: Ia8863b8d9d281a5d164f10de9c5ee52cf3be63db Reviewed-on: http://gerrit.cloudera.org:8080/17122 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 01:31:25 +00:00
Aman Sinha	decb79d032	IMPALA-10518: Add ImpalaServer interface to retrieve executor membership. This patch adds an interface to ImpalaServer to retrieve the current executor membership snapshot from impalad for use by an external frontend. This involves sending a thrift request to impalad and receiving a thrift response. Refactored some code in exec-env into a separate function in the impala namespace which makes it easier to populate the needed information for an external frontend. Testing: - Ran selected tests for sanity check (no impact is expected since this is adding a new interface): - Frontend tests (PlannerTest, CardinalityTest) - Backend tests under custom_cluster/test_executor_groups.py - Manually tested with external frontend to ensure it gets the executor membership snapshot. Change-Id: Ie89b71f4555c368869ee7b9d6341756c60af12b5 Reviewed-on: http://gerrit.cloudera.org:8080/17181 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 00:39:22 +00:00
Thomas Tauber-Marshall	c823f17f2c	IMPALA-10577: Add retrying of AdmitQuery This patch adds retries of the AdmitQuery rpc by coordinators. This helps to ensure that if an admissiond goes down and is restarted or is temporarily unreachable, queries won't fail. The retries are done with backoff and jitter to avoid overloading the admissiond in these scenarios. A new flag, --admission_max_retry_time_s, is added to control how long queries will continue retrying before giving up. The AdmitQuery rpc is made idempotent - if a query is submitted with the same query id as one the admissiond already knows about, AdmitQuery will return OK without submitting the query to be scheduled again. Testing: - Added a custom cluster test that checks that queries won't fail when the admissiond goes down. Change-Id: I8bc0cac666bbd613a1143c0e2c4f84d3b0ad003a Reviewed-on: http://gerrit.cloudera.org:8080/17188 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 00:36:54 +00:00
Fucun Chu	3e82501531	IMPALA-10558: Implement ds_theta_exclude() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the a-not-b set operation given two sketches of same or different column. Example: select ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 5 \| +-------------------------------------------------------+ Change-Id: I05119fd8c652c07ff248a99e44b0da3541e46ca3 Reviewed-on: http://gerrit.cloudera.org:8080/17153 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-17 22:14:44 +00:00
John Sherman	122bd9ea06	IMPALA-10553: External Frontend CTAS support - Adds the concept of an external staging dir to HdfsTableSink - This allows an external to specify the destination of the sink - When this is set, the external frontend is responsible for for moving and managing the results - Any DDL related operations are assumed to be handled by the external frontend - External frontends may optionally supply a partition depth which acts as a hint to skip a certain number of partitions while creating directories for partitions. This is for when the external frontend has pre-created a certain number of the directories in staging (usually the static portion of a partition specification)/ - Modifies delta/base naming to include 0 prefix padding to match Hive for dynamic partitioning detection - External frontends are responsible for doing authorization checks against the staging directory and it is assumed the external frontend service is not exposed directly to users. Co-authored-by: Kurt Deschler <kdechle@cloudera.com> Change-Id: Iae0ea4a832d8281c563427d0d7da1623bfce437b Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-on: http://gerrit.cloudera.org:8080/17145 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-16 15:17:26 +00:00
John Sherman	768722b868	IMPALA-10551: Add result sink support for external frontends - The intended purpose of these changes is to allow external frontends to receive query results via files rather than streaming the results through the thrift interface. - External frontends are expected to provide an FeFsTable implementation that describes the desired location to store results. - External frontends are responsible for managing the files after the query is completed. - Testing has been manual and through an implementation of an external frontend. Change-Id: I024bf41d77bb81f1ab0debdbd31ec3687c83f072 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-on: http://gerrit.cloudera.org:8080/17144 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2021-03-16 15:17:26 +00:00
Abhishek Rawat	573c60a298	IMPALA-10367: Impala-shell internal error - UnboundLocalError, local variable 'retry_msg' referenced before assign ImpalaHS2Client._open_session() has a 'retry_msg' variable which was not initialized in the code-path where retry was disabled. If an exception was hit with retry disabled, a compile time error was generated. The fix is to initialize 'retry_msg' in the non retry code-path. Testing: - Forced exception in ImpalaHS2Client._open_session() and verified that proper error message was generated. - Ran impala-shell e2e and custom cluster tests. Change-Id: I50a08a62a332de759022d0a4862e74f5a81945d9 Reviewed-on: http://gerrit.cloudera.org:8080/17172 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-16 01:14:06 +00:00
Kurt Deschler	c60a626ac6	IMPALA-10503: testdata load hits hive memory limit errors during hive inserts Changed the following hive settings to avoid hitting Hive container limit errors: hive.tez.container.size: 2048 hive.tez.java.opts: -Xmx1700m With these settings, testdata load completes without errors on a 32GB host. Reviewed-by: Fang-Yu Rao <fangyu.rao@cloudera.com> Change-Id: Idac5f054e814070b983f7f57aef4ea9d54252bb2 Reviewed-on: http://gerrit.cloudera.org:8080/17061 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2021-03-15 20:55:29 +00:00
Riza Suminto	37ec96e72a	IMPALA-10559: Fix flakiness in TestScratchLimit. TestScratchLimit has been flaky in ubuntu-16.04-dockerised-tests environment since results spooling is enabled by default in IMPALA-9856. A combination of result spooling, sort query, and low buffer_pool_limit in TestScratchLimit::test_with_unlimited_scratch_limit seems to reveal a memory reservation bug in BufferedTutpleStream. This patch disables result spooling for tests under TestScratchLimit until the underlying bug is found. We will investigate the bug in a separate JIRA. Testing: - Disable result spooling in all tests of TestScratchLimit before IMPALA-9856 gets in. - Run and pass TestScratchLimit locally. Change-Id: I68736d6bfb0001423fd138000670ac60b2117fbe Reviewed-on: http://gerrit.cloudera.org:8080/17182 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-14 13:41:40 +00:00
Riza Suminto	47219ec366	IMPALA-10565: Adjust result spooling memory based on scratch_limit IMPALA-9856 enables result spooling by default. Result spooling depends on the ability to spill its entire BufferedTupleStream to disk once it hits maximum memory reservation. However, if the query option scratch_limit is set lower than max_spilled_result_spooling_mem, the query might fail in the middle of execution due to insufficient scratch space. This patch adds planner change to consider scratch_limit and scratch_dirs query option when computing resource used by result spooling. The algorithm is as follow: * If scratch_dirs is empty or scratch_limit < minMemReservationBytes required to use BufferedPlanRootSink, we set spool_query_results to false and fallback to use BlockingPlanRootSink. * If scratch_limit > minMemReservationBytes but still fairly low, we lower the max_result_spooling_mem (default is 100MB) and max_spilled_result_spooling_mem (default is 1GB) to fit scratch_limit. * if scratch_limit > max_spilled_result_spooling_mem, do nothing. Testing: - Add TestScratchLimit::test_result_spooling_and_varying_scratch_limit - Verify that spool_query_results query option is disabled in TestScratchDir::test_no_dirs - Pass exhaustive tests. Change-Id: I541f46e6911694e14c0fc25be1a6982fd929d3a9 Reviewed-on: http://gerrit.cloudera.org:8080/17166 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Aman Sinha <amsinha@cloudera.com>	2021-03-14 03:35:40 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Zoltan Borok-Nagy	6c6b0ee869	IMPALA-10222: CREATE TABLE AS SELECT for Iceberg tables This patch adds support for CREATE TABLE AS SELECT statements for Iceberg tables. CTAS statements work like the following in Impala: 1. Analysis of the whole CTAS statement 2. Divide CTAS to CREATE stmt and INSERT stmt 3. Create temporary in-memory target table from the CREATE stmt 4. Analyse the INSERT statement by using the temporary target table 5. If everything is OK so far, create the target table 6. Execute the INSERT query For Iceberg tables the non-trivial thing was to create the temporary target table without actually creating it via Iceberg API. I've created a new class 'IcebergCtasTarget' that mimics an FeIceberg table. It can be used with catalog V1 and V2 as well. Testing * e2e CTAS tests in iceberg-ctas.test * SHOW CREATE TABLE stmts in show-create-table.test Change-Id: I81d2084e401b9fa74d5ad161b51fd3e2aa3fcc67 Reviewed-on: http://gerrit.cloudera.org:8080/17130 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 19:28:19 +00:00
stiga-huang	d5f67fce41	IMPALA-10523: Fix impala-shell crash in printing error messages that contain UTF-8 characters In Python2, print() converts all non-keyword arguments to strings like str() does and writes them to the stream. str() on QueryStateException returns its value(i.e. error message) which could be in unicode type. Python2 will implicitly encode it to str type using the default encoding, 'ascii'. This could result in UnicodeEncodeError when there are non-ascii characters in the error message. This patch explicitly encodes the error message using 'utf-8' encoding if it's in unicode type and the shell is run in Python2. Tests: - Add test in test_shell_interactive.py Change-Id: Ie10f5b03ecc5877053c2fbada1afaf256b423a71 Reviewed-on: http://gerrit.cloudera.org:8080/17099 Reviewed-by: Tamas Mate <tmate@cloudera.com> Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 18:19:11 +00:00
Kurt Deschler	2e53c11484	IMPALA-10546: Add ImpalaServer interface to retrieve BackendConfig from impalad This patch add a new interface ImpalaServer::GetBackendConfig() that returns the current TBackendGflags from impalad. Testing: Called new interface from external frontend. Verified that TBackendGflags were populated correctly. Reviewed-by: John Sherman <jfs@cloudera.com> Change-Id: I14a3cee29f1fc91f4431b7ea89053bb3fbfa5e69 Reviewed-on: http://gerrit.cloudera.org:8080/17116 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 17:49:08 +00:00
Kurt Deschler	b3d95d8f32	IMPALA-10522: Support external use of frontend libraries This patch enables the Impala frontend jar and dependent library libfesupport.so to be used by an external Java frontend. Calling FeSupport.setExternalFE() will cause external frontend initialization mode to be used during FeSupport.loadLibrary(). This mode builds upon logic that is used to initialize the frontend jar for unit tests. Initialization in external frontend mode differs as follows: - Skip instantiating Frontend object and it's dependents - Skip loading libhdfs - Skip starting JVM Pause monitor - Disable Minidumper - Initialize TimezoneDatabase for external frontends - Disable redirect of stderr/stdout to libfesupport.so glog - Log messages from libfesupport.so to stderr - Use libfesupport.so for JNI symbol look up Null check were added in places where objects were assumed to be instantiated but are now skipped during initialization. Additional change: 1) Add libfesupport.lib path to JAVA_LIBRARY_PATH in test driver Testing: - Initialized frontend jar from external frontend - Verified that frontend Java objects can be used externally without issues - Verified that exceptions thrown from Impala Java or libfesupport can be caught or propagated correctly by the external frontend - Manual verification of minicluster logs - Ran queries with external frontend Co-authored-by: John Sherman <jfs@cloudera.com> Co-authored-by: Aman Sinha <amsinha@cloudera.com> Change-Id: I4e3a84721ba196ec00773ce2923b19610b90edd9 Reviewed-on: http://gerrit.cloudera.org:8080/17115 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2021-03-12 17:49:08 +00:00
Kurt Deschler	311938b4f5	IMPALA-10535: Add interface to ImpalaServer for execution of externally compiled statements The ExecutePlannedStatement interface allows an externally supplied TExecRequest to be executed by impalad. The TExecRequest must be fully populated and will be sent directly to the backend for execution. The following fields in the TExecRequest are updated by the coordinator: - Hostname - KRPC address - Local Timezone In order to add the interface to ImpalaInternalService.thrift, several of the thrift classes were moved to Query.thrift to avoid a circular dependency with Frontend.thrift. Added functionality to format and dump TExecRequest structures to path specified in debug flag dump_exec_request_path. A start timestamp field has been added to TExecRequest to represent the interval in the query profile between when the request was sent by the external frontend and handled by the backend. A local timestamp field has been added to the Ping result struct to return the current backend timestamp. This is used by the external to frontend to populate the start timestamp. Also included is a change to avoid generating silent AnalysisExceptions during table resolution. Tested with TExecRequest structures populated by external frontend. Local timezone change tested withe INT64 TIMESTAMP datatype Reviewed-by: John Sherman <jfs@cloudera.com> Change-Id: Iace716dd67290f08441857dc02d2428b0e335eaa Reviewed-on: http://gerrit.cloudera.org:8080/17104 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2021-03-12 17:49:08 +00:00
Fucun Chu	0d22e89df4	IMPALA-10520: Implement ds_theta_intersect() function This function receives a set of serialized Apache DataSketches Theta sketches produced by ds_theta_sketch() and intersects them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and intersect them to get estimates based on the partitions the user is interested in related sketches. E.g.: SELECT ds_theta_estimate(ds_theta_intersect(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_theta_intersect() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_theta_intersect() on those sketches Change-Id: I80e68c2151c4604f0386d3dfb004c82b10293f97 Reviewed-on: http://gerrit.cloudera.org:8080/17088 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 16:13:48 +00:00
liuyao	1a01bfe831	IMPALA-10377: Improve the accuracy of resource estimation PlanNode does not consider some factors when estimating memory, this will cause a large error rate AggregationNode 1.MemoryEstimate = Ndv * (AvgRowSize + SizeOfBucket) 2.When estimating the Ndv of merge aggregation, Ndv should be divided only once. 3.If there is no grouping exprs, MemoryEstimate = MIN_PLAIN_AGG_MEM SortNode 1.MemoryEstimate = Cardinality * AvgRowSize. Memory used when there is enough memory HashJoinNode 1.MemoryEstimate= DataRows + Buckets + DuplicateNodes, DataRows = RightTableCardinality * AvgRowSize, Buckets= roundUpToPowerOf2(RightTableCardinality) * SizeOfBucket, DuplicateNodes = (RightTableCardinality - RightNdv) * SizeOfDuplicateNode KuduScanNode 1.MemoryEstimate = Columns * BytesPerColumn * MaxScannerThreads, Columns are scanned in query, not all the columns of the table UnitTest 1.CardinalityTest adds test cases to test memory estimation. Modify existing test cases related to memory estimation Change-Id: Ic01db168ff2c6d6de33ee553a8175599f035d7a1 Reviewed-on: http://gerrit.cloudera.org:8080/16842 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 14:23:04 +00:00
John Sherman	a29d06db53	IMPALA-9218: Add support for locally compiled Hive - Add HIVE_VERSION_OVERRIDE, HIVE_STORAGE_API_VERSION_OVERRIDE, HIVE_METASTORE_THRIFT_DIR_OVERRIDE, HIVE_HOME_OVERRIDE environment variable support to impala-config.sh - When used together with HIVE_SRC_DIR_OVERRIDE allows a user to specify a locally compiled version of Hive for development and the minicluster - Hive jars are expected to have been installed into the local maven repository - Currently only version 3 of Hive is supported due to the absence of API shims for Hive 4.0 Example: ~/hive $ mvn package install -Pdist -DskipTests Example configuration: export HIVE_VERSION_OVERRIDE=3.1.0-SNAPSHOT export HIVE_STORAGE_API_VERSION_OVERRIDE=2.6.0 export HIVE_HOME_OVERRIDE=\ ~/hive/packaging/target/apache-hive-3.1.0-SNAPSHOT-bin/apache-hive-3.1.0-SNAPSHOT-bin export HIVE_SRC_DIR_OVERRIDE=~/hive export HIVE_METASTORE_THRIFT_DIR_OVERRIDE=~/hive/standalone-metastore/src/main/thrift/ Change-Id: I21892c153c445e3a5d93f2bc8f5e0b799929dd34 Reviewed-on: http://gerrit.cloudera.org:8080/17094 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 03:15:44 +00:00
Fang-Yu Rao	2039746ebe	IMPALA-10576: Add refresh authorization to make a test case less flaky We found that a test case run in test_grant_revoke_with_role() that is used to verify a requesting user does not possess the necessary privilege to perform the GRANT operation could fail since the expected AuthorizationException is not returned after the query. Since the privilege of GRANT was revoked immediately before this test case, we suspect the authorization-related metadata has not been updated. To make this test case less flaky, in this patch we add a REFRESH AUTHORIZATION after the query that revoked the GRANT privilege from the requesting user. Testing: - Verified that this patch passes the core tests in an ASAN build. Change-Id: I7407bac0407e162ab5ba623505bd7ee49bdf3abf Reviewed-on: http://gerrit.cloudera.org:8080/17165 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 00:00:03 +00:00
Yida Wu	b2088b30d7	IMPALA-10555: Fix Hit DCHECK in TmpFileGroup::RecoverWriteError The DCHECK error happens when there is an IO error during the spilling process if the scratch directory is in a remote filesystem and doing an error recovery(rewrite). Because currently the DCHECK only consider the file number of local scratch files, it leads to a file number requirement mismatch in the DCHECK. Because the implementation of spilling to the local fs and the remote fs are quite different, for simplify, we don't recover write error for spilling to a remote fs in the current version. Instead, the errors generated during spilling to remote would be returned directly to the upper layer. So, we avoid the DCHECK logic for spilling to remote. Tests: * Added a unit test: TmpFileMgrTest::TestRemoteRemoveBuffer. * Ran Unit Tests: $IMPALA_HOME/be/build/latest/runtime/tmp-file-mgr-test Change-Id: Ifd9aea4bf2fff634ea9a30bf6e87987be4e1c611 Reviewed-on: http://gerrit.cloudera.org:8080/17140 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-10 05:07:25 +00:00
Zoltan Borok-Nagy	06fbb0d629	IMPALA-10571: ImpalaJdbcClient might silently choose a different driver than the one specified ImpalaJdbcClient might silently choose the HiveDriver when the connection string is not specified. It's because the default connection string is 'jdbc:hive2://...'. This patch adds a check to ImpalaJdbcClient to make sure the driver being used is the one specified by the user. If not, it raises an error. I also modified bin/run-jdbc-client.sh to make it easier to use different drivers. Users are now able to specify the classpath of their custom driver via the environment variable IMPALA_JDBC_DRIVER_CLASSPATH. Testing: * tested manually Change-Id: If7fdf49b7f04f4d9ae6286df5c8df6b205cbce8f Reviewed-on: http://gerrit.cloudera.org:8080/17164 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-09 23:03:05 +00:00
Yida Wu	d89c04bf80	IMPALA-10529: Fix hit DCHECK in DiskIoMgr::AssignQueue in core-s3 build For start option "scratch_dirs", it only considers local filesystem as the default filesystem, regardless of the setting of DefaultFS(for a remote scratch dir, it needs to explicitly set it with the remote fs prefix). However, the function AssignQueue() would assign the queue based on not only the path string but also the default filesystem setting. For example, if scratch_dirs is set as "/tmp", the scratch dir is supposed to be in the local filesystem, but the AssignQueue() would consider it as "s3a://xxx/tmp" if a s3 path is set as the default fs. To fix this, the solution is to add a bool variable to AssignQueue() to decide whether or not to check the default fs setting when parsing the file path. For all of the scratch dirs, AssignQueue() won't check the default fs. Tests: Added a unit testcase: TmpFileMgrTest::TestSpillingWithRemoteDefaultFS. Ran and Passed TmpFileMgrTest. Change-Id: Ic07945abe65d90235aa8dea92dd3c3821a4f1f53 Reviewed-on: http://gerrit.cloudera.org:8080/17136 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-04 01:03:04 +00:00
John Sherman	ca17e307ab	IMPALA-10550: Add External Frontend service port - If external_fe_port flag is >0, spins up a new HS2 compatible service port - Added enable_external_fe_support option to start-impala-cluster.py - which when detected will start impala clusters with external_fe_port on 21150-21152 - Modify impalad_coordinator Dockerfile to expose external frontend port at 21150 - The intent of this commit is to separate external frontend connections from normal hs2 connections - This allows different security policy to be applied to each type of connection. The external_fe_port should be considered a privileged service and should only be exposed to an external frontend that does user authentication and does authorization checks on generated plans Change-Id: I991b5b05e12e37d8739e18ed1086bbb0228acc40 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-on: http://gerrit.cloudera.org:8080/17125 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-03 22:46:05 +00:00
Vihang Karajgaonkar	a47700ed79	IMPALA-10450: Catalogd crashes due to exception in ThriftDebugString This patch adds a wrapper around ThriftDebugString method provided in the Thrift library. The thrift's method can throw exceptions like (bad_alloc or TProtocolException) when the object cannot be serialized into a string representation. This exception is not caught on the catalogd side and it crashes the catalogd. The error was specifically seen in the catalogd's debug UI which provides a way to display a Table object. An exception thrown when rendering the table on the UI would have crashed the catalogd before the patch. In order to simulate this crash a new debug action called EXCEPTION was added. A new custom cluster test was added which simulates a exception thrown in this method and makes sure that fetching the table from catalogd's debug UI does not crash the catalogd. Tests: 1. Added a new custom cluster test which reproduces the crash. 2. Created a large table which has ~270K partitions and reduced the memory of the catalogd to 16GB. This configuration throws bad_alloc exception in the ThriftDebugString method and crashes the catalogd. After the patch the crash is averted and we see a error message on the debug UI instead. I also looped around the catalog web UI call for more than an hour to see if there are any other stability issues. I could not see any problems. Change-Id: I42cee6186a3d5bacc1117bae5961ac60ac9f7a66 Reviewed-on: http://gerrit.cloudera.org:8080/17110 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Tested-by: Vihang Karajgaonkar <vihang@cloudera.com>	2021-03-02 21:44:24 +00:00
Steve Carlin	3554b0752d	IMPALA-10524: Changes to HdfsPartition for third party extensions. Some changes are needed to HdfsPartition and other related classes to allow for third party extensions. These changes include: - A protected constructor which will allow a subclass to instantiate HdfsPartition using its own Builder. - Various changes of permissions to methods and variables to allow third party extension visibility. - Creation of the getHostIndex() method to allow the subclass to override how the hostIndexes are retrieved. - Added a new default method "getFileSystem()" to FeFsPartition which will allow the third party extension to override how the filesystem is obtained from the partition object. Change-Id: I5a792642f27228118ac8f2e8ef98e8ba7aee4a46 Reviewed-on: http://gerrit.cloudera.org:8080/17092 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-02 05:57:26 +00:00
Riza Suminto	49ac55fb69	IMPALA-9856: Enable result spooling by default. Result spooling has been relatively stable since it was introduced, and it has several benefits described in IMPALA-8656. This patch enable result spooling (SPOOL_QUERY_RESULTS) query options by default. Furthermore, some tests need to be adjusted to account for result spooling by default. The following are the adjustment categories and list of tests that fall under such category. Change in assertions: PlannerTest#testAcidTableScans PlannerTest#testBloomFilterAssignment PlannerTest#testConstantFolding PlannerTest#testFkPkJoinDetection PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled PlannerTest#testKuduSelectivity PlannerTest#testMaxRowSize PlannerTest#testMinMaxRuntimeFilters PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled PlannerTest#testMtDopValidation PlannerTest#testParquetFiltering PlannerTest#testParquetFilteringDisabled PlannerTest#testPartitionPruning PlannerTest#testPreaggBytesLimit PlannerTest#testResourceRequirements PlannerTest#testRuntimeFilterQueryOptions PlannerTest#testSortExprMaterialization PlannerTest#testSpillableBufferSizing PlannerTest#testTableSample PlannerTest#testTpch PlannerTest#testKuduTpch PlannerTest#testTpchNested PlannerTest#testUnion TpcdsPlannerTest custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs metadata/test_explain.py::TestExplain::test_explain_level2 metadata/test_explain.py::TestExplain::test_explain_level3 metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation Increase BUFFER_POOL_LIMIT: query_test/test_queries.py::TestQueries::test_analytic_fns query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive query_test/test_udfs.py::TestUdfExecution::test_mem_limits Increase MEM_LIMIT: query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling Increase MAX_ROW_SIZE: custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config query_test/test_insert.py::TestInsertQueries::test_insert_large_string query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter query_test/test_scanners.py::TestWideRow::test_wide_row Disable result spooling to maintain assertion: custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only) query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit shell/test_shell_client.py::TestShellClient::test_fetch_size Testing: - Pass exhaustive tests. Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4 Reviewed-on: http://gerrit.cloudera.org:8080/16755 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-02 04:58:51 +00:00
Riza Suminto	801efdcf09	IMPALA-10492: Lower default MAX_CNF_EXPRS query option MAX_CNF_EXPRS was set to unlimited by default. The CNF rewrite can lead to significant frontend memory usage and eventually OutOfMemory for a complex query that contain many predicates. We need to lower the default value to avoid this memory problem while maintaining performance for our TPC-DS and TPC-H workloads. We investigate the maximum number of CNF expressions in TPC-DS and TPC-H by printing out the final value of 'numCnfExprs_' from ConvertToCNFRule.java to the query profile. We found 5 queries that applies CNF rewrite rules as follow: \| Query \| numCnfExprs_ \| \|-----------+--------------\| \| TPCDS-Q13 \| 168 \| \| TPCDS-Q85 \| 100 \| \| TPCDS-Q48 \| 34 \| \| TPCH-Q19 \| 124 \| \| TPCH-Q7 \| 3 \| This patch lower the default value from unlimited to 200 based on the result above. Testing: - Manually verify that MAX_CNF_EXPRS 200 is enough for our TPC-DS and TPC-H worloads. - Pass core tests. Change-Id: I7ca3d0e094ac01c24a046c25d6a1b56bf134faa8 Reviewed-on: http://gerrit.cloudera.org:8080/17132 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-28 00:27:40 +00:00

1 2 3 4 5 ...

9742 Commits