Commit Graph

9742 Commits

Author SHA1 Message Date
Jim Apple
103774a8e5 Update Python requests package to 2.20.0
See https://2.python-requests.org/en/master/community/updates/#id8.
This is currently only used in the tests, but it's best to fix
this now.

While here, remove now-false not about required support for Python
2.6.

Change-Id: I092a641a12f38cdb45b0062c31ffb51c0c664800
Reviewed-on: http://gerrit.cloudera.org:8080/17215
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2021-03-30 01:27:04 +00:00
Vihang Karajgaonkar
7198f07c92 IMPALA-10605: Deflake test_refresh_native
This test uses a regex to parse the output of
describe database and extract the db properties. The regex assumes that there
will only be 1 key value pair which is broken when events processor is
running. The fix is to modify the regex so that it only extracts the
relevant function name prefix and its value.

Testing:
1. The test fails when events processor is enabled. After the patch
the test works as expected.

Change-Id: I1df35b9c5f2b21cc7172f03ff8611d46070d64c2
Reviewed-on: http://gerrit.cloudera.org:8080/17227
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-30 01:05:40 +00:00
Zoltan Borok-Nagy
dbc2fc14d8 IMPALA-10597: Enable setting 'iceberg.file_format'
Currently we prohibit setting the following properties:

* iceberg.catalog
* iceberg.catalog_location
* iceberg.file_format
* iceberg.table_identifier

This patch enables setting 'iceberg.file_format', therefore if
a table was created by another engine, but using HiveCatalog,
we'll be able to set the data file format to the proper value
and make the table readable by Impala. Setting the other
properties are not needed for HiveCatalog tables.

If the table wasn't created by HiveCatalog, then we cannot load the
table, therefore we cannot invoke any ALTER TABLE statement at all.
In that case we need to create an external table.

If the table already contains data files, then Impala checks if
all of them have the proper file format. If not, the ALTER TABLE
statement fails.

Before this patch a CREATE TABLE statement accepted any string
for 'iceberg.file_format', and in case of invalid file formats the
frontend silently used Parquet. This patch also adds a check to only
allow valid file formats.

Testing:
 * added e2e test

Change-Id: I4b3506be4562a1ace3e6435867aadb3bdde7a8e2
Reviewed-on: http://gerrit.cloudera.org:8080/17207
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-29 18:32:31 +00:00
Fucun Chu
77d6acd032 IMPALA-10581: Implement ds_theta_intersect_f() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Computes the intersection of two sketches
of same or different column and returns the resulting sketch of
intersection.

Example:
select ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2))
from sketch_tbl;
+-----------------------------------------------------------+
| ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) |
+-----------------------------------------------------------+
| 5                                                         |
+-----------------------------------------------------------+

Change-Id: I335eada00730036d5433775cfe673e0e4babaa01
Reviewed-on: http://gerrit.cloudera.org:8080/17186
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-29 15:59:49 +00:00
Steve Carlin
8f8668aaf0 IMPALA-10593: Conditionally skip runtime filter for outer joins
Currently there is code that asserts that an Expr is not constant after
substituting SlotRefs with constant nulls.

For External FE, this restriction to be weakened.  In a case where
an Expr is checked and the Expr is not constant even after substituting
nulls, the result will be to not generate a runtime filter for that Expr.

Testing:

Manually tested with this query in the External FE:

select id, int_col, year, month from alltypessmall s
where s.int_col = (select count(*) from alltypestiny t where s.id = t.id)
order by id

Change-Id: I46462e2030731d97c4c88e364148c0093c025ab3
Reviewed-on: http://gerrit.cloudera.org:8080/17200
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-28 05:51:18 +00:00
stiga-huang
cf463f28c3 IMPALA-10609: Fix NPE in resolving table masks in StmtMetadataLoader
When creating a StmtMetadataLoader, the 'user' field could be null in
places that don't need to resolve column-masking/row-filtering policies.
E.g. in processing GetColumns HS2 operation, or some FE tests.

This patch skips resolving the table masks in such cases to avoid
NullPointerException.

Tests:
 - Run CORE tests and verified no NullPointerException found.

Change-Id: I7aa20458b02e8a93a871b6dd875decfab82c4eae
Reviewed-on: http://gerrit.cloudera.org:8080/17235
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-27 12:26:23 +00:00
Aman Sinha
2e5589d85f IMPALA-10116: Allow unwrapping a builtin cast function similar to CastExpr
This change allows unwrapping a builtin cast function such as
casttobigint(col) similar to a CAST(col as bigint). Unwrapping
is useful to access the SlotRef of the column and this in turn
is needed to compute predicate selectivity correctly.  Without
unwrapping, the cast function uses default 10 % selectivity
for a predicate such as 'casttobigint(l_quantity) is NOT NULL'
which is not accurate.

Note that Impala does not allow a user query to directly call the
builtin cast function. Rather, they have to use the explicit CAST
syntax. However, since the frontend jar can be used by an external
frontend module as a library, the builtin function can be called
and this patch makes the behavior consistent.

Testing:
 - Ran PlannerTest
 - Manual testing by commenting out the code in
   FunctionCallExpr.analyzeImpl() that throws an AnalysisException
   if builtin cast function is called. I haven't added a new test
   for this reason.

Cardinality before this change:
explain select * from date_dim d1, date_dim d2
   where d1.d_week_seq = d2.d_week_seq - 52
    and casttobigint(d1.d_week_seq) is not null
    and casttobigint(d2.d_week_seq) is not null

  SCAN HDFS [tpcds.date_dim d1]
    HDFS partitions=1/1 files=1 size=9.84MB
    predicates: casttobigint(d1.d_week_seq) IS NOT NULL
    runtime filters: RF000 -> d1.d_week_seq
    row-size=255B cardinality=7.30K

Cardinality after this change:
  SCAN HDFS [tpcds.date_dim d1]
    HDFS partitions=1/1 files=1 size=9.84MB
    predicates: casttobigint(d1.d_week_seq) IS NOT NULL
    runtime filters: RF000 -> d1.d_week_seq
    row-size=255B cardinality=73.05K

Change-Id: Idf82b2de78c6a7051ea036062f177d69e2558940
Reviewed-on: http://gerrit.cloudera.org:8080/16407
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-27 09:05:32 +00:00
Bikramjeet Vig
0b79464d9c IMPALA-10397: Fix test_single_workload
The logs on failed runs indicated that the autoscaler never started
another cluster. This can only happen if it never notices a queued
query which is possible since this test was only failing in release
builds. This patch increases the runtime of the sample query to
make execution more predictable.

Testing:
Looped on my local on a release build

Change-Id: Ide3c7fb4509ce9a797b4cbdd141b2a319b923d4e
Reviewed-on: http://gerrit.cloudera.org:8080/17218
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-25 22:59:16 +00:00
wzhou-code
281a47caad IMPALA-10564 (part 2): Fixed test_ctas_exprs failure for S3 build
New test case TestDecimalOverflowExprs::test_ctas_exprs was added
in the first patch for IMPALA-10564. But it failed in S3 build with
Parquet format since the table was not successfully created when
CTAS query failed.
This patch fixed the test failure by skipping checking if NULL is
inserted into table after CTAS failed for S3 build with Parquet.

Testing:
 - Reproduced the test failure in local box with defaultFS as s3a.
   Verified the fixing was working with defaultFS as s3a.
 - Passed EE_TEST.

Change-Id: Ia627ca70ed41764e86be348a0bc19e330b3334d2
Reviewed-on: http://gerrit.cloudera.org:8080/17228
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-25 21:45:10 +00:00
Vihang Karajgaonkar
5b27b7ca72 IMPALA-10598: Deflake test_cache_reload_validation
This patch deflakes the test test_cache_reload_validation in
test_hdfs_caching.py e2e test. The util method which the test relies on to
get the count of list of cache directives by parsing the output of command
"hdfs cacheadmin -listDirectives -stats" does not consider that the output
may contain trailing new lines or headers. Hence the test fails because the
expected number of cache directives does not match the number of lines
of the output.

The fix parses the line "Found <int> entries" in the output when available
and returns the count from that line. If the line is not found, it fallbacks
to the earlier implementation of using the number of lines.

Testing:
1. The test was failing for me when run individually. After the patch, I looped
the test 10 times without any errors.

Change-Id: I2d491e90af461d5db3575a5840958d17ca90901c
Reviewed-on: http://gerrit.cloudera.org:8080/17210
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-25 02:51:54 +00:00
Thomas Tauber-Marshall
e3bafcbef4 IMPALA-10590: Introduce admission service heartbeat mechanism
Currently, if a ReleaseQuery rpc fails, it's possible for the
admission service to think that some resources are still being used
that are actually free.

This patch fixes the issue by introducing a periodic heartbeat rpc
from coordinators to the admission service which contains a list of
queries registered at that coordinator.

If there is a query that the admission service thinks is running but
is not included in the heartbeat, the admission service can conclude
that the query must have already completed and release its resources.

Testing:
- Added a test that uses a debug action to simulate ReleaseQuery rpcs
  failing and checks that query resources are released properly.

Change-Id: Ia528d92268cea487ada20b476935a81166f5ad34
Reviewed-on: http://gerrit.cloudera.org:8080/17194
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-24 22:41:27 +00:00
Thomas Tauber-Marshall
452c2f1f7f IMPALA-10604: Allow setting KuduClient's verbose log level directly
This patch adds a flag --kudu_client_v which allows setting the
verbose logging level for the KuduClient to a value other than the
level for the rest of Impala (set by -v) in order to enable debugging
of issues in the KuduClient without producing the enormous amount of
logging that comes with setting a high -v value on all of Impala.

Testing:
- Manually set --kudu_client_v and confirmed that the expected logging
  is produced.

Change-Id: Ib39358709ee714b8cdffd72a0ee58f66d5fab37e
Reviewed-on: http://gerrit.cloudera.org:8080/17222
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-24 19:06:09 +00:00
Fucun Chu
622e3c95ad IMPALA-10580: Implement ds_theta_union_f() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Union two sketches and returns the
resulting sketch of union.

Example:
select ds_theta_estimate(ds_theta_union_f(sketch1, sketch2))
from sketch_tbl;
+-------------------------------------------------------+
| ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) |
+-------------------------------------------------------+
| 15                                                    |
+-------------------------------------------------------+

Change-Id: I8329979b81ceeaad739a43fab79768ca9c2916fa
Reviewed-on: http://gerrit.cloudera.org:8080/17179
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-24 15:16:07 +00:00
wzhou-code
410c3e79e4 IMPALA-10564: Return error when inserting an invalid decimal value
When using CTAS statements or INSERT-SELECT statements to insert rows to
table with decimal columns, Impala insert NULL for overflowed decimal
values, instead of returning error. This issue happens when the data
expression for the decimal column in SELECT sub-query consists at least
one alias.
This issue is similar as IMPALA-6340, but IMPALA-6340 only fixed the
issue for the cases with the data expression for the decimal columns as
constants.

This patch fixed the issue by calling RuntimeState::CheckQueryState()
in the end of HdfsTableWriter::AppendRows() and KuduTableSink::Send().
If there is an invalid decimal error, the query will be failed without
inserting NULL for decimal column.
We did not change the behaviour for decimal_v1. NULL will be inserted
to the table for invalid decimal values with warning message.

Tests:
 - Added unit-tests for INSERT-SELECT and CTAS statements with
   overflowed decimal values to be inserted into tables. The
   overflowed decimal values are expressed as a constant expression,
   or as an expression with aliases.
   Also added cases to verify behaviour of decimal_v1 is unchanged.
 - Passed exhaustive tests.

Change-Id: I64ce4ed194af81ef06401ffc1124e12f05b8da98
Reviewed-on: http://gerrit.cloudera.org:8080/17168
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-23 22:52:38 +00:00
Andrew Sherman
b28da054f3 IMPALA-10592: prevent pytest from hanging at exit.
In TestAdmissionControllerStress mark worker threads as daemons so that
an exception in teardown() will not cause pytest to hang just after
printing the test results.
https://stackoverflow.com/questions/19219596/py-test-hangs-after-showing-test-results

TESTING:

Simulated the failure in IMPALA-10596 by throwing an exception during
teardown(). Without this fix the pytest invocation hangs.

Change-Id: I74cca8f577c7fbc4d394311e2f039cf4f68b08df
Reviewed-on: http://gerrit.cloudera.org:8080/17212
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-22 23:33:55 +00:00
Jim Apple
e5d5dbc30a Update Paramiko to 2.4.2.
See https://www.paramiko.org/changelog.html#2.4.2. This shouldn't
directly apply to Impala deployments, but it is best to fix this in
test now.

Change-Id: If9cc9ea4a0763c8b5303ca4e8482761ee2f53efa
Reviewed-on: http://gerrit.cloudera.org:8080/17214
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-22 19:34:00 +00:00
stiga-huang
a0f77680c5 IMPALA-10483: Support subqueries in Ranger masking policies
This patch adds support for using subqueries in Ranger masking policies,
i.e. column-masking/row-filtering policies. The subquery can reference
either the current table or other tables. However, masking policies on
these tables won't be applied recursively. This is consistent with Hive.
One motivation is to avoid infinitely masking if it references the same
table. Another motivation I think is to simplify the masking behavior,
so when the admin is setting a masking expression, it can be considered
as running in the admin's perspective (i.e. no masking).

Implementation
Before analyzing the query, the coordinator loads the metadata of all
possibly used tables into the query's StmtTableCache. Table masking
takes place after the analyzing phase. If the subquery filter introduces
any new tables, the analyzer will fail to resolve them since their
metadata is not loaded in the StmtTableCache. This patch modified the
StmtMetadataLoader to also load those tables introduced by masking
policies. So they can be resolved correctly.

Tests
 - Add more complex tests in test_row_filtering

Change-Id: I254df9f684c95c660f402abd99ca12dded7e764f
Reviewed-on: http://gerrit.cloudera.org:8080/17185
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-22 15:52:03 +00:00
stiga-huang
c9d7bcb4a1 IMPALA-9661: Avoid introducing unused columns in table masking view
Previously, if a table has column masking policies, we replace its
unanalyzed TableRef with an analyzed InlineViewRef (table masking view)
in FromClause.analyze(). However, we can't detect which columns are
actually used in the original query at this point. In fact, analyze()
for SelectList, WhereClause, GroupByClause and other clauses containing
SlotRefs happen after FromClause.analyze(). After the whole query block
is analyzed, we can get the exact set of required columns.

This patch refactor the codes to do table masking after analyze() to
avoid introducing unused columns. Referenced columns of a TableRef are
registered in analyze(), which helps to figure out what columns are
actually needed.

With this, we don't need to revert table masking in FromClause.reset().
The doTableMasking flag in AST is also removed since now the table mask
is resolved once after analyze().

Tests:
 - Add more e2e tests in test_ranger.py
 - Run CORE tests

Change-Id: Ib015a8ab528065907b27fbdceb8e2818deb814e1
Reviewed-on: http://gerrit.cloudera.org:8080/17199
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-22 08:41:00 +00:00
John Sherman
879986ea6f IMPALA-10552: Support external frontends supplying timeline for profile
- Add EXTERNAL_FRONTEND as a client session type
- Use EXTERNAL_FRONTEND session type for clients connected to
  external frontend interface.
- Rename Query Timeline to Impala Backend Timeline for external
  frontends
  - the query timeline is no longer an end to end timeline when
    executing a plan from an external frontend
- External frontends can provide timeline information through a
  TExecRequest by filling in the timeline field with a valid
  TEventSequence
- The frontend timeline and backend timeline are completely separate
  entities, meaning there is no overall attempt to capture the timing
  end to end
  - This is due to the fact that the frontend and Impala may not share
    the same time source (or even machine).
  - It is safe to add together the backend + frontend timeline times
  to get a rough idea how long the query took end to end to execute,
  but keep in mind that this number does not capture the time it
  took the frontend to send the plan to the backend (Impala) nor does
  it capture how long it took the end user to read the results.

Example timeline with external frontend:
  Frontend Timeline: 3s016ms
     - Analysis finished: 1s130ms (1s130ms)
     - Calcite plan generated: 2s170ms (1s040ms)
     - Metadata load started: 2s245ms (74.486ms)
     - Metadata load finished. loaded-tables=1: 2s654ms (409.847ms)
     - Single node plan created: 2s726ms (71.659ms)
     - Runtime filters computed: 2s756ms (30.000ms)
     - Distributed plan created: 2s761ms (5.265ms)
     - Execution request created: 2s890ms (128.387ms)
     - Impala plan generated: 2s891ms (1.508ms)
     - Planning finished: 2s893ms (1.894ms)
     - Submitted query: 3s016ms (122.377ms)
  Impala Backend Timeline: 79.998ms
     - Query submitted: 0.000ns (0.000ns)
     - Submit for admission: 0.000ns (0.000ns)
     - Completed admission: 0.000ns (0.000ns)
     - Ready to start on 1 backends: 3.999ms (3.999ms)
     - All 1 execution backends (2 fragment instances) started: 7.999ms (3.999ms)
     - Rows available: 55.999ms (47.999ms)
     - Execution cancelled: 79.998ms (23.999ms)
     - Released admission control resources: 79.998ms (0.000ns)
     - Unregister query: 79.998ms (0.000ns)

Testing done:
- Manual inspection of profiles on the Impala web UI
- test_hs2.py
- test_tpch_queries.py
- test_tpcds_queries.py::TestTpcdsDecimalV2Query

Co-authored-by: Kurt Deschler <kdeschle@cloudera.com>

Change-Id: I2b3692b4118ea23c0f9f8ec4bcc27b0b68bb32ec
Reviewed-on: http://gerrit.cloudera.org:8080/17183
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-19 22:24:29 +00:00
stiga-huang
98de1c5436 IMPALA-9234: Support Ranger row filtering policies
Ranger row filtering policies provide customized expressions to filter
out rows for specific users when reading from a table. This patch adds
support for this feature. A new feature flag, enable_row_filtering, is
added to disable this experimental feature. It defaults to be true so
the feature is enabled by default. Enabling row-filtering requires
--enable_column_masking=true since it depends on the column masking
implementation.

Note that row filtering policies take effects prior to any column
masking policies, because column masking policies apply on result data.

Implementation:
The existing table masking view infrastructure can be extended to
support row filtering. Currently when analyzing a table with column
masking policies, we replace the TableRef with an InlineViewRef which
contains a SelectStmt wrapping the columns with masking expressions.
This patch adds the row filtering expressions to the WhereClause of the
SelectStmt.

Limitations:
 - Expressions using subqueries are not supported (IMPALA-10483).
 - Row filtering policies on nested tables will not be applied when
   nested collection columns are used directly in the FROM clause. This
   will leak data so we forbid such kinds of queries until IMPALA-10484
   is resolved.

Tests:
 - Add FE test for error message when disabling row filtering.
 - Add e2e test with row filtering policies.
 - Add e2e test with column masking and row filtering policies both take
   place.
 - Verified audits in a CDP cluster with Ranger and Solr set up.

Change-Id: I580517be241225ca15e45686381b78890178d7cc
Reviewed-on: http://gerrit.cloudera.org:8080/16976
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-18 21:08:14 +00:00
Zoltan Borok-Nagy
6162343842 IMPALA-10512: ALTER TABLE ADD PARTITION should bump the write id for ACID tables
ALTER TABLE ADD PARTITION should bump the write id for ACID tables.
Both for INSERT-only and full ACID tables.

For transational tables we are adding partitions in an ACID
transaction in the following sequence:

1. open transaction
2. allocate write id for table
3. add partitions to HMS table
4. commit transaction

However, please note that table metadata modifications are
independent of ACID transactions. I.e. if add partitions succeed,
but we cannot commit the transaction, then we the newly added
partitions won't get removed.

So why are we opening a txn then? We are doing it in order to bump
the write id in a best-effort way. This aids table metadata caching,
so by looking at the table write id we can determine if the cached
table metadata is up-to-date.

Testing:
 * added e2e test

Change-Id: Iad247008b7c206db00516326c1447bd00a9b34bd
Reviewed-on: http://gerrit.cloudera.org:8080/17081
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-18 19:35:58 +00:00
Kurt Deschler
ac4984b74f Revert "IMPALA-10503: testdata load hits hive memory limit errors during hive inserts"
This reverts commit c60a626ac6.

Change-Id: I896c7b2457d537fa1bfe8dc29063da0b7b3df199
Reviewed-on: http://gerrit.cloudera.org:8080/17191
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-18 07:24:31 +00:00
Kurt Deschler
231d41ae1f IMPALA-10549: Register transactions from external frontend DML
This change registers transactions that were started by an external
frontend so that coordinator keepalive can track them properly.

Testing: manually tested using DMLs from external frontend

Reviewed-by: John Sherman <jfs@cloudera.com>
Change-Id: Ia8863b8d9d281a5d164f10de9c5ee52cf3be63db
Reviewed-on: http://gerrit.cloudera.org:8080/17122
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-18 01:31:25 +00:00
Aman Sinha
decb79d032 IMPALA-10518: Add ImpalaServer interface to retrieve executor membership.
This patch adds an interface to ImpalaServer to retrieve the current
executor membership snapshot from impalad for use by an external
frontend. This involves sending a thrift request to impalad and
receiving a thrift response. Refactored some code in exec-env into
a separate function in the impala namespace which makes it easier to
populate the needed information for an external frontend.

Testing:
 - Ran selected tests for sanity check (no impact is expected
   since this is adding a new interface):
    - Frontend tests (PlannerTest, CardinalityTest)
    - Backend tests under custom_cluster/test_executor_groups.py
 - Manually tested with external frontend to ensure it gets
   the executor membership snapshot.

Change-Id: Ie89b71f4555c368869ee7b9d6341756c60af12b5
Reviewed-on: http://gerrit.cloudera.org:8080/17181
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-18 00:39:22 +00:00
Thomas Tauber-Marshall
c823f17f2c IMPALA-10577: Add retrying of AdmitQuery
This patch adds retries of the AdmitQuery rpc by coordinators.
This helps to ensure that if an admissiond goes down and is restarted
or is temporarily unreachable, queries won't fail.

The retries are done with backoff and jitter to avoid overloading the
admissiond in these scenarios.

A new flag, --admission_max_retry_time_s, is added to control how long
queries will continue retrying before giving up.

The AdmitQuery rpc is made idempotent - if a query is submitted with
the same query id as one the admissiond already knows about,
AdmitQuery will return OK without submitting the query to be scheduled
again.

Testing:
- Added a custom cluster test that checks that queries won't fail when
  the admissiond goes down.

Change-Id: I8bc0cac666bbd613a1143c0e2c4f84d3b0ad003a
Reviewed-on: http://gerrit.cloudera.org:8080/17188
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-18 00:36:54 +00:00
Fucun Chu
3e82501531 IMPALA-10558: Implement ds_theta_exclude() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Computes the a-not-b set operation given
two sketches of same or different column.

Example:
select ds_theta_estimate(ds_theta_exclude(sketch1, sketch2))
from sketch_tbl;
+-------------------------------------------------------+
| ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) |
+-------------------------------------------------------+
| 5                                                     |
+-------------------------------------------------------+

Change-Id: I05119fd8c652c07ff248a99e44b0da3541e46ca3
Reviewed-on: http://gerrit.cloudera.org:8080/17153
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-17 22:14:44 +00:00
John Sherman
122bd9ea06 IMPALA-10553: External Frontend CTAS support
- Adds the concept of an external staging dir to HdfsTableSink
  - This allows an external to specify the destination of the
  sink
  - When this is set, the external frontend is responsible for
  for moving and managing the results
  - Any DDL related operations are assumed to be handled by
  the external frontend
  - External frontends may optionally supply a partition
  depth which acts as a hint to skip a certain number of
  partitions while creating directories for partitions. This
  is for when the external frontend has pre-created a
  certain number of the directories in staging (usually the
  static portion of a partition specification)/
- Modifies delta/base naming to include 0 prefix padding to
  match Hive for dynamic partitioning detection
- External frontends are responsible for doing authorization
  checks against the staging directory and it is assumed the
  external frontend service is not exposed directly to users.

Co-authored-by: Kurt Deschler <kdechle@cloudera.com>

Change-Id: Iae0ea4a832d8281c563427d0d7da1623bfce437b
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-on: http://gerrit.cloudera.org:8080/17145
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-16 15:17:26 +00:00
John Sherman
768722b868 IMPALA-10551: Add result sink support for external frontends
- The intended purpose of these changes is to allow external frontends
  to receive query results via files rather than streaming the results
  through the thrift interface.
- External frontends are expected to provide an FeFsTable implementation
  that describes the desired location to store results.
- External frontends are responsible for managing the files after the
  query is completed.
- Testing has been manual and through an implementation of an external
  frontend.

Change-Id: I024bf41d77bb81f1ab0debdbd31ec3687c83f072
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Reviewed-on: http://gerrit.cloudera.org:8080/17144
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2021-03-16 15:17:26 +00:00
Abhishek Rawat
573c60a298 IMPALA-10367: Impala-shell internal error -
UnboundLocalError, local variable 'retry_msg' referenced before assign

ImpalaHS2Client._open_session() has a 'retry_msg' variable which was
not initialized in the code-path where retry was disabled. If an
exception was hit with retry disabled, a compile time error was
generated.

The fix is to initialize 'retry_msg' in the non retry code-path.

Testing:
- Forced exception in ImpalaHS2Client._open_session() and verified that
proper error message was generated.
- Ran impala-shell e2e and custom cluster tests.

Change-Id: I50a08a62a332de759022d0a4862e74f5a81945d9
Reviewed-on: http://gerrit.cloudera.org:8080/17172
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-16 01:14:06 +00:00
Kurt Deschler
c60a626ac6 IMPALA-10503: testdata load hits hive memory limit errors during hive inserts
Changed the following hive settings to avoid hitting Hive container
limit errors:

hive.tez.container.size: 2048
hive.tez.java.opts: -Xmx1700m

With these settings, testdata load completes without errors on a
32GB host.

Reviewed-by: Fang-Yu Rao <fangyu.rao@cloudera.com>
Change-Id: Idac5f054e814070b983f7f57aef4ea9d54252bb2
Reviewed-on: http://gerrit.cloudera.org:8080/17061
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
2021-03-15 20:55:29 +00:00
Riza Suminto
37ec96e72a IMPALA-10559: Fix flakiness in TestScratchLimit.
TestScratchLimit has been flaky in ubuntu-16.04-dockerised-tests
environment since results spooling is enabled by default in IMPALA-9856.
A combination of result spooling, sort query, and low buffer_pool_limit
in TestScratchLimit::test_with_unlimited_scratch_limit seems to reveal a
memory reservation bug in BufferedTutpleStream. This patch disables
result spooling for tests under TestScratchLimit until the underlying
bug is found. We will investigate the bug in a separate JIRA.

Testing:
- Disable result spooling in all tests of TestScratchLimit before
  IMPALA-9856 gets in.
- Run and pass TestScratchLimit locally.

Change-Id: I68736d6bfb0001423fd138000670ac60b2117fbe
Reviewed-on: http://gerrit.cloudera.org:8080/17182
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-14 13:41:40 +00:00
Riza Suminto
47219ec366 IMPALA-10565: Adjust result spooling memory based on scratch_limit
IMPALA-9856 enables result spooling by default. Result spooling depends
on the ability to spill its entire BufferedTupleStream to disk once it
hits maximum memory reservation. However, if the query option
scratch_limit is set lower than max_spilled_result_spooling_mem, the
query might fail in the middle of execution due to insufficient scratch
space. This patch adds planner change to consider scratch_limit and
scratch_dirs query option when computing resource used by result
spooling. The algorithm is as follow:

* If scratch_dirs is empty or scratch_limit < minMemReservationBytes
  required to use BufferedPlanRootSink, we set spool_query_results to
  false and fallback to use BlockingPlanRootSink.

* If scratch_limit > minMemReservationBytes but still fairly low, we
  lower the max_result_spooling_mem (default is 100MB) and
  max_spilled_result_spooling_mem (default is 1GB) to fit scratch_limit.

* if scratch_limit > max_spilled_result_spooling_mem, do nothing.

Testing:
- Add TestScratchLimit::test_result_spooling_and_varying_scratch_limit
- Verify that spool_query_results query option is disabled in
  TestScratchDir::test_no_dirs
- Pass exhaustive tests.

Change-Id: I541f46e6911694e14c0fc25be1a6982fd929d3a9
Reviewed-on: http://gerrit.cloudera.org:8080/17166
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Aman Sinha <amsinha@cloudera.com>
2021-03-14 03:35:40 +00:00
stiga-huang
2dfc68d852 IMPALA-7712: Support Google Cloud Storage
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.

New flags for GCS:
 - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.

Follow-up:
 - Support for spilling to GCS will be addressed in IMPALA-10561.
 - Support for caching GCS file handles will be addressed in
   IMPALA-10568.
 - test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   GCS (IMPALA-10562).
 - Some tests are skipped due to issues introduced by /etc/hosts setting
   on GCE instances (IMPALA-10563).

Tests:
 - Compile and create hdfs test data on a GCE instance. Upload test data
   to a GCS bucket. Modify all locations in HMS DB to point to the GCS
   bucket. Remove some hdfs caching params. Run CORE tests.
 - Compile and load snapshot data to a GCS bucket. Run CORE tests.

Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-13 11:20:08 +00:00
Zoltan Borok-Nagy
6c6b0ee869 IMPALA-10222: CREATE TABLE AS SELECT for Iceberg tables
This patch adds support for CREATE TABLE AS SELECT statements
for Iceberg tables.

CTAS statements work like the following in Impala:

1. Analysis of the whole CTAS statement
2. Divide CTAS to CREATE stmt and INSERT stmt
3. Create temporary in-memory target table from the CREATE stmt
4. Analyse the INSERT statement by using the temporary target table
5. If everything is OK so far, create the target table
6. Execute the INSERT query

For Iceberg tables the non-trivial thing was to create the temporary
target table without actually creating it via Iceberg API. I've created
a new class 'IcebergCtasTarget' that mimics an FeIceberg table. It can be
used with catalog V1 and V2 as well.

Testing
 * e2e CTAS tests in iceberg-ctas.test
 * SHOW CREATE TABLE stmts in show-create-table.test

Change-Id: I81d2084e401b9fa74d5ad161b51fd3e2aa3fcc67
Reviewed-on: http://gerrit.cloudera.org:8080/17130
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-12 19:28:19 +00:00
stiga-huang
d5f67fce41 IMPALA-10523: Fix impala-shell crash in printing error messages that contain UTF-8 characters
In Python2, print() converts all non-keyword arguments to strings like
str() does and writes them to the stream. str() on QueryStateException
returns its value(i.e. error message) which could be in unicode type.
Python2 will implicitly encode it to str type using the default
encoding, 'ascii'. This could result in UnicodeEncodeError when there
are non-ascii characters in the error message.

This patch explicitly encodes the error message using 'utf-8' encoding
if it's in unicode type and the shell is run in Python2.

Tests:
 - Add test in test_shell_interactive.py

Change-Id: Ie10f5b03ecc5877053c2fbada1afaf256b423a71
Reviewed-on: http://gerrit.cloudera.org:8080/17099
Reviewed-by: Tamas Mate <tmate@cloudera.com>
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-12 18:19:11 +00:00
Kurt Deschler
2e53c11484 IMPALA-10546: Add ImpalaServer interface to retrieve BackendConfig from impalad
This patch add a new interface ImpalaServer::GetBackendConfig() that
returns the current TBackendGflags from impalad.

Testing:
Called new interface from external frontend. Verified that
TBackendGflags were populated correctly.

Reviewed-by: John Sherman <jfs@cloudera.com>
Change-Id: I14a3cee29f1fc91f4431b7ea89053bb3fbfa5e69
Reviewed-on: http://gerrit.cloudera.org:8080/17116
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-12 17:49:08 +00:00
Kurt Deschler
b3d95d8f32 IMPALA-10522: Support external use of frontend libraries
This patch enables the Impala frontend jar and dependent library
libfesupport.so to be used by an external Java frontend.

Calling FeSupport.setExternalFE() will cause external frontend
initialization mode to be used during FeSupport.loadLibrary(). This
mode builds upon logic that is used to initialize the frontend jar for
unit tests.

Initialization in external frontend mode differs as follows:

- Skip instantiating Frontend object and it's dependents
- Skip loading libhdfs
- Skip starting JVM Pause monitor
- Disable Minidumper
- Initialize TimezoneDatabase for external frontends
- Disable redirect of stderr/stdout to libfesupport.so glog
- Log messages from libfesupport.so to stderr
- Use libfesupport.so for JNI symbol look up

Null check were added in places where objects were assumed to be
instantiated but are now skipped during initialization.

Additional change:
1) Add libfesupport.lib path to JAVA_LIBRARY_PATH in test driver

Testing: - Initialized frontend jar from external frontend
 - Verified that frontend Java objects can be used externally without
   issues
 - Verified that exceptions thrown from Impala Java or libfesupport
   can be caught or propagated correctly by the external frontend
 - Manual verification of minicluster logs
 - Ran queries with external frontend

Co-authored-by: John Sherman <jfs@cloudera.com>
Co-authored-by: Aman Sinha <amsinha@cloudera.com>

Change-Id: I4e3a84721ba196ec00773ce2923b19610b90edd9
Reviewed-on: http://gerrit.cloudera.org:8080/17115
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
2021-03-12 17:49:08 +00:00
Kurt Deschler
311938b4f5 IMPALA-10535: Add interface to ImpalaServer for execution of externally compiled statements
The ExecutePlannedStatement interface allows an externally supplied
TExecRequest to be executed by impalad. The TExecRequest must be fully
populated and will be sent directly to the backend for execution.

The following fields in the TExecRequest are updated by the coordinator:
- Hostname
- KRPC address
- Local Timezone

In order to add the interface to ImpalaInternalService.thrift, several of
the thrift classes were moved to Query.thrift to avoid a circular
dependency with Frontend.thrift.

Added functionality to format and dump TExecRequest structures to path
specified in debug flag dump_exec_request_path.

A start timestamp field has been added to TExecRequest to represent the
interval in the query profile between when the request was sent by the
external frontend and handled by the backend.

A local timestamp field has been added to the Ping result struct to
return the current backend timestamp. This is used by the external to
frontend to populate the start timestamp.

Also included is a change to avoid generating silent AnalysisExceptions
during table resolution.

Tested with TExecRequest structures populated by external frontend.
Local timezone change tested withe INT64 TIMESTAMP datatype

Reviewed-by: John Sherman <jfs@cloudera.com>
Change-Id: Iace716dd67290f08441857dc02d2428b0e335eaa
Reviewed-on: http://gerrit.cloudera.org:8080/17104
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
2021-03-12 17:49:08 +00:00
Fucun Chu
0d22e89df4 IMPALA-10520: Implement ds_theta_intersect() function
This function receives a set of serialized Apache DataSketches Theta
sketches produced by ds_theta_sketch() and intersects them into a
single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and intersect them to get
estimates based on the partitions the user is interested in related
sketches. E.g.:
  SELECT
      ds_theta_estimate(ds_theta_intersect(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Testing:
  - Apart from the automated tests I added to this patch I also
    tested ds_theta_intersect() on a bigger dataset to check that
    serialization, deserialization and merging steps work well. I
    took TPCH25.linelitem, created a number of sketches with grouping
    by l_shipdate and called ds_theta_intersect() on those sketches

Change-Id: I80e68c2151c4604f0386d3dfb004c82b10293f97
Reviewed-on: http://gerrit.cloudera.org:8080/17088
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-12 16:13:48 +00:00
liuyao
1a01bfe831 IMPALA-10377: Improve the accuracy of resource estimation
PlanNode does not consider some factors when estimating memory,
this will cause a large error rate

AggregationNode
1.MemoryEstimate = Ndv * (AvgRowSize + SizeOfBucket)
2.When estimating the Ndv of merge aggregation, Ndv should be
  divided only once.
3.If there is no grouping exprs, MemoryEstimate =
  MIN_PLAIN_AGG_MEM

SortNode
1.MemoryEstimate = Cardinality * AvgRowSize. Memory used when
  there is enough memory

HashJoinNode
1.MemoryEstimate= DataRows + Buckets + DuplicateNodes,
  DataRows = RightTableCardinality * AvgRowSize,
  Buckets= roundUpToPowerOf2(RightTableCardinality) *
           SizeOfBucket,
  DuplicateNodes = (RightTableCardinality - RightNdv) *
                    SizeOfDuplicateNode

KuduScanNode
1.MemoryEstimate = Columns * BytesPerColumn * MaxScannerThreads,
  Columns are scanned in query, not all the columns of the table

UnitTest
1.CardinalityTest adds test cases to test memory estimation.
  Modify existing test cases related to memory estimation

Change-Id: Ic01db168ff2c6d6de33ee553a8175599f035d7a1
Reviewed-on: http://gerrit.cloudera.org:8080/16842
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-12 14:23:04 +00:00
John Sherman
a29d06db53 IMPALA-9218: Add support for locally compiled Hive
- Add HIVE_VERSION_OVERRIDE, HIVE_STORAGE_API_VERSION_OVERRIDE,
  HIVE_METASTORE_THRIFT_DIR_OVERRIDE, HIVE_HOME_OVERRIDE environment
  variable support to impala-config.sh
- When used together with HIVE_SRC_DIR_OVERRIDE allows a user to
  specify a locally compiled version of Hive for development and the
  minicluster
- Hive jars are expected to have been installed into the local maven
  repository
- Currently only version 3 of Hive is supported due to the absence of
  API shims for Hive 4.0
Example:
  ~/hive $ mvn package install -Pdist -DskipTests

Example configuration:
export HIVE_VERSION_OVERRIDE=3.1.0-SNAPSHOT
export HIVE_STORAGE_API_VERSION_OVERRIDE=2.6.0
export HIVE_HOME_OVERRIDE=\
~/hive/packaging/target/apache-hive-3.1.0-SNAPSHOT-bin/apache-hive-3.1.0-SNAPSHOT-bin
export HIVE_SRC_DIR_OVERRIDE=~/hive
export HIVE_METASTORE_THRIFT_DIR_OVERRIDE=~/hive/standalone-metastore/src/main/thrift/

Change-Id: I21892c153c445e3a5d93f2bc8f5e0b799929dd34
Reviewed-on: http://gerrit.cloudera.org:8080/17094
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-12 03:15:44 +00:00
Fang-Yu Rao
2039746ebe IMPALA-10576: Add refresh authorization to make a test case less flaky
We found that a test case run in test_grant_revoke_with_role() that is
used to verify a requesting user does not possess the necessary
privilege to perform the GRANT operation could fail since the expected
AuthorizationException is not returned after the query. Since the
privilege of GRANT was revoked immediately before this test case, we
suspect the authorization-related metadata has not been updated. To make
this test case less flaky, in this patch we add a REFRESH AUTHORIZATION
after the query that revoked the GRANT privilege from the requesting
user.

Testing:
 - Verified that this patch passes the core tests in an ASAN build.

Change-Id: I7407bac0407e162ab5ba623505bd7ee49bdf3abf
Reviewed-on: http://gerrit.cloudera.org:8080/17165
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-12 00:00:03 +00:00
Yida Wu
b2088b30d7 IMPALA-10555: Fix Hit DCHECK in TmpFileGroup::RecoverWriteError
The DCHECK error happens when there is an IO error during the spilling
process if the scratch directory is in a remote filesystem and doing
an error recovery(rewrite). Because currently the DCHECK only consider
the file number of local scratch files, it leads to a file number
requirement mismatch in the DCHECK.
Because the implementation of spilling to the local fs and the remote fs
are quite different, for simplify, we don't recover write error
for spilling to a remote fs in the current version. Instead, the errors
generated during spilling to remote would be returned directly to the
upper layer. So, we avoid the DCHECK logic for spilling to remote.

Tests:
* Added a unit test: TmpFileMgrTest::TestRemoteRemoveBuffer.
* Ran Unit Tests:
$IMPALA_HOME/be/build/latest/runtime/tmp-file-mgr-test

Change-Id: Ifd9aea4bf2fff634ea9a30bf6e87987be4e1c611
Reviewed-on: http://gerrit.cloudera.org:8080/17140
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-10 05:07:25 +00:00
Zoltan Borok-Nagy
06fbb0d629 IMPALA-10571: ImpalaJdbcClient might silently choose a different driver than the one specified
ImpalaJdbcClient might silently choose the HiveDriver when the
connection string is not specified. It's because the default
connection string is 'jdbc:hive2://...'.

This patch adds a check to ImpalaJdbcClient to make sure the driver
being used is the one specified by the user. If not, it raises an
error.

I also modified bin/run-jdbc-client.sh to make it easier to use
different drivers. Users are now able to specify the classpath
of their custom driver via the environment variable
IMPALA_JDBC_DRIVER_CLASSPATH.

Testing:
 * tested manually

Change-Id: If7fdf49b7f04f4d9ae6286df5c8df6b205cbce8f
Reviewed-on: http://gerrit.cloudera.org:8080/17164
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-09 23:03:05 +00:00
Yida Wu
d89c04bf80 IMPALA-10529: Fix hit DCHECK in DiskIoMgr::AssignQueue in core-s3 build
For start option "scratch_dirs", it only considers local filesystem as
the default filesystem, regardless of the setting of DefaultFS(for a
remote scratch dir, it needs to explicitly set it with the remote fs
prefix). However, the function AssignQueue() would assign the queue
based on not only the path string but also the default filesystem
setting. For example, if scratch_dirs is set as "/tmp", the scratch dir
is supposed to be in the local filesystem, but the AssignQueue() would
consider it as "s3a://xxx/tmp" if a s3 path is set as the default fs.
To fix this, the solution is to add a bool variable to AssignQueue() to
decide whether or not to check the default fs setting when parsing the
file path. For all of the scratch dirs, AssignQueue() won't check the
default fs.

Tests:
Added a unit testcase: TmpFileMgrTest::TestSpillingWithRemoteDefaultFS.
Ran and Passed TmpFileMgrTest.

Change-Id: Ic07945abe65d90235aa8dea92dd3c3821a4f1f53
Reviewed-on: http://gerrit.cloudera.org:8080/17136
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-04 01:03:04 +00:00
John Sherman
ca17e307ab IMPALA-10550: Add External Frontend service port
- If external_fe_port flag is >0, spins up a new HS2 compatible
  service port
- Added enable_external_fe_support option to start-impala-cluster.py
  - which when detected will start impala clusters with
  external_fe_port on 21150-21152
- Modify impalad_coordinator Dockerfile to expose external frontend
  port at 21150
- The intent of this commit is to separate external frontend
  connections from normal hs2 connections
  - This allows different security policy to be applied to
  each type of connection. The external_fe_port should be considered
  a privileged service and should only be exposed to an external
  frontend that does user authentication and does authorization
  checks on generated plans

Change-Id: I991b5b05e12e37d8739e18ed1086bbb0228acc40
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Reviewed-on: http://gerrit.cloudera.org:8080/17125
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-03 22:46:05 +00:00
Vihang Karajgaonkar
a47700ed79 IMPALA-10450: Catalogd crashes due to exception in ThriftDebugString
This patch adds a wrapper around ThriftDebugString method provided
in the Thrift library. The thrift's method can throw exceptions
like (bad_alloc or TProtocolException) when the object cannot be
serialized into a string representation. This exception is not
caught on the catalogd side and it crashes the catalogd.

The error was specifically seen in the catalogd's debug UI
which provides a way to display a Table object. An exception
thrown when rendering the table on the UI would have crashed
the catalogd before the patch. In order to simulate this crash a new debug
action called EXCEPTION was added. A new custom cluster test
was added which simulates a exception thrown in this method and
makes sure that fetching the table from catalogd's debug UI
does not crash the catalogd.

Tests:
1. Added a new custom cluster test which reproduces the crash.
2. Created a large table which has ~270K partitions and reduced
the memory of the catalogd to 16GB. This configuration throws
bad_alloc exception in the ThriftDebugString method and crashes
the catalogd. After the patch the crash is averted and we see
a error message on the debug UI instead. I also looped around
the catalog web UI call for more than an hour to see if there
are any other stability issues. I could not see any problems.

Change-Id: I42cee6186a3d5bacc1117bae5961ac60ac9f7a66
Reviewed-on: http://gerrit.cloudera.org:8080/17110
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Vihang Karajgaonkar <vihang@cloudera.com>
2021-03-02 21:44:24 +00:00
Steve Carlin
3554b0752d IMPALA-10524: Changes to HdfsPartition for third party extensions.
Some changes are needed to HdfsPartition and other related classes
to allow for third party extensions.  These changes include:

- A protected constructor which will allow a subclass to instantiate
  HdfsPartition using its own Builder.
- Various changes of permissions to methods and variables to allow
  third party extension visibility.
- Creation of the getHostIndex() method to allow the subclass to
  override how the hostIndexes are retrieved.
- Added a new default method "getFileSystem()" to FeFsPartition which
  will allow the third party extension to override how the filesystem
  is obtained from the partition object.

Change-Id: I5a792642f27228118ac8f2e8ef98e8ba7aee4a46
Reviewed-on: http://gerrit.cloudera.org:8080/17092
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-02 05:57:26 +00:00
Riza Suminto
49ac55fb69 IMPALA-9856: Enable result spooling by default.
Result spooling has been relatively stable since it was introduced, and
it has several benefits described in IMPALA-8656. This patch enable
result spooling (SPOOL_QUERY_RESULTS) query options by default.

Furthermore, some tests need to be adjusted to account for result
spooling by default. The following are the adjustment categories and
list of tests that fall under such category.

Change in assertions:
PlannerTest#testAcidTableScans
PlannerTest#testBloomFilterAssignment
PlannerTest#testConstantFolding
PlannerTest#testFkPkJoinDetection
PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled
PlannerTest#testKuduSelectivity
PlannerTest#testMaxRowSize
PlannerTest#testMinMaxRuntimeFilters
PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled
PlannerTest#testMtDopValidation
PlannerTest#testParquetFiltering
PlannerTest#testParquetFilteringDisabled
PlannerTest#testPartitionPruning
PlannerTest#testPreaggBytesLimit
PlannerTest#testResourceRequirements
PlannerTest#testRuntimeFilterQueryOptions
PlannerTest#testSortExprMaterialization
PlannerTest#testSpillableBufferSizing
PlannerTest#testTableSample
PlannerTest#testTpch
PlannerTest#testKuduTpch
PlannerTest#testTpchNested
PlannerTest#testUnion
TpcdsPlannerTest
custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates
custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection
custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs
metadata/test_explain.py::TestExplain::test_explain_level2
metadata/test_explain.py::TestExplain::test_explain_level3
metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation

Increase BUFFER_POOL_LIMIT:
query_test/test_queries.py::TestQueries::test_analytic_fns
query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation
query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output
query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins
query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs
query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive
query_test/test_udfs.py::TestUdfExecution::test_mem_limits

Increase MEM_LIMIT:
query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling
query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling

Increase MAX_ROW_SIZE:
custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config
query_test/test_insert.py::TestInsertQueries::test_insert_large_string
query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit
query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter
query_test/test_scanners.py::TestWideRow::test_wide_row

Disable result spooling to maintain assertion:
custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool
custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory
custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory
custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory
custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued
custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows
custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query
custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs
custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs
custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs
query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only)
query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan
query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage
query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings
query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit
shell/test_shell_client.py::TestShellClient::test_fetch_size

Testing:
- Pass exhaustive tests.

Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4
Reviewed-on: http://gerrit.cloudera.org:8080/16755
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-02 04:58:51 +00:00
Riza Suminto
801efdcf09 IMPALA-10492: Lower default MAX_CNF_EXPRS query option
MAX_CNF_EXPRS was set to unlimited by default. The CNF rewrite can lead
to significant frontend memory usage and eventually OutOfMemory for a
complex query that contain many predicates. We need to lower the default
value to avoid this memory problem while maintaining performance for our
TPC-DS and TPC-H workloads.

We investigate the maximum number of CNF expressions in TPC-DS and TPC-H
by printing out the final value of 'numCnfExprs_' from
ConvertToCNFRule.java to the query profile. We found 5 queries that
applies CNF rewrite rules as follow:

| Query     | numCnfExprs_ |
|-----------+--------------|
| TPCDS-Q13 |          168 |
| TPCDS-Q85 |          100 |
| TPCDS-Q48 |           34 |
| TPCH-Q19  |          124 |
| TPCH-Q7   |            3 |

This patch lower the default value from unlimited to 200 based on the
result above.

Testing:
- Manually verify that MAX_CNF_EXPRS 200 is enough for our TPC-DS and
  TPC-H worloads.
- Pass core tests.

Change-Id: I7ca3d0e094ac01c24a046c25d6a1b56bf134faa8
Reviewed-on: http://gerrit.cloudera.org:8080/17132
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-28 00:27:40 +00:00