Each Ranger privilege grant/revoke request has a resources map to
describe the target resource, e.g. db, table, url, etc. The keys of the
resources map are predefined strings like "database", "table", "column",
"url", etc. The values are the actual resource name or "*" as a wildcard
for all such kinds of resources.
Our FE tests unintentionally set null values on "url" when the resources
map doesn't have such a key. See AuthorizationTestBase#updateUri().
Using null as the value is an undefined behavior and it causes
NullPointerException in newer versions of Ranger when granting/revoking
privileges (might due to RANGER-4638 that adds a code to split the
string value).
This patch fixes AuthorizationTestBase#updateUri() to only update the
URL when it's not null and starts with "/".
Tests:
- Verified the fix in a downstream build with a newer Ranger version
that hits the issue.
Change-Id: Ibed929bcd25ffdf54fa5f0fc848a0cc13c1fb0a2
Reviewed-on: http://gerrit.cloudera.org:8080/22435
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive uses URL encoding to format the partition strings when creating the
partition folders, e.g. "00:00:00" will be encoded into "00%3A00%3A00".
When you create a partition of string type partition column "p" and
using "00:00:00" as the partition value, the underlying partition folder
is "p=00%3A00%3A00".
When parsing the partition folders, Impala will URL-decode the partition
folder names to get the correct partition values. This is correct in
ALTER TABLE RECOVER PARTITIONS command that gets the partition strings
from the file paths. However, for partition strings come from HMS
events, Impala shouldn't URL-decode them since they are not URL encoded
and are the original partition values. This causes HMS events on
partitions that have percent signs in the value strings being matched to
wrong partitions.
This patch fixes the issue by only URL-decoding the partition strings
that come from file paths.
Tests:
- Ran tests/metadata/test_recover_partitions.py
- Added custom-cluster test.
Change-Id: I7ba7fbbed47d39b02fa0b1b86d27dcda5468e344
Reviewed-on: http://gerrit.cloudera.org:8080/22388
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch extends the existing AdminFnStmt to support operations on
EventProcessor. E.g. to pause the EventProcessor:
impala-shell> :event_processor('pause');
to resume the EventProcessor:
impala-shell> :event_processor('start');
Or to resume the EventProcessor on a given event id (1000):
impala-shell> :event_processor('start', 1000);
Admin can also resume the EventProcessor at the latest event id by using
-1:
impala-shell> :event_processor('start', -1);
Supported command actions in this patch: pause, start, status.
The command output of all actions will show the latest status of
EventProcessor, including
- EventProcessor status:
PAUSED / ACTIVE / ERROR / NEEDS_INVALIDATE / STOPPED / DISABLED.
- LastSyncedEventId: The last HMS event id which we have synced to.
- LatestEventId: The event id of the latest event in HMS.
Example output:
[localhost:21050] default> :event_processor('pause');
+--------------------------------------------------------------------------------+
| summary |
+--------------------------------------------------------------------------------+
| EventProcessor status: PAUSED. LastSyncedEventId: 34489. LatestEventId: 34489. |
+--------------------------------------------------------------------------------+
Fetched 1 row(s) in 0.01s
If authorization is enabled, only admin users that have ALL privilege on
SERVER can run this command.
Note that there is a restriction in MetastoreEventsProcessor#start(long)
that resuming EventProcessor back to a previous event id is only allowed
when it's not in the ACTIVE state. This patch aims to expose the control
of EventProcessor to the users so MetastoreEventsProcessor is not
changed. We can investigate the restriction and see if we want to relax
it.
Note that resuming EventProcessor at a newer event id can be done on any
states. Admins can use this to manually resolve the lag of HMS event
processing, after they have made sure all (or important) tables are
manually invalidated/refreshed.
A new catalogd RPC, SetEventProcessorStatus, is added for coordinators
to control the status of EventProcessor.
Tests
- Added e2e tests
Change-Id: I5a19f67264cfe06a1819a22c0c4f0cf174c9b958
Reviewed-on: http://gerrit.cloudera.org:8080/22250
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change, Puffin stats were only read from the current
snapshot. Now we also consider older snapshots, and for each column we
choose the most recent available stats. Note that this means that the
stats for different columns may come from different snapshots.
In case there are both HMS and Puffin stats for a column, the more
recent one will be used - for HMS stats we use the
'impala.lastComputeStatsTime' table property, and for Puffin stats we
use the snapshot timestamp to determine which is more recent.
This commit also renames the startup flag 'disable_reading_puffin_stats'
to 'enable_reading_puffin_stats' and the table property
'impala.iceberg_disable_reading_puffin_stats' to
'impala.iceberg_read_puffin_stats' to make them more intuitive. The
default values are flipped to keep the same behaviour as before.
The documentation of Puffin reading is updated in
docs/topics/impala_iceberg.xml
Testing:
- updated existing test cases and added new ones in
test_iceberg_with_puffin.py
- reorganised the tests in TestIcebergTableWithPuffinStats in
test_iceberg_with_puffin.py: tests that modify table properties and
other state that other tests rely on are now run separately to
provide a clean environment for all tests.
Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39
Reviewed-on: http://gerrit.cloudera.org:8080/22177
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Enables tuple caching on aggregates directly above scan nodes. Caching
aggregates requires that their children are also eligible for caching,
so this excludes aggregates above an exchange, union, or hash join.
Testing:
- Adds Planner tests for different aggregate cases to confirm they have
stable tuple cache keys and are valid for caching.
- Adds custom cluster tests that cached aggregates are used, and can be
re-used in slightly different statements.
Change-Id: I9bd13c2813c90d23eb3a70f98068fdcdab97a885
Reviewed-on: http://gerrit.cloudera.org:8080/22322
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
To support killing queries programatically, this patch adds a new
type of SQL statements, called the KILL QUERY statement, to cancel and
unregister a query on any coordinator in the cluster.
A KILL QUERY statement looks like
```
KILL QUERY '123:456';
```
where `123:456` is the query id of the query we want to kill. We follow
syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY'
are added as "unreserved keywords", like 'DEFAULT'. This allows the
three keywords to be used as identifiers.
A user is authorized to kill a query only if the user is an admin or is
the owner of the query. KILL QUERY statements are not affected by
admission control.
Implementation:
Since we don't know in advance which impalad is the coordinator of the
query we want to kill, we need to broadcast the kill request to all the
coordinators in the cluster. Upon receiving a kill request, each
coordinator checks whether it is the coordinator of the query:
- If yes, it cancels and unregisters the query,
- If no, it reports "Invalid or unknown query handle".
Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is
created for this.
For authorization, this patch adds a custom handler of
AuthorizationException for each statement to allow the exception to be
handled by the backend. This is because we don't know whether the user
is the owner of the query until we reach its coordinator.
To support cancelling child queries, this patch changes
ChildQuery::Cancel() to bypass the HS2 layer so that the session of the
child query will not be added to the connection used to execute the
KILL QUERY statement.
Testing:
- A new ParserTest case is added to test using "unreserved keywords" as
identifiers.
- New E2E test cases are added for the KILL QUERY statement.
- Added a new dimension in TestCancellation to use the KILL QUERY
statement.
- Added file tests/common/cluster_config.py and made
CustomClusterTestSuite.with_args() composable so that common cluster
configs can be reused in custom cluster tests.
Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0
Reviewed-on: http://gerrit.cloudera.org:8080/21930
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
IMPALA-13154 added the method getFileMetadataStats() to
HdfsPartition.java that would return the file metadata statistics. The
method requires the corresponding HdfsPartition instance to have a
non-null field of 'fileMetadataStats_'.
This patch revises two existing constructors of HdfsPartition to provide
a non-null value for 'fileMetadataStats'. This makes it easier for a
third party extension to set up and update the field of
'fileMetadataStats_'. A third party extension has to update the field of
'fileMetadataStats_' if it would like to use this field to get the size
of the partition since all three fields in 'fileMetadataStats_' are
defaulted to 0.
A new constructor was also added for HdfsPartition that allows a third
party extension to provide their own FileMetadataStats when
instantiating an HdfsPartition. To facilitate instantiating a
FileMetadataStats, a new constructor was added for FileMetadataStats
that takes in a List of FileDescriptor's to construct a
FileMetadataStats.
Change-Id: I7e690729fcaebb1e380cc61f2b746783c86dcbf7
Reviewed-on: http://gerrit.cloudera.org:8080/22340
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch makes the Iceberg position field inclusion conditional by
including them only if there are UPDATE or DELETE merge clauses that are
listed in a MERGE statement or the target table has existing delete
files. These fields can be omitted when there's no delete file creation
at the sink of the MERGE statement and the table has no existing delete
files.
Additionally, this change disables MERGE for Iceberg target tables
that contain equality delete files, see IMPALA-13674.
Tests:
- iceberg-merge-insert-only planner test added
Change-Id: Ib62c78dab557625fa86988559b3732591755106f
Reviewed-on: http://gerrit.cloudera.org:8080/21931
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
AggregationNode.computeStats() estimate cardinality under single node
assumption. This can be an underestimation in preaggregation node case
because same grouping key may exist in multiple nodes during
preaggreation.
This patch adjust the cardinality estimate using following model for the
number of distinct values in a random sample of k rows, previously used
to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644.
Assumes we are picking k rows from an infinite sample with ndv distinct
values, with the value uniformly distributed. The probability of a given
value not appearing in a sample, in that case is
((NDV - 1) / NDV) ^ k
This is because we are making k choices, and each of them has
(ndv - 1) / ndv chance of not being our value. Therefore the
probability of a given value appearing in the sample is:
1 - ((NDV - 1) / NDV) ^ k
And the number of distinct values in the sample is:
(1 - ((NDV - 1) / NDV) ^ k) * NDV
Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to
use the new estimation logic or not.
Testing:
- Pass core tests.
Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55
Reviewed-on: http://gerrit.cloudera.org:8080/22047
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
getPerInstanceNdvForCpuCosting is a method to estimate the number of
distinct values of exprs per fragment instance when accounting for the
likelihood of duplicate keys across fragment instances. It borrows the
probabilistic model described in IMPALA-2945. This method is exclusively
used by AggregationNode only.
getPerInstanceNdvForCpuCosting run the probabilistic formula
individually for each grouping expression and then multiply it together.
That match with how we estimate group NDV in the past where we simply do
NDV multiplication of each grouping expression.
Recently, we adds tuple-based analysis to lower cardinality estimate for
all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086).
All of the bounding happens in AggregationNode.computeStats(), where we
call estimateNumGroups() function that returns globalNdv estimate for
specific aggregation class.
To take advantage from that more precise globalNdv, this patch replace
getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that
apply the probabilistic formula over this single globalNdv number rather
than the old way where it often return an overestimated number from NDV
multiplication method. Its use is still limited only to calculate
ProcessingCost. Using it for preagg output cardinality will be done by
IMPALA-2945.
estimatePreaggCardinality is skipped if data partition of input is a
subset of grouping expression.
Testing:
- Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True.
ProcessingCost changes, but all cardinality number stays.
- Add CardinalityTest#testEstimatePreaggCardinality.
- Update test_executor_groups.py. Enable v2 profile as well for easier
runtime profile debugging.
Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1
Reviewed-on: http://gerrit.cloudera.org:8080/22320
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When IcebergMergeImpl created the table sink it didn't set
'inputIsClustered' to true. Therefore HdfsTableSink expected
random input and kept the output writers open for every partition,
which resulted in high memory consumption and potentially a
Memory Limit Exceeded error when the number of partitions are high.
Since we actually sort the rows before the sink we can set
'inputIsClustered' to true, which means HdfsTableSink can write
files one by one, because whenever it gets a row that belongs
to a new partition it knows that it can close the current output
writer, and open a new one.
Testing:
- e2e regression test
Change-Id: I7bad0310e96eb482af9d09ba0d41e44c07bf8e4d
Reviewed-on: http://gerrit.cloudera.org:8080/22332
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch added OAuth support with following functionality:
* Load and parse OAuth JWKS from configured JSON file or url.
* Read the OAuth Access token from the HTTP Header which is
the same format as JWT Authorization Bearer token.
* Verify the OAuth's signature with public key in JWKS.
* Get the username out of the payload of OAuth Access token.
* If kerberos or ldap is enabled, then both jwt and oauth are
supported together. Else only one of jwt or oauth is supported.
This has been a pre existing flow for jwt. So OAuth will follow
the same policy.
* Impala Shell side changes: OAuth options -a and --oauth_cmd
Testing:
- Added 3 custom cluster be test in test_shell_jwt_auth.py:
- test_oauth_auth_valid: authenticate with valid token.
- test_oauth_auth_expired: authentication failure with
expired token.
- test_oauth_auth_invalid_jwk: authentication failure with
valid signature but expired.
- Added 1 custom cluster fe test in JwtWebserverTest.java
- testWebserverOAuthAuth: Basic tests for OAuth
- Added 1 custom cluster fe test in LdapHS2Test.java
- testHiveserver2JwtAndOAuthAuth: tests all combinations of
jwt and oauth token verification with separate jwks keys.
- Manually tested with a valid, invalid and expired oauth
access token.
- Passed core run.
Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6
Reviewed-on: http://gerrit.cloudera.org:8080/21728
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change enables MERGE statements with source expressions containing
subqueries that require rewrite. The change adds implementation for
reset methods for each merge case, and properly handles resets for
MergeStmt and IcebergMergeImpl.
Tests:
- Planner test added with a merge query that requires a rewrite
- Analyzer test modified
Change-Id: I26e5661274aade3f74a386802c0ed20e5cb068b5
Reviewed-on: http://gerrit.cloudera.org:8080/22039
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-13620 increase datanucleus.connectionPool.maxPoolSize of HMS from
10 to 30. When running all tests in single node, this seem to exhaust
all 100 of postgresql max_connection and interfere with
authorization/test_ranger.py and query_test/test_ext_data_sources.py.
This patch lower datanucleus.connectionPool.maxPoolSize to 20.
Testing:
- Pass exhaustive tests in single node.
Change-Id: I98eb27cbd141d5458a26d05d1decdbc7f918abd4
Reviewed-on: http://gerrit.cloudera.org:8080/22326
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When IcebergUpdateImpl created the table sink it didn't set
'inputIsClustered' to true. Therefore HdfsTableSink expected
random input and kept the output writers open for every partition,
which resulted in high memory consumption and potentially an
OOM error when the number of partitions are high.
Since we actually sort the rows before the sink we can set
'inputIsClustered' to true, which means HdfsTableSink can write
files one by one, because whenever it gets a row that belongs
to a new partition it knows that it can close the current output
writer, and open a new one.
Testing:
- e2e regression test
Change-Id: I9bad335cc946364fc612e8aaf90858eaabd7c4af
Reviewed-on: http://gerrit.cloudera.org:8080/22325
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ALTER_TABLE events
IMPALA-12487 adds an optimization that if an ALTER_TABLE event has
trivial changes in StorageDescriptor (e.g. removing optional field
'storedAsSubDirectories'=false which defaults to false), file
metadata reload will be skipped, no matter what changes are in the
table properties. This is problematic since some HMS clients (e.g.
Spark) could modify both the table properties and StorageDescriptor.
If there is a non-trivial changes in table properties (e.g. 'location'
change), we shouldn't skip reloading file metadata.
Testing:
- Added a unit test to verify the same
Change-Id: Ia969dd32385ac5a1a9a65890a5ccc8cd257f4b97
Reviewed-on: http://gerrit.cloudera.org:8080/21971
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For a partition-level event, isOlderEvent() in catalogD needs to check
whether the corresponding partition is reloaded after the event. This
should be done after holding the table read lock. Otherwise,
EventProcessor could hit ConcurrentModificationException error when
there are concurrent DDLs/DMLs modifying the partition list.
note: Created IMPALA-13650 for a cleaner solution to clear the inflight
events list for partitioned table events.
Testing:
- Added a end-to-end stress test to verify the above scenario
Change-Id: I26933f98556736f66df986f9440ebb64be395bc1
Reviewed-on: http://gerrit.cloudera.org:8080/21663
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Iceberg supports multiple writers with optimistic concurrency.
Each writer can write new files which are then added to the table
after a validation check to ensure that the commit does not conflict
with other modifications made during the execution.
When there was a conflicting change which could not be resolved, it
means that the newly written files cannot be committed to the table,
so they used to become orphan files on the file system. Orphan files
can accumulate over time, taking up a lot of storage space. They do
not belong to the table because they are not referenced by any snapshot
and therefore they can't be removed by expiring snapshots.
This change introduces automatic cleanup of uncommitted files
after an unsuccessful DML operation to prevent creating orphan files.
No cleanup is done if Iceberg throws CommitStateUnknownException
because the update success or failure is unknown in this case.
Testing:
- E2E test: Injected ValidationException with debug option.
- stress test: Added a method to check that no orphan files were
created after failed conflicting commits.
Change-Id: Ibe59546ebf3c639b75b53dfa1daba37cef50eb21
Reviewed-on: http://gerrit.cloudera.org:8080/22189
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds INSERT * and UPDATE SET * language elements for
WHEN NOT MATCHED and WHEN MATCHED clauses. INSERT * enumerates all
source expressions from source table/subquery and analyzes the clause
similarly to the regular WHEN NOT MATCHED THEN INSERT case. UPDATE SET
* creates assignments for each target table column by enumerating the
table columns and assigning source expressions by index.
If the target column count and the source expression count mismatches or
the types mismatches both clauses report analysis errors.
Tests:
- parser tests added
- analyzer tests added
- E2E tests added
Change-Id: I31cb771f2355ba4acb0f3b9f570ec44fdececdf3
Reviewed-on: http://gerrit.cloudera.org:8080/22051
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch refreshes compute_table_stats.py script with the following
changes:
- Limit parallelism to IMPALA_BUILD_THREADS at maximum if --parallelism
argument is not set.
- Change its default connection to hs2, leveraging existing
ImpylaHS2Connection.
- Change OptionParser to ArgumentParser.
- Use impala-python3 to run the script.
- Add --exclude_table_names to skip running COMPUTE STATS on certain
tables/views.
- continue_on_error is False by default.
This patch also improves query handle logging in ImpylaHS2Connection.
collect_profile_and_log argument is added to control whether to pull
logs and runtime profile at the end of __fetch_results(). The default
behavior remains unchanged.
Skip COMPUTE STATS for functional_kudu.alltypesagg and
functional_kudu.manynulls because it is invalid to run COMPUTE STATS
over view.
Customized hive-site.xml to set datanucleus.connectionPool.maxPoolSize
to 30 and hikaricp.connectionTimeout to 60000 ms. Also set hive.log.dir
to ${IMPALA_CLUSTER_LOGS_DIR}/hive.
Testing:
Repeatedly run compute-table-stats.sh from cold state and confirm there
is no error occurs. This is the script to do so from active minicluster:
cd $IMPALA_HOME
./bin/start-impala-cluster.py --kill
./testdata/bin/kill-hive-server.sh
./testdata/bin/run-hive-server.sh
./bin/start-impala-cluster.py
./testdata/bin/compute-table-stats.sh > /tmp/compute-stats.txt 2>&1
grep error /tmp/compute-stats.txt
Core tests ran and passed.
Change-Id: I1ebf02f95b957e7dda3a30622b87e8fca3197699
Reviewed-on: http://gerrit.cloudera.org:8080/22231
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
NDV of a grouping column can be reduced if there is a predicate over
that column. If the predicate is a constant equality predicate or
is-null predicate, then the NDV must be equal to 1. If the predicate is
a simple in-list predicate, the NDV must be the number of items in the
list.
This patch adds such consideration by leveraging existing analysis in
HdfsScanNode.computeStatsTupleAndConjuncts(). It memorizes the first
ScanNode/UnionNode that produces a TupleId in Analyzer, registered
during Init()/computeStats() of the PlanNode. At AggregationNode, it
looks up the PlanNode that produces a TupleId. If the origin PlanNode is
an HdfsScanNode, analyze if any grouping expression is listed in
statsOriginalConjuncts_ and reduce them accordingly. If
HdfsScanNode.computeStatsTupleAndConjuncts() can be made generic for all
ScanNode implementations in the future, we can apply this same analysis
to all kinds of ScanNode and achieve the same reduction.
In terms of tracking producer PlanNode, this patch made an exception for
Iceberg PlanNodes that handle positional or equality deletion. In that
scenario, it is possible to have two ScanNodes sharing the same TupleId
to force UnionNode passthrough. Therefore, the UnionNode will be
acknowledged as the first producer of that TupleId.
This patch also remove some redundant operation in HdfsScanNode.
Fixed typo in method name MathUtil.saturatingMultiplyCardinalities().
Testing:
- Add new test cases in aggregation.test
- Pass core tests.
Change-Id: Ia840d68f1c4f126d4e928461ec5c44545dbf25f8
Reviewed-on: http://gerrit.cloudera.org:8080/22032
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The message of a COMMIT_TXN event just contains the transaction id
(txnid). In the logs of top-10 expensive events and top-10 targets that
contribute to the lag, we show the target as CLUSTER_WIDE.
However, when processing the events, catalogd actually finds
the involved tables and reloads them. It'd be helpful to show the names
of the tables involved in the transaction.
This patch overrides the getTargetName() method in CommitTxnEvent to
show the table names. They are collected after the event is processed.
Tests:
- Add tests in MetastoreEventsProcessorTest
Change-Id: I4a7cb5e716453290866a4c3e74c0d269f621144f
Reviewed-on: http://gerrit.cloudera.org:8080/22036
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Sai Hemanth Gantasala <saihemanth@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
IMPALA-13405 does tuple analysis to lower AggregationNode cardinality.
It begins by focusing on the simple column SlotRef, but we can improve
this further by tracing the origin TupleId across views and intermediate
aggregation tuples. This patch implements deeper TupleId tracing to
achieve further cardinality reduction. With this deeper TupleId
resolution, it is possible now to narrow down the TupleId search across
children ScanNodes and UnionNodes only.
Note that this optimization is still limited to run ONLY IF there are at
least two grouping expressions that refer to the same TupleId. There is
a benefit to run the same optimization even though there is only a
single expression per TupleId, but we defer that work until we can
provide faster TupleId to PlanNode mapping without repeating the plan
tree traversal.
This patch also makes tuple-based reduction more conservative by capping
at input cardinality/limit, or using output cardinality if the producer
node is a UnionNode or has hard estimates. aggInputCardinality is still
indirectly influenced by predicates and limits of children's nodes.
The following PlannerTest (under
testdata/workloads/functional-planner/queries/PlannerTest/) revert their
cardinality estimation to their state pior to IMPALA-13405:
tpcds/tpcds-q19.test
tpcds/tpcds-q55.test
tpcds_cpu_cost/tpcds-q03.test
tpcds_cpu_cost/tpcds-q31.test
tpcds_cpu_cost/tpcds-q47.test
tpcds_cpu_cost/tpcds-q52.test
tpcds_cpu_cost/tpcds-q57.test
tpcds_cpu_cost/tpcds-q89.test
Several other planner tests have increased cardinality after this
change, but the numbers are still below pre-IMPALA-13405.
Removed nested-view planner test in agg-node-max-mem-estimate.test that
first added by IMPALA-13405. That same test has been duplicated by
IMPALA-13480 at aggregation.test.
Testing:
- Pass core tests.
Change-Id: I11f59ccc469c24c1800abaad3774c56190306944
Reviewed-on: http://gerrit.cloudera.org:8080/21955
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-13405 adds a new tuple-analysis algorithm in AggregationNode to
lower cardinality estimation when planning multi-column grouping. This
patch adds a query option ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE that allows
users to enable/disable the algorithm if necessary. Default is True.
Testing:
- Add testAggregationNoTupleAnalysis.
This test is based on TpcdsPlannerTest#testQ19 but with
ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE set to false.
Change-Id: Iabd8daa3d9414fc33d232643014042dc20530514
Reviewed-on: http://gerrit.cloudera.org:8080/22294
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
is not held previously
Without IMPALA-12832, Event Processor (EP) is going into error state
when there is an issue while obtaining a table write lock because the
finally-clause of releaseWriteLock() is always invoked even if the lock
is not held by current thread. This patch addresses the problem by
checking if the table holds write lock before releasing it.
Note: With IMPALA-12832, the EP invalidates the table when an error is
encountered which is still an overhead. With this patch EP will neither
goes into error state nor invalidates when this issue is encountered.
Testing:
- Added an end-to-end to verify the same.
Change-Id: Ib2e4c965796dd515ab8549efa616f72510ca447f
Reviewed-on: http://gerrit.cloudera.org:8080/22080
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Previously, some table metrics, such as the estimated memory usage
and the number of files, were only updated when a "FULL" Thrift object
of the table is requested. As a result, if a user ran a DESCRIBE
command on a table, and then tried to find the table on the Top-N page
of the web UI, the user would not find it.
This patch fixes the issue by updating the table metrics as soon as
an HDFS table is loaded. With this, no matter what Thrift object type of
the table is requested, the metrics will always be updated and
displayed on the web UI.
Testing:
- Added two custom cluster tests in test_web_pages.py to make sure that
table stats can be viewed on the web UI after DESCRIBE, for both
legacy and local catalog modes.
Change-Id: I6e2eb503b0f61b1e6403058bc5dc78d721e7e940
Reviewed-on: http://gerrit.cloudera.org:8080/22014
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds support for a new MERGE clause that covers the
condition when the source statement's rows do not match the target
tables rows. Example: MERGE INTO target t using source s on t.id = s.id
WHEN NOT MATCHED BY SOURCE THEN UPDATE set t.column = "a";
This change also adds support to use WHEN NOT MATCHED BY TARGET
explicitly, this is equivalent to WHEN NOT MATCHED.
Tests:
- Parser tests for the new language elements.
- Analyzer and planner test for WHEN NOT MATCHED BY SOURCE/TARGET
clauses.
- E2E tests for WHEN NOT MATCHED BY SOURCE clause.
Change-Id: Ia0e0607682a616ef6ad9eccf499dc0c5c9278c5f
Reviewed-on: http://gerrit.cloudera.org:8080/21988
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The fair-scheduler file contains part of the configuration for Admission
Control. This change adds some better error handling to the parsing of
this file. Where it is safe to do so, new exceptions are thrown; this
will cause Impala to refuse to start. This is consistent with other
serious configuration errors. Where new exceptions might cause problems
with existing configurations, or for less dangerous faults, new warnings
are written to the server log.
For the recently added User Quota configuration (IMPALA-12345) throw an
exception when a duplicate snippet of configuration is found.
New warning log messages are added for these cases:
- when a user quota at the leaf level is completely ignored because of
a user quota at the root level
- when there is no user ACL on a leaf level queue. This prevents any
queries from being submitted to the queue.
Change-Id: Idcd50442ce16e7c4346c6da1624216d694f6f44d
Reviewed-on: http://gerrit.cloudera.org:8080/22209
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
An incomplete COMPUTE STATS during data loading reveal a bug in
AggregationNode.java where estimateNumGroups() can return value less
than -1.
This patch fix the bug by implementing
PlanNode.smallestValidCardinality() and
MathUtil.saturatingMultiplyCardinalities(). Both function validates that
the function arguments are valid cardinality number.
smallestValidCardinality() correctly compares two cardinality numbers
and return the smallest and valid one. It generalizes and replaces
static function PlanNode.capCardinalityAtLimit().
saturatingMultiplyCardinalities() adds validation and normalization over
MathUtil.saturatingMultiply().
Reorder logic of tuple-based estimation from IMPALA-13405 such that
negative estimate is handled properly.
Testing:
- Added more preconditions in AgggregationNode.java.
- Added CardinalityTest.testSmallestValidCardinality and
MathUtilTest.testSaturatingMultiplyCardinality.
- Added test in resource-requirements.test that will consistently fail
without this fix.
- Pass testResourceRequirement.
Change-Id: Ib862a010b2094daa2cbdd5d555e46443009672ad
Reviewed-on: http://gerrit.cloudera.org:8080/22235
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Within DistributedPlanner.java, there are several places where Planner
need to insert extra merge aggregation node. It requires transferring
HAVING conjuncts from preaggregation node to merge aggregation,
unsetting limit, and recompute stats of preaggregation node. However,
the stats recompute is not consistently done, and there might be an
inefficient recompute happening.
This patch fixes the order of AggregationNode creation order in
DistributedPlanner.java so that stats recomputation is done consistently
and efficiently.
Testing:
- Pass core tests.
Change-Id: Ica8227fdc46a1ef59bef5ae5424ba3907827411d
Reviewed-on: http://gerrit.cloudera.org:8080/22046
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
This patch enable VALIDATE_CARDINALITY test options in several planner
tests that touch aggregation node. Enabling it has revealed three bugs.
First, in IMPALA-13405, cardinality estimate of MERGE phase aggregation
is not capped against the output cardinality of the EXCHANGE node. This
patch fix it by adding such capping.
Second, tuple-based optimization IMPALA-13405 can cause cardinality
underestimation if HAVING predicate exist. This is due to the default
selectivity of 10% applied for each HAVING predicate. This patch skip
tuple-based optimization if AggregationNode.conjuncts_ is ever not
empty. It will stay skipped on stats recompute, even if conjuncts_ is
transfered into the next Merge AggregationNode above the plan. The
optimization skip causes following PlannerTest (under
testdata/workloads/functional-planner/queries/PlannerTest/) to revert
their cardinality estimation to their state pior to IMPALA-13405:
- tpcds/tpcds-q39a.test
- tpcds/tpcds-q39b.test
- tpcds_cpu_cost/tpcds-q39a.test
- tpcds_cpu_cost/tpcds-q39b.test
In the future, we should consider raising the default selectivity for
HAVING predicate and undo this skipping logic (IMPALA-13542).
Third, is missing stats recompute after conjunct transfer in multi-phase
aggregation. This will be fixed separately by IMPALA-13526.
Testing:
- Enable cardinality validation in testMultipleDistinct*
- Update aggregation.test to reflect current PlannerTest output.
Added some test cases in aggregation.test.
- Run and pass TpcdsPlannerTest and TpcdsCpuPlannerTest.
- Selectively run some more planner tests that touch AggregationNode and
pass them.
Change-Id: Iadb4af9fd65fdb85b66fae1e403ccec8ca5eb102
Reviewed-on: http://gerrit.cloudera.org:8080/22184
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The unit test `JdbcDataSourceTest.java` was originally
implemented using the H2 database, which is no longer
available in Impala's environment. The test code was
also outdated and erroneous.
This commit addresses and fixes the failure of
JdbcDataSourceTest.java and rewrites it in
Postgres, hence ensures compatibility with Impala's
current environment and aligns with JDBC and external
data source APIs. Please note, this test is moved to fe
folder to fix the BackendConfig instance not initialized
error.
To test this file, run the following command:
pushd fe && mvn -fae test -Dtest=JdbcDataSourceTest
Please note that the tests in JdbcDataSourceTest have a
dependency on previous tests and individual tests cannot be
ran separately for this class.
Change-Id: Ie07173d256d73c88f5a6c041f087db16b6ff3127
Reviewed-on: http://gerrit.cloudera.org:8080/21805
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When OptimizeStmt created the table sink it didn't set
'inputIsClustered' to true. Therefore HdfsTableSink expected
random input and kept the output writers open for every partition,
which resulted in high memory consumption and potentially an
OOM error when the number of partitions are high.
Since we actually sort the rows before the sink we can set
'inputIsClustered' to true, which means HdfsTableSink can write
files one by one, because whenever it gets a row that belongs
to a new partition it knows that it can close the current output
writer, and open a new one.
Testing:
* added e2e test
Change-Id: I8d451c50c4b6dff9433ab105493051bee106bc63
Reviewed-on: http://gerrit.cloudera.org:8080/22192
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Document the feature introduced in IMPALA-12345. Add a few more tests to
the QuotaExamples test which demonstrate the examples used in the
docs.
Clarify in docs and code the behavior when a user is a member of more
than one group for which there are rules. In this case the least
restrictive rule applies.
Also document the '--max_hs2_sessions_per_user' flag introduced in
IMPALA-12264.
Change-Id: I82e044adb072a463a1e4f74da71c8d7d48292970
Reviewed-on: http://gerrit.cloudera.org:8080/22100
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala has several PlannerTest that validate over EXTENDED profile and
validate cardinality. In EXTENDED level, profile display stored table
stats from HMS like 'numRows' and 'totalSize', which can vary between
data loads. They are not validated by PlannerTest. But frequent change
of these lines can disturb code review process because they are mostly
noise.
This patch provides a python script restore-stats-on-planner-tests.py to
fix the table stats information in selected .test files. The test files
to check and fixed table stats is declared inside the script. It is
currently focus on tests under
functional-planner/queries/PlannerTest/tpcds/ and some that test against
tpcds_partitioned_parquet_snap table. critique-gerrit-review.py is
updated to run with python3, trigger restore-stats-on-planner-tests.py,
and warn if there is any unnecessary table stats change detected.
This patch also fixed table size for tests under
functional-planner/queries/PlannerTest/tpcds_cpu_cost/ because all tests
there runs with synthetic stats declared in stats-3TB.json. Before the
patch, the table stats printed in plan is the real stats from HMS. After
this patch, the table stats displayed is calculated from the
stats-3TB.json. See IMPALA-12726 for more detail on large scale planner
test simulation.
Testing:
- Manually run the script and confirm that stats line are replaced
correctly.
- Run affected PlannerTest and all passed.
Change-Id: I27bab7cee93880cd59f01b9c2d1614dfcabdc682
Reviewed-on: http://gerrit.cloudera.org:8080/22045
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As agreed in JIRA discussions, the current PR extends existing TRIM
functionality with the support of SQL-standardized TRIM-FROM syntax:
TRIM({[LEADING / TRAILING / BOTH] | [STRING characters]} FROM expr).
Implemented based on the existing LTRIM / RTRIM / BTRIM family of
functions prepared earlier in IMPALA-6059 and extended for UTF-8 in
IMPALA-12718. Besides, partly based on abandoned PR
https://gerrit.cloudera.org/#/c/4474 and similar EXTRACT-FROM
functionality from https://github.com/apache/impala/commit/543fa73f3a846
f0e4527514c993cb0985912b06c.
Supported syntaxes:
Syntax #1 TRIM(<where> FROM <string>);
Syntax #2 TRIM(<charset> FROM <string>);
Syntax #3 TRIM(<where> <charset> FROM <string>);
"where": Case-insensitive trim direction. Valid options are "leading",
"trailing", and "both". "leading" means trimming characters from the
start; "trailing" means trimming characters from the end; "both" means
trimming characters from both sides. For Syntax #2, since no "where"
is specified, the option "both" is implied by default.
"charset": Case-sensitive characters to be removed. This argument is
regarded as a character set going to be removed. The occurrence order
of each character doesn't matter and duplicated instances of the same
character will be ignored. NULL argument implies " " (standard space)
by default. Empty argument ("" or '') makes TRIM return the string
untouched. For Syntax #1, since no "charset" is specified, it trims
" " (standard space) by default.
"string": Case-sensitive target string to trim. This argument can be
NULL.
The UTF8_MODE query option is honored by TRIM-FROM, similarly to
existing TRIM().
UTF8_TRIM-FROM can be used to force UTF8 mode regardless of the query
option.
Design Notes:
1. No-BE. Since the existing LTRIM / RTRIM / BTRIM functions fully cover
all needed use-cases, no backend logic is required. This differs from
similar EXTRACT-FROM.
2. Syntax wrapper. TrimFromExpr class was introduced as a syntax
wrapper around FunctionCallExpr, which instantiates one of the regular
LTRIM / RTRIM / BTRIM functions. TrimFromExpr's role is to maintain
the integrity of the "phantom" TRIM-FROM built-in function.
3. No TRIM keyword. Following EXTRACT-FROM, no "TRIM" keyword was
added to the language. Although generally a keyword would allow easier
and better parsing, on the negative side it restricts token's usage in
general context. However, leading/trailing/both, being previously
saved as reserved words, are now added as keywords to make possible
their usage with no escaping.
Change-Id: I3c4fa6d0d8d0684c4b6d8dac8fd531d205e4f7b4
Reviewed-on: http://gerrit.cloudera.org:8080/21825
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
The test test_admission_controller_with_quota_configs() is designed to
be a similar test to test_admission_controller_with_configs() which
uses the 'queueB' queue. The newer test uses a newly added
queue 'queueF'. Because Admission Control configuration is split across
two files, and because of user stupidity, the queue timeout
configuration for 'queueB' was not copied when the new test was
created. This causes queued queries to be timed out while waiting for
admission, which confuses the test.
Set pool-queue-timeout-ms.root to 600000 for queueF in
llama-site-test2.xml.
Change-Id: I1378cd4e42ed5629b92b1c16dd17d4d16ec4a19d
Reviewed-on: http://gerrit.cloudera.org:8080/22126
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The file descriptors in HdfsPartition are cached as byte arrays to keep
the memory footprint low. They are transformed into actual
FileDescriptor objects once queried.
This patch changes IcebergContentFileStore to similarly use byte arrays
as an internal representation for file descriptors. Note, file
descriptors for Iceberg tables have 2 components: one is the same as in
HdfsPartition and the other stores Iceberg specific file metadata in an
additional byte array.
Measurements and observations:
- I have a test table that has 110k data files. For this table the JVM
memory usage in the catalogd got reduced from 80MB to 65MB.
- Both HdfsPartition.FileDescriptor and IcebergContentFileStore use
flatbuffers and in turn byte arrays to represent file descriptors and
these byte arrays are shared between these 2 places. As a result
there is no redundancy in storing the file descriptors both for the
Iceberg and the Hdfs table.
- There is no measurable difference in planning times with this patch.
Change-Id: I9d7794df999bdaf118158eace26cea610f911c0a
Reviewed-on: http://gerrit.cloudera.org:8080/21869
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
When Trino writes Puffin stats for a column, it includes the NDV as a
property (with key "ndv") in the "statistics" section of the
metadata.json file, in addition to the Theta sketch in the Puffin file.
When we are only reading the stats and not writing/updating them, it is
enough to read this property if it is present.
After this change, Impala only opens and reads a Puffin stats file if it
contains stats for at least one column for which the "ndv" property is
not set in the metadata.json file.
Testing:
- added a test in test_iceberg_with_puffin.py that verifies that the
Puffin stats file is not read if the the metadata.json file contains
the NDV property. It uses the newly added stats file with corrupt
datasketches: 'metadata_ndv_ok_sketches_corrupt.stats'.
Change-Id: I5e92056ce97c4849742db6309562af3b575f647b
Reviewed-on: http://gerrit.cloudera.org:8080/21959
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The intention of test_admission_controller_with_quota_configs() is to
run a workload with a variety of outcomes in a pool that has Admission
Control User Quotas Configured. The idea was that the User Quotas
configuration would not affect the workload that is run by
run_admission_test(). The configuration for 'queueF' limits the number
of concurrent queries that can be run by any user to 30. In the test
there is only one user, and the number of queries that are run is 50,
so there is potential for the User Quotas configuration to affect the
operation of the test. Fix this by bumping the Quota limit to 50.
TESTING
I ran tests in a similar environment to that where failures were
observed. Without the fix I saw a failure, and with the fix there were
no failures. This isn't sufficient to prove this fix is all that is
needed, but the change is safe and isolated.
Change-Id: Ie2cc81a5b95d07154b73d32daf67617c79283ac8
Reviewed-on: http://gerrit.cloudera.org:8080/22096
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Allow administrators to configure per user limits on queries that can
run in the Impala system.
In order to do this, there are two parts. Firstly we must track the
total counts of queries in the system on a per-user basis. Secondly
there must be a user model that allows rules that control per-user
limits on the number of queries that can be run.
In a Kerberos environment the user names that are used for both the user
model and at runtime are short user names, e.g. testuser when the
Kerberos principal is testuser/scm@EXAMPLE.COM
TPoolStats (the data that is shared between Admission Control instances)
is extended to include a map from user name to a count of queries
running. This (along with some derived data structures) is updated when
queries are queued and when they are released from Admission Control.
This lifecycle is slightly different from other TPoolStats data which
usually tracks data about queries that are running. Queries can be
rejected because of user quotas at submission time. This is done for
two reasons: (1) queries can only be admitted from the front of the
queue and we do not want to block other queries due to quotas, and
(2) it is easy for users to understand what is going on when queries
are rejected at submission time.
Note that when running in configurations without an Admission Daemon
then Admission Control does not have perfect information about the
system and over-admission is possible for User-Level Admission Quotas
in the same way that it is for other Admission Control controls.
The User Model is implemented by extending the format of the
fair-scheduler.xml file. The rules controlling the per-user limits are
specified in terms of user or group names.
Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to
the fair-scheduler.xml file. These elements can be placed on the root
configuration, which applies to all pools, or the pool configuration.
The ‘userQueryLimit’ element has 2 child elements: "user"
and "totalCount". The 'user' element contains the short names of users,
and can be repeated, or have the value "*" for a wildcard name which
matches all users. The ‘groupQueryLimit’ element has 2 child
elements: "group" and "totalCount". The 'group' element contains group
names.
The root level rules and pool level rules must both be passed for a new
query to be queued. The rules dictate a maximum number of queries that
can run by a user. When evaluating rules at either the root level, or
at the pool level, when a rule matches a user then there is no more
evaluation done.
To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the
RequestPoolService is enhanced.
If user quotas are enabled for a pool then a list of the users with
running or queued queries in that pool is visible on the coordinator
webui admission control page.
More comprehensive documentation of the user model will be provided in
IMPALA-12943
TESTING
New end-to-end tests are added to test_admission_controller.py, and
admission-controller-test is extended to provide unit tests for the
user model.
Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41
Reviewed-on: http://gerrit.cloudera.org:8080/21616
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Non-deterministic functions should make a location ineligible
for caching. Unlike existing definitions of non-determinism
like FunctionCallExpr.isNondeterministicBuiltinFn(),
the non-determinism needs to apply over time and across query
boundaries, so it is a broader list of functions.
The following are considered non-deterministic in this change:
1. Random functions like rand/random/uuid
2. Current time functions like now/current_timestamp
3. Session/system information like current_user/pid/coordinator
4. AI functions
5. UDFs
With enable_expr_rewrites=true, constant folding can replace
some of these with a single constant (e.g. now() becomes a specific
timestamp). This is not a correctness problem for tuple caching,
because the specific value is incorporated into the cache key.
Testing:
- Added test cases to TupleCacheTest
Change-Id: I9601dba87b3c8f24cbe42eca0d8070db42b50488
Reviewed-on: http://gerrit.cloudera.org:8080/22011
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch makes Impala create only one Ranger policy for the GRANT
statement when there are multiple columns specified to reduce the number
of policies created on the Ranger server.
Note that this patch relies on RANGER-4585 and RANGER-4638.
Testing:
- Manually verified that Impala's catalog daemon only sends one
GrantRevokeRequest to the Ranger plug-in and that the value of the
key 'column' is a comma-separated list of column names involved in
the GRANT statement.
- Added an end-to-end test to verify only one Ranger policy will be
created in a multi-column GRANT statement.
Change-Id: I2b0ebba256c7135b4b0d2160856202292d720c6d
Reviewed-on: http://gerrit.cloudera.org:8080/21940
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This makes several changes to the Calcite planner to improve the
generated exceptions when there are errors:
1. When the Calcite parser produces SqlParseException, this is converted
to Impala's regular ParseException.
2. When the Calcite validation fails, it produces a CalciteContextException,
which is a wrapper around the real cause. This converts these validation
errors into AnalysisExceptions.
3. This produces UnsupportedFeatureException for non-HDFS table types like
Kudu, HBase, Iceberg, and views. It also produces
UnsupportedFeatureException for HDFS tables with complex types (which
otherwise would hit ClassCastException).
4. This changes exception handling in CalciteJniFrontend.java so it does
not convert exceptions to InternalException. The JNI code will print
the stacktrace for exceptions, so this drops the existing call to
print the exception stack trace.
Testing:
- Ran some end-to-end tests with a mode that continues past failures
and examined the output.
Change-Id: I6702ceac1d1d67c3d82ec357d938f12a6cf1c828
Reviewed-on: http://gerrit.cloudera.org:8080/21989
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
The following query is failing.
select sum(len_orderkey), sum(len_comment)
from (
select
length(group_concat(distinct cast(l_orderkey as string))) len_orderkey,
length(group_concat(distinct(l_comment))) len_comment
from tpch.lineitem
group by l_comment
) v
There is code where in AggregationFunction for group_concat that calls
to ignore an implicit cast. The 'isImplicitCast_' member is being used
directly in this function, but the variable is overridden in the
isImplicitCast method for the Calcite planner. A small change was needed
to call the isImplicitCast() function rather than use the member variable.
Change-Id: Idec41597b40a533bc0774b4ff2ab5059c7f324e2
Reviewed-on: http://gerrit.cloudera.org:8080/22025
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch improve PlannerTest by printing the path to .test file that
is failed. It also skip printing VERBOSE plan if
PlannerTestOption.EXTENDED_EXPLAIN is specified, since EXTENDED level
already contains sufficient details including tuples, sizes, and
cardinality.
This patch also change target path to save the updated end-to-end test file
if --update_results parameter is set. For example:
Before:
$EE_TEST_LOGS_DIR/tpcds-decimal_v2-q98.test
After:
$EE_TEST_LOGS_DIR/impala_updated_results/tpcds/queries/tpcds-decimal_v2-q98.test
Also ensure that the updated test file ends with a newline.
Testing:
- Manualy run modified PlannerTest that will fail and set
EXTENDED_EXPLAIN test option. Verified that .test file name is
printed and VERBOSE plan is not printed.
- Manually run TestTpcdsDecimalV2Query with --update_results parameter
and confirm the updated test file path is correct.
Change-Id: I5e15af93d9016d78ac0575c433146c8513a11949
Reviewed-on: http://gerrit.cloudera.org:8080/22030
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>