IMPALA-13770 added code to call Close() on
IcebergMergeCase::{output_exprs_,filter_conjuncts_}. However, these
expressions are created by IcebergMergeCasePlan, with pointers to the
expressions copied to possibly multiple IcebergMergeCase objects.
Therefore, although it does not cause errors in practice, it is better
to close the expressions in IcebergMergeCasePlan.
This change adds a Close() method to IcebergMergeCasePlan that closes
these expressions.
This patch also calls Close() on IcebergMergeSinkConfig::merge_action_
and IcebergMergeSink::merge_action_evaluator_, which were not closed
previously.
Change-Id: Iefa998dea173051702ef08c03b489178a17a653f
Reviewed-on: http://gerrit.cloudera.org:8080/22522
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When using a native UDF in the target value of an UPDATE statement or in
a filter predicate or target value of a MERGE statement, Impala crashes
with the following DCHECK:
be/src/exprs/expr.cc:47 47 DCHECK(cache_entry_ == nullptr);
This DCHECK is in the destructor of Expr, and it fires because Close()
has not been called for the expression. In the UPDATE case this is
caused by MultiTableSinkConfig: it creates child DataSinkConfig objects
but does not call Close() on them, and consequently these child sink
configs do not call Close() on their output expressions.
In the MERGE case it is because various expressions are not closed in
IcebergMergeCase and IcebergMergeNode.
This patch fixes the issue by overriding Close() in MultiTableSinkConfig,
calling Close() on the child sinks as well as closing the expressions in
IcebergMergeCase and IcebergMergeNode.
Testing:
- Added EE regression tests for the UPDATE and MERGE cases in
iceberg-update-basic.test and iceberg-merge.test
Change-Id: Id86638c8d6d86062c68cc9d708ec9c7b0a4e95eb
Reviewed-on: http://gerrit.cloudera.org:8080/22508
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IcebergDeleteBuilder assumes that it should only receive delete
records for paths of data files that are scheduled for its
corresponding SCAN operator.
It is not true when any of the following happens:
* number of output channels in sender is 1
(currently no DIRECTED mode, no filtering)
* hit bug in DIRECTED mode, see below
* single node plan is being used (no DIRECTED mode, no filtering)
With this patch, KrpcDataStreamSender::Send() will use DIRECTED mode
even if number of output channels is 1. It also fixes the bug in
DIRECTED mode (which was due to an unused variable 'skipped_prev_row')
and simplified the logic a bit.
The patch also relaxes the assumption in IcebergDeleteBuilder, i.e.
only return error for dangling delete records when we are in a
distributed plan where we can assume DIRECTED distribution mode of
position delete records.
Testing
* added e2e tests
Change-Id: I695c919c9a74edec768e413a02b2ef7dbfa0d6a5
Reviewed-on: http://gerrit.cloudera.org:8080/22500
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The ASAN tests identified a heap-use-after-free issue. This patch
fixes that issue by moving the memory release to after its last use.
Local execution of custom-cluster-mgr-test ASAN replicated the same
heap-use-after-free failure.
Testing:
Existing custom-cluster-mgr-test passed locally using debug build.
Existing custom-cluster-mgr-test passed locally using ASAN build.
Change-Id: I4fd2c9faa6daba9274f38238b952c377a07794e9
Reviewed-on: http://gerrit.cloudera.org:8080/22503
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Base directory created by INSERT OVERWRITE / TRUNCATE should be
treated differently than bases created by compaction because
IOW/TRUNCATE bases must be accepted even if there is an earlier
open writeId. This scenario can easily occur if there is
a pending write to a single partition, as this doesn't block
an IOW/TRUNCATE to another partition, while the global
minOpenWrite affects whether the base is accepted.
This change updates Impala logic to consider these bases
valid similarly to Hive.
Note that differentiating IOW/TRUNCATE from compaction is
different than in Hive, as metadata files are not considered
in Impala (IMPALA-13769). This can only cause problems when
interacting with earlier Hive versions that did not use
visibilityTxnId in the base path. I don't consider this
to be a significant regression that should block the current
critical fix.
Testing:
- added regression EE/FE tests
Change-Id: I838eaf4f41bae148e558f64288a1370c0908efa4
Reviewed-on: http://gerrit.cloudera.org:8080/22499
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-13201 adds test_coord_only_pool_exec_groups. This test failed in
TestAdmissionControllerWithACService due to unaccounted AdmissionD as
the extra statestore subscriber. This patch fix the issue by using
expected_subscribers and expected_num_impalads argument for
CustomClusterTestSuite._start_impala_cluster() and relies on wait
mechanism inside it.
This patch also does some adjustments:
1. Tweak __run_assert_systables_query to not use
execute_query_using_vector from IMPALA-13694.
2. Remove default_impala_client() call, first added by IMPALA-13668, and
use self.client instead.
3. Fixed minor flake8 issue at test_coord_only_pool_happy_path.
These make it possible to backport IMPALA-13201 and IMPALA-13761
together to older release/maintenance branch.
Testing:
Run the test with this command:
impala-py.test --exploration=exhaustive \
-k test_coord_only_pool custom_cluster/test_admission_controller.py
Change-Id: I00b83e706aca3325890736133b2d1dcf735b19df
Reviewed-on: http://gerrit.cloudera.org:8080/22486
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Jason Fehr <jfehr@cloudera.com>
Queries that run only against in-memory system tables are currently
subject to the same admission control process as all other queries.
Since these queries do not use any resources on executors, admission
control does not need to consider the state of executors when
deciding to admit these queries.
This change adds a boolean configuration option 'onlyCoordinators'
to the fair-scheduler.xml file for specifying a request pool only
applies to the coordinators. When a query is submitted to a
coordinator only request pool, then no executors are required to be
running. Instead, all fragment instances are executed exclusively on
the coordinators.
A new member was added to the ClusterMembershipMgr::Snapshot struct
to hold the ExecutorGroup of all coordinators. This object is kept up
to date by processing statestore messages and is used when executing
queries that either require the coordinators (such as queries against
sys.impala_query_live) or that use an only coordinators request pool.
Testing was accomplished by:
1. Adding cluster membership manager ctests to assert cluster
membership manager correctly builds the list of non-quiescing
coordinators.
2. RequestPoolService JUnit tests to assert the new optional
<onlyCoords> config in the fair scheduler xml file is correctly
parsed.
3. ExecutorGroup ctests modified to assert the new function.
4. Custom cluster admission controller tests to assert queries with a
coordinator only request pool only run on the active coordinators.
Change-Id: I5e0e64db92bdbf80f8b5bd85d001ffe4c8c9ffda
Reviewed-on: http://gerrit.cloudera.org:8080/22249
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Just like IMPALA-13634 bug, unique_database fixture may taint
ImpalaTestSuite.client by setting 'sync_ddl=True' option and never
cleaning it up for test method usage.
This patch fixes unique_database fixture by always creating fresh HS2
client during CREATE DATABASE and DROP DATABASE.
Testing:
- Caught an assert error at test_set_fallback_db_for_functions
Expected: "default.fn() unknown for database default"
Actual: "functional.fn() unknown for database functional"
In this case, shared ImpalaTestSuite.client may have changed database
via ImpalaTestSuite.run_test_case() or
ImpalaTestSuite.execute_wrapper() in other test method.
Removing unused 'vector' argument somehow fixed the issue.
- For both query_test/test_udfs.py and metadata/test_ddl.py:
- Fixed flake8 issues and remove unused pytest fixture.
- Run and pass the tests in exhaustive exploration.
Change-Id: Ib503e829552d436035c57b489ffda0d0299f8405
Reviewed-on: http://gerrit.cloudera.org:8080/22471
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch migrate query_test/test_kudu.py to use hs2 client protocol.
Here are the steps taken:
- Override default_test_protocol() to return 'hs2'.
See documentation in ImpalaTestSuite about what this method does.
- Remove usage of deprecated cursor and unique_cursor fixture.
- Replace all direct ImpalaTestSuite.client usage with helper
function call such as execute_query() or execute_query_using_vector().
- Remove all "SET" query invocation and replace it with passing
exec_option dictionary to helper method.
- Replace veryfing kudu modified / inserted rows from reading query
output to reading runtime profile counters.
- Add HS2_TYPES section at test cases where only TYPES exist.
- Remove all drop_impala_table_after_context() calls and replace it with
proper use of unique_database fixture.
KuduTestSuite is fixed with hs2 protocol dimension. Meanwhile,
CustomKuduTest is fixed to use beeswax protocol dimension until proper
migration can be done.
Added following convenience methods:
- ImpalaTestSuite.default_test_protocol() to allow individual test
class to override its default test procol.
- ImpylaHS2ResultSet.tuples() to access the raw HS2 result set that is
a list of tuples.
This patch also added several literal constants around test vector
dimension to help with traceability.
Fixed a bug where "SHOW PARTITIONS" via hs2 over kudu table will shows
NULL number of #Replicas because TResultRowBuilder does not have
overload for int type value. Adjust numFiles variable inside
HdfsTable.getTableStats() from int to long to match Type.BIGINT of
column '#Files'.
Fixed py.test classes that does not inherit BaseTestSuite. Fixed flake8
issues in test_statestore.py.
Testing:
- Run and pass all tests extended from KuduTestSuite in exhaustive mode.
Change-Id: I5f38baf5a0bbde1a1ad0bb4666c300f4f3cabd33
Reviewed-on: http://gerrit.cloudera.org:8080/22358
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The custom cluster tests utilize the retry() function defined in
retry.py. This function takes as input another function to do the
assertions. This assertion function used to have a single boolean
parameter indicating if the retry was on its last attempt. In
actuality, this boolean was not used and thus caused flake8 failures.
This change removes this unused parameter from the assertion function
passed in to the retry function.
Change-Id: I1bce9417b603faea7233c70bde3816beed45539e
Reviewed-on: http://gerrit.cloudera.org:8080/22452
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The RemoveExecutor() function within the ExecutorGroup class has a
potential use-after-free bug. Since the function takes an object
reference as input, the iterator that erases the backend could erase
the object references by the function input.
This change fixes the issue by storing the necessary data from the
provided input object and then referencing that stored data after the
erase has occurred.
Change-Id: If14a3c89ee631ebb05efc9a47745f7e63ab98690
Reviewed-on: http://gerrit.cloudera.org:8080/22453
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds __reset_impala_clients() method in ImpalaConnection.
__reset_impala_clients() then simply clear configuration. It is called
on each setup_method() to ensure that each EE test uses clean test
client. All subclasses of ImpalaTestSuite that declare setup() method
are refactored to declare setup_method() instead, to match newer py.test
convention. Also implement teardown_method() to complement
setup_method(). See "Method and function level setup/teardown" at
https://docs.pytest.org/en/stable/how-to/xunit_setup.html.
CustomClusterTestSuite fully overrides setup_method() and
teardown_method() because it subclasses can be destructive. The custom
cluster test method often restart the whole Impala cluster, rendering
default impala clients initialized at setup_class() unusable. Each
subclass of CustomClusterTestSuite is responsible to ensure that impala
client they are using is in a good state.
This patch improve BeeswaxConnection and ImpylaHS2Connection to only
consider non-REMOVED options as its default options. They lookup for
valid (not REMOVED) query options with their own appropriate way,
memorized the option names as lowercase string and the values as string.
List values are wrapped with double quote. Log in
ImpalaConnection.set_configuration_option() is differentiated from how
SET query looks.
Note that ImpalaTestSuite.run_test_case() modify and restore query
option written at .test file by issuing SET query, not by calling
ImpalaConnection.set_configuration_option(). It is remain unchanged.
Consistently lower case query option everywhere in Impala test code
infrastructure. Fixed several tests that has been unknowingly override
'exec_option' vector dimension due to case sensitive mismatch. Also
fixed some flake8 issues.
Added convenience method execute_query_using_vector() and
create_impala_client_from_vector() in ImpalaTestSuite.
Testing:
- Pass core tests.
Change-Id: Ieb47fec9f384cb58b19fdbd10ff7aa0850ad6277
Reviewed-on: http://gerrit.cloudera.org:8080/22404
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-13039 added support for the aes_encrypt() and aes_decrypt()
functions, where the user can choose the AES operation mode. Currently
if the user chooses a mode that is not supported by the OpenSSL library
at runtime, we fall back to a supported mode. There are two issues
related to this:
1. If the mode chosen by the user requires a 16 byte encryption key but
the fallback mode needs a 32 byte one, we will read past the provided
buffer.
2. If the tests are run in an environment where some modes are not
supported, we will get incorrect results. This can for example be the
case with GCM.
The first problem is solved by disabling falling back to a supported
mode in case of aes_encrypt() and aes_decrypt(). If the mode is not
supported, an error is returned instead. Note that this does not affect
the case when the user explicitly specifies NULL as the operation mode -
a default encoding will still be chosen in that case.
The second problem is solved by dividing the test queries into separate
".test" files based on operation mode (and one for queries that are
expected to fail). Each operation mode that is not always supported is
tested first with a probing query and the corresponding tests are only
run if the probing query succeeds, i.e. the mode is supported.
Change-Id: I27146cda4fa41965de35821315738e5502d1b018
Reviewed-on: http://gerrit.cloudera.org:8080/22419
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Closing the transports could hang in TAcceptQueueServer if there was
an error during SSL handshake. As the TSSLSocket is wrapped in
TBufferedTransport and TBufferedTransport::close() calls flush(),
TSSLSocket::flush() was also called that led to trying again the
handshake in an unclean state. This led to hanging indefinitely with
OpenSSL 3.2. Another potential error is that if flush throws an
exception then the underlying TTransport's close() wont' be called.
Ideally this would be solved in Thrift (THRIFT-5846). As quick
fix this change adds a subclass for TBufferedTransport that doesn't
call flush(). This is safe to do as generated TProcessor
subclasses call flush() every time when the client/server sends
a message.
Testing:
- the issue was caught by thrift-server-test/KerberosOnAndOff
and TestClientSsl::test_ssl hanging till killed
Change-Id: I4879a1567f7691711d73287269bf87f2946e75d2
Reviewed-on: http://gerrit.cloudera.org:8080/22368
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Hive uses URL encoding to format the partition strings when creating the
partition folders, e.g. "00:00:00" will be encoded into "00%3A00%3A00".
When you create a partition of string type partition column "p" and
using "00:00:00" as the partition value, the underlying partition folder
is "p=00%3A00%3A00".
When parsing the partition folders, Impala will URL-decode the partition
folder names to get the correct partition values. This is correct in
ALTER TABLE RECOVER PARTITIONS command that gets the partition strings
from the file paths. However, for partition strings come from HMS
events, Impala shouldn't URL-decode them since they are not URL encoded
and are the original partition values. This causes HMS events on
partitions that have percent signs in the value strings being matched to
wrong partitions.
This patch fixes the issue by only URL-decoding the partition strings
that come from file paths.
Tests:
- Ran tests/metadata/test_recover_partitions.py
- Added custom-cluster test.
Change-Id: I7ba7fbbed47d39b02fa0b1b86d27dcda5468e344
Reviewed-on: http://gerrit.cloudera.org:8080/22388
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
AES (Advanced Encryption Standard) crypto functions are
widely recognized and respected encryption algorithm used to protect
sensitive data which operate by transforming plaintext data into
ciphertext using a symmetric key, ensuring confidentiality and
integrity. This standard specifies the Rijndael algorithm, a symmetric
block cipher that can process data blocks of 128 bits, using cipher
keys with lengths of 128 and 256 bits. The patch makes use of the
EVP_*() algorithms from the OpenSSL library.
The patch includes:
1. AES-GCM, AES-CTR, and AES-CFB encryption functionalities and
AES-GCM, AES-ECB, AES-CTR, and AES-CFB decryption functionalities.
2. Support for both 128-bit and 256-bit key sizes for GCM and ECB modes.
3. Enhancements to EncryptionKey class to accommodate various AES modes.
The aes_encrypt() and aes_decrypt() functions serve as entry
points for encryption and decryption operations, handling
encryption and decryption based on user-provided keys, AES modes,
and initialization vectors (IVs). The implementation includes key
length validation and IV vector size checks to ensure data
integrity and confidentiality.
Multiple AES modes: GCM, CFB, CTR for encryption, and GCM, CFB, CTR
and ECB for decryption are supported to provide flexibility and
compatibility with various use cases and OpenSSL features. AES-GCM
is set as the default mode due to its strong security properties.
AES-CTR and AES-CFB are provided as fallbacks for environments where
AES-GCM may not be supported. Note that AES-GCM is not available in
OpenSSL versions prior to 1.0.1, so having multiple methods ensures
broader compatibility.
Testing: The patch is thouroughly tested and the tests are included in
exprs.test.
Change-Id: I3902f2b1d95da4d06995cbd687e79c48e16190c9
Reviewed-on: http://gerrit.cloudera.org:8080/20447
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Two permission issues caused this dataload step to fail:
- Lack of X permission on home directory (seems linux specific).
- LOAD statement has no right to use \tmp for some reason - using
\LOAD instead solves this. I don't know what postgres/configuration
change caused this.
Testing:
- dataload and ext-data-source related tests passed on Rocky Linux 9.5
Change-Id: I3829116f4c6d6f6cba2da824cd9f31259a15ca1b
Reviewed-on: http://gerrit.cloudera.org:8080/22383
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Real type was being treated as a float. E2E test can be found in
exprs.test where there is a cast to real. Specifically, this test...
select count(*) from alltypesagg
where double_col >= 20.2
and cast(double_col as double) = cast(double_col as real)
... was casting double_col as a double and returning the wrong result
previous to this commit.
Change-Id: I5f3cc0e50a4dfc0e28f39d81b591c1b458fd59ce
Reviewed-on: http://gerrit.cloudera.org:8080/22087
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
On Rocky Linux 9.5 a few checks started to fail because error 503
was expected to cause non 0 return value but it was not treated
as error. The difference is likely caused by the newer curl version.
curl documentation seems unclear about the return value of auth
related status codes.
The fix is to check the specific http status code instead of curl's
return value.
Testing:
- webserver tests passed on Rocky Linux 9.5
Change-Id: I354aa87a1b6283aa617f0298861bd5e79d03efc7
Reviewed-on: http://gerrit.cloudera.org:8080/22379
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ImpalaTestSuite.client is always initialized as beeswax client. And many
tests use it directly rather than going through helper method such as
execute_query().
This patch add add default_test_protocol parameter to conftest.py. It
control whether to initialize ImpalaTestSuite.client equals to
'beeswax_client', 'hs2_client', or 'hs2_http_client'. This parameter is
still default to 'beeswax'.
This patch also adds helper method 'default_client_protocol_dimension',
'beeswax_client_protocol_dimension' and 'hs2_client_protocol_dimension'
for convenience and traceability.
Reduced occurrence where test method manually override
ImpalaTestSuite.client. They are replaced by combination of
ImpalaTestSuite.create_impala_clients and
ImpalaTestSuite.close_impala_clients.
Testing:
- Pass core tests.
Change-Id: I9165ea220b2c83ca36d6e68ef3b88b128310af23
Reviewed-on: http://gerrit.cloudera.org:8080/22336
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch extends the existing AdminFnStmt to support operations on
EventProcessor. E.g. to pause the EventProcessor:
impala-shell> :event_processor('pause');
to resume the EventProcessor:
impala-shell> :event_processor('start');
Or to resume the EventProcessor on a given event id (1000):
impala-shell> :event_processor('start', 1000);
Admin can also resume the EventProcessor at the latest event id by using
-1:
impala-shell> :event_processor('start', -1);
Supported command actions in this patch: pause, start, status.
The command output of all actions will show the latest status of
EventProcessor, including
- EventProcessor status:
PAUSED / ACTIVE / ERROR / NEEDS_INVALIDATE / STOPPED / DISABLED.
- LastSyncedEventId: The last HMS event id which we have synced to.
- LatestEventId: The event id of the latest event in HMS.
Example output:
[localhost:21050] default> :event_processor('pause');
+--------------------------------------------------------------------------------+
| summary |
+--------------------------------------------------------------------------------+
| EventProcessor status: PAUSED. LastSyncedEventId: 34489. LatestEventId: 34489. |
+--------------------------------------------------------------------------------+
Fetched 1 row(s) in 0.01s
If authorization is enabled, only admin users that have ALL privilege on
SERVER can run this command.
Note that there is a restriction in MetastoreEventsProcessor#start(long)
that resuming EventProcessor back to a previous event id is only allowed
when it's not in the ACTIVE state. This patch aims to expose the control
of EventProcessor to the users so MetastoreEventsProcessor is not
changed. We can investigate the restriction and see if we want to relax
it.
Note that resuming EventProcessor at a newer event id can be done on any
states. Admins can use this to manually resolve the lag of HMS event
processing, after they have made sure all (or important) tables are
manually invalidated/refreshed.
A new catalogd RPC, SetEventProcessorStatus, is added for coordinators
to control the status of EventProcessor.
Tests
- Added e2e tests
Change-Id: I5a19f67264cfe06a1819a22c0c4f0cf174c9b958
Reviewed-on: http://gerrit.cloudera.org:8080/22250
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change, Puffin stats were only read from the current
snapshot. Now we also consider older snapshots, and for each column we
choose the most recent available stats. Note that this means that the
stats for different columns may come from different snapshots.
In case there are both HMS and Puffin stats for a column, the more
recent one will be used - for HMS stats we use the
'impala.lastComputeStatsTime' table property, and for Puffin stats we
use the snapshot timestamp to determine which is more recent.
This commit also renames the startup flag 'disable_reading_puffin_stats'
to 'enable_reading_puffin_stats' and the table property
'impala.iceberg_disable_reading_puffin_stats' to
'impala.iceberg_read_puffin_stats' to make them more intuitive. The
default values are flipped to keep the same behaviour as before.
The documentation of Puffin reading is updated in
docs/topics/impala_iceberg.xml
Testing:
- updated existing test cases and added new ones in
test_iceberg_with_puffin.py
- reorganised the tests in TestIcebergTableWithPuffinStats in
test_iceberg_with_puffin.py: tests that modify table properties and
other state that other tests rely on are now run separately to
provide a clean environment for all tests.
Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39
Reviewed-on: http://gerrit.cloudera.org:8080/22177
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Enables tuple caching on aggregates directly above scan nodes. Caching
aggregates requires that their children are also eligible for caching,
so this excludes aggregates above an exchange, union, or hash join.
Testing:
- Adds Planner tests for different aggregate cases to confirm they have
stable tuple cache keys and are valid for caching.
- Adds custom cluster tests that cached aggregates are used, and can be
re-used in slightly different statements.
Change-Id: I9bd13c2813c90d23eb3a70f98068fdcdab97a885
Reviewed-on: http://gerrit.cloudera.org:8080/22322
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
To support killing queries programatically, this patch adds a new
type of SQL statements, called the KILL QUERY statement, to cancel and
unregister a query on any coordinator in the cluster.
A KILL QUERY statement looks like
```
KILL QUERY '123:456';
```
where `123:456` is the query id of the query we want to kill. We follow
syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY'
are added as "unreserved keywords", like 'DEFAULT'. This allows the
three keywords to be used as identifiers.
A user is authorized to kill a query only if the user is an admin or is
the owner of the query. KILL QUERY statements are not affected by
admission control.
Implementation:
Since we don't know in advance which impalad is the coordinator of the
query we want to kill, we need to broadcast the kill request to all the
coordinators in the cluster. Upon receiving a kill request, each
coordinator checks whether it is the coordinator of the query:
- If yes, it cancels and unregisters the query,
- If no, it reports "Invalid or unknown query handle".
Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is
created for this.
For authorization, this patch adds a custom handler of
AuthorizationException for each statement to allow the exception to be
handled by the backend. This is because we don't know whether the user
is the owner of the query until we reach its coordinator.
To support cancelling child queries, this patch changes
ChildQuery::Cancel() to bypass the HS2 layer so that the session of the
child query will not be added to the connection used to execute the
KILL QUERY statement.
Testing:
- A new ParserTest case is added to test using "unreserved keywords" as
identifiers.
- New E2E test cases are added for the KILL QUERY statement.
- Added a new dimension in TestCancellation to use the KILL QUERY
statement.
- Added file tests/common/cluster_config.py and made
CustomClusterTestSuite.with_args() composable so that common cluster
configs can be reused in custom cluster tests.
Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0
Reviewed-on: http://gerrit.cloudera.org:8080/21930
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Impala with External frontend hit a DCHECK in
RuntimeProfile::EventSequence::Start(int64_t start_time_ns) because
frontend report remote_submit_time that is more than 3ns ahead of
Coordinator time. This can happen if there is a clock skew between
Frontend node and Impala Coordinator node.
This patch fix the issue by taking the minimum between given
remote_submit_time vs Coordinator's MonotonicStopWatch::Now().
Change-Id: If6e04219c515fddff07bfbee43bb93babb3d307b
Reviewed-on: http://gerrit.cloudera.org:8080/22360
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Amend verifying logs in impala to exclude partition id info.
partition_id assigned by catalogD may not be in serial order, so it is
best to avoid checking partition id in the logs.
Change-Id: I27cdeb2a4bed8afa30a27d05c7399c78af5bcebb
Reviewed-on: http://gerrit.cloudera.org:8080/22198
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-13154 added the method getFileMetadataStats() to
HdfsPartition.java that would return the file metadata statistics. The
method requires the corresponding HdfsPartition instance to have a
non-null field of 'fileMetadataStats_'.
This patch revises two existing constructors of HdfsPartition to provide
a non-null value for 'fileMetadataStats'. This makes it easier for a
third party extension to set up and update the field of
'fileMetadataStats_'. A third party extension has to update the field of
'fileMetadataStats_' if it would like to use this field to get the size
of the partition since all three fields in 'fileMetadataStats_' are
defaulted to 0.
A new constructor was also added for HdfsPartition that allows a third
party extension to provide their own FileMetadataStats when
instantiating an HdfsPartition. To facilitate instantiating a
FileMetadataStats, a new constructor was added for FileMetadataStats
that takes in a List of FileDescriptor's to construct a
FileMetadataStats.
Change-Id: I7e690729fcaebb1e380cc61f2b746783c86dcbf7
Reviewed-on: http://gerrit.cloudera.org:8080/22340
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch makes the Iceberg position field inclusion conditional by
including them only if there are UPDATE or DELETE merge clauses that are
listed in a MERGE statement or the target table has existing delete
files. These fields can be omitted when there's no delete file creation
at the sink of the MERGE statement and the table has no existing delete
files.
Additionally, this change disables MERGE for Iceberg target tables
that contain equality delete files, see IMPALA-13674.
Tests:
- iceberg-merge-insert-only planner test added
Change-Id: Ib62c78dab557625fa86988559b3732591755106f
Reviewed-on: http://gerrit.cloudera.org:8080/21931
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
AggregationNode.computeStats() estimate cardinality under single node
assumption. This can be an underestimation in preaggregation node case
because same grouping key may exist in multiple nodes during
preaggreation.
This patch adjust the cardinality estimate using following model for the
number of distinct values in a random sample of k rows, previously used
to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644.
Assumes we are picking k rows from an infinite sample with ndv distinct
values, with the value uniformly distributed. The probability of a given
value not appearing in a sample, in that case is
((NDV - 1) / NDV) ^ k
This is because we are making k choices, and each of them has
(ndv - 1) / ndv chance of not being our value. Therefore the
probability of a given value appearing in the sample is:
1 - ((NDV - 1) / NDV) ^ k
And the number of distinct values in the sample is:
(1 - ((NDV - 1) / NDV) ^ k) * NDV
Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to
use the new estimation logic or not.
Testing:
- Pass core tests.
Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55
Reviewed-on: http://gerrit.cloudera.org:8080/22047
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
getPerInstanceNdvForCpuCosting is a method to estimate the number of
distinct values of exprs per fragment instance when accounting for the
likelihood of duplicate keys across fragment instances. It borrows the
probabilistic model described in IMPALA-2945. This method is exclusively
used by AggregationNode only.
getPerInstanceNdvForCpuCosting run the probabilistic formula
individually for each grouping expression and then multiply it together.
That match with how we estimate group NDV in the past where we simply do
NDV multiplication of each grouping expression.
Recently, we adds tuple-based analysis to lower cardinality estimate for
all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086).
All of the bounding happens in AggregationNode.computeStats(), where we
call estimateNumGroups() function that returns globalNdv estimate for
specific aggregation class.
To take advantage from that more precise globalNdv, this patch replace
getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that
apply the probabilistic formula over this single globalNdv number rather
than the old way where it often return an overestimated number from NDV
multiplication method. Its use is still limited only to calculate
ProcessingCost. Using it for preagg output cardinality will be done by
IMPALA-2945.
estimatePreaggCardinality is skipped if data partition of input is a
subset of grouping expression.
Testing:
- Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True.
ProcessingCost changes, but all cardinality number stays.
- Add CardinalityTest#testEstimatePreaggCardinality.
- Update test_executor_groups.py. Enable v2 profile as well for easier
runtime profile debugging.
Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1
Reviewed-on: http://gerrit.cloudera.org:8080/22320
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
TestAdmissionControllerStress::test_mem_limit is flaky again. One
fragment instance that expected to stay alive until query submission
loop ends actually finished early, even though clients are only fetching
1 rows every 0.5 second. This patch attempts to address the flakiness in
two ways.
First, is lowering batch_size to 10. Lower batch size is expected to
keep all running fragment instances runnning until the query admission
loop finishes.
Second, is lowering num_queries from 50 to 40 if exploration_strategy is
exhaustive. This will shorten the query submission loop, expecially when
submission_delay_ms is high (150 seconds). This is OK because, based on
the assertions, the test framework will only retain at most 15 active
queries and 10 in-queue queries once the query submission loop ends.
This patch also refactors SubmitQueryThread. Set
long_polling_time_ms=100 for all queries to get faster initial response.
The lock is removed and replaced with threading.Event to signal the end
of test. The thread client and query_handle scope is made local within
run() method for proper cleanup. Set timeout for
wait_for_admission_control instead of waiting indefinitely.
impala_connection.py is refactored so that BeeswaxConnection has
matching logging functionality as ImpylaHS2Connection. Changed
ImpylaHS2Connection._collect_profile_and_log initialization for
possibillity that experimental Calcite planner may have ability to pull
query profile and log from Impala backend.
Testing:
- Run and pass test_mem_limit in both TestAdmissionControllerStress and
TestAdmissionControllerStressWithACService in exhaustive exploration
10 times.
- Run and pass the whole TestAdmissionControllerStress and
TestAdmissionControllerStressWithACService in exhaustive exploration.
Change-Id: I706e3dedce69e38103a524c64306f39eac82fac3
Reviewed-on: http://gerrit.cloudera.org:8080/22351
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
TestDecimalFuzz.test_decimal_ops and TestDecimalFuzz.test_width_bucket
each execute_scalar query 10000 times. This patch speed them up by
breaking each into 10 parallel tests run where each run execute_scalar
query 1000 times.
This patch also make execute_scalar and execute_scalar_expect_success to
run query with long_polling_time_ms=100 if there is no query_options
specified. Adds assertion in execute_scalar_expect_success that result
is indeed only a single row.
Slightly change exists_func to avoid unused argument warning.
Testing:
- From tests/ run and pass the following command
./run-tests.py query_test/test_decimal_fuzz.py
Change-Id: Ic12b51b50739deff7792e2640764bd75e8b8922d
Reviewed-on: http://gerrit.cloudera.org:8080/22328
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently only OpenAI sites are allowed for ai_generate_text(),
this patch adds support for general AI platforms to
the ai_generate_text function. It introduces a new flag,
ai_additional_platforms, allowing Impala to access additional
AI platforms. For these general AI platforms, only the openai
standard is supported, and the default api credential serves as
the api token for general platforms.
The ai_api_key_jceks_secret parameter has been renamed to
auth_credential to support passing both plain text and jceks
encrypted secrets.
A new impala_options parameter is added to ai_generate_text() to
enable future extensions. Adds the api_standard option to
impala_options, with "openai" as the only supported standard.
Adds the credential_type option to impala_options for allowing
the plain text as the token, by default it is set to jceks.
Adds the payload option to impala_options for customized
payload input. If set, the request will use the provided
customized payload directly, and the response will follow the
openai standard for parsing. The customized payload size must not
exceed 5MB.
Adding the impala_options parameter to ai_generate_text() should
be fine for backward compatibility, as this is a relatively new
feature.
Example:
1. Add the site to ai_api_additional_platforms,like:
ai_additional_platforms='new_ai.site,new_ai.com'
2. Example sql:
select ai_generate_text("https://new_ai.com/v1/chat/completions",
"hello", "model-name", "ai-api-token", "platform params",
'{"api_standard":"openai", "credential_type":"plain",
"payload":"payload content"}}')
Tests:
Added a new test AiFunctionsTestAdditionalSites.
Manual tested the example with the Cloudera AI platform.
Passed core and asan tests.
Change-Id: I4ea2e1946089f262dda7ace73d5f7e37a5c98b14
Reviewed-on: http://gerrit.cloudera.org:8080/22130
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Client connections can drop without an explicit close. This can
happen if client machine resets or there is a network disruption.
Some load balancers have an idle timeout that result in the
connection becoming invalid without an explicit teardown. With
short idle timeouts (e.g. AWS LB has a timeout of 350 seconds),
this can impact many connections.
This adds startup options to enable / tune TCP keepalive settings for
client connections:
client_keepalive_probe_period_s - idle time before doing keepalive probes
If set to > 0, keepalive is enabled.
client_keepalive_retry_period_s - time between keepalive probes
client_keepalive_retry_count - number of keepalive probes
These startup options mirror the startup options for Kudu's
equivalent functionality.
Thrift has preexisting support for turning on keepalive, but that
support uses the OS defaults for keepalive settings. To add the
ability to tune the keepalive settings, this implements a wrapper
around the Thrift socket (both TLS and non-TLS) and manually sets
the keepalive options on the socket (mirroring code from Kudu's
Socket::SetTcpKeepAlive).
This does not enable keepalive by default to make it easy to backport.
A separate patch will turn keepalive on by default.
Testing:
- Added a custom cluster test that connects with impala-shell
and verifies that the socket has the keepalive timer.
Verified that it works on Ubuntu 20, Centos 7, and Redhat 8.
- Used iptables to manually test cases where the client is unreachable
and verified that the server detects that and closes the connection.
Change-Id: I9e50f263006c456bc0797b8306aa4065e9713450
Reviewed-on: http://gerrit.cloudera.org:8080/22254
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Setting IV with non default length before setting the length
is not correct. With newer OpenSsl (3.2) this lead to failing
AES-GCM encryption
(likely since https://github.com/openssl/openssl/pull/22590).
The fix is to call EVP_(En/De)cryptInit_ex first without iv,
then set iv length and call EVP_EncryptInit_ex again with iv
(but without mode).
Change-Id: I243f1d487d8ba5dc44b5cc361e041c83598d83c1
Reviewed-on: http://gerrit.cloudera.org:8080/22337
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
When IcebergMergeImpl created the table sink it didn't set
'inputIsClustered' to true. Therefore HdfsTableSink expected
random input and kept the output writers open for every partition,
which resulted in high memory consumption and potentially a
Memory Limit Exceeded error when the number of partitions are high.
Since we actually sort the rows before the sink we can set
'inputIsClustered' to true, which means HdfsTableSink can write
files one by one, because whenever it gets a row that belongs
to a new partition it knows that it can close the current output
writer, and open a new one.
Testing:
- e2e regression test
Change-Id: I7bad0310e96eb482af9d09ba0d41e44c07bf8e4d
Reviewed-on: http://gerrit.cloudera.org:8080/22332
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ImpalaTestSuite.cleanup_db(cls, db_name, sync_ddl=1) is used to drop the
entire db. Currently it uses cls.client and sets sync_ddl based on the
parameter. This clears all the query options of cls.client and makes it
always run with the same sync_ddl value unless the test code explicitly
set query options again.
This patch changes cleanup_db() to use a dedicated client. Tested with
some tests that uses this method and see performance improvement:
TestName Before After
metadata/test_explain.py::TestExplainEmptyPartition 52s 9s
query_test/test_insert_behaviour.py::TestInsertBehaviour::test_insert_select_with_empty_resultset 62s 15s
metadata/test_metadata_query_statements.py::TestMetadataQueryStatements::test_describe_db (exhaustive) 160s 25s
Change-Id: Icb01665bc18d24e2fce4383df87c4607cf4562f1
Reviewed-on: http://gerrit.cloudera.org:8080/22286
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch added OAuth support with following functionality:
* Load and parse OAuth JWKS from configured JSON file or url.
* Read the OAuth Access token from the HTTP Header which is
the same format as JWT Authorization Bearer token.
* Verify the OAuth's signature with public key in JWKS.
* Get the username out of the payload of OAuth Access token.
* If kerberos or ldap is enabled, then both jwt and oauth are
supported together. Else only one of jwt or oauth is supported.
This has been a pre existing flow for jwt. So OAuth will follow
the same policy.
* Impala Shell side changes: OAuth options -a and --oauth_cmd
Testing:
- Added 3 custom cluster be test in test_shell_jwt_auth.py:
- test_oauth_auth_valid: authenticate with valid token.
- test_oauth_auth_expired: authentication failure with
expired token.
- test_oauth_auth_invalid_jwk: authentication failure with
valid signature but expired.
- Added 1 custom cluster fe test in JwtWebserverTest.java
- testWebserverOAuthAuth: Basic tests for OAuth
- Added 1 custom cluster fe test in LdapHS2Test.java
- testHiveserver2JwtAndOAuthAuth: tests all combinations of
jwt and oauth token verification with separate jwks keys.
- Manually tested with a valid, invalid and expired oauth
access token.
- Passed core run.
Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6
Reviewed-on: http://gerrit.cloudera.org:8080/21728
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change enables MERGE statements with source expressions containing
subqueries that require rewrite. The change adds implementation for
reset methods for each merge case, and properly handles resets for
MergeStmt and IcebergMergeImpl.
Tests:
- Planner test added with a merge query that requires a rewrite
- Analyzer test modified
Change-Id: I26e5661274aade3f74a386802c0ed20e5cb068b5
Reviewed-on: http://gerrit.cloudera.org:8080/22039
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Pick up a new binary build of the current toolchain version for ARM.
The toolchain version is identical, the only difference is that the new
build added binaries for Rocky/RHEL 9 to the already supported OS
versions, reaching the same level of Impala build support as
Rocky/RHEL 8.
Tested by building Impala for RHEL9 for Intel and ARM both on private
infrastructure.
Change-Id: I5fd2e8c3187cb7829de55d6739cf5d68a09a2ed3
Reviewed-on: http://gerrit.cloudera.org:8080/22323
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-13620 increase datanucleus.connectionPool.maxPoolSize of HMS from
10 to 30. When running all tests in single node, this seem to exhaust
all 100 of postgresql max_connection and interfere with
authorization/test_ranger.py and query_test/test_ext_data_sources.py.
This patch lower datanucleus.connectionPool.maxPoolSize to 20.
Testing:
- Pass exhaustive tests in single node.
Change-Id: I98eb27cbd141d5458a26d05d1decdbc7f918abd4
Reviewed-on: http://gerrit.cloudera.org:8080/22326
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When IcebergUpdateImpl created the table sink it didn't set
'inputIsClustered' to true. Therefore HdfsTableSink expected
random input and kept the output writers open for every partition,
which resulted in high memory consumption and potentially an
OOM error when the number of partitions are high.
Since we actually sort the rows before the sink we can set
'inputIsClustered' to true, which means HdfsTableSink can write
files one by one, because whenever it gets a row that belongs
to a new partition it knows that it can close the current output
writer, and open a new one.
Testing:
- e2e regression test
Change-Id: I9bad335cc946364fc612e8aaf90858eaabd7c4af
Reviewed-on: http://gerrit.cloudera.org:8080/22325
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ALTER_TABLE events
IMPALA-12487 adds an optimization that if an ALTER_TABLE event has
trivial changes in StorageDescriptor (e.g. removing optional field
'storedAsSubDirectories'=false which defaults to false), file
metadata reload will be skipped, no matter what changes are in the
table properties. This is problematic since some HMS clients (e.g.
Spark) could modify both the table properties and StorageDescriptor.
If there is a non-trivial changes in table properties (e.g. 'location'
change), we shouldn't skip reloading file metadata.
Testing:
- Added a unit test to verify the same
Change-Id: Ia969dd32385ac5a1a9a65890a5ccc8cd257f4b97
Reviewed-on: http://gerrit.cloudera.org:8080/21971
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>