Impala has an optimization for analytic expressions that have a rank filter on
top of the analytic expression. It can add a top-n plan node to reduce the amount
of rows examined. This is tested in tpcds query 67.
The optimization logic relies on an unassigned rank conjunct within the analyzer
while creating the analytic plan node.
A slight reorganization of the code was needed to implement this optimization.
The SlotRefs for the AnalyticInfo needed to be created a little earlier from
where it was done in the previous commit.
A small fix was made to normalize binary predicates. A non-normalized binary
predicate prevents the optimization from being used.
A call to the checkAndApplyLimitPushdown is needed for some of the optimizations
to kick in.
A new AllProjectInfo internal class was created to hold the relationships
between the Calcite RexNode objects and the Impala Analytic expressions.
Also, IMPALA-14158 is fixed by this commit. The nullsFirst value was
incorrect when the syntax was explicit in the query.
A new Calcite planner test was added in the junit tests to ensure the
optimization kicks in. The new test file is in the
PlannerTest/calcite/limit-pushdown-analytic-calcite.test file. This is a copy
of the limit-pushdown-analytic.test file in its parent directory but with some
modified results. Most of the differences are trivial, but IMPALA-14469 has been
filed to deal with one optimization that did not get fixed, which is when
the order by clause has a constant expression.
Change-Id: Ie6fa6781db56771b13b0cf49bd236f776016bf8d
Reviewed-on: http://gerrit.cloudera.org:8080/23317
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Since the toolchain was bumped to pick up Kudu's array column
feature (KUDU-1261), Impala's TSAN builds on the master branch
consistently break during dataload with a data race detected by TSAN.
The source of data race lies within libkudu_client.so and only trigger
if Impala build machine has both ipv4 and ipv6 associated with
localhost. Until the exact root cause is found and fixed, this patch
workaround the TSAN issue by fixing KUDU_MASTER_HOSTS env var to
127.0.0.1.
Testing:
Run TSAN build and confirm no data race error is emmitted.
Change-Id: I511ab625d18c6007567083557fcdf98980a6ac6f
Reviewed-on: http://gerrit.cloudera.org:8080/23507
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
Emits log messages from the OpenTelemetry SDK to the Impalad DEBUG,
INFO, WARNING, and ERROR logs. Previously, these SDK log messages
were dropped.
Modifies the function of the 'otel_debug' startup flag. This flag
defaults to 'false' which causes log messages from the SDK to be
dropped. When set to 'true', log messages from the OpenTelemetry SDK
will be sent to the Impala logging system. The overall glog level is
applied to all messages sent from the OpenTelemetry SDK, thus DEBUG
SDK logs will not appear in the Impalad logs unless the glog level
is greater than or equal to 2.
When a trace is successfully sent to the OpenTelemetry collector,
zero log lines are generated. When a trace cannot be sent, local
testing showed 12 lines with a total size around 3k were written
between the impalad.ERROR and impalad.WARNING log files. The request
body is not included in these log messages unless the glog level is
greater than or equal to 2 thus log message size will not grow or
shrink based on the size of the trace(s).
This patch also removes the completely useless
'LoggingInstrumentation' class. Previously, the 'otel_debug' flag
caused this class to log messages, but those messages provided no
insightful information.
Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: I41aba21f46233e6430eede9606be1e791071717a
Reviewed-on: http://gerrit.cloudera.org:8080/23418
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds a step to REST server configuration loading that
resolves environment variables noted as ${ENV:VARIABLE_NAME} format. If
the environment variable is not set, then the reference text remains the
same and Impala logs an error.
Tests:
- unit tests added
Change-Id: I3faccc15d012c389703c58371a4d38cca82bef60
Reviewed-on: http://gerrit.cloudera.org:8080/23457
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When IMPALA-14462 added tie-breaking logic to
ScanRangeOldestToNewestComparator, it relied on absolute path
being unset if the relative path is set. However, the code
always sets absolute path and uses an empty string to indicate
whether it is set. This caused the tie-breaking logic to see
two unrelated scan ranges as equal, triggering a DCHECK when
running query_test/test_tuple_cache_tpc_queries.py.
The fix is to rearrange the logic to check whether the relative
path is not empty rather than checking whether the absolute
path is set.
Testing:
- Ran query_test/test_tuple_cache_tpc_queries.py
- Ran custom_cluster/test_tuple_cache.py
Change-Id: I449308f4a0efdca7fc238e3dda24985a2931dd37
Reviewed-on: http://gerrit.cloudera.org:8080/23495
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Impala's planner generates a single-fragment, single-
threaded scan node for queries on JDBC tables because table
statistics are not properly available from the external
JDBC source. As a result, even large JDBC tables are
executed serially, causing suboptimal performance for joins,
aggregations, and scans over millions of rows.
This patch enables Impala to estimate the number of rows in a JDBC
table by issuing a COUNT(*) query at query preparation time. The
estimation is returned via TPrepareResult.setNum_rows_estimate()
and propagated into DataSourceScanNode. The scan node then uses
this cardinality to drive planner heuristics such as join order,
fragment parallelization, and scanner thread selection.
The design leverages the existing JDBC accessor layer:
- JdbcDataSource.prepare() constructs the configuration and invokes
GenericJdbcDatabaseAccessor.getTotalNumberOfRecords().
- The accessor wraps the underlying query in:
SELECT COUNT(*) FROM (<query>) tmptable
ensuring correctness for both direct table scans and parameterized
query strings.
- The result is captured as num_rows_estimate, which is then applied
during computeStats() in DataSourceScanNode.
With accurate (or approximate) row counts, the planner can now:
- Assign multiple scanner threads to JDBC scan nodes instead of
falling back to a single-thread plan.
- Introduce exchange nodes where beneficial, parallelizing data
fetches across multiple JDBC connections.
- Produce better join orders by comparing JDBC row cardinalities
against native Impala tables.
- Avoid severe underestimation that previously defaulted to wrong
table statistics, leading to degenerate plans.
For a sample join query mentioned in the test file,
these are the improvements:
Before Optimization:
- Cardinality fixed at 1 for all JDBC scans
- Single fragment, single thread per query
- Max per-host resource reservation: ~9.7 MB, 1 thread
- No EXCHANGE or MERGING EXCHANGE operators
- No broadcast distribution; joins executed serially
- Example query runtime: ~77s
SCAN JDBC A
\
HASH JOIN
\
SCAN JDBC B
\
HASH JOIN
\
SCAN JDBC C
\
TOP-N -> ROOT
After Optimization:
- Cardinality derived from COUNT(*) (e.g. 150K, 1.5M rows)
- Multiple fragments per scan, 7 threads per query
- Max per-host resource reservation: ~123 MB, 7 threads
- Plans include EXCHANGE and MERGING EXCHANGE operators
- Broadcast joins on small sides, improving parallelism
- Example query runtime: ~38s (~2x faster)
SCAN JDBC A --> EXCHANGE(SND) --+
\
EXCHANGE(RCV) -> HASH JOIN(BCAST) --+
SCAN JDBC B --> EXCHANGE(SND) ----/ \
HASH JOIN(BCAST) --+
SCAN JDBC C --> EXCHANGE(SND) ------------------------------------------/ \
TOP-N
\
MERGING EXCHANGE -> ROOT
Also added a new backend configuration flag
--min_jdbc_scan_cardinality (default: 10) to provide a
lower bound for scan node cardinality estimates
during planning. This flag is propagated from BE
to FE via TBackendGflags and surfaced through
BackendConfig, ensuring the planner never produces
unrealistically low cardinality values.
TODO: Add a query option for this optimization
to avoid extra JDBC round trip for smaller
queries (IMPALA-14417).
Testing: All cases of Planner tests are written in
jdbc-parallel.test. Some basic metrics
are also mentioned in the commit message.
Change-Id: If47d29bdda5b17a1b369440f04d4e209d12133d9
Reviewed-on: http://gerrit.cloudera.org:8080/23112
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Previous to this commit, outer join conjuncts were not being placed into
the ValueTransfersGraph which prevented them from being considered for
runtime filters. This caused a slowdown in some tpcds queries.
The conjuncts are now registered with the ImpalaJoinRel. The appropriate TableRef
objects are picked up from the underyling plan nodes.
Change-Id: I9e06d3f35a10f35ff8b57ba25dbab1bc6a35238a
Reviewed-on: http://gerrit.cloudera.org:8080/23318
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The remote admission client's retry logic for AdmitQuery RPC did not
handle cases where the admissiond restarts with a new IP address.
The client would use the old proxy and retry against the old, stale
ip, causing queries to time out.
This change fixes the issue by adding the GetProxy() call inside the
retry loop. This forces the client to re-resolve the admissiond's
network address on each retry attempt, allowing it to discover the
new endpoint and successfully reconnect.
Tests:
Passed admissiond related exhaustive ee tests.
Since automatically change hosts might be difficult, manually test
to change the /etc/hosts with following steps:
1. Start with --admission_service_host=localhost.
2. Change the 'localhost' in /etc/hosts to an inaccessible IP,
like 127.0.0.2.
3. Submit a query, it will block in the retry logic.
4. While the query is blocked, change 'localhost' in /etc/hosts
back to 127.0.0.1.
5. The query succeeded.
Change-Id: I5857de84ce69902b902099f668e87d747f944aff
Reviewed-on: http://gerrit.cloudera.org:8080/23472
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When Workload management is used first, CatalogD reports error "Table
not found: sys.impala_query_log". (also for sys.impala_query_live)
It is because during InitWorkloadManagement() we issue a ResetMetadata()
request against sys.impala_query_log to retrieve its schema version. If
the request fails with TableNotFound, we create the table. In other
words, the current initialization of workload management generates error
messages even when everything is going fine, and this can confuse users.
Instead of calling ResetMetadata() we can test the existence of the
workload management tables (sys.impala_query_log and
sys.impala_query_live) first.
Testing
* tested manually that the error logs disappear
Change-Id: Ic7f7c92bda57d9fdc2185bf4ef8fd4f09aea0879
Reviewed-on: http://gerrit.cloudera.org:8080/23470
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
It is a follow-up jira/commit to IMPALA-12709. IMPALA-12152 and
IMPALA-12785 are affected when hierarchical metastore event
processing feature is enabled.
Following changes are incorporated with this patch:
1. Added creationTime_ and dispatchTime_ fields in MetastoreEvent
class to store the current time in millisec. They are used to
calculate:
a) Event dispatch time(time between a MetastoreEvent object
creation and when event is moved to inProgressLog_ of
EventExecutorService after dispatching it to a
DbEventExecutor).
b) Event schedule delays incurred at DbEventExecutors and
TableEventExecutors(time between an event moved to
EventExecutorService's inProgressLog_ and before start of
processing event at appropriate DbEventExecutor and
TableEventExecutor).
c) Event process time from EventExecutorService point of
view(time spent in inProgressLog_ before it is moved to
processedLog_).
Logs are added to show the event dispatch time, schedule
delays, process time from EventExecutorService point of
view for each event. Also a log is added to show the time
taken for event's processIfEnabled().
2. Added isDelimiter_ field in MetastoreEvent class to indicate
whether it is a delimiter event. It is set only when
hierarchical event processing is enabled. Delimiter is a kind
of metastore event that do not require event processing.
Delimeter event can be:
a) A CommitTxnEvent that do not have any write event info for
a given transaction.
b) An AbortTxnEvent that do not have write ids for a given
transaction.
c) An IgnoredEvent.
An event is determined and marked as delimiter in
EventExecutorService#dispatch(). They are not queued to a
DbEventExecutor for processing. They are just maintained in
the inProgressLog_ to preserve continuity and correctness in
synchronization tracking. The delimiter events are removed from
inProgressLog_ when their preceding non-delimiter metastore
event is removed from inProgressLog_.
3. Greatest synced event id is computed based on the dispatched
events(inProgressLog_) and processed events(processedLog_) tree
maps. Greatest synced event is the latest event such that all
events with id less than or equal to the latest event are
definitely synced.
4. Lag is calculated as difference between latest event time on HMS
and the greatest synced event time. It is shown in the log.
5. Greatest synced event id is used in IMPALA-12152 changes. When
greatest synced event id becomes greater than or equal to
waitForEventId, all the required events are definitely synced.
6. Event processor is paused gracefully when paused with command in
IMPALA-12785. This ensures that all the fetched events from HMS in
current batch are processed before the event processor is fully
paused. It is necessary to process the current batch of events
because, certain events like AllocWriteIdEvent, AbortTxnEvent and
CommitTxnEvent update table write ids in catalog upon metastore
event object creation. And the table write ids are later updated
to appropriate table object during their event process. Can lead
to inconsistent state of write ids on table objects when paused
abruptly in the middle of current batch of event processing.
7. Added greatest synced event id and event time in events processor
metrics. And updated description of lag, pending events, last
synced event id and event time metrics.
8. Atomically update the event queue and increment outstanding event
count in enqueue methods of both DbProcessor and TableProcessor
so that respective process methods do not process the event until
event is added to queue and outstanding event count is incremented.
Otherwise, event can get processed, outstanding event count gets
decremented before it is incremented in enqueue method.
9. Refactored DbEventExecutor, DbProcessor, TableEventExecutor and
TableProcessor classes to propapage the exception occurred along
with event during event processing. EventProcessException is a
wrapper added to hold reference to event being processed and
exception occurred.
10.Added AcidTableWriteInfo helper class to store table, writeids
and partitions for the transaction id received in CommitTxnEvent.
Testing:
- Added new tests and executed existing end to end tests.
- Have executed the existing tests with hierarchical event processing
enabled.
Change-Id: I26240f36aaf85125428dc39a66a2a1e4d3197e85
Reviewed-on: http://gerrit.cloudera.org:8080/22997
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Custom cluster tests like TestKuduHMSIntegration restart the Kudu
service with custom startup flags. On Redhat8 ARM64, these tests
have been failing due to Kudu being unresponsive after this
restart. Debugging showed that Kudu was stuck early in startup.
This only reproduced via the custom cluster tests and never via
regular minicluster startup.
When custom cluster tests restart Kudu, the script to restart
Kudu inherits environment variables from the test runner. It
turns out that the HEAPCHECK environment variable (even when
empty) causes Kudu to get stuck during startup on Redhat8
ARM64 after the recent toolchain update.
As a short-term fix, this unsets HEAPCHECK when restarting the
Kudu service for these tests. There will need to be further
investigation / cleanup beyond this.
Testing:
- Ran the Kudu custom cluster tests on Redhat8 ARM64 and
on Ubuntu 20 x86_64
Change-Id: I51513e194d9e605df199672231b412fae40343af
Reviewed-on: http://gerrit.cloudera.org:8080/23467
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Minidump stack resolution does not work on Redhat8 ARM64.
Redhat8 ARM64 uses 64KB pages, and the Breakpad library does
not properly handle collecting stacks for that configuration.
Breakpad rounds off the stack pointer to the nearest page
boundary below the stack pointer, then collects up to 32KB of
stack memory. With a top-down stack, this means it is collecting
some memory that is not used by the stack. With 64KB pages,
the memory it collects usually doesn't contain any stack contents.
This picks up a toolchain with Breakpad patched to fix this. The
patch stops rounding the stack pointer to the nearest page.
Instead, it adjusts the stack pointer to account for the red
zone (128 bytes on x86_64) and then rounds to the nearest 1KB
boundary below the stack pointer.
Testing:
- Produced and resolved minidumps on multiple build types for
x86_64 and ARM64 (release, debug, asan, ubsan)
Change-Id: I4fbd91abfbddfd8355d27ae9d9b86b70a9ce0409
Reviewed-on: http://gerrit.cloudera.org:8080/23465
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
TestTupleCacheFullCluster.test_scan_range_distributed is flaky on s3
builds. The addition of a single file is changing scheduling significantly
even with scan ranges sorted oldest to newest. This is because modification
times on S3 have a granularity of one second. Multiple files have the
same modification time, and the fix for IMPALA-13548 did not properly
break ties for sorting.
This adds logic to break ties for files with the same modification
time. It compares the path (absolute path or relative path + partition)
as well as the offset within the file. These should be enough to break
all conceivable ties, as it is not possible to have two scan ranges with
the same file at the same offset. In debug builds, this does additional
validation to make sure that when a != b, comp(a, b) != comp(b, a).
The test requires that adding a single file to the table changes exactly
one cache key. If that final file has the same modification time as
an existing file, scheduling may still mix up the files and change more
than one cache key, even with tie-breaking. This adds a sleep just before
generating the final file to guarantee that it gets a newer modification
time.
Testing:
- Ran TestTupleCacheFullCluster.test_scan_range_distributed for 15
iterations on S3
Change-Id: I3f2e40d3f975ee370c659939da0374675a28cd38
Reviewed-on: http://gerrit.cloudera.org:8080/23458
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
TestAcidRowValidation.test_row_validation fails with tuple caching
correction verification. The test creates a Full Hive ACID table
with a file using valid write ids, mimicking a streaming ingest.
As the valid write ids change, the scan of that file produces
different rows without the file changing. Tuple caching currently
doesn't understand valid write ids, so this produces incorrect
results.
This marks Full Hive ACID tables as ineligible for caching until
valid write ids can be supported properly. Insert-only tables are
still eligible.
Testing:
- Added test cases to TupleCacheTest
- Ran TestAcidRowValidation.test_row_validation with correctness
verification
Change-Id: Icab9613b8e2973aed1d34427c51d2fd8b37a9aba
Reviewed-on: http://gerrit.cloudera.org:8080/23454
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
IMPALA-6984 changed the behavior to cancel backends when the
query reaches the RETURNED_RESULTS state. This ran into a regression
on large clusters where a query would end up waiting 10 seconds.
IMPALA-10047 reverted the core piece of the change.
For tuple caching, we found that a scan node can get stuck waiting
for a global runtime filter. It turns out that the coordinator will
not send out global runtime filters if the query is in a terminal
state. Tuple caching was causing queries to reach the RETURNED_RESULTS
phase before the runtime filter could be sent out. Reenabling the core
part of IMPALA-6984 sends out a cancel as soon as the query transitions
to RETURNED_RESULTS and wakes up any fragment instances waiting on
runtime filters.
The underlying cause of IMPALA-10047 is a tangle of locks that causes
us to exhaust the RPC threads. The coordinator is holding a lock on the
backend state while it sends the cancel synchronously. Other backends
that complete during that time run Coordinator::BackendState::LogFirstInProgress(),
which iterates through backend states to find the first that is not done.
The check to see if a backend state is done takes a lock on the backend
state. The problem case is that the coordinator may be sending a cancel
to a backend on itself. In that case, it needs an RPC thread on the coordinator
to be available to process the cancel. If all of the RPC threads are
processing updates, they can all call LogFirstInProgress() and get stuck
on the backend state lock for the coordinator's fragment. In that case,
it becomes a temporary deadlock as the cancel can't be processed and the
coordinator won't release the lock. It only gets resolved by the RPC timing
out.
To resolve this, this changes the Cancel() method to drop the lock while
doing the CancelQueryFInstances RPC. It reacquires the lock when it finishes
the RPC.
Testing:
- Hand tested with 10 impalads and control_service_num_svc_threads=1
Without the fix, it reproduces easily after reverting IMPALA-10047.
With the fix, it doesn't reproduce.
Change-Id: Ia058b03c72cc4bb83b0bd0a19ff6c8c43a647974
Reviewed-on: http://gerrit.cloudera.org:8080/23264
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Cleans up repetitive patterns in pom.xml.
Centralize plugin configuration in pluginManagement. Replace inline
maven-compiler-plugin configuration with newer maven.compiler.release
and update to latest plugin version.
Centralize common dependencies in dependencyManagement, including
exclusions when appropriate. Remove exclusions that are no longer
relevant.
Compared before and after with dependency:tree; only difference is that
commons-cli now comes from hadoop and jersey-serv{let,er} are
effectively excluded; all versions matched. Also ensured
USE_APACHE_COMPONENTS=true compiles.
Adds com.amazonaws:aws-java-sdk-bundle to exclusion checking to ensure
it's not accidentally included alongside impala-minimal-s3a-aws-sdk.
Removes missed io.netty exclusion from IMPALA-12816.
Updates commons-dbcp2 to 2.12.0 to match Hive.
Change-Id: If96649840e23036b4a73ee23e8d12516497994f0
Reviewed-on: http://gerrit.cloudera.org:8080/23432
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Scheduling does not sort scan ranges by modification time. When a new
file is added to a table, its order in the list of scan ranges is
not based on modification time. Instead, it is based on which partition
it belongs to and what its filename is. A new file that is added early
in the list of scan ranges can cause cascading differences in scheduling.
For tuple caching, this means that multiple runtime cache keys could
change due to adding a single file.
To minimize that disruption, this adds the ability to sort the scan
ranges by modification time and schedule scan ranges oldest to newest.
This enables it for scan nodes that feed into tuple cache nodes
(similar to deterministic scan range assignment).
Testing:
- Modified TestTupleCacheFullCluster::test_scan_range_distributed
to have stricter checks about how many cache keys change after
an insert (only one should change)
- Modified TupleCacheTest#testDeterministicScheduling to verify that
oldest to newest scheduling is also enabled.
Change-Id: Ia4108c7a00c6acf8bbfc036b2b76e7c02ae44d47
Reviewed-on: http://gerrit.cloudera.org:8080/23228
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Downstream error reports pointed out that the toolchain version picked
up for IMPALA-14139 contains toolchain binaries for Red Hat 9 (and
compatibles) that require at least the 9.5 minor version because of
OpenSSL library requirements. This was caused by the toolchain binary
build process not using package repo pinning for the redhat9 build
container definition, which caused the container process to install
"latest" packages, in this case packages released in Rocky / Red Hat
9.5.
This patch bumps the toolchain ID to a version in which the redhat9
binaries were produced in a build container "moved back in time" to the
9.2 release by pinning the package repos to the Rocky Linux 9.2 state,
using the Rocky Vault.
The patch also picks up a buffer overflow mitigation for the ORC
library.
Change-Id: I5c6921afdc69a4a6644b619de6b8d4e4cc69e601
Reviewed-on: http://gerrit.cloudera.org:8080/23448
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When admission control is enabled, but max memory for pool is not
configured, Impala skips memory-based admission completely, i.e. it
doesn't even take available host memory into account.
This behavior can lead to admitting many queries with large memory
consumption, potentially causing query failures due to memory
exhaustion.
Fixing the above behavior might cause regressions in some workloads,
so this patch just adds a new log message which makes it clear why
a query gets admitted, and also mentions possible failures.
Change-Id: Ib98482abc0fbcb53552adfd89cf6d157b17527fd
Reviewed-on: http://gerrit.cloudera.org:8080/23438
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds a new MetaProvider called MultiMetaProvider, which is
capable of handling multiple MetaProviders at once, prioritizing one
primary provider over multiple secondary providers. The primary
provider handles some methods exclusively for deterministic behavior.
In database listings, if one database name occurs multiple times the
contained tables are merged under that database name; if the two
separate databases contain a table with the same name, the query
analyzation fails with an error.
This change also modifies the local catalog implementation's
initialization. If catalogd is deployed, then it instantiates the
CatalogdMetaProvider and checks if the catalog configuration directory
is set as a backend flag. If it's set, then it tries to load every
configuration from the folder, and tries to instantiate the
IcebergMetaProvider from those configs. If the instantiation fails, an
error is reported to the logs, but the startup is not interrupted.
Tests:
- E2E tests for multi-catalog behavior
- Unit test for ConfigLoader
Change-Id: Ifbdd0f7085345e7954d9f6f264202699182dd1e1
Reviewed-on: http://gerrit.cloudera.org:8080/22878
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Removes IMPALA_JAVA_HOME_OVERRIDE and updates version selection. In
order of priority
1. If IMPALA_JDK_VERSION is set, use the OS JDK version from a known
location. This is primarily used when also installing the JDK as part
of automated builds.
2. If JAVA_HOME is set, use it.
3. Look for the system default JDK.
The IMPALA_JDK_VERSION variable is no longer modified to avoid issues
when sourcing impala-config.sh multiple times. JAVA_HOME will be
modified if IMPALA_JDK_VERSION is set; both must be unset to restore
using the system default Java.
If switching between JDKs, now prefer setting JAVA_HOME. If relying on
system Java, unset JAVA_HOME after e.g. update-java-alternatives.
The detected Java version is set in IMPALA_JAVA_TARGET, which is used to
add Java 9+ options and configure the Java compilation target.
Eliminates IMPALA_JDK_VERSION_NUM as it's value was always identical to
IMPALA_JAVA_TARGET.
Stops printing from impala-config-java.sh. It made the output from
impala-config.sh look strange, and the decisions can all be clearly
determined from impala-config.sh printed variables later or the packages
installed in bootstrap_system.sh.
Fixes JAVA_HOME in bootstrap_build.sh on ARM64 systems.
Change-Id: I68435ca69522f8310221a0f3050f13d86568b9da
Reviewed-on: http://gerrit.cloudera.org:8080/23434
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This changes the default behavior of the tuple cache to consider
cost when placing the TupleCacheNodes. It tries to pick the best
locations within a budget. First, it eliminates unprofitable locations
via a threshold. Next, it ranks the remaining locations by their
profitability. Finally, it picks the best locations in rank order
until it reaches the budget.
The threshold is based on the ratio of processing cost for regular
execution versus the processing cost for reading from the cache.
If the ratio is below the threshold, the location is eliminated.
The threshold is specified by the tuple_cache_required_cost_reduction_factor
query option. This defaults to 3.0, which means that the cost of
reading from the cache must be less than 1/3 the cost of computing
the value normally. A higher value makes this more restrictive
about caching locations, which pushes in the direction of lower
overhead.
The ranking is based on the cost reduction per byte. This is given
by the formula:
(regular processing cost - cost to read from cache) / estimated serialized size
This prefers locations with small results or high reduction in cost.
The budget is based on the estimated serialized size per node. This
limits the total caching that a query will do. A higher value allows more
caching, which can increase the overhead on the first run of a query. A lower
value is less aggressive and can limit the overhead at the expense of less
caching. This uses a per-node limit as the limit should scale based on the
size of the executor group as each executor brings extra capacity. The budget
is specified by the tuple_cache_budget_bytes_per_executor.
The old behavior to place the tuple cache at all eligible locations is
still available via the tuple_cache_placement_policy query option. The
default is the cost_based policy described above, but the old behavior
is available via the all_eligible policy. This is useful for correctness
testing (and the existing tuple cache test cases).
This changes the explain plan output:
- The hash trace is only enabled at VERBOSE level. This means that the regular
profile will not contain the hash trace, as the regular profile uses EXTENDED.
- This adds additional information at VERBOSE to display the cost information
for each plan node. This can help trace why a particular location was
not picked.
Testing:
- This adds a TPC-DS planner test with tuple caching enabled (based on the
existing TpcdsCpuCostPlannerTest)
- This modifies existing tests to adapt to changes in the explain plan output
Change-Id: Ifc6e7b95621a7937d892511dc879bf7c8da07cdc
Reviewed-on: http://gerrit.cloudera.org:8080/23219
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is a preparatory change for cost-based placement for
TupleCacheNodes. It reorders planning so that the processing cost and
filtered cardinality are calculated before running the TupleCachePlanner.
This computes the processing cost when enable_tuple_cache=true.
It also displays the cost information in the explain plan output
when enable_tuple_cache=true. This does not impact the adjustment
of fragment parallelism, which continues to be controlled by the
compute_processing_cost option.
This uses the processing cost to calculate a cumulative processing
cost in the TupleCacheInfo. This is all of the processing cost below
this point including other fragments. This is an indicator of how
much processing a cache hit could avoid. This does not accumulate the
cost when merging the TupleCacheInfo due to a runtime filter, as that
cost is not actually being avoided. This also computes the estimated
serialized size for the TupleCacheNode based on the filtered
cardinality and the row size.
Testing:
- Ran a core job
Change-Id: If78f5d002b0e079eef1eece612f0d4fefde545c7
Reviewed-on: http://gerrit.cloudera.org:8080/23164
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
In Apache Hive 3, HMS doesn't provide the API to retrive WriteEvents
info of a given transaction. COMMIT_TXN event just contains a
transaction id so Impala can't process it.
This patch ignores COMMIT_TXN events when building on Apache Hive 3.
Some tests in MetastoreEventsProcessorTest and EventExecutorServiceTest
are skipped due to this.
Tests:
- Manually tested on Apache Hive 3. Verified that EventProcessor still
works after receiving COMMIT_TXN events.
- Passed some tests in MetastoreEventsProcessorTest and
EventExecutorServiceTest that previously failed due to EventProcessor
going into ERROR state.
Change-Id: I863e39b3702028a14e83fed1fe912b441f2c28db
Reviewed-on: http://gerrit.cloudera.org:8080/23117
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The feature of hms_event_incremental_refresh_transactional_table is
already mature that it has been enabled for years. We'd like to
deprecate the feature of turning it off. However, for older Hive
versions like Apache Hive 3 that don't provide sufficient APIs for
Impala to process COMMIT_TXN events, users can still turn this off.
This patch skips
test_no_hms_event_incremental_refresh_transactional_table when running
on CDP Hive.
To run the test on Apache Hive 3, adjust the test to create ACID table
using tblproperties instead of "create transactional table" statement.
Tests:
- Ran the test on CDP Hive and Apache Hive 3.
Change-Id: I93379e5331072bec1d3a4769f7d7ab59431478ee
Reviewed-on: http://gerrit.cloudera.org:8080/23435
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, RELOAD events of partitioned table are processed one after
the other. Processing them one by one acquires the table lock multiple
times to load individual partitions in sequence. This also keeps the
table version changing which impacts performance of coordinators in
local-catalog mode - query planning needs retry to handle
InconsistentMetadataFetchException due to table version changes.
This patch handles the batch processing logic RELOAD events on same
table by reusing the exisiting logic of BatchPartitionEvent. This
implementation adds four new methods canBeBatched(),addToBatchEvents(),
getPartitionForBatching(), getBatchEventType()(pre-requisites to reuse
batching logic) to the RELOAD event class.
Testing:
- Added an end-to-end to verify the batching.
Change-Id: Ie3e9a99b666a1c928ac2a136bded1e5420f77dab
Reviewed-on: http://gerrit.cloudera.org:8080/23159
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Clang static analyzer found a potential memory leak in
TmpFileMgr. In some cases we forget the deletion of a
newly created TmpFileRemote object. This patch replaces
the raw pointer with a unique_ptr.
Change-Id: I5a516eab1a946e7368c6059f8d1cc430d2ee19e9
Reviewed-on: http://gerrit.cloudera.org:8080/23431
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Fixes the code that detects if the OpenTelemetry collector is using
TLS. Previously, the code only worked properly when the collector URL
was all lowercase. Also removes unnecessary checks that could cause
TLS to be enabled even when the collector URL scheme was not https.
Testing accomplished by adding new ctest tests.
Change-Id: I3bf74f1353545d280575cdb94cf135e55c580ec7
Reviewed-on: http://gerrit.cloudera.org:8080/23397
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
All functions in the SpanManager class operate under the assumption
that child_span_mu_ in the SpanManager class will be locked before
the ClientRequestState lock. However, the
ImpalaServer::ExecuteInternal function takes the ClientRequestState
lock before calling SpanManager::EndChildSpanPlanning. If another
function in the SpanManager class has already taken the
child_span_mu_ lock and is waiting for the ClientRequestState lock,
a deadlock occurs.
This issue was found by running end-to-end tests with OpenTelemetry
tracing enabled and a release buildof Impala.
Testing accomplished by re-running the end-to-end tests with
OpenTelemetry tracing enabled and verifying that the deadlock no
longer occurs.
Change-Id: I7b43dba794cfe61d283bdd476e4056b9304d8947
Reviewed-on: http://gerrit.cloudera.org:8080/23422
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Contains the following improvements to the Impala queries as
OpenTelemetry traces custom cluster tests:
1. Supporting code for asserting traces was moved to
'tests/util/otel_trace.py'. The moved code was modified to remove
all references to 'self'. Since this code used
'self.assert_impalad_log_contains', it had to be modified so the
caller provides the correct log file path to search. The
'__find_span_log' function was updated to call a new generic file
grep function to run the necessary log file search regex. All
other code was moved unmodified.
2. Classes 'TestOtelTraceSelectsDMLs' and 'TestOtelTraceDDLs'
contained a total of 11 individual tests that used the
'unique_database' fixture. When this fixture is used in a test, it
results in two DDLs being run before the test to drop/create the
database and one DDL being run after the test to drop the database.
These classes now create a test database once during 'setup_class'
and drop it once during 'teardown_class' because creating a new
database for each test was unnecessary. This change dropped test
execution time from about 97 seconds to about 77 seconds.
3. Each test now has comments describing what the test is asserting.
4. The unnecessary sleep in 'test_query_exec_fail' was removed saving
five seconds of test execution time.
5. New test 'test_dml_insert_fail' added. Previously, the situation
where an insert DML failed was not tested. The test passed without
any changes to backend code.
6. Test 'test_ddl_createtable_fail' is greatly simplified by using a
debug action to fail the query instead of multiple parallel
queries where one dropped the database the other was inserting
into. The simplified setup eliminated test flakiness caused by
timing differences and sped up test execution by about 5 seconds.
7. Fixed test flakiness was caused by timing issues. Depending on
when the close process was initiated, span events are sometimes in
the QueryExecution span and sometimes in the Close span. Test
assertions cannot handle these situations. All span event
assertions for the Close span were removed. IMPALA-14334 will fix
these assertions.
8. The function 'query_id_from_ui' which retrieves the query profile
using the Impala debug ui now makes multiple attempts to retrieve
the query. In slower test situations, such as ASAN, the query may
not yet be available when the function is called initially which
used to cause tests to fail. This test flakiness is now eliminated
through the addition of the retries.
Testing accomplished by running tests in test_otel_trace.py both
locally and in a full Jenkins build.
Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: I0c3e0075df688c7ae601c6f2e5743f56d6db100e
Reviewed-on: http://gerrit.cloudera.org:8080/23385
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-14349 caused a regression due to change in
FileMetadataLoader.createFd(). When default FS is S3, all files is S3
should not have any FileBlock. However, after IMPALA-14349, CTAS query
that scans functional.alltypes table in S3 hit following Preconditions
in HdfsScanNode.java:
if (!fsHasBlocks) {
Preconditions.checkState(fileDesc.getNumFileBlocks() == 0);
This is because FileMetadataLoader.createFd() skip checking if the
originating FileSystem support supportsStorageIds() or not. S3
dataloading from HDFS snapshot consistently failed due this regression.
This patch fix the issue by restoring FileMetadataLoader.createFd() to
its state before IMPALA-14349. It also make
FileMetadataLoader.createFd() calls more consistent by not allowing null
parameters except for 'absPath' that is only not null for Iceberg data
files. Generalize numUnknownDiskIds parameter from Reference<Long> to
AtomicLong for parallel usage.
Testing:
Pass dataloading, FE_TEST, EE_TEST, and CLUSTER_TEST in S3.
Change-Id: Ie16c5d7b020a59b5937b52dfbf66175ac94f60cd
Reviewed-on: http://gerrit.cloudera.org:8080/23423
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Update the following elements of the Impala build environment to enable
builds on Ubuntu 24.04:
- Recognize and handle (where necessary) Ubuntu 24.04 in various
bootstrap scripts (bootstrap_system.sh, bootstrap_toolchain.py, etc.)
- Bump IMPALA_TOOLCHAIN_ID to an official toolchain build that contains
Ubuntu 24.04-specific binary packages
- Bump binutils to 2.42, and
- Bump the GDB version to 12.1-p1, as required by the new toolchain
version
- Update unique_ptr usage syntax in be/src/util/webserver-test.cc to
compensate for new GLIBC funtion prototypes:
System headers in Ubuntu 24.04 adopted attributes on several widely
used function prototypes. Such attributes are not considered to be part
of the function's signature during template evaluation, so GCC throws a
warning when such a function is passed as a template argument, which
breaks the build, as warnings are treated as errors.
webserver-test.cc uses pclose() as the deleter for a unique_ptr in a
utility function. This patch encapsulates pclose() and its attributes in
an explicit specialization for std::default_delete<>, "hiding" the
attributes inside a functor.
The particular solution was inspired by Anton-V-K's proposal in
https://gist.github.com/t-mat/5849549
This commit builds on an earlier patch for the same purpose by Michael
Smith: https://gerrit.cloudera.org/c/23058/
Change-Id: Ia4454b0c359dbf579e6ba2f9f9c44cfa3f1de0d2
Reviewed-on: http://gerrit.cloudera.org:8080/23384
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Incorrect marker was included at the beginning (`--` instead of `---`)
which caused Apache Infrastructure to reject the file.
Tested with a YAML validator.
Change-Id: I554ba9b12655b1dbc8d38f6d533d51be92578369
Reviewed-on: http://gerrit.cloudera.org:8080/23426
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Consume various utility functions added as part of previous changes.
Testing accomplished by running exhaustive tests in
test_query_log.py, test_query_live.py, and test_otel_trace.py both
locally and in jenkins.
Change-Id: If42a8b5b6fdb43fb2bb37dd2a3be4668e8a5e283
Reviewed-on: http://gerrit.cloudera.org:8080/23234
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
TestImpalaShell.test_cancellation start failing when run with Python 3.9
with following error message
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
This patch is a quick fix the by changing the stderr write from using
print() to os.write(). Note that the thread-safetyness isssue within
_signal_handler in impala_shell.py during query cancellation still
remains.
Testing:
Run and pass test_cancellation in RHEL9 with Python 3.9.
Change-Id: I5403c7b8126b1a35ea841496fdfb6eb93e83376e
Reviewed-on: http://gerrit.cloudera.org:8080/23416
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change, query plans and profile reported only a single
partition even for partitioned Iceberg tables, which was misleading
for users.
Now we can display the number of scanned partitions correctly for
both partitioned and unpartitioned Iceberg tables. This is achieved by
extracting the partition values from the file descriptors and storing
them in the IcebergContentFileStore. Instead of storing this information
redundantly in all file descriptors, we store them in one place and
reference the partition metadata in the FDs with an id.
This also gives the opportunity to optimize memory consumption in the
Catalog and Coordinator as well as reduce network traffic between them
in the future.
Time travel is handled similarly to oldFileDescMap. In that case
we don't know the total number of partitions in the old snapshot,
so the output is [Num scanned partitions]/unknown.
Testing:
- Planner tests
- E2E tests
- partition transforms
- partition evolution
- DROP PARTITION
- time travel
Change-Id: Ifb2f654bc6c9bdf9cfafc27b38b5ca2f7b6b4872
Reviewed-on: http://gerrit.cloudera.org:8080/23113
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
test_load_table_with_primary_key_attr has been intermittently fail with
following error message since Python3 switch in IMPALA-14333.
dbm.error: db type could not be determined
Error happen inside load_unique_col_metadata. But both
load_unique_col_metadata and persist_unique_col_metadata are seemingly
unused anymore.
While the root cause is still unknown, commenting
load_unique_col_metadata in db_connection.py can make the tests pass.
Tests are not affected because the they do not assert against any unique
columns. This patch implement that workaround until we can revisit the
issue with query generator infrastructure.
Fix few flake8 warnings in db_connection.py.
Testing:
Loop and pass test_show_create_table 10 times using downstream jenkins.
Change-Id: I2caf6b9cc5a4f06237323cb32bbd47afef576fa1
Reviewed-on: http://gerrit.cloudera.org:8080/23412
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>