It's not that easy to find log files of a custom-cluster test. All
custom-cluster tests use the same log dir and the test output just shows
the symlink of the log files, e.g. "Starting State Store logging to
.../logs/custom_cluster_tests/statestored.INFO".
This patch prints the actual log file names after the cluster launchs.
An example output:
15:17:19 MainThread: Starting State Store logging to /tmp/statestored.INFO
15:17:19 MainThread: Starting Catalog Service logging to /tmp/catalogd.INFO
15:17:19 MainThread: Starting Impala Daemon logging to /tmp/impalad.INFO
15:17:19 MainThread: Starting Impala Daemon logging to /tmp/impalad_node1.INFO
15:17:19 MainThread: Starting Impala Daemon logging to /tmp/impalad_node2.INFO
...
15:17:24 MainThread: Total wait: 2.54s
15:17:24 MainThread: Actual log file names:
15:17:24 MainThread: statestored.INFO -> statestored.quanlong-Precision-3680.quanlong.log.INFO.20251216-151719.1094348
15:17:24 MainThread: catalogd.INFO -> catalogd.quanlong-Precision-3680.quanlong.log.INFO.20251216-151719.1094368
15:17:24 MainThread: impalad.INFO -> impalad.quanlong-Precision-3680.quanlong.log.INFO.20251216-151719.1094466
15:17:24 MainThread: impalad_node1.INFO -> impalad.quanlong-Precision-3680.quanlong.log.INFO.20251216-151719.1094468
15:17:24 MainThread: impalad_node2.INFO -> impalad.quanlong-Precision-3680.quanlong.log.INFO.20251216-151719.1094470
15:17:24 MainThread: Impala Cluster Running with 3 nodes (3 coordinators, 3 executors).
Tests
- Ran the script locally.
- Ran a failed custom-cluster test and verified the actual file names
are printed in the output.
Change-Id: Id76c0a8bdfb221ab24ee315e2e273abca4257398
Reviewed-on: http://gerrit.cloudera.org:8080/23781
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Quanlong Huang <huangquanlong@gmail.com>
This reverts commit 52b87fcefd.
The original commit caused an issue when Impala is deployed together
with Apache Atlas. Coordinator failed to start with error message:
java.lang.NoClassDefFoundError: org/apache/logging/log4j/core/Layout
Solved minor conflict in impala-config.sh due to IMPALA-14478 applied
after IMPALA-14454.
Change-Id: I77127db8d833c675c18c30eb3d6542ca906cd2a9
Reviewed-on: http://gerrit.cloudera.org:8080/23788
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Configure separate compile and link pools for ninja. Configures link
parallelism based on expected memory use, which can be reduced by
setting IMPALA_MINIMAL_DEBUG_INFO=true or IMPALA_SPLIT_DEBUG_INFO=true.
Adds IMPALA_MAKE_CMD to simplify using the ninja build tool for all make
operations in scripts. Install ninja on Ubuntu. Adds a '-make' option to
buildall.sh to force using 'make'.
Adds MOLD_JOBS=1 to avoid overloading the system when trying 'mold' and
linking test binaries. However 'mold' is not selected as the default
due to test failures around SASL/GSSAPI (see IMPALA-14527).
Switches bin/jenkins/all-tests.sh to use ninja and removes the guard in
bootstrap_development.sh limiting IMPALA_BUILD_THREADS as it's no longer
needed with ninja.
SKIP_BE_TEST_PATTERN in run-backend-tests is unused (only used with
TARGET_FILESYSTEM=local) so I don't attempt to make it work with ninja.
Tested with local 'IMPALA_SPLIT_DEBUG_INFO=true buildall.sh -skiptests'
with default (make) and IMPALA_MAKE_CMD=ninja.
Change-Id: I0952dc19ace5c9c42bed0d2ffb61499656c0a2db
Reviewed-on: http://gerrit.cloudera.org:8080/23572
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Pranav Lodha <pranav.lodha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The patch bumped up the arrow version to 15.0.0 and use
latest toolchain to fix the arrow jni loading issue for linux on
aarch64 environment.
Background:
We have fixed jni loading issue for aarch64 environment from
native toolchain side in IMPALA-14609. We also need to bump up
arrow version to 15.0.0 and use the toolchain to fix the issue.
Testing:
Built new toolchain and pass paimon test in aarch64
environment.
Change-Id: I7b8dd6ab43cf05b4339880ecec0d1f48e44ef294
Reviewed-on: http://gerrit.cloudera.org:8080/23756
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
The first IMPALA-14606 commit miss to setup Python 3 in fresh RHEL8
machine. This was not caught before because I test using downstream
jenkins and it reuse RHEL8 machine that previously setup with Python 2.
This patch fix the issue by skipping pip install argparse that broke the
script and run setup_python3 instead for RHEL8 machine.
Testing:
- Run full bootstrap_system.sh and buildall.sh in fresh RHEL8 machine.
Change-Id: I6df0a534175404fe96d32eeb1e7bf0aa9ca204cd
Reviewed-on: http://gerrit.cloudera.org:8080/23772
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
Impala allows various Java versions to be selected for its build and
runtime environment when bin/bootstrap_system.sh is used to set up the
environment. Unfortunately this setup failed to affect the current Java
JRE and compiler tools on Red Hat Linux and compatibles (e.g. Rocky
Linux), because bootstrap_system.sh failed to set up the requested
version in the "alternatives" subsystem. The same failure was not
observed on Ubuntu versions, on that platform `update_java_alternatives`
was correctly run for the same purpose.
This patch adds calls to `alternatives` to set the JRE and JDK
environments to the requested version. This benefits automated test runs
in Impala's pre- and post-commit environments as well as individual
workstation setups.
Change-Id: I8972fb35b232830c6d8cf1125a7a8223547bd206
Reviewed-on: http://gerrit.cloudera.org:8080/23741
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly implement the querying of paimon data table
through JNI based scanner.
Features implemented:
- support column pruning.
The partition pruning and predicate push down will be submitted
as the third part of the patch.
We implemented this by treating the paimon table as normal
unpartitioned table. When querying paimon table:
- PaimonScanNode will decide paimon splits need to be scanned,
and then transfer splits to BE do the jni-based scan operation.
- We also collect the required columns that need to be scanned,
and pass the columns to Scanner for column pruning. This is
implemented by passing the field ids of the columns to BE,
instead of column position to support schema evolution.
- In the original implementation, PaimonJniScanner will directly
pass paimon row object to BE, and call corresponding paimon row
field accessor, which is a java method to convert row fields to
impala row batch tuples. We find it is slow due to overhead of
JVM method calling.
To minimize the overhead, we refashioned the implementation,
the PaimonJniScanner will convert the paimon row batches to
arrow recordbatch, which stores data in offheap region of
impala JVM. And PaimonJniScanner will pass the arrow offheap
record batch memory pointer to the BE backend.
BE PaimonJniScanNode will directly read data from JVM offheap
region, and convert the arrow record batch to impala row batch.
The benchmark shows the later implementation is 2.x better
than the original implementation.
The lifecycle of arrow row batch is mainly like this:
the arrow row batch is generated in FE,and passed to BE.
After the record batch is imported to BE successfully,
BE will be in charge of freeing the row batch.
There are two free paths: the normal path, and the
exception path. For the normal path, when the arrow batch
is totally consumed by BE, BE will call jni to fetch the next arrow
batch. For this case, the arrow batch is freed automatically.
For the exceptional path, it happends when query is cancelled, or memory
failed to allocate. For these corner cases, arrow batch is freed in the
method close if it is not totally consumed by BE.
Current supported impala data types for query includes:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE
TODO:
- Patches pending submission:
- Support tpcds/tpch data-loading
for paimon data table.
- Virtual Column query support for querying
paimon data table.
- Query support with time travel.
- Query support for paimon meta tables.
- WIP:
- Snapshot incremental read.
- Complex type query support.
- Native paimon table scanner, instead of
jni based.
Testing:
- Create tests table in functional_schema_template.sql
- Add TestPaimonScannerWithLimit in test_scanners.py
- Add test_paimon_query in test_paimon.py.
- Already passed the tpcds/tpch test for paimon table, due to the
testing table data is currently generated by spark, and it is
not supported by impala now, we have to do this since hive
doesn't support generating paimon table for dynamic-partitioned
tables. we plan to submit a separate patch for tpcds/tpch data
loading and associated tpcds/tpch query tests.
- JVM Offheap memory leak tests, have run looped tpch tests for
1 day, no obvious offheap memory increase is observed,
offheap memory usage is within 10M.
Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384
Reviewed-on: http://gerrit.cloudera.org:8080/23613
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
When the environment variable USE_APACHE_HIVE is set to true, build
Impala for adapting to Apache Hive 3.x. In order to better distinguish it
from Apache Hive 2.x later, rename USE_APACHE_HIVE to USE_APACHE_HIVE_3.
Additionally, to facilitate referencing different versions of the Hive
MetastoreShim, the major version of Hive has been added to the environment
variable IMPALA_HIVE_DIST_TYPE.
Change-Id: I11b5fe1604b6fc34469fb357c98784b7ad88574d
Reviewed-on: http://gerrit.cloudera.org:8080/21724
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds the java/impala-package Maven project to make it easier
to ship / test the Calcite planner. impala-package has a dependency
on impala-frontend and calcite-planner, so its classpath requires
no extra work when constructing the classpath.
An additional cleanup is that this no longer puts the
impala-frontend-*-tests.jar on the classpath by default. This requires
updating the query event hooks test, as it relies on that jar being
present.
This does not change the default value for the use_calcite_planner
query option, so there is no change in behavior.
Testing:
- Ran a core job
- Built docker images and OS packages locally
Change-Id: I81dec2a5b59e279229a735c8bb1a23c77111a793
Reviewed-on: http://gerrit.cloudera.org:8080/23497
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When the --use_calcite_planner=true option is set at the server level,
the queries will no longer go through CalciteJniFrontend. Instead, they
will go through the regular JniFrontend, which is the path that is used
when the query option for "use_calcite_planner" is set.
The CalciteJniFrontend will be removed in a later commit.
This commit also enables fallback to the original planner when an unsupported
feature exception is thrown. This needed to be added to allow the tests to run
properly. During initial database load, there are queries that access complex
columns which throws the unsupported exception.
Change-Id: I732516ca8f7ea64f73484efd67071910c9b62c8f
Reviewed-on: http://gerrit.cloudera.org:8080/23523
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
Tested-by: Steve Carlin <scarlin@cloudera.com>
Redhat 9 environments recently switched to OpenSSL 3.5.1. On those
machines, the Kudu minicluster fails to start up with CSR signature
verification error. KUDU-3716 fixed this issue.
This patch update Toolchain and Kudu version to pick up KUDU-3716.
Testing:
Pass data loading with in Redhat 9.
Change-Id: I7262267939a9f08650af85443240950afbb3323f
Reviewed-on: http://gerrit.cloudera.org:8080/23697
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This modifies bin/single_node_perf_run.py to stop using the sh
python package. It replaces sh with calls to subprocess. It
stops installing sh for both the Python 2 and 3 virtualenvs.
Testing:
- Ran perf-AB-test job with it and examined the logs
Change-Id: Ic5f9316a5d83c5c0dc37d4a94c55b6a655765fe3
Reviewed-on: http://gerrit.cloudera.org:8080/23600
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
On python 3, when Impyla receives a result with a string that is
not valid UTF-8, it returns that as bytes. TPC-DS Q30 on scale 20
has a result that contains invalid UTF-8, so bin/run-workload.py
can fail while trying to dump this to JSON.
This modifies CustomJSONEncoder to handle serializing bytes by
converting it to a string with invalid unicode handled with
backslashes.
Testing:
- Ran bin/run-workload.py against TPC-DS scale 20
Change-Id: Ibe31c656de4fc65f8580c7b3b49bf655b8a5ecea
Reviewed-on: http://gerrit.cloudera.org:8080/23602
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This change adds an optional flag to coverage_helper.sh script that
accepts additional parameters for the wrapped gcovr call.
Tests:
- manually validated that the script has the original behaviour if the
newly added flag is not set, also if it's set, the parameters are pushed
down correctly.
Change-Id: Iea26c9967b62b06ded6a0cb4c0346f0e789beb80
Reviewed-on: http://gerrit.cloudera.org:8080/23290
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Peter Rozsa <prozsa@cloudera.com>
This patch updates CDP_BUILD_NUMBER to 71942734 to in order to
upgrade Iceberg to 1.5.2.
This patch updates some tests so they pass with Iceberg 1.5.2. The
behavior changes of Iceberg 1.5.2 are (compared to 1.3.1):
* Iceberg V2 tables are created by default
* Metadata tables have different schema
* Parquet compression is explicitly set for new tables (even for ORC
tables)
* Sequence numbers are assigned a bit differently
Updated the tests where needed.
Code changes to accomodate for the above behavior changes:
* SHOW CREATE TABLE adds 'format-version'='1' for Iceberg V1 tables
* CREATE TABLE statements don't throw errors when Parquet compression
is set for ORC tables
Change-Id: Ic4f9ed3f7ee9f686044023be938d6b1d18c8842e
Reviewed-on: http://gerrit.cloudera.org:8080/23670
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Initial implementation of KUDU-1261 (array column type) recently merged
in upstream Apache Kudu repository. This patch add initial Impala
support for working with Kudu tables having array type columns.
Unlike rows, the elements of a Kudu array are stored in a different
format than Impala. Instead of per-row bit flag for NULL info, values
and NULL bits are stored in separate arrays.
The following types of queries are not supported in this patch:
- (IMPALA-14538) Queries that reference an array column as a table, e.g.
```sql
SELECT item FROM kudu_array.array_int;
```
- (IMPALA-14539) Queries that create duplicate collection slots, e.g.
```sql
SELECT array_int FROM kudu_array AS t, t.array_int AS unnested;
```
Testing:
- Add some FE tests in AnalyzeDDLTest and AnalyzeKuduDDLTest.
- Add EE test test_kudu.py::TestKuduArray.
Since Impala does not support inserting complex types, including
array, the data insertion part of the test is achieved through
custom C++ code kudu-array-inserter.cc that insert into Kudu via
Kudu C++ client. It would be great if we could migrate it to Python so
that it can be moved to the same file as the test (IMPALA-14537).
- Pass core tests.
Co-authored-by: Riza Suminto
Change-Id: I9282aac821bd30668189f84b2ed8fff7047e7310
Reviewed-on: http://gerrit.cloudera.org:8080/23493
Reviewed-by: Alexey Serbin <alexey@apache.org>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Uses IMPALA_MINIMAL_DEBUG_INFO=true in Jenkins
build-all-flag-combinations.sh to reduce memory usage during linking and
avoid OOM kills. This script uses -skiptests to build all test binaries,
but doesn't run them, so debug info is not needed.
Change-Id: I4605b98d8d197e07c2eaac8218ff985c798875ed
Reviewed-on: http://gerrit.cloudera.org:8080/23641
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit bump Impala toolchain to pickup latest Kudu version up to
commit 60f5e5267b92c39485a66121d3ce3cc7ef57b0e0 (KUDU-1261 make
ArrayCellMetadataView::Init() more robust).
Change-Id: I68009e5fefd053882f5504cd2520bacb189a1b04
Reviewed-on: http://gerrit.cloudera.org:8080/23631
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3
This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
doesn't have a main function, it removes the hash-bang and makes
sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
(or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
replaced by the cm-client pypi package and interfaces have changed.
Rather than migrating the code (which hasn't been used in years), this
deletes the old code and stops installing cm-api into the virtualenv.
The code can be restored and revamped if there is any interest in
interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
bit-rotted. Some pieces can be run manually, but it can't be fully
verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
version that supports Python 3. The newest version of kazoo requires
upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
needing other upgrades.
The two remaining uses of impala-python are:
- bin/cmake_aux/create_virtualenv.sh
- bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.
The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)
Testing:
- Ran core job
- Ran build + dataload on Centos 7, Redhat 8
- Manual testing of individual scripts (except some bitrotted areas like the
random query generator)
Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Since the toolchain was bumped to pick up Kudu's array column
feature (KUDU-1261), Impala's TSAN builds on the master branch
consistently break during dataload with a data race detected by TSAN.
The source of data race lies within libkudu_client.so and only trigger
if Impala build machine has both ipv4 and ipv6 associated with
localhost. Until the exact root cause is found and fixed, this patch
workaround the TSAN issue by fixing KUDU_MASTER_HOSTS env var to
127.0.0.1.
Testing:
Run TSAN build and confirm no data race error is emmitted.
Change-Id: I511ab625d18c6007567083557fcdf98980a6ac6f
Reviewed-on: http://gerrit.cloudera.org:8080/23507
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
Minidump stack resolution does not work on Redhat8 ARM64.
Redhat8 ARM64 uses 64KB pages, and the Breakpad library does
not properly handle collecting stacks for that configuration.
Breakpad rounds off the stack pointer to the nearest page
boundary below the stack pointer, then collects up to 32KB of
stack memory. With a top-down stack, this means it is collecting
some memory that is not used by the stack. With 64KB pages,
the memory it collects usually doesn't contain any stack contents.
This picks up a toolchain with Breakpad patched to fix this. The
patch stops rounding the stack pointer to the nearest page.
Instead, it adjusts the stack pointer to account for the red
zone (128 bytes on x86_64) and then rounds to the nearest 1KB
boundary below the stack pointer.
Testing:
- Produced and resolved minidumps on multiple build types for
x86_64 and ARM64 (release, debug, asan, ubsan)
Change-Id: I4fbd91abfbddfd8355d27ae9d9b86b70a9ce0409
Reviewed-on: http://gerrit.cloudera.org:8080/23465
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Cleans up repetitive patterns in pom.xml.
Centralize plugin configuration in pluginManagement. Replace inline
maven-compiler-plugin configuration with newer maven.compiler.release
and update to latest plugin version.
Centralize common dependencies in dependencyManagement, including
exclusions when appropriate. Remove exclusions that are no longer
relevant.
Compared before and after with dependency:tree; only difference is that
commons-cli now comes from hadoop and jersey-serv{let,er} are
effectively excluded; all versions matched. Also ensured
USE_APACHE_COMPONENTS=true compiles.
Adds com.amazonaws:aws-java-sdk-bundle to exclusion checking to ensure
it's not accidentally included alongside impala-minimal-s3a-aws-sdk.
Removes missed io.netty exclusion from IMPALA-12816.
Updates commons-dbcp2 to 2.12.0 to match Hive.
Change-Id: If96649840e23036b4a73ee23e8d12516497994f0
Reviewed-on: http://gerrit.cloudera.org:8080/23432
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Downstream error reports pointed out that the toolchain version picked
up for IMPALA-14139 contains toolchain binaries for Red Hat 9 (and
compatibles) that require at least the 9.5 minor version because of
OpenSSL library requirements. This was caused by the toolchain binary
build process not using package repo pinning for the redhat9 build
container definition, which caused the container process to install
"latest" packages, in this case packages released in Rocky / Red Hat
9.5.
This patch bumps the toolchain ID to a version in which the redhat9
binaries were produced in a build container "moved back in time" to the
9.2 release by pinning the package repos to the Rocky Linux 9.2 state,
using the Rocky Vault.
The patch also picks up a buffer overflow mitigation for the ORC
library.
Change-Id: I5c6921afdc69a4a6644b619de6b8d4e4cc69e601
Reviewed-on: http://gerrit.cloudera.org:8080/23448
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Removes IMPALA_JAVA_HOME_OVERRIDE and updates version selection. In
order of priority
1. If IMPALA_JDK_VERSION is set, use the OS JDK version from a known
location. This is primarily used when also installing the JDK as part
of automated builds.
2. If JAVA_HOME is set, use it.
3. Look for the system default JDK.
The IMPALA_JDK_VERSION variable is no longer modified to avoid issues
when sourcing impala-config.sh multiple times. JAVA_HOME will be
modified if IMPALA_JDK_VERSION is set; both must be unset to restore
using the system default Java.
If switching between JDKs, now prefer setting JAVA_HOME. If relying on
system Java, unset JAVA_HOME after e.g. update-java-alternatives.
The detected Java version is set in IMPALA_JAVA_TARGET, which is used to
add Java 9+ options and configure the Java compilation target.
Eliminates IMPALA_JDK_VERSION_NUM as it's value was always identical to
IMPALA_JAVA_TARGET.
Stops printing from impala-config-java.sh. It made the output from
impala-config.sh look strange, and the decisions can all be clearly
determined from impala-config.sh printed variables later or the packages
installed in bootstrap_system.sh.
Fixes JAVA_HOME in bootstrap_build.sh on ARM64 systems.
Change-Id: I68435ca69522f8310221a0f3050f13d86568b9da
Reviewed-on: http://gerrit.cloudera.org:8080/23434
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Update the following elements of the Impala build environment to enable
builds on Ubuntu 24.04:
- Recognize and handle (where necessary) Ubuntu 24.04 in various
bootstrap scripts (bootstrap_system.sh, bootstrap_toolchain.py, etc.)
- Bump IMPALA_TOOLCHAIN_ID to an official toolchain build that contains
Ubuntu 24.04-specific binary packages
- Bump binutils to 2.42, and
- Bump the GDB version to 12.1-p1, as required by the new toolchain
version
- Update unique_ptr usage syntax in be/src/util/webserver-test.cc to
compensate for new GLIBC funtion prototypes:
System headers in Ubuntu 24.04 adopted attributes on several widely
used function prototypes. Such attributes are not considered to be part
of the function's signature during template evaluation, so GCC throws a
warning when such a function is passed as a template argument, which
breaks the build, as warnings are treated as errors.
webserver-test.cc uses pclose() as the deleter for a unique_ptr in a
utility function. This patch encapsulates pclose() and its attributes in
an explicit specialization for std::default_delete<>, "hiding" the
attributes inside a functor.
The particular solution was inspired by Anton-V-K's proposal in
https://gist.github.com/t-mat/5849549
This commit builds on an earlier patch for the same purpose by Michael
Smith: https://gerrit.cloudera.org/c/23058/
Change-Id: Ia4454b0c359dbf579e6ba2f9f9c44cfa3f1de0d2
Reviewed-on: http://gerrit.cloudera.org:8080/23384
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch mainly implement the creation/drop of paimon table
through impala.
Supported impala data types:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE
Syntax for creating paimon table:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
[col_name data_type ,...]
[PRIMARY KEY (col1,col2)]
)
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
STORED AS PAIMON
[LOCATION 'hdfs_path']
[TBLPROPERTIES (
'primary-key'='col1,col2',
'file.format' = 'orc/parquet',
'bucket' = '2',
'bucket-key' = 'col3',
];
Two types of paimon catalogs are supported.
(1) Create table with hive catalog:
CREATE TABLE paimon_hive_cat(userid INT,movieId INT)
STORED AS PAIMON;
(2) Create table with hadoop catalog:
CREATE [EXTERNAL] TABLE paimon_hadoop_cat
STORED AS PAIMON
TBLPROPERTIES('paimon.catalog'='hadoop',
'paimon.catalog_location'='/path/to/paimon_hadoop_catalog',
'paimon.table_identifier'='paimondb.paimontable');
SHOW TABLE STAT/SHOW COLUMN STAT/SHOW PARTITIONS/SHOW FILES
statements are also supported.
TODO:
- Patches pending submission:
- Query support for paimon data files.
- Partition pruning and predicate push down.
- Query support with time travel.
- Query support for paimon meta tables.
- WIP:
- Complex type query support.
- Virtual Column query support for querying
paimon data table.
- Native paimon table scanner, instead of
jni based.
Testing:
- Add unit test for paimon impala type conversion.
- Add unit test for ToSqlTest.java.
- Add unit test for AnalyzeDDLTest.java.
- Update default_file_format TestEnumCase in
be/src/service/query-options-test.cc.
- Update test case in
testdata/workloads/functional-query/queries/QueryTest/set.test.
- Add test cases in metadata/test_show_create_table.py.
- Add custom test test_paimon.py.
Change-Id: I57e77f28151e4a91353ef77050f9f0cd7d9d05ef
Reviewed-on: http://gerrit.cloudera.org:8080/22914
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Added '-udf_devel_package' option to buildall.sh. This generates
impala-udf-devel rpm which includes udf headers and static libraries -
ImpalaUdf-retail.a and ImpalaUdf-debug.a.
Testing:
- Tested that rpm is generated using build script:
./buildall.sh -release_and_debug -notests -udf_devel_package
- Tested that the rpm is also generated using standalone script:
./bin/make-impala-udf-devel-rpm.sh
- Generated impala-udf-devel package and tested compiling
impala_udf_samples:
https://github.com/cloudera/impala-udf-samples
Change-Id: I5b85df9c3f680a7e5551f067a97a5650daba9b50
Reviewed-on: http://gerrit.cloudera.org:8080/23060
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Runs mvn clean on all Java subprojects - instead of just ext-data-source
- to avoid build failures when files from other versions of the code and
dependencies are left behind.
Change-Id: I8cf540f90adbff327de98f900059bfa3bbc8ef22
Reviewed-on: http://gerrit.cloudera.org:8080/23374
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true
reveals some tests that require adjustment. This patch made such
adjustment, which mostly revolves around encoding differences and string
vs bytes type in Python3. This patch also switch the default to run
pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The
following are the details:
Change hash() function in conftest.py to crc32() to produce
deterministic hash. Hash randomization is enabled by default since
Python 3.3 (see
https://docs.python.org/3/reference/datamodel.html#object.__hash__).
This cause test sharding (like --shard_tests=1/2) produce inconsistent
set of tests per shard. Always restart minicluster during custom cluster
tests if --shard_tests argument is set, because test order may change
and affect test correctness, depending on whether running on fresh
minicluster or not.
Moved one test case from delimited-latin-text.test to
test_delimited_text.py for easier binary comparison.
Add bytes_to_str() as a utility function to decode bytes in Python3.
This is often needed when inspecting the return value of
subprocess.check_output() as a string.
Implement DataTypeMetaclass.__lt__ to substitute
DataTypeMetaclass.__cmp__ that is ignored in Python3 (see
https://peps.python.org/pep-0207/).
Fix WEB_CERT_ERR difference in test_ipv6.py.
Fix trivial integer parsing in test_restart_services.py.
Fix various encoding issues in test_saml2_sso.py,
test_shell_commandline.py, and test_shell_interactive.py.
Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1.
Switch to binary comparison in test_iceberg.py where needed.
Specify text mode when calling tempfile.NamedTemporaryFile().
Simplify create_impala_shell_executable_dimension to skip testing dev
and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason
is that several UTF-8 related tests in test_shell_commandline.py break
in Python3 pytest + Python2 impala-shell combo. This skipping already
happen automatically in build OS without system Python2 available like
RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty).
Removed unused vector argument and fixed some trivial flake8 issues.
Several test logic require modification due to intermittent issue in
Python3 pytest. These include:
Add _run_query_with_client() in test_ranger.py to allow reusing a single
Impala client for running several queries. Ensure clients are closed
when the test is done. Mark several tests in test_ranger.py with
SkipIfFS.hive because they run queries through beeline + HiveServer2,
but Ozone and S3 build environment does not start HiveServer2 by
default.
Increase the sleep period from 0.1 to 0.5 seconds per iteration in
test_statestore.py and mark TestStatestore to execute serially. This is
because TServer appears to shut down more slowly when run concurrently
with other tests. Handle the deprecation of Thread.setDaemon() as well.
Always force_restart=True each test method in TestLoggingCore,
TestShellInteractiveReconnect, and TestQueryRetries to prevent them from
reusing minicluster from previous test method. Some of these tests
destruct minicluster (kill impalad) and will produce minidump if metrics
verifier for next tests fail to detect healthy minicluster state.
Testing:
Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true.
Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4
Reviewed-on: http://gerrit.cloudera.org:8080/23319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
from metastore
HIVE-27746 introduced ALTER_PARTITIONS event type which is an
optimization of reducing the bulk ALTER_PARTITION events into a single
event. The components version is updated to pick up this change. It
would be a good optimization to include this in Impala so that the
number of events consumed by event processor would be significantly
reduced and help event processor to catch up with events quickly.
This patch enables the ability to consume ALTER_PARTITIONS event. The
downside of this patch is that, there is no before_partitions object in
the event message. This can cause partitions to be refreshed even on
trivial changes to them. HIVE-29141 will address this concern.
Testing:
- Added an end-to-end test to verify consuming the ALTER_PARTITIONS
event. Also, bigger time outs were added in this test as there was
flakiness observed while looping this test several times.
Change-Id: I009a87ef5e2c331272f9e2d7a6342cc860e64737
Reviewed-on: http://gerrit.cloudera.org:8080/22554
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
bin/bootstrap-build.sh did not distinguish between various version of
the Ubuntu platform, and attempted to install unversioned Python
packages (python-dev and python-setuptools) even on newer versions
that don't support Python 2 any longer (e.g. Ubuntu 22.04 and 24.04).
On older Ubuntu versions these packages are still useful, so at this
point it is not feasible just to drop them.
This patch makes these packages optional: they are added to the list of
packages to be installed only if they actually exist for the platform.
The patch also extends the package list with some basic packages that
are needed when bin/bootstrap_build.sh is run inside an Ubuntu 22.04
Docker container.
Tests: ran a compile-only build on Ubuntu 20.04 (still has Python 2) and
on Ubuntu 22.04 (does not support Python 2 any more).
Change-Id: I94ade35395afded4e130b79eab8c27c6171b50d6
Reviewed-on: http://gerrit.cloudera.org:8080/21800
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
load-data.py is used for dataloading while run-workload.py is used for
running perf-AB-test. This patch change the script from using beeswax
protocol to HS2 protocol.
Testing:
Run data loading and perf-AB-test-ub2004 based on this patch.
Change-Id: I1c3727871b8b2e75c3f10ceabfbe9cb96e36ead3
Reviewed-on: http://gerrit.cloudera.org:8080/23309
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds representation of Impala select queries using OpenTelemetry
traces.
Each Impala query is represented as its own individual OpenTelemetry
trace. The one exception is retried queries which will have an
individual trace for each attempt. These traces consist of a root span
and several child spans. Each child span has the root as its parent.
No child span has another child span as its parent. Each child span
represents one high-level query lifecycle stage. Each child span also
has span attributes that further describe the state of the query.
Child spans:
1. Init
2. Submitted
3. Planning
4. Admission Control
5. Query Execution
6. Close
Each child span contains a mix of universal attributes (available on
all spans) and query phase specific attributes. For example, the
"ErrorMsg" attribute, present on all child spans, is the error
message (if any) at the end of that particular query phase. One
example of a child span specific attribute is "QueryType" on the
Planning span. Since query type is first determined during query
planning, the "QueryType" attribute is present on the Planning span
and has a value of "QUERY" (since only selects are supported).
Since queries can run for lengthy periods of time, the Init span
communicates the beginning of a query along with global query
attributes. For example, span attributes include query id, session
id, sql, user, etc.
Once the query has closed, the root span is closed.
Testing accomplished with new custom cluster tests.
Generated-by: Github Copilot (GPT-4.1, Claude Sonnet 3.7)
Change-Id: Ie40b5cd33274df13f3005bf7a704299ebfff8a5b
Reviewed-on: http://gerrit.cloudera.org:8080/22924
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch, USE_APACHE_COMPONENTS overwrite all USE_APACHE_*
variables, but we should support using specific apache components.
After this patch, if USE_APACHE_COMPONENTS is not false, USE_APACHE_
{HADOOP,HBASE,HIVE,TEZ,RANGER} variable will be set true. Otherwise,
we should use the value of USE_APACHE_{HADOOP,HBASE,HIVE,TEZ,RANGER}.
Test:
- Built and ran a test cluster with setting USE_APACHE_HIVE=true
and USE_APACHE_COMPONENTS=false.
Change-Id: I33791465a3b238b56f82d749e3dbad8215f3b3bc
Reviewed-on: http://gerrit.cloudera.org:8080/23211
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds verification code to ensure the IMPALA_TOOLCHAIN_COMMIT_HASH
environment variable matches the commit hash in the
IMPALA_TOOLCHAIN_BUILD_ID_AARCH64 and
IMPALA_TOOLCHAIN_BUILD_ID_X86_64 environment variables.
Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: I348698356a014413875f6b8b54a005bf89b9793a
Reviewed-on: http://gerrit.cloudera.org:8080/23243
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>