This patch mainly implement the querying of paimon data table
through JNI based scanner.
Features implemented:
- support column pruning.
The partition pruning and predicate push down will be submitted
as the third part of the patch.
We implemented this by treating the paimon table as normal
unpartitioned table. When querying paimon table:
- PaimonScanNode will decide paimon splits need to be scanned,
and then transfer splits to BE do the jni-based scan operation.
- We also collect the required columns that need to be scanned,
and pass the columns to Scanner for column pruning. This is
implemented by passing the field ids of the columns to BE,
instead of column position to support schema evolution.
- In the original implementation, PaimonJniScanner will directly
pass paimon row object to BE, and call corresponding paimon row
field accessor, which is a java method to convert row fields to
impala row batch tuples. We find it is slow due to overhead of
JVM method calling.
To minimize the overhead, we refashioned the implementation,
the PaimonJniScanner will convert the paimon row batches to
arrow recordbatch, which stores data in offheap region of
impala JVM. And PaimonJniScanner will pass the arrow offheap
record batch memory pointer to the BE backend.
BE PaimonJniScanNode will directly read data from JVM offheap
region, and convert the arrow record batch to impala row batch.
The benchmark shows the later implementation is 2.x better
than the original implementation.
The lifecycle of arrow row batch is mainly like this:
the arrow row batch is generated in FE,and passed to BE.
After the record batch is imported to BE successfully,
BE will be in charge of freeing the row batch.
There are two free paths: the normal path, and the
exception path. For the normal path, when the arrow batch
is totally consumed by BE, BE will call jni to fetch the next arrow
batch. For this case, the arrow batch is freed automatically.
For the exceptional path, it happends when query is cancelled, or memory
failed to allocate. For these corner cases, arrow batch is freed in the
method close if it is not totally consumed by BE.
Current supported impala data types for query includes:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE
TODO:
- Patches pending submission:
- Support tpcds/tpch data-loading
for paimon data table.
- Virtual Column query support for querying
paimon data table.
- Query support with time travel.
- Query support for paimon meta tables.
- WIP:
- Snapshot incremental read.
- Complex type query support.
- Native paimon table scanner, instead of
jni based.
Testing:
- Create tests table in functional_schema_template.sql
- Add TestPaimonScannerWithLimit in test_scanners.py
- Add test_paimon_query in test_paimon.py.
- Already passed the tpcds/tpch test for paimon table, due to the
testing table data is currently generated by spark, and it is
not supported by impala now, we have to do this since hive
doesn't support generating paimon table for dynamic-partitioned
tables. we plan to submit a separate patch for tpcds/tpch data
loading and associated tpcds/tpch query tests.
- JVM Offheap memory leak tests, have run looped tpch tests for
1 day, no obvious offheap memory increase is observed,
offheap memory usage is within 10M.
Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384
Reviewed-on: http://gerrit.cloudera.org:8080/23613
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
When the environment variable USE_APACHE_HIVE is set to true, build
Impala for adapting to Apache Hive 3.x. In order to better distinguish it
from Apache Hive 2.x later, rename USE_APACHE_HIVE to USE_APACHE_HIVE_3.
Additionally, to facilitate referencing different versions of the Hive
MetastoreShim, the major version of Hive has been added to the environment
variable IMPALA_HIVE_DIST_TYPE.
Change-Id: I11b5fe1604b6fc34469fb357c98784b7ad88574d
Reviewed-on: http://gerrit.cloudera.org:8080/21724
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Redhat 9 environments recently switched to OpenSSL 3.5.1. On those
machines, the Kudu minicluster fails to start up with CSR signature
verification error. KUDU-3716 fixed this issue.
This patch update Toolchain and Kudu version to pick up KUDU-3716.
Testing:
Pass data loading with in Redhat 9.
Change-Id: I7262267939a9f08650af85443240950afbb3323f
Reviewed-on: http://gerrit.cloudera.org:8080/23697
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch updates CDP_BUILD_NUMBER to 71942734 to in order to
upgrade Iceberg to 1.5.2.
This patch updates some tests so they pass with Iceberg 1.5.2. The
behavior changes of Iceberg 1.5.2 are (compared to 1.3.1):
* Iceberg V2 tables are created by default
* Metadata tables have different schema
* Parquet compression is explicitly set for new tables (even for ORC
tables)
* Sequence numbers are assigned a bit differently
Updated the tests where needed.
Code changes to accomodate for the above behavior changes:
* SHOW CREATE TABLE adds 'format-version'='1' for Iceberg V1 tables
* CREATE TABLE statements don't throw errors when Parquet compression
is set for ORC tables
Change-Id: Ic4f9ed3f7ee9f686044023be938d6b1d18c8842e
Reviewed-on: http://gerrit.cloudera.org:8080/23670
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit bump Impala toolchain to pickup latest Kudu version up to
commit 60f5e5267b92c39485a66121d3ce3cc7ef57b0e0 (KUDU-1261 make
ArrayCellMetadataView::Init() more robust).
Change-Id: I68009e5fefd053882f5504cd2520bacb189a1b04
Reviewed-on: http://gerrit.cloudera.org:8080/23631
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
Since the toolchain was bumped to pick up Kudu's array column
feature (KUDU-1261), Impala's TSAN builds on the master branch
consistently break during dataload with a data race detected by TSAN.
The source of data race lies within libkudu_client.so and only trigger
if Impala build machine has both ipv4 and ipv6 associated with
localhost. Until the exact root cause is found and fixed, this patch
workaround the TSAN issue by fixing KUDU_MASTER_HOSTS env var to
127.0.0.1.
Testing:
Run TSAN build and confirm no data race error is emmitted.
Change-Id: I511ab625d18c6007567083557fcdf98980a6ac6f
Reviewed-on: http://gerrit.cloudera.org:8080/23507
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
Minidump stack resolution does not work on Redhat8 ARM64.
Redhat8 ARM64 uses 64KB pages, and the Breakpad library does
not properly handle collecting stacks for that configuration.
Breakpad rounds off the stack pointer to the nearest page
boundary below the stack pointer, then collects up to 32KB of
stack memory. With a top-down stack, this means it is collecting
some memory that is not used by the stack. With 64KB pages,
the memory it collects usually doesn't contain any stack contents.
This picks up a toolchain with Breakpad patched to fix this. The
patch stops rounding the stack pointer to the nearest page.
Instead, it adjusts the stack pointer to account for the red
zone (128 bytes on x86_64) and then rounds to the nearest 1KB
boundary below the stack pointer.
Testing:
- Produced and resolved minidumps on multiple build types for
x86_64 and ARM64 (release, debug, asan, ubsan)
Change-Id: I4fbd91abfbddfd8355d27ae9d9b86b70a9ce0409
Reviewed-on: http://gerrit.cloudera.org:8080/23465
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Cleans up repetitive patterns in pom.xml.
Centralize plugin configuration in pluginManagement. Replace inline
maven-compiler-plugin configuration with newer maven.compiler.release
and update to latest plugin version.
Centralize common dependencies in dependencyManagement, including
exclusions when appropriate. Remove exclusions that are no longer
relevant.
Compared before and after with dependency:tree; only difference is that
commons-cli now comes from hadoop and jersey-serv{let,er} are
effectively excluded; all versions matched. Also ensured
USE_APACHE_COMPONENTS=true compiles.
Adds com.amazonaws:aws-java-sdk-bundle to exclusion checking to ensure
it's not accidentally included alongside impala-minimal-s3a-aws-sdk.
Removes missed io.netty exclusion from IMPALA-12816.
Updates commons-dbcp2 to 2.12.0 to match Hive.
Change-Id: If96649840e23036b4a73ee23e8d12516497994f0
Reviewed-on: http://gerrit.cloudera.org:8080/23432
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Downstream error reports pointed out that the toolchain version picked
up for IMPALA-14139 contains toolchain binaries for Red Hat 9 (and
compatibles) that require at least the 9.5 minor version because of
OpenSSL library requirements. This was caused by the toolchain binary
build process not using package repo pinning for the redhat9 build
container definition, which caused the container process to install
"latest" packages, in this case packages released in Rocky / Red Hat
9.5.
This patch bumps the toolchain ID to a version in which the redhat9
binaries were produced in a build container "moved back in time" to the
9.2 release by pinning the package repos to the Rocky Linux 9.2 state,
using the Rocky Vault.
The patch also picks up a buffer overflow mitigation for the ORC
library.
Change-Id: I5c6921afdc69a4a6644b619de6b8d4e4cc69e601
Reviewed-on: http://gerrit.cloudera.org:8080/23448
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Removes IMPALA_JAVA_HOME_OVERRIDE and updates version selection. In
order of priority
1. If IMPALA_JDK_VERSION is set, use the OS JDK version from a known
location. This is primarily used when also installing the JDK as part
of automated builds.
2. If JAVA_HOME is set, use it.
3. Look for the system default JDK.
The IMPALA_JDK_VERSION variable is no longer modified to avoid issues
when sourcing impala-config.sh multiple times. JAVA_HOME will be
modified if IMPALA_JDK_VERSION is set; both must be unset to restore
using the system default Java.
If switching between JDKs, now prefer setting JAVA_HOME. If relying on
system Java, unset JAVA_HOME after e.g. update-java-alternatives.
The detected Java version is set in IMPALA_JAVA_TARGET, which is used to
add Java 9+ options and configure the Java compilation target.
Eliminates IMPALA_JDK_VERSION_NUM as it's value was always identical to
IMPALA_JAVA_TARGET.
Stops printing from impala-config-java.sh. It made the output from
impala-config.sh look strange, and the decisions can all be clearly
determined from impala-config.sh printed variables later or the packages
installed in bootstrap_system.sh.
Fixes JAVA_HOME in bootstrap_build.sh on ARM64 systems.
Change-Id: I68435ca69522f8310221a0f3050f13d86568b9da
Reviewed-on: http://gerrit.cloudera.org:8080/23434
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Update the following elements of the Impala build environment to enable
builds on Ubuntu 24.04:
- Recognize and handle (where necessary) Ubuntu 24.04 in various
bootstrap scripts (bootstrap_system.sh, bootstrap_toolchain.py, etc.)
- Bump IMPALA_TOOLCHAIN_ID to an official toolchain build that contains
Ubuntu 24.04-specific binary packages
- Bump binutils to 2.42, and
- Bump the GDB version to 12.1-p1, as required by the new toolchain
version
- Update unique_ptr usage syntax in be/src/util/webserver-test.cc to
compensate for new GLIBC funtion prototypes:
System headers in Ubuntu 24.04 adopted attributes on several widely
used function prototypes. Such attributes are not considered to be part
of the function's signature during template evaluation, so GCC throws a
warning when such a function is passed as a template argument, which
breaks the build, as warnings are treated as errors.
webserver-test.cc uses pclose() as the deleter for a unique_ptr in a
utility function. This patch encapsulates pclose() and its attributes in
an explicit specialization for std::default_delete<>, "hiding" the
attributes inside a functor.
The particular solution was inspired by Anton-V-K's proposal in
https://gist.github.com/t-mat/5849549
This commit builds on an earlier patch for the same purpose by Michael
Smith: https://gerrit.cloudera.org/c/23058/
Change-Id: Ia4454b0c359dbf579e6ba2f9f9c44cfa3f1de0d2
Reviewed-on: http://gerrit.cloudera.org:8080/23384
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch mainly implement the creation/drop of paimon table
through impala.
Supported impala data types:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE
Syntax for creating paimon table:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
[col_name data_type ,...]
[PRIMARY KEY (col1,col2)]
)
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
STORED AS PAIMON
[LOCATION 'hdfs_path']
[TBLPROPERTIES (
'primary-key'='col1,col2',
'file.format' = 'orc/parquet',
'bucket' = '2',
'bucket-key' = 'col3',
];
Two types of paimon catalogs are supported.
(1) Create table with hive catalog:
CREATE TABLE paimon_hive_cat(userid INT,movieId INT)
STORED AS PAIMON;
(2) Create table with hadoop catalog:
CREATE [EXTERNAL] TABLE paimon_hadoop_cat
STORED AS PAIMON
TBLPROPERTIES('paimon.catalog'='hadoop',
'paimon.catalog_location'='/path/to/paimon_hadoop_catalog',
'paimon.table_identifier'='paimondb.paimontable');
SHOW TABLE STAT/SHOW COLUMN STAT/SHOW PARTITIONS/SHOW FILES
statements are also supported.
TODO:
- Patches pending submission:
- Query support for paimon data files.
- Partition pruning and predicate push down.
- Query support with time travel.
- Query support for paimon meta tables.
- WIP:
- Complex type query support.
- Virtual Column query support for querying
paimon data table.
- Native paimon table scanner, instead of
jni based.
Testing:
- Add unit test for paimon impala type conversion.
- Add unit test for ToSqlTest.java.
- Add unit test for AnalyzeDDLTest.java.
- Update default_file_format TestEnumCase in
be/src/service/query-options-test.cc.
- Update test case in
testdata/workloads/functional-query/queries/QueryTest/set.test.
- Add test cases in metadata/test_show_create_table.py.
- Add custom test test_paimon.py.
Change-Id: I57e77f28151e4a91353ef77050f9f0cd7d9d05ef
Reviewed-on: http://gerrit.cloudera.org:8080/22914
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true
reveals some tests that require adjustment. This patch made such
adjustment, which mostly revolves around encoding differences and string
vs bytes type in Python3. This patch also switch the default to run
pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The
following are the details:
Change hash() function in conftest.py to crc32() to produce
deterministic hash. Hash randomization is enabled by default since
Python 3.3 (see
https://docs.python.org/3/reference/datamodel.html#object.__hash__).
This cause test sharding (like --shard_tests=1/2) produce inconsistent
set of tests per shard. Always restart minicluster during custom cluster
tests if --shard_tests argument is set, because test order may change
and affect test correctness, depending on whether running on fresh
minicluster or not.
Moved one test case from delimited-latin-text.test to
test_delimited_text.py for easier binary comparison.
Add bytes_to_str() as a utility function to decode bytes in Python3.
This is often needed when inspecting the return value of
subprocess.check_output() as a string.
Implement DataTypeMetaclass.__lt__ to substitute
DataTypeMetaclass.__cmp__ that is ignored in Python3 (see
https://peps.python.org/pep-0207/).
Fix WEB_CERT_ERR difference in test_ipv6.py.
Fix trivial integer parsing in test_restart_services.py.
Fix various encoding issues in test_saml2_sso.py,
test_shell_commandline.py, and test_shell_interactive.py.
Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1.
Switch to binary comparison in test_iceberg.py where needed.
Specify text mode when calling tempfile.NamedTemporaryFile().
Simplify create_impala_shell_executable_dimension to skip testing dev
and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason
is that several UTF-8 related tests in test_shell_commandline.py break
in Python3 pytest + Python2 impala-shell combo. This skipping already
happen automatically in build OS without system Python2 available like
RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty).
Removed unused vector argument and fixed some trivial flake8 issues.
Several test logic require modification due to intermittent issue in
Python3 pytest. These include:
Add _run_query_with_client() in test_ranger.py to allow reusing a single
Impala client for running several queries. Ensure clients are closed
when the test is done. Mark several tests in test_ranger.py with
SkipIfFS.hive because they run queries through beeline + HiveServer2,
but Ozone and S3 build environment does not start HiveServer2 by
default.
Increase the sleep period from 0.1 to 0.5 seconds per iteration in
test_statestore.py and mark TestStatestore to execute serially. This is
because TServer appears to shut down more slowly when run concurrently
with other tests. Handle the deprecation of Thread.setDaemon() as well.
Always force_restart=True each test method in TestLoggingCore,
TestShellInteractiveReconnect, and TestQueryRetries to prevent them from
reusing minicluster from previous test method. Some of these tests
destruct minicluster (kill impalad) and will produce minidump if metrics
verifier for next tests fail to detect healthy minicluster state.
Testing:
Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true.
Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4
Reviewed-on: http://gerrit.cloudera.org:8080/23319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
from metastore
HIVE-27746 introduced ALTER_PARTITIONS event type which is an
optimization of reducing the bulk ALTER_PARTITION events into a single
event. The components version is updated to pick up this change. It
would be a good optimization to include this in Impala so that the
number of events consumed by event processor would be significantly
reduced and help event processor to catch up with events quickly.
This patch enables the ability to consume ALTER_PARTITIONS event. The
downside of this patch is that, there is no before_partitions object in
the event message. This can cause partitions to be refreshed even on
trivial changes to them. HIVE-29141 will address this concern.
Testing:
- Added an end-to-end test to verify consuming the ALTER_PARTITIONS
event. Also, bigger time outs were added in this test as there was
flakiness observed while looping this test several times.
Change-Id: I009a87ef5e2c331272f9e2d7a6342cc860e64737
Reviewed-on: http://gerrit.cloudera.org:8080/22554
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Adds representation of Impala select queries using OpenTelemetry
traces.
Each Impala query is represented as its own individual OpenTelemetry
trace. The one exception is retried queries which will have an
individual trace for each attempt. These traces consist of a root span
and several child spans. Each child span has the root as its parent.
No child span has another child span as its parent. Each child span
represents one high-level query lifecycle stage. Each child span also
has span attributes that further describe the state of the query.
Child spans:
1. Init
2. Submitted
3. Planning
4. Admission Control
5. Query Execution
6. Close
Each child span contains a mix of universal attributes (available on
all spans) and query phase specific attributes. For example, the
"ErrorMsg" attribute, present on all child spans, is the error
message (if any) at the end of that particular query phase. One
example of a child span specific attribute is "QueryType" on the
Planning span. Since query type is first determined during query
planning, the "QueryType" attribute is present on the Planning span
and has a value of "QUERY" (since only selects are supported).
Since queries can run for lengthy periods of time, the Init span
communicates the beginning of a query along with global query
attributes. For example, span attributes include query id, session
id, sql, user, etc.
Once the query has closed, the root span is closed.
Testing accomplished with new custom cluster tests.
Generated-by: Github Copilot (GPT-4.1, Claude Sonnet 3.7)
Change-Id: Ie40b5cd33274df13f3005bf7a704299ebfff8a5b
Reviewed-on: http://gerrit.cloudera.org:8080/22924
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch, USE_APACHE_COMPONENTS overwrite all USE_APACHE_*
variables, but we should support using specific apache components.
After this patch, if USE_APACHE_COMPONENTS is not false, USE_APACHE_
{HADOOP,HBASE,HIVE,TEZ,RANGER} variable will be set true. Otherwise,
we should use the value of USE_APACHE_{HADOOP,HBASE,HIVE,TEZ,RANGER}.
Test:
- Built and ran a test cluster with setting USE_APACHE_HIVE=true
and USE_APACHE_COMPONENTS=false.
Change-Id: I33791465a3b238b56f82d749e3dbad8215f3b3bc
Reviewed-on: http://gerrit.cloudera.org:8080/23211
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds verification code to ensure the IMPALA_TOOLCHAIN_COMMIT_HASH
environment variable matches the commit hash in the
IMPALA_TOOLCHAIN_BUILD_ID_AARCH64 and
IMPALA_TOOLCHAIN_BUILD_ID_X86_64 environment variables.
Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: I348698356a014413875f6b8b54a005bf89b9793a
Reviewed-on: http://gerrit.cloudera.org:8080/23243
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Consumes the new toolchain builds that compiled the OpenTelemetry-cpp
SDK libraries against the standard C++ library instead of the SDK's
nostd translation layer.
Change-Id: Icf06710d5f7987f43cb8bae5450b657f251f199b
Reviewed-on: http://gerrit.cloudera.org:8080/23192
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Jason Fehr <jfehr@cloudera.com>
Adds the OpenTelemetry C++ SDK version 1.20.0 from the toolchain into
the cmake files for consumption during builds.
Testing was accomplished by building locally and in Jenkins.
Generated-by: Github Copilot (GPT-4.1)
Change-Id: Ib30123f79270e3f11233e28a2a34725e7d455f5e
Reviewed-on: http://gerrit.cloudera.org:8080/23101
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HBase jars are added into AUX_CLASSPATH in impala-config.sh so that Hive
can write into HBase. Newer Hive version already have
hbase-shaded-mapreduce jar included. Thus, it is not necessary to add
unshaded jar to AUX_CLASSPATH. Adding the unshaded jars can lead to
conflict in downstream build.
Testing:
- Run and pass dataload.
- Pass custom_cluster/test_hbase_hms_column_order.py and
query_test/test_hbase_queries.py.
Change-Id: I4caf37571a8bc2543bbc58071e5cb7046f216fa9
Reviewed-on: http://gerrit.cloudera.org:8080/23022
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
This moves from zlib 1.2.13 to zlib 1.3.1 and bumps cloudflare
zlib to a newer version. This does not require any update to the
toolchain, because these newer versions were already present.
Testing:
- Ran a perf-AB-test with no major difference in performance
Change-Id: I09ec358ea49198485d53e85eae7d0b61beac3308
Reviewed-on: http://gerrit.cloudera.org:8080/22993
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Some minor changes were needed on the Impala side because of changes in
glog (for example some variables and function parameters were changed
from signed to unsigned integer types).
Testing:
- passed exhaustive DEBUG tests
- core ASAN tests
Change-Id: Ifbe341265fd7aa7be8fe304b9fda31b4470237cf
Reviewed-on: http://gerrit.cloudera.org:8080/22906
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Upstream gperftools does not allow setting tcmalloc.max_total_thread_cache_bytes
to greater than 1GB. This moves to a new toolchain that has patched
gperftools to remove this limitation and allow setting
tcmalloc.max_total_thread_cache_bytes > 1GB. This also reads back the
value from tcmalloc and prints a warning if it doesn't match what we set.
Testing:
- Set tcmalloc_max_total_thread_cache_bytes to 2GB and verified that
the warning message doesn't appear. On unpatched versions of
gperftools, the warning message does appear.
Change-Id: If78c8734c704090c12737a8c2a8456b73ea4b8e8
Reviewed-on: http://gerrit.cloudera.org:8080/22834
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Downstream system vendors, users and customers have lately expressed
interest in consuming Impala in containerized forms, taking advantage of
various specialized, hardened container base image offerings, like
container offerings based on the Wolfi project by Chainguard;
see: https://github.com/wolfi-dev.
This patch enables Impala container images to be built on top of custom
base images, and adds an implementation example that uses the publicly
available Wolfi base image.
Building a customized Docker image follows a hybrid approach. Instead of
replicating the complete Impala build process inside a Wolfi container
for a fully native binary build, it relies on an existing build platform
that is compatible with the binary packages available inside the custom
container image. For Wolfi the Impala binaries are supplied by the
Red Hat 9 build of Impala. This is made possible by the fact that major
library dependencies of Impala have the same versions on Wolfi OS and
Red Hat 9, so binaries built on Red Hat 9 can be run on Wolfi
with no changes.
The binaries produced by the regular build process are then installed
into a Docker image built on top of an explicitly specified custom base
image. The selection of a custom base image is controlled by two
environment variables:
- USE_CUSTOM_IMPALA_BASE_IMAGE (boolean):
If set to 'true', triggers the use of the custom image.
When set to 'false' or left unspecified, the Docker base image is
selected by the existing logic of matching the build platform's
operating system.
- IMPALA_CUSTOM_DOCKER_BASE (string): specifies the URI of the base image
These environment variables can be overridden from the environment,
from impala-config-branch.sh, or impala-config-local.sh.
They are reported at the end of bin/impala-config.sh where important
environment variables are listed. They are also added to the list of
variables in bin/jenkins/dockerized-impala-preserve-vars.py to ensure
that they can be used in the context of Jenkins jobs as well.
The unified script that installs Impala's required dependencies into the
container image is extended for Wolfi to handle APK packages.
A new script is added to install Bash in the Docker image if it is
missing. Impala build scripts (including the scripts used during Docker
image builds) as well as container startup scripts require Bash,
but minimal container base images usually omit it, favoring a smaller
alternative.
To improve the debugging experience for a containerized Impala
minicluster, the minicluster starter script bin/start-impala-cluster.py
is extended with the following features:
- synchronizes every launched container's timezone to the host.
This is needed for Iceberg time-travel test, which create timestamped
Iceberg metadata items in the impalad context inside a container, but
check creation/modification times of the same items in the test scripts
running on the host, outside the containers. The tests scripts have
the implicit expectation that the same local time is shared across
all these contexts, but this is not necessarily true if the host,
where tests are running is set to a timezone other than UTC.
Time sycnhronization is achieved by injecting the TZ environment
variable into the container, holding the name of the timezone used
on the host. The timezone name is taken either from the host's TZ
variable (if set), or from the host's /etc/localtime symlink,
checking the name of the timezone file it points to.
In case /etc/localtime is not a symlink (and TZ is not set on the
host), the host's /etc/localtime file is mounted into the container.
- sets up a directory for each container to collect the Java VMs error
files (hs_err_pidNNNN.log) from the containers.
- adds the --mount_sources command line parameter, which mounts the
complete $IMPALA_HOME subtree into the container at
/opt/impala/sources to make source code available inside the container
for easier debugging.
Tested by running core-mode tests in the following environments:
- Regular run (impalad running natively on the platform) on Ubuntu 20.04
- Regular run on Rocky Linux 9.2
- Dockerised run (impalad instances running in their individual
containers) using Ubuntu 20.04 containers
- Dockerised run (impalad instances running in their individual
containers) using Rocky Linux 9.2 containers
- Dockerised run (impalad instances running in their individual
containers) using Wolfi's wolfi-base containers
Change-Id: Ia5e39f399664fe66f3774caa316ed5d4df24befc
Reviewed-on: http://gerrit.cloudera.org:8080/22583
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is an enhancement request to support JDBC tables
created by Hive JDBC Storage handler. This is essentially
done by making JDBC table properties compatible with
Impala. It is done by translating when loading the table,
and maintaining that only in the Impala cluster, i.e. it's
not written back to HMS.
Impala includes JDBC drivers for PostgreSQL and MySQL
making 'driver.url' not mandatory in such cases. The
Impala JDBC driver is still required for Impala-to-Impala
JDBC connections. Additionally, Hive allows adding database
driver JARs at runtime via Beeline, enabling users to
dynamically include JDBC driver JARs. However, Impala does
not support adding database driver JARs at runtime,
making the driver.url field still useful
in cases where additional drivers are needed.
'hive.sql.query' property is not handled in this patch.
It'll be covered in a separate jira.
Testing: End-to-end tests are included in
test_ext_data_sources.py.
Change-Id: I1674b93a02f43df8c1a449cdc54053cc80d9c458
Reviewed-on: http://gerrit.cloudera.org:8080/22134
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Pick up a new binary build of the current toolchain version for ARM.
The toolchain version is identical, the only difference is that the new
build added binaries for Rocky/RHEL 9 to the already supported OS
versions, reaching the same level of Impala build support as
Rocky/RHEL 8.
Tested by building Impala for RHEL9 for Intel and ARM both on private
infrastructure.
Change-Id: I5fd2e8c3187cb7829de55d6739cf5d68a09a2ed3
Reviewed-on: http://gerrit.cloudera.org:8080/22323
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, the shell tarball maintains its own packaging code
and directory layout. This is very complicated and currently has
several Python packages directly checked into our repository.
To simplify it, this changes the shell tarball to be based on
pip installing the pypi package. Specifically, the new directory
structure for an unpack shell tarball is:
impala-shell-4.5.0-SNAPSHOT/
impala-shell
install_py${PYTHON_VERSION}/
install_py${ANOTHER_PYTHON_VERSION}/
For example, install_py2.7 is the Python 2.7 pip install of impala-shell.
install_py3.8 is a Python 3.8 pip install of impala-shell. This means
that the impala-shell script simply picks the install for the
specified version of python and uses that pip install directory.
To make this more consistent across different Linux distributions, this
upgrades pip in the virtualenv to the latest.
With this, ext-py and pkg_resources.py can be removed.
This requires rearranging the shell build code. Specifically, this splits
out the code that generates impala_build_version.py so that it can run
before generating the pypi package. The shell tarball now has a dependency
on the pypi package and must run after it.
This builds on Michael Smith's work from IMPALA-11399.
Testing:
- Ran shell tests locally
- Built on Centos 7, Redhat 8 & 9, Ubuntu 20 & 22, SLES 15
Change-Id: Ifbb66ab2c5bc7180221f98d9bf5e38d62f4ac036
Reviewed-on: http://gerrit.cloudera.org:8080/20171
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This introduces the IMPALA_USE_PYTHON3_TESTS environment variable
to select whether to run tests using the toolchain Python 3.
This is an experimental option, so it defaults to false,
continuing to run tests with Python 2.
This fixes a first batch of Python 2 vs 3 issues:
- Deciding whether to open a file in bytes mode or text mode
- Adapting to APIs that operate on bytes in Python 3 (e.g. codecs)
- Eliminating 'basestring' and 'unicode' locations in tests/ by using
the recommendations from future
( https://python-future.org/compatible_idioms.html#basestring and
https://python-future.org/compatible_idioms.html#unicode )
- Uses impala-python3 for bin/start-impala-cluster.py
All fixes leave the Python 2 path working normally.
Testing:
- Ran an exhaustive run with Python 2 to verify nothing broke
- Verified that the new environment variable works and that
it uses Python 3 from the toolchain when specified
Change-Id: I177d9b8eae9b99ba536ca5c598b07208c3887f8c
Reviewed-on: http://gerrit.cloudera.org:8080/21474
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>