This patch mainly implement the querying of paimon data table
through JNI based scanner.
Features implemented:
- support column pruning.
The partition pruning and predicate push down will be submitted
as the third part of the patch.
We implemented this by treating the paimon table as normal
unpartitioned table. When querying paimon table:
- PaimonScanNode will decide paimon splits need to be scanned,
and then transfer splits to BE do the jni-based scan operation.
- We also collect the required columns that need to be scanned,
and pass the columns to Scanner for column pruning. This is
implemented by passing the field ids of the columns to BE,
instead of column position to support schema evolution.
- In the original implementation, PaimonJniScanner will directly
pass paimon row object to BE, and call corresponding paimon row
field accessor, which is a java method to convert row fields to
impala row batch tuples. We find it is slow due to overhead of
JVM method calling.
To minimize the overhead, we refashioned the implementation,
the PaimonJniScanner will convert the paimon row batches to
arrow recordbatch, which stores data in offheap region of
impala JVM. And PaimonJniScanner will pass the arrow offheap
record batch memory pointer to the BE backend.
BE PaimonJniScanNode will directly read data from JVM offheap
region, and convert the arrow record batch to impala row batch.
The benchmark shows the later implementation is 2.x better
than the original implementation.
The lifecycle of arrow row batch is mainly like this:
the arrow row batch is generated in FE,and passed to BE.
After the record batch is imported to BE successfully,
BE will be in charge of freeing the row batch.
There are two free paths: the normal path, and the
exception path. For the normal path, when the arrow batch
is totally consumed by BE, BE will call jni to fetch the next arrow
batch. For this case, the arrow batch is freed automatically.
For the exceptional path, it happends when query is cancelled, or memory
failed to allocate. For these corner cases, arrow batch is freed in the
method close if it is not totally consumed by BE.
Current supported impala data types for query includes:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE
TODO:
- Patches pending submission:
- Support tpcds/tpch data-loading
for paimon data table.
- Virtual Column query support for querying
paimon data table.
- Query support with time travel.
- Query support for paimon meta tables.
- WIP:
- Snapshot incremental read.
- Complex type query support.
- Native paimon table scanner, instead of
jni based.
Testing:
- Create tests table in functional_schema_template.sql
- Add TestPaimonScannerWithLimit in test_scanners.py
- Add test_paimon_query in test_paimon.py.
- Already passed the tpcds/tpch test for paimon table, due to the
testing table data is currently generated by spark, and it is
not supported by impala now, we have to do this since hive
doesn't support generating paimon table for dynamic-partitioned
tables. we plan to submit a separate patch for tpcds/tpch data
loading and associated tpcds/tpch query tests.
- JVM Offheap memory leak tests, have run looped tpch tests for
1 day, no obvious offheap memory increase is observed,
offheap memory usage is within 10M.
Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384
Reviewed-on: http://gerrit.cloudera.org:8080/23613
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
When the environment variable USE_APACHE_HIVE is set to true, build
Impala for adapting to Apache Hive 3.x. In order to better distinguish it
from Apache Hive 2.x later, rename USE_APACHE_HIVE to USE_APACHE_HIVE_3.
Additionally, to facilitate referencing different versions of the Hive
MetastoreShim, the major version of Hive has been added to the environment
variable IMPALA_HIVE_DIST_TYPE.
Change-Id: I11b5fe1604b6fc34469fb357c98784b7ad88574d
Reviewed-on: http://gerrit.cloudera.org:8080/21724
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3
This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
doesn't have a main function, it removes the hash-bang and makes
sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
(or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
replaced by the cm-client pypi package and interfaces have changed.
Rather than migrating the code (which hasn't been used in years), this
deletes the old code and stops installing cm-api into the virtualenv.
The code can be restored and revamped if there is any interest in
interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
bit-rotted. Some pieces can be run manually, but it can't be fully
verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
version that supports Python 3. The newest version of kazoo requires
upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
needing other upgrades.
The two remaining uses of impala-python are:
- bin/cmake_aux/create_virtualenv.sh
- bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.
The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)
Testing:
- Ran core job
- Ran build + dataload on Centos 7, Redhat 8
- Manual testing of individual scripts (except some bitrotted areas like the
random query generator)
Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Update the following elements of the Impala build environment to enable
builds on Ubuntu 24.04:
- Recognize and handle (where necessary) Ubuntu 24.04 in various
bootstrap scripts (bootstrap_system.sh, bootstrap_toolchain.py, etc.)
- Bump IMPALA_TOOLCHAIN_ID to an official toolchain build that contains
Ubuntu 24.04-specific binary packages
- Bump binutils to 2.42, and
- Bump the GDB version to 12.1-p1, as required by the new toolchain
version
- Update unique_ptr usage syntax in be/src/util/webserver-test.cc to
compensate for new GLIBC funtion prototypes:
System headers in Ubuntu 24.04 adopted attributes on several widely
used function prototypes. Such attributes are not considered to be part
of the function's signature during template evaluation, so GCC throws a
warning when such a function is passed as a template argument, which
breaks the build, as warnings are treated as errors.
webserver-test.cc uses pclose() as the deleter for a unique_ptr in a
utility function. This patch encapsulates pclose() and its attributes in
an explicit specialization for std::default_delete<>, "hiding" the
attributes inside a functor.
The particular solution was inspired by Anton-V-K's proposal in
https://gist.github.com/t-mat/5849549
This commit builds on an earlier patch for the same purpose by Michael
Smith: https://gerrit.cloudera.org/c/23058/
Change-Id: Ia4454b0c359dbf579e6ba2f9f9c44cfa3f1de0d2
Reviewed-on: http://gerrit.cloudera.org:8080/23384
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Adds the OpenTelemetry C++ SDK version 1.20.0 from the toolchain into
the cmake files for consumption during builds.
Testing was accomplished by building locally and in Jenkins.
Generated-by: Github Copilot (GPT-4.1)
Change-Id: Ib30123f79270e3f11233e28a2a34725e7d455f5e
Reviewed-on: http://gerrit.cloudera.org:8080/23101
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The Jenkins jobs that run the UBSAN tests on ARM were occaisonally
hanging on the disk-file-test. This commit fixes these hangs by
upgrading Google Test and implementing the Death Test handling
functionality which safely runs tests that expect the process to die.
See https://github.com/google/googletest/blob/main/docs/advanced.md#death-tests
for details on known problems with running death tests and threads at
the same time causing tests to hang.
Testing was accomplished by running the disk-file-test repeatedly in a
loop on a RHEL 8.9 ARM machine. Before this fix was implemented, this
test would run up to 70 times before it hung. After the fix was
implemented, the test ran 2,490 times and was still running when it was
stopped. These test runs had durations between 18.7 and 19.9 seconds
which means disk-file-test now takes about 15 seconds longer than its
previous duration of about 4.4 seconds.
Change-Id: Ie01f7781f24644a66e9ec52652450116f5cb4297
Reviewed-on: http://gerrit.cloudera.org:8080/21544
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The goals and non-goals of this patch could be summarized as follows.
Goals:
- Add changes to the minicluster configuration that allow a non-default
version of Ranger (possibly built locally) to run in the context of
the minicluster, and to be used as the authorization server by
Impala.
- Switch to the new constructor when instantiating
RangerAccessRequestImpl. This resolves IMPALA-12985 and also makes
Impala compatible with Apache Ranger if RangerAccessRequestImpl from
Apache Ranger is consumed.
- Prepare Ranger and Impala patches as supplemental material to verify
what authorization-related tests could be passed if Apache Ranger is
the authorization provider. Merging IMPALA-12921_addendum.diff to
the Impala repository is not in the scope of this patch in that the
diff file changes the behavior of Impala and thus more discussion is
required if we'd like to merge it in the future.
Non-goals:
- Set up any automation for building Ranger from source.
- Pass all Impala authorization-related tests with a non-default
version of Ranger.
Instructions on running Impala with locally built Ranger:
Suppose the Ranger project is under the folder $RANGER_SRC_DIR. We could
execute the following to build Apache Ranger for easy reference. By
default, the compressed tarball is produced under
$RANGER_SRC_DIR/target.
mvn clean compile -B -nsu -DskipCheck=true -Dcheckstyle.skip=true \
package install -DskipITs -DskipTests -Dmaven.javadoc.skip=true
After building Ranger, we need to build Impala's Java code so that
Impala's Java code could consume the locally produced Ranger classes. We
will need to export the following environment variables before building
Impala. This prevents bootstrap_toolchain.py from trying to download the
compressed Ranger tarball.
1. export RANGER_VERSION_OVERRIDE=\
$(mvn -f $RANGER_SRC_DIR/pom.xml -q help:evaluate \
-Dexpression=project.version -DforceStdout)
2. export RANGER_HOME_OVERRIDE=$RANGER_SRC_DIR/target/\
ranger-${RANGER_VERSION_OVERRIDE}-admin
It then suffices to execute the following to point
Impala to the locally built Ranger server before starting Impala.
1. source $IMPALA_HOME/bin/impala-config.sh
2. tar zxv -f $RANGER_SRC_DIR/target/\
ranger-${IMPALA_RANGER_VERSION}-admin.tar.gz \
-C $RANGER_SRC_DIR/target/
3. $IMPALA_HOME/bin/create-test-configuration.sh
4. $IMPALA_HOME/bin/create-test-configuration.sh \
-create_ranger_policy_db
5. $IMPALA_HOME/testdata/bin/run-ranger.sh
(run-all.sh has to be executed instead if other underlying services
have not been started)
6. $IMPALA_HOME/testdata/bin/setup-ranger.sh
Testing:
- Manually verified that we could point Impala to a locally built
Apache Ranger on the master branch (with tip being
https://github.com/apache/ranger/commit/4abb993).
- Manually verified that with RANGER-4771.diff and
IMPALA-12921_addendum.diff, only 3 authorization-related tests
failed. They failed because the resource type of 'storage-type' is
not supported in Apache Ranger yet and thus the test cases added in
IMPALA-10436 could fail.
- Manually verified that the log files of Apache and CDP Ranger's Admin
server could be created under ${RANGER_LOG_DIR} after we start the
Ranger service.
- Verified that this patch passed the core tests when CDP Ranger is
used.
Change-Id: I268d6d4d6e371da7497aac8d12f78178d57c6f27
Reviewed-on: http://gerrit.cloudera.org:8080/21160
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds support for using the mold linker. It changes
the existing USE_GOLD_LINKER environment variable to
IMPALA_LINKER, which accepts ld, gold, or mold as
values. It defaults to 'gold' to match current behavior.
Developers can override it in bin/impala-config-local.sh.
Clang does not implement -gz properly until version 12.
It does not enable compressed debuginfo in the final
binary. IMPALA_LINKER=mold doesn't work with
IMPALA_COMPRESSED_DEBUG_INFO=true on Clang due to this.
This detects Clang <12 and skips -gz as it is ineffective.
Mold follows similar to behavior to LLD and requires
--exclude-libs to use the full library name (i.e.
liblz4.a rather than liblz4). Gold will happily
accept the full library name, so this changes to use
the full library name.
Mold is much faster for incremental builds on my system:
(e.g. touch be/src/scheduling/scheduler.cc && make -j8 impalad)
gold: 15.8s
mold: 2.6s
Testing:
- Ran builds with IMPALA_LINKER=mold on Centos 7, Redhat 8,
and Ubuntu 20.
Change-Id: Ia9e9accd06b6ecd182d200d81afaae09a885c241
Reviewed-on: http://gerrit.cloudera.org:8080/21121
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The hadoop build only produces client binaries, not a full hadoop build.
The name was therefore misleading, and could not replace the full build
of hadoop required by Impala. Impala's toolchain bootstrap process would
then fail if we tried to include two packages named "hadoop" when
overriding the download URL via IMPALA_HADOOP_URL.
Renames hadoop to hadoop-client to clarify its contents and avoid
conflicts with a full hadoop build.
Change-Id: Ia50b5151e5339b06ae2b623a4b2090ae6708491f
Reviewed-on: http://gerrit.cloudera.org:8080/20779
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Pre-built toolchains are identified by a TOOLCHAIN_BUILD_ID. This commit
adds an aarch64 (64-bit ARM) native-toolchain build, separate from the
x86_64 native-toolchain build, with its own environment variable set in
impala-config.sh. bootstrap_toolchain.py selects which version to use
based on 'uname -m'.
impala-config.sh also verifies that IMPALA_TOOLCHAIN_BUILD_ID_AARCH64
and IMPALA_TOOLCHAIN_BUILD_ID_X86_64 were produced from the same
native-toolchain ref by checking the 2nd token of the build ID.
Updates package version to include the architecture tag to match how
native-toolchain now names them.
Testing:
- successfully built on ARM, and tests passed (exceptions noted in
IMPALA-12490)
Change-Id: I9bfa7125dbc647b33041c5572d97b7f7ccad6258
Reviewed-on: http://gerrit.cloudera.org:8080/20519
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
If NATIVE_TOOLCHAIN_HOME is set, that will be used to provide the native
toolchain instead of the default in IMPALA_TOOLCHAIN. Overrides
IMPALA_TOOLCHAIN_PACKAGES_HOME and sets SKIP_TOOLCHAIN_BOOTSTRAP=true.
Adds IMPALA_TOOLCHAIN_REPO, IMPALA_TOOLCHAIN_BRANCH, and
IMPALA_TOOLCHAIN_COMMIT_HASH so everything is clear about what toolchain
is used for this Impala commit.
If NATIVE_TOOLCHAIN_HOME does not yet exist, buildall.sh will clone the
repo and checkout the commit hash mentioned above before building.
Also skips downloading Kudu if SKIP_TOOLCHAIN_BOOTSTRAP is true as Kudu
is built from native-toolchain. Normalizes aarch64 logic, which skipped
Kudu because it would always build native-toolchain locally.
Change-Id: I3a9e51b7f54c738d8cc01b32428ac88a344de376
Reviewed-on: http://gerrit.cloudera.org:8080/20267
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
This adds support for Redhat 9 / Ubuntu 22. It updates
to a newer toolchain that has those builds, and it adds
supporting code in bootstrap_system.sh.
Redhat 9 and Ubuntu 22 use python = python3, which requires
various changes to build scripts and tests. Ubuntu 22 uses
Python 3.10, which deprecates certain ssl.PROTOCOL_TLS, so
this adapts test_client_ssl.py to that change until it
can be fully addressed in IMPALA-12219.
Various OpenSSL methods have been deprecated. As a workaround
until these can be addressed properly, this specifies
-Wno-deprecated-declarations. This can be removed once the
code is adapted to the non-deprecated APIs in IMPALA-12226.
Impala crashes with tcmalloc errors unless we update to a newer
gperftools, so this moves to gperftools 2.10. gperftools changed
the default for tcmalloc.aggressive_memory_decommit to off, so
this adapts our code to set it for backend tests. The gperftools
upgrade does not show any performance regression:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(42) | parquet / none / none | 3.08 | -0.64% | 2.20 | -0.37% |
+----------+-----------------------+---------+------------+------------+----------------+
With newer Python versions, the impala-virtualenv command
fails to create a Python 3 virtualenv. This switches to
using Python 3's builtin venv command for Python >=3.6.
Kudu needed a newer version and LLVM required a couple patches.
Testing:
- Ran a core job on Ubuntu 22 and Redhat 9. The tests run
to completion without crashing. There are test failures
that will be addressed in follow-up JIRAs.
- Ran dockerised tests on Ubuntu 22.
- Ran dockerised tests on Ubuntu 20 and Rocky 8.5.
Change-Id: If1fcdb2f8c635ecd6dc7a8a1db81f5f389c78b86
Reviewed-on: http://gerrit.cloudera.org:8080/20073
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This removes the usage of lsb_release in bootstrap_toolchain.py
in favor of reading /etc/os-release. /etc/os-release is available
on all distributions that we support. This combines the ID
and the major version component of VERSION_ID to produce the
distribution identifier. A few example:
Ubuntu 16.04: ID=ubuntu, VERSION_ID=16.04 => ubuntu16
Rocky 8.5: ID=rocky, VERSION_ID=8.5 => rocky8
SLES 15.1: ID=sles, VERSION_ID=15.1 => sles15
As cleanup, this removes old distributions that we no longer
support (e.g. Redhat 5/6, Debian, Sles 11, etc). It also removes
the unused CDH component of the OS_MAPPING.
The values used in OS_MAPPING are based on the database of
/etc/os-release files available at https://github.com/chef/os_release
Testing:
- Ran the logic against the /etc/os-release files for Redhat/Centos 7,
Redhat/Centos/Rocky/Almalinux 8, Ubuntu, and SLES 12/15.
Change-Id: Ida3ffb8525c5b750ddbf9fd3ed5d0782fac9cdd0
Reviewed-on: http://gerrit.cloudera.org:8080/20070
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Cloudflare Zlib is a fork of the Zlib codebase that
has been optimized to take advantage of CPU SIMD
instructions and other platform-specific optimizations.
It has the same license as regular Zlib. Amazon has
touted this as a major speedup over regular Zlib:
https://aws.amazon.com/blogs/opensource/improving-zlib-cloudflare-and-comparing-performance-with-other-zlib-forks/
This adds the IMPALA_USE_CLOUDFLARE_ZLIB environment
variable which allows Impala to be built against
Cloudflare Zlib. This defaults to true. If set to
any other value, it will build against regular Zlib.
Cloudflare Zlib shows a clear performance benefit
over regular Zlib on TPC-H ORC/deflate benchmark:
+----------+-------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-------------------+---------+------------+------------+----------------+
| TPCH(42) | orc / def / block | 4.18 | -6.43% | 3.29 | -6.74% |
+----------+-------------------+---------+------------+------------+----------------+
Testing:
- Ran GVO tests and exhaustive release tests
Change-Id: I82c480890726da0fa5bdc2a646022554eec181f4
Reviewed-on: http://gerrit.cloudera.org:8080/19207
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
The original fix for IMPALA-9627 removed the local check_output()
function for bin/bootstrap_toolchain.py and replaced it with
subprocess.check_call(). The local check_output() function was
preventing output from the wget call from going to the commandline.
The subprocess.check_call() doesn't do this. Since we don't want
this output anyway, this modifies the wget call to add "-q" to
avoid the extra output.
Testing:
- Verified the extra output was gone
Change-Id: If5b81f502a73551d269a0379ed0106ba4d3c8363
Reviewed-on: http://gerrit.cloudera.org:8080/19814
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We're starting to see environments where the system Python ('python') is
Python 3. Updates utility and build scripts to work with Python 3, and
updates check-pylint-py3k.sh to check scripts that use system python.
Fixes other issues found during a full build and test run with Python
3.8 as the default for 'python'.
Fixes a impala-shell tip that was supposed to have been two tips (and
had no space after period when they were printed).
Removes out-of-date deploy.py and various Python 2.6 workarounds.
Testing:
- Full build with /usr/bin/python pointed to python3
- run-all-tests passed with python pointed to python3
- ran push_to_asf.py
Change-Id: Idff388aff33817b0629347f5843ec34c78f0d0cb
Reviewed-on: http://gerrit.cloudera.org:8080/19697
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This bumps the toolchain version to get zlib 1.2.13,
which contains the fix for CVE-2022-37434.
This toolchain incorporates several changes to clean
up the native-toolchain and remove unnecessary
component builds. As part of this, OpenSSL is no
longer built in the toolchain, so this stops
downloading it. This changes the build to require
OpenSSL 1.0.2 or higher. This doesn't impact anything,
because all supported platforms already used
OpenSSL 1.0.2 or higher. See IMPALA-12064.
Testing:
- perf-AB-test shows no change in performance for
ORC with deflate
- GVO passes
Change-Id: I96efc947534cda8d15d4f440cd6851d397b6562d
Reviewed-on: http://gerrit.cloudera.org:8080/19760
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds a Python 3 equivalent to the impala-python
virtualenv base on the toolchain Python 3.7.16.
This modifies bootstrap_virtualenv.py to support
the two different modes. This adds py2-requirements.txt
and py3-requirements.txt to allow some differences
between the Python 2 and Python 3 virtualenvs.
Here are some specific package changes:
- allpairs is replaced with allpairspy, as allpairs did
not support Python 3.
- requests is upgraded slightly, because otherwise is has issues
with idna==2.8.
- pylint is limited to Python 3, because we are adding it
and don't need it on both
- flake8 is limited to Python 2, because it will take
some work to switch to a version that works on Python 3
- cm_api is limited to Python 2, because it doesn't support
Python 3
- pytest-random does not support Python 3 and it is unused,
so it is removed
- Bump the version of setuptool-scm to support Python 3
This adds impala-pylint, which can be used to do further
Python 3 checks via --py3k. This also adds a bin/check-pylint-py3k.sh
script to enforce specific py3k checks. The banned py3k warnings
are specified in the bin/banned_py3k_warnings.txt. This is currently
empty, but this can ratchet up the py3k strictness over time
to avoid regressions.
This pulls in a new toolchain with the fix for IMPALA-11956
to get Python 3.7.16.
Testing:
- Hand tested that the allpairs libraries produce the
same results
- The python3 virtualenv has no influence on regular
tests yet
Change-Id: Ica4853f440c9a46a79bd5fb8e0a66730b0b4efc0
Reviewed-on: http://gerrit.cloudera.org:8080/19567
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Updates utility scripts that don't use impala-python to work with Python
3 so we can build on systems that don't include Python 2 (such as SLES
15 SP4).
Primarily adds 'universal_newlines=True' to subprocess calls so they
return text rather than binary data in Python 3 with a change that's
compatible with Python 2.
Testing:
- built in SLES 15 SP4 container with Python 3
Change-Id: I7f4ce71fa1183aaeeca55d0666aeb113640c5cf2
Reviewed-on: http://gerrit.cloudera.org:8080/19559
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Updates IMPALA_TOOLCHAIN_BUILD_ID to a native-toolchain build that
includes SLES 15 support and adds OsMapping for SLES 15.
Testing:
- built with impala-toolchain-sles15 container image from
native-toolchain, which includes Python 2 and Java 8 SDK from OpenSUSE
Leap.
Change-Id: I4015b695862abc6eb901a857cc1c444aff1bbe24
Reviewed-on: http://gerrit.cloudera.org:8080/19556
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Python 3 does not support this old except syntax:
except Exception, e:
Instead, it needs to be:
except Exception as e:
This uses impala-futurize to fix all locations of
the old syntax.
Testing:
- The check-python-syntax.sh no longer shows errors
for except syntax.
Change-Id: I1737281a61fa159c8d91b7d4eea593177c0bd6c9
Reviewed-on: http://gerrit.cloudera.org:8080/19551
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Updates the test environment to default to the CDP build of Ozone, as
the latest build of CDP Hive depends on pre-release features unavailable
in Ozone 1.2.1. Apache Ozone 1.2 can still be used by setting
USE_APACHE_OZONE=true.
The latest CDP build also includes a version of Ozone based on
ozone#master with a candidate version of 1.3.0. Both Apache and CDP
therefore have builds of Ozone we can test with that use the new
artifact names introduced in Ozone 1.2, so this patch cleans up setup
that was only needed for Ozone versions prior to 1.2.
Change-Id: I1177a1b820fe21adca9f8c1cc51ff73ee001d3f2
Reviewed-on: http://gerrit.cloudera.org:8080/19247
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This adds support for Rocky Linux 8 and Alma Linux 8,
which are new Centos 8 alternatives. They use the
same toolchain as Centos 8.
Testing:
- Ran docker-based tests on Rocky Linux and Alma Linux.
The build passed and tests ran.
Change-Id: If10d71caa90d24e14d4cf6a28f5c27e03ef3c4c6
Reviewed-on: http://gerrit.cloudera.org:8080/18773
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds Ozone as an alternative to hdfs in the minicluster. Select by
setting `export TARGET_FILESYSTEM=ozone`. With that flag,
run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot
because Ozone does not support HBase (HDDS-3589); snapshot loading
doesn't work yet primarily due to HDDS-5502.
Uses the o3fs interface because Ozone puts specific restrictions on
bucket names (no underscores, for instance), and it was a lot easier to
use an interface where everything is written to a single bucket than to
update all Impala's use of HDFS-style paths to make `test-warehouse` a
bucket inside a volume.
Specifies reduced Ozone client retries during shutdown where Ozone may
not be available.
Passes tests with FE_TEST=false BE_TEST=false.
Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be
Reviewed-on: http://gerrit.cloudera.org:8080/18738
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Impala used to have one thrift compiler version to compile C++, Java,
and Python code.
Most Thrift serialization/deserialization between minor versions are
compatible with each other. So it is possible to have different thrift
compiler versions for different target codes. It is beneficial to do so
because it will allow Impala to upgrade separate components
independently.
This patch implements the infrastructure change required to do so. It
replace most of the 'THRIFT_*' environment variable and CMake variable
with 'THRFIT_CPP_*', 'THRFIT_JAVA_*', and 'THRFIT_PY_*' to compile C++,
Java, and Python code accordingly. All three still refer to the same
thrift version (thrift-0.11.0-p5).
Testing:
- Build Impala and pass core tests.
Change-Id: I56479dc69b79024d1a4d09211bbe88a61fa0c6a4
Reviewed-on: http://gerrit.cloudera.org:8080/18636
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Ubuntu 20 has been using the toolchain from Ubuntu 18.
Since Ubuntu 20 has been added to the toolchain, this
switches Impala to use a toolchain with Ubuntu 20 support
and uses the Ubuntu 20 bits. This is expected to help
with IMPALA-10962.
Testing:
- Ran a core build on Ubuntu 20
Change-Id: If2394b668ef3c56b1a4c0773fd5e4ff92be4a846
Reviewed-on: http://gerrit.cloudera.org:8080/18559
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
There are some API changes in the new version of protobuf library.
This patch makes necessary changes for Impala code to pass compiling.
Two tarballs of protobuf 3.14.0 are built in toolchain. Use
protobuf-3.14.0 for normal builds, use protobuf-3.14.0-clangcompat-p2
for Clang builds.
Bump up Kudu to the latest version 67ba3cae45.
Testing:
- Passed core DEBUG build and exhaustive release build.
- Passed core ASAN build.
Change-Id: Ia1df4faceff9fda169c9d15fe8b1e69cfabe0d43
Reviewed-on: http://gerrit.cloudera.org:8080/17948
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As part of moving to a newer protobuf, this updates the Kudu version
to get the fix for KUDU-3334. With this newer Kudu version, Clang
builds hit an error while linking:
lib/libLLVMCodeGen.a(TargetPassConfig.cpp.o):TargetPassConfig.cpp:
function llvm::TargetPassConfig::createRegAllocPass(bool):
error: relocation refers to global symbol "std::call_once<void (&)()>(std::once_flag&, void (&)())::{lambda()#2}::_FUN()",
which is defined in a discarded section
section group signature: "_ZZSt9call_onceIRFvvEJEEvRSt9once_flagOT_DpOT0_ENKUlvE0_clEv"
prevailing definition is from ../../build/debug/security/libsecurity.a(openssl_util.cc.o)
(This is from a newer binutils that will be pursued separately.)
As a hack to get around this error, this adds the calloncehack
shared library. The shared library publicly defines the symbol that
was coming from kudu_client. By linking it ahead of kudu_client, the
linker uses that rather than the one from kudu_client. This fixes
the Clang builds.
The new Kudu also requires a minor change to the flags for tserver
startup.
Testing:
- Ran debug tests and verified calloncehack is not used
- Ran ASAN tests
Change-Id: Ieccbe284f11445e1de792352ebc7c9e1fa2ca0c3
Reviewed-on: http://gerrit.cloudera.org:8080/18129
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch added functionality to download JWKS from a given URL and
support key rotation by periodically checking the JWKS URL for updates.
We use Kudu's EasyCurl wrapper to download file from the given URL.
curl was added to native-toolchain. This patch modified makefiles
and bootstrap_toolchain.py to integrate libcurl and libkudu_curl_util.
Added end-end JWT authentication test cases with JWKS specified as
HTTP/HTTPS URL.
Testing:
- Passed core run, including new test cases.
Change-Id: Ic6ac8cf0010c13db30219776d1d275709bf211df
Reviewed-on: http://gerrit.cloudera.org:8080/17802
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch modifies the minicluster script to optionally use Apache
Hive 3.1.2 instead of CDP Hive 3.1.3.
In order to make sure that existing setups don't break this is
enabled via a environment variable override to bin/impala-config.sh.
When the environment variable USE_APACHE_HIVE is set to true the
bootstrap_toolchain script downloads Apache Hive 3.1.2 tarballs and
extracts it in the toolchain directory. These binaries are used to
start the Hive services (Hiveserver2 and metastore). The default is
CDP Hive 3.1.3
Since CDP Hive 3 uses some features of Apache Hive 4, this patch uses
a different database name so that it is easy to switch from working
from one environment which uses CDP Hive 3.1.3 metastore to another
which usese Apache Hive 3.1.2 metastore.
In order to start a minicluster which uses Apache Hive 3.1.2 users
should follow the steps below:
1. Make sure that minicluster, if running, is stopped before you run
the following commands.
2. Open a new terminal and run following commands.
> export USE_APACHE_HIVE=true
> source bin/impala-config.sh
> bin/bootstrap_toolchain.py
The above command downloads the Apache Hive 3.1.2 tarballs and
extracts them in toolchain/apache_components directory.
> rm $HIVE_HOME/lib/guava-*jar
> cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-*.jar $HIVE_HOME/lib/
The above command is to fix HIVE-22915
> bin/create-test-configuration.sh -create_metastore
The above step should provide "-create-metastore" only the first time
so that a new metastore db is created and the Apache Hive 3.1.2 schema
is initialized.
> testdata/bin/run-all.sh
Follow-up:
- Add MetastoreShim to support Apache Hive 3.x in IMPALA-10871
Tests:
- Made sure that the cluster comes up with Apache Hive 3.1.2 when the
steps above are performed.
- Made sure that existing scripts work as they do currently when
argument is not provided.
Change-Id: I1978909589ecacb15d32d874e97f050a85adf1f6
Reviewed-on: http://gerrit.cloudera.org:8080/17793
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch added JWT support with following functionality:
* Load and parse JWKS from pre-installed JSON file.
* Read the JWT token from the HTTP Header.
* Verify the JWT's signature with public key in JWKS.
* Get the username out of the payload of JWT token.
* Support following JSON Web Algorithms (JWA):
HS256, HS384, HS512, RS256, RS384, RS512.
We use third party library jwt-cpp to verify JWT token. jwt-cpp is a
headers only C++ library. It was added to native-toolchain.
This patch modified bootstrap_toolchain.py to download jwt-cpp from
toolchain s3 bucket, and modified makefiles to add jwt-cpp/include
in the include path.
Added BE unit-tests for loading JWKS file and verifying JWT token.
Also added FE custom cluster test for JWT authentication.
Testing:
- Passed core run.
Change-Id: I6b71fa854c9ddc8ca882878853395e1eb866143c
Reviewed-on: http://gerrit.cloudera.org:8080/17435
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch Impala mainly used Thrift 0.9.3, but it was
possible to compile Impala shell with Thrift 0.11.0, so the 0.11.0
Thrift lib was already included in the toolchain.
Most of the changes are related to replacing boost:: with std::
shared_ptr-s in cpp code (this is a continuation of patch by Sahil).
The Thrift upgrade also needs an Impyla release with Thrift 0.11.0, as
Impala's test framework relies on Impyla. A thrift_sasl release is also
needed, because it currently pins Thrift version to 0.9.3 for Python 2.
The current patch uses alpha releases from Impyla and thrift_sasl that
use thrift 0.11.0.
Notable side effects:
- old logic to compile thrift for impala-shell with 0.11.0 was removed
- impala_shell's utf8 handling had to be updated as the new 0.11.0
compilation happens with no_utf8strings. This also made things a
bit faster, e.g the following is ~0.22s instead of ~0.25
shell/impala_shell.py \
-B -q "select * from functional_parquet.alltypes;" > /dev/null
- THRIFT-3921 changed the stream operators to print an enum's name
instead of its number, leading to slightly different messages
in some cases.
- "templates" was added to the thift generator's parameters to avoid
a compilation issue (related to IMPALA-10600). I didn't notice any
change in compilation time. This option generated .tcc files with
templetized readers/writers for Thrift types. Currently we don't
use these, but they could potentially speed up (de)serialization.
Testing:
- ran Impyla's test suite with Python 2 and 3
- ran core tests
Change-Id: Idd13f177b4f7acc07872ea6399035aa180ef6ab6
Reviewed-on: http://gerrit.cloudera.org:8080/17170
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Recent Red Hat Enterprise Linux 8.x version return a shorter release
name string from `lsb_release -si` than earlier versions. This shorter
string was not recognized in the OS mapper logic in
bin/bootstrap_toolchain.py, makig it -- and the build process -- break
on Red Hat 8.2
The patch adds the shorter signature as a point fix.
Rename a local variable to fix an unrelated name conflict (shadowing)
found by flake8.
Tests: run bin/bootstrap_toolchain.py manually on Red Hat 8.2, then run
a complete build on the same OS.
Regression-tested (build and dataload only) on the following versions:
- Centos 8.2 (as opposed to Red Hat 8.2)
- Centos 7.9
- Ubuntu 18.04
Change-Id: Icb1a6c215b1b5a65691042bb7d94fb034392d135
Reviewed-on: http://gerrit.cloudera.org:8080/17292
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
- Add HIVE_VERSION_OVERRIDE, HIVE_STORAGE_API_VERSION_OVERRIDE,
HIVE_METASTORE_THRIFT_DIR_OVERRIDE, HIVE_HOME_OVERRIDE environment
variable support to impala-config.sh
- When used together with HIVE_SRC_DIR_OVERRIDE allows a user to
specify a locally compiled version of Hive for development and the
minicluster
- Hive jars are expected to have been installed into the local maven
repository
- Currently only version 3 of Hive is supported due to the absence of
API shims for Hive 4.0
Example:
~/hive $ mvn package install -Pdist -DskipTests
Example configuration:
export HIVE_VERSION_OVERRIDE=3.1.0-SNAPSHOT
export HIVE_STORAGE_API_VERSION_OVERRIDE=2.6.0
export HIVE_HOME_OVERRIDE=\
~/hive/packaging/target/apache-hive-3.1.0-SNAPSHOT-bin/apache-hive-3.1.0-SNAPSHOT-bin
export HIVE_SRC_DIR_OVERRIDE=~/hive
export HIVE_METASTORE_THRIFT_DIR_OVERRIDE=~/hive/standalone-metastore/src/main/thrift/
Change-Id: I21892c153c445e3a5d93f2bc8f5e0b799929dd34
Reviewed-on: http://gerrit.cloudera.org:8080/17094
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Ubuntu 20.04
This is a minor amendment to a previously merged change with
ChangeId I4f592f60881fd8f34e2bf393a76f5a921505010a, to address
additional review comments. In particular, the original commit
referred to Ubuntu 20.4 whereas it should have used Ubuntu 20.04.
Change-Id: I7db302b4f1d57ec9aa2100d7589d5e814db75947
Reviewed-on: http://gerrit.cloudera.org:8080/16241
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Including following changes:
1 build native-toolchain local by script on aarch64 platform
2 change some native-toolchain's lib version number
3 split SKIP_TOOLCHAIN_BOOTSTRAP and DOWNLOAD_CDH_COMPONETS to two things,
because on aarch64, just need to download cdp components ,
but not need to download toolchain.
4 download hadoop aarch64 nativelibs , impala building needs these libs.
With this commit, on ubuntu 18.04 aarch64 version,
just need to run bin/bootstrap_development.sh, just like x86.
Change-Id: I769668c834ab0dd504a822ed9153186778275d59
Reviewed-on: http://gerrit.cloudera.org:8080/16065
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If DownloadUnpackTarball::download()'s wget_and_unpack_package call
hits an exception, the exception handler cleans up any created
directories. Currently, it erroneously cleans up the directory where
the tarballs are downloaded even when it is not a temporary directory.
This would delete the entire toolchain.
This fixes the cleanup to only delete that directory if it is a
temporary directory.
Testing:
- Simulated exception from wget_and_unpack_package and verified
behavior.
Change-Id: Ia57f56b6717635af94247fce50b955c07a57d113
Reviewed-on: http://gerrit.cloudera.org:8080/16294
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Ubuntu 20.4
This work addresses the current limitation in Impala development
environment in that Ubuntu 20.4 is not supportd. The fix modifies
bootstrap_system.sh and bootstrap_toolchain.py to specifically
allow the bootstrapping of the development environment on a maching
running Ubuntu 20.4. Limited use shows that the environment is useful
and stable, similar to the one running on Ubuntu 18.4.
Testing on a box running Ubuntu 20.4:
1. Successfully bootstrapped the entire Impala development environment
2. Interacted with the enviroment through the following tools:
gdb
jdb
clang-format
impalad GUI
vim
3. Ran all tests
Limitations found with Ubuntu 20.4 environment.
1. gdb in Impala toolchain is not compatible with Impala C++ test
code ${IMPALA_HOME}/be/build/latest/service\
/unifiedbetests (invoked by ${IMPALA_HOME}/be/build/latest/\
scheduling/admission-controller-test) and reports the following
error, after attaching to the test process.
BFD (GNU Binutils) 2.25.51 internal error, aborting at elf64-x86-64.c
ine 5583 in elf_x86_64_get_plt_sym_val
Change-Id: I4f592f60881fd8f34e2bf393a76f5a921505010a
Reviewed-on: http://gerrit.cloudera.org:8080/16238
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
bin/bootstrap_toolchain.py failed to recognize SLES12 sp5, which broke
builds on that platform.
This patch simplifies OS version parsing and matching for SLES, omitting
the check for the OS minor version, which shows the SP level for SLES.
This is similar to how Red Hat variants are handled. This gets
rid of the constant update need whenever a new SP level is released for
SLES12.
This is enabled by the native toolchain sharing a single set of artifacts
between all the SLES12 SP levels.
Test: ran a successful build on a SLES12sp5 box.
Change-Id: Id9ada210b915050fbceebb7364e130116e9244d0
Reviewed-on: http://gerrit.cloudera.org:8080/16102
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The locations for native-toolchain packages in IMPALA_TOOLCHAIN
currently do not include the compiler version. This means that
the toolchain can't distinguish between native-toolchain packages
built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause
issues when switching back and forth between branches.
This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment
variable, which is a location inside IMPALA_TOOLCHAIN that would
hold native-toolchain packages. Currently, it is set to the same
as IMPALA_TOOLCHAIN, so there is no difference in behavior.
This lays the groundwork to add the compiler version to this
path when switching to GCC7.
Testing:
- The only impediment to building with
IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is
Impala-lzo. With a custom Impala-lzo, compilation succeeds.
Either Impala-lzo will be fixed or it will be removed.
- Core tests
Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b
Reviewed-on: http://gerrit.cloudera.org:8080/15991
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala 4 decided to drop Sentry support in favor of Ranger. This
removes Sentry support and related tests. It retires startup
flags related to Sentry and does the first round of removing
obsolete code. This does not adjust documentation to remove
references to Sentry, and other dead code will be removed
separately.
Some issues came up when implementing this. Here is a summary
of how this patch resolves them:
1. authorization_provider currently defaults to "sentry", but
"ranger" requires extra parameters to be set. This changes the
default value of authorization_provider to "", which translates
internally to the noop policy that does no authorization.
2. These flags are Sentry specific and are now retired:
- authorization_policy_provider_class
- sentry_catalog_polling_frequency_s
- sentry_config
3. The authorization_factory_class may be obsolete now that
there is only one authorization policy, but this leaves it
in place.
4. Sentry is the last component using CDH_COMPONENTS_HOME, so
that is removed. There are still Maven dependencies coming
from the CDH_BUILD_NUMBER repository, so that is not removed.
5. To make the transition easier, testdata/bin/kill-sentry-service.sh
is not removed and it is still called from testdata/bin/kill-all.sh.
Testing:
- Core job passes
Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e
Reviewed-on: http://gerrit.cloudera.org:8080/15833
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala 4 moved to using CDP versions for components, which involves
adopting Hive 3. This removes the old code supporting CDH components
and Hive 2. Specifically, it does the following:
1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true.
USE_CDP_HIVE now has no effect on the Impala environment. This also
means that bin/jenkins/build-all-flag-combinations.sh no longer
include USE_CDP_HIVE=false as a configuration.
2. Remove USE_CDH_KUDU and default to getting Impala from the
native toolchain.
3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including
the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml.
There is a fair amount of code that still references the Hive major
version. Upstream Hive is now working on Hive 4, so there is a high
likelihood that we'll need some code to deal with that transition.
This leaves some code (such as maven profiles) and test logic in
place.
Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608
Reviewed-on: http://gerrit.cloudera.org:8080/15869
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-9626 broke the use case where the toolchain binaries are not
downloaded from the native-toolchain S3 bucket, because
SKIP_TOOLCHAIN_BOOTSTRAP is set to true.
Fix this use case by checking SKIP_TOOLCHAIN_BOOTSTRAP in
bin/bootstrap_environment.py:
- if true: just check if the specified version of the Python binary is
present at the expected toolchain location. If it is there, use it,
otherwise throw an exception and abort the bootstrap process.
- in any other case: proceed to download the Python binary as in
bootstrap_toolchain.py.
Test:
- simulate the custom toolchain setup by downloading the toolchain
binaries from the S3 bucket, copying them to a separate directory,
symlinking them into Impala/toolchain, then executing buildall.sh
with SKIP_BOOTSTRAP_TOOLCHAIN set to "true".
Change-Id: Ic51b3c327b3cebc08edff90de931d07e35e0c319
Reviewed-on: http://gerrit.cloudera.org:8080/15759
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Historically Impala used the Python2 version that was available on
the hosting platform, as long as that version was at least v2.6.
This caused constant headache as all Python syntax had to be kept
compatible with Python 2.6 (for Centos 6). It also caused a recent problem
on Centos 8: here the system Python version was compiled with the
system's GCC version (v8.3), which was much more recent than the Impala
standard compiler version (GCC 4.9.2). When the Impala virtualenv was
built, the system Python version supplied C compiler switches for models
containing native code that were unknown for the Impala version of GCC,
thus breaking virtualenv installation.
This patch changes the Impala virtualenv to always use the Python2
version from the toolchain, which is built with the toolchain compiler.
This ensures that
- Impala always has a known Python 2.7 version for all its scripts,
- virtualenv modules based on native code will always be installable, as
the Python environment and the modules are built with the same compiler
version.
Additional changes:
- Add an auto-use fixture to conftest.py to check that the tests are
being run with Python 2.7.x
- Make bootstrap_toolchain.py independent from the Impala virtualenv:
remove the dependency on the "sh" library
Tests:
- Passed core-mode tests on CentOS 7.4
- Passed core-mode tests in Docker-based mode for centos:7
and ubuntu:16.04
Most content in this patch was developed but not published earlier
by Tim Armstrong.
Change-Id: Ic7b40cef89cfb3b467b61b2d54a94e708642882b
Reviewed-on: http://gerrit.cloudera.org:8080/15624
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
CentOS 8.1 is a new major version of the CentOS family.
It is now stable and popular enough to start supporting it for Impala
development.
Prepare a raw CentOS 8.1 system to support Impala development and testing.
This should work on a standalone computer, on a virtual machine,
or inside a Docker container.
Details:
- snappy-devel moved to the PowerTools repo, so it needs to be installed
from there
- CentOS 8 has no default Python version. The bootstrap script installs
(or configures) Python2 with pip2, then makes them the default via the
"alternatives" mechanism. The installer is adaptive, it performs only
the necessary steps, so it works in various environments.
The installer logic is also shared between bin/bootstrap_system.sh and
docker/entrypoint.sh
- The toolchain package tag "ec2-centos-8" is added to
bootstrap_toolchain.py
- For some unknown reason, when the downloaded Maven tarball is extracted
in a Docker-based test, the "bin" and "boot" directories are created
with owner-only permissions. The 'impdev' users has no access to the
maven executable, which then breaks the build.
This patch forcibly restores the correct permissions on these
directories; this is a no-op when the extraction happens correctly.
- TOOLCHAIN_ID is bumped to a build that already has CentOS 8 binaries.
- Centos8-specific bootstrap code was added to the Docker-based tests.
Tested:
- ran the Docker-based tests with --base-image=centos:8 to verify the following build
phases are successful:
* system prep
* build
* dataload
and that test can start. Passing all tests is was not a requirement for this step,
although plausible test results (i.e. not all of the tests fail) were.
- ran the Docker-based tests to verify nonregression with --base-image set to the
following: centos:7, ubuntu:16.04, ubuntu:18.04.
On centos:7 and ubuntu:16.04 the only failure was IMPALA-9097 (BE tests fail without
the minicluster running); ubuntu:18.04 showed the same failures as the current upstream
code.
- passed a core-mode test run on private infrastructure on Centos 7.4
- ran buildall.sh in core mode manually inside a Docker container, simulating a developer
workflow (prep-build-dataload-test). There were several observed test failures, but
the workflow itself was run to completion with no problems.
Change-Id: I3df5d48eca7a10219264e3604a4f05f072188e6e
Reviewed-on: http://gerrit.cloudera.org:8080/15623
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>