Ubuntu 20 has been using the toolchain from Ubuntu 18.
Since Ubuntu 20 has been added to the toolchain, this
switches Impala to use a toolchain with Ubuntu 20 support
and uses the Ubuntu 20 bits. This is expected to help
with IMPALA-10962.
Testing:
- Ran a core build on Ubuntu 20
Change-Id: If2394b668ef3c56b1a4c0773fd5e4ff92be4a846
Reviewed-on: http://gerrit.cloudera.org:8080/18559
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
There are some API changes in the new version of protobuf library.
This patch makes necessary changes for Impala code to pass compiling.
Two tarballs of protobuf 3.14.0 are built in toolchain. Use
protobuf-3.14.0 for normal builds, use protobuf-3.14.0-clangcompat-p2
for Clang builds.
Bump up Kudu to the latest version 67ba3cae45.
Testing:
- Passed core DEBUG build and exhaustive release build.
- Passed core ASAN build.
Change-Id: Ia1df4faceff9fda169c9d15fe8b1e69cfabe0d43
Reviewed-on: http://gerrit.cloudera.org:8080/17948
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As part of moving to a newer protobuf, this updates the Kudu version
to get the fix for KUDU-3334. With this newer Kudu version, Clang
builds hit an error while linking:
lib/libLLVMCodeGen.a(TargetPassConfig.cpp.o):TargetPassConfig.cpp:
function llvm::TargetPassConfig::createRegAllocPass(bool):
error: relocation refers to global symbol "std::call_once<void (&)()>(std::once_flag&, void (&)())::{lambda()#2}::_FUN()",
which is defined in a discarded section
section group signature: "_ZZSt9call_onceIRFvvEJEEvRSt9once_flagOT_DpOT0_ENKUlvE0_clEv"
prevailing definition is from ../../build/debug/security/libsecurity.a(openssl_util.cc.o)
(This is from a newer binutils that will be pursued separately.)
As a hack to get around this error, this adds the calloncehack
shared library. The shared library publicly defines the symbol that
was coming from kudu_client. By linking it ahead of kudu_client, the
linker uses that rather than the one from kudu_client. This fixes
the Clang builds.
The new Kudu also requires a minor change to the flags for tserver
startup.
Testing:
- Ran debug tests and verified calloncehack is not used
- Ran ASAN tests
Change-Id: Ieccbe284f11445e1de792352ebc7c9e1fa2ca0c3
Reviewed-on: http://gerrit.cloudera.org:8080/18129
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch added functionality to download JWKS from a given URL and
support key rotation by periodically checking the JWKS URL for updates.
We use Kudu's EasyCurl wrapper to download file from the given URL.
curl was added to native-toolchain. This patch modified makefiles
and bootstrap_toolchain.py to integrate libcurl and libkudu_curl_util.
Added end-end JWT authentication test cases with JWKS specified as
HTTP/HTTPS URL.
Testing:
- Passed core run, including new test cases.
Change-Id: Ic6ac8cf0010c13db30219776d1d275709bf211df
Reviewed-on: http://gerrit.cloudera.org:8080/17802
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch modifies the minicluster script to optionally use Apache
Hive 3.1.2 instead of CDP Hive 3.1.3.
In order to make sure that existing setups don't break this is
enabled via a environment variable override to bin/impala-config.sh.
When the environment variable USE_APACHE_HIVE is set to true the
bootstrap_toolchain script downloads Apache Hive 3.1.2 tarballs and
extracts it in the toolchain directory. These binaries are used to
start the Hive services (Hiveserver2 and metastore). The default is
CDP Hive 3.1.3
Since CDP Hive 3 uses some features of Apache Hive 4, this patch uses
a different database name so that it is easy to switch from working
from one environment which uses CDP Hive 3.1.3 metastore to another
which usese Apache Hive 3.1.2 metastore.
In order to start a minicluster which uses Apache Hive 3.1.2 users
should follow the steps below:
1. Make sure that minicluster, if running, is stopped before you run
the following commands.
2. Open a new terminal and run following commands.
> export USE_APACHE_HIVE=true
> source bin/impala-config.sh
> bin/bootstrap_toolchain.py
The above command downloads the Apache Hive 3.1.2 tarballs and
extracts them in toolchain/apache_components directory.
> rm $HIVE_HOME/lib/guava-*jar
> cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-*.jar $HIVE_HOME/lib/
The above command is to fix HIVE-22915
> bin/create-test-configuration.sh -create_metastore
The above step should provide "-create-metastore" only the first time
so that a new metastore db is created and the Apache Hive 3.1.2 schema
is initialized.
> testdata/bin/run-all.sh
Follow-up:
- Add MetastoreShim to support Apache Hive 3.x in IMPALA-10871
Tests:
- Made sure that the cluster comes up with Apache Hive 3.1.2 when the
steps above are performed.
- Made sure that existing scripts work as they do currently when
argument is not provided.
Change-Id: I1978909589ecacb15d32d874e97f050a85adf1f6
Reviewed-on: http://gerrit.cloudera.org:8080/17793
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch added JWT support with following functionality:
* Load and parse JWKS from pre-installed JSON file.
* Read the JWT token from the HTTP Header.
* Verify the JWT's signature with public key in JWKS.
* Get the username out of the payload of JWT token.
* Support following JSON Web Algorithms (JWA):
HS256, HS384, HS512, RS256, RS384, RS512.
We use third party library jwt-cpp to verify JWT token. jwt-cpp is a
headers only C++ library. It was added to native-toolchain.
This patch modified bootstrap_toolchain.py to download jwt-cpp from
toolchain s3 bucket, and modified makefiles to add jwt-cpp/include
in the include path.
Added BE unit-tests for loading JWKS file and verifying JWT token.
Also added FE custom cluster test for JWT authentication.
Testing:
- Passed core run.
Change-Id: I6b71fa854c9ddc8ca882878853395e1eb866143c
Reviewed-on: http://gerrit.cloudera.org:8080/17435
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch Impala mainly used Thrift 0.9.3, but it was
possible to compile Impala shell with Thrift 0.11.0, so the 0.11.0
Thrift lib was already included in the toolchain.
Most of the changes are related to replacing boost:: with std::
shared_ptr-s in cpp code (this is a continuation of patch by Sahil).
The Thrift upgrade also needs an Impyla release with Thrift 0.11.0, as
Impala's test framework relies on Impyla. A thrift_sasl release is also
needed, because it currently pins Thrift version to 0.9.3 for Python 2.
The current patch uses alpha releases from Impyla and thrift_sasl that
use thrift 0.11.0.
Notable side effects:
- old logic to compile thrift for impala-shell with 0.11.0 was removed
- impala_shell's utf8 handling had to be updated as the new 0.11.0
compilation happens with no_utf8strings. This also made things a
bit faster, e.g the following is ~0.22s instead of ~0.25
shell/impala_shell.py \
-B -q "select * from functional_parquet.alltypes;" > /dev/null
- THRIFT-3921 changed the stream operators to print an enum's name
instead of its number, leading to slightly different messages
in some cases.
- "templates" was added to the thift generator's parameters to avoid
a compilation issue (related to IMPALA-10600). I didn't notice any
change in compilation time. This option generated .tcc files with
templetized readers/writers for Thrift types. Currently we don't
use these, but they could potentially speed up (de)serialization.
Testing:
- ran Impyla's test suite with Python 2 and 3
- ran core tests
Change-Id: Idd13f177b4f7acc07872ea6399035aa180ef6ab6
Reviewed-on: http://gerrit.cloudera.org:8080/17170
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Recent Red Hat Enterprise Linux 8.x version return a shorter release
name string from `lsb_release -si` than earlier versions. This shorter
string was not recognized in the OS mapper logic in
bin/bootstrap_toolchain.py, makig it -- and the build process -- break
on Red Hat 8.2
The patch adds the shorter signature as a point fix.
Rename a local variable to fix an unrelated name conflict (shadowing)
found by flake8.
Tests: run bin/bootstrap_toolchain.py manually on Red Hat 8.2, then run
a complete build on the same OS.
Regression-tested (build and dataload only) on the following versions:
- Centos 8.2 (as opposed to Red Hat 8.2)
- Centos 7.9
- Ubuntu 18.04
Change-Id: Icb1a6c215b1b5a65691042bb7d94fb034392d135
Reviewed-on: http://gerrit.cloudera.org:8080/17292
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
- Add HIVE_VERSION_OVERRIDE, HIVE_STORAGE_API_VERSION_OVERRIDE,
HIVE_METASTORE_THRIFT_DIR_OVERRIDE, HIVE_HOME_OVERRIDE environment
variable support to impala-config.sh
- When used together with HIVE_SRC_DIR_OVERRIDE allows a user to
specify a locally compiled version of Hive for development and the
minicluster
- Hive jars are expected to have been installed into the local maven
repository
- Currently only version 3 of Hive is supported due to the absence of
API shims for Hive 4.0
Example:
~/hive $ mvn package install -Pdist -DskipTests
Example configuration:
export HIVE_VERSION_OVERRIDE=3.1.0-SNAPSHOT
export HIVE_STORAGE_API_VERSION_OVERRIDE=2.6.0
export HIVE_HOME_OVERRIDE=\
~/hive/packaging/target/apache-hive-3.1.0-SNAPSHOT-bin/apache-hive-3.1.0-SNAPSHOT-bin
export HIVE_SRC_DIR_OVERRIDE=~/hive
export HIVE_METASTORE_THRIFT_DIR_OVERRIDE=~/hive/standalone-metastore/src/main/thrift/
Change-Id: I21892c153c445e3a5d93f2bc8f5e0b799929dd34
Reviewed-on: http://gerrit.cloudera.org:8080/17094
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Ubuntu 20.04
This is a minor amendment to a previously merged change with
ChangeId I4f592f60881fd8f34e2bf393a76f5a921505010a, to address
additional review comments. In particular, the original commit
referred to Ubuntu 20.4 whereas it should have used Ubuntu 20.04.
Change-Id: I7db302b4f1d57ec9aa2100d7589d5e814db75947
Reviewed-on: http://gerrit.cloudera.org:8080/16241
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Including following changes:
1 build native-toolchain local by script on aarch64 platform
2 change some native-toolchain's lib version number
3 split SKIP_TOOLCHAIN_BOOTSTRAP and DOWNLOAD_CDH_COMPONETS to two things,
because on aarch64, just need to download cdp components ,
but not need to download toolchain.
4 download hadoop aarch64 nativelibs , impala building needs these libs.
With this commit, on ubuntu 18.04 aarch64 version,
just need to run bin/bootstrap_development.sh, just like x86.
Change-Id: I769668c834ab0dd504a822ed9153186778275d59
Reviewed-on: http://gerrit.cloudera.org:8080/16065
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If DownloadUnpackTarball::download()'s wget_and_unpack_package call
hits an exception, the exception handler cleans up any created
directories. Currently, it erroneously cleans up the directory where
the tarballs are downloaded even when it is not a temporary directory.
This would delete the entire toolchain.
This fixes the cleanup to only delete that directory if it is a
temporary directory.
Testing:
- Simulated exception from wget_and_unpack_package and verified
behavior.
Change-Id: Ia57f56b6717635af94247fce50b955c07a57d113
Reviewed-on: http://gerrit.cloudera.org:8080/16294
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Ubuntu 20.4
This work addresses the current limitation in Impala development
environment in that Ubuntu 20.4 is not supportd. The fix modifies
bootstrap_system.sh and bootstrap_toolchain.py to specifically
allow the bootstrapping of the development environment on a maching
running Ubuntu 20.4. Limited use shows that the environment is useful
and stable, similar to the one running on Ubuntu 18.4.
Testing on a box running Ubuntu 20.4:
1. Successfully bootstrapped the entire Impala development environment
2. Interacted with the enviroment through the following tools:
gdb
jdb
clang-format
impalad GUI
vim
3. Ran all tests
Limitations found with Ubuntu 20.4 environment.
1. gdb in Impala toolchain is not compatible with Impala C++ test
code ${IMPALA_HOME}/be/build/latest/service\
/unifiedbetests (invoked by ${IMPALA_HOME}/be/build/latest/\
scheduling/admission-controller-test) and reports the following
error, after attaching to the test process.
BFD (GNU Binutils) 2.25.51 internal error, aborting at elf64-x86-64.c
ine 5583 in elf_x86_64_get_plt_sym_val
Change-Id: I4f592f60881fd8f34e2bf393a76f5a921505010a
Reviewed-on: http://gerrit.cloudera.org:8080/16238
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
bin/bootstrap_toolchain.py failed to recognize SLES12 sp5, which broke
builds on that platform.
This patch simplifies OS version parsing and matching for SLES, omitting
the check for the OS minor version, which shows the SP level for SLES.
This is similar to how Red Hat variants are handled. This gets
rid of the constant update need whenever a new SP level is released for
SLES12.
This is enabled by the native toolchain sharing a single set of artifacts
between all the SLES12 SP levels.
Test: ran a successful build on a SLES12sp5 box.
Change-Id: Id9ada210b915050fbceebb7364e130116e9244d0
Reviewed-on: http://gerrit.cloudera.org:8080/16102
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The locations for native-toolchain packages in IMPALA_TOOLCHAIN
currently do not include the compiler version. This means that
the toolchain can't distinguish between native-toolchain packages
built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause
issues when switching back and forth between branches.
This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment
variable, which is a location inside IMPALA_TOOLCHAIN that would
hold native-toolchain packages. Currently, it is set to the same
as IMPALA_TOOLCHAIN, so there is no difference in behavior.
This lays the groundwork to add the compiler version to this
path when switching to GCC7.
Testing:
- The only impediment to building with
IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is
Impala-lzo. With a custom Impala-lzo, compilation succeeds.
Either Impala-lzo will be fixed or it will be removed.
- Core tests
Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b
Reviewed-on: http://gerrit.cloudera.org:8080/15991
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala 4 decided to drop Sentry support in favor of Ranger. This
removes Sentry support and related tests. It retires startup
flags related to Sentry and does the first round of removing
obsolete code. This does not adjust documentation to remove
references to Sentry, and other dead code will be removed
separately.
Some issues came up when implementing this. Here is a summary
of how this patch resolves them:
1. authorization_provider currently defaults to "sentry", but
"ranger" requires extra parameters to be set. This changes the
default value of authorization_provider to "", which translates
internally to the noop policy that does no authorization.
2. These flags are Sentry specific and are now retired:
- authorization_policy_provider_class
- sentry_catalog_polling_frequency_s
- sentry_config
3. The authorization_factory_class may be obsolete now that
there is only one authorization policy, but this leaves it
in place.
4. Sentry is the last component using CDH_COMPONENTS_HOME, so
that is removed. There are still Maven dependencies coming
from the CDH_BUILD_NUMBER repository, so that is not removed.
5. To make the transition easier, testdata/bin/kill-sentry-service.sh
is not removed and it is still called from testdata/bin/kill-all.sh.
Testing:
- Core job passes
Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e
Reviewed-on: http://gerrit.cloudera.org:8080/15833
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala 4 moved to using CDP versions for components, which involves
adopting Hive 3. This removes the old code supporting CDH components
and Hive 2. Specifically, it does the following:
1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true.
USE_CDP_HIVE now has no effect on the Impala environment. This also
means that bin/jenkins/build-all-flag-combinations.sh no longer
include USE_CDP_HIVE=false as a configuration.
2. Remove USE_CDH_KUDU and default to getting Impala from the
native toolchain.
3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including
the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml.
There is a fair amount of code that still references the Hive major
version. Upstream Hive is now working on Hive 4, so there is a high
likelihood that we'll need some code to deal with that transition.
This leaves some code (such as maven profiles) and test logic in
place.
Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608
Reviewed-on: http://gerrit.cloudera.org:8080/15869
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-9626 broke the use case where the toolchain binaries are not
downloaded from the native-toolchain S3 bucket, because
SKIP_TOOLCHAIN_BOOTSTRAP is set to true.
Fix this use case by checking SKIP_TOOLCHAIN_BOOTSTRAP in
bin/bootstrap_environment.py:
- if true: just check if the specified version of the Python binary is
present at the expected toolchain location. If it is there, use it,
otherwise throw an exception and abort the bootstrap process.
- in any other case: proceed to download the Python binary as in
bootstrap_toolchain.py.
Test:
- simulate the custom toolchain setup by downloading the toolchain
binaries from the S3 bucket, copying them to a separate directory,
symlinking them into Impala/toolchain, then executing buildall.sh
with SKIP_BOOTSTRAP_TOOLCHAIN set to "true".
Change-Id: Ic51b3c327b3cebc08edff90de931d07e35e0c319
Reviewed-on: http://gerrit.cloudera.org:8080/15759
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Historically Impala used the Python2 version that was available on
the hosting platform, as long as that version was at least v2.6.
This caused constant headache as all Python syntax had to be kept
compatible with Python 2.6 (for Centos 6). It also caused a recent problem
on Centos 8: here the system Python version was compiled with the
system's GCC version (v8.3), which was much more recent than the Impala
standard compiler version (GCC 4.9.2). When the Impala virtualenv was
built, the system Python version supplied C compiler switches for models
containing native code that were unknown for the Impala version of GCC,
thus breaking virtualenv installation.
This patch changes the Impala virtualenv to always use the Python2
version from the toolchain, which is built with the toolchain compiler.
This ensures that
- Impala always has a known Python 2.7 version for all its scripts,
- virtualenv modules based on native code will always be installable, as
the Python environment and the modules are built with the same compiler
version.
Additional changes:
- Add an auto-use fixture to conftest.py to check that the tests are
being run with Python 2.7.x
- Make bootstrap_toolchain.py independent from the Impala virtualenv:
remove the dependency on the "sh" library
Tests:
- Passed core-mode tests on CentOS 7.4
- Passed core-mode tests in Docker-based mode for centos:7
and ubuntu:16.04
Most content in this patch was developed but not published earlier
by Tim Armstrong.
Change-Id: Ic7b40cef89cfb3b467b61b2d54a94e708642882b
Reviewed-on: http://gerrit.cloudera.org:8080/15624
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
CentOS 8.1 is a new major version of the CentOS family.
It is now stable and popular enough to start supporting it for Impala
development.
Prepare a raw CentOS 8.1 system to support Impala development and testing.
This should work on a standalone computer, on a virtual machine,
or inside a Docker container.
Details:
- snappy-devel moved to the PowerTools repo, so it needs to be installed
from there
- CentOS 8 has no default Python version. The bootstrap script installs
(or configures) Python2 with pip2, then makes them the default via the
"alternatives" mechanism. The installer is adaptive, it performs only
the necessary steps, so it works in various environments.
The installer logic is also shared between bin/bootstrap_system.sh and
docker/entrypoint.sh
- The toolchain package tag "ec2-centos-8" is added to
bootstrap_toolchain.py
- For some unknown reason, when the downloaded Maven tarball is extracted
in a Docker-based test, the "bin" and "boot" directories are created
with owner-only permissions. The 'impdev' users has no access to the
maven executable, which then breaks the build.
This patch forcibly restores the correct permissions on these
directories; this is a no-op when the extraction happens correctly.
- TOOLCHAIN_ID is bumped to a build that already has CentOS 8 binaries.
- Centos8-specific bootstrap code was added to the Docker-based tests.
Tested:
- ran the Docker-based tests with --base-image=centos:8 to verify the following build
phases are successful:
* system prep
* build
* dataload
and that test can start. Passing all tests is was not a requirement for this step,
although plausible test results (i.e. not all of the tests fail) were.
- ran the Docker-based tests to verify nonregression with --base-image set to the
following: centos:7, ubuntu:16.04, ubuntu:18.04.
On centos:7 and ubuntu:16.04 the only failure was IMPALA-9097 (BE tests fail without
the minicluster running); ubuntu:18.04 showed the same failures as the current upstream
code.
- passed a core-mode test run on private infrastructure on Centos 7.4
- ran buildall.sh in core mode manually inside a Docker container, simulating a developer
workflow (prep-build-dataload-test). There were several observed test failures, but
the workflow itself was run to completion with no problems.
Change-Id: I3df5d48eca7a10219264e3604a4f05f072188e6e
Reviewed-on: http://gerrit.cloudera.org:8080/15623
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change the preferred way of getting Kudu was to pull
it in from the specified CDH build (even if USE_CDP_HIVE was set
to true). Optionally by setting USE_CDH_KUDU to false, one could
force Impala to use the native toolchain Kudu. But even then, the
Kudu Java artifacts would be downloaded from CDH.
Since Kudu VARCHAR support won't be backported to CDH, this
behavior blocks the Impala side of the Kudu/Impala VARCHAR
integration.
With this change:
1. Using the native toolchain Kudu (including the Java artifacts)
is the default behavior. From now on USE_CDH_KUDU will be set
to false by default. Impala can be forced to fall back on
using the CDH Kudu by explicitly setting USE_CDH_KUDU to true.
2. Kudu version is updated to include the VARCHAR support.
Testing:
Ran exhaustive tests with USE_CDH_KUDU=true and
USE_CDH_KUDU=false.
Change-Id: Iafe56342d43cb63e35c0bbb1b4a99327dda0a44a
Reviewed-on: http://gerrit.cloudera.org:8080/15134
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The kerberized minicluster is enabled by setting
IMPALA_KERBERIZE=true in impala-config-*.sh.
After setting it you must run ./bin/create-test-configuration.sh
then restart minicluster.
This adds a script to partially automate setup of a local KDC,
in lieu of the unmaintained minikdc support (which has been ripped
out).
Testing:
I was able to run some queries against pre-created HDFS tables
with kerberos enabled.
Change-Id: Ib34101d132e9c9d59da14537edf7d096f25e9bee
Reviewed-on: http://gerrit.cloudera.org:8080/15159
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Bump our ORC version to include fixes for ORC-414, ORC-580, ORC-581,
ORC-586, ORC-589, ORC-590, and ORC-591. The new ORC version also
unblocks IMPALA-9226 which requires EncodedStringVectorBatch introduced
in ORC-1.6.
Due to other changes in native-toolchain, this patch also bumps versions
of LLVM and crcutil.
Tests:
- Run scanners test for orc/def/block.
Change-Id: I7eec92238b12179502d6a9001ee2ba24bfa96b77
Reviewed-on: http://gerrit.cloudera.org:8080/15089
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
bin/bootstrap_toolchain.py has accumulated complexity over time.
CDH, CDP, and the native toolchain all use different download
machinery and naming. One feature that is needed on the CDP side
is the ability to specify the download URL in an IMPALA_*_URL
environment variable.
This adds that support and refactors CDH and native toolchain
downloads to use the new system. This is essentially a rewrite
of bin/bootstrap_toolchain.py.
Currently, there are multiple phases of downloads, each with their
own download functions and peculiarities to account for package
names and destinations for downloads. This changes the logic
so that a package will generate a DownloadUnpackTarball that is
completely resolved. It contains everything about what to download
and where to put it as well as a needs_download() function and a
download() function. Once there is a list of DownloadUnpackTarball
objects, they can all be downloaded and unpacked in a single phase.
This implements different types of packages as subclasses of
DownloadUnpackTarball. Since most subclasses want to be able to
construct URLs and archive names using templates, the
TemplatedDownloadUnpackTarball takes the same arguments as
DownloadUnpackTarball along with a map of template substitutions,
which are applied to all string arguments.
Kudu requires special handling and gets its own set of subclasses
to handle various subtleties like toolchain vs CDH Kudu, the Kudu
stub, and making sure that the "kudu" package and the "kudu-java"
package don't confuse each other.
As part of this change, USE_CDP_HIVE=true now uses the CDP version
of HBase rather than always using the CDH version.
Change-Id: I67824fd82b820e68e9f5c87939ec94ca6abadb8c
Reviewed-on: http://gerrit.cloudera.org:8080/13432
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-8503 added downloading kudu-hive.jar and adding it to
HADOOP_CLASSPATH in run-hive-server.sh to allow the Hive Metastore to
start with Kudu's HMS plugin.
There are two problems with this that are fixed by this patch:
- Previously, we fully specify the expected jar filename based on the
value of IMPALA_KUDU_JAVA_VERSION when adding it to HADOOP_CLASSPATH
but this is overly restrictive for users who may wish to override
this value in impala-config-branch.sh to build their own branch with
a different version of the kudu-hive.jar This patch relaxes this
restriction by adding any jar containing the string kudu-hive in
IMPALA_KUDU_JAVA_HOME to HADOOP_CLASSPATH
- In bootstrap_toolchain, we don't download a package if its directory
already exists. Since the 'kudu' and 'kudu-java' packages download
to the same directory, this led to a race condition where
'kudu-java' might not be downloaded if 'kudu' had already been
unpacked when it started. This patch fixes this by inspecting the
contents of the Kudu package directory to look for specific files
expected for each Kudu package.
Change-Id: I4ac79c3e9b8625ba54145dba23c69fd5117f35c7
Reviewed-on: http://gerrit.cloudera.org:8080/13542
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
Reviewed-by: Hao Hao <hao.hao@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Makefile was updated to include zstd in the ${IMPALA_HOME}/toolchain
directory. Other changes were made to make zstd headers and libs
accessible.
Class ZstandardCompressor/ZstandardDecompressor was added to provide
interfaces for calling ZSTD_compress/ZSTD_decompress functions. Zstd
supports different compression levels (clevel) from 1 to
ZSTD_maxCLevel(). Zstd also supports -ive clevels, but since the -ive
values represents uncompressed data they won't be supported. The default
clevel is ZSTD_CLEVEL_DEFAULT.
HdfsParquetTableWriter was updated to support ZSTD codec. The
new codecs can be set using existing query option as follows:
set COMPRESSION_CODEC=ZSTD:<clevel>;
set COMPRESSION_CODEC=ZSTD; // uses ZSTD_CLEVEL_DEFAULT
Testing:
- Added unit test in DecompressorTest class with ZSTD_CLEVEL_DEFAULT
clevel and a random clevel. The test unit decompresses an input
compressed data and validates the result. It also tests for
expected behavior when passing an over/under sized buffer for
decompressing.
- Added unit tests for valid/invalid values for COMPRESSION_CODEC.
- Added e2e test in test_insert_parquet.py which tests writing/read-
ing (null/non-null) data into/from a table (w different data type
columns) using multiple codecs. Other existing e2e tests were
updated to also use parquet/zstd table format.
- Manual interoperability tests were run between Impala and Hive.
Change-Id: Id2c0e26e6f7fb2dc4024309d733983ba5197beb7
Reviewed-on: http://gerrit.cloudera.org:8080/13507
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch allows to start the Hive Metasotre with Kudu plugin which is
required for enabling Kudu's integration with the HMS. The Kudu plugin
is downloaded and extracted from native-toolchain S3 bucket.
Change-Id: I4bd1488ced51840ec986d29ed371e26168abcc76
Reviewed-on: http://gerrit.cloudera.org:8080/13319
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
This switches away from Tez local mode to tez-on-YARN. After spending a
couple of days trying to debug issues with Tez local mode, it seemed
like it was just going to be too much of a lift.
This patch switches on the starting of a Yarn RM and NM when
USE_CDP_HIVE is enabled. It also switches to a new yarn-site.xml with a
minimized set of configurations, generated by the new python templating.
In order for everything to work properly I also had to update the Hadoop
dependency to come from CDP instead of CDH when using CDP Hive.
Otherwise, the classpath of the launched Tez containers had conflicting
versions of various Hadoop classes which caused tasks to fail.
I verified that this fixes concurrent query execution by running queries
in parallel in two beeline sessions. With local mode, these queries
would periodically fail due to various races (HIVE-21682). I'm also able
to get farther along in data loading.
Change-Id: If96064f271582b2790a3cfb3d135f3834d46c41d
Reviewed-on: http://gerrit.cloudera.org:8080/13224
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Todd Lipcon <todd@apache.org>
Since CDP_BUILD_NUMBER was bumped to 1056671 the name of the hive source
tarball changed. Not only the tar ball name was changed, the file it
gets extracted to is also different from the tar file itself. Due to
this the bootstrap_toolchain.py fails to check if the downloaded
hive source component already exists and it downloads again unnecessarily.
This patch improves bootstrap_toolchain.py to take
non-standard tarfiles which extracts to a different directory name
compared to the tar file.
Testing done:
1. Removed the local toolchain and ran the script couple of times to
make sure that it downloads the hive tar ball only once.
Change-Id: Ifd04a1a367a0cc4aa0a2b490a45fbc93a862c83a
Reviewed-on: http://gerrit.cloudera.org:8080/13219
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change bumps the CDP_BUILD_NUMBER to 1056671 which includes all the
Hive and Tez patches required for building against Hive 3. With this
change we get rid of the custom builds for Hive and Tez introduced in
IMPALA-8369 and switch to more official sources of builds for the
minicluster.
Notes:
1. The tarball names and the directory to which they extract to changed
from the previous CDP_BUILD_NUMBER. Due to this we need to change the
bootstrap_toolchain and impala-config.sh so that the Hive environment
variables are set correctly.
Testing Done:
1. Built against Hive-3 and Hive-2 using the flag USE_CDP_HIVE
2. Did basic testing from Impala and Beeline for the testing the tez
patch
3. Currently running the full-suite of tests to make sure there are no
regressions
Change-Id: Ic758a15b33e89b6804c12356aac8e3f230e07ae0
Reviewed-on: http://gerrit.cloudera.org:8080/13213
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds a compatibility shim in fe so that Impala can
interoperate with Hive 3.1.0. It moves the existing Metastoreshim class
to a compat-hive-2 directory and adds a new Metastoreshim class under
compat-hive-3 directory. These shim classes implement method which are
different in hive-2 v/s hive-3 and are used by front end code. At the
build time, based on the environment variable
IMPALA_HIVE_MAJOR_VERSION one of the two shims is added to as source
using the fe/pom.xml build plugin.
Additionally, in order to reduce the dependencies footprint of Hive in
the front end code, this patch also introduces a new module called
shaded-deps. This module using shade plugin to include only the source
files from hive-exec which are need by the fe code. For hive-2 build
path, no changes are done with respect to hive dependencies to minimize
the risk of destabilizing the master branch on the default build option
of using Hive-2.
The different set of dependencies are activated using maven profiles.
The activation of each profile is automatic based on the
IMPALA_HIVE_MAJOR_VERSION.
Testing:
1. Code compiles and runs against both HMS-3 and HMS-2
2. Ran full-suite of tests using the private jenkins job against HMS-2
3. Running full-tests against HMS-3 will need more work like supporting
Tez in the mini-cluster (for dataloading) and HMS transaction support
since HMS3 create transactional tables by default. THis will be on-going
effort and test failures on Hive-3 will be fixed in additional
sub-tasks.
Notes:
1. Patch uses a custom build of Hive to be deployed in mini-cluster. This
build has the fixes for HIVE-21596. This hack will be removed when the
patches are available in official CDP Hive builds.
2. Some of the existing tests rely on the fact the UDFs implement the
UDF interface in Hive (UDFLength, UDFHour, UDFYear). These built-in hive
functions have been moved to use GenericUDF interface in Hive 3. Impala
currently only supports UDFExecutor. In order to have a full
compatibility with all the functions in Hive 2.x we should support
GenericUDFs too. That would be taken up as a separate patch.
3. Sentry dependencies bring a lot of transitive hive dependencies. The
patch excludes such dependencies since they create problems while
building against Hive-3. Since these hive-2 dependencies are
already included when building against hive-2 this should not be a problem.
Change-Id: I45a4dadbdfe30a02f722dbd917a49bc182fc6436
Reviewed-on: http://gerrit.cloudera.org:8080/13005
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive 3 no longer supports MR execution, so this sets up the appropriate
configuration and classpath so that HS2 can run queries using Tez.
The bulk of this patch is toolchain changes to download Tez itself. The
Tez tarball is slightly odd in that it has no top-level directory, so
the patch changes around bootstrap_toolchain a bit to support creating
its own top-level directory for a component.
The remainder of the patch is some classpath setup and hive-site changes
when Hive 3 is enabled.
So far I tested this manually by setting up a metastore and
impala-config with USE_CDP_HIVE=true, and then connecting to HS2 using
hive beeline -u 'jdbc:hive2://localhost:11050'
I was able to insert and query data, and was able to verify that queries
like 'select count(*)' were executing via Tez local mode.
NOTE: this patch relies on a custom build of Tez, based on a private
branch. I've submitted a PR to Tez upstream, referenced in the commits
here. Will remove this hack once the PR is accepted and makes its way
into an official build.
Change-Id: I76e47fbd1d6ff5103d81a8de430d5465dba284cd
Reviewed-on: http://gerrit.cloudera.org:8080/12931
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch bumps the CDP_BUILD_NUMBER to 1013201. This patch also
refactors the bootstrap_toolchain.py to be more generic for dealing with
CDP components, e.g. Ranger and Hive 3.
The patch also fixes some TODOs to replace the rangerPlugin.init() hack
with rangerPlugin.refreshPoliciesAndTags() API available in this Ranger
build.
Testing:
- Ran core tests
- Manually verified that no regression when starting Hive 3 with
USE_CDP_HIVE=true
Change-Id: I18c7274085be4f87ecdaf0cd29a601715f594ada
Reviewed-on: http://gerrit.cloudera.org:8080/13002
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As a first step to integrate Impala with Hive 3.1.0 this patch modifies
the minicluster scripts to optionally use Hive 3.1.0 instead of
CDH Hive 2.1.1.
In order to make sure that existing setups don't break this is
enabled via a environment variable override to bin/impala-config.sh.
When the environment variable USE_CDP_HIVE is set to true the
bootstrap_toolchain script downloads Hive 3.1.0 tarballs and extracts it
in the toolchain directory. These binaries are used to start the Hive
services (Hiveserver2 and metastore). The default is still CDH Hive 2.1.1
Also, since Hive 3.1.0 uses a upgraded metastore schema, this patch
makes use of a different database name so that it is easy to switch from
working from one environment which uses Hive 2.1.1 metastore to another
which usese Hive 3.1.0 metastore.
In order to start a minicluster which uses Hive 3.1.0 users should
follow the steps below:
1. Make sure that minicluster, if running, is stopped
before you run the following commands.
2. Open a new terminal and run following commands.
> export USE_CDP_HIVE=true
> source bin/impala-config.sh
> bin/bootstrap_toolchain.py
The above command downloads the Hive 3.1.0 tarballs and extracts them
in toolchain/cdp_components-${CDP_BUILD_NUMBER} directory. This is a
no-op if the CDP_BUILD_NUMBER has not changed and if the cdp_components
are already downloaded by a previous invocation of the script.
> source bin/create-test-configuration.sh -create-metastore
The above step should provide "-create-metastore" only the first time
so that a new metastore db is created and the Hive 3.1.0 schema is
initialized. For all subsequent invocations, the "-create-metastore"
argument can be skipped. We should still source this script since the
hive-site.xml of Hive 3.1.0 is different than Hive 2.1.0 and
needs to be regenerated.
> testdata/bin/run-all.sh
Note that the testing was performed locally by downloading the Hive 3.1
binaries into
toolchain/cdp_components-976603/apache-hive-3.1.0.6.0.99.0-9-bin. Once
the binaries are available in S3 bucket, the bootstrap_toolchain script
should automatically do this for you.
Testing Done:
1. Made sure that the cluster comes up with Hive 3.1 when the steps
above are performed.
2. Made sure that existing scripts work as they do currently when
argument is not provided.
3. Impala cluster comes and connects to HMS 3.1.0 (Note that Impala
still uses Hive 2.1.1 client. Upgrading client libraries in Impala will
be done as a separate change)
Change-Id: Icfed856c1f5429ed45fd3d9cb08a5d1bb96a9605
Reviewed-on: http://gerrit.cloudera.org:8080/12846
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch updates the build scripts to suport Apache Ranger:
- Download Apache Ranger
- Setup Apache Ranger database
- Create Apache Ranger configuration files
- Start/stop Apache Ranger
Testing:
- Ran ./buildall.sh -format on a clean repository and was able to start
Ranger without any problem.
- Ran test-with-docker
Change-Id: I249cd64d74518946829e8588ed33d5ac454ffa7b
Reviewed-on: http://gerrit.cloudera.org:8080/12469
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Upgrades the version of the toolchain in order to pull in Thrift 0.11.0.
Updates the CMake build to write generated Python code using Thrift 0.11
to shell/build/thrift-11-gen/gen-py/.
The Thrift 0.11 Python deserialization code has some big performance
improvements that allow faster parsing of runtime profiles. By adding
the ability to generate the Thrift Python code using Thrift 0.11 we can
take advantage of the Python performance improvements without going
through a full Thrift upgrade from 0.9 to 0.11.
Set USE_THRIFT11_GEN_PY=true and then run bin/set-pythonpath.sh to add
the Thrift 0.11 Python generated code to the PYTHONPATH rather than the
0.9 generated code.
Testing:
- Ran core tests
Change-Id: I3432c3e29d28ec3ef6a0a22156a18910f511fed0
Reviewed-on: http://gerrit.cloudera.org:8080/12036
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Allows the IMPALA_KUDU_VERSION and IMPALA_KUDU_URL environment
variables to be override by impala-config-branch.sh
Also adds a feature to bootstrap-toolchain.py that optionally
substitutes the CDH platform label into override values for
IMPALA_(CDH_COMPONENT)_URL, which makes it easier to override the
value of IMPALA_KUDU_URL
Testing:
- Went through various combinations of a clean shell or overridding
these variables then building and running the minicluster.
Change-Id: I36414b8772d615809463127a989e843b9d15d4a3
Reviewed-on: http://gerrit.cloudera.org:8080/11499
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch transitions from pulling in Kudu (libkudu_client.so and the
minicluster tarballs) from the toolchain to instead pull Kudu in with
the other CDH components.
For OSes where the CDH binaries are not provided but the toolchain
binaries are (only Ubuntu 14), we set USE_CDH_KUDU to false to
continue to download the toolchain binaries. We also continue
to use the toolchain binaries to build the client stub for OSes
where KUDU_IS_SUPPORTED is false.
This patch also fixes an issue in bootstrap_toolchain.py where we were
using the wrong g++ to compile the Kudu stub.
Testing:
- Verified building and running Impala works as expected for supported
combinations of KUDU_IS_SUPPORTED/USE_CDH_KUDU
Change-Id: If6e1048438b6d09a1b38c58371d6212bb6dcc06c
Reviewed-on: http://gerrit.cloudera.org:8080/11363
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch extends the toolchain bootstrap code with the toolchain
version of GDB (v7.9.1, built in the toolchain since its inception),
and adds it to the path. The goal is to provide a stable gdb version
for core dump analysis.
Change-Id: If4e094db93da4f5dab1e1b2da7f88a1dd06bc9e6
Reviewed-on: http://gerrit.cloudera.org:8080/11215
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Clang's run-clang-tidy.py script produces a lot of
output even when there are no warnings or errors.
None of this output is useful.
This patch has two parts:
1. Bump LLVM to 5.0.1-p1, which has patched run-clang-tidy.py
to make it reduce its own output when passed -quiet
(along with other enhancements).
2. Pass -quiet to run-clang-tidy.py and pipe the stderr output
to a temporary file. Display this output only if
run-clang-tidy.py hits an error, as this output is not
useful otherwise.
Testing with a known clang tidy issue shows that warnings
and errors are still in the output, and the output is
clean on a clean Impala checkout.
Change-Id: I63c46a7d57295eba38fac8ab49c7a15d2802df1d
Reviewed-on: http://gerrit.cloudera.org:8080/10615
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For IMPALA_MINICLUSTER_PROFILE=3 (Hadoop 3.x components), pin the
CDH dependencies by storing the CDH tarballs and Maven repository
in S3. This solves the issue of build coherency between the the CDH
tarballs and Maven dependencies.
For IMPALA_MINICLUSTER_PROFILE=2 (Hadoop 2.x components), pin the
CDH dependencies by storing only the CDH tarballs in S3. The Maven
repository will still use https://repository.cloudera.com, so there
is still a possibility of a build coherency issue.
For each CDH dependency, there is a unique build number in each repository
URL to indicate the build number that created those CDH dependencies.
This informaton can be useful for debugging issues related to CDH
dependencies.
This patch introduces CDH_DOWNLOAD_HOST and CDH_BUILD_NUMBER environment
variables that can be overriden, which can be useful for running an
integration job.
This patch also fixes dependency issues in Hadoop that transitively
depend on snapshot versions of dependencies that no longer exist, i.e.
- net.minidev:json-smart:2.3-SNAPSHOT (HADOOP-14903)
- org.glassfish:javax.el:3.0.1-b06-SNAPSHOT
The fix is to force the dependencies by using the released versions of
those dependencies.
Testing:
- Ran all core tests on IMPALA_MINICLUSTER_PROFILE=2 and
IMPALA_MINICLUSTER_PROFILE=3
Cherry-picks: not for 2.x
Change-Id: I66c0dcb8abdd0d187490a761f129cda3b3500990
Reviewed-on: http://gerrit.cloudera.org:8080/10748
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.
Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
changes.
- Time-zone database is not updated on a regular basis.
Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
performance degradation.
In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.
This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
specify an HDFS/S3/ADLS path to a zip archive that contains the
shared compiled IANA time-zone database. If the startup flag is set,
impalad will use the specified time-zone database. Otherwise,
impalad will use the default /usr/share/zoneinfo time-zone database.
- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
specify an HDFS/S3/ADLS path to a shared config file that contains
definitions for non-standard time-zone aliases.
- impalad reads the entire time-zone database into an in-memory
map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
query context when preparing query execution. This time-zone is used
whenever the current time-zone is referred afterwards in an
execution node.
- Adds a new ZipUtil class to extract files from a zip archive. The
implementation is not vulnerable to Zip Slip.
Cherry-picks: not for 2.x.
Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This parallelizes downloading some Python libraries, giving a speedup of
$IMPALA_HOME/infra/python/deps/download_requirements. I've seen this
take from 7-15 seconds before and from 2-5 seconds after.
I also checked that we always have at least Python 2.6 when
building Impala, so I was able to remove the try/except
handling in bootstrap_toolchain.
Change-Id: I7cbf622adb7d037f1a53c519402dcd8ae3c0fe30
Reviewed-on: http://gerrit.cloudera.org:8080/10234
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.
A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.
Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.
Tests
- Most of the end-to-end tests can run on ORC format.
- Add tpcds, tpch tests for ORC.
- Add some ORC specific tests.
- Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
is not robust for corrupt files (ORC-315).
Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>