Impala 4 moved to using CDP versions for components, which involves
adopting Hive 3. This removes the old code supporting CDH components
and Hive 2. Specifically, it does the following:
1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true.
USE_CDP_HIVE now has no effect on the Impala environment. This also
means that bin/jenkins/build-all-flag-combinations.sh no longer
include USE_CDP_HIVE=false as a configuration.
2. Remove USE_CDH_KUDU and default to getting Impala from the
native toolchain.
3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including
the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml.
There is a fair amount of code that still references the Hive major
version. Upstream Hive is now working on Hive 4, so there is a high
likelihood that we'll need some code to deal with that transition.
This leaves some code (such as maven profiles) and test logic in
place.
Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608
Reviewed-on: http://gerrit.cloudera.org:8080/15869
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-9626 broke the use case where the toolchain binaries are not
downloaded from the native-toolchain S3 bucket, because
SKIP_TOOLCHAIN_BOOTSTRAP is set to true.
Fix this use case by checking SKIP_TOOLCHAIN_BOOTSTRAP in
bin/bootstrap_environment.py:
- if true: just check if the specified version of the Python binary is
present at the expected toolchain location. If it is there, use it,
otherwise throw an exception and abort the bootstrap process.
- in any other case: proceed to download the Python binary as in
bootstrap_toolchain.py.
Test:
- simulate the custom toolchain setup by downloading the toolchain
binaries from the S3 bucket, copying them to a separate directory,
symlinking them into Impala/toolchain, then executing buildall.sh
with SKIP_BOOTSTRAP_TOOLCHAIN set to "true".
Change-Id: Ic51b3c327b3cebc08edff90de931d07e35e0c319
Reviewed-on: http://gerrit.cloudera.org:8080/15759
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Historically Impala used the Python2 version that was available on
the hosting platform, as long as that version was at least v2.6.
This caused constant headache as all Python syntax had to be kept
compatible with Python 2.6 (for Centos 6). It also caused a recent problem
on Centos 8: here the system Python version was compiled with the
system's GCC version (v8.3), which was much more recent than the Impala
standard compiler version (GCC 4.9.2). When the Impala virtualenv was
built, the system Python version supplied C compiler switches for models
containing native code that were unknown for the Impala version of GCC,
thus breaking virtualenv installation.
This patch changes the Impala virtualenv to always use the Python2
version from the toolchain, which is built with the toolchain compiler.
This ensures that
- Impala always has a known Python 2.7 version for all its scripts,
- virtualenv modules based on native code will always be installable, as
the Python environment and the modules are built with the same compiler
version.
Additional changes:
- Add an auto-use fixture to conftest.py to check that the tests are
being run with Python 2.7.x
- Make bootstrap_toolchain.py independent from the Impala virtualenv:
remove the dependency on the "sh" library
Tests:
- Passed core-mode tests on CentOS 7.4
- Passed core-mode tests in Docker-based mode for centos:7
and ubuntu:16.04
Most content in this patch was developed but not published earlier
by Tim Armstrong.
Change-Id: Ic7b40cef89cfb3b467b61b2d54a94e708642882b
Reviewed-on: http://gerrit.cloudera.org:8080/15624
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
CentOS 8.1 is a new major version of the CentOS family.
It is now stable and popular enough to start supporting it for Impala
development.
Prepare a raw CentOS 8.1 system to support Impala development and testing.
This should work on a standalone computer, on a virtual machine,
or inside a Docker container.
Details:
- snappy-devel moved to the PowerTools repo, so it needs to be installed
from there
- CentOS 8 has no default Python version. The bootstrap script installs
(or configures) Python2 with pip2, then makes them the default via the
"alternatives" mechanism. The installer is adaptive, it performs only
the necessary steps, so it works in various environments.
The installer logic is also shared between bin/bootstrap_system.sh and
docker/entrypoint.sh
- The toolchain package tag "ec2-centos-8" is added to
bootstrap_toolchain.py
- For some unknown reason, when the downloaded Maven tarball is extracted
in a Docker-based test, the "bin" and "boot" directories are created
with owner-only permissions. The 'impdev' users has no access to the
maven executable, which then breaks the build.
This patch forcibly restores the correct permissions on these
directories; this is a no-op when the extraction happens correctly.
- TOOLCHAIN_ID is bumped to a build that already has CentOS 8 binaries.
- Centos8-specific bootstrap code was added to the Docker-based tests.
Tested:
- ran the Docker-based tests with --base-image=centos:8 to verify the following build
phases are successful:
* system prep
* build
* dataload
and that test can start. Passing all tests is was not a requirement for this step,
although plausible test results (i.e. not all of the tests fail) were.
- ran the Docker-based tests to verify nonregression with --base-image set to the
following: centos:7, ubuntu:16.04, ubuntu:18.04.
On centos:7 and ubuntu:16.04 the only failure was IMPALA-9097 (BE tests fail without
the minicluster running); ubuntu:18.04 showed the same failures as the current upstream
code.
- passed a core-mode test run on private infrastructure on Centos 7.4
- ran buildall.sh in core mode manually inside a Docker container, simulating a developer
workflow (prep-build-dataload-test). There were several observed test failures, but
the workflow itself was run to completion with no problems.
Change-Id: I3df5d48eca7a10219264e3604a4f05f072188e6e
Reviewed-on: http://gerrit.cloudera.org:8080/15623
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change the preferred way of getting Kudu was to pull
it in from the specified CDH build (even if USE_CDP_HIVE was set
to true). Optionally by setting USE_CDH_KUDU to false, one could
force Impala to use the native toolchain Kudu. But even then, the
Kudu Java artifacts would be downloaded from CDH.
Since Kudu VARCHAR support won't be backported to CDH, this
behavior blocks the Impala side of the Kudu/Impala VARCHAR
integration.
With this change:
1. Using the native toolchain Kudu (including the Java artifacts)
is the default behavior. From now on USE_CDH_KUDU will be set
to false by default. Impala can be forced to fall back on
using the CDH Kudu by explicitly setting USE_CDH_KUDU to true.
2. Kudu version is updated to include the VARCHAR support.
Testing:
Ran exhaustive tests with USE_CDH_KUDU=true and
USE_CDH_KUDU=false.
Change-Id: Iafe56342d43cb63e35c0bbb1b4a99327dda0a44a
Reviewed-on: http://gerrit.cloudera.org:8080/15134
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The kerberized minicluster is enabled by setting
IMPALA_KERBERIZE=true in impala-config-*.sh.
After setting it you must run ./bin/create-test-configuration.sh
then restart minicluster.
This adds a script to partially automate setup of a local KDC,
in lieu of the unmaintained minikdc support (which has been ripped
out).
Testing:
I was able to run some queries against pre-created HDFS tables
with kerberos enabled.
Change-Id: Ib34101d132e9c9d59da14537edf7d096f25e9bee
Reviewed-on: http://gerrit.cloudera.org:8080/15159
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Bump our ORC version to include fixes for ORC-414, ORC-580, ORC-581,
ORC-586, ORC-589, ORC-590, and ORC-591. The new ORC version also
unblocks IMPALA-9226 which requires EncodedStringVectorBatch introduced
in ORC-1.6.
Due to other changes in native-toolchain, this patch also bumps versions
of LLVM and crcutil.
Tests:
- Run scanners test for orc/def/block.
Change-Id: I7eec92238b12179502d6a9001ee2ba24bfa96b77
Reviewed-on: http://gerrit.cloudera.org:8080/15089
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
bin/bootstrap_toolchain.py has accumulated complexity over time.
CDH, CDP, and the native toolchain all use different download
machinery and naming. One feature that is needed on the CDP side
is the ability to specify the download URL in an IMPALA_*_URL
environment variable.
This adds that support and refactors CDH and native toolchain
downloads to use the new system. This is essentially a rewrite
of bin/bootstrap_toolchain.py.
Currently, there are multiple phases of downloads, each with their
own download functions and peculiarities to account for package
names and destinations for downloads. This changes the logic
so that a package will generate a DownloadUnpackTarball that is
completely resolved. It contains everything about what to download
and where to put it as well as a needs_download() function and a
download() function. Once there is a list of DownloadUnpackTarball
objects, they can all be downloaded and unpacked in a single phase.
This implements different types of packages as subclasses of
DownloadUnpackTarball. Since most subclasses want to be able to
construct URLs and archive names using templates, the
TemplatedDownloadUnpackTarball takes the same arguments as
DownloadUnpackTarball along with a map of template substitutions,
which are applied to all string arguments.
Kudu requires special handling and gets its own set of subclasses
to handle various subtleties like toolchain vs CDH Kudu, the Kudu
stub, and making sure that the "kudu" package and the "kudu-java"
package don't confuse each other.
As part of this change, USE_CDP_HIVE=true now uses the CDP version
of HBase rather than always using the CDH version.
Change-Id: I67824fd82b820e68e9f5c87939ec94ca6abadb8c
Reviewed-on: http://gerrit.cloudera.org:8080/13432
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-8503 added downloading kudu-hive.jar and adding it to
HADOOP_CLASSPATH in run-hive-server.sh to allow the Hive Metastore to
start with Kudu's HMS plugin.
There are two problems with this that are fixed by this patch:
- Previously, we fully specify the expected jar filename based on the
value of IMPALA_KUDU_JAVA_VERSION when adding it to HADOOP_CLASSPATH
but this is overly restrictive for users who may wish to override
this value in impala-config-branch.sh to build their own branch with
a different version of the kudu-hive.jar This patch relaxes this
restriction by adding any jar containing the string kudu-hive in
IMPALA_KUDU_JAVA_HOME to HADOOP_CLASSPATH
- In bootstrap_toolchain, we don't download a package if its directory
already exists. Since the 'kudu' and 'kudu-java' packages download
to the same directory, this led to a race condition where
'kudu-java' might not be downloaded if 'kudu' had already been
unpacked when it started. This patch fixes this by inspecting the
contents of the Kudu package directory to look for specific files
expected for each Kudu package.
Change-Id: I4ac79c3e9b8625ba54145dba23c69fd5117f35c7
Reviewed-on: http://gerrit.cloudera.org:8080/13542
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
Reviewed-by: Hao Hao <hao.hao@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Makefile was updated to include zstd in the ${IMPALA_HOME}/toolchain
directory. Other changes were made to make zstd headers and libs
accessible.
Class ZstandardCompressor/ZstandardDecompressor was added to provide
interfaces for calling ZSTD_compress/ZSTD_decompress functions. Zstd
supports different compression levels (clevel) from 1 to
ZSTD_maxCLevel(). Zstd also supports -ive clevels, but since the -ive
values represents uncompressed data they won't be supported. The default
clevel is ZSTD_CLEVEL_DEFAULT.
HdfsParquetTableWriter was updated to support ZSTD codec. The
new codecs can be set using existing query option as follows:
set COMPRESSION_CODEC=ZSTD:<clevel>;
set COMPRESSION_CODEC=ZSTD; // uses ZSTD_CLEVEL_DEFAULT
Testing:
- Added unit test in DecompressorTest class with ZSTD_CLEVEL_DEFAULT
clevel and a random clevel. The test unit decompresses an input
compressed data and validates the result. It also tests for
expected behavior when passing an over/under sized buffer for
decompressing.
- Added unit tests for valid/invalid values for COMPRESSION_CODEC.
- Added e2e test in test_insert_parquet.py which tests writing/read-
ing (null/non-null) data into/from a table (w different data type
columns) using multiple codecs. Other existing e2e tests were
updated to also use parquet/zstd table format.
- Manual interoperability tests were run between Impala and Hive.
Change-Id: Id2c0e26e6f7fb2dc4024309d733983ba5197beb7
Reviewed-on: http://gerrit.cloudera.org:8080/13507
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch allows to start the Hive Metasotre with Kudu plugin which is
required for enabling Kudu's integration with the HMS. The Kudu plugin
is downloaded and extracted from native-toolchain S3 bucket.
Change-Id: I4bd1488ced51840ec986d29ed371e26168abcc76
Reviewed-on: http://gerrit.cloudera.org:8080/13319
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
This switches away from Tez local mode to tez-on-YARN. After spending a
couple of days trying to debug issues with Tez local mode, it seemed
like it was just going to be too much of a lift.
This patch switches on the starting of a Yarn RM and NM when
USE_CDP_HIVE is enabled. It also switches to a new yarn-site.xml with a
minimized set of configurations, generated by the new python templating.
In order for everything to work properly I also had to update the Hadoop
dependency to come from CDP instead of CDH when using CDP Hive.
Otherwise, the classpath of the launched Tez containers had conflicting
versions of various Hadoop classes which caused tasks to fail.
I verified that this fixes concurrent query execution by running queries
in parallel in two beeline sessions. With local mode, these queries
would periodically fail due to various races (HIVE-21682). I'm also able
to get farther along in data loading.
Change-Id: If96064f271582b2790a3cfb3d135f3834d46c41d
Reviewed-on: http://gerrit.cloudera.org:8080/13224
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Todd Lipcon <todd@apache.org>
Since CDP_BUILD_NUMBER was bumped to 1056671 the name of the hive source
tarball changed. Not only the tar ball name was changed, the file it
gets extracted to is also different from the tar file itself. Due to
this the bootstrap_toolchain.py fails to check if the downloaded
hive source component already exists and it downloads again unnecessarily.
This patch improves bootstrap_toolchain.py to take
non-standard tarfiles which extracts to a different directory name
compared to the tar file.
Testing done:
1. Removed the local toolchain and ran the script couple of times to
make sure that it downloads the hive tar ball only once.
Change-Id: Ifd04a1a367a0cc4aa0a2b490a45fbc93a862c83a
Reviewed-on: http://gerrit.cloudera.org:8080/13219
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change bumps the CDP_BUILD_NUMBER to 1056671 which includes all the
Hive and Tez patches required for building against Hive 3. With this
change we get rid of the custom builds for Hive and Tez introduced in
IMPALA-8369 and switch to more official sources of builds for the
minicluster.
Notes:
1. The tarball names and the directory to which they extract to changed
from the previous CDP_BUILD_NUMBER. Due to this we need to change the
bootstrap_toolchain and impala-config.sh so that the Hive environment
variables are set correctly.
Testing Done:
1. Built against Hive-3 and Hive-2 using the flag USE_CDP_HIVE
2. Did basic testing from Impala and Beeline for the testing the tez
patch
3. Currently running the full-suite of tests to make sure there are no
regressions
Change-Id: Ic758a15b33e89b6804c12356aac8e3f230e07ae0
Reviewed-on: http://gerrit.cloudera.org:8080/13213
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds a compatibility shim in fe so that Impala can
interoperate with Hive 3.1.0. It moves the existing Metastoreshim class
to a compat-hive-2 directory and adds a new Metastoreshim class under
compat-hive-3 directory. These shim classes implement method which are
different in hive-2 v/s hive-3 and are used by front end code. At the
build time, based on the environment variable
IMPALA_HIVE_MAJOR_VERSION one of the two shims is added to as source
using the fe/pom.xml build plugin.
Additionally, in order to reduce the dependencies footprint of Hive in
the front end code, this patch also introduces a new module called
shaded-deps. This module using shade plugin to include only the source
files from hive-exec which are need by the fe code. For hive-2 build
path, no changes are done with respect to hive dependencies to minimize
the risk of destabilizing the master branch on the default build option
of using Hive-2.
The different set of dependencies are activated using maven profiles.
The activation of each profile is automatic based on the
IMPALA_HIVE_MAJOR_VERSION.
Testing:
1. Code compiles and runs against both HMS-3 and HMS-2
2. Ran full-suite of tests using the private jenkins job against HMS-2
3. Running full-tests against HMS-3 will need more work like supporting
Tez in the mini-cluster (for dataloading) and HMS transaction support
since HMS3 create transactional tables by default. THis will be on-going
effort and test failures on Hive-3 will be fixed in additional
sub-tasks.
Notes:
1. Patch uses a custom build of Hive to be deployed in mini-cluster. This
build has the fixes for HIVE-21596. This hack will be removed when the
patches are available in official CDP Hive builds.
2. Some of the existing tests rely on the fact the UDFs implement the
UDF interface in Hive (UDFLength, UDFHour, UDFYear). These built-in hive
functions have been moved to use GenericUDF interface in Hive 3. Impala
currently only supports UDFExecutor. In order to have a full
compatibility with all the functions in Hive 2.x we should support
GenericUDFs too. That would be taken up as a separate patch.
3. Sentry dependencies bring a lot of transitive hive dependencies. The
patch excludes such dependencies since they create problems while
building against Hive-3. Since these hive-2 dependencies are
already included when building against hive-2 this should not be a problem.
Change-Id: I45a4dadbdfe30a02f722dbd917a49bc182fc6436
Reviewed-on: http://gerrit.cloudera.org:8080/13005
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive 3 no longer supports MR execution, so this sets up the appropriate
configuration and classpath so that HS2 can run queries using Tez.
The bulk of this patch is toolchain changes to download Tez itself. The
Tez tarball is slightly odd in that it has no top-level directory, so
the patch changes around bootstrap_toolchain a bit to support creating
its own top-level directory for a component.
The remainder of the patch is some classpath setup and hive-site changes
when Hive 3 is enabled.
So far I tested this manually by setting up a metastore and
impala-config with USE_CDP_HIVE=true, and then connecting to HS2 using
hive beeline -u 'jdbc:hive2://localhost:11050'
I was able to insert and query data, and was able to verify that queries
like 'select count(*)' were executing via Tez local mode.
NOTE: this patch relies on a custom build of Tez, based on a private
branch. I've submitted a PR to Tez upstream, referenced in the commits
here. Will remove this hack once the PR is accepted and makes its way
into an official build.
Change-Id: I76e47fbd1d6ff5103d81a8de430d5465dba284cd
Reviewed-on: http://gerrit.cloudera.org:8080/12931
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch bumps the CDP_BUILD_NUMBER to 1013201. This patch also
refactors the bootstrap_toolchain.py to be more generic for dealing with
CDP components, e.g. Ranger and Hive 3.
The patch also fixes some TODOs to replace the rangerPlugin.init() hack
with rangerPlugin.refreshPoliciesAndTags() API available in this Ranger
build.
Testing:
- Ran core tests
- Manually verified that no regression when starting Hive 3 with
USE_CDP_HIVE=true
Change-Id: I18c7274085be4f87ecdaf0cd29a601715f594ada
Reviewed-on: http://gerrit.cloudera.org:8080/13002
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As a first step to integrate Impala with Hive 3.1.0 this patch modifies
the minicluster scripts to optionally use Hive 3.1.0 instead of
CDH Hive 2.1.1.
In order to make sure that existing setups don't break this is
enabled via a environment variable override to bin/impala-config.sh.
When the environment variable USE_CDP_HIVE is set to true the
bootstrap_toolchain script downloads Hive 3.1.0 tarballs and extracts it
in the toolchain directory. These binaries are used to start the Hive
services (Hiveserver2 and metastore). The default is still CDH Hive 2.1.1
Also, since Hive 3.1.0 uses a upgraded metastore schema, this patch
makes use of a different database name so that it is easy to switch from
working from one environment which uses Hive 2.1.1 metastore to another
which usese Hive 3.1.0 metastore.
In order to start a minicluster which uses Hive 3.1.0 users should
follow the steps below:
1. Make sure that minicluster, if running, is stopped
before you run the following commands.
2. Open a new terminal and run following commands.
> export USE_CDP_HIVE=true
> source bin/impala-config.sh
> bin/bootstrap_toolchain.py
The above command downloads the Hive 3.1.0 tarballs and extracts them
in toolchain/cdp_components-${CDP_BUILD_NUMBER} directory. This is a
no-op if the CDP_BUILD_NUMBER has not changed and if the cdp_components
are already downloaded by a previous invocation of the script.
> source bin/create-test-configuration.sh -create-metastore
The above step should provide "-create-metastore" only the first time
so that a new metastore db is created and the Hive 3.1.0 schema is
initialized. For all subsequent invocations, the "-create-metastore"
argument can be skipped. We should still source this script since the
hive-site.xml of Hive 3.1.0 is different than Hive 2.1.0 and
needs to be regenerated.
> testdata/bin/run-all.sh
Note that the testing was performed locally by downloading the Hive 3.1
binaries into
toolchain/cdp_components-976603/apache-hive-3.1.0.6.0.99.0-9-bin. Once
the binaries are available in S3 bucket, the bootstrap_toolchain script
should automatically do this for you.
Testing Done:
1. Made sure that the cluster comes up with Hive 3.1 when the steps
above are performed.
2. Made sure that existing scripts work as they do currently when
argument is not provided.
3. Impala cluster comes and connects to HMS 3.1.0 (Note that Impala
still uses Hive 2.1.1 client. Upgrading client libraries in Impala will
be done as a separate change)
Change-Id: Icfed856c1f5429ed45fd3d9cb08a5d1bb96a9605
Reviewed-on: http://gerrit.cloudera.org:8080/12846
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch updates the build scripts to suport Apache Ranger:
- Download Apache Ranger
- Setup Apache Ranger database
- Create Apache Ranger configuration files
- Start/stop Apache Ranger
Testing:
- Ran ./buildall.sh -format on a clean repository and was able to start
Ranger without any problem.
- Ran test-with-docker
Change-Id: I249cd64d74518946829e8588ed33d5ac454ffa7b
Reviewed-on: http://gerrit.cloudera.org:8080/12469
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Upgrades the version of the toolchain in order to pull in Thrift 0.11.0.
Updates the CMake build to write generated Python code using Thrift 0.11
to shell/build/thrift-11-gen/gen-py/.
The Thrift 0.11 Python deserialization code has some big performance
improvements that allow faster parsing of runtime profiles. By adding
the ability to generate the Thrift Python code using Thrift 0.11 we can
take advantage of the Python performance improvements without going
through a full Thrift upgrade from 0.9 to 0.11.
Set USE_THRIFT11_GEN_PY=true and then run bin/set-pythonpath.sh to add
the Thrift 0.11 Python generated code to the PYTHONPATH rather than the
0.9 generated code.
Testing:
- Ran core tests
Change-Id: I3432c3e29d28ec3ef6a0a22156a18910f511fed0
Reviewed-on: http://gerrit.cloudera.org:8080/12036
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Allows the IMPALA_KUDU_VERSION and IMPALA_KUDU_URL environment
variables to be override by impala-config-branch.sh
Also adds a feature to bootstrap-toolchain.py that optionally
substitutes the CDH platform label into override values for
IMPALA_(CDH_COMPONENT)_URL, which makes it easier to override the
value of IMPALA_KUDU_URL
Testing:
- Went through various combinations of a clean shell or overridding
these variables then building and running the minicluster.
Change-Id: I36414b8772d615809463127a989e843b9d15d4a3
Reviewed-on: http://gerrit.cloudera.org:8080/11499
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch transitions from pulling in Kudu (libkudu_client.so and the
minicluster tarballs) from the toolchain to instead pull Kudu in with
the other CDH components.
For OSes where the CDH binaries are not provided but the toolchain
binaries are (only Ubuntu 14), we set USE_CDH_KUDU to false to
continue to download the toolchain binaries. We also continue
to use the toolchain binaries to build the client stub for OSes
where KUDU_IS_SUPPORTED is false.
This patch also fixes an issue in bootstrap_toolchain.py where we were
using the wrong g++ to compile the Kudu stub.
Testing:
- Verified building and running Impala works as expected for supported
combinations of KUDU_IS_SUPPORTED/USE_CDH_KUDU
Change-Id: If6e1048438b6d09a1b38c58371d6212bb6dcc06c
Reviewed-on: http://gerrit.cloudera.org:8080/11363
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch extends the toolchain bootstrap code with the toolchain
version of GDB (v7.9.1, built in the toolchain since its inception),
and adds it to the path. The goal is to provide a stable gdb version
for core dump analysis.
Change-Id: If4e094db93da4f5dab1e1b2da7f88a1dd06bc9e6
Reviewed-on: http://gerrit.cloudera.org:8080/11215
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Clang's run-clang-tidy.py script produces a lot of
output even when there are no warnings or errors.
None of this output is useful.
This patch has two parts:
1. Bump LLVM to 5.0.1-p1, which has patched run-clang-tidy.py
to make it reduce its own output when passed -quiet
(along with other enhancements).
2. Pass -quiet to run-clang-tidy.py and pipe the stderr output
to a temporary file. Display this output only if
run-clang-tidy.py hits an error, as this output is not
useful otherwise.
Testing with a known clang tidy issue shows that warnings
and errors are still in the output, and the output is
clean on a clean Impala checkout.
Change-Id: I63c46a7d57295eba38fac8ab49c7a15d2802df1d
Reviewed-on: http://gerrit.cloudera.org:8080/10615
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For IMPALA_MINICLUSTER_PROFILE=3 (Hadoop 3.x components), pin the
CDH dependencies by storing the CDH tarballs and Maven repository
in S3. This solves the issue of build coherency between the the CDH
tarballs and Maven dependencies.
For IMPALA_MINICLUSTER_PROFILE=2 (Hadoop 2.x components), pin the
CDH dependencies by storing only the CDH tarballs in S3. The Maven
repository will still use https://repository.cloudera.com, so there
is still a possibility of a build coherency issue.
For each CDH dependency, there is a unique build number in each repository
URL to indicate the build number that created those CDH dependencies.
This informaton can be useful for debugging issues related to CDH
dependencies.
This patch introduces CDH_DOWNLOAD_HOST and CDH_BUILD_NUMBER environment
variables that can be overriden, which can be useful for running an
integration job.
This patch also fixes dependency issues in Hadoop that transitively
depend on snapshot versions of dependencies that no longer exist, i.e.
- net.minidev:json-smart:2.3-SNAPSHOT (HADOOP-14903)
- org.glassfish:javax.el:3.0.1-b06-SNAPSHOT
The fix is to force the dependencies by using the released versions of
those dependencies.
Testing:
- Ran all core tests on IMPALA_MINICLUSTER_PROFILE=2 and
IMPALA_MINICLUSTER_PROFILE=3
Cherry-picks: not for 2.x
Change-Id: I66c0dcb8abdd0d187490a761f129cda3b3500990
Reviewed-on: http://gerrit.cloudera.org:8080/10748
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.
Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
changes.
- Time-zone database is not updated on a regular basis.
Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
performance degradation.
In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.
This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
specify an HDFS/S3/ADLS path to a zip archive that contains the
shared compiled IANA time-zone database. If the startup flag is set,
impalad will use the specified time-zone database. Otherwise,
impalad will use the default /usr/share/zoneinfo time-zone database.
- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
specify an HDFS/S3/ADLS path to a shared config file that contains
definitions for non-standard time-zone aliases.
- impalad reads the entire time-zone database into an in-memory
map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
query context when preparing query execution. This time-zone is used
whenever the current time-zone is referred afterwards in an
execution node.
- Adds a new ZipUtil class to extract files from a zip archive. The
implementation is not vulnerable to Zip Slip.
Cherry-picks: not for 2.x.
Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This parallelizes downloading some Python libraries, giving a speedup of
$IMPALA_HOME/infra/python/deps/download_requirements. I've seen this
take from 7-15 seconds before and from 2-5 seconds after.
I also checked that we always have at least Python 2.6 when
building Impala, so I was able to remove the try/except
handling in bootstrap_toolchain.
Change-Id: I7cbf622adb7d037f1a53c519402dcd8ae3c0fe30
Reviewed-on: http://gerrit.cloudera.org:8080/10234
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.
A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.
Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.
Tests
- Most of the end-to-end tests can run on ORC format.
- Add tpcds, tpch tests for ORC.
- Add some ORC specific tests.
- Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
is not robust for corrupt files (ORC-315).
Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
output from RHEL OS
The OS map that we currently use to check platform/OS release
against in bootstrap_toolchain.py does not contain key-value pairs
for Redhat platforms.
e.g.
lsb_release -irs
RedHatEnterpriseServer 6.9
This change adds RHEL5, RHEL6 and RHEL7 to the OS map. It also
relaxes the matching criteria for RHEL and CentOS to only major
version.
Testing: I manually cloned a repo locally and called
bootstrap_toolchain.py to verify that it can detect the platform.
Testing was done against RHEL6, RHEL7, Ubuntu16.04 and Centos7.
Change-Id: I83874220bd424a452df49520b5dad7bfa2124ca6
Reviewed-on: http://gerrit.cloudera.org:8080/9310
Reviewed-by: Lars Volker <lv@cloudera.com>
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins
If the environment variable $IMPALA_<NAME>_URL is configured in
impala-config-branch.sh or impala-config-local, for a thirdparty
dependency, use that to download it instead of the s3://native-toolchain
bucket. This makes testing against arbitrary versions of the
dependencies easier.
I did a little bit of refactoring while here, creating a small class for
a Package to handle reading the environment variables. I also changed
bootstrap_toolchain.py to use Python logging, which cleans up the output
during the multi-threaded downloading.
I tested this by both with customized URLs and by running the regular
build (pre-review-test, without most of the slow test suites).
Change-Id: I4628d86022d4bd8b762313f7056d76416a58b422
Reviewed-on: http://gerrit.cloudera.org:8080/8456
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
We've seen intermittent 500 errors when downloading the toolchain from
S3 over the HTTPS URLs. As a first stab, this commit retries 3 times,
with some jitter.
I also changed the threadpool introduced previously to have a limit
of 4 threads, because that's sufficient to get the speed improvement.
The 500 errors have been observed both before and after the threadpool
change.
For testing, I ran the straight-forward case directly. I introduced
a broken version string to observe that retries would happen on
any error from wget.
Change-Id: I7669c7d41240aa0eb43c30d5bf2bd5c01b66180b
Reviewed-on: http://gerrit.cloudera.org:8080/8258
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
By downloading from the toolchain S3 buckets in parallel with
extracting them, this improves bootstrap_toolchain on my machine
from about 1m5s to about 30s.
$rm -rf toolchain; time bin/bootstrap_toolchain.py > /dev/null
real 0m29.226s
user 0m46.516s
sys 0m33.820s
On a large EC2 machine, closer to the S3 buckets, the new time is 21s.
Because multiprocessing hasn't always been available (python2.4 on RHEL5
won't have it), I fall back to a simpler implementation
Change-Id: I46a6088bb002402c7653dbc8257dff869afb26ec
Reviewed-on: http://gerrit.cloudera.org:8080/8237
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
To support KRPC on legacy platforms with version of OpenSSL
older than 1.0.1, we may need to use libssl from the toolchain.
This change makes toolchain boostrapping to also download
OpenSSL 1.0.1p.
Testing: private packaging build.
Change-Id: I860b16d8606de1ee472db35a4d8d4e97b57b67ae
Reviewed-on: http://gerrit.cloudera.org:8080/7532
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Impala Public Jenkins
This takes care of the difference in outputs for SLES 12 SP1 and SP2.
For reference here's the outputs in sles12sp1 and sp2:
sles12sp1 # lsb_release -irs
SUSE LINUX 12.1
sles12sp2 # lsb_release -irs
SUSE 12.2
Testing:
Did a full build on SLES12 SP2. Before this patch, a build resulted in:
'Pre-built toolchain archives not available for your platform.'
After this patch:
Toolchain bootstrap complete.
..Followed by a full build.
Change-Id: I005e05b8b66de78e6d53a35a894eb34d89843a62
Reviewed-on: http://gerrit.cloudera.org:8080/7535
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
FlatBuffers version 1.6.0 is already included in the toolchain. This
commit adds it to the build system.
Change-Id: I2ca255ddf08ac846b454bfa1470ed67b1338d2b0
Reviewed-on: http://gerrit.cloudera.org:8080/6180
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
Add libev 4.20 to the Impala build. This is a dependency of KRPC.
FindLibEv.cmake was taken from Apache Kudu.
Change-Id: Iaf0646533592e6a8cd929a8cb015b83a7ea3008f
Reviewed-on: http://gerrit.cloudera.org:8080/5659
Tested-by: Impala Public Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
This patch adds Protobuf 2.6.1 to Impala's build, and bumps the
toolchain version so that the dependency is available. Protobuf is
unused in this commit, but is required for KRPC.
FindProtobuf.cmake includes some utility CMake methods to generate
source code from Protobuf definitions. It is taken from Kudu.
Change-Id: Ic9357fe0f201cbf7df1ba19fe4773dfb6c10b4ef
Reviewed-on: http://gerrit.cloudera.org:8080/5657
Tested-by: Impala Public Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
This change prevents us from depending on LLAMA to build.
Note that the LLAMA MiniKDC is left in - it is a test
utility that does not depend on LLAMA itself.
IMPALA-4292 tracks cleaning this up.
Testing:
Ran a private build to verify that all tests pass.
Change-Id: If2e5e21d8047097d56062ded11b0832a1d397fe0
Reviewed-on: http://gerrit.cloudera.org:8080/4739
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
Alas, poor Llama! I knew him, Impala: a system
of infinite jest, of most excellent fancy: we hath
borne him on our back a thousand times; and now, how
abhorred in my imagination it is!
Done:
* Removed QueryResourceMgr, ResourceBroker, CGroupsMgr
* Removed untested 'offline' mode and NM failure detection from
ImpalaServer
* Removed all Llama-related Thrift files
* Removed RM-related arguments to MemTracker constructors
* Deprecated all RM-related flags, printing a warning if enable_rm is
set
* Removed expansion logic from MemTracker
* Removed VCore logic from QuerySchedule
* Removed all reservation-related logic from Scheduler
* Removed RM metric descriptions
* Various misc. small class changes
Not done:
* Remove RM flags (--enable_rm etc.)
* Remove RM query options
* Changes to RequestPoolService (see IMPALA-4159)
* Remove estimates of VCores / memory from plan
Change-Id: Icfb14209e31f6608bb7b8a33789e00411a6447ef
Reviewed-on: http://gerrit.cloudera.org:8080/4445
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
One problem uncovered while trying to build Impala on Ubuntu16 is
that the functions 'isnan' and 'isinf' both appear in std::
(from <cmath>) and in boost::math::, but we're currently using
them without qualifiers in several places, leading to a conflict.
This patch prefaces all uses with 'std::' to disambiguate, and also
adds <cmath> imports to all files that use those functions, for
the sake of explicitness.
Another problem is that bin/make_impala.sh uses the system cmake,
which may not be compatible with the toolchain binaries. This patch
updates impala-config.sh to add the toolchain cmake to PATH, so
that we'll use it wherever we use cmake.
Change-Id: Iaa1520c1e4aa4175468ac342b14c1262fa745f7a
Reviewed-on: http://gerrit.cloudera.org:8080/3800
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins