IMPALA-13279 upgraded gcovr to 7.2 and moved it from python 2 to
python 3.8. gcovr has several dependencies that require native
compilation, and this increased the cost of initializing the
Python 3 virtualenv substantially:
Without gcovr: 1m43.279s
With gcovr and deps: 6m35.107s
This moves gcovr to its own requirements file and only installs
gcovr if this is a coverage build (detected from the
.cmake_buid_type file).
Testing:
- Verified that a coverage build does install gcovr and
produce a report
Change-Id: I1d0fd6d21273053aaf2acee39fcb83d9093d49a2
Reviewed-on: http://gerrit.cloudera.org:8080/21849
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In some environments, the code coverage report is empty even
though the tests ran successfully and gcno/gcda files are
written properly.
This upgrades to gcovr 7.2, which does not show the same
problem. gcovr 7.2 requires Python 3.8, so this switches to use
Python 3.8 from the toolchain and installs gcovr in the Python 3
virtualenv.
gcovr 7.2 outputs logging to stderr, so this also modifies
bin/coverage_helper.sh to redirect stderr to stdout.
Testing:
- Verified that this can generate a report locally and on
the affected environment
Change-Id: I5b1aaa92c65f54149a3e7230cbe56d5286f1051a
Reviewed-on: http://gerrit.cloudera.org:8080/21647
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
impala-python currently gets its Thrift from the toolchain
by adding the appropriate Thrift toolchain directories to
the PYTHONPATH. This is a problem when switching to Python 3,
because the toolchain Thrift was built with Python 2 and
this can produce complicated bugs. In general, it is also
not a good idea to get Python dependencies from the toolchain.
This switches to installing Thrift into the impala-python
virtualenv, which lets the different Python versions have
their own copy of compiled files.
Testing:
- Ran a core job
Change-Id: Ib36e8a1ce8d446b69b08e81ea458f95c158e28f5
Reviewed-on: http://gerrit.cloudera.org:8080/21046
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds 'kudu_table_reserve_seconds' query option to set reserved time
for deleted Impala managed Kudu tables. The default value is 0.
This option can prevent users from deleting important Kudu tables
by mistake.
Testing:
- Added e2e tests.
Change-Id: I3020567bb6cfe4dd48ef17906f8de674f37217e7
Reviewed-on: http://gerrit.cloudera.org:8080/20773
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
in python 3 environment when kerberos_host_fqdn option is used
In Pyhton 2, the sasl layer does not accept unicode strings,
so we have to explicitly encode the kerberos_host_fqdn string
to ascii. However, this is not the case in python 3, where
we have to omit the encode, because if we don't do this,
impala-shell wants to use the following service principal
during Kerberos auth:
my_service_name/b'my.kerberos.host.fqdn'@MY.REALM
instead of the correct one, which is:
my_service_name/my.kerberos.host.fqdn@MY.REALM
(This is because the output of the encode function
is a byte array in python 3.)
Tested with new unit tests and with a snapshot build
manually in CDP PVC DS.
Change-Id: I8b157d76824ad67faf531a529256a8afe2ab9d49
Reviewed-on: http://gerrit.cloudera.org:8080/20691
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
On some systems, we have seen the build for the impala-python
virtualenv refer to system gcc directly, even though we have
specified Impala toolchain's gcc via CC. When the system gcc
is newer than Impala's gcc, it fails to execute because it needs
symbols that are not present in Impala's libstdc++:
gcc: /home/joe/impala/toolchain/toolchain-packages-gcc10.4.0/gcc-10.4.0/lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by gcc)
This adds the toolchain gcc to the PATH when building the impala-python
virtualenv. This means that any direct reference to gcc will use our
compiler rather than system gcc. We continue to have CC pointed to
our compiler.
Testing:
- Ran a build on Redhat 9 where the issue presented
Change-Id: Ia5ddd6a88b41a3f8ba04d13538b3de2d9499cbf5
Reviewed-on: http://gerrit.cloudera.org:8080/20114
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We're starting to see environments where the system Python ('python') is
Python 3. Updates utility and build scripts to work with Python 3, and
updates check-pylint-py3k.sh to check scripts that use system python.
Fixes other issues found during a full build and test run with Python
3.8 as the default for 'python'.
Fixes a impala-shell tip that was supposed to have been two tips (and
had no space after period when they were printed).
Removes out-of-date deploy.py and various Python 2.6 workarounds.
Testing:
- Full build with /usr/bin/python pointed to python3
- run-all-tests passed with python pointed to python3
- ran push_to_asf.py
Change-Id: Idff388aff33817b0629347f5843ec34c78f0d0cb
Reviewed-on: http://gerrit.cloudera.org:8080/19697
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This adds a Python 3 equivalent to the impala-python
virtualenv base on the toolchain Python 3.7.16.
This modifies bootstrap_virtualenv.py to support
the two different modes. This adds py2-requirements.txt
and py3-requirements.txt to allow some differences
between the Python 2 and Python 3 virtualenvs.
Here are some specific package changes:
- allpairs is replaced with allpairspy, as allpairs did
not support Python 3.
- requests is upgraded slightly, because otherwise is has issues
with idna==2.8.
- pylint is limited to Python 3, because we are adding it
and don't need it on both
- flake8 is limited to Python 2, because it will take
some work to switch to a version that works on Python 3
- cm_api is limited to Python 2, because it doesn't support
Python 3
- pytest-random does not support Python 3 and it is unused,
so it is removed
- Bump the version of setuptool-scm to support Python 3
This adds impala-pylint, which can be used to do further
Python 3 checks via --py3k. This also adds a bin/check-pylint-py3k.sh
script to enforce specific py3k checks. The banned py3k warnings
are specified in the bin/banned_py3k_warnings.txt. This is currently
empty, but this can ratchet up the py3k strictness over time
to avoid regressions.
This pulls in a new toolchain with the fix for IMPALA-11956
to get Python 3.7.16.
Testing:
- Hand tested that the allpairs libraries produce the
same results
- The python3 virtualenv has no influence on regular
tests yet
Change-Id: Ica4853f440c9a46a79bd5fb8e0a66730b0b4efc0
Reviewed-on: http://gerrit.cloudera.org:8080/19567
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Updates utility scripts that don't use impala-python to work with Python
3 so we can build on systems that don't include Python 2 (such as SLES
15 SP4).
Primarily adds 'universal_newlines=True' to subprocess calls so they
return text rather than binary data in Python 3 with a change that's
compatible with Python 2.
Testing:
- built in SLES 15 SP4 container with Python 3
Change-Id: I7f4ce71fa1183aaeeca55d0666aeb113640c5cf2
Reviewed-on: http://gerrit.cloudera.org:8080/19559
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This adds the bin/check-python-syntax.sh script, which
runs "python -m compileall" for all python files in
Impala with both python2 and python3. This detects
syntax errors in the python files. This will be
incorporated into precommit once it is clean.
This also adds future to the impala-python virtualenv.
This provides the futurize script (exposed via
impala-futurize), which can be used to automatically
fix some py2/py3 issues. Future also provides the
builtins library, which can provide python 3
functionality on python 2.
Testing:
- Ran impala-futurize locally
- Ran the script repeatedly while fixing syntax errors
Change-Id: Iae2c51bc6ddc9b6a04469ee1b8284227fed3bd45
Reviewed-on: http://gerrit.cloudera.org:8080/19550
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
IMPALA_THRIFT_PY_VERSION is also bumped to 0.16.0p3.
As 0.16.0p3 Thrift does not contain Python related
patches and Impyla 0.18.0 depends on Thrift 0.16.0,
now we are consistently using Thrift 0.16.0 in all
Python code. This also bumps the Thrift in the
shell's ext-py directory to 0.16.0 (based on the
Thrift 0.16.0 pypi tarball with the egg directory
removed).
Testing:
- Ran a GVO job
Change-Id: I7265558b0e07959c606cba73cd251c3edfcb3ed5
Reviewed-on: http://gerrit.cloudera.org:8080/18456
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
IMPALA-9999 upgrades to GCC version to 10.4 which generates new gcov
format that the current gcovr version (3.4) can't parse. This patch
upgrades gcovr to the latest Python2-compatible version (4.2). Also adds
Jinja2, MarkupSafe and lxml as the required dependent packages. The
development packages of libxml2 and libxslt are also added in
bootstrap_system.sh and bootstrap_build.sh.
This patch also fixes a failure due to the gcov executable not found in
PATH.
Tests:
- Verified builds on Ubuntu 16.04 and CentOS 7.9
- Verified coverage_helper.sh work after this patch
Change-Id: I9458fa0dc97d69f88a4e8a3313dc9440215dfd52
Reviewed-on: http://gerrit.cloudera.org:8080/19226
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This upgrades GCC and libstdc++ to version 10.4. This
required patching or upgrading several dependencies
so they could compile with GCC 10. The toolchain
companion change has details on what items needed
to be upgraded and why.
The toolchain companion change switches GCC to build
with toolchain binutils rather than host binutils. This
means that the python virtualenv initialization needs
to include binutils on the path.
This disables two warnings introduced in the new GCC
versions (Wclass-memaccess and Winit-list-lifetime).
These two warnings occur in our code and also in
dependencies like LLVM and rapidjson. These are not
critical warnings, so they can be addressed
independently and reenabled later.
Binary sizes increase, particulary when including
debug symbols:
| GCC 7.5 | GCC 10.4
impalad RELEASE stripped | 83204768 | 88702824
impalad RELEASE | 707278904 | 971711456
impalad DEBUG stripped | 106677672 | 97391944
impalad DEBUG | 725864760 | 867647512
Testing:
- Multiple test jobs (core, release exhaustive, ASAN)
- Performance testing for TPC-H and TPC-DS shows
a modest improvement (2-4%).
- Code compiles without warnings on debug and release
Change-Id: Ibe6857b822925226d39fd4d6413457ef6bbaabec
Reviewed-on: http://gerrit.cloudera.org:8080/18134
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Updates flake8 to the latest Python 2-compatible version so we can use
indent-size=2. Our code uses 2-space indents and we have previously
worked around or disabled flake8 checks that rely on 4-space indenting.
Change-Id: Ia701f6e3d86be451ae86d041b799c8a10aee2d93
Reviewed-on: http://gerrit.cloudera.org:8080/18669
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch ports the implementation of GSSAPI authentication over http
transport from Impyla (https://github.com/cloudera/impyla/pull/415) to
impala-shell.
The implementation adds a new dependency on 'kerberos' python module,
which is a pip-installed module distributed under Apache License Version
2.
When using impala-shell with Kerberos over http, it is assumed that the
host has a preexisting kinit-cached Kerberos ticket that impala-shell
can pass to the server automatically without the user to reenter the
password.
Testing:
- Passed exhaustive tests.
- Tested manually on a real cluster with a full Kerberos setup.
Change-Id: Ia59ba4004490735162adbd468a00a962165c5abd
Reviewed-on: http://gerrit.cloudera.org:8080/18493
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Updates gitignore for files generated during bootstrap_development.
Fixes deleting tracked files in be/src/thirdparty. Includes ignore rules
for past versions of shell dependencies and updates ignores for current
versions.
Change-Id: I03deba5e7fb151ef8e34039becdcc3fb47684084
Reviewed-on: http://gerrit.cloudera.org:8080/18499
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
According to PEP-0503, pip repo server doesn't support unnormalized URL
access, and some package name within
'infra/python/deps/*requirements.txt' are unnormalized, e.g. 'Cython',
and pip_download.py will concat $PYPI_MIRROR and package name to get
download URL directly, which maybe unnormalized.
Fix this by normalize package name in download URL using the
recommanded method in PEP-0503.
Change-Id: I479df0ad7acf3c650b8f5317372261d5e2840864
Reviewed-on: http://gerrit.cloudera.org:8080/17987
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds parquet stats to iceberg manifest as per-datafile
metrics.
The following metrics are supported:
- column_sizes :
Map from column id to the total size on disk of all regions that
store the column. Does not include bytes necessary to read other
columns, like footers.
- null_value_counts :
Map from column id to number of null values in the column.
- lower_bounds :
Map from column id to lower bound in the column serialized as
binary. Each value must be less than or equal to all non-null,
non-NaN values in the column for the file.
- upper_bounds :
Map from column id to upper bound in the column serialized as
binary. Each value must be greater than or equal to all non-null,
non-Nan values in the column for the file.
The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).
Testing:
- New e2e test was added to verify that the metrics are written to the
Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.
Relevant Iceberg documentation:
- Manifest:
https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
serialized to binary:
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
This patch upgrades impyla to the latest version 0.18a1, which supports
cookie retention for LDAP authentications. Also adds unit-test cases
for implyla's HTTP test with LDAP authentication.
Testing:
- Passed core tests.
Change-Id: I990e5cdde4e98d6ab3581fe48f53a5d0590ce492
Reviewed-on: http://gerrit.cloudera.org:8080/17795
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch Impala mainly used Thrift 0.9.3, but it was
possible to compile Impala shell with Thrift 0.11.0, so the 0.11.0
Thrift lib was already included in the toolchain.
Most of the changes are related to replacing boost:: with std::
shared_ptr-s in cpp code (this is a continuation of patch by Sahil).
The Thrift upgrade also needs an Impyla release with Thrift 0.11.0, as
Impala's test framework relies on Impyla. A thrift_sasl release is also
needed, because it currently pins Thrift version to 0.9.3 for Python 2.
The current patch uses alpha releases from Impyla and thrift_sasl that
use thrift 0.11.0.
Notable side effects:
- old logic to compile thrift for impala-shell with 0.11.0 was removed
- impala_shell's utf8 handling had to be updated as the new 0.11.0
compilation happens with no_utf8strings. This also made things a
bit faster, e.g the following is ~0.22s instead of ~0.25
shell/impala_shell.py \
-B -q "select * from functional_parquet.alltypes;" > /dev/null
- THRIFT-3921 changed the stream operators to print an enum's name
instead of its number, leading to slightly different messages
in some cases.
- "templates" was added to the thift generator's parameters to avoid
a compilation issue (related to IMPALA-10600). I didn't notice any
change in compilation time. This option generated .tcc files with
templetized readers/writers for Thrift types. Currently we don't
use these, but they could potentially speed up (de)serialization.
Testing:
- ran Impyla's test suite with Python 2 and 3
- ran core tests
Change-Id: Idd13f177b4f7acc07872ea6399035aa180ef6ab6
Reviewed-on: http://gerrit.cloudera.org:8080/17170
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When rebasing from an older commit, the version change
in virtualenv can cause there to be multiple virtualenv
tarballs of different versions in the infra/python/deps
directory. bootstrap_virtualenv.py currently doesn't
handle this gracefully, because it is looking for all
virtualenv*.tar.gz files and fails when it finds more
than one.
This changes bootstrap_virtualenv.py to get the virtualenv
version from the requirements.txt file and only look
for the tarball with that version. If it fails to get
the version, it falls back to the old method.
Testing:
- Copied virtualenv-16.7.10.tar.gz to virtualenv-16.7.9.tar.gz
and verified that bootstrap_virtualenv.py works
Change-Id: Iebfa9ba5e223d5187414e02e24f34562418fae40
Reviewed-on: http://gerrit.cloudera.org:8080/17249
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This updates kudu-python to version 1.14.0 (from 1.2.0).
As part of this, it disables ccache for bootstrap_virtualenv.py.
ccache wasn't working anyway, because pip install uses random
temporary directories. It also needs to copy a few files to
the build directory for the Kudu install. The advantage to
upgrading is that the new version no longer has a numpy dependency.
Additionally, this modifies a few minor packages:
- virtualenv moves to the latest version prior to the rewrite
that accompanied version 20 (i.e. 16.10.7).
- setuptools moves to the last version that supports python 2.7 (44.1.1)
- remove botos3, ipython, and ordereddict
These changes speed up installing the virtualenv
Before:
real 3m11.956s
user 2m49.620s
sys 0m14.266s
After:
real 1m38.798s
user 1m33.591s
sys 0m8.112s
Testing:
- Hand tests, GVO run
Change-Id: Ib47770df9e46de448fe2bffef7abe2c3aa942fb9
Reviewed-on: http://gerrit.cloudera.org:8080/17231
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Bootstrapping the impala-python virtualenv requires multiple
rounds of pip installs with different sets of requirements.
This consolidates the requirements.txt, stage2-requirements.txt,
and compiled-requirements.txt into a single requirements.txt.
This will make it easier to upgrade python packages.
This also splits out setuptools into its own
setuptools-requirements.txt. Setuptools is used during the
pip install for several of the dependencies. Recent versions
of setuptools do not support Python 2, but some of the install
tools (like easy_install) don't know how to pick a version
of setuptools that works with Python 2. Splitting it out to its
own requirements file lets us pin the version.
To make review easier, this does not change any of the versions
of the dependencies. It also leaves the stage2-requirements.txt
and compiled-requirements.txt split out in separate sections
of requirements.txt. These will later be turned into a single
alphabetical list.
Testing:
- Tested impala-python locally
- Ran GVO
Change-Id: I8e920e5a257f1e0613065685078624a50d59bf2e
Reviewed-on: http://gerrit.cloudera.org:8080/17226
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Initializing the impala-python virtualenv takes a couple minutes,
so it is useful to do that in parallel to the rest of the build.
This moves the impala-python initialization to its own step
in the CMake build. It stops using impala-python for commands
invoked from buildall.sh or the CMake build to avoid premature
or concurrent initializations of impala-python. Then, it adds
a dedicated step to initialize impala-python.
Testing:
- Ran a core job and a couple builds
- Rebuilt and verified that impala-python is not reinitialized
if it is already initialized
Change-Id: Ieff51263c55bd234028fed7101c94b4a928590f0
Reviewed-on: http://gerrit.cloudera.org:8080/16607
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When build impala in Company internal network, pip_download.py
failed to download dependency eggs from https engpoint Although
correcly set system proxy like http_proxy, https_proxy. Is is
a issue of python2's urllib. I just replace urllib with wget
which can works well with system proxy like https_proxy.
Change-Id: I146d93312701fd682420cb65cf4738bc030f3cfb
Reviewed-on: http://gerrit.cloudera.org:8080/16344
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This upgrades GCC and libstdc++ to version 7.5.0. There
have been ABI changes since 4.9.2, so this means that
the native-toolchain produced with the new compiler is
not interoperable with one produced by the old compiler.
To allow that transition, IMPALA_TOOLCHAIN_PACKAGES_HOME
is now a subdirectory of IMPALA_TOOLCHAIN
(toolchain-packages-gcc${IMPALA_GCC_VERSION}) to distinguish
it from the old packages.
Some Python packages in the impala-python virtualenv are
compiled using the toolchain GCC and now use the new ABI.
This leads to two changes:
1. When constructing the LD_LIBRARY_PATH for impala-python,
we include the GCC libstdc++ libraries. Otherwise, certain
Python packages that use C++ fail on older OSes like Centos 7.
This fixes IMPALA-9804.
2. Since developers work on various branches, this changes
the virtualenv's directory location to a directory with
the GCC version in the name. This allows the virtualenv
built with GCC 7 to coexist with the current virtualenv
built with GCC 4.9.2. The location for the old virtualenv is
${IMPALA_HOME}/infra/python/env. The new location is
${IMPALA_HOME}/infra/python/env-gcc${IMPALA_GCC_VERSION}. This
required updating several impala-python scripts.
There are various odds-and-ends related to the transition:
1. Due to the small string optimization, the size of std::string
changed, which means that various data structures also changed
in size. This required updating some static asserts.
2. There is a bug in clang-tidy that reports a use-after-free
for some code using std::shared_ptr. Clang is not modeling
the shared_ptr correctly, so it is a false-positive. As a workaround,
this disables the clang-analyzer-cplusplus.NewDelete diagnostic.
3. Various small compilation fixes (includes, etc).
Performance testing:
- Ran single-node performance tests on TPC-H for the following
configurations:
- TPC-H Parquet scale 30 with normal configurations
- TPC-H Parquet scale 30 with codegen disabled
- TPC-H Kudu scale 10
None found any significant regressions. Full results are
posted on the JIRA.
- Ran single-node performance tests on targeted-perf scale 10.
No significant regressions.
- The size of binaries (impalad, etc) is slightly smaller with the new GCC:
GCC 4.9.2 release impalad binary: 545664
GCC 7.5.0 release impalad binary: 539900
- Compilation in DEBUG mode is roughly 15-25% faster
Functional testing:
- Ran core jobs, exhaustive release jobs, UBSAN
Change-Id: Ia0beb2b618ba669c9699f8dbc0c52d1203d004e4
Reviewed-on: http://gerrit.cloudera.org:8080/16045
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The locations for native-toolchain packages in IMPALA_TOOLCHAIN
currently do not include the compiler version. This means that
the toolchain can't distinguish between native-toolchain packages
built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause
issues when switching back and forth between branches.
This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment
variable, which is a location inside IMPALA_TOOLCHAIN that would
hold native-toolchain packages. Currently, it is set to the same
as IMPALA_TOOLCHAIN, so there is no difference in behavior.
This lays the groundwork to add the compiler version to this
path when switching to GCC7.
Testing:
- The only impediment to building with
IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is
Impala-lzo. With a custom Impala-lzo, compilation succeeds.
Either Impala-lzo will be fixed or it will be removed.
- Core tests
Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b
Reviewed-on: http://gerrit.cloudera.org:8080/15991
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-9626 broke the use case where the toolchain binaries are not
downloaded from the native-toolchain S3 bucket, because
SKIP_TOOLCHAIN_BOOTSTRAP is set to true.
Fix this use case by checking SKIP_TOOLCHAIN_BOOTSTRAP in
bin/bootstrap_environment.py:
- if true: just check if the specified version of the Python binary is
present at the expected toolchain location. If it is there, use it,
otherwise throw an exception and abort the bootstrap process.
- in any other case: proceed to download the Python binary as in
bootstrap_toolchain.py.
Test:
- simulate the custom toolchain setup by downloading the toolchain
binaries from the S3 bucket, copying them to a separate directory,
symlinking them into Impala/toolchain, then executing buildall.sh
with SKIP_BOOTSTRAP_TOOLCHAIN set to "true".
Change-Id: Ic51b3c327b3cebc08edff90de931d07e35e0c319
Reviewed-on: http://gerrit.cloudera.org:8080/15759
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Upgrades the impala-shell's bundled version of sqlparse to 0.3.1.
There were some API changes in 0.2.0+ that required a re-write of
the StripLeadingCommentFilter in impala_shell.py. A slight perf
optimization was also added to avoid using the filter altogether
if no leading comment is readily discernible.
As 0.1.19 was the last version of sqlparse to support python 2.6,
this patch also breaks Impala's compatibility with python 2.6.
No new tests were added, but all existing tests passed without
modification.
Change-Id: I77a1fd5ae311634a18ee04b8c389d8a3f3a6e001
Reviewed-on: http://gerrit.cloudera.org:8080/15642
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Historically Impala used the Python2 version that was available on
the hosting platform, as long as that version was at least v2.6.
This caused constant headache as all Python syntax had to be kept
compatible with Python 2.6 (for Centos 6). It also caused a recent problem
on Centos 8: here the system Python version was compiled with the
system's GCC version (v8.3), which was much more recent than the Impala
standard compiler version (GCC 4.9.2). When the Impala virtualenv was
built, the system Python version supplied C compiler switches for models
containing native code that were unknown for the Impala version of GCC,
thus breaking virtualenv installation.
This patch changes the Impala virtualenv to always use the Python2
version from the toolchain, which is built with the toolchain compiler.
This ensures that
- Impala always has a known Python 2.7 version for all its scripts,
- virtualenv modules based on native code will always be installable, as
the Python environment and the modules are built with the same compiler
version.
Additional changes:
- Add an auto-use fixture to conftest.py to check that the tests are
being run with Python 2.7.x
- Make bootstrap_toolchain.py independent from the Impala virtualenv:
remove the dependency on the "sh" library
Tests:
- Passed core-mode tests on CentOS 7.4
- Passed core-mode tests in Docker-based mode for centos:7
and ubuntu:16.04
Most content in this patch was developed but not published earlier
by Tim Armstrong.
Change-Id: Ic7b40cef89cfb3b467b61b2d54a94e708642882b
Reviewed-on: http://gerrit.cloudera.org:8080/15624
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A few built-ins were changed in python 3 -- e.g., xrange became range,
ConfigParser became configparser, etc. We can redefine some of those
things in a single place, and import them from there as needed. Other
items may also be added as we go along.
Change-Id: Ibd3d86df524666a98cbfa463756adac48bd1f8a3
Reviewed-on: http://gerrit.cloudera.org:8080/15514
Reviewed-by: David Knupp <dknupp@cloudera.com>
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This includes a bugfix for fetch requests returning 0 rows,
which was resulting in Impyla truncating results for
some tests.
Testing:
Ran exhaustive tests.
Change-Id: Iee3afdcc2f25f0e094c6d2531a83da79045d01be
Reviewed-on: http://gerrit.cloudera.org:8080/14733
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch fixes two issues with --webserver_interface:
- When --webserver_interface was used start-impala-cluster.py with a
value that's different from --hostname, minicluster startup would
appear to fail as liveness is determined by checking for the webui's
availability at the address specified for --hostname.
- The value of --webserver_interface was applied correctly for the
catalogd and statestored but not for impalads, due to the way
ExecEnv constructed the Webserver.
- It is now possible to specify a hostname for webserver_interface
instead of an IP. The webserver will resolve the hostname.
This patch also upgrades our version of psutil to the latest for the
function 'net_if_addrs'. This requires a few change to our use of
psutil, mostly adding '()' to call functions that previously were
variables.
Testing:
- Added a custom cluster test that finds all available interfaces,
binds the webserver to one of them, and checks that its only
available over that interface.
Change-Id: Ic7e75908426756d73f13a0fa3cfc21fc31da164c
Reviewed-on: http://gerrit.cloudera.org:8080/14266
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HS2 is added as an option via --protocol=hs2. The user-visible
differences in behaviour are minimal. Beeswax is still the
default and can be explicitly enabled via --protocol=beeswax
but will be deprecated. The default is unchanged because
changing the default could break certain workflows, e.g.
those that explicitly specify the port with -i or deployments
that hit --fe_service_threads for HS2 and somehow rely on
impala-shell not contributing to that limit. For most
workflows the change is transparent and we should change
the default in a major version change.
This support requires Impala-specific extensions to
the HS2 interface, similar to the existing extensions
to Beeswax. Thus the HS2 shell is only
forwards-compatible with newer Impala versions.
I considered trying to gracefully degrade when the
new extensions weren't present, but it didn't seem to be
worth the ongoing testing effort.
Differences between HS2 and Beeswax are abstracted into
ImpalaClient subclasses.
Here are the changes required to make it work:
* Switch to TBinaryProtocolAccelerated to avoid perf
regression. The HS2 protocol requires decoding
more primitive values (because its not a string-per-row),
which was slow with the pure python implementation of
TBinaryProtocol.
* Added bitarray module to efficiently unpack null indicators
* Minimise invasiveness of changes by transposing and stringifying
the columnar results into rows in impala_client.py. The transposition
needs to happen before display anyway.
* Add PingImpalaHS2Service() to get back version string and webserver
address.
* Add CloseImpalaOperation() extension to return DML row counts. This
possibly addresses IMPALA-1789, although we need to confirm that
this is a sufficient solution.
* Add is_closed member to query handles to avoid shell independently
tracking whether the query handle was closed or not.
* Include query status in HS2 log to match beeswax.
* HS2 GetLog() command now includes query status error message for
consistency with beeswax.
* "set"/"set all" uses the client requests options, not the session
default. This captures the effective value of TIMEZONE, which
was previously missing. This also requires test changes where
the tests set non-default values, e.g. for ABORT_ON_ERROR.
* "set all" on the server side returns REMOVED query options - the
shell needs to know these so it can correctly ignore them.
* Clean up self.orig_cmd/self.last_leading comment argument
passing to avoid implicit parameter passing through multiple
function calls.
* Clean up argument handling in shell tests to consistently pass
around lists of arguments instead of strings that are subject
to shell tokenisation rules.
* Consistently close connections in the shell to avoid leaking
HS2 sessions. This is enforced by making ImpalaShell a context
manager and also eliminating all sys.exit() calls that would
bypass the explicit connection closing.
Testing:
* Shell tests can run with both protocols
* Add tests for formatting of all types and NULL values
* Added testing for floating point output formatting, which does
change as a result of switching to server-side vs client-side
formatting.
* Verified that newly-added tests were actually going through HS2
by disabling hs2 on the minicluster and running tests.
* Add checks to test_verify_metrics.py to ensure that no sessions
are left open at the end of tests.
Performance:
Baseline from beeswax shell for large extract is as follows:
$ time impala-shell.sh -B -q 'select * from tpch_parquet.orders' > /dev/null
real 0m6.708s
user 0m5.132s
sys 0m0.204s
After this change it is somewhat slower, but we generally don't consider
bulk extract performance through the shell to be perf-critical:
real 0m7.625s
user 0m6.436s
sys 0m0.256s
Change-Id: I6d5cc83d545aacc659523f29b1d6feed672e2a12
Reviewed-on: http://gerrit.cloudera.org:8080/12884
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some tests can fail on S3 due to some operations that are eventually
consistent. S3Guard stores extra metadata in a DynamoDB to solve
several consistency issues.
This adds support for running the minicluster on S3 with S3Guard.
S3Guard is configured by the following environment variables:
S3GUARD_ENABLED: defaults to false, set to true to enable S3Guard
S3GUARD_DYNAMODB_TABLE: name of the DynamoDB table to use. This must
be exclusively owned by this minicluster. The dataload scripts
initialize this table and will purge entries if the table already
exists. The table should be in the same region as the S3_BUCKET
for the minicluster.
S3GUARD_DYNAMODB_REGION - AWS region for S3GUARD_DYNAMODB_TABLE
These environment variables only impact S3 configurations.
The support comes from three pieces:
1. Configuration changes in core-site.xml to add the appropriate
parameters.
2. Updating dataload to initialize/purge the s3guard dynamodb table
and import data appropriately.
3. Update tests to manipulate files through the HDFS command line
rather than through s3 utilities. This takes the filesystem
utility code for ABFS (which actually calls HDFS command line),
makes it generic, and uses it for S3Guard.
Testing:
- Ran multiple rounds of s3 tests
- Aborted tests in the middle and restarted the s3 tests (to test
the s3guard reinitialization code)
Change-Id: I3c748529a494bb6e70fec96dc031523ff79bf61d
Reviewed-on: http://gerrit.cloudera.org:8080/13020
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Sahil Takiar <stakiar@cloudera.com>
This change updates Impyla to 0.15.0 and then uses Impyla to retrieve
thrift profiles through the HS2 api.
Unfortunately, some of the current usages of get_thrift_profile rely on
the Beeswax query states and the ImpylaHS2Connection does not have the
required functionality yet. We will have to update these in a future
change, once we unified the query states.
This change also adds a self-contained test for IMPALA-2063
Change-Id: I769a99f0843297dd2b20f2f5b1a9046c97bb131e
Reviewed-on: http://gerrit.cloudera.org:8080/12530
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Fixes in IMPALA-8317 and IMPALA-8337 introduced third-party dependencies
in Impala shell which is problematic in multi-Python environment. This
patch rewrites the fixes using an alternative solution when dealing with
duplicate options without any third-party dependencies. For example:
[impala]
keyval=msg1=hello,keyval=msg2=world
Testing:
- Ran all shell tests on Python 2.6 and 2.7.
- Ran make_shell_tarball.sh and ran Impala shell from the tarball
without any issue.
Change-Id: Ifc0bf391ba26cf5a34f622a4157d7287453cc539
Reviewed-on: http://gerrit.cloudera.org:8080/12844
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-8317 added a patch that uses OrderedDict, which is not available
on Python 2.6 or lower. The patch in IMPALA-8317 also requires a newer
version of configparser to handle duplicate keys, which is not available
in Python 2.6. This patch fixes the issue by using the OrderedDict from
ordereddict third-party library when running on Python 2.6 or lower and
using configparser from the backport.
Testing:
- Ran all E2E shell tests on Python 2.6 and 2.7
Change-Id: Iab1a33542319e9bb75d806bfd38b158995b54aa9
Reviewed-on: http://gerrit.cloudera.org:8080/12830
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>