To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3
This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
doesn't have a main function, it removes the hash-bang and makes
sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
(or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
replaced by the cm-client pypi package and interfaces have changed.
Rather than migrating the code (which hasn't been used in years), this
deletes the old code and stops installing cm-api into the virtualenv.
The code can be restored and revamped if there is any interest in
interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
bit-rotted. Some pieces can be run manually, but it can't be fully
verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
version that supports Python 3. The newest version of kazoo requires
upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
needing other upgrades.
The two remaining uses of impala-python are:
- bin/cmake_aux/create_virtualenv.sh
- bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.
The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)
Testing:
- Ran core job
- Ran build + dataload on Centos 7, Redhat 8
- Manual testing of individual scripts (except some bitrotted areas like the
random query generator)
Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
When using bin/dump_breakpad_symbols.py to dump symbols for RPM/DEB
packages, the script extracts the packages to a temporary directory
and relies on keeping that directory around until the processing
is finished. The parallel processing added in IMPALA-11511 breaks
the logic that keeps the temporary directory around, so the script
generates errors like:
Found debugging info in /tmp/tmpqfZ9MZ/usr/lib/debug/usr/lib/impala/sbin-retail/impalad.debug
Failed to open ELF file '/tmp/tmpqfZ9MZ/usr/lib/debug/usr/lib/impala/sbin-retail/impalad.debug': No such file or directory
Failed to write symbol file.
This turns off parallelism for bin/dump_breakpad_symbols.py when
processing RPM/DEB packages (i.e. -r/--pkg). This also avoids using
a ThreadPool when num_processes <= 1.
Testing:
- Hand tested with Redhat 7 RPMs
Change-Id: If2885a9cfb36a4f616b539599e7f744bd23552c3
Reviewed-on: http://gerrit.cloudera.org:8080/20943
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Some RPMs contain binaries with debug symbols with no separate
debuginfo package needed. bin/dump_breakpad_symbols.py does not
allow this combination, as it expects a corresponding symbol
package. This adds a --no_symbol_pkg option to dump_breakpad_symbols.py
to turn off the requirement that --pkg be combined with --symbol_pkg.
Testing:
- Tested with an RPM package with an unstripped impalad binary
- Tested with the usual RPM + debuginfo RPM combination
Change-Id: I9589b0ed7855fe49c6989ec3dcc51a9e9c4f476b
Reviewed-on: http://gerrit.cloudera.org:8080/20944
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Yida Wu <wydbaggio000@gmail.com>
This modifies dump_breakpad_symbols.py to use a ThreadPool
to go parallel when there are multiple binaries or
libraries to process. This is common for Jenkins jobs that
dump symbols for all backend tests. The different binaries
write out to different directories, so the threads don't
interfere with each other.
Testing:
- Ran locally dumping the symbols for all backend tests
- Ran a Jenkins job that generates a minidump and triggers
the minidump symbol processing. It went parallel and
worked fine.
Change-Id: I93427bb07f1d9718bd6df90acfd247210b54294d
Reviewed-on: http://gerrit.cloudera.org:8080/20802
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Impala's build produces dozens of C++ binaries
that link in all Impala libraries. Each binary is
hundreds of megabytes, leading to 10s of gigabytes
of disk space. A large proportion of this (~80%) is debug
information. The debug information increases in newer
versions of GCC such as GCC 10.
This introduces two options for reducing the size
of debug information:
- IMPALA_MINIMAL_DEBUG_INFO=true builds Impala with
minimal debug information (-g1). This contains line tables
and can resolve backtraces, but it does not contain
variable information and restricts further debugging.
- IMPALA_COMPRESSED_DEBUG_INFO=true builds Impala with
compressed debug information (-gz). This does not change
the debug information included, but the compression saves
significant disk space. gdb is known to work with
compressed debug information, but other tools may not
support it. The dump_breakpad_symbols.py script has been
adjusted to handle these binaries.
These are disabled by default.
Release impalad binary sizes:
Configuration | Size (bytes) | % reduction over base
Base | 707834808 | N/A
Stripped | 83351664 | 88%
Minimal debuginfo | 215924096 | 69%
Compressed debuginfo | 301619286 | 57%
Minimal + compressed debuginfo | 120886705 | 83%
Testing:
- Generated minidumps and resolved them
- Verified this is disabled by default
Change-Id: I04a20258a86053d8f3972b9c7c81cd5bec1bbb66
Reviewed-on: http://gerrit.cloudera.org:8080/18962
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Recent testing showed that the pytests are not
respecting the log level and format set in
conftest.py's configure_logging(). It is using
the default log level of WARNING and the
default formatter.
The issue is that logging.basicConfig() is only
effective the first time it is called. The code
in lib/python/impala_py_lib/helpers.py does a
call to logging.basicConfig() at the global
level, and conftest.py imports that file. This
renders the call in configure_logging()
ineffective.
To avoid this type of confusion, logging.basicConfig()
should only be called from the main() functions for
libraries. This removes the call in lib/python/impala_py_lib
(as it is not needed for a library without a main function).
It also fixes up various other locations to move the
logging.basicConfig() call to the main() function.
Testing:
- Ran the end to end tests and custom cluster tests
- Confirmed the logging format
- Added an assert in configure_logging() to test that
the INFO log level is applied to the root logger.
Change-Id: I5d91b7f910b3606c50bcba4579179a0bc8c20588
Reviewed-on: http://gerrit.cloudera.org:8080/16679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The locations for native-toolchain packages in IMPALA_TOOLCHAIN
currently do not include the compiler version. This means that
the toolchain can't distinguish between native-toolchain packages
built with gcc 4.9.2 versus gcc 7.5.0. The collisions can cause
issues when switching back and forth between branches.
This introduces the IMPALA_TOOLCHAIN_PACKAGES_HOME environment
variable, which is a location inside IMPALA_TOOLCHAIN that would
hold native-toolchain packages. Currently, it is set to the same
as IMPALA_TOOLCHAIN, so there is no difference in behavior.
This lays the groundwork to add the compiler version to this
path when switching to GCC7.
Testing:
- The only impediment to building with
IMPALA_TOOLCHAIN_PACKAGES_HOME=$IMPALA_TOOLCHAIN/test is
Impala-lzo. With a custom Impala-lzo, compilation succeeds.
Either Impala-lzo will be fixed or it will be removed.
- Core tests
Change-Id: I1ff641e503b2161baf415355452f86b6c8bfb15b
Reviewed-on: http://gerrit.cloudera.org:8080/15991
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins