19 Commits

Author SHA1 Message Date
Joe McDonnell
1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
Michael Smith
dced8ca27c IMPALA-12217: Update cgroup-util to handle cgroups v2
RedHat 9 and Ubuntu 22 switch to cgroups v2, which has a different
hierarchy than cgroups v1. Ubuntu 20 has a hybrid layout with both
cgroup and cgroup2 mounted, but the cgroup2 functionality is limited.

Updates cgroup-util to
- identify available cgroups in FindCGroupMounts. Prefers v1 if
  available, as Ubuntu 20's hybrid layout provides only limited v2
  interfaces.
- refactors file reading to follow guidelines from
  https://gehrcke.de/2011/06/reading-files-in-c-using-ifstream-dealing-correctly-with-badbit-failbit-eofbit-and-perror/
  for clearer error handling. Specifically, failbit doesn't set errno, but
  we were printing it anyway (which produced misleading errors).
- updates FindCGroupMemLimit to read memory.max for cgroups v2.
- updates DebugString to print the correct property based on cgroup
  version.

Removes unused cgroups test library.

Testing:
- proc-info-test CGroupInfo.ErrorHandling test on RHEL 9 and Ubuntu 20.
- verified no error messages related to reading cgroup present in logs
  on RHEL 9 and Ubuntu 20.

Change-Id: I8dc499bd1b490970d30ed6dcd2d16d14ab41ee8c
Reviewed-on: http://gerrit.cloudera.org:8080/20105
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-06-23 01:07:12 +00:00
Joe McDonnell
0c7c6a335e IMPALA-11977: Fix Python 3 broken imports and object model differences
Python 3 changed some object model methods:
 - __nonzero__ was removed in favor of __bool__
 - func_dict / func_name were removed in favor of __dict__ / __name__
 - The next() function was deprecated in favor of __next__
   (Code locations should use next(iter) rather than iter.next())
 - metaclasses are specified a different way
 - Locations that specify __eq__ should also specify __hash__

Python 3 also moved some packages around (urllib2, Queue, httplib,
etc), and this adapts the code to use the new locations (usually
handled on Python 2 via future). This also fixes the code to
avoid referencing exception variables outside the exception block
and variables outside of a comprehension. Several of these seem
like false positives, but it is better to avoid the warning.

This fixes these pylint warnings:
bad-python3-import
eq-without-hash
metaclass-assignment
next-method-called
nonzero-method
exception-escape
comprehension-escape

Testing:
 - Ran core tests
 - Ran release exhaustive tests

Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee
Reviewed-on: http://gerrit.cloudera.org:8080/19592
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
eb66d00f9f IMPALA-11974: Fix lazy list operators for Python 3 compatibility
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.

Python 2:
range(0,5) == [0,1,2,3,4]
True

Python 3:
range(0,5) == [0,1,2,3,4]
False

The fix is to wrap locations with list(). i.e.

Python 3:
list(range(0,5)) == [0,1,2,3,4]
True

Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).

Most of the changes were done via these futurize fixes:
 - libfuturize.fixes.fix_xrange_with_import
 - lib2to3.fixes.fix_map
 - lib2to3.fixes.fix_filter

This eliminates the pylint warnings:
 - xrange-builtin
 - range-builtin-not-iterating
 - map-builtin-not-iterating
 - zip-builtin-not-iterating
 - filter-builtin-not-iterating
 - reduce-builtin
 - deprecated-itertools-function

Testing:
 - Ran core job

Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
2b550634d2 IMPALA-11952 (part 2): Fix print function syntax
Python 3 now treats print as a function and requires
the parenthesis in invocation.

print "Hello World!"
is now:
print("Hello World!")

This fixes all locations to use the function
invocation. This is more complicated when the output
is being redirected to a file or when avoiding the
usual newline.

print >> sys.stderr , "Hello World!"
is now:
print("Hello World!", file=sys.stderr)

To support this properly and guarantee equivalent behavior
between python 2 and python 3, all files that use print
now add this import:
from __future__ import print_function

This also fixes random flake8 issues that intersect with
the changes.

Testing:
 - check-python-syntax.sh shows no errors related to print

Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351
Reviewed-on: http://gerrit.cloudera.org:8080/19552
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-02-28 17:11:50 +00:00
Joe McDonnell
c71de994b0 IMPALA-11952 (part 1): Fix except syntax
Python 3 does not support this old except syntax:

except Exception, e:

Instead, it needs to be:

except Exception as e:

This uses impala-futurize to fix all locations of
the old syntax.

Testing:
 - The check-python-syntax.sh no longer shows errors
   for except syntax.

Change-Id: I1737281a61fa159c8d91b7d4eea593177c0bd6c9
Reviewed-on: http://gerrit.cloudera.org:8080/19551
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-02-28 17:11:50 +00:00
Joe McDonnell
90ab610d34 Convert dataload hdfs copy commands to LOAD DATA statements
The schema file allows specifying a commandline command in
several of the sections (LOAD, DEPENDENT_LOAD, etc). These
are execute by testdata/bin/generate-schema-statements.py
when it is creating the SQL files that are later executed
for dataload. A fair number of tables use this flexibility
to execute hdfs mkdir and copy commands via the command line.

Unfortunately, this is very inefficient. HDFS command line
commands require spinning up a JVM and can take over one
second per command. These commands are executed during a
serial part of dataload, and they can be executed multiple
times. In short, these commands are a significant slowdown
for loading the functional tables.

This converts the hdfs command line statements to equivalent
Hive LOAD DATA LOCAL statements. These are doing the copy
from an already running JVM, so they do not need JVM startup.
They also run in the parallel part of dataload, speeding up
the SQL generation part.

This speeds up generate-schema-statements.py significantly.
On the functional dataset, it saves 7 minutes.
Before:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real    8m8.068s
user    10m11.218s
sys     0m44.932s

After:
time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f
real    0m35.800s
user    0m42.536s
sys     0m5.210s

This is currently a long-pole in dataload, so it translates directly to
an overall speedup of about 7 minutes.

Testing:
 - Ran debug tests

Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c
Reviewed-on: http://gerrit.cloudera.org:8080/15228
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-24 21:22:18 +00:00
Zachary Amsden
66704f915e IMPALA-6068: Scale back fixing functional-types
I re-created the original patch for IMPALA-6068, but only
performed what I believe to be the limited legal transformation
of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE.

Any place that directly uploads via hadoop or hdfs commands
was left alone as changing it can't be proven to be correct.

Change-Id: I6c242cca209a7138b10ad517076707709b5cd204
Testing: Doing a full data load.  I mistakenly changed a variable
name causing the first two dry-runs to fail.
Reviewed-on: http://gerrit.cloudera.org:8080/8690
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Zach Amsden <zamsden@cloudera.com>
2017-12-04 23:46:44 +00:00
David Knupp
d1c9510001 Revert "IMPALA-6068: Fix dataload for complextypes_fileformat"
This reverts commit e4f585240a.

Among other things, that commit replaced hdfs command line calls
with "LOAD DATA LOCAL INPATH" using Hive. However, doing so
presumes that the minicluster is the only test environment.
Sometimes though, the data load script is against a remote cluster,
and those cases, the data load process is now broken.

Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2
Reviewed-on: http://gerrit.cloudera.org:8080/8653
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-28 02:57:04 +00:00
Joe McDonnell
e4f585240a IMPALA-6068: Fix dataload for complextypes_fileformat
Dataload typically follows a pattern of loading data into
a text version of a table, and then using an insert
overwrite from the text table to populate the table for
other file formats. This insert is always done in Impala
for Parquet and Kudu. Otherwise it runs in Hive.

Since Impala doesn't support writing nested data, the
population of complextypes_fileformat tries to hack
the insert to run in Hive by including it in the ALTER
part of the table definition. ALTER runs immediately
after CREATE and always runs in Hive. The problem is
that ALTER also runs before the base table
(functional.complextypes_fileformat) is populated.
The insert succeeds, but it is inserting zero rows.

This code change introduces a way to force the Parquet
load to run using Hive. This lets complextypes_fileformat
specify that the insert should happen in Hive and fixes
the ordering so that the table is populated correctly.

This is also useful for loading custom Parquet files
into Parquet tables. Hive supports the DATA LOAD LOCAL
syntax, which can read a file from the local filesystem.
This means that several locations that currently use
the hdfs commandline can be modified to use this SQL.
This change speeds up dataload by a few minutes, as it
avoids the overhead of the hdfs commandline.

Any other location that could use DATA LOAD LOCAL is
also switched over to use it. This includes the
testescape* tables which now print the appropriate
DATA LOAD commands as a result of text_delims_table.py.
Any location that already uses DATA LOAD LOCAL is also
switched to indicate that it must run in Hive. Any
location that was doing an HDFS command in the LOAD
section is moved to the LOAD_DEPENDENT_HIVE section.

Testing: Ran dataload and core tests. Also verified that
functional_parquet.complextypes_fileformat has rows.

Change-Id: I7152306b2907198204a6d8d282a0bad561129b82
Reviewed-on: http://gerrit.cloudera.org:8080/8350
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
2017-10-25 03:43:26 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Casey Ching
d202d6a967 Use "impala-python" (virtualenv) instead of system python
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.

Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-06 02:09:09 +00:00
Matthew Jacobs
f37682a16f Fix packaging build for Python 2.4
cgroups.py was using unsupported "except <Exception> as <var>" syntax.

generate_metrics.py was using the json module which is not available
in Python 2.4, but contains simplejson which provides the same
functionality.

Change-Id: If2c176c15a9573dd2a2acf5ee459ff24ce891ce3
Reviewed-on: http://gerrit.cloudera.org:8080/396
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
2015-05-19 17:13:33 +00:00
Matthew Jacobs
38f0c3d046 Envvar for Impala test cluster base cgroup hierarchy
Allows the base cgroup hierarchy path used by the impala
test cluster to be specified with the environment variable
IMPALA_CGROUP_BASE_PATH. This is needed to support older
kernels that do not use the proper default cgroup path
and do not even support finding the hierarchy via mount.

This will be used in jenkins test runs with RM enabled
which run on Centos6 images.

Change-Id: I30984a58fbcf990410f75f7feb5c1d549afa6ddd
Reviewed-on: http://gerrit.cloudera.org:8080/397
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-05-19 02:12:03 +00:00
Matthew Jacobs
76529cdd8c Remove usage of context manager in cgroups.py
Context managers are not supported before Python 2.7. Removes the
use of the 'with' clause in cgroups.py because this code is
executed on Centos 6 packaging boxes with an older version of
Python.

Change-Id: Ic6bcf161086f671ec2010df16f9bb23534c57697
Reviewed-on: http://gerrit.cloudera.org:8080/385
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
2015-05-15 16:49:35 +00:00
Matthew Jacobs
cf0b6bc595 Add flag to easily enable Yarn and Llama in mini cluster
Adds a flag to start-impala-cluster.py (--enable_rm) to set up the
mini Impala cluster using Yarn and Llama. This hides a number of
flags that must be set on the impalads:
  -enable_rm
  -llama_addressess: set to the local llama service
  -fair_scheduler_allocation_path: set to the path of the fair-scheduler.xml
       in each node's hadoop conf directory
  -cgroup_hierarchy_path: set to a path in the CPU cgroup hierarchy which
       has the correct permissions for Impala to manage a child cgroup. The
       path comes from cgroups.py.

The new module cgroups.py was added to contain cgroups-related
utilities. Right now it provides paths to the CPU controller
hierarchy root and a path within the hierarchy that can be used
for impalads (i.e. have the proper permissions, one for each
cluster node).

Change-Id: Ic2181ec5613c180f240958c84f885c6b136a64d4
Reviewed-on: http://gerrit.cloudera.org:8080/369
Tested-by: Internal Jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
2015-05-14 21:15:24 +00:00
Juan Yu
e121bc9b0a IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line.
I did a local benchmark and there's minimal performance impact(<1%)

Change-Id: I8d84a145acad886c52587258b27d33cff96ea399
(cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0)
Reviewed-on: http://gerrit.cloudera.org:8080/189
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-03-20 19:58:50 -07:00
Skye Wanderman-Milne
60db4d4d82 CDH-18416: Don't inline ReadWriteUtil::ReadZLong()
For wide Avro tables, ReadZLong() would get inlined many times into a
single function body, causing LLVM to crash. Not inlining doesn't seem
to have a performance impact on narrow tables, and helps with wide
tables.

This change also adds tests over wide (i.e. many-column) tables. The
test tables are produced by specifying shell commands to generate test
tables in functional_schema_template.sql, which are executed in
generate-schema-statements.py. In the SQL templates, sections starting
with a ` are treated as shell commands. The output of the shell
command is then used as the section text. This is only a starting
point; it isn't currently implemented for all sections, and may have
to be tweaked if we use this mechanism for all tables.

Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
(cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353
Tested-by: jenkins
2014-04-28 15:58:15 -07:00