impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
Michael Smith	dced8ca27c	IMPALA-12217: Update cgroup-util to handle cgroups v2 RedHat 9 and Ubuntu 22 switch to cgroups v2, which has a different hierarchy than cgroups v1. Ubuntu 20 has a hybrid layout with both cgroup and cgroup2 mounted, but the cgroup2 functionality is limited. Updates cgroup-util to - identify available cgroups in FindCGroupMounts. Prefers v1 if available, as Ubuntu 20's hybrid layout provides only limited v2 interfaces. - refactors file reading to follow guidelines from https://gehrcke.de/2011/06/reading-files-in-c-using-ifstream-dealing-correctly-with-badbit-failbit-eofbit-and-perror/ for clearer error handling. Specifically, failbit doesn't set errno, but we were printing it anyway (which produced misleading errors). - updates FindCGroupMemLimit to read memory.max for cgroups v2. - updates DebugString to print the correct property based on cgroup version. Removes unused cgroups test library. Testing: - proc-info-test CGroupInfo.ErrorHandling test on RHEL 9 and Ubuntu 20. - verified no error messages related to reading cgroup present in logs on RHEL 9 and Ubuntu 20. Change-Id: I8dc499bd1b490970d30ed6dcd2d16d14ab41ee8c Reviewed-on: http://gerrit.cloudera.org:8080/20105 Reviewed-by: Yida Wu <wydbaggio000@gmail.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-23 01:07:12 +00:00
Joe McDonnell	0c7c6a335e	IMPALA-11977: Fix Python 3 broken imports and object model differences Python 3 changed some object model methods: - __nonzero__ was removed in favor of __bool__ - func_dict / func_name were removed in favor of __dict__ / __name__ - The next() function was deprecated in favor of __next__ (Code locations should use next(iter) rather than iter.next()) - metaclasses are specified a different way - Locations that specify __eq__ should also specify __hash__ Python 3 also moved some packages around (urllib2, Queue, httplib, etc), and this adapts the code to use the new locations (usually handled on Python 2 via future). This also fixes the code to avoid referencing exception variables outside the exception block and variables outside of a comprehension. Several of these seem like false positives, but it is better to avoid the warning. This fixes these pylint warnings: bad-python3-import eq-without-hash metaclass-assignment next-method-called nonzero-method exception-escape comprehension-escape Testing: - Ran core tests - Ran release exhaustive tests Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee Reviewed-on: http://gerrit.cloudera.org:8080/19592 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	2b550634d2	IMPALA-11952 (part 2): Fix print function syntax Python 3 now treats print as a function and requires the parenthesis in invocation. print "Hello World!" is now: print("Hello World!") This fixes all locations to use the function invocation. This is more complicated when the output is being redirected to a file or when avoiding the usual newline. print >> sys.stderr , "Hello World!" is now: print("Hello World!", file=sys.stderr) To support this properly and guarantee equivalent behavior between python 2 and python 3, all files that use print now add this import: from __future__ import print_function This also fixes random flake8 issues that intersect with the changes. Testing: - check-python-syntax.sh shows no errors related to print Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351 Reviewed-on: http://gerrit.cloudera.org:8080/19552 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Joe McDonnell	c71de994b0	IMPALA-11952 (part 1): Fix except syntax Python 3 does not support this old except syntax: except Exception, e: Instead, it needs to be: except Exception as e: This uses impala-futurize to fix all locations of the old syntax. Testing: - The check-python-syntax.sh no longer shows errors for except syntax. Change-Id: I1737281a61fa159c8d91b7d4eea593177c0bd6c9 Reviewed-on: http://gerrit.cloudera.org:8080/19551 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Joe McDonnell	90ab610d34	Convert dataload hdfs copy commands to LOAD DATA statements The schema file allows specifying a commandline command in several of the sections (LOAD, DEPENDENT_LOAD, etc). These are execute by testdata/bin/generate-schema-statements.py when it is creating the SQL files that are later executed for dataload. A fair number of tables use this flexibility to execute hdfs mkdir and copy commands via the command line. Unfortunately, this is very inefficient. HDFS command line commands require spinning up a JVM and can take over one second per command. These commands are executed during a serial part of dataload, and they can be executed multiple times. In short, these commands are a significant slowdown for loading the functional tables. This converts the hdfs command line statements to equivalent Hive LOAD DATA LOCAL statements. These are doing the copy from an already running JVM, so they do not need JVM startup. They also run in the parallel part of dataload, speeding up the SQL generation part. This speeds up generate-schema-statements.py significantly. On the functional dataset, it saves 7 minutes. Before: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 8m8.068s user 10m11.218s sys 0m44.932s After: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 0m35.800s user 0m42.536s sys 0m5.210s This is currently a long-pole in dataload, so it translates directly to an overall speedup of about 7 minutes. Testing: - Ran debug tests Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c Reviewed-on: http://gerrit.cloudera.org:8080/15228 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-24 21:22:18 +00:00
Zachary Amsden	66704f915e	IMPALA-6068: Scale back fixing functional-types I re-created the original patch for IMPALA-6068, but only performed what I believe to be the limited legal transformation of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE. Any place that directly uploads via hadoop or hdfs commands was left alone as changing it can't be proven to be correct. Change-Id: I6c242cca209a7138b10ad517076707709b5cd204 Testing: Doing a full data load. I mistakenly changed a variable name causing the first two dry-runs to fail. Reviewed-on: http://gerrit.cloudera.org:8080/8690 Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Zach Amsden <zamsden@cloudera.com>	2017-12-04 23:46:44 +00:00
David Knupp	d1c9510001	Revert "IMPALA-6068: Fix dataload for complextypes_fileformat" This reverts commit `e4f585240a`. Among other things, that commit replaced hdfs command line calls with "LOAD DATA LOCAL INPATH" using Hive. However, doing so presumes that the minicluster is the only test environment. Sometimes though, the data load script is against a remote cluster, and those cases, the data load process is now broken. Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2 Reviewed-on: http://gerrit.cloudera.org:8080/8653 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 02:57:04 +00:00
Joe McDonnell	e4f585240a	IMPALA-6068: Fix dataload for complextypes_fileformat Dataload typically follows a pattern of loading data into a text version of a table, and then using an insert overwrite from the text table to populate the table for other file formats. This insert is always done in Impala for Parquet and Kudu. Otherwise it runs in Hive. Since Impala doesn't support writing nested data, the population of complextypes_fileformat tries to hack the insert to run in Hive by including it in the ALTER part of the table definition. ALTER runs immediately after CREATE and always runs in Hive. The problem is that ALTER also runs before the base table (functional.complextypes_fileformat) is populated. The insert succeeds, but it is inserting zero rows. This code change introduces a way to force the Parquet load to run using Hive. This lets complextypes_fileformat specify that the insert should happen in Hive and fixes the ordering so that the table is populated correctly. This is also useful for loading custom Parquet files into Parquet tables. Hive supports the DATA LOAD LOCAL syntax, which can read a file from the local filesystem. This means that several locations that currently use the hdfs commandline can be modified to use this SQL. This change speeds up dataload by a few minutes, as it avoids the overhead of the hdfs commandline. Any other location that could use DATA LOAD LOCAL is also switched over to use it. This includes the testescape* tables which now print the appropriate DATA LOAD commands as a result of text_delims_table.py. Any location that already uses DATA LOAD LOCAL is also switched to indicate that it must run in Hive. Any location that was doing an HDFS command in the LOAD section is moved to the LOAD_DEPENDENT_HIVE section. Testing: Ran dataload and core tests. Also verified that functional_parquet.complextypes_fileformat has rows. Change-Id: I7152306b2907198204a6d8d282a0bad561129b82 Reviewed-on: http://gerrit.cloudera.org:8080/8350 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 03:43:26 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Casey Ching	d202d6a967	Use "impala-python" (virtualenv) instead of system python Python tests and infra scripts will now use "python" from the virtualenv via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now that python 2.6 and a dependable set of third-party libraries are available but that is not done as part of this commit. Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f Reviewed-on: http://gerrit.cloudera.org:8080/603 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-06 02:09:09 +00:00
Matthew Jacobs	f37682a16f	Fix packaging build for Python 2.4 cgroups.py was using unsupported "except <Exception> as <var>" syntax. generate_metrics.py was using the json module which is not available in Python 2.4, but contains simplejson which provides the same functionality. Change-Id: If2c176c15a9573dd2a2acf5ee459ff24ce891ce3 Reviewed-on: http://gerrit.cloudera.org:8080/396 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2015-05-19 17:13:33 +00:00
Matthew Jacobs	38f0c3d046	Envvar for Impala test cluster base cgroup hierarchy Allows the base cgroup hierarchy path used by the impala test cluster to be specified with the environment variable IMPALA_CGROUP_BASE_PATH. This is needed to support older kernels that do not use the proper default cgroup path and do not even support finding the hierarchy via mount. This will be used in jenkins test runs with RM enabled which run on Centos6 images. Change-Id: I30984a58fbcf990410f75f7feb5c1d549afa6ddd Reviewed-on: http://gerrit.cloudera.org:8080/397 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-05-19 02:12:03 +00:00
Matthew Jacobs	76529cdd8c	Remove usage of context manager in cgroups.py Context managers are not supported before Python 2.7. Removes the use of the 'with' clause in cgroups.py because this code is executed on Centos 6 packaging boxes with an older version of Python. Change-Id: Ic6bcf161086f671ec2010df16f9bb23534c57697 Reviewed-on: http://gerrit.cloudera.org:8080/385 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2015-05-15 16:49:35 +00:00
Matthew Jacobs	cf0b6bc595	Add flag to easily enable Yarn and Llama in mini cluster Adds a flag to start-impala-cluster.py (--enable_rm) to set up the mini Impala cluster using Yarn and Llama. This hides a number of flags that must be set on the impalads: -enable_rm -llama_addressess: set to the local llama service -fair_scheduler_allocation_path: set to the path of the fair-scheduler.xml in each node's hadoop conf directory -cgroup_hierarchy_path: set to a path in the CPU cgroup hierarchy which has the correct permissions for Impala to manage a child cgroup. The path comes from cgroups.py. The new module cgroups.py was added to contain cgroups-related utilities. Right now it provides paths to the CPU controller hierarchy root and a path within the hierarchy that can be used for impalads (i.e. have the proper permissions, one for each cluster node). Change-Id: Ic2181ec5613c180f240958c84f885c6b136a64d4 Reviewed-on: http://gerrit.cloudera.org:8080/369 Tested-by: Internal Jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com>	2015-05-14 21:15:24 +00:00
Juan Yu	e121bc9b0a	IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line. I did a local benchmark and there's minimal performance impact(<1%) Change-Id: I8d84a145acad886c52587258b27d33cff96ea399 (cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0) Reviewed-on: http://gerrit.cloudera.org:8080/189 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 19:58:50 -07:00
Skye Wanderman-Milne	60db4d4d82	CDH-18416: Don't inline ReadWriteUtil::ReadZLong() For wide Avro tables, ReadZLong() would get inlined many times into a single function body, causing LLVM to crash. Not inlining doesn't seem to have a performance impact on narrow tables, and helps with wide tables. This change also adds tests over wide (i.e. many-column) tables. The test tables are produced by specifying shell commands to generate test tables in functional_schema_template.sql, which are executed in generate-schema-statements.py. In the SQL templates, sections starting with a ` are treated as shell commands. The output of the shell command is then used as the section text. This is only a starting point; it isn't currently implemented for all sections, and may have to be tweaked if we use this mechanism for all tables. Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com> (cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353 Tested-by: jenkins	2014-04-28 15:58:15 -07:00

19 Commits