impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
Riza Suminto	9fc941b611	IMPALA-14327: Update load-data.py and run-workload.py to use HS2 load-data.py is used for dataloading while run-workload.py is used for running perf-AB-test. This patch change the script from using beeswax protocol to HS2 protocol. Testing: Run data loading and perf-AB-test-ub2004 based on this patch. Change-Id: I1c3727871b8b2e75c3f10ceabfbe9cb96e36ead3 Reviewed-on: http://gerrit.cloudera.org:8080/23309 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-08-20 07:20:29 +00:00
Riza Suminto	f533225915	IMPALA-13543: single_node_perf_run.py must accept tpcds_partitioned tpcds_partitioned dataset is a fully-partitioned version of tpcds dataset (the latter only partition store_sales table). It does not have the default text format database like tpcds dataset. Instead, it relies on pre-existence of text format tpcds database, which then INSERT OVERWRITE INTO tpcds_partitioned database equivalent. It does not have its own queries set, but instead symlinked to share testdata/workloads/tpcds/queries. It also have slightly different schema from tpcds dataset, namely column "c_last_review_date" in tpcds dataset is "c_last_review_date_sk" in tpcds_partitioned (TPC-DS v2.11.0, section 2.4.7). These reasons make tpcds_partitioned ineligible for perf-AB-test (single_node_perf_run.py). This patch update single_node_perf_run.py and related scripts to make tpcds_partitioned eligible for benchmark dataset. It adds an initial steps to load the text database from tpcds dataset with selected scale before running the load script for tpcds_partitioned dataset. Compute stats step also limited to run one at a time to not overadmit the cluster with concurrent compute stats queries. Created helper function build_replacement_params() inside generate-schema-statements.py for common function. Testing - Run perf-AB-test-ub2004 with this commit included and confirm benchmark works with tpcds_partitioned dataset. - Run normal data loading. Pass FE tests, and query_test/test_tpcds_queries.py. Change-Id: I4b6f435705dcf873696ffd151052ebeab35d9898 Reviewed-on: http://gerrit.cloudera.org:8080/22061 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-19 07:11:42 +00:00
Riza Suminto	cc63757c10	IMPALA-12838: Adds exec_options parameter to single_node_perf_run.py This patch adds exec_options parameter to single_node_perf_run.py to allow running single node benchmark with custom query option for entire workload. The option is passed from single_node_perf_run.py to run-workload.py. Some cleanup also done to fix existing flake8 issues. Testing: Ran single_node_perf_run.py in my local machine as follow ./bin/single_node_perf_run.py --num_impalads=1 --scale=10 \ --exec_options=num_nodes:1 --workloads=tpcds --iterations=9 \ --table_formats=parquet/none/none,orc/def \ --query_names=TPCDS-Q_COUNT_OPTIMIZED \ asf-master IMPALA-11123 Change-Id: I243b6c474eed84d6d66ae35917bdc80fc8c8d7a4 Reviewed-on: http://gerrit.cloudera.org:8080/21054 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-02-29 21:00:38 +00:00
Riza Suminto	667403b2cb	IMPALA-12090: Split runtime profiles made by single_node_perf_run.py single_node_perf_run.py produce a single text file containing all runtime profiles from perf run from one git hash. This is handy, but the resulting text file can be very long and makes it difficult to analyze individual profile. This patch add --split_profiles and --no_split_profiles option into single_node_perf_run.py. If --split_profiles is specified, it it will extract runtime profiles into individual file instead of single long text file. Specifying --no_split_profiles will retain the old behavior of putting runtime profiles into a single-combined text file. Default to split profiles if neither is specified. Files in profile directory will look like this with --split_profiles: $ ls -1 perf_results/latest/2267d9d104cc3fb0740cba09acb369b4d7ae4f52_profiles/ TPCDS-Q14-1_iter001.txt TPCDS-Q14-1_iter002.txt TPCDS-Q14-1_iter003.txt TPCDS-Q14-2_iter001.txt TPCDS-Q14-2_iter002.txt TPCDS-Q14-2_iter003.txt TPCDS-Q23-1_iter001.txt TPCDS-Q23-1_iter002.txt TPCDS-Q23-1_iter003.txt TPCDS-Q23-2_iter001.txt TPCDS-Q23-2_iter002.txt TPCDS-Q23-2_iter003.txt Testing: - Manually test run the script with selected queries from tpcds workload with either --split_profiles or --no_split_profiles. Change-Id: Ibc2d3cefd7ad61b76cbef74c734543ef9ca51795 Reviewed-on: http://gerrit.cloudera.org:8080/19796 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-04-28 09:51:48 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	2b550634d2	IMPALA-11952 (part 2): Fix print function syntax Python 3 now treats print as a function and requires the parenthesis in invocation. print "Hello World!" is now: print("Hello World!") This fixes all locations to use the function invocation. This is more complicated when the output is being redirected to a file or when avoiding the usual newline. print >> sys.stderr , "Hello World!" is now: print("Hello World!", file=sys.stderr) To support this properly and guarantee equivalent behavior between python 2 and python 3, all files that use print now add this import: from __future__ import print_function This also fixes random flake8 issues that intersect with the changes. Testing: - check-python-syntax.sh shows no errors related to print Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351 Reviewed-on: http://gerrit.cloudera.org:8080/19552 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Joe McDonnell	7a26ff4b97	IMPALA-11379: Remove kerberos.egg-info directory This directory is currently checked in, but it is overwritten when building the shell. On some Linux distributions, the output is different from what is checked in. This causes problems for perf-AB-test (based on bin/single_node_perf_run.py), which relies on a build not causing any modifications. This removes the kerberos.egg-info directory, which does not need to be checked in. This also adds checks to the GVO Jenkins jobs to verify that the source tree is unmodified after bootstrap_build.sh and boostrap_development.sh. These checks are not included in those scripts directly, because developers can run those scripts in their development environments, which may have modifications. Tests: - Uploaded a change without removing the kerberos.egg-info directory and verified that the new checks fail - Verified that perf-AB-test gets past the current issue Change-Id: I90b486bb6c1644fc18b56779d6c54e1e1b3c9aaa Reviewed-on: http://gerrit.cloudera.org:8080/18650 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-22 23:58:44 +00:00
Gergely Fürnstáhl	182617ee87	IMPALA-11113 and IMPALA-11114: fixed single_node_perf_run.py for TPCDS Fixed the UTF-8 UnicodeDecodeError which was thrown while dumping and loading the json file. Now the script ignores non-decodable characters. Fixed the ZeroDevisionError coming from t-test when the standard deviations were 0. "(N/A) Invalid t-test type" is shown for significant changes and a hint at the end if any invalid t-test was detected. Change-Id: I094763188a1f3ddf40b7140c65acf95918a6597f Reviewed-on: http://gerrit.cloudera.org:8080/18215 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>	2022-02-15 15:20:56 +00:00
Sahil Takiar	d3a2d73fda	IMPALA-9439: Make --scale a mandatory option in single_node_perf_run.py This makes the --scale option mandatory when running ./bin/single_node_perf_run.py. If the option is not set, the script attempts to run the workloads against the database '[workload-name]None_[file-format]', which is typically not what the user wants. Makes some minor documentation improvements to the script. Testing: * Confirmed that running without the --scale option set causes the script to error out with a help message Change-Id: I9ad13580f8f74388981a37d6960087d95cde574b Reviewed-on: http://gerrit.cloudera.org:8080/15335 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-02 22:35:31 +00:00
Tim Armstrong	112953c63b	Add --impalad_args to single_node_perf_run.py This is useful for benchmarking non-standard configurations, e.g. with mt_dop enabled. Testing: Ran the script, confirmed manually that the arguments took effect. single_node_perf_run.py <other args> \ --impalad_args=--default_query_options=mt_dop=4 \ --impalad_args=--unlock_mt_dop=true Change-Id: Ib903f0eabb06a7e8981c874c8fe1cec0936b1a64 Reviewed-on: http://gerrit.cloudera.org:8080/14923 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Jim Apple <jbapple@apache.org>	2019-12-22 08:57:32 +00:00
Tim Armstrong	23731ba90c	Fix single_node_perf_run default num_impalads The documentation claims that the default is 1, but it was actually 3. Change-Id: Ia295ce0b0040e02b4fa8faafc0ac749e35b46c19 Reviewed-on: http://gerrit.cloudera.org:8080/14383 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-08 09:26:14 +00:00
Jim Apple	fa672909c8	IMPALA-8062: Call impala-config in single_node_perf_run This wraps most shell calls in single_node_perf_run.py with a bash shell that first sources impala-config.sh, to make sure environment variables are set properly. Change-Id: Ic7c1b77906a975c37f3b51a0f900ed3536b398ba Reviewed-on: http://gerrit.cloudera.org:8080/12277 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-27 03:04:25 +00:00
njanarthanan	dcbff6bbbd	IMPALA-7228: Add tpcds-unmodified to single-node-perf-run Description: tpcds-unmodified workload was added as a part of IMPALA-6819. This change allows tpcds-unmodified workload to be available for the single node perf run. Testing: Ran single node perf run using the following parameters and the test run was successful --iterations 2 --scale 2 --table_formats "parquet/none" \ --num_impalads 1 --workload "tpcds-unmodified" \ --load --query_names "TPCDS-Q17.*" --start_minicluster Change-Id: I511661c586cd55e3240ccbea9c499b9c3fc98440 Reviewed-on: http://gerrit.cloudera.org:8080/10931 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-13 20:29:48 +00:00
Jim Apple	216642e28d	IMPALA-6105: Clarify argument order in single_node_perf_run single_node_perf_run.py uses git_hash_A vs. git_hash_B, distinguish them by their position in the command-line arguments. single_node_perf_run.py calls report_benchmark_results.py, which uses the "reference vs. input", distinguished by their command-line flags. The output of report_benchmark_results.py uses "{empty string} vs Base". In the long run, I think it would be better to fix all three to use the same terminology, but this comment hopefully adds clarity. Change-Id: Ib236ce7e83dc193ef1382f6304444ce58759a639 Reviewed-on: http://gerrit.cloudera.org:8080/8470 Tested-by: Impala Public Jenkins Reviewed-by: Jim Apple <jbapple-impala@apache.org>	2017-11-07 16:16:09 +00:00
Jim Apple	01b5973c40	single_node_perf_run.py: clean up newly-added testdata In single_node_perf_run.py, restore_workloads() can make the tree "dirty", and when a tree is dirty, git won't let you switch branches in a way that clobbers the dirty file contents: $ cd $(mktemp -d) $ git init . Initialized empty Git repository in /tmp/tmp.H0NxzTXLUj/.git/ $ touch foo && git add foo && git commit -a -m "foo" [master (root-commit) 3776149] foo 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 foo $ git checkout -b ok_foo && echo "ok" >> foo && git commit -a -m "foo is ok" Switched to a new branch 'ok_foo' [ok_foo 9fd5bde] foo is ok 1 file changed, 1 insertion(+) $ git checkout master && echo "not ok" >> foo Switched to branch 'master' $ git checkout ok_foo error: Your local changes to the following files would be overwritten by checkout: foo Please, commit your changes or stash them before you can switch branches. Aborting Discovered when testing single_node_perf_run with https://gerrit.cloudera.org/#/c/7153/; after this commit, that patch works with single_node_perf_run.py Change-Id: Id0220f3cd7a26d2627e40cd432c23815a6d65ea4 Reviewed-on: http://gerrit.cloudera.org:8080/7291 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-07-11 00:12:24 +00:00
Jim Apple	de9f5230eb	IMPALA-5482: fix git checkout when workloads are modified When git checkout would overwrite changes, it fails and alerts the user to do something with the changes. This patch removes any changes to files induced by the workload copy-and-paste. Testing: using a patch provided by Lars Volker that touched testdata/workloads/ (https://gerrit.cloudera.org/#/c/7073/), I was able to reproduce the problem he saw and see that this patch fixed it. Change-Id: I9a0d004c353eb4b547aeaf3c56289594326653d7 Reviewed-on: http://gerrit.cloudera.org:8080/7145 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-11 18:20:22 +00:00
Jim Apple	07a7138817	Add a script to test performance on a developer machine This is a migration from an old and broken script from another repository. Example use: bin/single_node_perf_run.py --ninja --workloads targeted-perf \ --load --scale 4 --iterations 20 --num_impalads 3 \ --start_minicluster --query_names PERF_AGG-Q3 \ $(git rev-parse HEAD~1) $(git rev-parse HEAD) The script can load data, run benchmarks, and compare the statistics of those runs for significant differences in performance. It glues together buildall.sh, bin/load-data.py, bin/run-workload.py, and tests/benchmark/report_benchmark_results.py. Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9 Reviewed-on: http://gerrit.cloudera.org:8080/6818 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-31 08:10:48 +00:00

19 Commits