Commit Graph

19 Commits

Author SHA1 Message Date
Joe McDonnell
1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
Riza Suminto
9fc941b611 IMPALA-14327: Update load-data.py and run-workload.py to use HS2
load-data.py is used for dataloading while run-workload.py is used for
running perf-AB-test. This patch change the script from using beeswax
protocol to HS2 protocol.

Testing:
Run data loading and perf-AB-test-ub2004 based on this patch.

Change-Id: I1c3727871b8b2e75c3f10ceabfbe9cb96e36ead3
Reviewed-on: http://gerrit.cloudera.org:8080/23309
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-20 07:20:29 +00:00
Riza Suminto
f533225915 IMPALA-13543: single_node_perf_run.py must accept tpcds_partitioned
tpcds_partitioned dataset is a fully-partitioned version of tpcds
dataset (the latter only partition store_sales table). It does not have
the default text format database like tpcds dataset. Instead, it relies
on pre-existence of text format tpcds database, which then INSERT
OVERWRITE INTO tpcds_partitioned database equivalent. It does not have
its own queries set, but instead symlinked to share
testdata/workloads/tpcds/queries. It also have slightly different schema
from tpcds dataset, namely column "c_last_review_date" in tpcds dataset
is "c_last_review_date_sk" in tpcds_partitioned (TPC-DS v2.11.0, section
2.4.7). These reasons make tpcds_partitioned ineligible for
perf-AB-test (single_node_perf_run.py).

This patch update single_node_perf_run.py and related scripts to make
tpcds_partitioned eligible for benchmark dataset. It adds an initial
steps to load the text database from tpcds dataset with selected scale
before running the load script for tpcds_partitioned dataset. Compute
stats step also limited to run one at a time to not overadmit the
cluster with concurrent compute stats queries.

Created helper function build_replacement_params() inside
generate-schema-statements.py for common function.

Testing
- Run perf-AB-test-ub2004 with this commit included and confirm
  benchmark works with tpcds_partitioned dataset.
- Run normal data loading. Pass FE tests, and
  query_test/test_tpcds_queries.py.

Change-Id: I4b6f435705dcf873696ffd151052ebeab35d9898
Reviewed-on: http://gerrit.cloudera.org:8080/22061
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-19 07:11:42 +00:00
Riza Suminto
cc63757c10 IMPALA-12838: Adds exec_options parameter to single_node_perf_run.py
This patch adds exec_options parameter to single_node_perf_run.py to
allow running single node benchmark with custom query option for entire
workload. The option is passed from single_node_perf_run.py to
run-workload.py. Some cleanup also done to fix existing flake8 issues.

Testing:
Ran single_node_perf_run.py in my local machine as follow

./bin/single_node_perf_run.py --num_impalads=1 --scale=10 \
  --exec_options=num_nodes:1 --workloads=tpcds --iterations=9 \
  --table_formats=parquet/none/none,orc/def \
  --query_names=TPCDS-Q_COUNT_OPTIMIZED \
  asf-master IMPALA-11123

Change-Id: I243b6c474eed84d6d66ae35917bdc80fc8c8d7a4
Reviewed-on: http://gerrit.cloudera.org:8080/21054
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-02-29 21:00:38 +00:00
Riza Suminto
667403b2cb IMPALA-12090: Split runtime profiles made by single_node_perf_run.py
single_node_perf_run.py produce a single text file containing all
runtime profiles from perf run from one git hash. This is handy, but the
resulting text file can be very long and makes it difficult to analyze
individual profile.

This patch add --split_profiles and --no_split_profiles option into
single_node_perf_run.py. If --split_profiles is specified, it it will
extract runtime profiles into individual file instead of single long
text file. Specifying --no_split_profiles will retain the old behavior
of putting runtime profiles into a single-combined text file. Default to
split profiles if neither is specified. Files in profile directory will
look like this with --split_profiles:

$ ls -1 perf_results/latest/2267d9d104cc3fb0740cba09acb369b4d7ae4f52_profiles/
TPCDS-Q14-1_iter001.txt
TPCDS-Q14-1_iter002.txt
TPCDS-Q14-1_iter003.txt
TPCDS-Q14-2_iter001.txt
TPCDS-Q14-2_iter002.txt
TPCDS-Q14-2_iter003.txt
TPCDS-Q23-1_iter001.txt
TPCDS-Q23-1_iter002.txt
TPCDS-Q23-1_iter003.txt
TPCDS-Q23-2_iter001.txt
TPCDS-Q23-2_iter002.txt
TPCDS-Q23-2_iter003.txt

Testing:
- Manually test run the script with selected queries from tpcds
  workload with either --split_profiles or --no_split_profiles.

Change-Id: Ibc2d3cefd7ad61b76cbef74c734543ef9ca51795
Reviewed-on: http://gerrit.cloudera.org:8080/19796
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-04-28 09:51:48 +00:00
Joe McDonnell
eb66d00f9f IMPALA-11974: Fix lazy list operators for Python 3 compatibility
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.

Python 2:
range(0,5) == [0,1,2,3,4]
True

Python 3:
range(0,5) == [0,1,2,3,4]
False

The fix is to wrap locations with list(). i.e.

Python 3:
list(range(0,5)) == [0,1,2,3,4]
True

Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).

Most of the changes were done via these futurize fixes:
 - libfuturize.fixes.fix_xrange_with_import
 - lib2to3.fixes.fix_map
 - lib2to3.fixes.fix_filter

This eliminates the pylint warnings:
 - xrange-builtin
 - range-builtin-not-iterating
 - map-builtin-not-iterating
 - zip-builtin-not-iterating
 - filter-builtin-not-iterating
 - reduce-builtin
 - deprecated-itertools-function

Testing:
 - Ran core job

Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
2b550634d2 IMPALA-11952 (part 2): Fix print function syntax
Python 3 now treats print as a function and requires
the parenthesis in invocation.

print "Hello World!"
is now:
print("Hello World!")

This fixes all locations to use the function
invocation. This is more complicated when the output
is being redirected to a file or when avoiding the
usual newline.

print >> sys.stderr , "Hello World!"
is now:
print("Hello World!", file=sys.stderr)

To support this properly and guarantee equivalent behavior
between python 2 and python 3, all files that use print
now add this import:
from __future__ import print_function

This also fixes random flake8 issues that intersect with
the changes.

Testing:
 - check-python-syntax.sh shows no errors related to print

Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351
Reviewed-on: http://gerrit.cloudera.org:8080/19552
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-02-28 17:11:50 +00:00
Joe McDonnell
7a26ff4b97 IMPALA-11379: Remove kerberos.egg-info directory
This directory is currently checked in, but it is
overwritten when building the shell. On some Linux
distributions, the output is different from what
is checked in. This causes problems for perf-AB-test
(based on bin/single_node_perf_run.py), which relies on
a build not causing any modifications.

This removes the kerberos.egg-info directory,
which does not need to be checked in.

This also adds checks to the GVO Jenkins jobs
to verify that the source tree is unmodified after
bootstrap_build.sh and boostrap_development.sh.
These checks are not included in those scripts
directly, because developers can run those scripts
in their development environments, which may have
modifications.

Tests:
 - Uploaded a change without removing the kerberos.egg-info
   directory and verified that the new checks fail
 - Verified that perf-AB-test gets past the current issue

Change-Id: I90b486bb6c1644fc18b56779d6c54e1e1b3c9aaa
Reviewed-on: http://gerrit.cloudera.org:8080/18650
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-22 23:58:44 +00:00
Gergely Fürnstáhl
182617ee87 IMPALA-11113 and IMPALA-11114: fixed single_node_perf_run.py for TPCDS
Fixed the UTF-8 UnicodeDecodeError which was thrown while dumping and
loading the json file. Now the script ignores non-decodable characters.

Fixed the ZeroDevisionError coming from t-test when the standard
deviations were 0. "(N/A) Invalid t-test type" is shown for significant
changes and a hint at the end if any invalid t-test was detected.

Change-Id: I094763188a1f3ddf40b7140c65acf95918a6597f
Reviewed-on: http://gerrit.cloudera.org:8080/18215
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2022-02-15 15:20:56 +00:00
Sahil Takiar
d3a2d73fda IMPALA-9439: Make --scale a mandatory option in single_node_perf_run.py
This makes the --scale option mandatory when running
./bin/single_node_perf_run.py. If the option is not set, the script
attempts to run the workloads against the database
'[workload-name]None_[file-format]', which is typically not what the
user wants.

Makes some minor documentation improvements to the script.

Testing:
* Confirmed that running without the --scale option set causes the
  script to error out with a help message

Change-Id: I9ad13580f8f74388981a37d6960087d95cde574b
Reviewed-on: http://gerrit.cloudera.org:8080/15335
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-02 22:35:31 +00:00
Tim Armstrong
112953c63b Add --impalad_args to single_node_perf_run.py
This is useful for benchmarking non-standard configurations,
e.g. with mt_dop enabled.

Testing:
Ran the script, confirmed manually that the arguments took effect.

  single_node_perf_run.py <other args> \
      --impalad_args=--default_query_options=mt_dop=4 \
      --impalad_args=--unlock_mt_dop=true

Change-Id: Ib903f0eabb06a7e8981c874c8fe1cec0936b1a64
Reviewed-on: http://gerrit.cloudera.org:8080/14923
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Jim Apple <jbapple@apache.org>
2019-12-22 08:57:32 +00:00
Tim Armstrong
23731ba90c Fix single_node_perf_run default num_impalads
The documentation claims that the default is 1, but it was actually
3.

Change-Id: Ia295ce0b0040e02b4fa8faafc0ac749e35b46c19
Reviewed-on: http://gerrit.cloudera.org:8080/14383
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-10-08 09:26:14 +00:00
Jim Apple
fa672909c8 IMPALA-8062: Call impala-config in single_node_perf_run
This wraps most shell calls in single_node_perf_run.py with a bash
shell that first sources impala-config.sh, to make sure environment
variables are set properly.

Change-Id: Ic7c1b77906a975c37f3b51a0f900ed3536b398ba
Reviewed-on: http://gerrit.cloudera.org:8080/12277
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-27 03:04:25 +00:00
njanarthanan
dcbff6bbbd IMPALA-7228: Add tpcds-unmodified to single-node-perf-run
Description:
tpcds-unmodified workload was added as a part of IMPALA-6819.
This change allows tpcds-unmodified workload to be available
for the single node perf run.

Testing:
Ran single node perf run using the following parameters and the
test run was successful

--iterations 2 --scale 2 --table_formats "parquet/none" \
--num_impalads 1 --workload "tpcds-unmodified" \
--load --query_names "TPCDS-Q17.*" --start_minicluster

Change-Id: I511661c586cd55e3240ccbea9c499b9c3fc98440
Reviewed-on: http://gerrit.cloudera.org:8080/10931
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-13 20:29:48 +00:00
Jim Apple
216642e28d IMPALA-6105: Clarify argument order in single_node_perf_run
single_node_perf_run.py uses git_hash_A vs. git_hash_B, distinguish
them by their position in the command-line
arguments. single_node_perf_run.py calls report_benchmark_results.py,
which uses the "reference vs. input", distinguished by their
command-line flags. The output of report_benchmark_results.py uses
"{empty string} vs Base".

In the long run, I think it would be better to fix all three to use
the same terminology, but this comment hopefully adds clarity.

Change-Id: Ib236ce7e83dc193ef1382f6304444ce58759a639
Reviewed-on: http://gerrit.cloudera.org:8080/8470
Tested-by: Impala Public Jenkins
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
2017-11-07 16:16:09 +00:00
Jim Apple
01b5973c40 single_node_perf_run.py: clean up newly-added testdata
In single_node_perf_run.py, restore_workloads() can make the tree
"dirty", and when a tree is dirty, git won't let you switch branches
in a way that clobbers the dirty file contents:

    $ cd $(mktemp -d)
    $ git init .
    Initialized empty Git repository in /tmp/tmp.H0NxzTXLUj/.git/
    $ touch foo && git add foo && git commit -a -m "foo"
    [master (root-commit) 3776149] foo
     1 file changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 foo
    $ git checkout -b ok_foo && echo "ok" >> foo && git commit -a -m "foo is ok"
    Switched to a new branch 'ok_foo'
    [ok_foo 9fd5bde] foo is ok
     1 file changed, 1 insertion(+)
    $ git checkout master && echo "not ok" >> foo
    Switched to branch 'master'
    $ git checkout ok_foo
    error: Your local changes to the following files would be overwritten by checkout:
            foo
    Please, commit your changes or stash them before you can switch branches.
    Aborting

Discovered when testing single_node_perf_run with
https://gerrit.cloudera.org/#/c/7153/; after this commit, that patch
works with single_node_perf_run.py

Change-Id: Id0220f3cd7a26d2627e40cd432c23815a6d65ea4
Reviewed-on: http://gerrit.cloudera.org:8080/7291
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-07-11 00:12:24 +00:00
Jim Apple
de9f5230eb IMPALA-5482: fix git checkout when workloads are modified
When git checkout would overwrite changes, it fails and alerts the
user to do something with the changes. This patch removes any changes
to files induced by the workload copy-and-paste.

Testing: using a patch provided by Lars Volker that touched
testdata/workloads/ (https://gerrit.cloudera.org/#/c/7073/), I was
able to reproduce the problem he saw and see that this patch fixed it.

Change-Id: I9a0d004c353eb4b547aeaf3c56289594326653d7
Reviewed-on: http://gerrit.cloudera.org:8080/7145
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-11 18:20:22 +00:00
Jim Apple
07a7138817 Add a script to test performance on a developer machine
This is a migration from an old and broken script from another
repository. Example use:

    bin/single_node_perf_run.py --ninja --workloads targeted-perf \
      --load --scale 4 --iterations 20 --num_impalads 3 \
      --start_minicluster --query_names PERF_AGG-Q3 \
      $(git rev-parse HEAD~1) $(git rev-parse HEAD)

The script can load data, run benchmarks, and compare the statistics
of those runs for significant differences in performance. It glues
together buildall.sh, bin/load-data.py, bin/run-workload.py, and
tests/benchmark/report_benchmark_results.py.

Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9
Reviewed-on: http://gerrit.cloudera.org:8080/6818
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-05-31 08:10:48 +00:00