Commit Graph

29 Commits

Author SHA1 Message Date
ishaan
3bed0be1df Refactor the performance framework and change its execution strategy.
This patch introduces new abstractions and changes the way queries are run via the
workload runner. A new class 'Workload' is introduced, which represents the notion of a
workload in the performance framework (i.e, A set of query names mapped to query
strings).

The new workflow is:
 - run-workload acts as a driver. It accepts user parmaters for which queries to
   run and their execution strategy. It generates workload objects and passes them to the
   workload-runner.
 - The workload runner takes a workload, its execution parameters and generates a set of
   test vectors over which the workload is run iteratively.
 - A workload is executed by initialiazing a QueryExecutor for each query being run in a
   test vector. The workload executor is then responsible for execution and gathering
   results.
 - The execution details of every query being executed are are stored and returned to the
   driver (run-workload).

Change-Id: Ia16360140d65e6733e534e823bc5d5614622ab5f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3616
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
2014-07-25 18:17:11 -07:00
ishaan
7e520f8f23 Make workload runner logging more concise and readable.
This patch makes the workload runner's logging concise and more informative. Specifically,
it
 - logs the time taken for each iteration of a query.
 - changes the default log level to INFO.
 - The output is less verbose.

Change-Id: I5f964cf76269fd64ce127b9e4c51fe1deafd1d1b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1076
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:54:35 -08:00
ishaan
0cb16863ee run-workload should log a warning to console and not fail if abort_on_query_error is False and the
query fails.

This change also disables printing the runtime_profile to the console.

Change-Id: Ic7bc3406d6eddb67a514ecfb4a27add8c40a8604
Reviewed-on: http://gerrit.ent.cloudera.com:8080/687
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:53:25 -08:00
ishaan
aa530ce11d Change the order of fields stored in the benchmark results to fix performance comparisons.
Change-Id: I7b7ebd711adfe9a44cba92b55d35ef8dd97eba60
Reviewed-on: http://gerrit.ent.cloudera.com:8080/584
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:53:12 -08:00
ishaan
565d15579c Add the ability to use a workload as the unit of execution in the Impala benchmark runner.
At the moment, a query is the default unit of execution and parallelism in the Impala
performance suite. With this change, we now have the ability to treat a workload as the
unit of execution. A workload is defined as a unique combination of the dataset, scale
factor, a subset (or all) of the queries in the dataset, and a table format (file format,
compression codec and compression scheme).

It introduces two new command line options in bin/run-workload.py:
  * --execution_scope
    The default scope is 'query', and it maintains previous semantics. The
    new scope is 'workload', which toggles the unit of execution to a workload.
  * --shuffle_query_exec_order.
    Shuffles the order in which queries are executed (only applicable when the
    execution_scope if workload), defaults to False.

Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/503
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:53:07 -08:00
ishaan
a6cb5f70a4 Introduce the notion of scope to the plugin framework.
Change-Id: I2cf39c38e7e0a359950d9e05e2daed433fc0c38f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/144
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:05 -08:00
ishaan
6f9569ea6f Add a plugin framework for run-workload 2014-01-08 10:51:39 -08:00
Lenni Kuff
b877240ffd Increase the 'field_size_limit' for csv reader/writer used in benchmark result processing 2014-01-08 10:49:04 -08:00
Lenni Kuff
129c6473b9 Update run-workload to gather, display, and archive runtime profiles for each workload query 2014-01-08 10:48:46 -08:00
Lenni Kuff
d3b9de2222 Add support for detecting significant performance changes (regression and improvements) 2014-01-08 10:48:19 -08:00
Lenni Kuff
4d876e01d0 Add support for storing details on number of concurrent clients in backend perf db 2014-01-08 10:47:55 -08:00
Lenni Kuff
1a2695781d Add support for targeting JDBC via run-workload and add Impala Jdbc Client tool 2014-01-08 10:47:29 -08:00
Lenni Kuff
d5177c3c30 Update run-workload to support specifying table format test vectors from command line 2014-01-08 10:47:20 -08:00
Lenni Kuff
9953224f68 Cleanup IMPALA_HOME/bin directory
Deleted some old files and moved some files out of the bin directory into better locations
2014-01-08 10:46:55 -08:00
Lenni Kuff
97380676d8 Fixed case sensitivity issue with exec options in the test Impala Beeswax client code
Also added support for executing run-workload in a mode that continues after query errors
2014-01-08 10:46:50 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
ishaan
c5ddba4296 run-workload + keberos 2014-01-08 10:46:29 -08:00
Lenni Kuff
9f91081183 Modify TPCH tests to always insert into text table so workload can run on all file formats 2014-01-08 10:46:21 -08:00
Lenni Kuff
5e91fc8ff8 Fix $TABLE table suffix replacement for workloads that don't have tables in a database 2014-01-08 10:46:17 -08:00
Lenni Kuff
b3fce13b1d Initial Impala failure testing library + modularize run-workload
This adds initial changes for the Impala failure testing library. It also refactors
run workload into its own module to it can be used in other tests.

The failure testing has two main components - the first is an object model on top on top
of Impala services in a cluster. This allows for enumerating the serivces in the cluster
and executing commands on remote machines. This initial cut is built on top of the
CM service to help with starting/stopping services. The long term goal is to let this run
on both a CM cluster and non-CM cluster as well as locally.

The other part of the failure injection change is failure_inctor module that uses the
Impala service abstraction to select and inject failures into random impala services.

This failure testing framework hasn't been completely validated because the product code
is not yet ready, but it is important to get this checked in so all new changes to
run-workload are based off this refactor.

Change-Id: I73bf44f0ac881ec17bea7cb05d850b45e2ea5be5
2014-01-08 10:46:16 -08:00
Lenni Kuff
25edaae9d7 Enable running specific query name(s) + log exec results before completion 2014-01-08 10:46:16 -08:00
Lenni Kuff
231b66f37f A few small fixes
Queries now return rows on both our small (query test) data set as well as the 10TB
data set. This change also fixes a problem with python not being set properly and
adds support for reporting query results using the geometric mean

Change-Id: Ia432148d96645ecda3f63900b3bfbd29c706d886
2014-01-08 10:46:15 -08:00
ishaan
3ec95e3226 Enable run-workload to run with beeswax. 2014-01-08 10:46:14 -08:00
Lenni Kuff
5f72b34faa Additional changes to run-workload for flexible query execution, filtering of file formats
This changes cleans up run-workload to push more query execution logic into query_executor.
It also adds a new feature to run-workload to support filtering of the file format / compression
to run on.
2014-01-08 10:45:08 -08:00
Lenni Kuff
1b4f318bf2 Update run-workload to facilitate beeswax execution and support saving of partial results
This change updates run-workload to provide a more generic interface for query
execution. Now the query executor just takes an execution function and a new
QueryExecOptions object that defines the values to use for execution.
I also made a change to store partial result sets so we can salvage some work if
a run fails.
2014-01-08 10:45:06 -08:00
Lenni Kuff
7d595ba740 Update run-workload result reporting to make reference result comparison more flexible
Now we save Hive results into a separate file (previously everything was stored
in the same file. Also added ability to do a run-benchmark and specify to skip
impala and which will help generate hive reference results.

Updated the reporting script to reflect this change.
2014-01-08 10:44:50 -08:00
ishaan
4c84cdae51 Handle queries with '%', python does not parse it properly. 2014-01-08 10:44:44 -08:00
Lenni Kuff
58001240d5 Improve performance reporting and add support for running multiple workloads with different scale factors
This improves the summary reporting for perf results, fixes a problem with how the short query names were being
stored, and also adds support for running multiple workloads of different scale factors.
2014-01-08 10:44:41 -08:00
Lenni Kuff
aa60a59188 Add support for executing multiple workload queries in parallel
This change add a -num_clients flag that specifies the number of clients
(threads) to use when executing each query in a workload. This is used to
validate Impala concurrency/stress. The logging was getting messed up with
multiple threads so I also updated this to use the logger module.

Currently we only capture and save the results of the first thread that
executes. In the future we might want to update this to capture results from all
the threads.
2014-01-08 10:44:40 -08:00