This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
The ScanNode.keyRanges is an array list that can contain null. The existing HBase scan node
did not check for that.
A keyRanges would contain null if
1. the row-key is a string type and it is referenced in the query and,
2. there is no predicate on the row-key.
Fixes bug in Planner.createHashJoinFragment(), which didn't set the left child of the
hj node to the output of the left child fragment.
Also: row descriptor was set incorrectly (too wide; included tuples that weren't materialized)
for roots of plan trees of non-root fragments if those fragments materialized an aggregate
- created new class PlanFragment, which encapsulates everything having to do with a single
plan fragment, including its partition, output exprs, destination node, etc.
- created new class DataPartition
- explicit classes for fragment and plan node ids, to avoid getting them mixed up, which is easy to do with ints
- Adding IdGenerator class.
- moved PlanNode.ExplainPlanLevel to Types.thrift, so it can also be used for
PlanFragment.getExplainString()
- Changed planner interface to return scan ranges with a complete list of server locations,
instead of making a server assignment.
Also included: cleaned up AggregateInfo:
- the 2nd phase of a DISTINCT aggregation is now captured separately from a merge aggregation.
- moved analysis functionality into AggregateInfo
Removing broken test cases from workload functional-planner (they're being handled correctly in functional-newplanner).
verify that the exception message contains the correct error;
verify that excpected exception is thrown;
verify that no exception is thrown when abort_on_error is set to false
- executor takes report callback; passed in by ImpalaServer::FragmentExecState
- the PlanFragmentExecutor invokes profile reporting cb in background thread.
- RuntimeProfile is now thread-safe and has an RuntimeProfile::Update()
Also included:
- a number of bug fixes related to async cancellation of query
and propagation of errors through PlanFragmentExecutor/Coordinator/ImpalaServer.
- changing COUNTER_SCOPED_TIMER to SCOPED_TIMER
- derived counters: RuntimeProfile now lets you add counters that return a
value via a function call, which is useful for reporting something like normalized
ScanNode throughput; retrofitted to ScanNode and all subclasses
- changed coordinator to make cancellation atomic wrt recognition of an error status
for the overall query.
- Removed InProcessQueryExecutor from data-stream-test.
Added aggregate throughput counters to coordinator:
- all throughput counters are grouped in a sub-profile "AggregateThroughput"
- each scan node gets its own counter
- the value is aggregated across all registered backends which contain that node in
their plan fragments
"distinctpc" and "distinctpcsa".
We've gathered statistics on an internal dataset (all columns) which is
part of our regression data. It's roughly 400mb, ~100 columns,
int/bigint/string type.
On Hive, it took roughly 64sec.
On this Impala implementation, it took 35sec. By adding inline to hash-util.h (which we don't),
we can achieve 24~26sec.
Change-Id: Ibcba3c9512b49e8b9eb0c2fec59dfd27f14f84c3
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:
./run-benchmark --workloads=hive-benchmark,tpch
We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.
To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.
Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:
./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.
Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3
Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3
This changeset also includes a few other minor tweaks to some of the test
scripts.
Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6