Commit Graph

297 Commits

Author SHA1 Message Date
Lenni Kuff
5f9cd044ee Add scanner test suite that runs across all file format/compression permuations 2014-01-08 10:48:25 -08:00
ishaan
5138a720bb IMP-768: Enable the python test framework to check for insert results. 2014-01-08 10:48:22 -08:00
Henry Robinson
222d15c6ca IMPALA-72: String partition keys should be URL encoded 2014-01-08 10:48:20 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
d2e4776731 Support passing snapshot file to buildall, add script to run all tests, remove old tests 2014-01-08 10:47:59 -08:00
Lenni Kuff
1896701399 IMPALA-44: Database names are case sensitive 2014-01-08 10:47:34 -08:00
Lenni Kuff
9d981984e7 Update expected results of the 'show table/database' test to remove trevni tables 2014-01-08 10:47:10 -08:00
Lenni Kuff
12d18631e3 Test enhancements: dynamic table format data loading, per-workload exploration stategies 2014-01-08 10:47:07 -08:00
Lenni Kuff
c806738af2 Add scan range length tests to Python test framework 2014-01-08 10:47:06 -08:00
Lenni Kuff
30dbf59ef2 Final changes to enable Python test infrastructure and tests
With this change the Python tests will now be called as part of buildall and
the corresponding Java tests have been disabled. The new tests can also be
invoked calling ./tests/run-tests.sh directly.

This includes a fix from Nong that caused wrong results for limit on non-io
manager formats.
2014-01-08 10:46:57 -08:00
Nong Li
fbfef4e22e Fix crash in TopN node with null tuples. 2014-01-08 10:46:54 -08:00
Lenni Kuff
837f35eab3 Updated results for more query tests to reflect proper ordering + improved result updating 2014-01-08 10:46:53 -08:00
Lenni Kuff
bed633c1ae Extract config/metastore creation from buildall + script for loading warehouse snapshot 2014-01-08 10:46:53 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
Lenni Kuff
1e25c98fb4 Test data loading framework improvements
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
2014-01-08 10:46:49 -08:00
Nong Li
b4dc3eeb35 Fix IMP-575 2014-01-08 10:46:45 -08:00
Nong Li
34879a4ddc Fix IMP-297 2014-01-08 10:46:44 -08:00
Nong Li
b22b565a92 Fix codegen for min/max of bool col. 2014-01-08 10:46:43 -08:00
Alan Choi
a5a9ccf8c2 IMP-550 short-circuit queries with limit 0
Impala server would examine the plan. If the first fragment's top plan node has a "limit 0",
then the query is set to EOS immediately.
2014-01-08 10:46:41 -08:00
Alan Choi
dfe7690add IMP-522 Fix null pointer exception in HBase query
The ScanNode.keyRanges is an array list that can contain null. The existing HBase scan node
did not check for that.

A keyRanges would contain null if
1. the row-key is a string type and it is referenced in the query and,
2. there is no predicate on the row-key.
2014-01-08 10:46:36 -08:00
Marcel Kornacker
2fda5d9b99 IMP-491
Fixes bug in Planner.createHashJoinFragment(), which didn't set the left child of the
hj node to the output of the left child fragment.

Also: row descriptor was set incorrectly (too wide; included tuples that weren't materialized)
for roots of plan trees of non-root fragments if those fragments materialized an aggregate
2014-01-08 10:46:33 -08:00
Michael Ubell
0750384b41 IMP-497 Insert with limit, remove extra files from test. 2014-01-08 10:46:33 -08:00
Michael Ubell
116241f1d1 IMP-497 Insert with limit. 2014-01-08 10:46:33 -08:00
Michael Ubell
7536510b69 IMP-258 Test writing nulls. 2014-01-08 10:46:31 -08:00
Alan Choi
595edaa9d1 Disable all string to numeric and boolean implicit cast 2014-01-08 10:46:24 -08:00
Marcel Kornacker
ea050a43ad Switching over backend runtime structures to new planner.
Added container-util.h
2014-01-08 10:46:20 -08:00
Michael Ubell
477422beda IMP-380 handle '\r' at end of row. 2014-01-08 10:46:14 -08:00
Henry Robinson
3519701529 Support backtick quoting for identifiers 2014-01-08 10:46:00 -08:00
Henry Robinson
91c3b979ca IMP-370: SHOW TABLES IN support and IMP-363: SHOW DATABASES
Change-Id: Ic41c4b0767a0480f0a18e1e985f25de3bc2ca947
2014-01-08 10:45:59 -08:00
Henry Robinson
540673763f Add session key handling to ThriftServer, and session support to the frontend 2014-01-08 10:45:59 -08:00
Marcel Kornacker
5984c0be52 First cut of partitioned plan generation:
- created new class PlanFragment, which encapsulates everything having to do with a single
  plan fragment, including its partition, output exprs, destination node, etc.
- created new class DataPartition
- explicit classes for fragment and plan node ids, to avoid getting them mixed up, which is easy to do with ints
- Adding IdGenerator class.
- moved PlanNode.ExplainPlanLevel to Types.thrift, so it can also be used for
  PlanFragment.getExplainString()
- Changed planner interface to return scan ranges with a complete list of server locations,
  instead of making a server assignment.

Also included: cleaned up AggregateInfo:
- the 2nd phase of a DISTINCT aggregation is now captured separately from a merge aggregation.
- moved analysis functionality into AggregateInfo

Removing broken test cases from workload functional-planner (they're being handled correctly in functional-newplanner).
2014-01-08 10:45:56 -08:00
Michael Ubell
5f951ffc4a Handle missing columns at the end of a row 2014-01-08 10:45:11 -08:00
Henry Robinson
e7348a209b IMP-232: Parallel INSERT OVERWRITE 2014-01-08 10:45:04 -08:00
Henry Robinson
e3e6ba984b Show / describe 2014-01-08 10:44:49 -08:00
Alan Choi
22765fc33a IMP-251: re-enable DataErrorTest
verify that the exception message contains the correct error;
verify that excpected exception is thrown;
verify that no exception is thrown when abort_on_error is set to false
2014-01-08 10:44:45 -08:00
Marcel Kornacker
7725f25ff5 This combines changes related to periodic reporting of plan fragment exec profiles:
- executor takes report callback; passed in by ImpalaServer::FragmentExecState
- the PlanFragmentExecutor invokes profile reporting cb in background thread.
- RuntimeProfile is now thread-safe and has an RuntimeProfile::Update()

Also included:
- a number of bug fixes related to async cancellation of query
  and propagation of errors through PlanFragmentExecutor/Coordinator/ImpalaServer.
- changing COUNTER_SCOPED_TIMER to SCOPED_TIMER
- derived counters: RuntimeProfile now lets you add counters that return a
  value via a function call, which is useful for reporting something like normalized
  ScanNode throughput; retrofitted to ScanNode and all subclasses
- changed coordinator to make cancellation atomic wrt recognition of an error status
  for the overall query.
- Removed InProcessQueryExecutor from data-stream-test.

Added aggregate throughput counters to coordinator:
- all throughput counters are grouped in a sub-profile "AggregateThroughput"
- each scan node gets its own counter
- the value is aggregated across all registered backends which contain that node in
  their plan fragments
2014-01-08 10:44:42 -08:00
Nong Li
4c9c82910a Text parser fix for columns off end. 2014-01-08 10:44:40 -08:00
Nong Li
4d0319d32b Fix null string parsing. 2014-01-08 10:44:40 -08:00
Alan Choi
dd1537d116 IMP-132: collect unique agg expr 2014-01-08 10:44:39 -08:00
Nong Li
81bba16dac Parallel scanners. 2014-01-08 10:44:38 -08:00
Alan Choi
9ac664f1f7 Fix IMP-239: text_converter_->WriteSlot returns true when it's ok
QueryTest and HBaseQueryTest set AbortOnError to false except the expected error case
2014-01-08 10:44:37 -08:00
Henry Robinson
c472213eeb Parallel INSERT, sink-per-scan-node plan 2014-01-08 10:44:35 -08:00
Alexander Behm
ee705e3083 Added timestamp arithmetic expressions. 2014-01-08 10:44:31 -08:00
Alan Choi
f15ef994fb "mvn test" now uses impalad and beeswax api to submit query and fetch, including
insert query.

review issue: 260
2014-01-08 10:44:30 -08:00
Alan Choi
88101bc90e This patch implements the probabilistic counting algorithm as an aggregate
"distinctpc" and "distinctpcsa".

We've gathered statistics on an internal dataset (all columns) which is
part of our regression data. It's roughly 400mb, ~100 columns,
int/bigint/string type.

On Hive, it took roughly 64sec.
On this Impala implementation, it took 35sec. By adding inline to hash-util.h (which we don't),
 we can achieve 24~26sec.

Change-Id: Ibcba3c9512b49e8b9eb0c2fec59dfd27f14f84c3
2014-01-08 10:44:27 -08:00
Alan Choi
cbadb4eac4 When a scan range begins at the starting point fo the tuple, we'll missed that tuple. This patch fixes
this problem.

review: 162
2014-01-08 10:44:24 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00