Commit Graph

33 Commits

Author SHA1 Message Date
ishaan
2b5df0c6ff [CDH5] Convert tpch schemas to decimal and change the queries where possible.
I used the following document for reference: http://www.tpc.org/tpch/spec/tpch2.1.0.pdf

Change-Id: Ic84db0628323c90e89552707f214bbb9fa2f2ae0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3132
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-07-08 14:51:43 -07:00
Victor Bittorf
808f9a661a IMPALA-939: Regex should match anywhere in string.
Change-Id: I8dcd337c3b06b632017270670a4f199ec7ada648
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2296
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit c97f82eaaf0efe9bd4c3da3d005464f425696a62)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2371
2014-04-25 16:16:15 -07:00
Nong Li
d401f746d4 IMPALA-692: Fix data corruption with dictionary encoded values.
We weren't clearing the state in the dictionary when rolling over to a new
page. The memory for the dictionary (built from the first file) was cleared
but the dictionary entires were not.

This also had a minor side effect that unused dictionary entries from the first
page were still being written out for subsequent pages, although in practice,
this is unlikely to affect the file size much.

Change-Id: I8e11fc4723dc23d21c5de8a42def13d8238c137b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1072
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:24 -08:00
Lenni Kuff
a2cbd2820e Add Catalog Service and support for automatic metadata refresh
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.

Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.

What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
  contains the change. An impalad will wait for the statestore heartbeat that contains this
  version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing

Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
  same JAR.

Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:11 -08:00
ishaan
2f7d24b35b Fix tpch-q18, to not use qualified table names. 2014-01-08 10:51:49 -08:00
ishaan
ece902d953 Fix tpch-q18 to inser into the database associated with its scale-factor. 2014-01-08 10:51:45 -08:00
Nong Li
58631d9ce0 Fix parquet insert .test files. 2014-01-08 10:49:46 -08:00
Skye Wanderman-Milne
a7e15b1417 Update Parquet scanner to only scan a file if assigned the first split.
Also re-enable Parquet tests.
2014-01-08 10:49:25 -08:00
Nong Li
329763e5ab Disable parquet tests. 2014-01-08 10:49:20 -08:00
Nong Li
20fc700002 Fix precision issue in text table writer. 2014-01-08 10:49:19 -08:00
Lenni Kuff
5f81becd84 Create tables used by insert tests in a supported insert format 2014-01-08 10:49:00 -08:00
Nong Li
0df9476be1 Parquet data loading. 2014-01-08 10:48:48 -08:00
Skye Wanderman-Milne
461a48df2b Refactor testing framework to generate Avro tables. 2014-01-08 10:48:45 -08:00
Nong Li
6e293090e6 Parquet writer.
Change-Id: I7117b545e3d3a7803a219234ad992040a6c7c4ec
2014-01-08 10:48:44 -08:00
Lenni Kuff
328ceed4e7 Add support for generating lzo compressed text files and running tests against lzo 2014-01-08 10:48:38 -08:00
ishaan
5138a720bb IMP-768: Enable the python test framework to check for insert results. 2014-01-08 10:48:22 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Nong Li
a0229cd12e Update tpch schema to use bigint for keys. 2014-01-08 10:47:54 -08:00
Nong Li
02c329b97a Update RC files to use io mgr and remove scanner support for non-io mgr. 2014-01-08 10:47:11 -08:00
Nong Li
f46c654e01 Enable tpch-q21 and tpch-q22 in tests. 2014-01-08 10:47:03 -08:00
Lenni Kuff
837f35eab3 Updated results for more query tests to reflect proper ordering + improved result updating 2014-01-08 10:46:53 -08:00
Lenni Kuff
a035cf4e73 Update results of a few TPC-H queries to reflect proper ordering
Change-Id: I41156b506155c846220cfb097f5e8120503f8da8
2014-01-08 10:46:52 -08:00
Marcel Kornacker
f6af9316d9 Fix for IMP-137: incorrect predicate placement for outer joins
Fixing predicate assignment for outer joins:
- On clause predicates for outer joins are now assigned to the join node
- the exception are On clause predicates that can be directly evaluated
  by the outer-joined tables themselves; those are "pushed down"
- Where clause predicates for outer-joined tables are assigned to the join node
  that materializes the outer join
2014-01-08 10:46:50 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
Lenni Kuff
1451650055 Bring onlne all TPCH planner tests (updated for new planner) and supported query tests 2014-01-08 10:46:21 -08:00
Lenni Kuff
9f91081183 Modify TPCH tests to always insert into text table so workload can run on all file formats 2014-01-08 10:46:21 -08:00
ishaan
42231b7d86 Annotate queries for better benchmark reporting. 2014-01-08 10:45:05 -08:00
Henry Robinson
e7348a209b IMP-232: Parallel INSERT OVERWRITE 2014-01-08 10:45:04 -08:00
Nong Li
4fd7bd9606 Updated tpch core workload to include seq/snappy and seq/gzip.
Change-Id: Ifb01ee95542fced2ae8cfa4928ffbc7e357df3a8
2014-01-08 10:44:34 -08:00
Alan Choi
cbadb4eac4 When a scan range begins at the starting point fo the tuple, we'll missed that tuple. This patch fixes
this problem.

review: 162
2014-01-08 10:44:24 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00
Michael Ubell
02d63d8dc3 Trevni file support 2014-01-08 10:44:19 -08:00
Lenni Kuff
bf27a31f98 Move functional data loading to new framework + initial changes for workload directory structure
This change moves (almost) all the functional data loading to the new data
loading framework. This removes the need for the create.sql, load.sql, and
load-raw-data.sql file. Instead we just have the single schema template file:
testdata/datasets/functional/functional_schema_template.sql

This template can be used to generate the schema for all file formats and
compression variations. It also should help make loading data easier. Now you
can run:

bin/load-impala-data.sh "query-test" "exhaustive"

And get all data needed for running the query tests.

This change also includes the initial changes for new dataset/workload directory
structure. The new structure looks like:

testdata/workload  <- Will contain query files and test vectors/dimensions

testdata/datasets <- WIll contain the data files and schema templates

Note: This is the first part of the change to this directory structure - it's
not yet complete. # Please enter the commit message for your changes. Lines starting
2014-01-08 10:44:18 -08:00