Commit Graph

9 Commits

Author SHA1 Message Date
Matthew Jacobs
25428fdb21 Add support for streaming decompression of gzip text
Compressed text formats currently require entire compressed
files be read into memory to be decompressed in a single call
to the decompression codec. This changes the HdfsTextScanner
to drive gzip in a streaming mode, i.e. produce partial output
as input is consumed.

Change-Id: Id5c0805e18cf6b606bcf27a5df4b5f58895809fd
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5233
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 05c3cc55e7a601d97adc4eebe03f878c68a33e56)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5385
2014-11-23 01:55:55 -08:00
Skye Wanderman-Milne
a7e15b1417 Update Parquet scanner to only scan a file if assigned the first split.
Also re-enable Parquet tests.
2014-01-08 10:49:25 -08:00
Nong Li
329763e5ab Disable parquet tests. 2014-01-08 10:49:20 -08:00
Nong Li
0df9476be1 Parquet data loading. 2014-01-08 10:48:48 -08:00
Skye Wanderman-Milne
461a48df2b Refactor testing framework to generate Avro tables. 2014-01-08 10:48:45 -08:00
Nong Li
02c329b97a Update RC files to use io mgr and remove scanner support for non-io mgr. 2014-01-08 10:47:11 -08:00
Nong Li
4fd7bd9606 Updated tpch core workload to include seq/snappy and seq/gzip.
Change-Id: Ifb01ee95542fced2ae8cfa4928ffbc7e357df3a8
2014-01-08 10:44:34 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00
Lenni Kuff
bf27a31f98 Move functional data loading to new framework + initial changes for workload directory structure
This change moves (almost) all the functional data loading to the new data
loading framework. This removes the need for the create.sql, load.sql, and
load-raw-data.sql file. Instead we just have the single schema template file:
testdata/datasets/functional/functional_schema_template.sql

This template can be used to generate the schema for all file formats and
compression variations. It also should help make loading data easier. Now you
can run:

bin/load-impala-data.sh "query-test" "exhaustive"

And get all data needed for running the query tests.

This change also includes the initial changes for new dataset/workload directory
structure. The new structure looks like:

testdata/workload  <- Will contain query files and test vectors/dimensions

testdata/datasets <- WIll contain the data files and schema templates

Note: This is the first part of the change to this directory structure - it's
not yet complete. # Please enter the commit message for your changes. Lines starting
2014-01-08 10:44:18 -08:00