We weren't clearing the state in the dictionary when rolling over to a new
page. The memory for the dictionary (built from the first file) was cleared
but the dictionary entires were not.
This also had a minor side effect that unused dictionary entries from the first
page were still being written out for subsequent pages, although in practice,
this is unlikely to affect the file size much.
Change-Id: I8e11fc4723dc23d21c5de8a42def13d8238c137b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1072
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.
Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.
What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
contains the change. An impalad will wait for the statestore heartbeat that contains this
version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing
Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
same JAR.
Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Fixing predicate assignment for outer joins:
- On clause predicates for outer joins are now assigned to the join node
- the exception are On clause predicates that can be directly evaluated
by the outer-joined tables themselves; those are "pushed down"
- Where clause predicates for outer-joined tables are assigned to the join node
that materializes the outer join
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:
./run-benchmark --workloads=hive-benchmark,tpch
We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.
To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.
Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:
./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.
Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3
Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3
This changeset also includes a few other minor tweaks to some of the test
scripts.
Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
This change moves (almost) all the functional data loading to the new data
loading framework. This removes the need for the create.sql, load.sql, and
load-raw-data.sql file. Instead we just have the single schema template file:
testdata/datasets/functional/functional_schema_template.sql
This template can be used to generate the schema for all file formats and
compression variations. It also should help make loading data easier. Now you
can run:
bin/load-impala-data.sh "query-test" "exhaustive"
And get all data needed for running the query tests.
This change also includes the initial changes for new dataset/workload directory
structure. The new structure looks like:
testdata/workload <- Will contain query files and test vectors/dimensions
testdata/datasets <- WIll contain the data files and schema templates
Note: This is the first part of the change to this directory structure - it's
not yet complete. # Please enter the commit message for your changes. Lines starting