impala

mirror of https://github.com/apache/impala.git synced 2026-01-01 00:00:20 -05:00

Author	SHA1	Message	Date
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
Nong Li	a3bc1ce133	Some parquet encoder/decoder refactoring. Added dictionary to other types. Split out the encoder/type for parquet reader/writer. I think this puts us in a better place to support future encodings. On the tpch lineitem table, the results are: Before: BytesWritten: 236.45 MB Per Column Sizes: l_comment: 75.71 MB l_commitdate: 8.64 MB l_discount: 11.19 MB l_extendedprice: 33.02 MB l_linenumber: 4.56 MB l_linestatus: 869.98 KB l_orderkey: 8.99 MB l_partkey: 27.02 MB l_quantity: 11.58 MB l_receiptdate: 8.65 MB l_returnflag: 1.40 MB l_shipdate: 8.65 MB l_shipinstruct: 1.45 MB l_shipmode: 2.17 MB l_suppkey: 21.91 MB l_tax: 10.68 MB After: BytesWritten: 198.63 MB (84%) Per Column Sizes: l_comment: 75.71 MB (100%) l_commitdate: 8.64 MB (100%) l_discount: 2.89 MB (25.8%) l_extendedprice: 33.13 MB (100.33%) l_linenumber: 1.50 MB (32.89%) l_linestatus: 870.26 KB (100.032%) l_orderkey: 9.18 MB (102.11%) l_partkey: 27.10 MB (100.29%) l_quantity: 4.32 MB (37.31%) l_receiptdate: 8.65 MB (100%) l_returnflag: 1.40 MB (100%) l_shipdate: 8.65 MB (100%) l_shipinstruct: 1.45 MB (100%) l_shipmode: 2.17 MB (100%) l_suppkey: 10.11 MB (46.14%) l_tax: 2.89 MB (27.06%) The table is overall 84% as big (i.e. 16% smaller). A few columns got marginally bigger. If the file filled the 1 GB, I'd expect the overhead to decrease even more. The restructuring to use a virtual call doesn't seem to change things much and will go away when we codegen the scanner. Here's what they look like with this patch (note this is on the before data files, so only string cols are dictionary encoded). Before query times: Insert Time: 8.5 sec select : 2.3 sec select avg(l_orderkey): .33 sec After query times: Insert Time: 9.5 sec <-- Longer due to doing dictionary encoding select : 2.4 sec <-- kind of noisy, possibly a slight slow down select avg(l_orderkey): .33 sec Change-Id: I213fdca1bb972cc200dc0cd9fb14b77a8d36d9e6 Reviewed-on: http://gerrit.ent.cloudera.com:8080/238 Tested-by: jenkins <kitchen-build@cloudera.com> Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:52:16 -08:00
Alex Behm	045038e479	IMPALA-374: Added WITH clause without recursion.	2014-01-08 10:51:00 -08:00
Alex Behm	937a44f9f8	IMPALA-68: Support Values() statement.	2014-01-08 10:50:31 -08:00
Alex Behm	c7819f4db7	IMPALA-87: Support INSERT from SELECT without FROM.	2014-01-08 10:50:30 -08:00
Lenni Kuff	627e74a068	Fix insert test failure by cleaning up table before executing query	2014-01-08 10:50:27 -08:00
Henry Robinson	ead69d377f	IMPALA-249, IMPALA-252: Fixes for static partition keys.	2014-01-08 10:50:14 -08:00
Henry Robinson	7d2c47ad72	IMPALA-258: Make partition key string encoding Hive-compatible	2014-01-08 10:49:54 -08:00
Alex Behm	5db3f2cdf5	IMPALA-227: SELECT * on partitioned table returns columns in different order than Hive.	2014-01-08 10:49:48 -08:00
Lenni Kuff	c74b7e41dd	Enable insert tests to run against parquet	2014-01-08 10:49:47 -08:00
Lenni Kuff	5f81becd84	Create tables used by insert tests in a supported insert format	2014-01-08 10:49:00 -08:00
ishaan	5138a720bb	IMP-768: Enable the python test framework to check for insert results.	2014-01-08 10:48:22 -08:00
Henry Robinson	222d15c6ca	IMPALA-72: String partition keys should be URL encoded	2014-01-08 10:48:20 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Lenni Kuff	d2e4776731	Support passing snapshot file to buildall, add script to run all tests, remove old tests	2014-01-08 10:47:59 -08:00
Lenni Kuff	bed633c1ae	Extract config/metastore creation from buildall + script for loading warehouse snapshot	2014-01-08 10:46:53 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00
Henry Robinson	e7348a209b	IMP-232: Parallel INSERT OVERWRITE	2014-01-08 10:45:04 -08:00
Henry Robinson	c472213eeb	Parallel INSERT, sink-per-scan-node plan	2014-01-08 10:44:35 -08:00
Alan Choi	f15ef994fb	"mvn test" now uses impalad and beeswax api to submit query and fetch, including insert query. review issue: 260	2014-01-08 10:44:30 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00

21 Commits