impala

mirror of https://github.com/apache/impala.git synced 2025-12-30 21:02:41 -05:00

Author	SHA1	Message	Date
Victor Bittorf	2d7f2e19b2	IMPALA 938: Infer schema from Parquet file Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'". Supports all options that CREATE TABLE does. Currently only PARQUET is supported. Run testdata/bin/create-load-data.sh after pulling this patch. Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158	2014-06-20 17:38:01 -07:00
ishaan	99602fb8c2	Force load data if the current HEAD has a schema change. This patch checks the test-warehouse's stored githash (if it exists) to determine if the current patch has changed the schema if a table. If a change is detected, we force load all the data. Change-Id: I314f9f3364d3e6b2d66de38a9e6d9f57c4e279a7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3049 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-06-19 02:25:50 -07:00
Nong Li	5d80942d42	[CDH5] IMPALA-1019: Fix cancellation path in io mgr for cached reads. Change-Id: I11efd65d1efa900f79afe88b781262a44ac5006a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2703 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-05-30 19:14:39 -07:00
Lenni Kuff	c45e9a70d9	[CDH5] Add DDL support for HDFS caching This change adds DDL support for HDFS caching. The DDL allows the user to indicate a table or partition should be cached and which pool to cache the data into: * Create a cached table: CREATE TABLE ... CACHED IN 'poolName' * Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName' * Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED When a table/partition is marked as cached, a new HDFS caching request is submitted to cache the location (HDFS path) of the table/partition and the ID of that request is stored with in the table metadata (in the table properties). This is stored as: 'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS and persisted across HDFS restarts. When a cached table or partition is dropped it is important to uncache the cached data (drop the associated cache request). For partitioned tables, this means dropping all cache requests from all cached partitions in the table. Likewise, if a partitioned table is created as cached, new partitions should be marked as cached by default. It is desirable to know which cache pools exists early on (in analysis) so the query will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To support this, a new cache pool catalog object type was introduced. The catalog server caches the known pools (periodically refreshing the cache) and sends the known pools out in catalog updates. This allows impalads to perform analysis checks on cache pool existence going to HDFS. It would be easy to use this to add basic cache pool management in the future (ADD/DROP/SHOW CACHE POOL). Waiting for the table/partition to become cached may take a long time. Instead of blocking the user from access the time during this period we will wait for the cache requests to complete in the background and once they have finished the table metadata will be automatically refreshed. Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-05-27 16:47:15 -07:00
Matthew Jacobs	ebc6c5894e	External Data Source: Frontend and catalog changes Initial frontend and catalog changes for external data sources. Change-Id: Ia0e61ef97cfd7a4e138ef555c17f2e45bbf08c18 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2224 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit dfa14c828957f751db9c89bae0bdc040ce6f648c) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2485	2014-05-08 14:56:19 -07:00
Nong Li	03e5665e56	Decimal: Read/Write to parquet. This adds support for the FIXED_LENGTH_BYTE_ARRAY parquet type and encoding for decimals. Change-Id: I9d5780feb4530989b568ec8d168cbdc32b7039bd Reviewed-on: http://gerrit.ent.cloudera.com:8080/1727 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2432	2014-05-02 16:38:35 -07:00
Nong Li	87295a4e06	Decimal implementation. This patch implements decimal support for text based formats. Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238 Tested-by: jenkins	2014-04-14 21:07:32 -07:00
Lenni Kuff	aa0b7a35f5	IMPALA-880: COMPUTE STATS should update partitions in batches When updating partition metadata as part of COMPUTE STATS we would previously attempt to update all partitions at once. This could lead to HMS socket timeouts and also could run into issues if there were > 32K partitions. In this change we now update the partitions in batches, with a max size of 500 partitions per batch. We also compare whether the row count has changed and only update partitions that have been modified. Change-Id: If7bfcc30f86fc2fdd79855b981067ac29a47b5e1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1913 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1918	2014-03-14 19:20:12 -07:00
Lenni Kuff	bf16b5cd0d	IMPALA-749: Fetch partitions in batches, rather than all at once. This updates how Impala fetches partition metadata from the Hive Metastore to fetch partitions in batches, rather than all at once. This helps reduce the load on the HMS and also lets Impala scale to above 32K partitions. The downside is that it may require additional RPCs to get all the partitions. This is done by first querying the metastore to get all the partition names that exist, then splitting the list of names into seperate batches to get the actual partition metadata. Impala uses a default size of 1000 partitions per batch, but it can be configured by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter in the hive-site.xml config file. Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698	2014-02-28 22:30:45 -08:00
Nong Li	04b501d3a1	[CDH5] Collect metadata for cached blocks. Change-Id: I81026de2f9a08553dc15e07090b8297120aa7462 (cherry picked from commit 69414f67b20016e49b739a46d6e2b4b57e1d1a3c) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1252 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-15 15:12:20 -08:00
Alex Behm	dc7b398bd3	Impala reserves resources from YARN via LLama. Impala reserves resources from YARN via Llama and handles resources preemptions by cancelling affected queries. Adds the Impala Resource Broker for interacting with Llama. Refactors scheduler and coordinator to move fragment-to-host assignment logic into scheduler. Local test setup uses MiniLLama. Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1	2014-01-15 15:12:04 -08:00
Skye Wanderman-Milne	561da008c7	IMPALA-729: fix resource management in Parquet scanner for multiple row groups We weren't attaching resources to the row batch when starting a new row group, so it was possible for string data to be overwritten. This patch removes CloseStreams() and merges its functionality with AttachCompletedResources() so it's not possible to destroy streams without transferring the resources first. It also merges and removes ScannerContext::Close(). Also adds test cases for IMPALA-720. Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb	2014-01-08 10:56:26 -08:00
Skye Wanderman-Milne	9e17042185	Allow zero bit width dict/RLE decoders. This allows us to read single-value dictionary-encoded columns generated by parquet-mr. Change-Id: I80903d910d0cc3a3e4ebf02e34212d868e94feb4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1098 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne	de531e15bd	IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8 parquet-mr had a bug where it didn't include the dictionary page's header in the total column size. We now compensate for this by detecting these files and padding the scan range length. This required changing how the scanner detects when it's finished: it now counts the number of rows rather than checking eosr (since the scan range may be longer than the column). Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne	acdc792355	IMPALA-695: Use the local path of Hive UDF jars in the FE. The FE was creating class loaders with the HDFS locations of Hive UDF libs, rather than the local locations created by the BE. Our tests still passed since we only used UDFs already on the classpath (e.g. Hive builtins). Change-Id: Idbe9c98ad6adb84b70cb44efbf9ad0afc53366ca Reviewed-on: http://gerrit.ent.cloudera.com:8080/1081 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:25 -08:00
Lenni Kuff	0bae3978c9	Update compute-stats.py to execute using Impala Updates our compute stats script to execute using Impala. This allows us to easily compute stats on all tables in a database or all tables in the metastore. The updated stats caused one of the TPCH plans to change so this also updates the TPCH planner test results. Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:18 -08:00
Lenni Kuff	8b2acf5c22	IMPALA-425: Detect read-only tables and disable INSERT/LOAD operations on these tables With this change we now detect if a table is read-only and disable INSERT/LOAD operations on these tables. A table is read-only if Impala does not have write permission on the HDFS base directory of the table or any one of the partition directories (if the table is partitioned). Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/713 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:37 -08:00
Nong Li	4800995d44	Add execution for Hive UDFs. Change-Id: I6a5ad96fed77e2b8a2701f21a917a8eb7a11d500 Reviewed-on: http://gerrit.ent.cloudera.com:8080/458 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:25 -08:00
Nong Li	6b9a7de02e	Add symbol resolution during analysis for create function stmts. Before this, we had to specify the entire mangled symbol. This can be quite long and quite tedious (take a look at some of the create UDA test cases that specify all the symbols). This patch adds some code to convert from the user function signature to the mangled name. This means the user can specify the unmangled name and we can do the symbol lookup. The mangling rules are pretty convoluted but if it is messed up, the user can always specify the full symbol. Some other minor cleanup in: - JNI from FE to BE - UDFs/UDAs that are loaded as test data Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/624 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:20 -08:00
Skye Wanderman-Milne	b7f83bcd73	Add support for LLVM IR UDFs. This patch also adds a number of improvements to NativeUdfExpr. Highlights include: * Correctly handling the lowering of AnyVal struct types (required for ABI compatibility) * A rudimentary library cache for reusing handles produced by dlopen * More complicated test cases Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195 Reviewed-on: http://gerrit.ent.cloudera.com:8080/540 Tested-by: jenkins Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:03 -08:00
Lenni Kuff	79cdeac3d6	Consolidate test cluster under IMPALA_HOME/cluster_logs + store logs during data loading Change-Id: I8f6239e4ccb0515c85bf80193a475788fb18dedb Reviewed-on: http://gerrit.ent.cloudera.com:8080/518 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:52:56 -08:00
Skye Wanderman-Milne	fd99db0300	First pass at UdfExpr. Change-Id: I517bf56541749b5c2459554821c7bf838239fdf0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/439 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:52:50 -08:00
ishaan	6735e3983f	Fix build failure because of hbase data loading. Change-Id: I796656332c3733a1ffdc338d206009efa6c451ac Reviewed-on: http://gerrit.ent.cloudera.com:8080/360 Tested-by: jenkins Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:37 -08:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Skye Wanderman-Milne	6e7406df8b	IMPALA-502: Impala does not return NULL for case where table has extra string column and data does not (it returns an empty string) Change-Id: I0cfe5ce5fc279d46610a3cc191a501ccbc335296 Reviewed-on: http://gerrit.ent.cloudera.com:8080/127 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:52:02 -08:00
Skye Wanderman-Milne	3fecdeb793	IMPALA-441: support default values for Avro tables	2014-01-08 10:51:39 -08:00
Skye Wanderman-Milne	c8a8308ece	Avro schema resolution (minus default values)	2014-01-08 10:51:26 -08:00
Lenni Kuff	7ac88e1fa9	IMPALA-400: Add support for SQL statement authorization This changes adds support for SQL statement authorization in Impala. The authorization works by updating the Catalog API to require a User + Privilege when getting Table/Db objects (and in the future can be extended to cover columns as well). If the user doesn't have permission to access the object, an AuthorizationException is thrown. The authorization checks are done during analysis as new Catalog objects are encountered. These changes build on top of the Hive Access code which handles the actually processing of authorization requests. The authorization is currently based on a "policy file" which will be stored in HDFS. This policy file is read once on startup and then reloaded every 5 minutes. It can also be reloaded on a specific impalad by executing a "refresh" command. Authorization is enabled by setting: --server_name='server1' and then pointing the impalad to the policy file using the flag: --authorization_policy_file=/path/to/policy/file any authorization configuration problems will result in impalad failing to start.	2014-01-08 10:50:56 -08:00
Skye Wanderman-Milne	1ab189c789	Fix build	2014-01-08 10:50:52 -08:00
Skye Wanderman-Milne	c8fd4f8016	IMPALA-362: impalad hangs when read sequence file without contents	2014-01-08 10:50:49 -08:00
Alan Choi	bd59bbb07a	IMPALA-300/356 Always reload region server info. Clear keyRange.start/stopkey before setting it in setKeyRangeStart/End. Split HBase tables into multiple regions. I've to disable HBase scanrangelocations planner test because region assigment is non-deterministic. I'll have a follow up patch to address that.	2014-01-08 10:50:48 -08:00
Lenni Kuff	2f7198292a	Add support for auxiliary workloads, tests, and datasets This change adds support for auxiliary worksloads, tests, and datasets. This is useful to augment the regular test runs with some additional tests that do not belong in the main Impala repo.	2014-01-08 10:50:32 -08:00
Skye Wanderman-Milne	223b1a8e47	IMPALA-293: Impala is unable to query RCFile tables which describe less columns than the file's header.	2014-01-08 10:50:17 -08:00
Skye Wanderman-Milne	cc6007cf9e	IMPALA-262: Querying text/lzo table that is not indexed causes an impalad segfault	2014-01-08 10:49:52 -08:00
Lenni Kuff	36e9fe1c1a	Run compute table stats statements using Hive CLI This works around a problem with computing table stats via the Hive Meta Store client API. When executing these stements via the MetaStoreClient, all tables were getting a num_rows=0 value returned from the ANALYZE TABLE query.	2014-01-08 10:49:19 -08:00
Lenni Kuff	831ee529be	Fixed data loading bugs, moved most tables out of load-dependent-tables	2014-01-08 10:48:56 -08:00
Lenni Kuff	e0a7b7cb55	Compute column stats on tables used by Planner tests	2014-01-08 10:48:48 -08:00
Lenni Kuff	328ceed4e7	Add support for generating lzo compressed text files and running tests against lzo	2014-01-08 10:48:38 -08:00
Lenni Kuff	6e1f8d178a	Update utility script to compute column and table stats for given table(s)	2014-01-08 10:48:23 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Lenni Kuff	99bb22dcac	Add db name filter to compute stats, run compute stats on functional/text tables	2014-01-08 10:48:08 -08:00
Alan Choi	73c8ee3d96	IMPALA-18 Ignore hidden file prefixed with . or _	2014-01-08 10:48:00 -08:00
Lenni Kuff	d2e4776731	Support passing snapshot file to buildall, add script to run all tests, remove old tests	2014-01-08 10:47:59 -08:00
Lenni Kuff	d5177c3c30	Update run-workload to support specifying table format test vectors from command line	2014-01-08 10:47:20 -08:00
Lenni Kuff	deea7a86b9	Remove Trevni from mixed format table and fix data loading bug	2014-01-08 10:47:10 -08:00
Lenni Kuff	30dbf59ef2	Final changes to enable Python test infrastructure and tests With this change the Python tests will now be called as part of buildall and the corresponding Java tests have been disabled. The new tests can also be invoked calling ./tests/run-tests.sh directly. This includes a fix from Nong that caused wrong results for limit on non-io manager formats.	2014-01-08 10:46:57 -08:00
Lenni Kuff	bed633c1ae	Extract config/metastore creation from buildall + script for loading warehouse snapshot	2014-01-08 10:46:53 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00
Lenni Kuff	1e25c98fb4	Test data loading framework improvements This change includes a number of improvements for the test data loading framework: * Named sections for schema template definitions * Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE) * More granular data loading via table name filters * Improved robustness in detecting failed data loads * Table level constraints for specific file formats * Re-written compute stats script	2014-01-08 10:46:49 -08:00
Michael Ubell	8a5297a526	Add HdfsLzoTextScanner	2014-01-08 10:46:35 -08:00

1 2

61 Commits