Commit Graph

61 Commits

Author SHA1 Message Date
Victor Bittorf
2d7f2e19b2 IMPALA 938: Infer schema from Parquet file
Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'".
Supports all options that CREATE TABLE does. Currently only PARQUET is supported.
Run testdata/bin/create-load-data.sh after pulling this patch.

Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158
2014-06-20 17:38:01 -07:00
ishaan
99602fb8c2 Force load data if the current HEAD has a schema change.
This patch checks the test-warehouse's stored githash (if it exists) to determine if the
current patch has changed the schema if a table. If a change is detected, we force load
all the data.

Change-Id: I314f9f3364d3e6b2d66de38a9e6d9f57c4e279a7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3049
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-06-19 02:25:50 -07:00
Nong Li
5d80942d42 [CDH5] IMPALA-1019: Fix cancellation path in io mgr for cached reads.
Change-Id: I11efd65d1efa900f79afe88b781262a44ac5006a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2703
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-30 19:14:39 -07:00
Lenni Kuff
c45e9a70d9 [CDH5] Add DDL support for HDFS caching
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED

When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.

When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.

It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).

Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.

Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-27 16:47:15 -07:00
Matthew Jacobs
ebc6c5894e External Data Source: Frontend and catalog changes
Initial frontend and catalog changes for external data sources.

Change-Id: Ia0e61ef97cfd7a4e138ef555c17f2e45bbf08c18
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2224
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit dfa14c828957f751db9c89bae0bdc040ce6f648c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2485
2014-05-08 14:56:19 -07:00
Nong Li
03e5665e56 Decimal: Read/Write to parquet.
This adds support for the FIXED_LENGTH_BYTE_ARRAY parquet type and
encoding for decimals.

Change-Id: I9d5780feb4530989b568ec8d168cbdc32b7039bd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1727
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2432
2014-05-02 16:38:35 -07:00
Nong Li
87295a4e06 Decimal implementation.
This patch implements decimal support for text based formats.

Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238
Tested-by: jenkins
2014-04-14 21:07:32 -07:00
Lenni Kuff
aa0b7a35f5 IMPALA-880: COMPUTE STATS should update partitions in batches
When updating partition metadata as part of COMPUTE STATS we would previously
attempt to update all partitions at once. This could lead to HMS socket timeouts
and also could run into issues if there were > 32K partitions.

In this change we now update the partitions in batches, with a max size of 500
partitions per batch. We also compare whether the row count has changed and only
update partitions that have been modified.

Change-Id: If7bfcc30f86fc2fdd79855b981067ac29a47b5e1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1913
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1918
2014-03-14 19:20:12 -07:00
Lenni Kuff
bf16b5cd0d IMPALA-749: Fetch partitions in batches, rather than all at once.
This updates how Impala fetches partition metadata from the Hive Metastore to fetch
partitions in batches, rather than all at once. This helps reduce the load on the
HMS and also lets Impala scale to above 32K partitions. The downside is that it
may require additional RPCs to get all the partitions.

This is done by first querying the metastore to get all the partition names that
exist, then splitting the list of names into seperate batches to get the actual
partition metadata.

Impala uses a default size of 1000 partitions per batch, but it can be configured
by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter
in the hive-site.xml config file.

Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698
2014-02-28 22:30:45 -08:00
Nong Li
04b501d3a1 [CDH5] Collect metadata for cached blocks.
Change-Id: I81026de2f9a08553dc15e07090b8297120aa7462
(cherry picked from commit 69414f67b20016e49b739a46d6e2b4b57e1d1a3c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1252
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-15 15:12:20 -08:00
Alex Behm
dc7b398bd3 Impala reserves resources from YARN via LLama.
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.

Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1
2014-01-15 15:12:04 -08:00
Skye Wanderman-Milne
561da008c7 IMPALA-729: fix resource management in Parquet scanner for multiple row groups
We weren't attaching resources to the row batch when starting a new
row group, so it was possible for string data to be overwritten. This
patch removes CloseStreams() and merges its functionality with
AttachCompletedResources() so it's not possible to destroy streams
without transferring the resources first. It also merges and removes
ScannerContext::Close().

Also adds test cases for IMPALA-720.

Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb
2014-01-08 10:56:26 -08:00
Skye Wanderman-Milne
9e17042185 Allow zero bit width dict/RLE decoders.
This allows us to read single-value dictionary-encoded columns
generated by parquet-mr.

Change-Id: I80903d910d0cc3a3e4ebf02e34212d868e94feb4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1098
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne
de531e15bd IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).

Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne
acdc792355 IMPALA-695: Use the local path of Hive UDF jars in the FE.
The FE was creating class loaders with the HDFS locations of Hive UDF
libs, rather than the local locations created by the BE. Our tests
still passed since we only used UDFs already on the classpath
(e.g. Hive builtins).

Change-Id: Idbe9c98ad6adb84b70cb44efbf9ad0afc53366ca
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1081
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:25 -08:00
Lenni Kuff
0bae3978c9 Update compute-stats.py to execute using Impala
Updates our compute stats script to execute using Impala. This allows us
to easily compute stats on all tables in a database or all tables in the
metastore.
The updated stats caused one of the TPCH plans to change so this also
updates the TPCH planner test results.

Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:18 -08:00
Lenni Kuff
8b2acf5c22 IMPALA-425: Detect read-only tables and disable INSERT/LOAD operations on these tables
With this change we now detect if a table is read-only and disable INSERT/LOAD operations
on these tables. A table is read-only if Impala does not have write permission on the HDFS
base directory of the table or any one of the partition directories (if
the table is partitioned).

Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/713
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:37 -08:00
Nong Li
4800995d44 Add execution for Hive UDFs.
Change-Id: I6a5ad96fed77e2b8a2701f21a917a8eb7a11d500
Reviewed-on: http://gerrit.ent.cloudera.com:8080/458
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:25 -08:00
Nong Li
6b9a7de02e Add symbol resolution during analysis for create function stmts.
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).

This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.

Some other minor cleanup in:
  - JNI from FE to BE
  - UDFs/UDAs that are loaded as test data

Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:20 -08:00
Skye Wanderman-Milne
b7f83bcd73 Add support for LLVM IR UDFs.
This patch also adds a number of improvements to NativeUdfExpr. Highlights include:

* Correctly handling the lowering of AnyVal struct types (required for ABI compatibility)
* A rudimentary library cache for reusing handles produced by dlopen
* More complicated test cases

Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195
Reviewed-on: http://gerrit.ent.cloudera.com:8080/540
Tested-by: jenkins
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:53:03 -08:00
Lenni Kuff
79cdeac3d6 Consolidate test cluster under IMPALA_HOME/cluster_logs + store logs during data loading
Change-Id: I8f6239e4ccb0515c85bf80193a475788fb18dedb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/518
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:56 -08:00
Skye Wanderman-Milne
fd99db0300 First pass at UdfExpr.
Change-Id: I517bf56541749b5c2459554821c7bf838239fdf0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/439
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:50 -08:00
ishaan
6735e3983f Fix build failure because of hbase data loading.
Change-Id: I796656332c3733a1ffdc338d206009efa6c451ac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/360
Tested-by: jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:37 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Skye Wanderman-Milne
6e7406df8b IMPALA-502: Impala does not return NULL for case where table has extra string column and data does not (it returns an empty string)
Change-Id: I0cfe5ce5fc279d46610a3cc191a501ccbc335296
Reviewed-on: http://gerrit.ent.cloudera.com:8080/127
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:02 -08:00
Skye Wanderman-Milne
3fecdeb793 IMPALA-441: support default values for Avro tables 2014-01-08 10:51:39 -08:00
Skye Wanderman-Milne
c8a8308ece Avro schema resolution (minus default values) 2014-01-08 10:51:26 -08:00
Lenni Kuff
7ac88e1fa9 IMPALA-400: Add support for SQL statement authorization
This changes adds support for SQL statement authorization in Impala. The authorization
works by updating the Catalog API to require a User + Privilege when getting Table/Db
objects (and in the future can be extended to cover columns as well).
If the user doesn't have permission to access the object, an AuthorizationException is
thrown. The authorization checks are done during analysis as new Catalog objects are
encountered.

These changes build on top of the Hive Access code which handles the actually
processing of authorization requests.  The authorization is currently based
on a "policy file" which will be stored in HDFS. This policy file is read once
on startup and then reloaded every 5 minutes. It can also be reloaded on a
specific impalad by executing a "refresh" command.

Authorization is enabled by setting:
--server_name='server1'
and then pointing the impalad to the policy file using the flag:
--authorization_policy_file=/path/to/policy/file

any authorization configuration problems will result in impalad failing to
start.
2014-01-08 10:50:56 -08:00
Skye Wanderman-Milne
1ab189c789 Fix build 2014-01-08 10:50:52 -08:00
Skye Wanderman-Milne
c8fd4f8016 IMPALA-362: impalad hangs when read sequence file without contents 2014-01-08 10:50:49 -08:00
Alan Choi
bd59bbb07a IMPALA-300/356
Always reload region server info.
Clear keyRange.start/stopkey before setting it in setKeyRangeStart/End.
Split HBase tables into multiple regions.
I've to disable HBase scanrangelocations planner test because region assigment
is non-deterministic. I'll have a follow up patch to address that.
2014-01-08 10:50:48 -08:00
Lenni Kuff
2f7198292a Add support for auxiliary workloads, tests, and datasets
This change adds support for auxiliary worksloads, tests, and datasets. This is useful
to augment the regular test runs with some additional tests that do not belong in the
main Impala repo.
2014-01-08 10:50:32 -08:00
Skye Wanderman-Milne
223b1a8e47 IMPALA-293: Impala is unable to query RCFile tables which describe less columns than the file's header. 2014-01-08 10:50:17 -08:00
Skye Wanderman-Milne
cc6007cf9e IMPALA-262: Querying text/lzo table that is not indexed causes an impalad segfault 2014-01-08 10:49:52 -08:00
Lenni Kuff
36e9fe1c1a Run compute table stats statements using Hive CLI
This works around a problem with computing table stats via the Hive Meta Store client
API. When executing these stements via the MetaStoreClient, all tables were getting a
num_rows=0 value returned from the ANALYZE TABLE query.
2014-01-08 10:49:19 -08:00
Lenni Kuff
831ee529be Fixed data loading bugs, moved most tables out of load-dependent-tables 2014-01-08 10:48:56 -08:00
Lenni Kuff
e0a7b7cb55 Compute column stats on tables used by Planner tests 2014-01-08 10:48:48 -08:00
Lenni Kuff
328ceed4e7 Add support for generating lzo compressed text files and running tests against lzo 2014-01-08 10:48:38 -08:00
Lenni Kuff
6e1f8d178a Update utility script to compute column and table stats for given table(s) 2014-01-08 10:48:23 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
99bb22dcac Add db name filter to compute stats, run compute stats on functional/text tables 2014-01-08 10:48:08 -08:00
Alan Choi
73c8ee3d96 IMPALA-18 Ignore hidden file prefixed with . or _ 2014-01-08 10:48:00 -08:00
Lenni Kuff
d2e4776731 Support passing snapshot file to buildall, add script to run all tests, remove old tests 2014-01-08 10:47:59 -08:00
Lenni Kuff
d5177c3c30 Update run-workload to support specifying table format test vectors from command line 2014-01-08 10:47:20 -08:00
Lenni Kuff
deea7a86b9 Remove Trevni from mixed format table and fix data loading bug 2014-01-08 10:47:10 -08:00
Lenni Kuff
30dbf59ef2 Final changes to enable Python test infrastructure and tests
With this change the Python tests will now be called as part of buildall and
the corresponding Java tests have been disabled. The new tests can also be
invoked calling ./tests/run-tests.sh directly.

This includes a fix from Nong that caused wrong results for limit on non-io
manager formats.
2014-01-08 10:46:57 -08:00
Lenni Kuff
bed633c1ae Extract config/metastore creation from buildall + script for loading warehouse snapshot 2014-01-08 10:46:53 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
Lenni Kuff
1e25c98fb4 Test data loading framework improvements
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
2014-01-08 10:46:49 -08:00
Michael Ubell
8a5297a526 Add HdfsLzoTextScanner 2014-01-08 10:46:35 -08:00