Commit Graph

55 Commits

Author SHA1 Message Date
Nong Li
87295a4e06 Decimal implementation.
This patch implements decimal support for text based formats.

Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238
Tested-by: jenkins
2014-04-14 21:07:32 -07:00
Lenni Kuff
aa0b7a35f5 IMPALA-880: COMPUTE STATS should update partitions in batches
When updating partition metadata as part of COMPUTE STATS we would previously
attempt to update all partitions at once. This could lead to HMS socket timeouts
and also could run into issues if there were > 32K partitions.

In this change we now update the partitions in batches, with a max size of 500
partitions per batch. We also compare whether the row count has changed and only
update partitions that have been modified.

Change-Id: If7bfcc30f86fc2fdd79855b981067ac29a47b5e1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1913
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1918
2014-03-14 19:20:12 -07:00
Lenni Kuff
bf16b5cd0d IMPALA-749: Fetch partitions in batches, rather than all at once.
This updates how Impala fetches partition metadata from the Hive Metastore to fetch
partitions in batches, rather than all at once. This helps reduce the load on the
HMS and also lets Impala scale to above 32K partitions. The downside is that it
may require additional RPCs to get all the partitions.

This is done by first querying the metastore to get all the partition names that
exist, then splitting the list of names into seperate batches to get the actual
partition metadata.

Impala uses a default size of 1000 partitions per batch, but it can be configured
by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter
in the hive-site.xml config file.

Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698
2014-02-28 22:30:45 -08:00
Nong Li
04b501d3a1 [CDH5] Collect metadata for cached blocks.
Change-Id: I81026de2f9a08553dc15e07090b8297120aa7462
(cherry picked from commit 69414f67b20016e49b739a46d6e2b4b57e1d1a3c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1252
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-15 15:12:20 -08:00
Alex Behm
dc7b398bd3 Impala reserves resources from YARN via LLama.
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.

Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1
2014-01-15 15:12:04 -08:00
Skye Wanderman-Milne
561da008c7 IMPALA-729: fix resource management in Parquet scanner for multiple row groups
We weren't attaching resources to the row batch when starting a new
row group, so it was possible for string data to be overwritten. This
patch removes CloseStreams() and merges its functionality with
AttachCompletedResources() so it's not possible to destroy streams
without transferring the resources first. It also merges and removes
ScannerContext::Close().

Also adds test cases for IMPALA-720.

Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb
2014-01-08 10:56:26 -08:00
Skye Wanderman-Milne
9e17042185 Allow zero bit width dict/RLE decoders.
This allows us to read single-value dictionary-encoded columns
generated by parquet-mr.

Change-Id: I80903d910d0cc3a3e4ebf02e34212d868e94feb4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1098
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne
de531e15bd IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).

Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne
acdc792355 IMPALA-695: Use the local path of Hive UDF jars in the FE.
The FE was creating class loaders with the HDFS locations of Hive UDF
libs, rather than the local locations created by the BE. Our tests
still passed since we only used UDFs already on the classpath
(e.g. Hive builtins).

Change-Id: Idbe9c98ad6adb84b70cb44efbf9ad0afc53366ca
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1081
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:25 -08:00
Lenni Kuff
0bae3978c9 Update compute-stats.py to execute using Impala
Updates our compute stats script to execute using Impala. This allows us
to easily compute stats on all tables in a database or all tables in the
metastore.
The updated stats caused one of the TPCH plans to change so this also
updates the TPCH planner test results.

Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:18 -08:00
Lenni Kuff
8b2acf5c22 IMPALA-425: Detect read-only tables and disable INSERT/LOAD operations on these tables
With this change we now detect if a table is read-only and disable INSERT/LOAD operations
on these tables. A table is read-only if Impala does not have write permission on the HDFS
base directory of the table or any one of the partition directories (if
the table is partitioned).

Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/713
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:37 -08:00
Nong Li
4800995d44 Add execution for Hive UDFs.
Change-Id: I6a5ad96fed77e2b8a2701f21a917a8eb7a11d500
Reviewed-on: http://gerrit.ent.cloudera.com:8080/458
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:25 -08:00
Nong Li
6b9a7de02e Add symbol resolution during analysis for create function stmts.
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).

This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.

Some other minor cleanup in:
  - JNI from FE to BE
  - UDFs/UDAs that are loaded as test data

Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:20 -08:00
Skye Wanderman-Milne
b7f83bcd73 Add support for LLVM IR UDFs.
This patch also adds a number of improvements to NativeUdfExpr. Highlights include:

* Correctly handling the lowering of AnyVal struct types (required for ABI compatibility)
* A rudimentary library cache for reusing handles produced by dlopen
* More complicated test cases

Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195
Reviewed-on: http://gerrit.ent.cloudera.com:8080/540
Tested-by: jenkins
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:53:03 -08:00
Lenni Kuff
79cdeac3d6 Consolidate test cluster under IMPALA_HOME/cluster_logs + store logs during data loading
Change-Id: I8f6239e4ccb0515c85bf80193a475788fb18dedb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/518
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:56 -08:00
Skye Wanderman-Milne
fd99db0300 First pass at UdfExpr.
Change-Id: I517bf56541749b5c2459554821c7bf838239fdf0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/439
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:50 -08:00
ishaan
6735e3983f Fix build failure because of hbase data loading.
Change-Id: I796656332c3733a1ffdc338d206009efa6c451ac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/360
Tested-by: jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:37 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Skye Wanderman-Milne
6e7406df8b IMPALA-502: Impala does not return NULL for case where table has extra string column and data does not (it returns an empty string)
Change-Id: I0cfe5ce5fc279d46610a3cc191a501ccbc335296
Reviewed-on: http://gerrit.ent.cloudera.com:8080/127
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:02 -08:00
Skye Wanderman-Milne
3fecdeb793 IMPALA-441: support default values for Avro tables 2014-01-08 10:51:39 -08:00
Skye Wanderman-Milne
c8a8308ece Avro schema resolution (minus default values) 2014-01-08 10:51:26 -08:00
Lenni Kuff
7ac88e1fa9 IMPALA-400: Add support for SQL statement authorization
This changes adds support for SQL statement authorization in Impala. The authorization
works by updating the Catalog API to require a User + Privilege when getting Table/Db
objects (and in the future can be extended to cover columns as well).
If the user doesn't have permission to access the object, an AuthorizationException is
thrown. The authorization checks are done during analysis as new Catalog objects are
encountered.

These changes build on top of the Hive Access code which handles the actually
processing of authorization requests.  The authorization is currently based
on a "policy file" which will be stored in HDFS. This policy file is read once
on startup and then reloaded every 5 minutes. It can also be reloaded on a
specific impalad by executing a "refresh" command.

Authorization is enabled by setting:
--server_name='server1'
and then pointing the impalad to the policy file using the flag:
--authorization_policy_file=/path/to/policy/file

any authorization configuration problems will result in impalad failing to
start.
2014-01-08 10:50:56 -08:00
Skye Wanderman-Milne
1ab189c789 Fix build 2014-01-08 10:50:52 -08:00
Skye Wanderman-Milne
c8fd4f8016 IMPALA-362: impalad hangs when read sequence file without contents 2014-01-08 10:50:49 -08:00
Alan Choi
bd59bbb07a IMPALA-300/356
Always reload region server info.
Clear keyRange.start/stopkey before setting it in setKeyRangeStart/End.
Split HBase tables into multiple regions.
I've to disable HBase scanrangelocations planner test because region assigment
is non-deterministic. I'll have a follow up patch to address that.
2014-01-08 10:50:48 -08:00
Lenni Kuff
2f7198292a Add support for auxiliary workloads, tests, and datasets
This change adds support for auxiliary worksloads, tests, and datasets. This is useful
to augment the regular test runs with some additional tests that do not belong in the
main Impala repo.
2014-01-08 10:50:32 -08:00
Skye Wanderman-Milne
223b1a8e47 IMPALA-293: Impala is unable to query RCFile tables which describe less columns than the file's header. 2014-01-08 10:50:17 -08:00
Skye Wanderman-Milne
cc6007cf9e IMPALA-262: Querying text/lzo table that is not indexed causes an impalad segfault 2014-01-08 10:49:52 -08:00
Lenni Kuff
36e9fe1c1a Run compute table stats statements using Hive CLI
This works around a problem with computing table stats via the Hive Meta Store client
API. When executing these stements via the MetaStoreClient, all tables were getting a
num_rows=0 value returned from the ANALYZE TABLE query.
2014-01-08 10:49:19 -08:00
Lenni Kuff
831ee529be Fixed data loading bugs, moved most tables out of load-dependent-tables 2014-01-08 10:48:56 -08:00
Lenni Kuff
e0a7b7cb55 Compute column stats on tables used by Planner tests 2014-01-08 10:48:48 -08:00
Lenni Kuff
328ceed4e7 Add support for generating lzo compressed text files and running tests against lzo 2014-01-08 10:48:38 -08:00
Lenni Kuff
6e1f8d178a Update utility script to compute column and table stats for given table(s) 2014-01-08 10:48:23 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
99bb22dcac Add db name filter to compute stats, run compute stats on functional/text tables 2014-01-08 10:48:08 -08:00
Alan Choi
73c8ee3d96 IMPALA-18 Ignore hidden file prefixed with . or _ 2014-01-08 10:48:00 -08:00
Lenni Kuff
d2e4776731 Support passing snapshot file to buildall, add script to run all tests, remove old tests 2014-01-08 10:47:59 -08:00
Lenni Kuff
d5177c3c30 Update run-workload to support specifying table format test vectors from command line 2014-01-08 10:47:20 -08:00
Lenni Kuff
deea7a86b9 Remove Trevni from mixed format table and fix data loading bug 2014-01-08 10:47:10 -08:00
Lenni Kuff
30dbf59ef2 Final changes to enable Python test infrastructure and tests
With this change the Python tests will now be called as part of buildall and
the corresponding Java tests have been disabled. The new tests can also be
invoked calling ./tests/run-tests.sh directly.

This includes a fix from Nong that caused wrong results for limit on non-io
manager formats.
2014-01-08 10:46:57 -08:00
Lenni Kuff
bed633c1ae Extract config/metastore creation from buildall + script for loading warehouse snapshot 2014-01-08 10:46:53 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
Lenni Kuff
1e25c98fb4 Test data loading framework improvements
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
2014-01-08 10:46:49 -08:00
Michael Ubell
8a5297a526 Add HdfsLzoTextScanner 2014-01-08 10:46:35 -08:00
Michael Ubell
85807f6169 Start a single impalad to avoid data load race 2014-01-08 10:46:18 -08:00
Michael Ubell
37aaf06f79 IMP-390 Get rid of test dependencies on InProcessQE and Runquery 2014-01-08 10:46:18 -08:00
Michael Ubell
0c4f025a5e Fix loading of nulltable data, remove loading functional-planner data 2014-01-08 10:45:58 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00
Michael Ubell
02d63d8dc3 Trevni file support 2014-01-08 10:44:19 -08:00
Lenni Kuff
84d91fca4f Fix sequence file data loading for the alltypesmixedformat table
Moved this out of the data loading framework because it is kind of a special
case. I will consider how we can update the framework to address mixed format
tables.
2014-01-08 10:44:18 -08:00