Commit Graph

142 Commits

Author SHA1 Message Date
Dimitris Tsirogiannis
af67b2fef4 IMPALA-4514: Fix broken exhaustive builds caused by non-nullable columns
This commit fixes the broken exhaustive Impala builds. The issue was
caused by Kudu table that didn't properly specify nullability
constraints. Hence, some rows were rejected during data loading causing
some tests to fail.

Change-Id: Ib6f4b4c88ef18b1731b7c9789aad602880e18035
Reviewed-on: http://gerrit.cloudera.org:8080/5157
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-21 22:24:25 +00:00
Dimitris Tsirogiannis
3db5ced4ce IMPALA-3726: Add support for Kudu-specific column options
This commit adds support for Kudu-specific column options in CREATE
TABLE statements. The syntax is:
CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [, ....])
where option is:
| NULL
| NOT NULL
| ENCODING encoding_val
| COMPRESSION compression_algorithm
| DEFAULT expr
| BLOCK_SIZE num

The output of the SHOW CREATE TABLE statement was altered to include all the specified
column options for Kudu tables.

Change-Id: I727b9ae1b7b2387db752b58081398dd3f3449c02
Reviewed-on: http://gerrit.cloudera.org:8080/5026
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-18 11:41:01 +00:00
Taras Bobrovytsky
eb8120d218 IMPALA-3812: Fix error message for unsupported types
Before this patch an unclear error message was returned if DATE or
DATETIME appeared in the select list after a star expansion. This was
because DATE and DATETIME PrimitiveType was serialized as INVALID_TYPE.
This is fixed by serializing correctly.

Change-Id: I9019b4bfd219f94e554c795befd3ff5e39706ea9
Reviewed-on: http://gerrit.cloudera.org:8080/4859
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 05:31:34 +00:00
David Knupp
b14f319708 IMPALA-4461: Make sure data gets loaded for wide hbase tables.
Ths patch reverts a change that broke the exhaustive suite of Impala
tests. The change was introduced here:

ce4c5f6743

The orginal problem was that data load was failing when run against a
remote cluster, due to a 4000 byte max for SERDEPROPERTIES.PARAM_VALUE,
a limitation that is well described in HIVE-1364. Locally, when we load
data, we work around the issue here:

https://github.com/apache/incubator-impala/blob/master/bin/create-test-configuration.sh#L99

When testing on CDH remote cluster however, this "fix" never gets applied.
(It also assumes the database will always by postgres.)

I made this change without realizing its full effect, or appreciating
exactly how exhaustive our exhaustive test suite really is. Another
solution will need to be found for the case of remote cluster testing,
but this should unblock the local build for now.

As far as testing, I ran the full suite of tests in query_test/
test_scanners.py, and they all pass after removing these lines.

Change-Id: If2148d6546789c6c53c8e045717081b24ce76689
Reviewed-on: http://gerrit.cloudera.org:8080/5033
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-11 00:37:59 +00:00
Martin Grund
ce4c5f6743 IMPALA-4365: Enabling end-to-end tests on a remote cluster
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:

  - Managed by Cloudera Manager (CM)
  - GPL Extras need to be installed
  - KMS and KeyTrustee installed and available as a service
  - SERDEPROPERTIES in the Hive DB modified to accept wide tables
  - Hive warehouse dir points to /test-warehouse

The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)

It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.

Usage: remote_data_load.py [options] <cm_host address>

Options:
  -h, --help            show this help message and exit
  --snapshot-file=SNAPSHOT_FILE
                        Path to the test-warehouse archive
  --cm-user=CM_USER     Cloudera Manager admin user
  --cm-pass=CM_PASS     Cloudera Manager admin user password
  --gateway=GATEWAY     Gateway host to upload the data from. If not
                        set, uses the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with
                        passwordless SSH configured.
  --no-load             Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --test                Run end-to-end tests against cluster

Testing:

This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.

However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 10:16:55 +00:00
Dimitris Tsirogiannis
d802f321b2 IMPALA-3724: Support Kudu non-covering range partitions
This commit adds support for non-covering range partitions in Kudu
tables. The SPLIT ROWS clause is now deprecated and no longer supported.
The following new syntax provides more flexibility in creating range
partitions and it supports bounded and unbounded ranges as well as single value
partitions; multi-column range partitions are supported as well.

The new syntax is:
DISTRIBUTE BY RANGE (col_list)
(
 PARTITION lower_1 <[=] VALUES <[=] upper_1,
 PARTITION lower_2 <[=] VALUES <[=] upper_2,
             ....
 PARTITION lower_n <[=] VALUES <[=] upper_n,
 PARTITION VALUE = val_1,
             ....
 PARTITION VALUE = val_n
)

Multi-column range partitions are specified as follows:
DISTRIBUTE BY RANGE (col1, col2,..., coln)
(
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val),
                     ....
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val)
)

Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06
Reviewed-on: http://gerrit.cloudera.org:8080/4856
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-04 22:02:22 +00:00
Jim Apple
f397d75600 IMPALA-3853: More RAT cleaning.
Apache RAT is a tool to audit code repositories for the ASF copyright
rules. Our wrapper script around it found a few more things; this
patch fixes those things.

Change-Id: I01367ea26feaf6a3e2cf4ac04f1c6a63f6e66195
Reviewed-on: http://gerrit.cloudera.org:8080/4904
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-03 22:53:04 +00:00
Taras Bobrovytsky
40bce41765 Fix TPCH and TPCDS Kudu loading templates
The templates (used by the stress test) for loading the TCPH and TPCDS
data into Kudu had a missing "stored as kudu" statement.

Change-Id: Ibe84e1831cc0722bd0381ec76f385ae2a02a6841
Reviewed-on: http://gerrit.cloudera.org:8080/4939
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
2016-11-03 21:28:43 +00:00
Dimitris Tsirogiannis
2990696e08 IMPALA-4374: Use new syntax for creating TPC-DS/H tables in Kudu
stress test

This commit modifies the DDL statements for creating TPC-DS/H tables in
Kudu. The DDL statements now use the new syntax for creating Kudu tables
(see IMPALA-3719).

Change-Id: I2d501fb9c3cba00b1fb0f7b5941db49cbbda5a53
Reviewed-on: http://gerrit.cloudera.org:8080/4860
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-02 23:34:27 +00:00
Tim Armstrong
6587c08f70 IMPALA-4387: validate decimal type in Avro file schema
This patch prevents an invalid decimal type in an Avro file schema from
crashing Impala. Most invalid Avro schemas are caught by the frontend,
but file schemas still need to be validated by the backend.

After this patch files with bad schemas are skipped.

Testing:
This was hit very rarely by the scanner fuzzing. Added a regression test that
scans a file with a bad schema.

Change-Id: I25a326ee2220bc14d3b5f887dc288b4adf859cfc
Reviewed-on: http://gerrit.cloudera.org:8080/4876
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-10-30 00:12:58 +00:00
Dimitris Tsirogiannis
8a49ceaae5 IMPALA-3739: Enable stress tests on Kudu
This commit modifies the stress test framework to run TPC-H and TPC-DS
workloads against Kudu. The follwing changes are included in this
commit:
1. Created template files with DDL and DML statements for loading TPC-H and
   TPC-DS data in Kudu
2. Created a script (load-tpc-kudu.py) to load data in Kudu. The
   script is invoked by the stress test runner to load test data in an
   existing Impala/Kudu cluster (both local and CM-managed clusters are
   supported).
3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL
   files with TPC-H queries for Kudu were added in a previous patch.
4. Modified the stress test runner to take additional parameters
   specific to Kudu (e.g. kudu master addr)

The stress test runner for Kudu was tested on EC2 clusters for both TPC-H
and TPC-DS workloads.

Missing functionality:
* No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu.
* Not all supported TPC-DS queries are included. Currently, only the
  TPC-DS queries from the testdata/workloads/tpcds/queries directory
  were modified to run against Kudu.

Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34
Reviewed-on: http://gerrit.cloudera.org:8080/4327
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 11:01:37 +00:00
Dimitris Tsirogiannis
041fa6d946 IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.

Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU

Changes:
1) Remove the requirement to specify table properties such as key
   columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
   schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
   time of creation in Impala.
4) Disallow table properties that could conflict with an existing
   table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
   addresses. The flag is used as the default value for the table
   property kudu_master_addresses but it can still be overriden
   using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
   wasn't implemented for Kudu tables and silently ignored. The Kudu
   tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
   Kudu) the existence of the other delegate and the use of delegates in
   general has led to confusion. The Kudu delegate only exists to provide
   functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
   standard. When used at the column level, only one column can be
   marked as a key. When used at the table level, multiple columns can
   be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
   The old "kudu.key_columns" table property is no longer accepted
   though it is still used internally. "PRIMARY" is now a keyword.
   The ident style declaration is used for "KEY" because it is also used
   for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
   The table property "kudu.table_name" is optional for managed tables
   and is required for external tables. If for a managed table a Kudu
   table name is not provided, a table name will be generated based
   on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
   of HMS when a table is loaded or refreshed. Table/column metadata
   are cached in the catalog and are stored in HMS in order to be
   able to use table and column statistics.

Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 10:52:25 +00:00
Lars Volker
ef4c9958d0 IMPALA-4047: Remove occurrences of 'CDH'/'cdh' from repo
This change removes some of the occurrences of the strings 'CDH'/'cdh'
from the Impala repository. References to Cloudera-internal Jiras have
been replaced with upstream Jira issues on issues.cloudera.org.

For several categories of occurrences (e.g. pom.xml files,
DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to
remove the occurrences left after this change.

Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8
Reviewed-on: http://gerrit.cloudera.org:8080/4187
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-13 00:40:41 +00:00
Matthew Jacobs
c7fa03286b IMPALA-3718: Support subset of functional-query for Kudu
Adds initial support for the functional-query test workload
for Kudu tables.

There are a few issues that make loading the functional
schema difficult on Kudu:
 1) Kudu tables must have one or more columns that together
    constitute a unique primary key.
   a) Primary key columns must currently be the first columns
      in the table definition (KUDU-1271).
   b) Primary key columns cannot be nullable (KUDU-1570).
 2) Kudu tables must be specified with distribution
    parameters.

(1) limits the tables that can be loaded without ugly
workarounds. This patch only includes important tables that
are used for relevant tests, most notably the alltypes*
family. In particular, alltypesagg is important but it does
not have a set of columns that are non-nullable and form a unique
primary key. As a result, that table is created in Kudu with
a different name and an additional BIGINT column for a PK
that is a unique index and is generated at data loading time
using the ROW_NUMBER analytic function. A view is then
wrapped around the underlying table that matches the
alltypesagg schema exactly. When KUDU-1570 is resolved, this
can be simplified.

(2) requires some additional considerations and custom
syntax. As a result, the DDL to create the tables is
explicitly specified in CREATE_KUDU sections in the
functional_schema_constraints.csv, and an additional
DEPENDENT_LOAD_KUDU section was added to specify custom data
loading DML that differs from the existing DEPENDENT_LOAD.

TODO: IMPALA-4005: generate_schema_statements.py needs refactoring

Tests that are not relevant or not yet supported have been
marked with xfail and a skip where appropriate.

TODO: Support remaining functional tables/tests when possible.

Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc
Reviewed-on: http://gerrit.cloudera.org:8080/4175
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-09-14 22:11:04 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Matthew Jacobs
b9f6392e84 IMPALA-3939: Data loading may fail on tpch kudu
Change-Id: I7f229a9f74fa4dceb14335914b0dde9bf607264e
Reviewed-on: http://gerrit.cloudera.org:8080/3818
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-07-30 03:59:29 +00:00
Alex Behm
c77fb628f7 IMPALA-3915: Register privilege and audit requests when analyzing resolved table refs.
The bug: We used to register privilege requests for table refs in TableRef.analyze()
which only got called for unresolved TableRefs. As a result, a reference to a view that
contains a subquery did not get properly authorized, explained as follows.
1. In the first analysis pass the view is replaced by a an InlineViewRef and we
   correctly register an authorizarion request.
2. We rewrite the subquery via the StmtRewriter and wipe the analysis state, but
   preserve the InlineViewRef that replaces the view reference.
3. The rewritten statement is analyzed again, but since an InlineViewRef is
   considered to be resolved, we never call TableRef.analyze(), and hence
   never register an authorization event for the view.

The fix: We now register authorization and auditing events when calling analyze() on a
resolved TableRef (BaseTableRef, InlineViewRef, CollectionTableRef).

Change-Id: I18fa8af9a94ce190c5a3c29c3221c659a2ace659
Reviewed-on: http://gerrit.cloudera.org:8080/3783
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-07-29 05:16:23 +00:00
Tim Armstrong
c1d70f814e IMPALA-3227: generate test TPC data sets during data load
The generated data is identical to the pregenerated tpch.tar.gz
and tpcds.tar.gz data that was used previously and were not
publically accessible.

This adds a "preload" hook to bin/load-data.py that can execute custom
logic for each data set. This is used to call the TPC-H and TPC-DS data
generation utilities that are already available in the Impala toolchain.

Testing:
Ran private test job with loading from snapshot disabled and without
the tpch/tpcds tarballs available.

Change-Id: Ieccfbd7d8d4a91bffddbe35abb7f5572e71a71cf
Reviewed-on: http://gerrit.cloudera.org:8080/3761
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-07-28 04:56:57 +00:00
Dimitris Tsirogiannis
6fbd35fa87 Enable TPC-H workload for Kudu tables
With this commit we enable loading of TPC-H data in Kudu tables and
running the 22 TPC-H queries against Kudu. Since Kudu doesn't support
the decimal data type, we had to modify the queries by using round()
function and update the test results.

Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289
Reviewed-on: http://gerrit.cloudera.org:8080/3789
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-07-28 04:35:11 +00:00
Tim Armstrong
547be27e77 IMPALA-3745: parquet invalid data handling
Added checks/error handling:
* Negative string lengths while decoding dictionary or data page.
* Buffer overruns while decoding dictionary or data page.
* Some metadata FILECHECKs were converted to statuses.

Testing:
Unit tests for:
* decoding of strings with negative lengths
* truncation of all parquet types
* dictionary creation correctly handling error returns from Decode().

End-to-end tests for handling of negative string lengths in
dictionary- and plain-encoded data in corrupt files, and for
handling of buffer overruns for string data. The corrupted
parquet files were generated by hacking Impala's parquet
writer to write invalid lengths, and by hacking it to
write plain-encoded data instead of dictionary-encoded
data by default.

Performance:
set num_nodes=1;
set num_scanner_threads=1;
select * from biglineitem where l_orderkey = -1;

I inspected MaterializeTupleTime. Before the average was 8.24s and after
was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%).

Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece
Reviewed-on: http://gerrit.cloudera.org:8080/3387
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-06-15 21:33:39 -07:00
Skye Wanderman-Milne
01287a3ba9 IMPALA-3441, IMPALA-3659: check for malformed Avro data
This patch adds error checking to the Avro scanner (both the codegen'd
and interepted paths), including out-of-bounds checks and data
validity checks.

I ran a local benchmark using the following queries:
  set num_scanner_threads=1;
  select count(i) from default.avro_bigints_big; # file contains only longs
  select max(l_orderkey) from biglineitem_avro; # file has tpch.lineitem schema

Both benchmark queries see negligable or no performance impact.

This patch adds a new Avro scanner unit test and an end-to-end test
that queries several corrupted files, as well as updates the zig-zag
varlen int unit test.

Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Reviewed-on: http://gerrit.cloudera.org:8080/3072
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-06-13 18:32:32 -07:00
Matthew Jacobs
f413e236a8 IMPALA-3579: Strict handling of numeric overflow in text parsing
Adds a query option 'strict_mode' which treats integer and
floating pt overflows as parse errors. In the past,
overflows were ignored and the max value was returned. When
this query option is set, overflowing values are treated as if
they were completely invalid data, i.e. NULL is returned.
When abort_on_error is enabled, this means the query is
aborted.

Notes:
* DECIMAL overflow/underflow is already treated as an error.
* The handling in text-converter treats underflows the same
  as overflows, so they would result in the same behavior.
  However, floating point parsing never returns an underflow
  today.
* We may also want to handle numeric values that are truncated
  when parsing to integer types, e.g. 10.5 -> 10.

Change-Id: I7409c31ec0cb6fe0b2d9842b9f58fe1670914836
Reviewed-on: http://gerrit.cloudera.org:8080/3150
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-05-23 08:40:20 -07:00
Lars Volker
a09c80a33e IMPALA-3458: Fix table creation to test insert with header lines
For IMPALA-1740 we added a test to insert.test, which creates a table and
inserts data. The table was created on HDFS by default and thus inserts with
compression enabled did not work. This change adds the required table to the
functional schema in the same way we do it for the other insert tests.

Change-Id: Ie68e7067b7a16218d27935820d5d1ce7035d2e6c
Reviewed-on: http://gerrit.cloudera.org:8080/2919
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:47 -07:00
Lars Volker
b5570da405 IMPALA-1740: Add support for skip.header.line.count.
HIVE-5795 introduced a parameter skip.header.line.count to skip header
lines from input files. This change introduces the capability to skip
an arbitrary number of header lines from csv input files on hdfs. The
size of the total file header must be smaller than
max_scan_range_length, otherwise an error will be reported. This is
necessary because scan ranges are not read in disk order, so there is
no way of identifying header lines except by counting from the start
of the first scan range.

[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='1');
Query: alter table t1 set tblproperties('skip.header.line.count'='1')
[localhost:21000] > select * from t1;
Query: select * from t1
+----+----+
| c1 | c2 |
+----+----+
| 1  | 1  |
| 2  | 2  |
| 3  | 3  |
+----+----+
Fetched 3 row(s) in 0.32s
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='0');
Query: alter table t1 set tblproperties('skip.header.line.count'='0')
[localhost:21000] > select * from t1;
Query: select * from t1
+------+------+
| c1   | c2   |
+------+------+
| NULL | NULL |
| 1    | 1    |
| 2    | 2    |
| 3    | 3    |
+------+------+
WARNINGS: Error converting column: 0 TO INT (Data is: num1)
Error converting column: 1 TO DOUBLE (Data is: num2)
file: hdfs://localhost:20500/test-warehouse/t1/test.txt
record: num1,num2

Fetched 4 row(s) in 0.41s

Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543
Reviewed-on: http://gerrit.cloudera.org:8080/2110
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:46 -07:00
Skye Wanderman-Milne
2cbd327d41 Regenerate complextypestbl files to include nested_struct.g field
This field was included in the schema and data files, but the
checked-in generated parquet files didn't include it. It's not
referenced in any tests so we didn't catch it.

Change-Id: I5d394f074e7082fa12fafb7e57a144a83b3099a6
Reviewed-on: http://gerrit.cloudera.org:8080/2562
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-04-01 05:06:38 +00:00
Sailesh Mukil
76b674850f IMPALA-2466: Add more tests for the HDFS parquet scanner.
These tests functionally test whether the following type of files
are able to be scanned properly:

1) Add a parquet file with multiple blocks such that each node has to
   scan multiple blocks.
2) Add a parquet file with multiple blocks but only one row group
   that spans the entire file. Only one scan range should do any work
   in this case.

Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368
Reviewed-on: http://gerrit.cloudera.org:8080/1500
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-03-25 13:10:15 +00:00
David Alves
7381304a23 Merge branch 'feature/kudu' into cdh5-trunk
This is the final merge commit that merges the 'feature/kudu' branch
into cdh5-trunk.

Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a
2016-03-13 19:28:43 -07:00
David Alves
82222abaf5 Merge branch 'feature/kudu' into cdh5-trunk
This merges the 'feature/kudu' branch with cdh5-trunk as of commit:
055500cc753f87f6d1c70627321fcc825044e183

This patch is not a pure merge patch in the sense that goes beyond conflict
resolution to also address reviews to the 'feature/kudu' branch as a whole.

The review items and their resolution can be inspected at:
http://gerrit.cloudera.org:8080/#/c/1403/

Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92
2016-03-11 11:37:58 -08:00
Juan Yu
c9b33ddf63 IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.
Fix a bug in which Impala only reads the first stream
of a multi-stream bz2/gzip file.
Changes the bz2 decoder to read the file in a streaming
fashion rather than reading the entire file into memory
before it can be decompressed.

Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8
(cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e)
Reviewed-on: http://gerrit.cloudera.org:8080/2219
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2016-02-28 21:31:37 -08:00
Alex Behm
c6fd5a0fe4 IMPALA-2844: Allow count(*) on RC files with complex types.
This patch also fixes the incorrect error message reported
in the JIRA.

Change-Id: I2c7b732767d154c36bc7189df5177d27a35d0d7b
Reviewed-on: http://gerrit.cloudera.org:8080/2267
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-22 20:16:24 -08:00
Alex Behm
db90f9954e Fix OOM when data loading with the new Hive SNAPSHOT.
This patch is required for upgrading thirdparty.

After upgrading thirdparty, Hive OOMs in data loading TPCDS.
This patch fixes the OOM, but I have no idea why exactly it works.
I had tried several other approaches like increasing the
mapred and JVM heap memory to no avail.

Change-Id: I3c69e4c3bf0f24e0c3b6272c946f71c17e312e01
Reviewed-on: http://gerrit.cloudera.org:8080/2145
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-14 07:08:22 +00:00
Alex Behm
1a4a830a6d IMPALA-2776: Remove escapechartesttable and associated tests.
The original purpose of the escapechartesttable was to test
Impala's behavior on text tables that have the same character
as line terminator and escape character. Recent changes in
Hive have made creating such a table impossible because
1) Only newline is allowed as the line terminator
2) Newline is forbidden as the escape character

See HIVE-11785 for details on the Hive changes.

This commit removes escapechartesttable and all associated
tests, but does not add the same enforcement rules as Hive.
These enforcement rules should be added in a follow-on change.

Change-Id: I2bd9755f4c2cc3d7dfd8d67c3759885951550f08
Reviewed-on: http://gerrit.cloudera.org:8080/1690
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-01-05 06:04:41 +00:00
Skye Wanderman-Milne
dd2eb951d7 IMPALA-2558: DCHECK in parquet scanner after block read error
There was an incorrect DCHECK in the parquet scanner. If abort_on_error
is false, the intended behaviour is to skip to the next row group, but
the DCHECK assumed that execution should have aborted if a parse error
was encountered.

This also:
- Fixes a DCHECK after an empty row group. InitColumns() would try to
  create empty scan ranges for the column readers.
- Uses metadata_range_->file() instead of stream_->filename() in the
  scanner. InitColumns() was using stream_->filename() in error
  messages, which used to work but now stream_ is set to NULL before
  calling InitColumns().

Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701
Reviewed-on: http://gerrit.cloudera.org:8080/1257
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-30 22:35:57 +00:00
Matthew Jacobs
1cd95f79f0 Add partitions to table_no_newline_part IF NOT EXISTS
Loading functional data may fail if table_no_newline_part
partitions already exist. We should only ADD with IF NOT
EXISTS to handle this case as we do with other tables.

Change-Id: I5fe5c318d2cbbd5b5419394212b94a2fe7d386ce
Reviewed-on: http://gerrit.cloudera.org:8080/1261
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-17 00:52:53 +00:00
Sailesh Mukil
277a92a14a IMPALA-2479: Failure in TestParquet.test_verify_runtime_profile
The test_verify_runtime_profile test failed during C5.5 builds and
GVMs because this test relies on the table lineitem_multiblock to
have 3 blocks. However, due to the rules to load the data not being
followed in the functional_schema_template.sql file, the table ended
up being stored with only one block.

This change moves the data load to the end of create-load-data.sh
file which would load the data even for snapshots.

Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be
Reviewed-on: http://gerrit.cloudera.org:8080/1153
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-10-08 15:16:35 -07:00
Sailesh Mukil
7778b3ded5 IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.
Impala supports reading Parquet files with multiple row groups but
with possible performance degradation due to remote reads. This patch
maximizes scan locality by allowing multiple impalads to scan the
rowgroups in their local splits. Each impalad starts a new scan range
for each split local to it if that split contains row group(s) that
need to be scanned.

Change-Id: Iaecc5fb8e89364780bc59dbfa9ae51d0d124d16e
Reviewed-on: http://gerrit.cloudera.org:8080/908
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:39 -07:00
Martin Grund
4b102fefd2 KUDU-1115: Deprecate 'kudu.split_keys' for range partitioning
For range partitioning use DISTRIBUTE BY RANGE instead.

Change-Id: I055b605312a6be76a5439526f9c10ab4c4b432ce
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/8063
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-09-16 16:43:47 -07:00
ishaan
3278ab9ef7 IMPALA-2292: Change the type of timestamp_col to string in the table no_avro_schema.
Hive does not allow DDL when an avro table has a timestamp column. This breaks data
loading. This change simply uses a string type to denote the previous timestamp column.

Change-Id: Id4eb42da3ac8a805d12b1c3ff35b200298269084
Reviewed-on: http://gerrit.cloudera.org:8080/745
Tested-by: Internal Jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2015-09-04 06:46:35 +00:00
Skye Wanderman-Milne
bcc73a36da Nested types: read and materialize nested types in Parquet scanner
This patch modifies the Parquet scanner to resolve nested schemas, and
read and materialize collection types. The high-level modification is
to create a CollectionColumnReader that recursively materializes map-
and array-type slots.

This patch also adds many tests, most of which query a new table
called complextypestbl. This table contains hand-generated data that
is meant to expose edge cases in the scanner. The tests mostly test
the scanner, with a few tests of other functionality (e.g. array
serialization).

I ran a local benchmark comparing this scanner code to the original
scanner code on an expanded version of tpch_parquet.lineitem with
48009720 rows. My benchmark involved selecting different numbers of
columns with a single scanner thread, and I looked at the HDFS scan
node time in the query profiles. This code introduces a 10%-20%
regression in single-threaded scan time.

Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a
Reviewed-on: http://gerrit.cloudera.org:8080/576
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-02 19:23:54 +00:00
Martin Grund
f927a285c6 IMPALA-1136, IMPALA-2161: Skip \u0000 characters when dealing Avro schemas
The limitation of the Avro JSON library not to handle \u0000 characters
is to avoid problems with builtin functions like strlen() that would
report wrong length when encountering such a character. Now, in the
case if Impala, for now, we don't support any Unicode characters.
This allows us to actually skip the \u0000 character instead of
interpreting it.

It is important to say that even the most recent versions of Avro do
not support parsing \u0000 characters.

Change-Id: I56dfa7f0f12979fe9705c51c751513aebce4beca
Reviewed-on: http://gerrit.cloudera.org:8080/712
Tested-by: Internal Jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2015-09-02 00:37:28 +00:00
Alex Behm
9d46853fbc Nested Types: Check un/supported file formats for complex types.
Before this patch, we used to accept any query referencing complex
types, regardless of the table/partition's file format being scanned.
We would ultimately hit a DCHECK in the BE when attempting to scan
complex types of a table/partition with an unsupported format.

This patch makes queries fail gracefully during planning if a scan
would access a table/partition in a format for which we do not
support complex types.

For mixed-format partitioned Hdfs tables we perform this check
at the partition granularity, so such a table can be scanned as
long as only partitions with supported formats are accessed.

HBase tables with complex-typed columns can be scanned as long as
no complex-typed columns are accessed in the query.

Change-Id: I2fd2e386c9755faf2cfe326541698a7094fa0ffc
Reviewed-on: http://gerrit.cloudera.org:8080/705
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-01 03:26:53 +00:00
Alex Behm
6f0b255c5a Address several shortcomings with respect to the usability of Avro tables.
Addressed JIRAs: IMPALA-1947 and IMPALA-1813

New Feature:
Adds support for creating an Avro table without an explicit
Avro schema with the following syntax.

CREATE TABLE <table_name> column_defs STORED AS AVRO

Fixes and Improvements:
This patch fixes and unifies the logic for reconciling differences between
an Avro table's Avro Schema and its column definitions. This reconciliation
logic is executed during Impala's CREATE TABLE and when loading a table's
metadata. Impala generally performs the schema reconciliation during table
creation, but Hive does not. In many cases, Hive's CREATE TABLE stores the
original column definitions in the HMS (in the StorageDescriptor) instead
of the reconciled column definitions.

The reconciliation logic considers the field/column names and follows this
conflict resolution policy which is similar to Hive's:

Mismatched number of columns -> Prefer Avro columns.
Mismatched name/type -> Prefer Avro column, except:
  A CHAR/VARCHAR column definition maps to an Avro STRING, and is preserved
  as a CHAR/VARCHAR in the reconciled schema.

Behavior for TIMESTAMP:
A TIMESTAMP column definition maps to an Avro STRING and is presented as a STRING
in the reconciled schema, because Avro has no binary TIMESTAMP representation.
As a result, no Avro table may have a TIMESTAMP column (existing behavior).

Change-Id: I8457354568b6049b2dd2794b65fadc06e619d648
Reviewed-on: http://gerrit.cloudera.org:8080/550
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-25 09:52:18 +00:00
Alex Behm
c908ba1b7e IMPALA-1136: Support loading Avro tables without an explicit Avro schema
Hive allows creating Avro tables without an explicit Avro schema since 0.14.0.
For such tables, the Avro schema is inferred from the column definitions,
and not stored in the metadata at all (no Avro schema literal or Avro schema file).

This patch adds support for loading the metadata of such tables, although Impala
currently cannot create such tables (expect a follow-on patch).

Change-Id: I9e66921ffbeff7ce6db9619bcfb30278b571cd95
Reviewed-on: http://gerrit.cloudera.org:8080/538
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-07-31 12:13:37 +00:00
Ippokratis Pandis
00db434cd4 Mistake in schema_constraints by the IMPALA-2130 patch (7c7e69b)
The patch that addressed IMPALA-2130 (7c7e69b) creates a new table intended
only to be used in a test that uses the functional_parquet database.
This patch has a mistake though in schema_constraints which essentially
allows the creation of this table for all types and not only for
parquet/none/none.

Change-Id: I1d72b30557cb9d8f47fe27170808fec75af3bb1d
Reviewed-on: http://gerrit.cloudera.org:8080/524
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-07-23 20:39:17 +00:00
Ippokratis Pandis
e99c68fe52 IMPALA-2130: Wrong verification of Parquet file version
This patch corrects a mistake in the Parquet magic file number verification
and adds a test about it. Note that with this patch Impala may fail to read
Parquet files with wrong magic number that it used to read before.

Change-Id: Iff31accda1e1d541946ef1f750e38886ce4cb8d5
Reviewed-on: http://gerrit.cloudera.org:8080/515
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-07-14 02:52:02 +00:00
David Alves
cc685fb893 Avoid using 'liketbl' in tests
This table's test data contains rows whose the first column
is null. As that column is select as primary key in Kudu this
makes it so we try to insert a row with a NULL key, when loading
test data, which makes data loading fail.

This patch just removes this table from the set of test tables
used in Kudu and makes it so KuduPlannerTest uses another table.

Change-Id: I50664d29509716c8ac995fe8fbcee46233124c78
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6989
Tested-by: David Alves <david.alves@cloudera.com>
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: jenkins
2015-06-19 13:42:56 -07:00
David Alves
8a91b6b60f Allow to specify split keys in json format in Kudu's CREATE TABLE statement
This allows to specify split keys in Kudu's CREATE TABLE statement as part
of the key/value pairs in TBLPROPERTIES.

Splits are expected to be specified in as json arrays of arrays: [[key1], [key2], ...]
'key1', 'key2 might be single values or comma separated lists of values, depending
on whether the table has a simple of compound primary key.

This also adds a series of test tables to be created for the kutu table format
when load-data.py is executed.

Change-Id: I1824199fda14abb2d7352800789f2b9c2f2124ae
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6974
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-06-19 10:40:16 -07:00
David Alves
cf2618df1e Allow to specify a TABLE_PROPERTIES section in functional_schema_template.sql
This adds a new TABLE_PROPERTIES section that will be included in the
CREATE TABLE statement as TBLPROPERTIES. Each line in this new section
is expected to be in the form:
<file_format>:<key>=<value>

Properties are only added to create table statements of the file format
they specify.

Change-Id: I89ef7ced3351ecf2c727050ca426f6616f3e5bcd
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6945
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-06-18 14:23:28 -07:00
Martin Grund
6b945cf257 Starting Kudu as part of the run-all.sh command / data loading
This starts a kudu mini-cluster with a master and three tablet servers
on a single host. This requires to have a checkout of the kudu-bin
project accessible. By default the location of the checkout is expected
to be $IMPALA_HOME/../kudu-bin.

In addition, this patch enables loading data to kudu via the
load-data.py command. Currently only the "liketbl" is created for Kudu,
but not laoded with data. This has to be done manually from the kudu-bin
repo for now.

Change-Id: Ia7981b023f119759e5e13e78322a6c89f82bd085
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6499
Tested-by: jenkins
Reviewed-by: David Alves <david.alves@cloudera.com>
2015-06-01 15:53:34 -07:00
Juan Yu
934b28fe5e IMPALA-1381: Expand set of supported timezones.
The hardcoded timezone information is from Java version 1.7.0_76.

Change-Id: I32c40d0036473079e5bfd4d0252a648cbb0e7c23
Reviewed-on: http://gerrit.cloudera.org:8080/393
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-05-22 01:32:54 +00:00