Commit Graph

16 Commits

Author SHA1 Message Date
Thomas Tauber-Marshall
ba84ad03cb IMPALA-6954: Fix problems with CTAS into Kudu with an expr rewrite
This patch fixes two problems:
- Previously a CTAS into a Kudu table where an expr rewrite occurred
  would create an unpartitioned table, due to the partition info being
  reset in TableDataLayout and then never reconstructed. Since the
  Kudu partition info is set by the parser and never changes, the
  solution is to not reset it.
- Previously a CTAS into a Kudu table with a range partition where an
  expr rewrite occurred would fail with an analysis exception due to
  a Precondition check in RangePartition.analyze that checked that
  the RangePartition wasn't already analyzed, as the analysis can't
  be done twice. Since the state in RangePartition never changes, it
  doesn't need to be reanalyzed and we can just return instead of
  failing on the check.

Testing:
- Added an e2e test that creates a partitioned Kudu table with a CTAS
  with a rewrite, and checks that the expected partitions are created.

Change-Id: I731743bd84cc695119e99342e1b155096147f0ed
Reviewed-on: http://gerrit.cloudera.org:8080/10251
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-02 02:54:23 +00:00
Zoltan Borok-Nagy
2ee914d5b3 IMPALA-5903: Inconsistent specification of result set and result set metadata
Before this commit it was quite random which DDL oprations
returned a result set and which didn't.

With this commit, every DDL operations return a summary of
its execution. They declare their result set schema in
Frontend.java, and provide the summary in CalatogOpExecutor.java.

Updated the tests according to the new behavior.

Change-Id: Ic542fb8e49e850052416ac663ee329ee3974e3b9
Reviewed-on: http://gerrit.cloudera.org:8080/9090
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 02:21:48 +00:00
Thomas Tauber-Marshall
832974383c IMPALA-6445: Test for kudu master address with whitespace
A concern was brought up that Impala might not handle kudu master
addresses containing whitespace correctly. Turns out that the Kudu
client takes care of stripping whitespace, so it works, but it would
be good to have a test to ensure it continues to work.

Change-Id: I1857b8dbcb5af66d69f7620368cd3b9b85ae7576
Reviewed-on: http://gerrit.cloudera.org:8080/9876
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 20:29:51 +00:00
Grant Henke
0c8eba076c IMPALA-5752: Add support for DECIMAL on Kudu tables
Adds support for the Kudu DECIMAL type introduced in Kudu 1.7.0.

Note: Adding support for Kudu decimal min/max filters is
tracked in IMPALA-6533.

Tests:
* Added Kudu create with decimal test to AnalyzeDDLTest.java
* Added Kudu table_format to test_decimal_queries.py
** Both decimal.test and decimal-exprs.test workloads
* Added decimal queries to the following Kudu workloads:
** kudu_create.test
** kudu_delete.test
** kudu_insert.test
** kudu_update.test
** kudu_upsert.test

Change-Id: I3a9fe5acadc53ec198585d765a8cfb0abe56e199
Reviewed-on: http://gerrit.cloudera.org:8080/9368
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-23 00:03:54 +00:00
Thomas Tauber-Marshall
b881fba763 IMPALA-5546: Allow creating unpartitioned Kudu tables
This patch makes it possible to create unpartitioned, managed Kudu
tables from Impala, by making the 'PARTITION BY' clause of 'CREATE
TABLE... STORED AS KUDU' optional:

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  (col_name data_type
    [kudu_column_attribute ...]
    [COMMENT 'col_comment']
    [, ...]
    [PRIMARY KEY (col_name[, ...])]
  )
  [PARTITION BY kudu_partition_clause]
  [COMMENT 'table_comment']
  STORED AS KUDU
  [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]

Kudu represents this as a table that is range partitioned on no
columns.

Because unpartitioned Kudu tables are inefficient for large data
sizes, and because the syntax doesn't make it explicit that the table
will be unpartitioned, there is a warning issued to encourage users
to created partitioned tables.

This patch also converts the tpch_kudu.nation and tpch_kudu.region
tables to be unpartitioned, as they are very small.

Testing:
- Updated analysis tests.
- Added e2e test that creates unpartitioned table and inserts into it.

Change-Id: I281f173dbec1484eb13434d53ea581a0f245358a
Reviewed-on: http://gerrit.cloudera.org:8080/7446
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-07 19:53:59 +00:00
Matthew Jacobs
2dcbefc652 IMPALA-5338: Fix Kudu timestamp column default values
While support for TIMESTAMP columns in Kudu tables has been
committed (IMPALA-5137), it does not support TIMESTAMP
column default values.

This supports CREATE TABLE syntax to specify the default
values, but more importantly this fixes the loading of Kudu
tables that may have had default values set on
UNIXTIME_MICROS columns, e.g. if the table was created via
the python client. This involves fixing KuduColumn to hide
the LiteralExpr representing the default value because it
will be a BIGINT if the column type is TIMESTAMP. It is only
needed to call toSql() and toStringValue(), so helper
functions are added to KuduColumn to encapsulate special
logic for TIMESTAMP.

TODO: Add support and tests for ALTER setting the default
value (when IMPALA-4622 is committed).

Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2
Reviewed-on: http://gerrit.cloudera.org:8080/6936
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-02 01:47:48 +00:00
Matthew Jacobs
6226e59702 IMPALA-5137: Support TIMESTAMPs in Kudu range predicate DDL
Adds support in DDL for timestamps in Kudu range partition syntax.

For convenience, strings can be specified with or without
explicit casts to TIMESTAMP.

E.g.
create table ts_ranges (ts timestamp primary key, i int)
partition by range (
  partition '2009-01-02 00:00:00' <= VALUES < '2009-01-03 00:00:00'
) stored as kudu

Range bounds are converted to Kudu UNIXTIME_MICROS during
analysis.

Testing: Adds FE and EE tests.

Change-Id: Iae409b6106c073b038940f0413ed9d5859daaeff
Reviewed-on: http://gerrit.cloudera.org:8080/6849
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-19 00:41:46 +00:00
Matthew Jacobs
878fcf5a74 IMPALA-5111: Fix check when creating NOT NULL PK col in Kudu
The fix for IMPALA-4616 broke the ability to create a PK key
col in a Kudu table as explicitly 'NOT NULL'. While this is
the default, it should be possible to specify.

The precondition that was failing was fixed, and some tests
were added/modified.

Change-Id: I557eea7cd994d6a2ed38893d283d08107e78f789
Reviewed-on: http://gerrit.cloudera.org:8080/6465
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-03-24 21:22:50 +00:00
Dimitris Tsirogiannis
5ea1798661 IMPALA-4619: Allow NULL as default value in Kudu tables
This commit fixes an issue where an error is thrown if the default value
for a Kudu column is set to NULL.

Change-Id: Ida27ce56f1dd7603485a69c680db3bcea6702aff
Reviewed-on: http://gerrit.cloudera.org:8080/5405
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-12-08 04:53:38 +00:00
Dan Burkert
f83652c1da Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.

Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-07 07:31:16 +00:00
Matthew Jacobs
5188f879a7 IMPALA-4477: Bump Kudu version to latest master (60aa54e)
Bumps the toolchain version to get a newer Kudu build.

Also fixes test failures resulting from changes in Kudu.
Notably error strings have changed (IMPALA-4590) and the
number of replicas must be odd (IMPALA-4589).

Note: The toolchain binaries starting with this build are
now using the toolchain binutils rather than the system
binutils.

Testing: private exhaustive build.

Change-Id: If1912f058c240fbe82b06f77e31add7755289be1
Reviewed-on: http://gerrit.cloudera.org:8080/5369
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-07 05:11:13 +00:00
Dimitris Tsirogiannis
cba93f1ac3 IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE
Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2
Reviewed-on: http://gerrit.cloudera.org:8080/5317
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-12-06 10:41:53 +00:00
Thomas Tauber-Marshall
3833707dbd IMPALA-4466: Improve Kudu CRUD test coverage
The results in the test files were verified by hand.

This patch also introduces a new test section 'DML_RESULTS', which
takes the name of a table as a comment and the contents of the
table as its body and then verifies that the body matches the
actual contents of the table. This makes it easy to check that a
DML operation has the desired effect on the contents of a table,
rather than always having to add another test case that runs a
select on the table. For now, this section cannot be used in a
test along with the RESULTS or ERRORS sections.

TODO: Refactor the DML test case handling (IMPALA-4471)

Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6
Reviewed-on: http://gerrit.cloudera.org:8080/4953
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 02:54:30 +00:00
Dimitris Tsirogiannis
d802f321b2 IMPALA-3724: Support Kudu non-covering range partitions
This commit adds support for non-covering range partitions in Kudu
tables. The SPLIT ROWS clause is now deprecated and no longer supported.
The following new syntax provides more flexibility in creating range
partitions and it supports bounded and unbounded ranges as well as single value
partitions; multi-column range partitions are supported as well.

The new syntax is:
DISTRIBUTE BY RANGE (col_list)
(
 PARTITION lower_1 <[=] VALUES <[=] upper_1,
 PARTITION lower_2 <[=] VALUES <[=] upper_2,
             ....
 PARTITION lower_n <[=] VALUES <[=] upper_n,
 PARTITION VALUE = val_1,
             ....
 PARTITION VALUE = val_n
)

Multi-column range partitions are specified as follows:
DISTRIBUTE BY RANGE (col1, col2,..., coln)
(
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val),
                     ....
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val)
)

Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06
Reviewed-on: http://gerrit.cloudera.org:8080/4856
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-04 22:02:22 +00:00
Matthew Jacobs
9b507b6ed6 IMPALA-4379: Fix and test Kudu table type checking
Creating Kudu tables shouldn't allow types not supported by
Kudu (e.g. VARCHAR/CHAR, DECIMAL, TIMESTAMP, collection types).
The behavior is inconsistent: for some types it throws in
the catalog, for VARCHAR/CHAR these become strings. This changes
behavior so that all fail during analysis. Analysis tests
were added.

Similarly, external tables cannot contain Kudu types that
Impala doesn't support (e.g. UNIXTIME_MICROS, BINARY). Tests
were added to validate this behavior. Note that this
required upgrading the python Kudu client.

This also fixes a small corner case with ALTER TABLE:
ALTER TABLE shouldn't allow Kudu tables to change the
storage descriptor tblproperty, otherwise the table metadata
gets in an inconsistent state.

Tests were added for all of the above.

Change-Id: I475273cbbf4110db8d0f78ddf9a56abfc6221e3e
Reviewed-on: http://gerrit.cloudera.org:8080/4857
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-10-31 16:03:54 +00:00
Dimitris Tsirogiannis
041fa6d946 IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.

Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU

Changes:
1) Remove the requirement to specify table properties such as key
   columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
   schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
   time of creation in Impala.
4) Disallow table properties that could conflict with an existing
   table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
   addresses. The flag is used as the default value for the table
   property kudu_master_addresses but it can still be overriden
   using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
   wasn't implemented for Kudu tables and silently ignored. The Kudu
   tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
   Kudu) the existence of the other delegate and the use of delegates in
   general has led to confusion. The Kudu delegate only exists to provide
   functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
   standard. When used at the column level, only one column can be
   marked as a key. When used at the table level, multiple columns can
   be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
   The old "kudu.key_columns" table property is no longer accepted
   though it is still used internally. "PRIMARY" is now a keyword.
   The ident style declaration is used for "KEY" because it is also used
   for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
   The table property "kudu.table_name" is optional for managed tables
   and is required for external tables. If for a managed table a Kudu
   table name is not provided, a table name will be generated based
   on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
   of HMS when a table is loaded or refreshed. Table/column metadata
   are cached in the catalog and are stored in HMS in order to be
   able to use table and column statistics.

Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 10:52:25 +00:00