Commit Graph

20 Commits

Author SHA1 Message Date
Attila Bukor
2576952655 IMPALA-5092 Add support for VARCHAR in Kudu tables
KUDU-1938 added VARCHAR column type support to Kudu.
This commit adds support for Kudu's VARCHAR type to Impala.

The length of a Kudu varchar is applied as a character length as opposed
to a byte length like Impala currently uses.

When writing data to Kudu, the VARCHAR length is not an issue because
Impala only officially supports ASCII characters and those characters are
the same size in bytes and characters. Additionally, extra bytes would be
truncated by the Kudu client if somehow a value was too long.

When reading data from Kudu, it is possible that the value written by
some other application is wider in bytes than Impala expects and can
handle. This can happen due to multi-byte UTF-8 characters. In that
case, we adjust the length in Impala to truncate the extra bytes of the
value. This isn’t a great solution, but one other integrations have taken
as well given Impala doesn’t support UTF-8 values.

IMPALA-5675 tracks adding UTF-8 Character length support to VARCHAR
columns and marked the truncation code with a TODO that references
that Jira.

Testing:
* Performed manual testing of standard DDL and DML interaction
* Manually reproduced a check failure due to multi-byte characters
  and tested that length truncation resolve that issue.
* Added/adjusted the following automated tests:
** AnalyzeDDLTest: CTAS into Kudu with varchar type
** AnalyzeKuduDDLTest: CREATE TABLE in Kudu with VARCHAR type
** kudu_create.test: Create table with VARCHAR column, key, hash
   partition, and range partition
** kudu_describe.test: Describe table with VARCHAR column and key
** kudu_insert.test: Insert with VARCHAR columns including null and
   non-null defaults
** kudu_update.test: Updates with VARCHAR column
** kudu_upsert.test: Upserts with VARCHAR column
** kudu_delete.test Deletes with VARCHAR columns
** kudu-scan-node.test Tests basic predicates with VARCHAR columns

Follow on work:
- IMPALA-9580: Add min-max runtime filter support/tests
- IMPALA-9581: Pushdown string predicates
- IMPALA-9583: Automated multibyte truncation tests

Change-Id: I0d4959410fdd882bfa980cb55e8a7837c7823da8
Reviewed-on: http://gerrit.cloudera.org:8080/14197
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
2020-04-01 15:48:36 +00:00
Volodymyr Verovkin
6fdc644fed IMPALA-8800: Added support of Kudu DATE type to Impala
This patch supports reading and writing DATE values
to Kudu tables. It does not add min-max filter runtime
support, but there is followup JIRA IMPALA-9294.
Corresponding Kudu JIRA is KUDU-2632.

Change-Id: I91656749a58ac769b54c2a63bdd4f85c89520b32
Reviewed-on: http://gerrit.cloudera.org:8080/14705
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-12 14:26:13 +00:00
Tim Armstrong
2ca7f8e7c0 IMPALA-7995: part 1: fixes for e2e dockerised impala tests
This fixes all core e2e tests running on my local dockerised
minicluster build. I do not yet have a CI job or script running
but I wanted to get feedback on these changes sooner. The second
part of the change will include the CI script and any follow-on
fixes required for the exhaustive tests.

The following fixes were required:
* Detect docker_network from TEST_START_CLUSTER_ARGS
* get_webserver_port() does not depend on the caller passing in
  the default webserver port. It failed previously because it
  relied on start-impala-cluster.py setting -webserver_port
  for *all* processes.
* Add SkipIf markers for tests that don't make sense or are
  non-trivial to fix for containerised Impala.
* Support loading Impala-lzo plugin from host for tests that depend on
  it.
* Fix some tests that had 'localhost' hardcoded - instead it should
  be $INTERNAL_LISTEN_HOST, which defaults to localhost.
* Fix bug with sorting impala daemons by backend port, which is
  the same for all dockerised impalads.

Testing:
I ran tests locally as follows after having set up a docker network and
starting other services:

  ./buildall.sh -noclean -notests -ninja
  ninja -j $IMPALA_BUILD_THREADS docker_images
  export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster"
  export FE_TEST=false
  export BE_TEST=false
  export JDBC_TEST=false
  export CLUSTER_TEST=false
  ./bin/run-all-tests.sh

Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755
Reviewed-on: http://gerrit.cloudera.org:8080/12639
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-13 02:42:32 +00:00
Todd Lipcon
c5918b44b0 IMPALA-8283. Order of Kudu PRIMARY KEYs can be silently ignored
When creating a Kudu table, we would use the 'PRIMARY KEY (...)' clause
to determine which columns made up the primary key, but the order of
those columns would be ignored. Thus a statement like:

CREATE TABLE (x int, y int, PRIMARY KEY (y, x)) STORED AS KUDU;

would silently create a table with an (x,y) primary key instead of a
(y,x) key. This can have substantial performance implications.

This fixes the frontend to correctly throw an error in this case.

This might be incompatible if someone was previously relying on the bug,
but I think it's worth fixing because it was clearly doing the wrong
thing.

Change-Id: I0499cee7c532db19cddac3906198d965b27ea604
Reviewed-on: http://gerrit.cloudera.org:8080/12694
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Todd Lipcon <todd@apache.org>
2019-03-18 23:54:00 +00:00
Thomas Tauber-Marshall
ba84ad03cb IMPALA-6954: Fix problems with CTAS into Kudu with an expr rewrite
This patch fixes two problems:
- Previously a CTAS into a Kudu table where an expr rewrite occurred
  would create an unpartitioned table, due to the partition info being
  reset in TableDataLayout and then never reconstructed. Since the
  Kudu partition info is set by the parser and never changes, the
  solution is to not reset it.
- Previously a CTAS into a Kudu table with a range partition where an
  expr rewrite occurred would fail with an analysis exception due to
  a Precondition check in RangePartition.analyze that checked that
  the RangePartition wasn't already analyzed, as the analysis can't
  be done twice. Since the state in RangePartition never changes, it
  doesn't need to be reanalyzed and we can just return instead of
  failing on the check.

Testing:
- Added an e2e test that creates a partitioned Kudu table with a CTAS
  with a rewrite, and checks that the expected partitions are created.

Change-Id: I731743bd84cc695119e99342e1b155096147f0ed
Reviewed-on: http://gerrit.cloudera.org:8080/10251
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-02 02:54:23 +00:00
Zoltan Borok-Nagy
2ee914d5b3 IMPALA-5903: Inconsistent specification of result set and result set metadata
Before this commit it was quite random which DDL oprations
returned a result set and which didn't.

With this commit, every DDL operations return a summary of
its execution. They declare their result set schema in
Frontend.java, and provide the summary in CalatogOpExecutor.java.

Updated the tests according to the new behavior.

Change-Id: Ic542fb8e49e850052416ac663ee329ee3974e3b9
Reviewed-on: http://gerrit.cloudera.org:8080/9090
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 02:21:48 +00:00
Thomas Tauber-Marshall
832974383c IMPALA-6445: Test for kudu master address with whitespace
A concern was brought up that Impala might not handle kudu master
addresses containing whitespace correctly. Turns out that the Kudu
client takes care of stripping whitespace, so it works, but it would
be good to have a test to ensure it continues to work.

Change-Id: I1857b8dbcb5af66d69f7620368cd3b9b85ae7576
Reviewed-on: http://gerrit.cloudera.org:8080/9876
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 20:29:51 +00:00
Grant Henke
0c8eba076c IMPALA-5752: Add support for DECIMAL on Kudu tables
Adds support for the Kudu DECIMAL type introduced in Kudu 1.7.0.

Note: Adding support for Kudu decimal min/max filters is
tracked in IMPALA-6533.

Tests:
* Added Kudu create with decimal test to AnalyzeDDLTest.java
* Added Kudu table_format to test_decimal_queries.py
** Both decimal.test and decimal-exprs.test workloads
* Added decimal queries to the following Kudu workloads:
** kudu_create.test
** kudu_delete.test
** kudu_insert.test
** kudu_update.test
** kudu_upsert.test

Change-Id: I3a9fe5acadc53ec198585d765a8cfb0abe56e199
Reviewed-on: http://gerrit.cloudera.org:8080/9368
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-23 00:03:54 +00:00
Thomas Tauber-Marshall
b881fba763 IMPALA-5546: Allow creating unpartitioned Kudu tables
This patch makes it possible to create unpartitioned, managed Kudu
tables from Impala, by making the 'PARTITION BY' clause of 'CREATE
TABLE... STORED AS KUDU' optional:

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  (col_name data_type
    [kudu_column_attribute ...]
    [COMMENT 'col_comment']
    [, ...]
    [PRIMARY KEY (col_name[, ...])]
  )
  [PARTITION BY kudu_partition_clause]
  [COMMENT 'table_comment']
  STORED AS KUDU
  [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]

Kudu represents this as a table that is range partitioned on no
columns.

Because unpartitioned Kudu tables are inefficient for large data
sizes, and because the syntax doesn't make it explicit that the table
will be unpartitioned, there is a warning issued to encourage users
to created partitioned tables.

This patch also converts the tpch_kudu.nation and tpch_kudu.region
tables to be unpartitioned, as they are very small.

Testing:
- Updated analysis tests.
- Added e2e test that creates unpartitioned table and inserts into it.

Change-Id: I281f173dbec1484eb13434d53ea581a0f245358a
Reviewed-on: http://gerrit.cloudera.org:8080/7446
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-07 19:53:59 +00:00
Matthew Jacobs
2dcbefc652 IMPALA-5338: Fix Kudu timestamp column default values
While support for TIMESTAMP columns in Kudu tables has been
committed (IMPALA-5137), it does not support TIMESTAMP
column default values.

This supports CREATE TABLE syntax to specify the default
values, but more importantly this fixes the loading of Kudu
tables that may have had default values set on
UNIXTIME_MICROS columns, e.g. if the table was created via
the python client. This involves fixing KuduColumn to hide
the LiteralExpr representing the default value because it
will be a BIGINT if the column type is TIMESTAMP. It is only
needed to call toSql() and toStringValue(), so helper
functions are added to KuduColumn to encapsulate special
logic for TIMESTAMP.

TODO: Add support and tests for ALTER setting the default
value (when IMPALA-4622 is committed).

Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2
Reviewed-on: http://gerrit.cloudera.org:8080/6936
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-02 01:47:48 +00:00
Matthew Jacobs
6226e59702 IMPALA-5137: Support TIMESTAMPs in Kudu range predicate DDL
Adds support in DDL for timestamps in Kudu range partition syntax.

For convenience, strings can be specified with or without
explicit casts to TIMESTAMP.

E.g.
create table ts_ranges (ts timestamp primary key, i int)
partition by range (
  partition '2009-01-02 00:00:00' <= VALUES < '2009-01-03 00:00:00'
) stored as kudu

Range bounds are converted to Kudu UNIXTIME_MICROS during
analysis.

Testing: Adds FE and EE tests.

Change-Id: Iae409b6106c073b038940f0413ed9d5859daaeff
Reviewed-on: http://gerrit.cloudera.org:8080/6849
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-19 00:41:46 +00:00
Matthew Jacobs
878fcf5a74 IMPALA-5111: Fix check when creating NOT NULL PK col in Kudu
The fix for IMPALA-4616 broke the ability to create a PK key
col in a Kudu table as explicitly 'NOT NULL'. While this is
the default, it should be possible to specify.

The precondition that was failing was fixed, and some tests
were added/modified.

Change-Id: I557eea7cd994d6a2ed38893d283d08107e78f789
Reviewed-on: http://gerrit.cloudera.org:8080/6465
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-03-24 21:22:50 +00:00
Dimitris Tsirogiannis
5ea1798661 IMPALA-4619: Allow NULL as default value in Kudu tables
This commit fixes an issue where an error is thrown if the default value
for a Kudu column is set to NULL.

Change-Id: Ida27ce56f1dd7603485a69c680db3bcea6702aff
Reviewed-on: http://gerrit.cloudera.org:8080/5405
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-12-08 04:53:38 +00:00
Dan Burkert
f83652c1da Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.

Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-07 07:31:16 +00:00
Matthew Jacobs
5188f879a7 IMPALA-4477: Bump Kudu version to latest master (60aa54e)
Bumps the toolchain version to get a newer Kudu build.

Also fixes test failures resulting from changes in Kudu.
Notably error strings have changed (IMPALA-4590) and the
number of replicas must be odd (IMPALA-4589).

Note: The toolchain binaries starting with this build are
now using the toolchain binutils rather than the system
binutils.

Testing: private exhaustive build.

Change-Id: If1912f058c240fbe82b06f77e31add7755289be1
Reviewed-on: http://gerrit.cloudera.org:8080/5369
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-07 05:11:13 +00:00
Dimitris Tsirogiannis
cba93f1ac3 IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE
Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2
Reviewed-on: http://gerrit.cloudera.org:8080/5317
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-12-06 10:41:53 +00:00
Thomas Tauber-Marshall
3833707dbd IMPALA-4466: Improve Kudu CRUD test coverage
The results in the test files were verified by hand.

This patch also introduces a new test section 'DML_RESULTS', which
takes the name of a table as a comment and the contents of the
table as its body and then verifies that the body matches the
actual contents of the table. This makes it easy to check that a
DML operation has the desired effect on the contents of a table,
rather than always having to add another test case that runs a
select on the table. For now, this section cannot be used in a
test along with the RESULTS or ERRORS sections.

TODO: Refactor the DML test case handling (IMPALA-4471)

Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6
Reviewed-on: http://gerrit.cloudera.org:8080/4953
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 02:54:30 +00:00
Dimitris Tsirogiannis
d802f321b2 IMPALA-3724: Support Kudu non-covering range partitions
This commit adds support for non-covering range partitions in Kudu
tables. The SPLIT ROWS clause is now deprecated and no longer supported.
The following new syntax provides more flexibility in creating range
partitions and it supports bounded and unbounded ranges as well as single value
partitions; multi-column range partitions are supported as well.

The new syntax is:
DISTRIBUTE BY RANGE (col_list)
(
 PARTITION lower_1 <[=] VALUES <[=] upper_1,
 PARTITION lower_2 <[=] VALUES <[=] upper_2,
             ....
 PARTITION lower_n <[=] VALUES <[=] upper_n,
 PARTITION VALUE = val_1,
             ....
 PARTITION VALUE = val_n
)

Multi-column range partitions are specified as follows:
DISTRIBUTE BY RANGE (col1, col2,..., coln)
(
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val),
                     ....
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val)
)

Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06
Reviewed-on: http://gerrit.cloudera.org:8080/4856
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-04 22:02:22 +00:00
Matthew Jacobs
9b507b6ed6 IMPALA-4379: Fix and test Kudu table type checking
Creating Kudu tables shouldn't allow types not supported by
Kudu (e.g. VARCHAR/CHAR, DECIMAL, TIMESTAMP, collection types).
The behavior is inconsistent: for some types it throws in
the catalog, for VARCHAR/CHAR these become strings. This changes
behavior so that all fail during analysis. Analysis tests
were added.

Similarly, external tables cannot contain Kudu types that
Impala doesn't support (e.g. UNIXTIME_MICROS, BINARY). Tests
were added to validate this behavior. Note that this
required upgrading the python Kudu client.

This also fixes a small corner case with ALTER TABLE:
ALTER TABLE shouldn't allow Kudu tables to change the
storage descriptor tblproperty, otherwise the table metadata
gets in an inconsistent state.

Tests were added for all of the above.

Change-Id: I475273cbbf4110db8d0f78ddf9a56abfc6221e3e
Reviewed-on: http://gerrit.cloudera.org:8080/4857
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-10-31 16:03:54 +00:00
Dimitris Tsirogiannis
041fa6d946 IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.

Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU

Changes:
1) Remove the requirement to specify table properties such as key
   columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
   schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
   time of creation in Impala.
4) Disallow table properties that could conflict with an existing
   table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
   addresses. The flag is used as the default value for the table
   property kudu_master_addresses but it can still be overriden
   using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
   wasn't implemented for Kudu tables and silently ignored. The Kudu
   tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
   Kudu) the existence of the other delegate and the use of delegates in
   general has led to confusion. The Kudu delegate only exists to provide
   functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
   standard. When used at the column level, only one column can be
   marked as a key. When used at the table level, multiple columns can
   be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
   The old "kudu.key_columns" table property is no longer accepted
   though it is still used internally. "PRIMARY" is now a keyword.
   The ident style declaration is used for "KEY" because it is also used
   for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
   The table property "kudu.table_name" is optional for managed tables
   and is required for external tables. If for a managed table a Kudu
   table name is not provided, a table name will be generated based
   on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
   of HMS when a table is loaded or refreshed. Table/column metadata
   are cached in the catalog and are stored in HMS in order to be
   able to use table and column statistics.

Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 10:52:25 +00:00