Commit Graph

14 Commits

Author SHA1 Message Date
Matthew Jacobs
24c77f194b IMPALA-5137: Support pushing TIMESTAMP predicates to Kudu
This change builds on the support for reading and writing
TIMESTAMP columns to Kudu tables (see [1]), adding support
for pushing TIMESTAMP predicates to Kudu for scans.

Binary predicates and IN list predicates are supported.

Testing: Added some planner and EE tests to validate the
behavior.

1: https://gerrit.cloudera.org/#/c/6526/

Change-Id: I08b6c8354a408e7beb94c1a135c23722977246ea
Reviewed-on: http://gerrit.cloudera.org:8080/6789
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-18 21:09:51 +00:00
Joe McDonnell
077c07eec7 IMPALA-4859: Push down IS NULL / IS NOT NULL to Kudu
This detects IS NULL / IS NOT NULL and creates a Kudu
predicate to push this to Kudu.

For testing, there are planner tests to verify that the
predicate is pushed to Kudu. There are also end-to-end
tests for correctness.

Change-Id: I9c96fec8d41f77222879c0ffdd6940b168e47e65
Reviewed-on: http://gerrit.cloudera.org:8080/5958
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
2017-03-25 04:51:36 +00:00
Dan Burkert
f83652c1da Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.

Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-12-07 07:31:16 +00:00
Dimitris Tsirogiannis
cba93f1ac3 IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE
Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2
Reviewed-on: http://gerrit.cloudera.org:8080/5317
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-12-06 10:41:53 +00:00
Alex Behm
4918b20ac0 IMPALA-4408: Omit null bytes for Kudu scans with no nullable slots.
Kudu does not allocate null bytes if all projected columns are
non-nullable. Otherwise, Kudu allocates a null bit for all columns,
even the non-nullable ones. The bug was that Impala's memory layout
did not match the first requirement.

Change-Id: I762ad9d5cc4198922ea4b5218c504fde355c49a5
Reviewed-on: http://gerrit.cloudera.org:8080/4892
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-01 01:47:30 +00:00
Dimitris Tsirogiannis
041fa6d946 IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.

Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU

Changes:
1) Remove the requirement to specify table properties such as key
   columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
   schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
   time of creation in Impala.
4) Disallow table properties that could conflict with an existing
   table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
   addresses. The flag is used as the default value for the table
   property kudu_master_addresses but it can still be overriden
   using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
   wasn't implemented for Kudu tables and silently ignored. The Kudu
   tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
   Kudu) the existence of the other delegate and the use of delegates in
   general has led to confusion. The Kudu delegate only exists to provide
   functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
   standard. When used at the column level, only one column can be
   marked as a key. When used at the table level, multiple columns can
   be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
   The old "kudu.key_columns" table property is no longer accepted
   though it is still used internally. "PRIMARY" is now a keyword.
   The ident style declaration is used for "KEY" because it is also used
   for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
   The table property "kudu.table_name" is optional for managed tables
   and is required for external tables. If for a managed table a Kudu
   table name is not provided, a table name will be generated based
   on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
   of HMS when a table is loaded or refreshed. Table/column metadata
   are cached in the catalog and are stored in HMS in order to be
   able to use table and column statistics.

Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 10:52:25 +00:00
Matthew Jacobs
d113205cee IMPALA-3650: DISTRIBUTE BY required for managed Kudu tables
As of Kudu 0.9, DISTRIBUTE BY is now required when creating
a new Kudu table. Create table analysis, data loading, and
tests are updated to reflect this.

This also bumps the Kudu version to 0.10.0.

Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e
Reviewed-on: http://gerrit.cloudera.org:8080/3987
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-08-19 02:14:39 +00:00
casey
8c224398bb IMPALA-2635: Kudu scanner hangs on UNION
A UNION is special because it may cause a scan node to be started
without any scan ranges. The Kudu scanner didn't expect that scenario
and would hang waiting for data from scanner threads that would never be
started. The fix is to exit early when there are no scan ranges.

Change-Id: Id53fb880ba23ee9bbcf3169598f97fa1a3285dd9
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/10044
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
2016-01-28 21:49:39 -08:00
casey
84ce1e22af IMPALA-2740: Kudu scanner - Reset tuple's null bits after filtering row
The problem was, if a tuple was filtered, the bits indicating values are
NULL were not reset and the tuple's memory was reused. So NULLs from
consecutively filtered rows would accumulate in the tuple. The fix is to
always reset the NULL bits (as it doesn't matter whether the row was
filtered).

Change-Id: Ib4d980980e02bf2c82dc229a8ed1ada16bb8174f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/9958
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-12-28 11:34:22 -08:00
Martin Grund
23ca2f01ad Execution of Update
This patch adds the backend implementation to the update. It reuses the
Kudu table sink and simply changes the KuduWriteOperation type to
Update.

Change-Id: I31e524210b9401d4619ab0f892d9fb044b6dfdea
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6999
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: jenkins
2015-08-08 07:02:37 -07:00
David Alves
ee8d830d7a Frontend part of Kudu predicate pushdown
This adds the frontend part of Kudu predicate pushdown. Namely it goes
through all the predicates that are assigned to the KuduScanNode and selects
those that are pushable to Kudu (binary predicates: <=, >= and = that have
a constant on one side and a slot ref on the other). Pushable predicates
are then set on TKuduScanNode for the backend to transform into range predicates.

Partition pruning is not handled at the moment due to limitations/bugs on the Kudu
java API.

This adds a test that makes sure that predicates are pushed down when they
match the pushable rules and are not when they don't.

Change-Id: I8f86bb8b5f6667422df7080315045d69b61dba92
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7042
Tested-by: jenkins
Reviewed-by: David Alves <david.alves@cloudera.com>
2015-08-04 23:49:14 -07:00
David Alves
af1e1bea15 On Kudu scans, always build a schema with 0 key columns.
We currently have a bug where SELECT queries with named columns
only work if the key columns are declared first.

This because, on scans, we're passing a number of key columns equal
to the number of key columns referred to by slot descriptors. The
problem is that Kudu expects key columns to come first in the schema
if the number of key columns is > 0 and we build a schema that matches
the column order in the SlotDescriptors vector, which might not have
key columns first. However Kudu scans don't actually care about
key column ordering on scans _if_ the number of key columns is set
to 0 (which is weird behavior, filed KUDU-852 for this).

This patch just changes the built Kudu schema so that we always pass
0 key columns. It also adds an end-to-end test that makes sure a
previously failing projection now works.

Change-Id: I0826dabd87493a684cfc18058a4b5aa02f7f6cdc
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7130
Tested-by: jenkins
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
2015-07-13 14:56:03 -07:00
David Alves
84b2d60123 Enforce any limit set on backend KuduScanNode
We were previously only incrementing the rows returned counter after
a full batch was processed, missing the 'num_rows_returned_' on the
scanner, which is actually used in ReachedLimit(). This caused us
to return more rows than needed when with a single node plan.

This patch fixes this and adds an update to 'rows_read_counter_'.
Moreover this patch adds a test that makes sure the limit is enforced.

Change-Id: I31c76e67fd1acb7b2bb6d31de8904954e01f9da3
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7046
Tested-by: jenkins
Reviewed-by: David Alves <david.alves@cloudera.com>
2015-07-13 09:25:35 -07:00
David Alves
47bb63d249 If Kudu returns an empty string don't try to allocate buffer space
In KuduScanner, when an empty string was returned we would try
and allocate an empty buffer getting, correctly, a NULL buffer back.
However we would interpret the NULL buffer as an inability to allocate
memory, returning a MEM_LIMIT_EXCEEDED error.

This patch special cases handling empty strings so that we just accept
the NULL buffer and don't return an error. Specifically the following
sequence of operations:

INSERT INTO TABLE (id, name) testbl VALUES (10, "");
SELECT * FROM testtbl;

Would fail with the aforementioned error and with this patch returns,
correctly:

+----+------+------+
| id | name | zip  |
+----+------+------+
...
| 10 |      | NULL |
+----+------+------+

Change-Id: I5eeee4b57ed3163b9c9888d694eba5dd4dd45bb5
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7053
Tested-by: jenkins
Reviewed-by: David Alves <david.alves@cloudera.com>
2015-07-13 09:25:08 -07:00