This change builds on the support for reading and writing
TIMESTAMP columns to Kudu tables (see [1]), adding support
for pushing TIMESTAMP predicates to Kudu for scans.
Binary predicates and IN list predicates are supported.
Testing: Added some planner and EE tests to validate the
behavior.
1: https://gerrit.cloudera.org/#/c/6526/
Change-Id: I08b6c8354a408e7beb94c1a135c23722977246ea
Reviewed-on: http://gerrit.cloudera.org:8080/6789
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
This detects IS NULL / IS NOT NULL and creates a Kudu
predicate to push this to Kudu.
For testing, there are planner tests to verify that the
predicate is pushed to Kudu. There are also end-to-end
tests for correctness.
Change-Id: I9c96fec8d41f77222879c0ffdd6940b168e47e65
Reviewed-on: http://gerrit.cloudera.org:8080/5958
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.
Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Kudu does not allocate null bytes if all projected columns are
non-nullable. Otherwise, Kudu allocates a null bit for all columns,
even the non-nullable ones. The bug was that Impala's memory layout
did not match the first requirement.
Change-Id: I762ad9d5cc4198922ea4b5218c504fde355c49a5
Reviewed-on: http://gerrit.cloudera.org:8080/4892
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.
Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU
Changes:
1) Remove the requirement to specify table properties such as key
columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
time of creation in Impala.
4) Disallow table properties that could conflict with an existing
table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
addresses. The flag is used as the default value for the table
property kudu_master_addresses but it can still be overriden
using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
wasn't implemented for Kudu tables and silently ignored. The Kudu
tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
Kudu) the existence of the other delegate and the use of delegates in
general has led to confusion. The Kudu delegate only exists to provide
functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
standard. When used at the column level, only one column can be
marked as a key. When used at the table level, multiple columns can
be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
The old "kudu.key_columns" table property is no longer accepted
though it is still used internally. "PRIMARY" is now a keyword.
The ident style declaration is used for "KEY" because it is also used
for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
The table property "kudu.table_name" is optional for managed tables
and is required for external tables. If for a managed table a Kudu
table name is not provided, a table name will be generated based
on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
of HMS when a table is loaded or refreshed. Table/column metadata
are cached in the catalog and are stored in HMS in order to be
able to use table and column statistics.
Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
As of Kudu 0.9, DISTRIBUTE BY is now required when creating
a new Kudu table. Create table analysis, data loading, and
tests are updated to reflect this.
This also bumps the Kudu version to 0.10.0.
Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e
Reviewed-on: http://gerrit.cloudera.org:8080/3987
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
A UNION is special because it may cause a scan node to be started
without any scan ranges. The Kudu scanner didn't expect that scenario
and would hang waiting for data from scanner threads that would never be
started. The fix is to exit early when there are no scan ranges.
Change-Id: Id53fb880ba23ee9bbcf3169598f97fa1a3285dd9
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/10044
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
The problem was, if a tuple was filtered, the bits indicating values are
NULL were not reset and the tuple's memory was reused. So NULLs from
consecutively filtered rows would accumulate in the tuple. The fix is to
always reset the NULL bits (as it doesn't matter whether the row was
filtered).
Change-Id: Ib4d980980e02bf2c82dc229a8ed1ada16bb8174f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/9958
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
This patch adds the backend implementation to the update. It reuses the
Kudu table sink and simply changes the KuduWriteOperation type to
Update.
Change-Id: I31e524210b9401d4619ab0f892d9fb044b6dfdea
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6999
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: jenkins
This adds the frontend part of Kudu predicate pushdown. Namely it goes
through all the predicates that are assigned to the KuduScanNode and selects
those that are pushable to Kudu (binary predicates: <=, >= and = that have
a constant on one side and a slot ref on the other). Pushable predicates
are then set on TKuduScanNode for the backend to transform into range predicates.
Partition pruning is not handled at the moment due to limitations/bugs on the Kudu
java API.
This adds a test that makes sure that predicates are pushed down when they
match the pushable rules and are not when they don't.
Change-Id: I8f86bb8b5f6667422df7080315045d69b61dba92
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7042
Tested-by: jenkins
Reviewed-by: David Alves <david.alves@cloudera.com>
We currently have a bug where SELECT queries with named columns
only work if the key columns are declared first.
This because, on scans, we're passing a number of key columns equal
to the number of key columns referred to by slot descriptors. The
problem is that Kudu expects key columns to come first in the schema
if the number of key columns is > 0 and we build a schema that matches
the column order in the SlotDescriptors vector, which might not have
key columns first. However Kudu scans don't actually care about
key column ordering on scans _if_ the number of key columns is set
to 0 (which is weird behavior, filed KUDU-852 for this).
This patch just changes the built Kudu schema so that we always pass
0 key columns. It also adds an end-to-end test that makes sure a
previously failing projection now works.
Change-Id: I0826dabd87493a684cfc18058a4b5aa02f7f6cdc
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7130
Tested-by: jenkins
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
We were previously only incrementing the rows returned counter after
a full batch was processed, missing the 'num_rows_returned_' on the
scanner, which is actually used in ReachedLimit(). This caused us
to return more rows than needed when with a single node plan.
This patch fixes this and adds an update to 'rows_read_counter_'.
Moreover this patch adds a test that makes sure the limit is enforced.
Change-Id: I31c76e67fd1acb7b2bb6d31de8904954e01f9da3
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7046
Tested-by: jenkins
Reviewed-by: David Alves <david.alves@cloudera.com>
In KuduScanner, when an empty string was returned we would try
and allocate an empty buffer getting, correctly, a NULL buffer back.
However we would interpret the NULL buffer as an inability to allocate
memory, returning a MEM_LIMIT_EXCEEDED error.
This patch special cases handling empty strings so that we just accept
the NULL buffer and don't return an error. Specifically the following
sequence of operations:
INSERT INTO TABLE (id, name) testbl VALUES (10, "");
SELECT * FROM testtbl;
Would fail with the aforementioned error and with this patch returns,
correctly:
+----+------+------+
| id | name | zip |
+----+------+------+
...
| 10 | | NULL |
+----+------+------+
Change-Id: I5eeee4b57ed3163b9c9888d694eba5dd4dd45bb5
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7053
Tested-by: jenkins
Reviewed-by: David Alves <david.alves@cloudera.com>