Iceberg recently added a new partition transform called 'void':
https://iceberg.apache.org/#spec/#partition-transforms
This patch adds support for this transform.
When the user wants to drop a column from the partition spec,
the VOID transform should be used instead of just omitting
the column. Simply omitting the column might cause problems when
the metadata table is being queried (currently only supported
by other engines).
Testing
* added SHOW CREATE TABLE test
* added e2e test
Change-Id: Icbe11d56cdeb82aaadedfdb3ad61dd7cc4c2f4d0
Reviewed-on: http://gerrit.cloudera.org:8080/18102
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch Impala inconsistently scheduled scan ranges for
Iceberg tables on HDFS, in local catalog mode. It did so because
LocalIcebergTable reloaded all the files descriptors, and the HDFS
block locations were not consistent across the reloads. Impala's
scheduler uses the block location list for scan range assignment,
hence the assignments were inconsistent between queries. This has
a negative effect on caching and hence hit performance quite badly.
It is redundant and expensive to reload file descriptors for each
query in local catalog mode. This patch extends the GetPartialInfo()
RPC with Iceberg-specific snapshot information. It means that the
coordinator is now able to fetch Iceberg data file descriptors from
the CatalogD. This way scan range assignment becomes consistent
because we reuse the same file descriptors with the same block
location information.
Fixing the above revealed another bug. Before this patch we didn't
handle self-events of Iceberg tables. When an Iceberg table is stored
in the HiveCatalog it means that Iceberg will update the HMS table
on modifications because it needs to update table property
'metadata_location' (this points to the new snapshot file).
Then Catalogd processes these modifications again when they arrive
via the event notification mechanism. I fixed this by creating Iceberg
transactions in which I set the catalog service ID and new catalog
version for the Iceberg table. Since we are using transactions now
Iceberg has to embed all table modifications in a single ALTER TABLE
request to HMS, and detect the corresponding alter event later via the
aforementioned catalog service ID and version.
Testing:
* added e2e test for the scan range assignment
* added e2e test for detecting self-events
Change-Id: Ibb8216b37d350469b573dad7fcefdd0ee0599ed5
Reviewed-on: http://gerrit.cloudera.org:8080/17857
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
This change will allow usage of commands that do not require reading the
Json File like:
- Create Table <Table> stored as JSONFILE
- Show Create Table <Table>
- Describe <Table>
Changes:
- Added JSON as FileFormat to thrift and HdfsFileFormat.
- Allowing Sql keyword 'jsonfile' and mapping it to JSON format.
- Adding JSON serDe.
- JsonFiles have input format same as TextFile, so we need to use SerDe
library in use to differentiate between the two formats. Overloaded the
functions querying File Format based on input format to consider serDe
library too.
- Added tests for 'Create Table' and 'Show Create Table' commmands
Pending Changes:
- test for Describe command - to be added with backend changes.
Change-Id: I5b8cb2f59df3af09902b49d3bdac16c19954b305
Reviewed-on: http://gerrit.cloudera.org:8080/17727
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive relies on engine.hive.enabled=true table property to be set
for Iceberg tables. Without it Hive overwrites table metadata with
different storage handler, SerDe/Input/OutputFormatter when it
writes the table, making it unusable.
With this patch Impala sets this table property during table creation.
Testing:
* updated show-create-table.test
* tested Impala/Hive interop manually
Change-Id: I6aa0240829697a27f48d0defcce48920a5d6f49b
Reviewed-on: http://gerrit.cloudera.org:8080/17750
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for the following standard Iceberg properties:
write.parquet.compression-codec:
Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY
(default value), LZ4, ZSTD. The table property will be ignored if
COMPRESSION_CODEC query option is set.
write.parquet.compression-level:
Parquet compression level. Used with ZSTD compression only.
Supported range is [1, 22]. Default value is 3. The table property
will be ignored if COMPRESSION_CODEC query option is set.
write.parquet.row-group-size-bytes :
Parquet row group size in bytes. Supported range is [8388608,
2146435072] (8MB - 2047MB). The table property will be ignored if
PARQUET_FILE_SIZE query option is set.
If neither the table property nor the PARQUET_FILE_SIZE query option
is set, the way Impala calculates row group size will remain
unchanged.
write.parquet.page-size-bytes:
Parquet page size in bytes. Used for PLAIN encoding. Supported range
is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates page size
will remain unchanged.
write.parquet.dict-size-bytes:
Parquet dictionary page size in bytes. Used for dictionary encoding.
Supported range is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates dictionary
page size will remain unchanged.
This patch also renames 'iceberg.file_format' table property to
'write.format.default' which is the standard Iceberg name for the
table property.
Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23
Reviewed-on: http://gerrit.cloudera.org:8080/17654
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Two Iceberg commits got into master branch in parallel. One of
them modified the DDL syntax, the other one added some tests.
They were correct on their own, but mixing the two causes
test failures.
The affected tests have been updated.
Change-Id: Id3cf6ff04b8da5782df2b84a580cdbd4a4a16d06
Reviewed-on: http://gerrit.cloudera.org:8080/17689
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Iceberg recently switched to use its Catalogs class to define
catalog and table properties. Catalog information is stored in
a configuration file such as hive-site.xml. And the table properties
contain information about which catalog is being used and what is
the Iceberg table id.
E.g. in the Hive conf we can have the following properties to define
catalogs:
iceberg.catalog.<catalog_name>.type = hadoop
iceberg.catalog.<catalog_name>.warehouse = somelocation
or
iceberg.catalog.<catalog_name>.type = hive
And at the table level we can have the following:
iceberg.catalog = <catalog_name>
name = <table_identifier>
Table property 'iceberg.catalog' refers to a Catalog defined in the
configuration file. This is in contradiction with Impala's current
behavior where we are already using 'iceberg.catalog', and it can
have the following values:
* hive.catalog for HiveCatalog
* hadoop.catalog for HadoopCatalog
* hadoop.tables for HadoopTables
To be backward-compatible and also support the new Catalogs properties
Impala still recognizes the above special values. But, from now Impala
doesn't define 'iceberg.catalog' by default. 'iceberg.catalog' being
NULL means HiveCatalog for both Impala and Iceberg's Catalogs API,
hence for Hive and Spark as well.
If 'iceberg.catalog' has a different value than the special values it
indicates that Iceberg's Catalogs API is being used, so Impala will
try to look up the catalog configuration from the Hive config file.
Testing:
* added SHOW CREATE TABLE tests
* added e2e tests that create/insert/drop Iceberg tables with Catalogs
* manually tested interop behavior with Hive
Change-Id: I5dfa150986117fc55b28034c4eda38a736460ead
Reviewed-on: http://gerrit.cloudera.org:8080/17466
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently we have a DDL syntax for defining Iceberg partitions that
differs from SparkSQL:
https://iceberg.apache.org/spark-ddl/#partitioned-by
E.g. Impala is using the following syntax:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)
STORED AS ICEBERG;
The same in Spark is:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
USING ICEBERG
PARTITIONED BY (bucket(5, i), months(ts), years(d))
HIVE-25179 added the following syntax for Hive:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITIONED BY SPEC (bucket(5, i), months(ts), years(d))
STORED BY ICEBERG;
I.e. the same syntax as Spark, but adding the keyword "SPEC".
This patch makes Impala use Hive's syntax, i.e. we will also
use the PARTITIONED BY SPEC clause + the unified partition
transform syntax.
Testing:
* existing tests has been rewritten with the new syntax
Change-Id: Ib72ae445fd68fb0ab75d87b34779dbab922bbc62
Reviewed-on: http://gerrit.cloudera.org:8080/17575
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for CREATE TABLE AS SELECT statements
for Iceberg tables.
CTAS statements work like the following in Impala:
1. Analysis of the whole CTAS statement
2. Divide CTAS to CREATE stmt and INSERT stmt
3. Create temporary in-memory target table from the CREATE stmt
4. Analyse the INSERT statement by using the temporary target table
5. If everything is OK so far, create the target table
6. Execute the INSERT query
For Iceberg tables the non-trivial thing was to create the temporary
target table without actually creating it via Iceberg API. I've created
a new class 'IcebergCtasTarget' that mimics an FeIceberg table. It can be
used with catalog V1 and V2 as well.
Testing
* e2e CTAS tests in iceberg-ctas.test
* SHOW CREATE TABLE stmts in show-create-table.test
Change-Id: I81d2084e401b9fa74d5ad161b51fd3e2aa3fcc67
Reviewed-on: http://gerrit.cloudera.org:8080/17130
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For convenience this patch adds support with the old-style
CREATE TABLE ... PARTITIONED BY ...; syntax for Iceberg tables.
So users should be able to write the following:
CREATE TABLE ice_t (i int)
PARTITIONED BY (p int)
STORED AS ICEBERG;
Which should be equivalent to this:
CREATE TABLE ice_t (i int, p int)
PARTITION BY SPEC (p IDENTITY)
STORED AS ICEBERG;
Please note that the old-style CREATE TABLE statement creates
IDENTITY-partitioned tables. For other partition transforms the
users must use the new, more generic syntax.
Hive also supports the old PARTITIONED BY syntax with the same
behavior.
Testing:
* added e2e tests
Change-Id: I789876c161bc0987820955aa9ae01414e0dcb45d
Reviewed-on: http://gerrit.cloudera.org:8080/16979
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We supported create required/optional field for Iceberg table in this
patch. If we set 'NOT NULL' property for Iceberg table column in SQL,
Impala will create required field by Iceberg api, 'NULL' or default
will create optional field.
Besides, 'DESCRIBE XXX' for Iceberg table will display 'optional'
property like this:
+------+--------+---------+----------+
| name | type | comment | nullable |
+------+--------+---------+----------+
| id | int | | false |
| name | string | | true |
| age | int | | true |
+------+--------+---------+----------+
And 'SHOW CREATE TABLE XXX' will also display 'NULL'/'NOT NULL'
property for Iceberg table.
Tests:
* added new test in iceberg-create.test
* added new test in iceberg-negative.test
* added new test in show-create-table.test
* modify 'DESCRIBE XXX' result in iceberg-create.test
* modify 'DESCRIBE XXX' result in iceberg-alter.test
* modify create table result in show-create-table.test
Change-Id: I70b8014ba99f43df1b05149ff7a15cf06b6cd8d3
Reviewed-on: http://gerrit.cloudera.org:8080/16904
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HiveCatalog is one of Iceberg's catalog implementations. It uses
the Hive metastore and it is the recommended catalog implementation
when the table data is stored in object stores like S3.
This commit updates the Iceberg version to a newer one, and it also
retrieves Iceberg from the CDP distribution because that version of
Iceberg is built against Hive 3 (Impala is only compatible with
Hive 3).
This commit makes HiveCatalog the default Iceberg catalog in Impala
because it can be used in more environments (e.g. cloud stores),
and it is more featureful. Also, other engines that store their
table metadata in HMS will probably use HiveCatalog as well.
Tables stored in HiveCatalog are similar to Kudu tables with HMS
integration, i.e. modifying an Iceberg table via the Iceberg APIs
also modifies the HMS table. So in CatalogOpExecutor we handle
such Iceberg tables similarly to integrated Kudu tables.
Testing:
* Added e2e tests for creating, writing, and altering Iceberg
tables
* Added SHOW CREATE TABLE tests
Change-Id: Ie574589a1751aaa9ccbd34a89c6819714d103197
Reviewed-on: http://gerrit.cloudera.org:8080/16721
Reviewed-by: wangsheng <skyyws@163.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly realizes querying Iceberg table with ORC
file format. We can using following SQL to create table with
ORC file format:
CREATE TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg.file_format'='orc', 'iceberg.catalog'='hadoop.tables');
But pay attention, there still some problems when scan ORC files
with Timestamp, more details please refer IMPALA-9967. We may add
new tests with Timestmap type after this JIRA fixed.
Testing:
- Create table tests in functional_schema_template.sql
- Iceberg table create test in test_iceberg.py
- Iceberg table query test in test_scanners.py
Change-Id: Ib579461aa57348c9893a6d26a003a0d812346c4d
Reviewed-on: http://gerrit.cloudera.org:8080/16568
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for Iceberg Bucket and Truncate partition
transforms. Both accept a parameter: number of buckets and width
respectively.
Usage:
CREATE TABLE tbl_name (i int, p1 int, p2 timestamp)
PARTITION BY SPEC (
p1 BUCKET 10,
p1 TRUNCATE 5
) STORED AS ICEBERG
TBLPROPERTIES ('iceberg.catalog'='hadoop.tables');
Testing:
- Extended AnalyzerStmtsTest to cover creating partitioned Iceberg
tables with the new partition transforms.
- Extended ParserTest.
- Extended iceberg-create.test to create Iceberg tables with the new
partition transforms.
- Extended show-create-table.test to check that the new partition
transforms are displayed with their parameters in the SHOW CREATE
TABLE output.
Change-Id: Idc75cd23045b274885607c45886319f4f6da19de
Reviewed-on: http://gerrit.cloudera.org:8080/16551
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We provide several new table properties in IMPALA-10164, such as
'iceberg.catalog', in order to keep consist of these properties, we
rename 'iceberg_file_format' to 'iceberg.file_format'. When we creating
Iceberg table, we should use SQL like this:
CREATE TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
TBLPROPERTIES ('iceberg.file_format'='parquet',
'iceberg.catalog'='hadoop.tables')
Change-Id: I722303fb765aca0f97a79bd6e4504765d355a623
Reviewed-on: http://gerrit.cloudera.org:8080/16550
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly realizes creating Iceberg table by HadoopCatalog.
We only supported HadoopTables api before this patch, but now we can
use HadoopCatalog to create Iceberg table. When creating managed table,
we can use SQL like this:
CREATE TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog',
'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test');
We supported two values ('hadoop.catalog', 'hadoop.tables') for
'iceberg.catalog' now. If you don't specify this property in your SQL,
default catalog type is 'hadoop.catalog'.
As for external Iceberg table, you can use SQL like this:
CREATE EXTERNAL TABLE default.iceberg_test_external
STORED AS ICEBERG
TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog',
'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test',
'iceberg.table_identifier'='default.iceberg_test');
We cannot set table location for both managed and external Iceberg
table with 'hadoop.catalog', and 'SHOW CREATE TABLE' will not display
table location yet. We need to use 'DESCRIBE FORMATTED/EXTENDED' to
get this location info.
'iceberg.catalog_location' is necessary for 'hadoop.catalog' table,
which used to reserved Iceberg table metadata and data, and we use this
location to load table metadata from Iceberg.
'iceberg.table_identifier' is used for Icebreg TableIdentifier.If this
property not been specified in SQL, Impala will use database and table name
to load Iceberg table, which is 'default.iceberg_test_external' in above SQL.
This property value is splitted by '.', you can alse set this value like this:
'org.my_db.my_tbl'. And this property is valid for both managed and external
table.
Testing:
- Create table tests in functional_schema_template.sql
- Iceberg table create test in test_iceberg.py
- Iceberg table query test in test_scanners.py
- Iceberg table show create table test in test_show_create_table.py
Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef
Reviewed-on: http://gerrit.cloudera.org:8080/16446
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly realizes the querying of iceberg table through impala,
we can use the following sql to create an external iceberg table:
CREATE EXTERNAL TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
Or just including table name and location like this:
CREATE EXTERNAL TABLE default.iceberg_test
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
'iceberg_file_format' is the file format in iceberg, currently only
support PARQUET, other format would be supported in the future. And
if you don't specify this property in your SQL, default file format
is PARQUET.
We achieved this function by treating the iceberg table as normal
unpartitioned hdfs table. When querying iceberg table, we pushdown
partition column predicates to iceberg to decide which data files
need to be scanned, and then transfer this information to BE to
do the real scan operation.
Testing:
- Unit test for Iceberg in FileMetadataLoaderTest
- Create table tests in functional_schema_template.sql
- Iceberg table query test in test_scanners.py
Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006
Reviewed-on: http://gerrit.cloudera.org:8080/16143
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly realizes the creation of iceberg table through impala,
we can use the following sql to create a new iceberg table:
create table iceberg_test(
level string,
event_time timestamp,
message string,
register_time date,
telephone array <string>
)
partition by spec(
level identity,
event_time identity,
event_time hour,
register_time day
)
stored as iceberg;
'identity' is one of Iceberg's Partition Transforms. 'identity' means that
the source data values are used to create partitions, and other partition
transfroms would be supported in the future, such as BUCKET/TRUNCATE. We
can alse use 'show create table iceberg_test' to display table schema, and
use 'show partitions iceberg_test' to display partition column info. By the
way, partition column must be the source column.
Testing:
- Add test cases in metadata/test_show_create_table.py.
- Add custom cluster test test_iceberg.py.
Change-Id: I8d85db4c904a8c758c4cfb4f19cfbdab7e6ea284
Reviewed-on: http://gerrit.cloudera.org:8080/15797
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This removes Impala-lzo from the Impala development environment.
Impala-lzo is not built as part of the Impala build. The LZO plugin
is no longer loaded. LZO tables are not loaded during dataload,
and LZO is no longer tested.
This removes some obsolete scan APIs that were only used by Impala-lzo.
With this commit, Impala-lzo would require code changes to build
against Impala.
The plugin infrastructure is not removed, and this leaves some
LZO support code in place. If someone were to decide to revive
Impala-lzo, they would still be able to load it as a plugin
and get the same functionality as before. This plugin support
may be removed later.
Testing:
- Dryrun of GVO
- Modified TestPartitionMetadataUncompressedTextOnly's
test_unsupported_text_compression() to add LZO case
Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e
Reviewed-on: http://gerrit.cloudera.org:8080/15814
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
HMS seems to be returning SQLPrimaryKeys in inconsistent orders.
This makes some of the primary keys tests flaky. This change sorts
the list of primary keys and stores them in canonical order within
Impala.
Testing:
- Modified the tests that were relying on HMS to return same order
every time.
- Ran parametrized job.
Change-Id: I0f798d7a2659c6cd061002db151f3fa787eb6370
Reviewed-on: http://gerrit.cloudera.org:8080/15106
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
in LocalCatalog Mode.
This change add a new method 'loadConstraints()' to the MetaProvider
interface.
1. In CatalogdMetaProvider implementation, we fetch the primary key
(PK) and foreign key(FK) information via the GetPartialCatalogObject()
RPC to the catalogd. This is modified to include PK/FK information.
This is because, on catalog side we eagerly load PK/FK information
which can be sent over to local catalog in a single RPC to Catalog.
This information is then stored in TableMetaRef object for future
consumers.
2. In the DirectMetaProvider implementation, we make two RPCs to HMS
to directly get PK/FK information.
Load constraints can be extended to include other constraints later
(for ex: unique constraints.)
Testing:
- Added tests in LocalCatalogTest, CatalogTest and PartialCatalogInfoTest
- This change also modifies the toSqlUtil for show create table
statements. Added a test for the same.
Change-Id: I7ea7e1bacf6eb502c67caf310a847b32687e0d58
Reviewed-on: http://gerrit.cloudera.org:8080/14731
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In HMS-3 the translation layer converts a managed kudu table into an
external kudu table and adds additional table property
'external.table.purge' to 'true'. This means any installation which
is using HMS-3 (or a Hive version which has HIVE-22158) will always
create Kudu tables as external tables. This is problematic since the
output of show create table will now be different and may confuse
the users.
In order to improve the user experience of such synchronized tables
(external tables with external.table.purge property set to true),
this patch adds support in Impala to create
external Kudu tables. Previous versions of Impala disallowed
creating a external Kudu table if the Kudu table did not exist.
After this patch, Impala will check if the Kudu table exists and if
it does not it will create a Kudu table based on the schema provided
in the create table statement. The command will error out if the Kudu
table already exists. However, this applies to only the synchronized
tables. Previous way to create a pure external table behaves the
same.
Following syntax of creating a synchronized table is now allowed:
CREATE EXTERNAL TABLE foo (
id int PRIMARY KEY,
name string)
PARTITION BY HASH PARTITIONS 8
STORED AS KUDU
TBLPROPERTIES ('external.table.purge'='true')
The syntax is very similar to creating a managed table, except for
the EXTERNAL keyword and additional table property. A synchronized
table will behave similar to managed Kudu tables (drops and renames
are allowed). The output of show create table on a synchronized
table will display the full column and partition spec similar to the
managed tables.
Testing:
1. After the CDP version bump all of the existing Kudu tables now
create synchronized tables so there is good coverage there.
2. Added additional tests which create synchronized tables and
compares the show create table output.
3. Ran exhaustive tests with both CDP and CDH builds.
Change-Id: I76f81d41db0cf2269ee1b365857164a43677e14d
Reviewed-on: http://gerrit.cloudera.org:8080/14750
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Extended the SQL grammar with an optional and a default flag for
SORT BY, namely ZORDER and LEXICAL. If set, the new 'sort.algorithm'
table property will be set to ZORDER and the information will sink
down to the backend. The default order is indicated by LEXICAL
and can be omitted. Examples are:
CREATE TABLE t (a INT, b INT) PARTITIONED BY (c INT)
SORT BY ZORDER (a, b);
CREATE TABLE t SORT BY ZORDER (int_col,id) LIKE u;
CREATE TABLE t LIKE PARQUET '/foo' SORT BY ZORDER (id,zip);
ALTER TABLE t SORT BY ZORDER (int_col,id);
The following two are the same statements:
CREATE TABLE t (a INT, b INT) SORT BY (a, b);
CREATE TABLE t (a INT, b INT) SORT BY LEXICAL (a, b);
For strings, varchars, floats and doubles Z-ordering is currently
not supported. It's not suitable for strings and varchars, but
support can be added for floats and doubles later. The supported
types are: boolean, int types, decimals, date, timestamp, and char.
Currently ZORDER has the same functionality as a simple SORT BY clause,
therefore hidden behind a feature flag: unlock_zorder. The custom
sorting with Z-ordering will be in a different commit later.
Testing:
* Added tests for the ZORDER option for every SORT BY test.
* Modified some tests by adding the LEXICAL option.
* The .test workloads are temporarily put in separate test files
in order to set up the feature flag. These tests are run from
tests/custom_cluster/test_zorder.py which is a duplication of
the relevant tests, but with CustomClusterTestSuite decorator.
Change-Id: Ie122002ca8f52ca2c1e1ec8ff1d476ae1f4f875d
Reviewed-on: http://gerrit.cloudera.org:8080/13955
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch fixes the NullPointerException in SHOW CREATE TABLE for HBase
tables.
Testing:
- Moved the content of back hbase-show-create-table.test to
show-create-table.test
- Ran show-create-table end-to-end tests
Change-Id: Ibe018313168fac5dcbd80be9a8f28b71a2c0389b
Reviewed-on: http://gerrit.cloudera.org:8080/9884
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This change adds support for adding SORT BY (...) clauses to CREATE
TABLE and ALTER TABLE statements. Examples are:
CREATE TABLE t (i INT, j INT, k INT) PARTITIONED BY (l INT) SORT BY (i, j);
CREATE TABLE t SORT BY (int_col,id) LIKE u;
CREATE TABLE t LIKE PARQUET '/foo' SORT BY (id,zip);
ALTER TABLE t SORT BY (int_col,id);
ALTER TABLE t SORT BY ();
Sort columns can only be specified for Hdfs tables and effectiveness may
vary based on storage type; for example TEXT tables will not see
improved compression. The SORT BY clause must not contain clustering
columns. The columns in the SORT BY clause are stored in the
'sort.columns' table property and will result in an additional SORT node
being added to the plan before the final table sink. Specifying sort
columns also enables clustering during inserts, so the SORT node will
contain all partitioning columns first, followed by the sort columns. We
do this because sort columns add a SORT node to the plan and adding the
clustering columns to the SORT node is cheap.
Sort columns supersede the sortby() hint, which we will remove in a
subsequent change (IMPALA-5144). Until then, it is possible to specify
sort columns using both ways at the same time and the column lists
will be concatenated.
Change-Id: I08834f38a941786ab45a4381c2732d929a934f75
Reviewed-on: http://gerrit.cloudera.org:8080/6495
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
For a table that has both a table comment and a partition specified,
"show create table" incorrectly outputs the comment before the partition.
This is not the correct order, and it results in an invalid SQL.
This transaction fixes the ordering (partition comes before comment) and
adds tests for this case.
Change-Id: I29a33cfd142b473997fdc3acfe3f0966bc7ed784
Reviewed-on: http://gerrit.cloudera.org:8080/5648
Tested-by: Impala Public Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
This commit fixes an issue where a SHOW CREATE VIEW statement throws an
analysis error if the view contains a subquery.
Change-Id: I4a89e46a022f0ccec198b6e3e2b30230103831ce
Reviewed-on: http://gerrit.cloudera.org:8080/5333
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
SHOW CREATE TABLE already outputs information for views. As a
convenience, this patch adds SHOW CREATE VIEW as an alias for SHOW
CREATE TABLE.
Switch some SHOW CREATE VIEW tests to use SHOW CREATE VIEW and add
additional test for SHOW CREATE VIEW on a table so that expected
behaviour is tested.
Change-Id: I9925e0789573e9b097a2ef52b5023964dcf8f32c
Reviewed-on: http://gerrit.cloudera.org:8080/1661
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This changes implements support for PARTITIONED BY clauses in CTAS
statements. The syntax and semantics follow the PARTITION feature of
insert from select statements: inside the PARTITIONED BY (...) column
list the user must specify names of the columns to partition by. These
column names must appear in that particular order at the end of the
select statement. A remapping between columns of the source and
destination tables is not possible, because the destination table does
not yet exist. Specifying static values for the partition columns is
also not possible, as their type needs to be deduced from columns in the
select statement. Example:
CREATE TABLE t (a DOUBLE, b INT);
INSERT INTO t VALUES (1.5, 3);
CREATE TABLE p PARTITIONED BY (b) AS SELECT a, b FROM t;
This change also contains a fix for setting the PYTHONPATH environment
variable correctly, so you can run single python tests from the command
line.
Change-Id: I5f61854d36d1ee30cfcd1c6b2b3eb971f6cf4b2f
Reviewed-on: http://gerrit.cloudera.org:8080/1740
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
SHOW CREATE TABLE now supports views. It returns a CREATE VIEW statement
with column names and the original sql statement.
Authorization allows SHOW CREATE TABLE to be run on view if the user has
VIEW_METADATA privilege on the view and SELECT privilege on all
underlying views and table.
E.g. "SHOW CREATE TABLE some_view" returns output of form:
CREATE VIEW a_database.some_view (id, bool_col, tinyint_col) AS
SELECT id, bool_col, tinyint_col FROM functional.alltypes
Change-Id: Id633af2f5c1f5b0e01c13ed85c4bf9c045dc0666
Reviewed-on: http://gerrit.cloudera.org:8080/713
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.
Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This change updates our DDL syntax support to allow for using 'STORED AS PARQUET'
as well as 'STORED AS PARQUETFILE'. Moving forward we should prefer the new syntax,
but continue to support the old. I made the same change for 'AVROFILE', but since
we have not yet documented the 'AVROFILE' syntax I left out support for the old syntax.
Change-Id: I10c73a71a94ee488c9ae205485777b58ab8957c9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1053
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Adds support for "show create table", a DDL statement that outputs a DDL statement that
creates the specified table.
In general, the output DDL works in Impala, so a user can copy the output and execute it
to create the same table. However, there are a few special cases that output Hive DDL
because we do not support creating some tables in Impala: HBase tables and tables with
LZO compressed text. When we do support creating these tables in Impala, users should
be able to execute the DDL in Impala as well.
Change-Id: I8c130297a657810dea5b994bf99d72b0e61b847b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/842
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>