SHOW FILES on Iceberg tables lists all files in table directory. Even
deleted files and metadata files. We should only shows the current data
files.
Testing:
- existing tests
Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179
Reviewed-on: http://gerrit.cloudera.org:8080/18455
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IcebergScanNode interprets the timestamp literals as UTC timestamps
during predicate pushdown to Iceberg. It causes problems when the
Iceberg table uses TIMESTAMPTZ (which corresponds to TIMESTAMP WITH
LOCAL TIME ZONE in SQL) because in the scanners we assume that the
timestamp literals in a query are in local timezone.
Hence, if the Iceberg table is partitioned by HOUR(ts), and Impala is
running in a different timezone than UTC, then the following query
doesn't return any rows:
SELECT * from t
WHERE ts = <some ts>;
Because during predicate pushdown the timestamp is interpreted as a
UTC timestamp (no conversion from local to UTC), but during query
execution the timestamp data in the files are converted to local
timezone, then compared to <some ts>. I.e. in the scanner the
assumption is that <some ts> is in local timezone.
On the other hand, when Iceberg type TIMESTAMP (which correcponds
to TIMESTAMP WITHOUT TIME ZONE in SQL) is used, then we should just
push down the timestamp values without any conversion. In this case
there is no conversion in the scanners either.
Testing:
* added e2e test with TIMESTAMPTZ
* added e2e test with TIMESTAMP
Change-Id: I181be5d2fa004f69b457f69ff82dc2f9877f46fa
Reviewed-on: http://gerrit.cloudera.org:8080/18399
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
When Impala/Hive creates a table they lowercase the schema elements.
When Spark creates an Iceberg table it doesn't lowercase the names
of the columns in the Iceberg metadata. This triggers a precondition
check in Impala which makes such Iceberg tables unloadable.
This patch converts column names to lowercase when converting Iceberg
schemas to Hive/Impala schemas.
Testing:
* added e2e test
Change-Id: Iffd910f76844fbf34db805dda6c3053c5ad1cf79
Reviewed-on: http://gerrit.cloudera.org:8080/18368
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In iceberg-query.test we create an external Iceberg table and
set the table property 'iceberg.file_format' to check
backward-compatibility with earlier versions. At the end we
delete the table. The table deletion makes the test fail
sporadically during GVO.
Seems like the bug is caused by the parallel execution of this test.
The test didn't use a unique database, therefore dropping the table
could affect other executions of the same test. This patch puts
the relevant queries to their own .test file using a unique
database.
Change-Id: I16e558ae5add48d8a39bd89277a0256f534ba65f
Reviewed-on: http://gerrit.cloudera.org:8080/17929
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
With IMPALA-10627 we switched to use standard Iceberg table
properties: https://iceberg.apache.org/configuration/
E.g. we switched from 'iceberg.file_format' to 'write.format.default'.
For backward compatibility we also support 'iceberg.file_format'. Though
the support is not perfect as it causes a crash in some cases.
Impala crashes when the following conditions met:
* local catalog mode is being used
* Iceberg table is being queried
* the data file format is ORC
* 'iceberg.file_format' is set instead of 'write.format.default' table
property
* Query is "select count(*) from t;"
Impala wrongly assumes that PARQUET is being used and tries to apply the
count star optimization. It is not implemented for the ORC scanner and
causes it to crash.
This patch fixes the wrong assumption. Also it fixes the HdfsOrcScanner,
so it won't crash in release mode but raise an error.
This patch also enables UNSETting the file format table property for
Iceberg tables. This table property was already enabled for
modifications (changing the value via SET TBLPROPERTIES).
Testing:
* added e2e test for the above conditions
Change-Id: Iafd9baef1c124d7356a14ba24c571567629a5e50
Reviewed-on: http://gerrit.cloudera.org:8080/17877
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for the following standard Iceberg properties:
write.parquet.compression-codec:
Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY
(default value), LZ4, ZSTD. The table property will be ignored if
COMPRESSION_CODEC query option is set.
write.parquet.compression-level:
Parquet compression level. Used with ZSTD compression only.
Supported range is [1, 22]. Default value is 3. The table property
will be ignored if COMPRESSION_CODEC query option is set.
write.parquet.row-group-size-bytes :
Parquet row group size in bytes. Supported range is [8388608,
2146435072] (8MB - 2047MB). The table property will be ignored if
PARQUET_FILE_SIZE query option is set.
If neither the table property nor the PARQUET_FILE_SIZE query option
is set, the way Impala calculates row group size will remain
unchanged.
write.parquet.page-size-bytes:
Parquet page size in bytes. Used for PLAIN encoding. Supported range
is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates page size
will remain unchanged.
write.parquet.dict-size-bytes:
Parquet dictionary page size in bytes. Used for dictionary encoding.
Supported range is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates dictionary
page size will remain unchanged.
This patch also renames 'iceberg.file_format' table property to
'write.format.default' which is the standard Iceberg name for the
table property.
Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23
Reviewed-on: http://gerrit.cloudera.org:8080/17654
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We supported resolve column by field id for Iceberg table in this
patch. Currently, we use field id to resolve column for Iceberg
tables, which means 'PARQUET_FALLBACK_SCHEMA_RESOLUTION' is invalid
for Iceberg tables.
Change-Id: I057bdc6ab2859cc4d40de5ed428d0c20028b8435
Reviewed-on: http://gerrit.cloudera.org:8080/16788
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
This patch mainly realizes querying Iceberg table with ORC
file format. We can using following SQL to create table with
ORC file format:
CREATE TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg.file_format'='orc', 'iceberg.catalog'='hadoop.tables');
But pay attention, there still some problems when scan ORC files
with Timestamp, more details please refer IMPALA-9967. We may add
new tests with Timestmap type after this JIRA fixed.
Testing:
- Create table tests in functional_schema_template.sql
- Iceberg table create test in test_iceberg.py
- Iceberg table query test in test_scanners.py
Change-Id: Ib579461aa57348c9893a6d26a003a0d812346c4d
Reviewed-on: http://gerrit.cloudera.org:8080/16568
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We provide several new table properties in IMPALA-10164, such as
'iceberg.catalog', in order to keep consist of these properties, we
rename 'iceberg_file_format' to 'iceberg.file_format'. When we creating
Iceberg table, we should use SQL like this:
CREATE TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
TBLPROPERTIES ('iceberg.file_format'='parquet',
'iceberg.catalog'='hadoop.tables')
Change-Id: I722303fb765aca0f97a79bd6e4504765d355a623
Reviewed-on: http://gerrit.cloudera.org:8080/16550
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly realizes creating Iceberg table by HadoopCatalog.
We only supported HadoopTables api before this patch, but now we can
use HadoopCatalog to create Iceberg table. When creating managed table,
we can use SQL like this:
CREATE TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog',
'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test');
We supported two values ('hadoop.catalog', 'hadoop.tables') for
'iceberg.catalog' now. If you don't specify this property in your SQL,
default catalog type is 'hadoop.catalog'.
As for external Iceberg table, you can use SQL like this:
CREATE EXTERNAL TABLE default.iceberg_test_external
STORED AS ICEBERG
TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog',
'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test',
'iceberg.table_identifier'='default.iceberg_test');
We cannot set table location for both managed and external Iceberg
table with 'hadoop.catalog', and 'SHOW CREATE TABLE' will not display
table location yet. We need to use 'DESCRIBE FORMATTED/EXTENDED' to
get this location info.
'iceberg.catalog_location' is necessary for 'hadoop.catalog' table,
which used to reserved Iceberg table metadata and data, and we use this
location to load table metadata from Iceberg.
'iceberg.table_identifier' is used for Icebreg TableIdentifier.If this
property not been specified in SQL, Impala will use database and table name
to load Iceberg table, which is 'default.iceberg_test_external' in above SQL.
This property value is splitted by '.', you can alse set this value like this:
'org.my_db.my_tbl'. And this property is valid for both managed and external
table.
Testing:
- Create table tests in functional_schema_template.sql
- Iceberg table create test in test_iceberg.py
- Iceberg table query test in test_scanners.py
- Iceberg table show create table test in test_show_create_table.py
Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef
Reviewed-on: http://gerrit.cloudera.org:8080/16446
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We found that the tests of test_iceberg_query and test_iceberg_profile
fail after the patch for IMPALA-9741 has been merged and that it is due
to the default timezone of Impala not being UTC. This patch fixes the
issue by adding "SET TIMEZONE=UTC;" before those test queries are run.
Testing:
- Verified in a local development environment that the tests of
test_iceberg_query and test_iceberg_profile could pass after applying
this patch.
Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3
Reviewed-on: http://gerrit.cloudera.org:8080/16425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly realizes the querying of iceberg table through impala,
we can use the following sql to create an external iceberg table:
CREATE EXTERNAL TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
Or just including table name and location like this:
CREATE EXTERNAL TABLE default.iceberg_test
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
'iceberg_file_format' is the file format in iceberg, currently only
support PARQUET, other format would be supported in the future. And
if you don't specify this property in your SQL, default file format
is PARQUET.
We achieved this function by treating the iceberg table as normal
unpartitioned hdfs table. When querying iceberg table, we pushdown
partition column predicates to iceberg to decide which data files
need to be scanned, and then transfer this information to BE to
do the real scan operation.
Testing:
- Unit test for Iceberg in FileMetadataLoaderTest
- Create table tests in functional_schema_template.sql
- Iceberg table query test in test_scanners.py
Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006
Reviewed-on: http://gerrit.cloudera.org:8080/16143
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>