SHOW FILES on Iceberg tables lists all files in table directory. Even
deleted files and metadata files. We should only shows the current data
files.
Testing:
- existing tests
Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179
Reviewed-on: http://gerrit.cloudera.org:8080/18455
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Iceberg stores the table metadata next to the data files, when this is
accessed through the Iceberg API a filesystem call is executed (HDFS,
S3, ADLS). These calls were used in various places during query
processing and this patch unifies the Iceberg metadata request in the
CatalogD and ImpalaD:
- CatalogD loads and caches the org.apache.iceberg.Table object.
- When ImpalaDs request the Table metadata, the current catalog
snapshot id is sent over and the ImpalaD loads and caches the
org.apache.iceberg.Table object throught Iceberg API as well.
This approach (loading the Iceberg table twice) was choosen because
the org.apache.iceberg.Table could not be meaningfully serialized and
deserialized. The result of a serialized Table is a lightweight
SerializableTable object which is in the Iceberg core package.
As a result REFRESH/INVALIDATE METADATA is required to reload any
Iceberg metadata changes and the metadata load time is improved.
This improvement is more significant for smaller queries, where the
metadata request has larger impact on the query execution time.
Additionally, the dependency on the Iceberg core package has been
reduced and the TableMetadata/BaseTable class uses has been replaced
with the Table class from the Iceberg api package in most places.
Testing:
- Passed Iceberg E2E tests.
Change-Id: I5492e0cdb31602f0276029c2645d14ff5cb2f672
Reviewed-on: http://gerrit.cloudera.org:8080/18353
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This reverts commit cd10acdbb1.
This commit has been reverted, because it blocks upgrading the Iceberg
version to 0.13.
In the newer Iceberg version the BaseTable serialization has been
changed, it serializes the BaseTable to a SerializableTable sibiling
class. This is a lightweigth Table class which does not have the
necessary metadata that could be cached and reused by the ImpalaDs.
SerializableTable utilization has to be further considered.
Change-Id: I21e65cb3ab38d9e683223fb100d7ced90caa6edd
Reviewed-on: http://gerrit.cloudera.org:8080/18305
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Iceberg stores the table metadata next to the data files, when this is
accessed through the Iceberg API a filesystem call is executed (HDFS,
S3, ADLS). These calls were used in various places during query
processing and this patch unifies the Iceberg metadata request in the
CatalogD similar to other metadata requests:
- CatalogD loads and caches the org.apache.iceberg.BaseTable object.
- ImpalaDs requests the org.apache.iceberg.BaseTable from the
CatalogD and caches it as well.
As a result REFRESH/INVALIDATE METADATA is required to reload any
Iceberg metadata changes and the metadata load time is improved.
This improvement is more significant for smaller queries, where the
metadata request has larger impact on the query execution time.
Testing:
- Passed Iceberg E2E tests.
Change-Id: I9e62a1fb9753ea1b022c7763047d9ccfd1d27d62
Reviewed-on: http://gerrit.cloudera.org:8080/18226
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
This patch adds support for the following standard Iceberg properties:
write.parquet.compression-codec:
Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY
(default value), LZ4, ZSTD. The table property will be ignored if
COMPRESSION_CODEC query option is set.
write.parquet.compression-level:
Parquet compression level. Used with ZSTD compression only.
Supported range is [1, 22]. Default value is 3. The table property
will be ignored if COMPRESSION_CODEC query option is set.
write.parquet.row-group-size-bytes :
Parquet row group size in bytes. Supported range is [8388608,
2146435072] (8MB - 2047MB). The table property will be ignored if
PARQUET_FILE_SIZE query option is set.
If neither the table property nor the PARQUET_FILE_SIZE query option
is set, the way Impala calculates row group size will remain
unchanged.
write.parquet.page-size-bytes:
Parquet page size in bytes. Used for PLAIN encoding. Supported range
is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates page size
will remain unchanged.
write.parquet.dict-size-bytes:
Parquet dictionary page size in bytes. Used for dictionary encoding.
Supported range is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates dictionary
page size will remain unchanged.
This patch also renames 'iceberg.file_format' table property to
'write.format.default' which is the standard Iceberg name for the
table property.
Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23
Reviewed-on: http://gerrit.cloudera.org:8080/17654
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The Iceberg format spec defines what types to use for different file
formats, e.g.: https://iceberg.apache.org/spec/#parquet
Impala should follow the specification, so this patch
* annotates strings with UTF8 in Parquet metadata
* removes fixed(L) <-> CHAR(L) mapping
* forbids INSERTs when the Iceberg schema has a TIMESTAMPTZ column
This patch also refactors the type/schema conversions as
Impala => Iceberg conversions were duplicated in
IcebergCatalogOpExecutor and IcebergUtil. I introduced the class
'IcebergSchemaConverter' to contain the code for conversions.
Testing:
* added test to check CHAR and VARCHAR types are not allowed
* test that INSERTs are not allowed when the table has TIMESTMAPTZ
* added test to check that strings are annotated with UTF8
Change-Id: I652565f82708824f5cf7497139153b06f116ccd3
Reviewed-on: http://gerrit.cloudera.org:8080/16851
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HiveCatalog is one of Iceberg's catalog implementations. It uses
the Hive metastore and it is the recommended catalog implementation
when the table data is stored in object stores like S3.
This commit updates the Iceberg version to a newer one, and it also
retrieves Iceberg from the CDP distribution because that version of
Iceberg is built against Hive 3 (Impala is only compatible with
Hive 3).
This commit makes HiveCatalog the default Iceberg catalog in Impala
because it can be used in more environments (e.g. cloud stores),
and it is more featureful. Also, other engines that store their
table metadata in HMS will probably use HiveCatalog as well.
Tables stored in HiveCatalog are similar to Kudu tables with HMS
integration, i.e. modifying an Iceberg table via the Iceberg APIs
also modifies the HMS table. So in CatalogOpExecutor we handle
such Iceberg tables similarly to integrated Kudu tables.
Testing:
* Added e2e tests for creating, writing, and altering Iceberg
tables
* Added SHOW CREATE TABLE tests
Change-Id: Ie574589a1751aaa9ccbd34a89c6819714d103197
Reviewed-on: http://gerrit.cloudera.org:8080/16721
Reviewed-by: wangsheng <skyyws@163.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit adds support for INSERT INTO statements against Iceberg
tables when the table is non-partitioned and the underlying file format
is Parquet.
We still use Impala's HdfsParquetTableWriter to write the data files,
though they needed some modifications to conform to the Iceberg spec,
namely:
* write Iceberg/Parquet 'field_id' for the columns
* TIMESTAMPs are encoded as INT64 micros (without time zone)
We use DmlExecState to transfer information from the table sink
operators to the coordinator, then updateCatalog() invokes the
AppendFiles API to add files atomically. DmlExecState is encoded in
protobuf, communication with the Frontend uses Thrift. Therefore to
avoid defining Iceberg DataFile multiple times they are stored in
FlatBuffers.
The commit also does some corrections on Impala type <-> Iceberg type
mapping:
* Impala TIMESTAMP is Iceberg TIMESTAMP (without time zone)
* Impala CHAR is Iceberg FIXED
Testing:
* Added INSERT tests to iceberg-insert.test
* Added negative tests to iceberg-negative.test
* I also did some manual testing with Spark. Spark is able to read
Iceberg tables written by Impala until we use TIMESTAMPs. In that
case Spark rejects the data files because it only accepts TIMESTAMPS
with time zone.
* Added concurrent INSERT tests to test_insert_stress.py
Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e
Reviewed-on: http://gerrit.cloudera.org:8080/16545
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>