Commit Graph

8 Commits

Author SHA1 Message Date
LPL
78609dca32 IMPALA-11256: Fix SHOW FILES on Iceberg tables lists all files
SHOW FILES on Iceberg tables lists all files in table directory. Even
deleted files and metadata files. We should only shows the current data
files.

Testing:
  - existing tests

Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179
Reviewed-on: http://gerrit.cloudera.org:8080/18455
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-04-28 17:48:23 +00:00
Tamas Mate
efba58f5f0 IMPALA-10737: Optimize the number of Iceberg API Metadata requests
Iceberg stores the table metadata next to the data files, when this is
accessed through the Iceberg API a filesystem call is executed (HDFS,
S3, ADLS). These calls were used in various places during query
processing and this patch unifies the Iceberg metadata request in the
CatalogD and ImpalaD:
 - CatalogD loads and caches the org.apache.iceberg.Table object.
 - When ImpalaDs request the Table metadata, the current catalog
   snapshot id is sent over and the ImpalaD loads and caches the
   org.apache.iceberg.Table object throught Iceberg API as well.

This approach (loading the Iceberg table twice) was choosen because
the org.apache.iceberg.Table could not be meaningfully serialized and
deserialized. The result of a serialized Table is a lightweight
SerializableTable object which is in the Iceberg core package.

As a result REFRESH/INVALIDATE METADATA is required to reload any
Iceberg metadata changes and the metadata load time is improved.
This improvement is more significant for smaller queries, where the
metadata request has larger impact on the query execution time.

Additionally, the dependency on the Iceberg core package has been
reduced and the TableMetadata/BaseTable class uses has been replaced
with the Table class from the Iceberg api package in most places.

Testing:
 - Passed Iceberg E2E tests.

Change-Id: I5492e0cdb31602f0276029c2645d14ff5cb2f672
Reviewed-on: http://gerrit.cloudera.org:8080/18353
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-04-04 23:52:33 +00:00
Tamas Mate
aef30d0442 Revert "IMPALA-10737: Optimize the number of Iceberg API Metadata requests"
This reverts commit cd10acdbb1.

This commit has been reverted, because it blocks upgrading the Iceberg
version to 0.13.

In the newer Iceberg version the BaseTable serialization has been
changed, it serializes the BaseTable to a SerializableTable sibiling
class. This is a lightweigth Table class which does not have the
necessary metadata that could be cached and reused by the ImpalaDs.
SerializableTable utilization has to be further considered.

Change-Id: I21e65cb3ab38d9e683223fb100d7ced90caa6edd
Reviewed-on: http://gerrit.cloudera.org:8080/18305
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-03-10 15:30:59 +00:00
Tamas Mate
cd10acdbb1 IMPALA-10737: Optimize the number of Iceberg API Metadata requests
Iceberg stores the table metadata next to the data files, when this is
accessed through the Iceberg API a filesystem call is executed (HDFS,
S3, ADLS). These calls were used in various places during query
processing and this patch unifies the Iceberg metadata request in the
CatalogD similar to other metadata requests:
 - CatalogD loads and caches the org.apache.iceberg.BaseTable object.
 - ImpalaDs requests the org.apache.iceberg.BaseTable from the
CatalogD and caches it as well.

As a result REFRESH/INVALIDATE METADATA is required to reload any
Iceberg metadata changes and the metadata load time is improved.
This improvement is more significant for smaller queries, where the
metadata request has larger impact on the query execution time.

Testing:
 - Passed Iceberg E2E tests.

Change-Id: I9e62a1fb9753ea1b022c7763047d9ccfd1d27d62
Reviewed-on: http://gerrit.cloudera.org:8080/18226
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2022-03-04 15:24:23 +00:00
Attila Jeges
fabe994d1f IMPALA-10627: Use standard parquet-related Iceberg table properties
This patch adds support for the following standard Iceberg properties:

write.parquet.compression-codec:
  Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY
  (default value), LZ4, ZSTD. The table property will be ignored if
  COMPRESSION_CODEC query option is set.

write.parquet.compression-level:
  Parquet compression level. Used with ZSTD compression only.
  Supported range is [1, 22]. Default value is 3. The table property
  will be ignored if COMPRESSION_CODEC query option is set.

write.parquet.row-group-size-bytes :
  Parquet row group size in bytes. Supported range is [8388608,
  2146435072] (8MB - 2047MB). The table property will be ignored if
  PARQUET_FILE_SIZE query option is set.
  If neither the table property nor the PARQUET_FILE_SIZE query option
  is set, the way Impala calculates row group size will remain
  unchanged.

write.parquet.page-size-bytes:
  Parquet page size in bytes. Used for PLAIN encoding. Supported range
  is [65536, 1073741824] (64KB - 1GB).
  If the table property is unset, the way Impala calculates page size
  will remain unchanged.

write.parquet.dict-size-bytes:
  Parquet dictionary page size in bytes. Used for dictionary encoding.
  Supported range is [65536, 1073741824] (64KB - 1GB).
  If the table property is unset, the way Impala calculates dictionary
  page size will remain unchanged.

This patch also renames 'iceberg.file_format' table property to
'write.format.default' which is the standard Iceberg name for the
table property.

Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23
Reviewed-on: http://gerrit.cloudera.org:8080/17654
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-20 23:58:06 +00:00
Zoltan Borok-Nagy
a7e71b4523 IMPALA-10358: Correct Iceberg type mappings
The Iceberg format spec defines what types to use for different file
formats, e.g.: https://iceberg.apache.org/spec/#parquet

Impala should follow the specification, so this patch
 * annotates strings with UTF8 in Parquet metadata
 * removes fixed(L) <-> CHAR(L) mapping
 * forbids INSERTs when the Iceberg schema has a TIMESTAMPTZ column

This patch also refactors the type/schema conversions as
Impala => Iceberg conversions were duplicated in
IcebergCatalogOpExecutor and IcebergUtil. I introduced the class
'IcebergSchemaConverter' to contain the code for conversions.

Testing:
 * added test to check CHAR and VARCHAR types are not allowed
 * test that INSERTs are not allowed when the table has TIMESTMAPTZ
 * added test to check that strings are annotated with UTF8

Change-Id: I652565f82708824f5cf7497139153b06f116ccd3
Reviewed-on: http://gerrit.cloudera.org:8080/16851
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-15 19:17:51 +00:00
Zoltan Borok-Nagy
4448b8755b IMPALA-10152: Add support for Iceberg HiveCatalog
HiveCatalog is one of Iceberg's catalog implementations. It uses
the Hive metastore and it is the recommended catalog implementation
when the table data is stored in object stores like S3.

This commit updates the Iceberg version to a newer one, and it also
retrieves Iceberg from the CDP distribution because that version of
Iceberg is built against Hive 3 (Impala is only compatible with
Hive 3).

This commit makes HiveCatalog the default Iceberg catalog in Impala
because it can be used in more environments (e.g. cloud stores),
and it is more featureful. Also, other engines that store their
table metadata in HMS will probably use HiveCatalog as well.

Tables stored in HiveCatalog are similar to Kudu tables with HMS
integration, i.e. modifying an Iceberg table via the Iceberg APIs
also modifies the HMS table. So in CatalogOpExecutor we handle
such Iceberg tables similarly to integrated Kudu tables.

Testing:
 * Added e2e tests for creating, writing, and altering Iceberg
   tables
 * Added SHOW CREATE TABLE tests

Change-Id: Ie574589a1751aaa9ccbd34a89c6819714d103197
Reviewed-on: http://gerrit.cloudera.org:8080/16721
Reviewed-by: wangsheng <skyyws@163.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-11-20 21:40:28 +00:00
Zoltan Borok-Nagy
981ef10465 IMPALA-10215: Implement INSERT INTO for non-partitioned Iceberg tables (Parquet)
This commit adds support for INSERT INTO statements against Iceberg
tables when the table is non-partitioned and the underlying file format
is Parquet.

We still use Impala's HdfsParquetTableWriter to write the data files,
though they needed some modifications to conform to the Iceberg spec,
namely:
 * write Iceberg/Parquet 'field_id' for the columns
 * TIMESTAMPs are encoded as INT64 micros (without time zone)

We use DmlExecState to transfer information from the table sink
operators to the coordinator, then updateCatalog() invokes the
AppendFiles API to add files atomically. DmlExecState is encoded in
protobuf, communication with the Frontend uses Thrift. Therefore to
avoid defining Iceberg DataFile multiple times they are stored in
FlatBuffers.

The commit also does some corrections on Impala type <-> Iceberg type
mapping:
 * Impala TIMESTAMP is Iceberg TIMESTAMP (without time zone)
 * Impala CHAR is Iceberg FIXED

Testing:
 * Added INSERT tests to iceberg-insert.test
 * Added negative tests to iceberg-negative.test
 * I also did some manual testing with Spark. Spark is able to read
   Iceberg tables written by Impala until we use TIMESTAMPs. In that
   case Spark rejects the data files because it only accepts TIMESTAMPS
   with time zone.
 * Added concurrent INSERT tests to test_insert_stress.py

Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e
Reviewed-on: http://gerrit.cloudera.org:8080/16545
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-10-26 20:01:09 +00:00