impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 15:00:38 -05:00

Author	SHA1	Message	Date
LPL	78609dca32	IMPALA-11256: Fix SHOW FILES on Iceberg tables lists all files SHOW FILES on Iceberg tables lists all files in table directory. Even deleted files and metadata files. We should only shows the current data files. Testing: - existing tests Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179 Reviewed-on: http://gerrit.cloudera.org:8080/18455 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-28 17:48:23 +00:00
Tamas Mate	efba58f5f0	IMPALA-10737: Optimize the number of Iceberg API Metadata requests Iceberg stores the table metadata next to the data files, when this is accessed through the Iceberg API a filesystem call is executed (HDFS, S3, ADLS). These calls were used in various places during query processing and this patch unifies the Iceberg metadata request in the CatalogD and ImpalaD: - CatalogD loads and caches the org.apache.iceberg.Table object. - When ImpalaDs request the Table metadata, the current catalog snapshot id is sent over and the ImpalaD loads and caches the org.apache.iceberg.Table object throught Iceberg API as well. This approach (loading the Iceberg table twice) was choosen because the org.apache.iceberg.Table could not be meaningfully serialized and deserialized. The result of a serialized Table is a lightweight SerializableTable object which is in the Iceberg core package. As a result REFRESH/INVALIDATE METADATA is required to reload any Iceberg metadata changes and the metadata load time is improved. This improvement is more significant for smaller queries, where the metadata request has larger impact on the query execution time. Additionally, the dependency on the Iceberg core package has been reduced and the TableMetadata/BaseTable class uses has been replaced with the Table class from the Iceberg api package in most places. Testing: - Passed Iceberg E2E tests. Change-Id: I5492e0cdb31602f0276029c2645d14ff5cb2f672 Reviewed-on: http://gerrit.cloudera.org:8080/18353 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-04 23:52:33 +00:00
Tamas Mate	aef30d0442	Revert "IMPALA-10737: Optimize the number of Iceberg API Metadata requests" This reverts commit `cd10acdbb1`. This commit has been reverted, because it blocks upgrading the Iceberg version to 0.13. In the newer Iceberg version the BaseTable serialization has been changed, it serializes the BaseTable to a SerializableTable sibiling class. This is a lightweigth Table class which does not have the necessary metadata that could be cached and reused by the ImpalaDs. SerializableTable utilization has to be further considered. Change-Id: I21e65cb3ab38d9e683223fb100d7ced90caa6edd Reviewed-on: http://gerrit.cloudera.org:8080/18305 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-10 15:30:59 +00:00
Tamas Mate	cd10acdbb1	IMPALA-10737: Optimize the number of Iceberg API Metadata requests Iceberg stores the table metadata next to the data files, when this is accessed through the Iceberg API a filesystem call is executed (HDFS, S3, ADLS). These calls were used in various places during query processing and this patch unifies the Iceberg metadata request in the CatalogD similar to other metadata requests: - CatalogD loads and caches the org.apache.iceberg.BaseTable object. - ImpalaDs requests the org.apache.iceberg.BaseTable from the CatalogD and caches it as well. As a result REFRESH/INVALIDATE METADATA is required to reload any Iceberg metadata changes and the metadata load time is improved. This improvement is more significant for smaller queries, where the metadata request has larger impact on the query execution time. Testing: - Passed Iceberg E2E tests. Change-Id: I9e62a1fb9753ea1b022c7763047d9ccfd1d27d62 Reviewed-on: http://gerrit.cloudera.org:8080/18226 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2022-03-04 15:24:23 +00:00
Attila Jeges	fabe994d1f	IMPALA-10627: Use standard parquet-related Iceberg table properties This patch adds support for the following standard Iceberg properties: write.parquet.compression-codec: Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY (default value), LZ4, ZSTD. The table property will be ignored if COMPRESSION_CODEC query option is set. write.parquet.compression-level: Parquet compression level. Used with ZSTD compression only. Supported range is [1, 22]. Default value is 3. The table property will be ignored if COMPRESSION_CODEC query option is set. write.parquet.row-group-size-bytes : Parquet row group size in bytes. Supported range is [8388608, 2146435072] (8MB - 2047MB). The table property will be ignored if PARQUET_FILE_SIZE query option is set. If neither the table property nor the PARQUET_FILE_SIZE query option is set, the way Impala calculates row group size will remain unchanged. write.parquet.page-size-bytes: Parquet page size in bytes. Used for PLAIN encoding. Supported range is [65536, 1073741824] (64KB - 1GB). If the table property is unset, the way Impala calculates page size will remain unchanged. write.parquet.dict-size-bytes: Parquet dictionary page size in bytes. Used for dictionary encoding. Supported range is [65536, 1073741824] (64KB - 1GB). If the table property is unset, the way Impala calculates dictionary page size will remain unchanged. This patch also renames 'iceberg.file_format' table property to 'write.format.default' which is the standard Iceberg name for the table property. Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23 Reviewed-on: http://gerrit.cloudera.org:8080/17654 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-20 23:58:06 +00:00
Zoltan Borok-Nagy	a7e71b4523	IMPALA-10358: Correct Iceberg type mappings The Iceberg format spec defines what types to use for different file formats, e.g.: https://iceberg.apache.org/spec/#parquet Impala should follow the specification, so this patch * annotates strings with UTF8 in Parquet metadata * removes fixed(L) <-> CHAR(L) mapping * forbids INSERTs when the Iceberg schema has a TIMESTAMPTZ column This patch also refactors the type/schema conversions as Impala => Iceberg conversions were duplicated in IcebergCatalogOpExecutor and IcebergUtil. I introduced the class 'IcebergSchemaConverter' to contain the code for conversions. Testing: * added test to check CHAR and VARCHAR types are not allowed * test that INSERTs are not allowed when the table has TIMESTMAPTZ * added test to check that strings are annotated with UTF8 Change-Id: I652565f82708824f5cf7497139153b06f116ccd3 Reviewed-on: http://gerrit.cloudera.org:8080/16851 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-15 19:17:51 +00:00
Zoltan Borok-Nagy	4448b8755b	IMPALA-10152: Add support for Iceberg HiveCatalog HiveCatalog is one of Iceberg's catalog implementations. It uses the Hive metastore and it is the recommended catalog implementation when the table data is stored in object stores like S3. This commit updates the Iceberg version to a newer one, and it also retrieves Iceberg from the CDP distribution because that version of Iceberg is built against Hive 3 (Impala is only compatible with Hive 3). This commit makes HiveCatalog the default Iceberg catalog in Impala because it can be used in more environments (e.g. cloud stores), and it is more featureful. Also, other engines that store their table metadata in HMS will probably use HiveCatalog as well. Tables stored in HiveCatalog are similar to Kudu tables with HMS integration, i.e. modifying an Iceberg table via the Iceberg APIs also modifies the HMS table. So in CatalogOpExecutor we handle such Iceberg tables similarly to integrated Kudu tables. Testing: * Added e2e tests for creating, writing, and altering Iceberg tables * Added SHOW CREATE TABLE tests Change-Id: Ie574589a1751aaa9ccbd34a89c6819714d103197 Reviewed-on: http://gerrit.cloudera.org:8080/16721 Reviewed-by: wangsheng <skyyws@163.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-20 21:40:28 +00:00
Zoltan Borok-Nagy	981ef10465	IMPALA-10215: Implement INSERT INTO for non-partitioned Iceberg tables (Parquet) This commit adds support for INSERT INTO statements against Iceberg tables when the table is non-partitioned and the underlying file format is Parquet. We still use Impala's HdfsParquetTableWriter to write the data files, though they needed some modifications to conform to the Iceberg spec, namely: * write Iceberg/Parquet 'field_id' for the columns * TIMESTAMPs are encoded as INT64 micros (without time zone) We use DmlExecState to transfer information from the table sink operators to the coordinator, then updateCatalog() invokes the AppendFiles API to add files atomically. DmlExecState is encoded in protobuf, communication with the Frontend uses Thrift. Therefore to avoid defining Iceberg DataFile multiple times they are stored in FlatBuffers. The commit also does some corrections on Impala type <-> Iceberg type mapping: * Impala TIMESTAMP is Iceberg TIMESTAMP (without time zone) * Impala CHAR is Iceberg FIXED Testing: * Added INSERT tests to iceberg-insert.test * Added negative tests to iceberg-negative.test * I also did some manual testing with Spark. Spark is able to read Iceberg tables written by Impala until we use TIMESTAMPs. In that case Spark rejects the data files because it only accepts TIMESTAMPS with time zone. * Added concurrent INSERT tests to test_insert_stress.py Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e Reviewed-on: http://gerrit.cloudera.org:8080/16545 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-26 20:01:09 +00:00

8 Commits