impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 15:00:38 -05:00

Author	SHA1	Message	Date
LPL	e711461a5c	IMPALA-11286: Writes value_counts to Iceberg metadata Impala does not write 'value_counts' to Iceberg metadata, just 'null_value_counts'. Push-down NOT_NULL predicate does not work when the data is written by the impala, so we implement impala to write 'value_counts' to Iceberg metadata. Testing: - existing tests - tested manually on a real cluster Change-Id: I6b7afab8be197118e573fda1a381fa08e4c8c9c0 Reviewed-on: http://gerrit.cloudera.org:8080/18513 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-12 16:23:17 +00:00
LPL	78609dca32	IMPALA-11256: Fix SHOW FILES on Iceberg tables lists all files SHOW FILES on Iceberg tables lists all files in table directory. Even deleted files and metadata files. We should only shows the current data files. Testing: - existing tests Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179 Reviewed-on: http://gerrit.cloudera.org:8080/18455 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-28 17:48:23 +00:00
Zoltan Borok-Nagy	2fffac3bad	IMPALA-11175: Iceberg table cannot be loaded when partition value is NULL When Impala created the metadata objects about the Iceberg data files it tried to convert the partition values to strings. But the partition values can be NULLs as well. The code didn't expect this, so we got a NullPointerException. With this patch we pass the table's null partition key value in case of NULLs. Testing: * added e2e tests Change-Id: I88c4f7a2c2db4f6390c8ee5c08baddc96b04602e Reviewed-on: http://gerrit.cloudera.org:8080/18307 Reviewed-by: Tamas Mate <tmater@apache.org> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-11 13:23:30 +00:00
Attila Jeges	c8aa5796d9	IMPALA-10879: Add parquet stats to iceberg manifest This patch adds parquet stats to iceberg manifest as per-datafile metrics. The following metrics are supported: - column_sizes : Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. - null_value_counts : Map from column id to number of null values in the column. - lower_bounds : Map from column id to lower bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file. - upper_bounds : Map from column id to upper bound in the column serialized as binary. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file. The corresponding parquet stats are collected by 'ColumnStats' (in 'min_value_', 'max_value_', 'null_count_' members) and 'HdfsParquetTableWriter::BaseColumnWriter' (in 'total_compressed_byte_size_' member). Testing: - New e2e test was added to verify that the metrics are written to the Iceberg manifest upon inserting data. - New e2e test was added to verify that lower_bounds/upper_bounds metrics are used to prune data files on querying iceberg tables. - Existing e2e tests were updated to work with the new behavior. - BE test for single-value serialization. Relevant Iceberg documentation: - Manifest: https://iceberg.apache.org/spec/#manifests - Values in lower_bounds and upper_bounds maps should be Single-value serialized to binary: https://iceberg.apache.org/spec/#appendix-d-single-value-serialization Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b Reviewed-on: http://gerrit.cloudera.org:8080/17806 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com>	2021-09-02 21:34:41 +00:00
Zoltan Borok-Nagy	d0749d59de	IMPALA-10732: Use consistent DDL for specifying Iceberg partitions Currently we have a DDL syntax for defining Iceberg partitions that differs from SparkSQL: https://iceberg.apache.org/spark-ddl/#partitioned-by E.g. Impala is using the following syntax: CREATE TABLE ice_t (i int, s string, ts timestamp, d date) PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR) STORED AS ICEBERG; The same in Spark is: CREATE TABLE ice_t (i int, s string, ts timestamp, d date) USING ICEBERG PARTITIONED BY (bucket(5, i), months(ts), years(d)) HIVE-25179 added the following syntax for Hive: CREATE TABLE ice_t (i int, s string, ts timestamp, d date) PARTITIONED BY SPEC (bucket(5, i), months(ts), years(d)) STORED BY ICEBERG; I.e. the same syntax as Spark, but adding the keyword "SPEC". This patch makes Impala use Hive's syntax, i.e. we will also use the PARTITIONED BY SPEC clause + the unified partition transform syntax. Testing: * existing tests has been rewritten with the new syntax Change-Id: Ib72ae445fd68fb0ab75d87b34779dbab922bbc62 Reviewed-on: http://gerrit.cloudera.org:8080/17575 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-15 15:15:07 +00:00
Zoltan Borok-Nagy	824b39e829	IMPALA-10433: Use Iceberg's fixed partition transforms Because of an Iceberg bug Impala didn't push predicates to Iceberg for dates/timestamps when the predicate referred to a value before the UNIX epoch. https://github.com/apache/iceberg/pull/1981 fixed the Iceberg bug, and lately Impala switched to an Iceberg version that has the fix, therefore this patch enables predicate pushdown for all timestamp/date values. The above Iceberg patch maintains backward compatibility with the old, wrong behavior. Therefore sometimes we need to read plus one Iceberg partition than necessary. Testing: * Updated current e2e tests Change-Id: Ie67f41a53f21c7bdb8449ca0d27746158be7675a Reviewed-on: http://gerrit.cloudera.org:8080/17417 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-19 00:59:35 +00:00
Zoltan Borok-Nagy	90f3b2f491	IMPALA-10432: INSERT INTO Iceberg tables with partition transforms INSERT INTO Iceberg tables that use partition transforms. Partition transforms are functions that calculate partition data from row data. There are the following partition transforms in Iceberg: https://iceberg.apache.org/spec/#partition-transforms * IDENTITY * BUCKET * TRUNCATE * YEAR * MONTH * DAY * HOUR INSERT INTO identity-partitioned Iceberg tables are already supported. This patch adds support for the rest of the transforms. We create the partitioning expressions in InsertStmt. Based on these expressions data are automatically shuffled and sorted by the backend executors before rows are given to the table sink operators. The table sink operator writes the partitions one-by-one and creates a human-readable partition path for them. In the end, we will convert the partition path to partition data and create Iceberg DataFiles with information about the files written. Testing: * added planner test * added e2e tests Change-Id: I3edf02048cea78703837b248c55219c22d512b78 Reviewed-on: http://gerrit.cloudera.org:8080/16939 Reviewed-by: wangsheng <skyyws@163.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-18 18:46:42 +00:00

7 Commits