Commit Graph

7 Commits

Author SHA1 Message Date
LPL
e711461a5c IMPALA-11286: Writes value_counts to Iceberg metadata
Impala does not write 'value_counts' to Iceberg metadata, just
'null_value_counts'. Push-down NOT_NULL predicate does not work when the
data is written by the impala, so we implement impala to write
'value_counts' to Iceberg metadata.

Testing:
 - existing tests
 - tested manually on a real cluster

Change-Id: I6b7afab8be197118e573fda1a381fa08e4c8c9c0
Reviewed-on: http://gerrit.cloudera.org:8080/18513
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-05-12 16:23:17 +00:00
LPL
78609dca32 IMPALA-11256: Fix SHOW FILES on Iceberg tables lists all files
SHOW FILES on Iceberg tables lists all files in table directory. Even
deleted files and metadata files. We should only shows the current data
files.

Testing:
  - existing tests

Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179
Reviewed-on: http://gerrit.cloudera.org:8080/18455
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-04-28 17:48:23 +00:00
Zoltan Borok-Nagy
2fffac3bad IMPALA-11175: Iceberg table cannot be loaded when partition value is NULL
When Impala created the metadata objects about the Iceberg data files it
tried to convert the partition values to strings. But the partition
values can be NULLs as well. The code didn't expect this, so we got a
NullPointerException.

With this patch we pass the table's null partition key value in case
of NULLs.

Testing:
 * added e2e tests

Change-Id: I88c4f7a2c2db4f6390c8ee5c08baddc96b04602e
Reviewed-on: http://gerrit.cloudera.org:8080/18307
Reviewed-by: Tamas Mate <tmater@apache.org>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-03-11 13:23:30 +00:00
Attila Jeges
c8aa5796d9 IMPALA-10879: Add parquet stats to iceberg manifest
This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
2021-09-02 21:34:41 +00:00
Zoltan Borok-Nagy
d0749d59de IMPALA-10732: Use consistent DDL for specifying Iceberg partitions
Currently we have a DDL syntax for defining Iceberg partitions that
differs from SparkSQL:
https://iceberg.apache.org/spark-ddl/#partitioned-by

E.g. Impala is using the following syntax:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)
STORED AS ICEBERG;

The same in Spark is:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
USING ICEBERG
PARTITIONED BY (bucket(5, i), months(ts), years(d))

HIVE-25179 added the following syntax for Hive:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITIONED BY SPEC (bucket(5, i), months(ts), years(d))
STORED BY ICEBERG;

I.e. the same syntax as Spark, but adding the keyword "SPEC".

This patch makes Impala use Hive's syntax, i.e. we will also
use the PARTITIONED BY SPEC clause + the unified partition
transform syntax.

Testing:
 * existing tests has been rewritten with the new syntax

Change-Id: Ib72ae445fd68fb0ab75d87b34779dbab922bbc62
Reviewed-on: http://gerrit.cloudera.org:8080/17575
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-15 15:15:07 +00:00
Zoltan Borok-Nagy
824b39e829 IMPALA-10433: Use Iceberg's fixed partition transforms
Because of an Iceberg bug Impala didn't push predicates to
Iceberg for dates/timestamps when the predicate referred to a
value before the UNIX epoch.

https://github.com/apache/iceberg/pull/1981 fixed the Iceberg
bug, and lately Impala switched to an Iceberg version that has
the fix, therefore this patch enables predicate pushdown for all
timestamp/date values.

The above Iceberg patch maintains backward compatibility with the
old, wrong behavior. Therefore sometimes we need to read plus one
Iceberg partition than necessary.

Testing:
 * Updated current e2e tests

Change-Id: Ie67f41a53f21c7bdb8449ca0d27746158be7675a
Reviewed-on: http://gerrit.cloudera.org:8080/17417
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-19 00:59:35 +00:00
Zoltan Borok-Nagy
90f3b2f491 IMPALA-10432: INSERT INTO Iceberg tables with partition transforms
INSERT INTO Iceberg tables that use partition transforms. Partition
transforms are functions that calculate partition data from row data.

There are the following partition transforms in Iceberg:
https://iceberg.apache.org/spec/#partition-transforms

 * IDENTITY
 * BUCKET
 * TRUNCATE
 * YEAR
 * MONTH
 * DAY
 * HOUR

INSERT INTO identity-partitioned Iceberg tables are already supported.
This patch adds support for the rest of the transforms.

We create the partitioning expressions in InsertStmt. Based on these
expressions data are automatically shuffled and sorted by the backend
executors before rows are given to the table sink operators. The table
sink operator writes the partitions one-by-one and creates a
human-readable partition path for them.

In the end, we will convert the partition path to partition data and
create Iceberg DataFiles with information about the files written.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I3edf02048cea78703837b248c55219c22d512b78
Reviewed-on: http://gerrit.cloudera.org:8080/16939
Reviewed-by: wangsheng <skyyws@163.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-18 18:46:42 +00:00