Impala does not write 'value_counts' to Iceberg metadata, just
'null_value_counts'. Push-down NOT_NULL predicate does not work when the
data is written by the impala, so we implement impala to write
'value_counts' to Iceberg metadata.
Testing:
- existing tests
- tested manually on a real cluster
Change-Id: I6b7afab8be197118e573fda1a381fa08e4c8c9c0
Reviewed-on: http://gerrit.cloudera.org:8080/18513
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
SHOW FILES on Iceberg tables lists all files in table directory. Even
deleted files and metadata files. We should only shows the current data
files.
Testing:
- existing tests
Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179
Reviewed-on: http://gerrit.cloudera.org:8080/18455
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When Impala created the metadata objects about the Iceberg data files it
tried to convert the partition values to strings. But the partition
values can be NULLs as well. The code didn't expect this, so we got a
NullPointerException.
With this patch we pass the table's null partition key value in case
of NULLs.
Testing:
* added e2e tests
Change-Id: I88c4f7a2c2db4f6390c8ee5c08baddc96b04602e
Reviewed-on: http://gerrit.cloudera.org:8080/18307
Reviewed-by: Tamas Mate <tmater@apache.org>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds parquet stats to iceberg manifest as per-datafile
metrics.
The following metrics are supported:
- column_sizes :
Map from column id to the total size on disk of all regions that
store the column. Does not include bytes necessary to read other
columns, like footers.
- null_value_counts :
Map from column id to number of null values in the column.
- lower_bounds :
Map from column id to lower bound in the column serialized as
binary. Each value must be less than or equal to all non-null,
non-NaN values in the column for the file.
- upper_bounds :
Map from column id to upper bound in the column serialized as
binary. Each value must be greater than or equal to all non-null,
non-Nan values in the column for the file.
The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).
Testing:
- New e2e test was added to verify that the metrics are written to the
Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.
Relevant Iceberg documentation:
- Manifest:
https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
serialized to binary:
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Currently we have a DDL syntax for defining Iceberg partitions that
differs from SparkSQL:
https://iceberg.apache.org/spark-ddl/#partitioned-by
E.g. Impala is using the following syntax:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)
STORED AS ICEBERG;
The same in Spark is:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
USING ICEBERG
PARTITIONED BY (bucket(5, i), months(ts), years(d))
HIVE-25179 added the following syntax for Hive:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITIONED BY SPEC (bucket(5, i), months(ts), years(d))
STORED BY ICEBERG;
I.e. the same syntax as Spark, but adding the keyword "SPEC".
This patch makes Impala use Hive's syntax, i.e. we will also
use the PARTITIONED BY SPEC clause + the unified partition
transform syntax.
Testing:
* existing tests has been rewritten with the new syntax
Change-Id: Ib72ae445fd68fb0ab75d87b34779dbab922bbc62
Reviewed-on: http://gerrit.cloudera.org:8080/17575
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Because of an Iceberg bug Impala didn't push predicates to
Iceberg for dates/timestamps when the predicate referred to a
value before the UNIX epoch.
https://github.com/apache/iceberg/pull/1981 fixed the Iceberg
bug, and lately Impala switched to an Iceberg version that has
the fix, therefore this patch enables predicate pushdown for all
timestamp/date values.
The above Iceberg patch maintains backward compatibility with the
old, wrong behavior. Therefore sometimes we need to read plus one
Iceberg partition than necessary.
Testing:
* Updated current e2e tests
Change-Id: Ie67f41a53f21c7bdb8449ca0d27746158be7675a
Reviewed-on: http://gerrit.cloudera.org:8080/17417
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
INSERT INTO Iceberg tables that use partition transforms. Partition
transforms are functions that calculate partition data from row data.
There are the following partition transforms in Iceberg:
https://iceberg.apache.org/spec/#partition-transforms
* IDENTITY
* BUCKET
* TRUNCATE
* YEAR
* MONTH
* DAY
* HOUR
INSERT INTO identity-partitioned Iceberg tables are already supported.
This patch adds support for the rest of the transforms.
We create the partitioning expressions in InsertStmt. Based on these
expressions data are automatically shuffled and sorted by the backend
executors before rows are given to the table sink operators. The table
sink operator writes the partitions one-by-one and creates a
human-readable partition path for them.
In the end, we will convert the partition path to partition data and
create Iceberg DataFiles with information about the files written.
Testing:
* added planner test
* added e2e tests
Change-Id: I3edf02048cea78703837b248c55219c22d512b78
Reviewed-on: http://gerrit.cloudera.org:8080/16939
Reviewed-by: wangsheng <skyyws@163.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>