impala

mirror of https://github.com/apache/impala.git synced 2026-02-01 21:00:29 -05:00

Author	SHA1	Message	Date
Attila Jeges	fabe994d1f	IMPALA-10627: Use standard parquet-related Iceberg table properties This patch adds support for the following standard Iceberg properties: write.parquet.compression-codec: Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY (default value), LZ4, ZSTD. The table property will be ignored if COMPRESSION_CODEC query option is set. write.parquet.compression-level: Parquet compression level. Used with ZSTD compression only. Supported range is [1, 22]. Default value is 3. The table property will be ignored if COMPRESSION_CODEC query option is set. write.parquet.row-group-size-bytes : Parquet row group size in bytes. Supported range is [8388608, 2146435072] (8MB - 2047MB). The table property will be ignored if PARQUET_FILE_SIZE query option is set. If neither the table property nor the PARQUET_FILE_SIZE query option is set, the way Impala calculates row group size will remain unchanged. write.parquet.page-size-bytes: Parquet page size in bytes. Used for PLAIN encoding. Supported range is [65536, 1073741824] (64KB - 1GB). If the table property is unset, the way Impala calculates page size will remain unchanged. write.parquet.dict-size-bytes: Parquet dictionary page size in bytes. Used for dictionary encoding. Supported range is [65536, 1073741824] (64KB - 1GB). If the table property is unset, the way Impala calculates dictionary page size will remain unchanged. This patch also renames 'iceberg.file_format' table property to 'write.format.default' which is the standard Iceberg name for the table property. Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23 Reviewed-on: http://gerrit.cloudera.org:8080/17654 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-20 23:58:06 +00:00
Zoltan Borok-Nagy	d0749d59de	IMPALA-10732: Use consistent DDL for specifying Iceberg partitions Currently we have a DDL syntax for defining Iceberg partitions that differs from SparkSQL: https://iceberg.apache.org/spark-ddl/#partitioned-by E.g. Impala is using the following syntax: CREATE TABLE ice_t (i int, s string, ts timestamp, d date) PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR) STORED AS ICEBERG; The same in Spark is: CREATE TABLE ice_t (i int, s string, ts timestamp, d date) USING ICEBERG PARTITIONED BY (bucket(5, i), months(ts), years(d)) HIVE-25179 added the following syntax for Hive: CREATE TABLE ice_t (i int, s string, ts timestamp, d date) PARTITIONED BY SPEC (bucket(5, i), months(ts), years(d)) STORED BY ICEBERG; I.e. the same syntax as Spark, but adding the keyword "SPEC". This patch makes Impala use Hive's syntax, i.e. we will also use the PARTITIONED BY SPEC clause + the unified partition transform syntax. Testing: * existing tests has been rewritten with the new syntax Change-Id: Ib72ae445fd68fb0ab75d87b34779dbab922bbc62 Reviewed-on: http://gerrit.cloudera.org:8080/17575 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-15 15:15:07 +00:00
Zoltan Borok-Nagy	08367e91f0	IMPALA-10452: CREATE Iceberg tables with old PARTITIONED BY syntax For convenience this patch adds support with the old-style CREATE TABLE ... PARTITIONED BY ...; syntax for Iceberg tables. So users should be able to write the following: CREATE TABLE ice_t (i int) PARTITIONED BY (p int) STORED AS ICEBERG; Which should be equivalent to this: CREATE TABLE ice_t (i int, p int) PARTITION BY SPEC (p IDENTITY) STORED AS ICEBERG; Please note that the old-style CREATE TABLE statement creates IDENTITY-partitioned tables. For other partition transforms the users must use the new, more generic syntax. Hive also supports the old PARTITIONED BY syntax with the same behavior. Testing: * added e2e tests Change-Id: I789876c161bc0987820955aa9ae01414e0dcb45d Reviewed-on: http://gerrit.cloudera.org:8080/16979 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-26 22:12:25 +00:00
skyyws	1093a563e6	IMPALA-10368: Support required/optional property when creating Iceberg table We supported create required/optional field for Iceberg table in this patch. If we set 'NOT NULL' property for Iceberg table column in SQL, Impala will create required field by Iceberg api, 'NULL' or default will create optional field. Besides, 'DESCRIBE XXX' for Iceberg table will display 'optional' property like this: +------+--------+---------+----------+ \| name \| type \| comment \| nullable \| +------+--------+---------+----------+ \| id \| int \| \| false \| \| name \| string \| \| true \| \| age \| int \| \| true \| +------+--------+---------+----------+ And 'SHOW CREATE TABLE XXX' will also display 'NULL'/'NOT NULL' property for Iceberg table. Tests: * added new test in iceberg-create.test * added new test in iceberg-negative.test * added new test in show-create-table.test * modify 'DESCRIBE XXX' result in iceberg-create.test * modify 'DESCRIBE XXX' result in iceberg-alter.test * modify create table result in show-create-table.test Change-Id: I70b8014ba99f43df1b05149ff7a15cf06b6cd8d3 Reviewed-on: http://gerrit.cloudera.org:8080/16904 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-11 17:08:21 +00:00
Zoltan Borok-Nagy	579f5c67e0	IMPALA-10364: Set the real location for external Iceberg tables stored in HadoopCatalog Impala tries to come up with the table location of external Iceberg tables stored in HadoopCatalog. The current method is not correct for tables that are nested under multiple namespaces. With this patch Imapala loads the Iceberg table and retrieves the location from it. Testing: * added e2e test in iceberg-create.test Change-Id: I04b75d219e095ce00b4c48f40b8dee872ba57b78 Reviewed-on: http://gerrit.cloudera.org:8080/16795 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-02 22:42:12 +00:00
Zoltan Borok-Nagy	4448b8755b	IMPALA-10152: Add support for Iceberg HiveCatalog HiveCatalog is one of Iceberg's catalog implementations. It uses the Hive metastore and it is the recommended catalog implementation when the table data is stored in object stores like S3. This commit updates the Iceberg version to a newer one, and it also retrieves Iceberg from the CDP distribution because that version of Iceberg is built against Hive 3 (Impala is only compatible with Hive 3). This commit makes HiveCatalog the default Iceberg catalog in Impala because it can be used in more environments (e.g. cloud stores), and it is more featureful. Also, other engines that store their table metadata in HMS will probably use HiveCatalog as well. Tables stored in HiveCatalog are similar to Kudu tables with HMS integration, i.e. modifying an Iceberg table via the Iceberg APIs also modifies the HMS table. So in CatalogOpExecutor we handle such Iceberg tables similarly to integrated Kudu tables. Testing: * Added e2e tests for creating, writing, and altering Iceberg tables * Added SHOW CREATE TABLE tests Change-Id: Ie574589a1751aaa9ccbd34a89c6819714d103197 Reviewed-on: http://gerrit.cloudera.org:8080/16721 Reviewed-by: wangsheng <skyyws@163.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-20 21:40:28 +00:00
skyyws	5c91ff2737	IMPALA-10346: Rename Iceberg test tables' name with specific cases We used some unrecognized table names in Iceberg related test cases, such as iceberg_test1/iceberg_test2 and so on, which resulted in poor readability. So we better rename these Iceberg test tables' name by specific cases. Testing: - Renamed tables' name in iceberg-create.test - Renamed tables' name in iceberg-alter.test Change-Id: Ifdaeaaeed69753222668342dcac852677fdd9ae5 Reviewed-on: http://gerrit.cloudera.org:8080/16753 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-20 16:24:56 +00:00
Zoltan Borok-Nagy	26fc6795ec	IMPALA-10318: default_transactional_type shouldn't affect Iceberg tables Query option 'default_transactional_type' shouldn't affect Iceberg tables. Also, Iceberg tables shouldn't allow setting transactional properties. Testing: * Added e2e tests Change-Id: I86d1ac82ecd01a7455a0881a9e84aeb193dd5385 Reviewed-on: http://gerrit.cloudera.org:8080/16742 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-18 22:24:08 +00:00
skyyws	0c0985a825	IMPALA-10159: Supporting ORC file format for Iceberg table This patch mainly realizes querying Iceberg table with ORC file format. We can using following SQL to create table with ORC file format: CREATE TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg.file_format'='orc', 'iceberg.catalog'='hadoop.tables'); But pay attention, there still some problems when scan ORC files with Timestamp, more details please refer IMPALA-9967. We may add new tests with Timestmap type after this JIRA fixed. Testing: - Create table tests in functional_schema_template.sql - Iceberg table create test in test_iceberg.py - Iceberg table query test in test_scanners.py Change-Id: Ib579461aa57348c9893a6d26a003a0d812346c4d Reviewed-on: http://gerrit.cloudera.org:8080/16568 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-14 19:19:19 +00:00
Gabor Kaszab	13a78fc1b0	IMPALA-10165: Implement Bucket and Truncate partition transforms for Iceberg tables This patch adds support for Iceberg Bucket and Truncate partition transforms. Both accept a parameter: number of buckets and width respectively. Usage: CREATE TABLE tbl_name (i int, p1 int, p2 timestamp) PARTITION BY SPEC ( p1 BUCKET 10, p1 TRUNCATE 5 ) STORED AS ICEBERG TBLPROPERTIES ('iceberg.catalog'='hadoop.tables'); Testing: - Extended AnalyzerStmtsTest to cover creating partitioned Iceberg tables with the new partition transforms. - Extended ParserTest. - Extended iceberg-create.test to create Iceberg tables with the new partition transforms. - Extended show-create-table.test to check that the new partition transforms are displayed with their parameters in the SHOW CREATE TABLE output. Change-Id: Idc75cd23045b274885607c45886319f4f6da19de Reviewed-on: http://gerrit.cloudera.org:8080/16551 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-14 19:07:06 +00:00
skyyws	5912c47617	IMPALA-10221: Rename 'iceberg_file_format' to 'iceberg.file_format' as Iceberg table property We provide several new table properties in IMPALA-10164, such as 'iceberg.catalog', in order to keep consist of these properties, we rename 'iceberg_file_format' to 'iceberg.file_format'. When we creating Iceberg table, we should use SQL like this: CREATE TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG TBLPROPERTIES ('iceberg.file_format'='parquet', 'iceberg.catalog'='hadoop.tables') Change-Id: I722303fb765aca0f97a79bd6e4504765d355a623 Reviewed-on: http://gerrit.cloudera.org:8080/16550 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-06 16:58:04 +00:00
skyyws	5b720a4d18	IMPALA-10164: Supporting HadoopCatalog for Iceberg table This patch mainly realizes creating Iceberg table by HadoopCatalog. We only supported HadoopTables api before this patch, but now we can use HadoopCatalog to create Iceberg table. When creating managed table, we can use SQL like this: CREATE TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog', 'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test'); We supported two values ('hadoop.catalog', 'hadoop.tables') for 'iceberg.catalog' now. If you don't specify this property in your SQL, default catalog type is 'hadoop.catalog'. As for external Iceberg table, you can use SQL like this: CREATE EXTERNAL TABLE default.iceberg_test_external STORED AS ICEBERG TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog', 'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test', 'iceberg.table_identifier'='default.iceberg_test'); We cannot set table location for both managed and external Iceberg table with 'hadoop.catalog', and 'SHOW CREATE TABLE' will not display table location yet. We need to use 'DESCRIBE FORMATTED/EXTENDED' to get this location info. 'iceberg.catalog_location' is necessary for 'hadoop.catalog' table, which used to reserved Iceberg table metadata and data, and we use this location to load table metadata from Iceberg. 'iceberg.table_identifier' is used for Icebreg TableIdentifier.If this property not been specified in SQL, Impala will use database and table name to load Iceberg table, which is 'default.iceberg_test_external' in above SQL. This property value is splitted by '.', you can alse set this value like this: 'org.my_db.my_tbl'. And this property is valid for both managed and external table. Testing: - Create table tests in functional_schema_template.sql - Iceberg table create test in test_iceberg.py - Iceberg table query test in test_scanners.py - Iceberg table show create table test in test_show_create_table.py Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef Reviewed-on: http://gerrit.cloudera.org:8080/16446 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-01 13:54:48 +00:00
skyyws	fb6d96e001	IMPALA-9741: Support querying Iceberg table by impala This patch mainly realizes the querying of iceberg table through impala, we can use the following sql to create an external iceberg table: CREATE EXTERNAL TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); Or just including table name and location like this: CREATE EXTERNAL TABLE default.iceberg_test STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); 'iceberg_file_format' is the file format in iceberg, currently only support PARQUET, other format would be supported in the future. And if you don't specify this property in your SQL, default file format is PARQUET. We achieved this function by treating the iceberg table as normal unpartitioned hdfs table. When querying iceberg table, we pushdown partition column predicates to iceberg to decide which data files need to be scanned, and then transfer this information to BE to do the real scan operation. Testing: - Unit test for Iceberg in FileMetadataLoaderTest - Create table tests in functional_schema_template.sql - Iceberg table query test in test_scanners.py Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006 Reviewed-on: http://gerrit.cloudera.org:8080/16143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-06 02:12:07 +00:00

13 Commits