impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Zoltan Borok-Nagy	84cb5e8510	IMPALA-11900 Test table iceberg_partitioned_orc has wrong metadata Iceberg table iceberg_partitioned_orc has wrong metadata. The field 'file_size_in_bytes' is wrong for the data files. This causes issues on object stores where we rely more on information coming from Iceberg metadata. This commit updates the manifest and manifest list files to reflect correct information. Change-Id: Iae860f401947092d9fdca802f41dd6de79e0638d Reviewed-on: http://gerrit.cloudera.org:8080/19476 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2023-02-07 10:25:11 +00:00
Gergely Fürnstáhl	0a1f2dd723	IMPALA-11833: Fixed manifest_length in snapshot files Some manifest files were manually edited to support multiple environments, namely hdfs://localhost:20500 were cropped from every path. The snapshot files contain the length of these manifest files, which was not adjusted. Testing: - Ran test_iceberg.py locally - Verified lengths manually Change-Id: I258055998a1d41b7f6047b6879e919834ed2c247 Reviewed-on: http://gerrit.cloudera.org:8080/19409 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-01-16 14:37:16 +00:00
noemi	390a932064	IMPALA-11708: Add support for mixed Iceberg tables with AVRO file format This patch extends the support of Iceberg tables containing multiple file formats. Now AVRO data files can also be read in a mixed table besides Parquet and ORC. Impala uses its avro scanner to read AVRO files, therefore all the avro related limitations apply here as well: writes/metadata changes are not supported. testing: - E2E testing: extending 'iceberg-mixed-file-format.test' to include AVRO files as well, in order to test reading all three currently supported file formats: avro+orc+parquet Change-Id: I941adfb659218283eb5fec1b394bb3003f8072a6 Reviewed-on: http://gerrit.cloudera.org:8080/19353 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-12-16 17:37:35 +00:00
Zoltan Borok-Nagy	c56cd7b214	IMPALA-11780: Wrong FILE__POSITION values for multi row group Parquet files when page filtering is used Impala generated wrong values for the FILE__POSITION column when the Parquet file contained multiple row groups and page filtering was used as well. We are using the value of 'current_row_' in the Parquet column readers to populate the file position slot. The problem is that 'current_row_' denotes the index of the row within the row group and not within the file. We cannot change 'current_row_' as page filtering depends on its value, as the page index also uses the row group-based indexes of the rows, not the file indexes. In the meantime it turned out FILE__POSITION was also not set correctly in the Parquet late materialization code, as BaseScalarColumnReader::SkipRowsInternal() didn't update 'current_row_' in some code paths. The value of FILE__POSITION is critical for Iceberg V2 tables as position delete files store file positions of the deleted rows. Testing: * added e2e tests * the tests are now running w/o PARQUET_READ_STATISTICS to exercise more code paths Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605 Reviewed-on: http://gerrit.cloudera.org:8080/19328 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-12-08 23:07:08 +00:00
noemi	80fc49abe6	IMPALA-11158: Add support for Iceberg tables with AVRO data files Iceberg tables containing only AVRO files or no AVRO files at all can now be read by Impala. Mixed file format tables with AVRO are currently unsupported. Impala uses its avro scanner to read AVRO files, therefore all the avro related limitations apply here as well: writes/metadata changes are not supported. testing: - created test tables: 'iceberg_avro_only' contains only AVRO files; 'iceberg_avro_mixed' contains all file formats: avro+orc+parquet - added E2E test that reads Avro-only table - added test case to iceberg-negative.test that tries to read mixed file format table Change-Id: I827e5707e54bebabc614e127daa48255f86f4c4f Reviewed-on: http://gerrit.cloudera.org:8080/19084 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-12-08 03:03:13 +00:00
Gergely Fürnstáhl	5fd4e6a11f	IMPALA-11438: Add tests for CREATE TABLE LIKE PARQUET STORED AS ICEBERG Impala already supports said statement, added query tests for it. Testing: - Generated parquet table with complex type by hive - Created hive and iceberg table from said file, provides same output for describe statement Change-Id: Ia363b913e101fd49b62a280721680f0eb88761c0 Reviewed-on: http://gerrit.cloudera.org:8080/18969 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-16 19:08:12 +00:00
Gergely Fürnstáhl	f598b2ad68	IMPALA-10610: Support multiple file formats in a single Iceberg Table Added support for multiple file formats. Previously Impala created a Scanner class based on the partitions file format, now in case of an Iceberg table it will read out the file format from the file level metadata instead. IcebergScanNode will aggregate file formats as well instead of relying on partitions, so it can be used for plannig. Testing: Created a mixed file format table with hive and added a test for it. Change-Id: Ifc816595724e8fd2c885c6664f790af61ddf5c07 Reviewed-on: http://gerrit.cloudera.org:8080/18935 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-08 18:03:37 +00:00
LPL	cc26f345a4	IMPALA-11507: Use absolute_path when Iceberg data files are outside of the table location For Iceberg tables, when one of the following properties is used, it is considered that the table is possible to have data outside the table location directory: - 'write.object-storage.enabled' is true - 'write.data.path' is not empty - 'write.location-provider.impl' is configured - 'write.object-storage.path'(Deprecated) is not empty - 'write.folder-storage.path'(Deprecated) is not empty We should tolerate the situation that relative path of the data files cannot be obtained by the table location path, and we could use the absolute path in that case. E.g. the ETL program will write the table that the metadata of the Iceberg tables is placed in 'hdfs://nameservice_meta/warehouse/hadoop_catalog/ice_tbl/metadata', the recent data files in 'hdfs://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', and the data files half a year ago in 's3a://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', it should still be queried normally by Impala. Testing: - added e2e tests Change-Id: I666bed21d20d5895f4332e92eb30a94fa24250be Reviewed-on: http://gerrit.cloudera.org:8080/18894 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-06 18:35:30 +00:00
Zoltan Borok-Nagy	73da4d7ddf	IMPALA-11484: Create SCAN plan for Iceberg V2 position delete tables This patch adds support for reading Iceberg V2 tables use position deletes. Equality deletes are still not supported. Position delete files store the file path and file position of the deleted rows. When an Iceberg table has position delete files we need to do an ANTI JOIN between data files and delete files. From the data files we need to query the virtual columns INPUT__FILE__NAME and FILE__POSITION, while from the delete files we need the data columns 'file_path' and 'pos'. The latter data columns are not part of the table schema, so we create a virtual table instance of 'IcebergPositionDeleteTable' that has a table schema corresponding to the delete files ('file_path', 'pos'). This patch introduces a new class 'IcebergScanPlanner' which has the responsibility of doing a plan for Iceberg table scans. It creates the aforementioned ANTI JOIN. Also, if there are data files without corresponding delete files, we can have a separate SCAN node and its results would be UNIONed to the rows coming from the ANTI JOIN: UNION / \ SCAN data ANTI JOIN / \ SCAN data SCAN deletes Some refactorings in the context of this CR: Predicate pushdown and time travel logic is transferred from IcebergScanNode to IcebergScanPlanner. Iceberg snapshot summary retrieval is moved from FeFsTable to FeIcebergTable. Testing: * added planner test * added e2e tests TODO in follow-up Jiras: * better cardinality estimates (IMPALA-11516) * support unrelative collection columns (select item from t.int_array) (IMPALA-11517) Currently such queries return error during analysis Change-Id: I672cfee18d8e131772d90378d5b12ad4d0f7dd48 Reviewed-on: http://gerrit.cloudera.org:8080/18847 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-01 16:51:17 +00:00
Csaba Ringhofer	7ca11dfc7f	IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 13:55:42 +00:00
Gergely Fürnstáhl	8d034a2f7c	IMPALA-11034: Resolve schema of old data files in migrated Iceberg tables When external tables are converted to Iceberg, the data files remain intact, thus missing field IDs. Previously, Impala used name based column resolution in this case. Added a feature to traverse through the data files before column resolution and assign field IDs the same way as iceberg would, to be able to use field ID based column resolutions. Testing: Default resolution method was changed to field id for migrated tables, existing tests use that from now. Added new tests to cover edge cases with complex types and schema evolution. Change-Id: I77570bbfc2fcc60c2756812d7210110e8cc11ccc Reviewed-on: http://gerrit.cloudera.org:8080/18639 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-14 13:06:04 +00:00
Zoltan Borok-Nagy	e91c7810f0	IMPALA-10850: Interpret timestamp predicates in local timezone in IcebergScanNode IcebergScanNode interprets the timestamp literals as UTC timestamps during predicate pushdown to Iceberg. It causes problems when the Iceberg table uses TIMESTAMPTZ (which corresponds to TIMESTAMP WITH LOCAL TIME ZONE in SQL) because in the scanners we assume that the timestamp literals in a query are in local timezone. Hence, if the Iceberg table is partitioned by HOUR(ts), and Impala is running in a different timezone than UTC, then the following query doesn't return any rows: SELECT * from t WHERE ts = <some ts>; Because during predicate pushdown the timestamp is interpreted as a UTC timestamp (no conversion from local to UTC), but during query execution the timestamp data in the files are converted to local timezone, then compared to <some ts>. I.e. in the scanner the assumption is that <some ts> is in local timezone. On the other hand, when Iceberg type TIMESTAMP (which correcponds to TIMESTAMP WITHOUT TIME ZONE in SQL) is used, then we should just push down the timestamp values without any conversion. In this case there is no conversion in the scanners either. Testing: * added e2e test with TIMESTAMPTZ * added e2e test with TIMESTAMP Change-Id: I181be5d2fa004f69b457f69ff82dc2f9877f46fa Reviewed-on: http://gerrit.cloudera.org:8080/18399 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2022-04-21 12:49:31 +00:00
Tamas Mate	9cd4823aa9	IMPALA-11023: Raise error when delete file is found in an Iceberg table Iceberg V2 DeleteFiles are skipped during scans and the whole content of the DataFiles are returned. This commit adds an extra check to prevent scanning tables that have delete files to avoid unexpected results till merge on read is supported. Metadata operations are allowed on tables with delete files. Testing: - Added e2e test. Change-Id: I6e9cbf2424b27157883d551f73e728ab4ec6d21e Reviewed-on: http://gerrit.cloudera.org:8080/18383 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-11 19:37:04 +00:00
Zoltan Borok-Nagy	952f2af0ca	IMPALA-11210: Impala can only handle lowercase schema elements of Iceberg table When Impala/Hive creates a table they lowercase the schema elements. When Spark creates an Iceberg table it doesn't lowercase the names of the columns in the Iceberg metadata. This triggers a precondition check in Impala which makes such Iceberg tables unloadable. This patch converts column names to lowercase when converting Iceberg schemas to Hive/Impala schemas. Testing: * added e2e test Change-Id: Iffd910f76844fbf34db805dda6c3053c5ad1cf79 Reviewed-on: http://gerrit.cloudera.org:8080/18368 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-31 11:53:13 +00:00
Zoltan Borok-Nagy	c10e951bcb	IMPALA-11053: Impala should be able to read migrated partitioned Iceberg tables When Hive (and probably other engines as well) converts a legacy Hive table to Iceberg it doesn't rewrite the data files. It means that the data files don't have write ids neither partition column data. Currently Impala expects the partition columns to be present in the data files, so it is not able to read converted partitioned tables. With this patch Impala loads partition values from the Iceberg metadata. The extra metadata information is attached to the file descriptor objects and propageted to the scanners. This metadata contains the Iceberg data file format (later it could be used to handle mixed-format tables), and partition data. We use the partition data in the HdfsScanner to create the template tuple that contains the partition values of identity-partitioned columns. This is not only true to migrated tables, but all Iceberg tables with identity partitions, which means we also save some IO and CPU time for such columns. The partition information could also be used for Dynamic Partition Pruning later. We use the (human-readable) string representation of the partition data when storing them in the flat buffers. This helps debugging, also it provides the needed flexibility when the partition columns evolve (e.g. INT -> BIGINT, DECIMAL(4,2) -> DECIMAL(6,2)). Testing * e2e test for all data types that can be used to partition a table * e2e test for migrated partitioned table + schema evolution (without renaming columns) * e2e for table where all columns are used as identity-partitions Change-Id: Iac11a02de709d43532056f71359c49d20c1be2b8 Reviewed-on: http://gerrit.cloudera.org:8080/18240 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-07 20:00:42 +00:00
Gergely Fürnstáhl	71c904e5c2	IMPALA-10948: Default scale and DecimalType Added default 0 for scale if it is not set to comply with parquet spec. Wrapped reading scale and precision in a function to support reading LogicalType.DecimalType if it is set, falling back to old ones if it is not, for backward compatibility. Regenerated bad_parquet_decimals table with filled DecimalType, moved missing scale test, as it is no longer a bad table. Added no_scale.parquet table to test reading table without set scale. Checked it with parquet-tools: message schema { optional fixed_len_byte_array(2) d1 (DECIMAL(4,0)); } Change-Id: I003220b6e2ef39d25d1c33df62c8432803fdc6eb Reviewed-on: http://gerrit.cloudera.org:8080/18224 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-04 16:49:22 +00:00
Zoltan Borok-Nagy	8217c230ac	IMPALA-11147: Min/max filtering crashes on Parquet file that contains partition columns Impala crashes on a Parquet file that contains the partition columns. Data files usually don't contain the partition columns, so Impala don't expect to find such columns in the data files. Unfortunately min/max filtering generates a SEGFAULT when the partition column is present in the data files. It happens when FindSkipRangesForPagesWithMinMaxFilters() tries to retrieve the Parquet schema element for a given slot descriptor. When the slot descriptor refers to a partition column, we usually don't find a schema element so we don't try to skip pages. But when the partition column is present in the data file, the code tries to calculate the filtered pages for the column. It uses the column reader object corresponding to the column, but this is NULL for partition columns, hence we get a SEGFAULT. The code shouldn't do anything at the page-level for partition columns, as the data in such columns are the same for the whole file and it is already filtered at a higher level. Testing: * added e2e test Change-Id: I17eff4467da3fd67a21353ba2d52d3bec405acd2 Reviewed-on: http://gerrit.cloudera.org:8080/18265 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-24 04:24:08 +00:00
Zoltan Borok-Nagy	b60ccabd5b	IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file Impala returns "Couldn't skip rows in file" error for old Parquet file written by an old Impala (e.g. Impala 2.5, 2.6) In DEBUG build Impala crashes by a DCHECK: Check failed: num_buffered_values_ > 0 (-1 vs. 0) The problem is that in some old Parquet files there can be a mismatch between 'num_values' in a page and the encoded def/rep levels. There is usually one more def/rep levels encoded in these files. In SkipTopLevelRows() we skipped values based on how many def levels are `92ce6fe48e/be/src/exec/parquet/parquet-column-readers.cc (L1308-L1314)` Since there are more def levels than values in some old files, num_buferred_values_ could become negative. This patch also takes the value of num_buferred_values_ into account when calculating 'read_count', so we can deal with such files. With this patch we also include the column name in the "Couldn't skip rows" error message, so in the future it'll be easier to identify the problematic columns. Testing: * added Parquet file written by Impala 2.5 and e2e test for it Change-Id: I568fe59df720ea040be4926812412ba4c1510a26 Reviewed-on: http://gerrit.cloudera.org:8080/18257 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-23 03:45:22 +00:00
Fang-Yu Rao	351e037472	IMPALA-10934 (Part 2): Enable table definition over a single file This patch adds an end-to-end test to validate and characterize HMS' behavior with respect to external table creation after HIVE-25569 via which a user is allowed to create an external table associated with a single file. Change-Id: Ia4f57f07a9f543c660b102ebf307a6cf590a6784 Reviewed-on: http://gerrit.cloudera.org:8080/18033 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2022-01-05 03:32:11 +00:00
Gabor Kaszab	df528fe2b1	IMPALA-10920: Zipping unnest for arrays This patch provides an unnest implementation for arrays where unnesting multiple arrays in one query results the items of the arrays being zipped together instead of joining. There are two different syntaxes introduced for this purpose: 1: ISO:SQL 2016 compliant syntax: SELECT a1.item, a2.item FROM complextypes_arrays t, UNNEST(t.arr1, t.arr2) AS (a1, a2); 2: Postgres compatible syntax: SELECT UNNEST(arr1), UNNEST(arr2) FROM complextypes_arrays; Let me show the expected behaviour through the following example: Inputs: arr1: {1,2,3}, arr2: {11, 12} After running any of the above queries we expect the following output: =============== \| arr1 \| arr2 \| =============== \| 1 \| 11 \| \| 2 \| 12 \| \| 3 \| NULL \| =============== Expected behaviour: - When unnesting multiple arrays with zipping unnest then the 'i'th item of one array will be put next to the 'i'th item of the other arrays in the results. - In case the size of the arrays is not the same then the shorter arrays will be filled with NULL values up to the size of the longest array. On a sidenote, UNNEST is added to Impala's SQL language as a new keyword. This might interfere with use cases where a resource (db, table, column, etc.) is named "UNNEST". Restrictions: - It is not allowed to have WHERE filters on an unnested item of an array in the same SELECT query. E.g. this is not allowed: SELECT arr1.item FROM complextypes_arrays t, UNNEST(t.arr1) WHERE arr1.item < 5; Note, that it is allowed to have an outer SELECT around the one doing unnests and have a filter there on the unnested items. - If there is an outer SELECT filtering on the unnested array's items from the inner SELECT then these predicates won't be pushed down to the SCAN node. They are rather evaluated in the UNNEST node to guarantee result correctness after unnesting. Note, this restriction is only active when there are multiple arrays being unnested, or in other words when zipping unnest logic is required to produce results. - It's not allowed to do a zipping and a (traditional) joining unnest together in one SELECT query. - It's not allowed to perform zipping unnests on arrays from different tables. Testing: - Added a bunch of E2E tests to the test suite to cover both syntaxes. - Did a manual test run on a table with 1000 rows, 3 array columns with size of around 5000 items in each array. I did an unnest on all three arrays in one query to see if there are any crashes or suspicious slowness when running on this scale. Change-Id: Ic58ff6579ecff03962e7a8698edfbe0684ce6cf7 Reviewed-on: http://gerrit.cloudera.org:8080/17983 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-11-23 07:03:10 +00:00
Zoltan Borok-Nagy	b02c003138	IMPALA-10974: Impala cannot resolve columns of converted Iceberg table When a regular Parquet/ORC table is converted to Iceberg via Hive, only the Iceberg metadata files need to be created. The data files can stay in place. This causes problems when the data files don't have field ids for the schema elements. Currently Impala resolves columns in data files based on Iceberg field ids, but since they are missing, Impala raises an error or returns NULLs. With this patch Impala falls back to the default column resolution strategy when the data files lack field ids. Testing: * added e2e tests both for Parquet and ORC Change-Id: I85881b09891c7bd101e7a96e92561b70bbe5af41 Reviewed-on: http://gerrit.cloudera.org:8080/17953 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-11-04 17:55:21 +00:00
Zoltan Borok-Nagy	9d46255739	IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale IMPALA-7087 is about reading Parquet decimal columns with lower precision/scale than table metadata. IMPALA-8131 is about reading Parquet decimal columns with higher scale than table metadata. Both are resolved by this patch. It reuses some parts from an earlier change request from Sahil Takiar: https://gerrit.cloudera.org/#/c/12163/ A new utility class has been introduced, ParquetDataConverter which does the data conversion. It also helps to decide whether data conversion is needed or not. NULL values are returned in case of overflows. This behavior is consistent with Hive. Parquet column stats reader is also updated to convert the decimal values. The stats reader is used to evaluate min/max conjuncts. It works well because later we also evaluate the conjuncts on the converted values anyway. The status of different filterings: * dictionary filtering: disabled for columns that need conversion * runtime bloom filters: work on the converted values * runtime min/max filters: work on the converted values This patch also enables schema evolution of decimal columns of Iceberg tables. Testing: * added e2e tests Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da Reviewed-on: http://gerrit.cloudera.org:8080/17678 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2021-07-14 12:51:09 +00:00
Daniel Becker	817ca5920d	IMPALA-10640: Support reading Parquet Bloom filters - most common types This change adds read support for Parquet Bloom filters for types that can reasonably be supported in Impala. Other types, such as CHAR(N), would be very difficult to support because the length may be different in Parquet and in Impala which results in truncation or padding, and that changes the hash which makes using the Bloom filter impossible. Write support will be added in a later change. The supported Parquet type - Impala type pairs are the following: --------------------------------------- \|Parquet type \| Impala type \| \|---------------------------------------\| \|INT32 \| TINYINT, SMALLINT, INT \| \|INT64 \| BIGINT \| \|FLOAT \| FLOAT \| \|DOUBLE \| DOUBLE \| \|BYTE_ARRAY \| STRING \| --------------------------------------- The following types are not supported for the given reasons: ---------------------------------------------------------------- \|Impala type \| Problem \| \|----------------------------------------------------------------\| \|VARCHAR(N) \| truncation can change hash \| \|CHAR(N) \| padding / truncation can change hash \| \|DECIMAL \| multiple encodings supported \| \|TIMESTAMP \| multiple encodings supported, timezone conversion \| \|DATE \| not considered yet \| ---------------------------------------------------------------- Support may be added for these types later, see IMPALA-10641. If a Bloom filter is available for a column that is fully dictionary encoded, the Bloom filter is not used as the dictionary can give exact results in filtering. Testing: - Added tests/query_test/test_parquet_bloom_filter.py that tests whether Parquet Bloom filtering works for the supported types and that we do not incorrectly discard row groups for the unsupported type VARCHAR. The Parquet file used in the test was generated with an external tool. - Added unit tests for ParquetBloomFilter in file be/src/util/parquet-bloom-filter-test.cc - A minor, unrelated change was done in be/src/util/bloom-filter-test.cc: the MakeRandom() function had return type uint64_t, the documentation claimed it returned a 64 bit random number, but the actual number of random bits is 32, which is what is intended in the tests. The return type and documentation have been corrected to use 32 bits. Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287 Reviewed-on: http://gerrit.cloudera.org:8080/17026 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2021-06-03 06:32:45 +00:00
Zoltan Borok-Nagy	ced7b7d221	IMPALA-10485: Support Iceberg field-id based column resolution in the ORC scanner Currently the ORC scanner only supports position-based column resolution. This patch adds Iceberg field-id based column resolution which will be the default for Iceberg tables. It is needed to support schema evolution in the future, i.e. ALTER TABLE DROP/RENAME COLUMNS. (The Parquet scanner already supports Iceberg field-id based column resolution) Testing * added e2e test 'iceberg-orc-field-id.test' by copying the contents of nested-types-scanner-basic, nested-types-scanner-array-materialization, nested-types-scanner-position, nested-types-scanner-maps, and executing the queries on an Iceberg table with ORC data files Change-Id: Ia2b1abcc25ad2268aa96dff032328e8951dbfb9d Reviewed-on: http://gerrit.cloudera.org:8080/17398 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-20 19:19:50 +00:00
Fucun Chu	b1326f7eff	IMPALA-10687: Implement ds_cpc_union() function This function receives a set of serialized Apache DataSketches CPC sketches produced by ds_cpc_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_cpc_estimate(ds_cpc_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_cpc_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_cpc_union() on those sketches Change-Id: Ib94b45ae79efcc11adc077dd9df9b9868ae82cb6 Reviewed-on: http://gerrit.cloudera.org:8080/17372 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-13 12:18:55 +00:00
Fucun Chu	e39c30b3cd	IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions These functions can be used to get cardinality estimates of data using CPC algorithm from Apache DataSketches. ds_cpc_sketch() receives a dataset, e.g. a column from a table, and returns a serialized CPC sketch in string format. This can be written to a table or be fed directly to ds_cpc_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the CPC sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' CPC see: http://datasketches.apache.org/docs/CPC/CPC.html Figures-of-Merit Comparison of the HLL and CPC Sketches see: https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch_parquet.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3x faster. CPC gives closer estimate than current ndv(). CPC is more accurate than HLL in some cases Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a Reviewed-on: http://gerrit.cloudera.org:8080/16656 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-11 18:07:40 +00:00
Zoltan Borok-Nagy	e26543426c	IMPALA-9967: Add support for reading ORC's TIMESTAMP WITH LOCAL TIMEZONE ORC-189 and ORC-666 added support for a new timestamp type 'TIMESTMAP WITH LOCAL TIMEZONE' to the Orc library. This patch adds support for reading such timestamps with Impala. These are UTC-normalized timestamps, therefore we convert them to local timezone during scanning. Testing: * added test for CREATE TABLE LIKE ORC * added scanner tests to test_scanners.py Change-Id: Icb0c6a43ebea21f1cba5b8f304db7c4bd43967d9 Reviewed-on: http://gerrit.cloudera.org:8080/17347 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-07 14:11:08 +00:00
Fucun Chu	7f60990028	IMPALA-10467: Implement ds_theta_union() function This function receives a set of serialized Apache DataSketches Theta sketches produced by ds_theta_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_theta_estimate(ds_theta_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_theta_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_theta_union() on those sketches Change-Id: I91baf58c76eb43748acd5245047edac8c66761b2 Reviewed-on: http://gerrit.cloudera.org:8080/17048 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-19 13:32:09 +00:00
Fucun Chu	65c6a81ed9	IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions These functions can be used to get cardinality estimates of data using Theta algorithm from Apache DataSketches. ds_theta_sketch() receives a dataset, e.g. a column from a table, and returns a serialized Theta sketch in string format. This can be written to a table or be fed directly to ds_theta_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the Theta sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' Theta see: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch25_parquet.lineitem to compare perfomance with ds_hll_. ds_theta_ is faster than ds_hll_* on the original data, the difference is around 1%-10%. ds_hll_estimate() is faster than ds_theta_estimate() on existing sketch. HLL and Theta gives closer estimate except for string. see IMPALA-10464. Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Reviewed-on: http://gerrit.cloudera.org:8080/17008 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-17 17:09:48 +00:00
Tim Armstrong	1d5fe2771f	IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages The encoding is identical to the already-supported PLAIN_DICTIONARY encoding but the PLAIN enum value is used for the dictionary pages and the RLE_DICTIONARY enum value is used for the data pages. A hidden option -write_new_parquet_dictionary_encodings is added to turn on writing too, for test purposes only. Testing: * Added an automated test using a pregenerated test file. * Ran core tests. * Manually tested by writing out TPC-H lineitem with the new encoding and reading back in Impala and Hive. Parquet-tools output for the generated test file: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/att/824de2afebad009f-6f460ade00000003_643159826_data.0.parq 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade00000003_643159826_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema -------------------------------------------------------------------------------- id: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bool_col: OPTIONAL BOOLEAN R:0 D:1 tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1 smallint_col: OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1 int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bigint_col: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 float_col: OPTIONAL FLOAT R:0 D:1 double_col: OPTIONAL DOUBLE R:0 D:1 date_string_col: OPTIONAL BINARY R:0 D:1 string_col: OPTIONAL BINARY R:0 D:1 timestamp_col: OPTIONAL INT96 R:0 D:1 year: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 month: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 row group 1: RC:8 TS:754 OFFSET:4 -------------------------------------------------------------------------------- id: INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0] bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN ST:[min: false, max: true, num_nulls: 0] tinyint_col: INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] int_col: INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] bigint_col: INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0] float_col: FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0] double_col: DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0] date_string_col: BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, num_nulls: 0] string_col: BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0] timestamp_col: INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined] year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0] month: INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0] Parquet-tools output for one of the lineitem files: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/li2/4b4d9143c575dd71-3f69d3cf00000001_1879643220_data.0.parq 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf00000001_1879643220_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema -------------------------------------------------------------------------------- l_orderkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_partkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_suppkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_linenumber: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 l_quantity: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_extendedprice: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_discount: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_tax: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_returnflag: OPTIONAL BINARY R:0 D:1 l_linestatus: OPTIONAL BINARY R:0 D:1 l_shipdate: OPTIONAL BINARY R:0 D:1 l_commitdate: OPTIONAL BINARY R:0 D:1 l_receiptdate: OPTIONAL BINARY R:0 D:1 l_shipinstruct: OPTIONAL BINARY R:0 D:1 l_shipmode: OPTIONAL BINARY R:0 D:1 l_comment: OPTIONAL BINARY R:0 D:1 row group 1: RC:1724693 TS:58432195 OFFSET:4 -------------------------------------------------------------------------------- l_orderkey: INT64 SNAPPY DO:4 FPO:159797 SZ:2839537/13147604/4.63 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 2142211, max: 6000000, num_nulls: 0] l_partkey: INT64 SNAPPY DO:2839640 FPO:3028619 SZ:8179566/13852808/1.69 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 1, max: 200000, num_nulls: 0] l_suppkey: INT64 SNAPPY DO:11019308 FPO:11059413 SZ:3063563/3103196/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 10000, num_nulls: 0] l_linenumber: INT32 SNAPPY DO:14082964 FPO:14083007 SZ:412884/650550/1.58 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 7, num_nulls: 0] l_quantity: FIXED_LEN_BYTE_ARRAY SNAPPY DO:14495934 FPO:14496204 SZ:1298038/1297963/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1.00, max: 50.00, num_nulls: 0] l_extendedprice: FIXED_LEN_BYTE_ARRAY SNAPPY DO:15794062 FPO:16003224 SZ:9087746/10429259/1.15 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 904.00, max: 104949.50, num_nulls: 0] l_discount: FIXED_LEN_BYTE_ARRAY SNAPPY DO:24881912 FPO:24881976 SZ:866406/866338/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0.00, max: 0.10, num_nulls: 0] l_tax: FIXED_LEN_BYTE_ARRAY SNAPPY DO:25748406 FPO:25748463 SZ:866399/866325/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0.00, max: 0.08, num_nulls: 0] l_returnflag: BINARY SNAPPY DO:26614888 FPO:26614918 SZ:421113/421069/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x41, max: 0x52, num_nulls: 0] l_linestatus: BINARY SNAPPY DO:27036081 FPO:27036106 SZ:262209/270332/1.03 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x46, max: 0x4F, num_nulls: 0] l_shipdate: BINARY SNAPPY DO:27298370 FPO:27309301 SZ:2602937/2627148/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3032, max: 0x313939382D31322D3031, num_nulls: 0] l_commitdate: BINARY SNAPPY DO:29901405 FPO:29912079 SZ:2602680/2626308/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3331, max: 0x313939382D31302D3331, num_nulls: 0] l_receiptdate: BINARY SNAPPY DO:32504185 FPO:32515219 SZ:2603040/2627498/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3036, max: 0x313939382D31322D3330, num_nulls: 0] l_shipinstruct: BINARY SNAPPY DO:35107326 FPO:35107408 SZ:434968/434917/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x434F4C4C45435420434F44, max: 0x54414B45204241434B2052455455524E, num_nulls: 0] l_shipmode: BINARY SNAPPY DO:35542401 FPO:35542471 SZ:650639/650580/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x414952, max: 0x545255434B, num_nulls: 0] l_comment: BINARY SNAPPY DO:36193124 FPO:36711343 SZ:22240470/52696671/2.37 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 0x20546972657369617320, max: 0x7A7A6C653F20626C697468656C792069726F6E69, num_nulls: 0] Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Reviewed-on: http://gerrit.cloudera.org:8080/16893 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-05 23:30:35 +00:00
skyyws	a850cd3cc6	IMPALA-10361: Use field id to resolve columns for Iceberg tables We supported resolve column by field id for Iceberg table in this patch. Currently, we use field id to resolve column for Iceberg tables, which means 'PARQUET_FALLBACK_SCHEMA_RESOLUTION' is invalid for Iceberg tables. Change-Id: I057bdc6ab2859cc4d40de5ed428d0c20028b8435 Reviewed-on: http://gerrit.cloudera.org:8080/16788 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2020-12-10 19:01:08 +00:00
Gabor Kaszab	b66045c8a5	IMPALA-10288: Implement DESCRIBE HISTORY for Iceberg tables The DESCRIBE HISTORY works for Iceberg tables and displays the snapshot history of the table. An example output: DESCRIBE HISTORY iceberg_multi_snapshots; +----------------------------+---------------------+---------------------+---------------------+ \| creation_time \| snapshot_id \| parent_id \| is_current_ancestor \| +----------------------------+---------------------+---------------------+---------------------+ \| 2020-10-13 14:01:07.234000 \| 4400379706200951771 \| NULL \| TRUE \| \| 2020-10-13 14:01:19.307000 \| 4221472712544505868 \| 4400379706200951771 \| TRUE \| +----------------------------+---------------------+---------------------+---------------------+ The purpose here was to have similar output with this new feature as what SparkSql returns for "SELECT * from tablename.history". See "History" section of https://iceberg.apache.org/spark/#inspecting-tables Testing: - iceberg-negative.test was extended to check that DESCRIBE HISTORY is not applicable for non-Iceberg tables. - iceberg-table-history.test: Covers basic usage of DESCRIBE HISTORY. Tests on tables created with Impala and also with Spark. Change-Id: I56a4b92c27e8e4a79109696cbae62735a00750e5 Reviewed-on: http://gerrit.cloudera.org:8080/16599 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Reviewed-by: wangsheng <skyyws@163.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-23 21:47:55 +00:00
guojingfeng	1fd5e4279c	IMPALA-10310: Fix couldn't skip rows in parquet file on NextRowGroup In practice we recommend that hdfs block size should align with parquet row group size.But in fact some compute engine like spark, default parquet row group size is 128MB, and if ETL user doesn't change the default property spark will generate row groups that smaller than hdfs block size. The result is a single hdfs block may contain multiple parquet row groups. In planner stage, length of impala generated scan range may be bigger than row group size. thus a single scan range contains multiple row group. In current parquet scanner when move to next row group, some of internal stat in parquet column readers need to reset. eg: num_buffered_values_, column chunk metadata, reset internal stat of column chunk readers. But current_row_range_ offset is not reset currently, this will cause errors "Couldn't skip rows in file hdfs://xxx" as IMPALA-10310 points out. This patch simply reset current_row_range_ to 0 when moving into next row group in parquet column readers. Fix the bug IMPALA-10310. Testing: * Add e2e test for parquet multi blocks per file and multi pages per block * Ran all core tests offline. * Manually tested all cases encountered in my production environment. Change-Id: I964695cd53f5d5fdb6485a85cd82e7a72ca6092c Reviewed-on: http://gerrit.cloudera.org:8080/16697 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-12 17:12:22 +00:00
skyyws	0c0985a825	IMPALA-10159: Supporting ORC file format for Iceberg table This patch mainly realizes querying Iceberg table with ORC file format. We can using following SQL to create table with ORC file format: CREATE TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg.file_format'='orc', 'iceberg.catalog'='hadoop.tables'); But pay attention, there still some problems when scan ORC files with Timestamp, more details please refer IMPALA-9967. We may add new tests with Timestmap type after this JIRA fixed. Testing: - Create table tests in functional_schema_template.sql - Iceberg table create test in test_iceberg.py - Iceberg table query test in test_scanners.py Change-Id: Ib579461aa57348c9893a6d26a003a0d812346c4d Reviewed-on: http://gerrit.cloudera.org:8080/16568 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-14 19:19:19 +00:00
Zoltan Borok-Nagy	9817afbfd5	IMPALA-9952: Fix page index filtering for empty pages As IMPALA-4371 and IMPALA-10186 points out, Impala might write empty data pages. It usually does that when it has to write a bigger page than the current page size. If we really need to write empty data pages is a different question, but we need to handle them correctly as there are already such files out there. The corresponding Parquet offset index entries to empty data pages are invalid PageLocation objects with 'compressed_page_size' = 0. Before this commit Impala didn't ignore the empty page locations, but generated a warning. Since invalid page index doesn't fail a scan by default, Impala continued scanning the file with semi-initialized page filtering. This resulted in 'Top level rows aren't in sync' error, or a crash in DEBUG builds. With this commit Impala ignores empty data pages and still able to filter the rest of the pages. Also, if the page index is corrupt for some other reason, Impala correctly resets the page filtering logic and falls back to regular scanning. Testing: * Added unit test for empty data pages * Added e2e test for empty data pages * Added e2e test for invalid page index Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Reviewed-on: http://gerrit.cloudera.org:8080/16503 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-07 13:41:34 +00:00
skyyws	5b720a4d18	IMPALA-10164: Supporting HadoopCatalog for Iceberg table This patch mainly realizes creating Iceberg table by HadoopCatalog. We only supported HadoopTables api before this patch, but now we can use HadoopCatalog to create Iceberg table. When creating managed table, we can use SQL like this: CREATE TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog', 'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test'); We supported two values ('hadoop.catalog', 'hadoop.tables') for 'iceberg.catalog' now. If you don't specify this property in your SQL, default catalog type is 'hadoop.catalog'. As for external Iceberg table, you can use SQL like this: CREATE EXTERNAL TABLE default.iceberg_test_external STORED AS ICEBERG TBLPROPERTIES ('iceberg.catalog'='hadoop.catalog', 'iceberg.catalog_location'='hdfs://test-warehouse/iceberg_test', 'iceberg.table_identifier'='default.iceberg_test'); We cannot set table location for both managed and external Iceberg table with 'hadoop.catalog', and 'SHOW CREATE TABLE' will not display table location yet. We need to use 'DESCRIBE FORMATTED/EXTENDED' to get this location info. 'iceberg.catalog_location' is necessary for 'hadoop.catalog' table, which used to reserved Iceberg table metadata and data, and we use this location to load table metadata from Iceberg. 'iceberg.table_identifier' is used for Icebreg TableIdentifier.If this property not been specified in SQL, Impala will use database and table name to load Iceberg table, which is 'default.iceberg_test_external' in above SQL. This property value is splitted by '.', you can alse set this value like this: 'org.my_db.my_tbl'. And this property is valid for both managed and external table. Testing: - Create table tests in functional_schema_template.sql - Iceberg table create test in test_iceberg.py - Iceberg table query test in test_scanners.py - Iceberg table show create table test in test_show_create_table.py Change-Id: Ic1893c50a633ca22d4bca6726c9937b026f5d5ef Reviewed-on: http://gerrit.cloudera.org:8080/16446 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-01 13:54:48 +00:00
stiga-huang	7b44b35132	IMPALA-9351: Fix tests depending on hard-coded file paths of managed tables Some tests (e.g. AnalyzeDDLTest.TestCreateTableLikeFileOrc) depend on hard-coded file paths of managed tables, assuming that there is always a file named 'base_0000001/bucket_00000_0' under the table dir. However, the file name is in the form of bucket_${bucket-id}_${attempt-id}. The last part of the file name is not guaranteed to be 0. If the first attempt fails and the second attempt succeeds, the file name will be bucket_00000_1. This patch replaces these hard-coded file paths to corresponding files that are uploaded to HDFS by commands. For tests that do need to use the file paths of managed table files, we do a listing on the table dir to get the file names, instead of hard-coding the file paths. Updated chars-formats.orc to contain column names in the file so can be used in more tests. The original one only has names like col0, col1, col2. Tests: - Run CORE tests Change-Id: Ie3136ee90e2444c4a12f0f2e1470fca1d5deaba0 Reviewed-on: http://gerrit.cloudera.org:8080/16441 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-12 05:30:45 +00:00
skyyws	fb6d96e001	IMPALA-9741: Support querying Iceberg table by impala This patch mainly realizes the querying of iceberg table through impala, we can use the following sql to create an external iceberg table: CREATE EXTERNAL TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); Or just including table name and location like this: CREATE EXTERNAL TABLE default.iceberg_test STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); 'iceberg_file_format' is the file format in iceberg, currently only support PARQUET, other format would be supported in the future. And if you don't specify this property in your SQL, default file format is PARQUET. We achieved this function by treating the iceberg table as normal unpartitioned hdfs table. When querying iceberg table, we pushdown partition column predicates to iceberg to decide which data files need to be scanned, and then transfer this information to BE to do the real scan operation. Testing: - Unit test for Iceberg in FileMetadataLoaderTest - Create table tests in functional_schema_template.sql - Iceberg table query test in test_scanners.py Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006 Reviewed-on: http://gerrit.cloudera.org:8080/16143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-06 02:12:07 +00:00
Zoltan Borok-Nagy	329bb41294	IMPALA-10115: Impala should check file schema as well to check full ACIDv2 files Currently Impala checks file metadata 'hive.acid.version' to decide the full ACID schema. There are cases when Hive forgets to set this value for full ACID files, e.g. query-based compactions. So it's more robust to check the schema elements instead of the metadata field. Also, sometimes Hive write the schema with different character cases, e.g. originalTransaction vs originaltransaction, so we should rather compare the column names in a case insensitive way. Testing: * added test for full ACID compaction * added test_full_acid_schema_without_file_metadata_tag to test full ACID file without metadata 'hive.acid.version' Change-Id: I52642c1755599efd28fa2c90f13396cfe0f5fa14 Reviewed-on: http://gerrit.cloudera.org:8080/16383 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-01 22:27:27 +00:00
Gabor Kaszab	f95f7940e4	IMPALA-10017: Implement ds_kll_union() function This function receives a set of serialized Apache DataSketches KLL sketches produced by ds_kll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_kll_quantile(ds_kll_union(sketch_col), 0.5) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_kll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_kll_union() on those sketches. Change-Id: I020aea28d36f9b6ef9fb57c08411f2170f5c24bf Reviewed-on: http://gerrit.cloudera.org:8080/16267 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-08 11:50:04 +00:00
Gabor Kaszab	033a4607e2	IMPALA-9959: Implement ds_kll_sketch() and ds_kll_quantile() functions ds_kll_sketch() is an aggregate function that receives a float parameter (e.g. a float column of a table) and returns a serialized Apache DataSketches KLL sketch of the input data set wrapped into STRING type. This sketch can be saved into a table or view and later used for quantile approximations. ds_kll_quantile() receives two parameters: a STRING parameter that contains a serialized KLL sketch and a DOUBLE that represents the rank of the quantile in the range of [0,1]. E.g. rank=0.1 means the approximate value in the sketch where 10% of the sketched items are less than or equals to this value. Testing: - Added automated tests on small data sets to check the basic functionality of sketching and getting a quantile approximate. - Tested on TPCH25_parquet.lineitem to check that sketching and approximating works on bigger scale as well where serialize/merge phases are also required. On this scale the error range of the quantile approximation is within 1-1.5% Change-Id: I11de5fe10bb5d0dd42fb4ee45c4f21cb31963e52 Reviewed-on: http://gerrit.cloudera.org:8080/16235 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-31 14:34:49 +00:00
Gabor Kaszab	9c542ef589	IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Note, currently there is a known limitation of unioning string types where some input sketches come from Impala and some from Hive. In this case if there is an overlap in the input data used by Impala and by Hive this overlapping data is still counted twice due to some string representation difference between Impala and Hive. For more details see: https://issues.apache.org/jira/browse/IMPALA-9939 Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Reviewed-on: http://gerrit.cloudera.org:8080/16095 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-23 22:20:12 +00:00
Adam Tamas	1bafb7bd29	IMPALA-9531: Dropped support for dateless timestamps Removed the support for dateless timestamps. During dateless timestamp casts if the format doesn't contain date part we get an error during tokenization of the format. If the input str doesn't contain a date part then we get null result. Examples: select cast('01:02:59' as timestamp); This will come back as NULL value. select to_timestamp('01:01:01', 'HH:mm:ss'); select cast('01:02:59' as timestamp format 'HH12:MI:SS'); select cast('12 AM' as timestamp FORMAT 'AM.HH12'); These will come back with a parsing errors. Casting from a table will generate similar results. Testing: Modified the previous tests related to dateless timestamps. Added test to read fromtables which are still containing dateless timestamps and covered timestamp to string path when no date tokens are requested in the output string. Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Reviewed-on: http://gerrit.cloudera.org:8080/15866 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2020-07-08 19:32:15 +00:00
Gabor Kaszab	7e456dfa9d	IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Reviewed-on: http://gerrit.cloudera.org:8080/16000 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-07-07 14:11:21 +00:00
Zoltan Borok-Nagy	930264afbd	IMPALA-9515: Full ACID Milestone 3: Read support for "original files" "Original files" are files that don't have full ACID schema. We can see such files if we upgrade a non-ACID table to full ACID. Also, the LOAD DATA statement can load non-ACID files into full ACID tables. So such files don't store special ACID columns, that means we need to auto-generate their values. These are (operation, originalTransaction, bucket, rowid, and currentTransaction). With the exception of 'rowid', all of them can be calculated based on the file path, so I add their values to the scanner's template tuple. 'rowid' is the ordinal number of the row inside a bucket inside a directory. For now Impala only allows one file per bucket per directory. Therefore we can generate row ids for each file independently. Multiple files in a single bucket in a directory can only be present if the table was non-transactional earlier and we upgraded it to full ACID table. After the first compaction we should only see one original file per bucket per directory. In HdfsOrcScanner we calculate the first row id for our split then the OrcStructReader fills the rowid slot with the proper values. Testing: * added e2e tests to check if the generated values are correct * added e2e test to reject tables that have multiple files per bucket * added unit tests to the new auxiliary functions Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Reviewed-on: http://gerrit.cloudera.org:8080/16001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-29 21:00:05 +00:00
Tim Armstrong	2c76ff5e6e	IMPALA-2515: support parquet decimal with extra padding This adds support for reading Parquet files where the DECIMAL is encoded as a FIXED_LEN_BYTE_ARRAY field with extra padding. This requires loosening file validation and fixing up the decoding so that it no longer assumes that the in-memory value is at least as large as the encoded representation. The decimal decoding logic was reworked so that we could add the extra condition handling without regressing performance of the decoding logic in the common case. In the end I was able to significantly speed up the decoding logic. The bottleneck, revealed by perf record while running the below benchmark, was CPU stalls on the bitshift used for sign extension instruction waiting on loading the result of ByteSwap(). I worked around this by doing the sign-extension before the ByteSwap(), Perf: Ran a microbenchmark to check that scanning perf didn't regress as a result of the change. The query scans a DECIMAL column that is mostly plain-encoded, so to maximally stress the FIXED_LEN_BYTE_ARRAY decoding performance. set mt_dop=1; set num_nodes=1; select min(l_extendedprice) from tpch_parquet.lineitem The SCAN time in the summary averaged out to 94ms before the change and is reduced to 74ms after the change. The actual speedup of the DECIMAL decoding is greater - it went from ~20% of time in to ~6% of time as measured by perf. Testing: Added a couple of parquet files that were generated with a hacked version of Impala to have extra padding. Sanity-checked that hacked tables returned the same results on Hive. The tests failed before this code change. Ran exhaustive tests with the hacked version of Impala (so that all decimal tables got extra padding). Change-Id: I2700652eab8ba7f23ffa75800a1712d310d4e1ec Reviewed-on: http://gerrit.cloudera.org:8080/16090 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 21:00:27 +00:00
Zoltan Borok-Nagy	f8015ff68d	IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list Minor compactions can compact several delta directories into a single delta directory. The current directory filtering algorithm had to be modified to handle minor compacted directories and prefer those over plain delta directories. This happens in the Frontend, mostly in AcidUtils.java. Hive Streaming Ingestion writes similar delta directories, but they might contain rows Impala cannot see based on its valid write id list. E.g. we can have the following delta directory: full_acid/delta_0000001_0000010/0000 # minWriteId: 1 # maxWriteId: 10 This delta dir contains rows with write ids between 1 and 10. But maybe we are only allowed to see write ids less than 5. Therefore we need to check the ACID write id column (named originalTransaction) to determine which rows are valid. Delta directories written by Hive Streaming don't have a visibility txn id, so we can recognize them based on the directory name. If there's a visibilityTxnId and it is committed => every row is valid: full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId # every row is valid If there's no visibilityTxnId then it was created via Hive Streaming, therefore we need to validate rows. Fortunately Hive Streaming writes rows with different write ids into different ORC stripes, therefore we don't need to validate the write id per row. If we had statistics, we could validate per stripe, but since Hive Streaming doesn't write statistics we validate the write id per ORC row batch (an alternative could be to do a 2-pass read, first we'd read a single value from each stripe's 'currentTransaction' field, then we'd read the stripe if the write id is valid). Testing * the frontend logic is tested in AcidUtilsTest * the backend row validation is tested in test_acid_row_validation Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d Reviewed-on: http://gerrit.cloudera.org:8080/15818 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 21:00:44 +00:00
xiaomeng	571131fdc1	IMPALA-9075: Add support for reading zstd text files In this patch, we add support for reading zstd encoded text files. This includes: 1. support reading zstd file written by Hive which uses streaming. 2. support reading zstd file compressed by standard zstd library which uses block. To support decompressing both formats, a function ProcessBlockStreaming is added in zstd decompressor. Testing done: Added two backend tests: 1. streaming decompress test. 2. large data test for both block and streaming decompress. Added two end to end tests: 1. hive and impala integration. For four compression codecs, write in hive and read from impala. 2. zstd library and impala integration. Copy a zstd lib compressed file to HDFS, and read from impala. Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4 Reviewed-on: http://gerrit.cloudera.org:8080/15023 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-25 00:31:34 +00:00
Yanjia Li	ea0e1def61	IMPALA-8778: Support Apache Hudi Read Optimized Table Hudi Read Optimized Table contains multiple versions of parquet files, in order to load the table correctly, Impala needs to recognize Hudi Read Optimized Table as a HdfsTable and load the latest version of the file using HoodieROTablePathFilter. Tests - Unit test for Hudi in FileMetadataLoader - Create table tests in functional_schema_template.sql - Query tests in hudi-parquet.test Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf Reviewed-on: http://gerrit.cloudera.org:8080/14711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-11 15:08:39 +00:00
Gabor Kaszab	63f52518ab	IMPALA-8801: Date type support for ORC scanner Implements read path for the date type in ORC scanner. The internal representation of a date is an int32 meaning the number of days since Unix epoch using proleptic Gregorian calendar. Similarly to the Parquet implementation (IMPALA-7370) this representation introduces an interoperability issue between Impala and older versions of Hive (before 3.1). For more details see the commit message of the mentioned Parquet implementation. Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Reviewed-on: http://gerrit.cloudera.org:8080/14982 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-17 18:54:33 +00:00

1 2 3

125 Commits