impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 00:00:40 -05:00

Author	SHA1	Message	Date
TheOmid	e327a28757	IMPALA-6684: Fix untracked memory in KRPC During serialization of a row batch header, a tuple_data_ is created which will hold the compressed tuple data for an outbound row batch. We would like this tuple data to be trackable as it is responsible for a significant portion of untrackable memory from the krpc data stream sender. By using MemTrackerAllocator, we can allocate tuple data and compression scratch and account for it in the memory tracker of the KrpcDataStreamSender. This solution replaces the type for tuple data and compression scratch from std::string to TrackedString, an std:basic_string with MemTrackerAllocator as the custom allocator. This patch adds memory estimation in DataStreamSink.java to account for OutboundRowBatch memory allocation. This patch also removes the thrift-based serialization because the thrift RPC has been removed in the prior commit. Testing: - Passed core tests. - Ran a single node benchmark which shows no regression. - Updated row-batch-serialize-test and row-batch-serialize-benchmark to test the row-batch serialization used by KRPC. - Manually collected query-profile, heap growth, and memory usage log showing untracked memory decreased by 1/2. - Add test_datastream_sender.py to verify the peak memory of EXCHANGE SENDER node. - Raise mem_limit in two of test_spilling_large_rows test case. - Print test line number in PlannerTestBase.java New row-batch serialization benchmark: Machine Info: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz serialize: 10% 50% 90% 10% 50% 90% (rel) (rel) (rel) ------------------------------------------------------------- ser_no_dups_base 18.6 18.8 18.9 1X 1X 1X ser_no_dups 18.5 18.5 18.8 0.998X 0.988X 0.991X ser_no_dups_full 14.7 14.8 14.8 0.793X 0.79X 0.783X ser_adj_dups_base 28.2 28.4 28.8 1X 1X 1X ser_adj_dups 68.9 69.1 69.8 2.44X 2.43X 2.43X ser_adj_dups_full 56.2 56.7 57.1 1.99X 2X 1.99X ser_dups_base 20.7 20.9 20.9 1X 1X 1X ser_dups 20.6 20.8 20.9 0.994X 0.995X 1X ser_dups_full 39.8 40 40.5 1.93X 1.92X 1.94X deserialize: 10% 50% 90% 10% 50% 90% (rel) (rel) (rel) ------------------------------------------------------------- deser_no_dups_base 75.9 76.6 77 1X 1X 1X deser_no_dups 74.9 75.6 76 0.987X 0.987X 0.987X deser_adj_dups_base 127 128 129 1X 1X 1X deser_adj_dups 179 193 195 1.41X 1.51X 1.51X deser_dups_base 128 128 129 1X 1X 1X deser_dups 165 190 193 1.29X 1.48X 1.49X Change-Id: I2ba2b907ce4f275a7a1fb8cf75453c7003eb4b82 Reviewed-on: http://gerrit.cloudera.org:8080/18798 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-17 21:59:57 +00:00
Zoltan Borok-Nagy	dc912f016e	IMPALA-11655: Impala should set write mode "merge-on-read" by default Similarly to HIVE-26596 Impala should set merge-on-read write mode for V2 tables, unless otherwise specified: * during table creation with 'format-version'='2' * during alter table set tblproperties 'format-version'='2' We do so because in the foreseeable future Impala will only support merge-on-read (on the write-side, on the read side copy-on-write is also supported). Also, currently Hive only supports merge-on-read. Testing: * e2e tests added Change-Id: Icaa32472cde98e21fb23f5461175db1bf401db3d Reviewed-on: http://gerrit.cloudera.org:8080/19138 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-17 19:37:15 +00:00
Joe McDonnell	11e66523d6	IMPALA-11526: Install en_US.UTF-8 locale into docker images In IMPALA-11492, ExprTest.Utf8MaskTest was failing on some configurations because the en_US.UTF-8 was missing. Since the Docker images don't contain en_US.UTF-8, they are subject to the same bug. This was confirmed by adding tests cases to the test_utf8_strings.py end-to-end test and running it in the dockerized tests. This add the appropriate language pack to the list of packages installed for the Docker build. Testing: - This adds end-to-end tests to test_utf8_strings.py covering the same cases that were failing in ExprTest.Utf8MaskTest. They failed without the added languages packs, and now succeed. Change-Id: I353f257b3cb6d45f7d0a28f7d5319fdb457e6e3d Reviewed-on: http://gerrit.cloudera.org:8080/19080 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>	2022-10-11 20:30:50 +00:00
Michael Smith	3577030df6	IMPALA-11562: Revert support for o3fs as default filesystem Reverts support for o3fs as a default filesystem added in IMPALA-9442. Updates test setup to use ofs instead. Munges absolute paths in Iceberg metadata to match the new location required for ofs. Ozone has strict requirements on volume and bucket names, so all tables must be created within a bucket (e.g. inside /impala/test-warehouse/). Change-Id: I45e90d30b2e68876dec0db3c43ac15ee510b17bd Reviewed-on: http://gerrit.cloudera.org:8080/19001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-28 22:35:48 +00:00
xqhe	d47d305bf4	IMPALA-11418: A statement that returns at most one row need not to spool results A query that returns at most one row can run more efficiently without result spooling. If result spooling is enabled, it will set the minimum memory reservation in PlanRootSink, e.g. 'select 1' minimum memory reservation is 4MB. This optimization can reduce the statement's resource reservation and prevent the exception 'Failed to get minimum memory reservation' when the host memory limit not available. Testing: - Add tests in result-spooling.test Change-Id: Icd4d73c21106048df68a270cf03d4abd56bd3aac Reviewed-on: http://gerrit.cloudera.org:8080/18711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-28 05:46:03 +00:00
xiabaike	a377662e94	IMPALA-11420: Support CREATE/ALTER VIEW SET/UNSET TBLPROPERTIES Add TBLPROPERTIES support to the view, here are some examples: CREATE VIEW [IF NOT EXISTS] [database_name.]view_name [(column_name [COMMENT 'column_comment'][, ...])] [COMMENT 'view_comment'] [TBLPROPERTIES (property_name = property_value, ...)] AS select_statement; ALTER VIEW [database_name.]view_name SET TBLPROPERTIES (property_name = property_value, ...); ALTER VIEW [database_name.]view_name UNSET TBLPROPERTIES (property_name, ...); Change-Id: I8d05bb4ec1f70f5387bb21fbe23f62c05941af18 Reviewed-on: http://gerrit.cloudera.org:8080/18940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-27 13:27:41 +00:00
Zoltan Borok-Nagy	b91aa06537	IMPALA-11582: Implement table sampling for Iceberg tables This patch adds table sampling functionalities for Iceberg tables. From now it's possible to execute SELECT and COMPUTE STATS statements with table sampling. Predicates in the WHERE clause affect the results of table sampling similarly to how legacy tables work (sampling is applied after static partition and file pruning). Sampling is repeatable via the REPEATABLE clause. Testing * planner tests * e2e tests for V1 and V2 tables Change-Id: I5de151747c0e9d9379a4051252175fccf42efd7d Reviewed-on: http://gerrit.cloudera.org:8080/18989 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-26 15:49:22 +00:00
Zoltan Borok-Nagy	3f382b7ebb	IMPALA-11583: Use Iceberg API to update stats Before this patch we used HMS API alter_table() to update an Iceberg table's statistics. 'alter_table()' API calls are unsafe for Iceberg tables as they overwrite the whole HMS table, including the table property 'metadata_location' which must always point to the latest snapshot. Hence concurrent modification to the same table could be reverted by COMPUTE STATS. In this patch we are using Iceberg API to update Iceberg tables. Also, table-level stats (e.g. numRows, totalSize, totalFiles) are not set as Iceberg keeps them up-to-date. COMPUTE INCREMENTAL STATS without partition clause is the same as plain COMPUTE STATS for Iceberg tables. This behavior is aligned with current behavior on non-partitioned tables: https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html COMPUTE INCREMENTAL STATS .. PARTITION raises an error. DROP STATS has been also modified to not drop table-level stats for HMS-integrated Iceberg tables. Testing: * added e2e tests for COMPUTE STATS * added e2e tests for DROP STATS * manually tested concurrent Hive INSERT and Impala COMPUTE STATS using latest Hive * opened IMPALA-11590 to add automated interop tests with Hive Change-Id: I46b6e0a5a65e18e5aaf2a007ec0242b28e0fed92 Reviewed-on: http://gerrit.cloudera.org:8080/18995 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-22 18:33:25 +00:00
Peter Rozsa	04b5319e6e	IMPALA-9499: Display support for all complex types in a SELECT * query This change adds EXPAND_COMPLEX_TYPES query option to support the display of complex types in SELECT statements where star () expression is in the select list. By default, the query option is disabled. When it's enabled, it changes the behaviour of star expansion to list all top-level column fields including ones with complex types, instead of listing the scalar column fields only. Nested complex type expansion is also supported, eg.: struct. will enumerate the members of the struct. Array, map and struct types are supported. Testing: - Analyzer tests check select statements when the query option is enabled or disabled. - EE tests check the proper complex type deserialization when the query option is enabled, and the original behaviour when the option is disabled. Change-Id: I84b5e5703f9e0ce0f4f8bff83941677dd7489974 Reviewed-on: http://gerrit.cloudera.org:8080/18863 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-09 22:43:33 +00:00
Csaba Ringhofer	2e1ce445b2	IMPALA-11567: Fix left outer join if the right side is subquery with complex type Non-matching rows from the left side will null out all slots from the right side in left outer joins. If the right side is a subquery, it is possible that some returned expressions will be non-NULL even if all slots are NULL (e.g. constants) - these expressions are wrapped as IF(TupleIsNull(tids), NULL, expr) to null them in the non-matching case. The logic above used to hit a precondition for complex types. We can safely ignore complex types for now, as currently the only possible expression that returns a complex type is SlotRef, which doesn't need to be wrapped. We will have to revisit this once functions are added that return complex types. Testing: - added a regression test and ran it Change-Id: Iaa8991cd4448d5c7ef7f44f73ee07e2a2b6f37ce Reviewed-on: http://gerrit.cloudera.org:8080/18954 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-09 14:42:41 +00:00
Gergely Fürnstáhl	f598b2ad68	IMPALA-10610: Support multiple file formats in a single Iceberg Table Added support for multiple file formats. Previously Impala created a Scanner class based on the partitions file format, now in case of an Iceberg table it will read out the file format from the file level metadata instead. IcebergScanNode will aggregate file formats as well instead of relying on partitions, so it can be used for plannig. Testing: Created a mixed file format table with hive and added a test for it. Change-Id: Ifc816595724e8fd2c885c6664f790af61ddf5c07 Reviewed-on: http://gerrit.cloudera.org:8080/18935 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-08 18:03:37 +00:00
Daniel Becker	37f44a58f3	IMPALA-10918: Allow map type in SELECT list Adding support for MAP types in the select list. An example of how maps are printed: {"k1":2,"k2":null} Nested collection types (maps and arrays) are supported in any combination. However, structs in collections and collections in structs are not supported. Limitations (other than map support) as described in the commit for IMPALA-9498 still apply, the following are to be implemented later: - Unify HS2 / Beeswax logic with the way STRUCTs are handled. This could be done in a "final" logic that can handle STRUCTS/ARRAYS nested to each other - Implement "deep copy" and "deep serialize" for collections in BE. This would enable all operators, e.g. ORDER BY and UNION. Testing: - modified the FE tests that checked that maps were not allowed in the select list - now the test expect maps are allowed there - added FE and EE tests involving maps based on the array tests Change-Id: I921c647f1779add36e7f5df4ce6ca237dcfaf001 Reviewed-on: http://gerrit.cloudera.org:8080/18736 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-07 19:55:43 +00:00
LPL	cc26f345a4	IMPALA-11507: Use absolute_path when Iceberg data files are outside of the table location For Iceberg tables, when one of the following properties is used, it is considered that the table is possible to have data outside the table location directory: - 'write.object-storage.enabled' is true - 'write.data.path' is not empty - 'write.location-provider.impl' is configured - 'write.object-storage.path'(Deprecated) is not empty - 'write.folder-storage.path'(Deprecated) is not empty We should tolerate the situation that relative path of the data files cannot be obtained by the table location path, and we could use the absolute path in that case. E.g. the ETL program will write the table that the metadata of the Iceberg tables is placed in 'hdfs://nameservice_meta/warehouse/hadoop_catalog/ice_tbl/metadata', the recent data files in 'hdfs://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', and the data files half a year ago in 's3a://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', it should still be queried normally by Impala. Testing: - added e2e tests Change-Id: I666bed21d20d5895f4332e92eb30a94fa24250be Reviewed-on: http://gerrit.cloudera.org:8080/18894 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-06 18:35:30 +00:00
Zoltan Borok-Nagy	73da4d7ddf	IMPALA-11484: Create SCAN plan for Iceberg V2 position delete tables This patch adds support for reading Iceberg V2 tables use position deletes. Equality deletes are still not supported. Position delete files store the file path and file position of the deleted rows. When an Iceberg table has position delete files we need to do an ANTI JOIN between data files and delete files. From the data files we need to query the virtual columns INPUT__FILE__NAME and FILE__POSITION, while from the delete files we need the data columns 'file_path' and 'pos'. The latter data columns are not part of the table schema, so we create a virtual table instance of 'IcebergPositionDeleteTable' that has a table schema corresponding to the delete files ('file_path', 'pos'). This patch introduces a new class 'IcebergScanPlanner' which has the responsibility of doing a plan for Iceberg table scans. It creates the aforementioned ANTI JOIN. Also, if there are data files without corresponding delete files, we can have a separate SCAN node and its results would be UNIONed to the rows coming from the ANTI JOIN: UNION / \ SCAN data ANTI JOIN / \ SCAN data SCAN deletes Some refactorings in the context of this CR: Predicate pushdown and time travel logic is transferred from IcebergScanNode to IcebergScanPlanner. Iceberg snapshot summary retrieval is moved from FeFsTable to FeIcebergTable. Testing: * added planner test * added e2e tests TODO in follow-up Jiras: * better cardinality estimates (IMPALA-11516) * support unrelative collection columns (select item from t.int_array) (IMPALA-11517) Currently such queries return error during analysis Change-Id: I672cfee18d8e131772d90378d5b12ad4d0f7dd48 Reviewed-on: http://gerrit.cloudera.org:8080/18847 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-01 16:51:17 +00:00
Gabor Kaszab	fec7a79c50	IMPALA-11529: FILE__POSITION virtual column for ORC tables IMPALA-11350 implemented the FILE__POSITION virtual column for Parquet files. This ticket does the same but for ORC files. Note, that for full ACID ORC tables there have already been an implementation of row__id that could simply be re-used for this ticket. Testing: - TestScannersVirtualColumns.test_virtual_column_file_position_generic is changed to run now on ORC as well. I don't think further testing is required as this functionality has already been there for row__id we just re-used it for FILE__POSITION. Change-Id: Ie8e951f73ceb910d64cd149192853a4a2131f79b Reviewed-on: http://gerrit.cloudera.org:8080/18909 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-30 16:45:24 +00:00
Tamas Mate	423b087762	IMPALA-11520: Remove functional.unsupported_types misc test IMPALA-9482 added support to the remaining Hive types and removed the functional.unsupported_types table. There was a reference remaining in a misc test. test_misc is not marked as exhaustive but it only runs in exhaustive builds. Change-Id: I65b6ea5ac742fbcc427ad41741d347558cb7d110 Reviewed-on: http://gerrit.cloudera.org:8080/18896 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-25 16:24:41 +00:00
Tamas Mate	40b9b9cd75	IMPALA-11521: Fix test_binary_type Fix test_binary_type typo, it should not reference to an HBase table. Change-Id: Id41049094b632af6326f6ee9f3886577d1fc5ee6 Reviewed-on: http://gerrit.cloudera.org:8080/18895 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-25 15:12:38 +00:00
LPL	d7ecc11149	IMPALA-11500: Fix Impalad crashed in ParquetBoolDecoder::SkipValues when num_values is 0 Fix Impalad crashed in the method ParquetBoolDecoder::SkipValues when the parameter 'num_values' is 0. The function should tolerate that the 'num_values' is 0 values. Testing: - Add e2e tests Change-Id: I8c4c5a4dff9e9e75913c7b524b4ae70967febb37 Reviewed-on: http://gerrit.cloudera.org:8080/18854 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 19:11:58 +00:00
Csaba Ringhofer	7ca11dfc7f	IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 13:55:42 +00:00
Zoltan Borok-Nagy	522ee1fcc0	IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables Virtual column FILE__POSITION returns the ordinal position of the row in the data file. It will be useful to add support for Iceberg's position-based delete files This patch only adds FILE__POSITION to Parquet tables. It works similarly to the handling of collection position slots. I.e. we add the responsibility of dealing with the file position slot to an existing column reader. Because of page-filtering and late materialization we already tracked the file position in member 'current_row_' during scanning. Querying the FILE__POSITION in other file formats raises an error. Testing: * added e2e tests Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7 Reviewed-on: http://gerrit.cloudera.org:8080/18704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-12 19:21:55 +00:00
LPL	0e3e4b57a1	IMPALA-11408: Fill missing partition columns when INSERT INTO iceberg_tbl (col_list) In the case of INSERT INTO iceberg_tbl (col_a, col_b, ...), if the partition columns of Iceberg table are not in the columns permutation, in order for data to be written to the default partition '__HIVE_DEFAULT_PARTITION__' we will fill the missing partition columns with NullLiteral. Testing: - add e2e tests Change-Id: I40c733755d65e5c81a12ffe09b6d16ed5d115368 Reviewed-on: http://gerrit.cloudera.org:8080/18790 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-10 14:34:04 +00:00
Riza Suminto	05c3a8e09c	IMPALA-11465, IMPALA-11466: Bump CDP_BUILD_NUMBER to 30010248 This patch bump up CDP_BUILD_NUMBER to pick Hive version 3.1.3000.7.2.16.0-127 that contains: - thrift-0.16.0 upgrade from HIVE-25635. - Backport of ORC-517. This patch also contains fix for IMPALA-11466 by adding jetty-server as an allowed dependency. Testing: - Build locally and confirm that the cdp components is downloaded. Change-Id: Iff5297a48865fb2444e8ef7b9881536dc1bbf63c Reviewed-on: http://gerrit.cloudera.org:8080/18803 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2022-08-09 06:45:52 +00:00
Tamas Mate	bb5ec85134	IMPALA-10453: Support file pruning via runtime filters on Iceberg Iceberg tables store partition information in manifest files and not in the file path. This metadata has already been pushed down to the scanners and this commit uses this metadata to evaluate runtime filters on Iceberg files. Pefromance measurement: Used TPC-DS Q10 [1] with scale of 10 to measure the query performance. Min/Max filters have been disabled and increased the wait time for runtime filters to 5 seconds. After pre-warming the Catalog I executed Q10 5 times on my local machine. The fastest execution times were: Baseline Parquet tables: 1.08s Baseline Iceberg tables without this patch: 1.43s Iceberg tables with this patch: 1.09s Testing: * Added e2e tests. * Initial perofrmance test with TPC-DS Q10. Ref: [1] TPC-DS Q10: select cd_gender, cd_marital_status, cd_education_status, count() cnt1, cd_purchase_estimate, count() cnt2, cd_credit_rating, count() cnt3, cd_dep_count, count() cnt4, cd_dep_employed_count, count() cnt5, cd_dep_college_count, count() cnt6 from customer c, customer_address ca, customer_demographics where c.c_current_addr_sk = ca.ca_address_sk and ca_county in ('Walker County','Richland County','Gaines County', 'Douglas County','Dona Ana County') and cd_demo_sk = c.c_current_cdemo_sk and exists (select * from store_sales, date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3) and exists (select * from (select ws_bill_customer_sk as customer_sk, d_year,d_moy from web_sales, date_dim where ws_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3 union all select cs_ship_customer_sk as customer_sk, d_year, d_moy from catalog_sales, date_dim where cs_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3 ) x where c.c_customer_sk = customer_sk) group by cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count order by cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count limit 100; Change-Id: I7762e1238bdf236b85d2728881a402a2bb41f36a Reviewed-on: http://gerrit.cloudera.org:8080/18531 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-04 20:18:04 +00:00
Tamas Mate	c0b0875bda	IMPALA-11378: Allow INSERT OVERWRITE for bucket tranforms in some cases This change has been considered only for Iceberg tables mainly for table maintenance reasons. Iceberg table writes create new snapshots and these can accumulate over time. This commit allows a simple form of compaction of these snapshots. INSERT OVERWRITES have been blocked in case partition evolution is in place, because it would be possible to overwrite a data file with a newer schema that has less columns. This could cause unexpected data loss. For bucketed tables, the following syntax is allowed to be executed: INSERT OVERWRITE ice_tbl SELECT * FROM ice_tbl; The source and target table has to be the same and specified, only SELECT '*' queries are allowed. These requirements are also in place to avoid unexpected data loss. - Values are not allowed, because inserting a single record could overwrite a whole file in a bucket. - Only source table is allowed, because at the time of the insert it is unknown which files will be modified, similar to values. Testing: - Added e2e tests. Change-Id: Ibd1bc19d839297246eadeb754cdeeec1e306098a Reviewed-on: http://gerrit.cloudera.org:8080/18649 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-01 13:36:51 +00:00
Gergely Fürnstáhl	cbd3fab8c4	IMPALA-11268: Allow STORED BY and STORED AS as well Extended the parser to support 'stored by' keyword for storage engines, namely Kudu and Iceberg at the moment, to match hive's syntax. Furthermore, this lays the ground work for seperating storage engines from file formats to be able specify both with "STORED BY ... STORED AS ...". Testing: Added front-end Parser and Analyzer tests and query tests for table creation. Change-Id: Ib677bea8e582bbc01c5fb8c81df57eb60b0ed961 Reviewed-on: http://gerrit.cloudera.org:8080/18743 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-26 18:21:48 +00:00
LPL	614dc54dcc	IMPALA-11446: Push-down NOT_IN predicate to iceberg Because the column value bounds of the Iceberg meta are not necessarily a min or max value, NOT_IN cannot be answered using them. NOT_IN(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value in col. But it works when the push-down column is the partition column, it's still very helpful. Testing: - add e2e tests Change-Id: Ib8bdaf6f31a4438e11c4eb27485bb413fe6df9a3 Reviewed-on: http://gerrit.cloudera.org:8080/18760 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-26 18:11:25 +00:00
Csaba Ringhofer	efc303b71a	IMPALA-11434: Fix analysis of multiple more than 1d arrays in select list More than 1d arrays in select list tried to register a CollectionTableRef with name "item" for the inner arrays, leading to name collision if there was more than one such array. The logic is changed to always use the full path as implicit alias in CollectionTableRefs backing arrays in select list. As a side effect this leads to using the fully qualified names in expressions in the explain plans of queries that use arrays from views. This is not an intended change, but I don't consider it to be critical. Created IMPALA-11452 to deal with more sophisticated alias handling in collections. Testing: - added a new table to testdata and a regression test Change-Id: I6f2b6cad51fa25a6f6932420eccf1b0a964d5e4e Reviewed-on: http://gerrit.cloudera.org:8080/18734 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-22 22:59:19 +00:00
wzhou-code	02043744be	IMPALA-11445: Fix bug in firing insert event of partitions located in different FS When adding a partition with location in a file system which is different from the file system of the table location, Impala accept it. But when insert values to the table, catalogd throw exception. This patch fix the issue by using the right FileSystem object. Testing: - Added new test case with partitions on different file systems. Ran the test on S3. - Did manual tests in cluster with partitions on HDFS and Ozone. - Passed core test. Change-Id: I0491ee1bf40c3d5240f9124cef3f3169c44a8267 Reviewed-on: http://gerrit.cloudera.org:8080/18759 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-22 04:23:30 +00:00
Csaba Ringhofer	06e8e7bba7	IMPALA-886: Support displaying HBase cols in the order from HMS Before this patch catalogd always ordered HBase columns lexicographically by family/qualifier. This is incompatible with other table formats and the way Hive handles HBase tables, where the order comes from HMS as defined during CREATE TABLE. I don't know of any valid reason behind this old behavior, it probably just made the implementation a bit easier by doing the ordering in FE instead of BE - the BE actually needs this ordering during scanning as the HBase API returns results in this order, but this should have no effect on other parts of Impala. Added flag use_hms_column_order_for_hbase_tables (used by catalogd) to decide whether to do this reordering: - true: keep HMS order - false: reorder by family/qualifier [default] The old way is kept as default to avoid breaking existing workloads, but it would make sense to change it in the next major release. Note that a query option would be more convenient to use, but it would be much harder to implement it as the order is decided during loading in catalogd. Testing: - added custom cluster test for use_hms_column_order_for_hbase_tables = true Change-Id: Ibc5df8b803f2ae3b93951765326cdaea706e3563 Reviewed-on: http://gerrit.cloudera.org:8080/18635 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-18 19:02:34 +00:00
LPL	e1fea843ea	IMPALA-11433: Remove misleading bucketing info from DESCRIBE FORMATTED output for Iceberg tables The DESCRIBE FORMATTED output show this even for bucketed Iceberg tables: \| Num Buckets: \| 0 \| NULL \| \| Bucket Columns: \| [] \| NULL \| We should remove them, and the user should rely on the information in the '# Partition Transform Information' block instead. Testing: - add e2e tests - tested in a real cluster Change-Id: Idc156c932780f0f12c935a1a60ff6606d59bb1da Reviewed-on: http://gerrit.cloudera.org:8080/18735 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-16 17:11:24 +00:00
Kurt Deschler	7764830216	IMPALA-11430: Support custom hash schema for Kudu range tables KUDU-2671 added support for custom hash partition specification at the range partition level. This patch adds CREATE TABLE and ALTER TABLE syntax to allow Kudu custom hash schema to be specified through Impala. In addition, a new SHOW HASH SCHEMA statement has been added to allow display of the hash schema information for each partition. HASH syntax within a partition is similar to the table-level syntax except that HASH clauses must follow the PARTITION clause and commas are not allowed within a partition. These differences were required to keep the grammar unambiguous and due to limitations of the Java Cup Parser. To make the grammar more consistent, commas in the table-level partion spec and between PARTITION clauses are now optional but allowed for backward compatibility. Example: CREATE TABLE t1 (id int, c2 int, PRIMARY KEY(id, c2)) PARTITION BY HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4 RANGE (c2) ( PARTITION 0 <= VALUES < 10 PARTITION 10 <= VALUES < 20 HASH(id) PARTITIONS 2 HASH(c2) PARTITIONS 3 PARTITION 20 <= VALUES < 30 ) STORED AS KUDU; ALTER TABLE t1 ADD RANGE PARTITION 30 <= VALUES < 40 HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4; This bumps the toolchain to Kudu githash 43ee785b2d to get the needed Kudu-side changes. Testing: Tests added to kudu_partition_ddl.test Change-Id: I981056e0827f4957580706d6e73742e4e6743c1c Reviewed-on: http://gerrit.cloudera.org:8080/18676 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-07-16 06:06:46 +00:00
Gergely Fürnstáhl	8d034a2f7c	IMPALA-11034: Resolve schema of old data files in migrated Iceberg tables When external tables are converted to Iceberg, the data files remain intact, thus missing field IDs. Previously, Impala used name based column resolution in this case. Added a feature to traverse through the data files before column resolution and assign field IDs the same way as iceberg would, to be able to use field ID based column resolutions. Testing: Default resolution method was changed to field id for migrated tables, existing tests use that from now. Added new tests to cover edge cases with complex types and schema evolution. Change-Id: I77570bbfc2fcc60c2756812d7210110e8cc11ccc Reviewed-on: http://gerrit.cloudera.org:8080/18639 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-14 13:06:04 +00:00
Zoltan Borok-Nagy	26438d8e3e	IMPALA-11414: Off-by-one error in Parquet late materialization With PARQUET_LATE_MATERIALIZATION we can set the number of minimum consecutive rows that if filtered out, we avoid materialization of rows in other columns in parquet. E.g. if PARQUET_LATE_MATERIALIZATION is 10, and in a filtered column we find at least 10 consecutive rows that don't pass the predicates we avoid materializing the corresponding rows in the other columns. But due to an off-by-one error we actually only needed (PARQUET_LATE_MATERIALIZATION - 1) consecutive elements. This means if we set PARQUET_LATE_MATERIALIZATION to one, then we need zero consecutive filtered out elements which leads to a crash/DCHECK. The bug is in the GetMicroBatches() algorithm when we produce the micro batches based on the selected rows. Setting PARQUET_LATE_MATERIALIZATION to 0 doesn't make sense so it shouldn't be allowed. Testing * e2e test with PARQUET_LATE_MATERIALIZATION=1 * e2e test for checking SET PARQUET_LATE_MATERIALIZATION=N Change-Id: I38f95ad48c4ac8c1e06651565ab5c496283b29fa Reviewed-on: http://gerrit.cloudera.org:8080/18700 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-11 19:25:02 +00:00
LPL	78e45a7cea	IMPALA-11287 (part 2): Implement CREATE TABLE LIKE for Iceberg tables This commit implements cloning between Iceberg tables. Cloning Iceberg tables from other Types of tables is not implemented, because the Data Types of Iceberg and Impala do not correspond one by one. Testing: - e2e tests Change-Id: I1284b926f51158e221277b18b2e73707e29f86ac Reviewed-on: http://gerrit.cloudera.org:8080/18658 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2022-07-08 13:52:26 +00:00
gaoxq	4174ff5ea6	IMPALA-11320: SHOW PARTITIONS on Iceberg table doesn't list the partitions Currently, SHOW PARTITIONS on Iceberg tables only outputs the partition spec which is not too useful. Instead it should output the concrete partitions, number of files, number of rows in each partitions. E.g.: SHOW PARTITIONS ice_ctas_hadoop_tables_part; '{"d_month":"613"}',4,2 '{"d_month":"614"}',3,1 '{"d_month":"615"}',2,1 Testing: - Added end-to-end test Change-Id: I3b4399ae924dadb89875735b12a2f92453b6754c Reviewed-on: http://gerrit.cloudera.org:8080/18641 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-05 20:32:17 +00:00
LPL	9c96855146	IMPALA-11368: Iceberg time-travel error message should show timestamp in local timezone In the FOR SYSTEM_TIME AS OF clause we expect timestamps in the local timezone, while the error message shows the timestamp in UTC timezone. The error message should show timestamp in local timezone. Testing: - Add e2e test Change-Id: Iba5d5eb65133f11cc4eb2fc15a19f7b25c14cc46 Reviewed-on: http://gerrit.cloudera.org:8080/18675 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-30 18:31:18 +00:00
LPL	f38c53235f	IMPALA-11279: Optimize plain count() queries for Iceberg tables This commit optimizes the plain count() queries for the Iceberg tables. When the `org.apache.iceberg.SnapshotSummary#TOTAL_RECORDS_PROP` can be retrieved from the current `org.apache.iceberg.BaseSnapshot#summary` of the Iceberg table, this kind of query can be very fast. If this property is not retrieved, the query will aggregate the `num_rows` of parquet `file_metadata_` as usual. Queries that can be optimized need to meet the following requirements: - SelectStmt does not have WHERE clause - SelectStmt does not have GROUP BY clause - SelectStmt does not have HAVING clause - The TableRefs of FROM clause contains only one BaseTableRef - Only for the Iceberg table - SelectList must contain 'count(*)' or 'count(constant)' - SelectList can contain other agg functions, e.g. min, sum, etc - SelectList can contain constant Testing: - Added end-to-end test - Existing tests - Test it in a real cluster Change-Id: I8e9c48bbba7ab2320fa80915e7001ce54f1ef6d9 Reviewed-on: http://gerrit.cloudera.org:8080/18574 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-27 15:29:11 +00:00
Gabor Kaszab	5d021ce5a7	IMPALA-9496: Allow struct type in the select list for Parquet tables This patch is to extend the support of Struct columns in the select list to Parquet files as well. There are some limitation with this patch: - Dictionary filtering could work when we have conjuncts on a member of a struct, however, if this struct is given in the select list then the dictionary filtering is disabled. The reason is that in this case there would be a mismatch between the slot/tuple IDs in the conjunct between the ones in the select list due to expr substitution logic when a struct is in the select list. Solving this puzzle would be a nice future performance enhancement. See IMPALA-11361. - When structs are read in a batched manner it delegates the actual reading of the data to the column readers of its children, however, would use the simple ReadValue() on these readers instead of the batched version. The reason is that calling the batched reader in the member column readers would in fact read in batches, but it won't handle the case when the parent struct is NULL and would set only itself to NULL but not the parent struct. This might also be a future performance enhancement. See IMPALA-11363. - If there is a struct in the select list then late materialization is turned off. The reason is that LM expects the column readers to be used through the batched reading interface, however, as said in the above bulletpoint currently struct column readers use the non-batched reading interface of its children. As a result after reading the column readers are not in a state as SkipRows() of LM expects and then results in a query failure because it's not able to skip the rows for non-filter readers. Once IMPALA-11363 is implemented and the struct will also use the ReadValueBatch() interface of its children then late materialization could be turned on even if structs are in the select list. See IMPALA-11364. Testing: - There were a lot of tests already to exercise this functionality but they were only run on ORC table. I changed these to cover Parquet tables too. Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452 Reviewed-on: http://gerrit.cloudera.org:8080/18596 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-22 17:55:07 +00:00
LPL	75ccdc6aec	IMPALA-11367: Fix some formatting bugs when DESCRIBE FORMATTED for Iceberg tables Fix some formatting bugs when DESCRIBE FORMATTED for Iceberg tables: - 'LINE_DELIM' is missing on '# Partition Transform Information' - The partition transform columns header should be 'col_name,transform_type,NULL' Testing: - Existing tests Change-Id: I991644cefb34decc843a5542b47eaec11d7b6e42 Reviewed-on: http://gerrit.cloudera.org:8080/18634 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-20 14:06:56 +00:00
Joe McDonnell	d12d4c9b07	IMPALA-11356: Fix test_sfs.py to account for new Hive behavior Hive changed behavior in HIVE-24920 to allow external SFS tables to point to locations in managed tables. test_sfs.py had a test case for the previous failure, so it broke with this change. This changes test_sfs.py to verify the new behavior where an external SFS table can point to a managed location. Testing: - Ran test_sfs.py locally with recent GBN Change-Id: I426aacdba7afba3f3a747a4e02632d15e38c63c9 Reviewed-on: http://gerrit.cloudera.org:8080/18618 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-20 13:52:37 +00:00
pranav.lodha	256f37f17f	IMPALA-11205: Implement Statistical functions: CORR(), COVAR_SAMP() and COVAR_POP() CORR() function takes two numeric type columns as arguments and returns the Pearson's correlation coefficient between them. COVAR_SAMP() function takes two numeric type columns and returns sample covariance between them. COVAR_POP() function takes two numeric type columns and returns population covariance between them. These UDAFs are tested with a few query tests written in aggregation.test. Change-Id: I32ad627c953ba24d9cde2d5549bdd0d27a9c0d06 Reviewed-on: http://gerrit.cloudera.org:8080/18413 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-19 04:35:04 +00:00
Gabor Kaszab	2744f46fbd	IMPALA-11280: Join node incorrectly picks up unnest(array) predicates The expectation for predicates on unnested arrays is that they are either picked up by the SCAN node or the UNNEST node for evaluation. If there is only one array being unnested then the SCAN node, otherwise the UNNEST node will be responsible for the evaluation. However, if there is a JOIN node involved where the JOIN construction happens before creating the UNNEST node then the JOIN node incorrectly picks up the predicates for the unnested arrays as well. This patch is to fix this behaviour. Tests: - Added E2E tests to cover result correctness. - Added planner tests to verify that the desired node picks up the predicates for unnested arrays. Change-Id: I89fed4eef220ca513b259f0e2649cdfbe43c797a Reviewed-on: http://gerrit.cloudera.org:8080/18614 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-14 15:37:54 +00:00
Zoltan Borok-Nagy	cee086f685	IMPALA-11302, IMPALA-11303: Fix error handling of external Iceberg tables IMPALA-11302: Improve error message for CREATE EXTERNAL TABLE iceberg command The error message complained about failed table loading. The new error message is more precise, and also notifies the user to use plain 'CREATE TABLE' for creating new Iceberg tables. IMPALA-11303: Exception is not raised for Iceberg DDL that misses LOCATION clause We returned the error in the result summary. Instead of that now we raise an error, and also notify the user about how to create new Iceberg tables. Testing: * e2e tests Change-Id: I659115cc97a1a00e1ddf3fbb7dbe1f286bf1edcf Reviewed-on: http://gerrit.cloudera.org:8080/18563 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-13 12:59:10 +00:00
LPL	4b5b0a6932	IMPALA-11334: Include partition transform information in DESCRIBE FORMATTED for Iceberg tables To provide consistent output with Hive, Iceberg partition transformation information is added to the DESCRIBE TABLE output. Testing: - Added end-to-end test Change-Id: I48a9370640710025e350580f1a8d7b6e3dbeaf9b Reviewed-on: http://gerrit.cloudera.org:8080/18593 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-10 18:05:31 +00:00
Zoltan Borok-Nagy	1a1536bd1d	IMPALA-11346: Migrated partitioned Iceberg tables might return ERROR when WHERE condition is used on partition column Identity-partitioned columns are not necessarily stored in the data files. E.g. when we migrate a legacy partitioned table to Iceberg without rewriting the data files, the partition columns won't be present in the files. The Parquet scanner does a few optimizations to eliminate row groups, i.e. filtering based on stats, bloom filters, etc. When a column is not present in the data file that has some predicate on, then it is assumed that the whole row group doesn't pass the filtering criteria. But for Iceberg some files might contain partition columns, while other files doesn't, so we need to prepare the scanners to handle such cases. The ORC scanner doesn't have that many optimizations so it didn't ran into this issue. Testing: * e2e tests Change-Id: Ie706317888981f634d792fb570f3eab1ec11a4f4 Reviewed-on: http://gerrit.cloudera.org:8080/18605 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Reviewed-by: Tamas Mate <tmater@apache.org> Reviewed-by: <lipenglin@sensorsdata.cn> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-10 13:01:28 +00:00
Zoltan Borok-Nagy	23d09638de	IMPALA-801, IMPALA-8011: Add INPUT__FILE__NAME virtual column for file name Hive has virtual column INPUT__FILE__NAME which returns the data file name that stores the actual row. It can be used in several ways, see the above two Jira tickets for examples. This virtual column is also needed to support position-based delete files in Iceberg V2 tables. This patch also adds the foundations to support further table-level virtual columns later. Virtual columns are stored at the table level in a separate list from the table schema. During path resolution in Path.resolve() we also try to resolve virtual columns. Slot descriptors also store the information whether they refer to a virtual column. Currently we only add the INPUT__FILE__NAME virtual column. The value of this column can be set in the template tuple of the scanners. All kinds of operations are possible on this virtual column, users can invoke additional functions on it, can filter rows, can group by, etc. Special care is needed for virtual columns when column masking/row filtering is applicable on them. They are added as "hidden" select list items to the table masking views which means they don't expand by * expressions. They still need to be included in * expressions though when they are coming from user-written views. Testing: * analyzer tests * added e2e tests Change-Id: I498591f1db08a91a5c846df59086d2291df4ff61 Reviewed-on: http://gerrit.cloudera.org:8080/18514 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-08 13:02:52 +00:00
Steve Carlin	13bbff4e4e	IMPALA-11323: Don't evaluate constants-only inferred predicates IMPALA-10182 fixed the problem of creating inferred predicates when both sides of an equality predicate came from the same slot. Inferred predicates also should not be created when both sides of an equality predicate are constant values which do not have scan slots. Change-Id: If1cd4559dda406d2d38703ed594b70b41ed336fd Reviewed-on: http://gerrit.cloudera.org:8080/18579 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2022-06-04 03:36:11 +00:00
Gergely Fürnstáhl	decb46aa0d	IMPALA-9410: Support resolving ORC file columns by names Added query option and implementation to be able to resolve columns by names. Changed secondary resolution strategy for iceberg orc tables to name based resolution. Testing: Added new test dimension for orc tests, added results to now working iceberg migrated table test Change-Id: I29562a059160c19eb58ccea76aa959d2e408f8de Reviewed-on: http://gerrit.cloudera.org:8080/18397 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-02 17:06:05 +00:00
LPL	b58966b983	IMPALA-11289: Push-down compound predicates to iceberg This patch implements pushing compound predicates down to Iceberg. The compound predicates include NOT, AND, and OR. Testing: - Added end-to-end test Change-Id: I27bc67b71033900c466183da5b1907ac90844177 Reviewed-on: http://gerrit.cloudera.org:8080/18535 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-24 13:16:21 +00:00
stiga-huang	6ea15409b8	IMPALA-11208: Fix uninitialized counter of CollectionItemsRead in orc-scanner CollectionItemsRead in the runtime profile counts the total number of nested collection items read by the scan node. Only created for scans that support nested types, e.g. Parquet or ORC. Each scanner thread maintains its local counter and merges it into HdfsScanNode counter for each row batch. However, the local counter in orc-scanner is uninitialized, leading to weird values. This patch simply initializes it to 0 and adds test coverage. Tests: Add profile verification for this counter on some existing query tests. Note that there are some implementation difference between Parquet and ORC scanners (e.g. in predicate pushdown). So we will see different counter results in some query. I just pick some queries that have consistent counters. Change-Id: Id7783d1460ac9b98e94d3a31028b43f5a9884f99 Reviewed-on: http://gerrit.cloudera.org:8080/18528 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-18 23:59:58 +00:00

1 2 3 4 5 ...

1619 Commits