impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 09:00:39 -05:00

Author	SHA1	Message	Date
Daniel Becker	87e0077255	IMPALA-11734: TestIcebergTable.test_compute_stats fails in RELEASE builds If the Impala version is set to a release build as described in point 8 in the "How to Release" document (https://cwiki.apache.org/confluence/display/IMPALA/How+to+Release#HowtoRelease-HowtoVoteonaReleaseCandidate), TestIcebergTable.test_compute_stats fails: Stacktrace query_test/test_iceberg.py:852: in test_compute_stats self.run_test_case('QueryTest/iceberg-compute-stats', vector, unique_database) common/impala_test_suite.py:742: in run_test_case self.__verify_results_and_errors(vector, test_section, result, use_db) common/impala_test_suite.py:578: in __verify_results_and_errors replace_filenames_with_placeholder) common/test_result_verifier.py:469: in verify_raw_results VERIFIER_MAP[verifier](expected, actual) common/test_result_verifier.py:278: in verify_query_result_is_equal assert expected_results == actual_results E assert Comparing QueryTestResults (expected vs actual): E 2,1,'2.33KB','NOT CACHED','NOT CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/test_compute_stats_74dbc105.db/ice_alltypes' != 2,1,'2.32KB','NOT CACHED','NOT CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/test_compute_stats_74dbc105.db/ice_alltypes' The problem is the file size which is 2.32KB instead of 2.33KB. This is because the version is written into the file, and "x.y.z-RELEASE" is one byte shorter than "x.y.z-SNAPSHOT". The size of the file in this test is on the boundary between 2.32KB and 2.33KB, so this one byte can change the value. This change fixes the problem by using a regex to accept both values so it works for both snapshot and release versions. Change-Id: Ia1fa12eebf936ec2f4cc1d5f68ece2c96d1256fb Reviewed-on: http://gerrit.cloudera.org:8080/19260 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-21 16:15:00 +00:00
Daniel Becker	03c465dac1	IMPALA-11719: Inconsistency in printing NULL values NULL values are printed as "NULL" if they are top level or in collections, but as "null" in structs. We should print collections and structs in JSON form, so it should be "null" in collections, too. Hive also follows the latter (correct) approach. This commit changes the printing of NULL values to "null" in collections. Testing: - Modified the tests to expect "null" instead of "NULL" in collections. Change-Id: Ie5e7f98df4014ea417ddf73ac0fb8ec01ef655ba Reviewed-on: http://gerrit.cloudera.org:8080/19236 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>	2022-11-15 09:24:03 +00:00
Peter Rozsa	97a506c656	IMPALA-11537: Query option validation for numeric types This change adds a more generic approach to validate numeric query options and report parse and validation errors. Supported types: integers, floats, memory specifications. Range and bound validator helper functions are added to make validation unified on call sites. Testing: - Error messages got more generic, therefore the existing tests around query options are aligned to match them Change-Id: Ia7757b52393c094d2c661918d73cbfad7214f855 Reviewed-on: http://gerrit.cloudera.org:8080/19096 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-11 22:02:16 +00:00
Michael Smith	15b07ff1fb	IMPALA-11704: (Addendum) fix crash on open for HDFS cache When trying to read from HDFS cache, ReadFromCache calls FileReader::Open(false) to force the file to open. The prior commit for IMPALA-11704 didn't allow for that case when using a data cache, as the data cache check would always happen. This resulted in a crash calling CachedFile as exclusive_hdfs_fh_ was nullptr. Tests only catch this when reading from HDFS cache with data cache enabled. Replaces explicit arguments to override FileReader behavior with a flag to communicate whether FileReader supports delayed open. Then the caller can choose whether to call Open before read. Also simplifies calls to ReadFromPos as it already has a pointer to ScanRange and can check whether file handle caching is enabled directly. The Open call in DoInternalRead uses a slightly wider net by only checking UseDataCache. If the data cache is unavailable or a miss the file will then be opened. Adds a select from tpch.nation to the query for test_data_cache.py as something that triggers checking the HDFS cache. Change-Id: I741488d6195e586917de220a39090895886a2dc5 Reviewed-on: http://gerrit.cloudera.org:8080/19228 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-11 17:53:58 +00:00
LPL	f617e36487	IMPALA-11711: Virtual columns should be skipped in 'FileMetadataUtils::AddIcebergColumns' In the 'FileMetadataUtils::AddIcebergColumns' method, when the slot is a virtual column, it should be skipped directly. That may affect that when we query the Iceberg v2 table (the first column is a partition column of bool type), wrong position-delete result is given. Testing: - Add e2e tests - Locally tested the result of The Position-based Iceberg tables Change-Id: I58faf3df6ae8a5bcabb1d2ac9f11a6fbcd74bc24 Reviewed-on: http://gerrit.cloudera.org:8080/19223 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-10 00:51:50 +00:00
Fang-Yu Rao	4e6692b024	IMPALA-11686: Fix test_corrupt_stat after IMPALA-11666 IMPALA-11666 revised the message in the query plans when there are potentially corrupt statistics, which broke test_corrupt_stat, an E2E test only run in the exhaustive tests. This patch fixes the test file accordingly. Testing: - Verified locally that the patch passes test_corrupt_stat. Change-Id: I817c7807a07bb89b93d795bce958b9872eff2eef Reviewed-on: http://gerrit.cloudera.org:8080/19224 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-09 05:15:14 +00:00
Daniel Becker	77dc20264c	IMPALA-11687: Select * with EXPAND_COMPLEX_TYPES=1 and explicit complex types fails If EXPAND_COMPLEX_TYPES is set to true, some queries that combine star expressions and explicitly given complex columns fail: select outer_struct, * from functional_orc_def.complextypes_nested_structs; ERROR: IllegalStateException: Illegal reference to non-materialized slot: tid=1 sid=1 select , outer_struct.str from functional_orc_def.complextypes_nested_structs; ERROR: IllegalStateException: null Having two stars in a table with complex columns also fails. select , * from functional_orc_def.complextypes_nested_structs; ERROR: IllegalStateException: Illegal reference to non-materialized slot: tid=6 sid=13 The error is because of this line in 'SelectStmt.addStarResultExpr()': `8e350d0a8a/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java (L811)` What we want to do is create 'SlotRef's for the struct children (recursively) but 'reExpandStruct()' also creates new 'SlotDescriptor's for the children. The new 'SlotDescriptor's are redundant and are not inserted into the tree which can leave them unmaterialised or without a correct memory layout. The solution is to only create the 'SlotRef's for the struct children without creating new 'SlotDescriptor's. This leads us to another problem: - for structs, it is 'SlotRef.analyzeImpl()' that creates the child 'SlotRef's - the constructor 'SlotRef(SlotDescriptor desc)' sets 'isAnalyzed_' to true. Before structs were allowed, this was correct but now struct-typed 'SlotRef's created with the above constructor are counted as analysed but lack child expressions, which would have been added if 'analyze()' had been called on them. This essentially violates the contract of this constructor. This commit modifies 'SlotRef(SlotDescriptor desc)' so that child expressions are generated for structs, restoring the correct semantics of this constructor. After this, it is no longer necessary to call 'reExpandStruct()' in 'SelectStmt.addStarResultExpr()'. Testing: - Added the failing test cases and a few variations of them to nested-types-star-expansion.test Change-Id: Ia8cf53b0a7409faca668713228bfef275f3833f9 Reviewed-on: http://gerrit.cloudera.org:8080/19171 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-04 04:33:02 +00:00
Fang-Yu Rao	20c0de1017	IMPALA-10436: Support storage handler privileges for external Kudu table creation This patch lowers the privilege requirement for external Kudu table creation. Before this patch, a user was required to have the ALL privilege on SERVER if the user wanted to create an external Kudu table. In this patch we introduce a new type of resources called storage handler URI and a new access type called RWSTORAGE that will be supported by Apache Ranger once RANGER-3281 is resolved, which in turn depends on the release of Apache Hive 4.0 that consists of HIVE-24705. Specifically, after this patch, a user will be allowed to create an external Kudu table as long as the user is granted the RWSTORAGE privilege on the resource specified by a storage handler URI that points to an existing Kudu table. For instance, in order for a user 'non_owner' to create an external Kudu table based on an existing Kudu table 'impala::tpch_kudu.nation', it suffices to execute the following command as an administrator to grant the necessary privilege to the requesting user, where "localhost" is the default address of Kudu master host assuming there is only one single master host in this example. GRANT RWSTORAGE ON STORAGEHANDLER_URI 'kudu://localhost/impala::tpch_kudu.nation' TO USER non_owner One may be wondering why we do not simply cancel the privilege check that required the ALL privilege on SERVER for external Kudu table creation. One scenario in which such relaxation is not secure is when the owner or the creator of the existing Kudu table is different from the requesting user who wants to create an external Kudu table in Impala. Not requiring any additional privilege check would allow a user without any privilege to retrieve the contents of the existing Kudu table. On the other hand, after this patch we still require a user to have the ALL privilege on SERVER when the table property of 'kudu.master_addresses' is specified in a query that tries to create a Kudu table whether or not the table is external. To be more specific, the user 'non_owner' would be able to create an external Kudu table using the following statement once being granted the RWSTORAGE privilege on the specified storage handler URI above. CREATE EXTERNAL TABLE default.kudu_tbl STORED AS KUDU TBLPROPERTIES ('kudu.table_name'='impala::tpch_kudu.nation') However, the following query submitted by the same user would be rejected due to the user 'non_owner' not being granted the ALL privilege on SERVER. CREATE EXTERNAL TABLE default.kudu_tbl STORED AS KUDU TBLPROPERTIES ('kudu.table_name'='impala::tpch_kudu.nation', 'kudu.master_addresses'='localhost') We do not relax such a requirement in that specifying the addresses of Kudu master hosts to connect should still be considered as an administrative operation. Testing: - Added various FE and E2E tests to verify Impala's behavior after this patch with respect to external Kudu table creation. - Verified that this patch passes the core tests in the DEBUG build. Change-Id: I7936e1d8c48696169f7ad7ad92abe44a26eea3c4 Reviewed-on: http://gerrit.cloudera.org:8080/17640 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-03 05:48:17 +00:00
Csaba Ringhofer	a983a347a7	IMPALA-11682: Add tests for minor compacted insert only ACID tables Only test changes. Minor compacted delta dirs are supported in Impala since IMPALA-9512, but at that time Hive supported minor compaction only on full ACID tables. Since that time Hive added support for minor compacting insert only/MM tables (HIVE-22610). Change-Id: I7159283f3658f2119d38bd3393729535edd0a76f Reviewed-on: http://gerrit.cloudera.org:8080/19164 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-03 00:52:08 +00:00
Zoltan Borok-Nagy	301c3cebad	IMPALA-11591: Avoid calling planFiles() on Iceberg tables Iceberg's planFiles() API is very expensive as it needs to read all the relevant manifest files. It's especially expensive on object stores like S3. When there are no predicates on the table and we are not doing time travel it's possible to avoid calling planFiles() and do the scan planning from cached metadata. When none of the predicates are on partition columns there's little benefit of pushing down predicates to Iceberg. So with this patch we only push down predicates (and hence invoke planFiles()) when at least one of the predicates are on partition columns. This patch introduces a new class to store content files: IcebergContentFileStore. It separates data, delete, and "old" content files. "Old" content files are the ones that are not part of the current snapshot. We add such data files during time travel. Storing "old" content files in a separate concurrent hash map also fixes a concurrency bug in the current code. Testing: * executed current e2e tests * updated predicate push down tests Change-Id: Iadb883a28602bb68cf4f61e57cdd691605045ac5 Reviewed-on: http://gerrit.cloudera.org:8080/19043 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-02 22:14:47 +00:00
LPL	3973fc6d09	IMPALA-11608: Fix SHOW TABLE STATS iceberg_tbl shows wrong number of files Impala SHOW TABLE stats outputs wrong value for number of files for Iceberg tables. It should only calculate the number of data files and delete files, but it calculates all files under the table directory, including metadata files, orphaned files, and old data files not belonging to the current snapshot. Testing: - add e2e tests Change-Id: I110e5e13cec3aa898f115e1ed795ce98e68ef06c Reviewed-on: http://gerrit.cloudera.org:8080/19150 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-21 22:06:07 +00:00
TheOmid	e327a28757	IMPALA-6684: Fix untracked memory in KRPC During serialization of a row batch header, a tuple_data_ is created which will hold the compressed tuple data for an outbound row batch. We would like this tuple data to be trackable as it is responsible for a significant portion of untrackable memory from the krpc data stream sender. By using MemTrackerAllocator, we can allocate tuple data and compression scratch and account for it in the memory tracker of the KrpcDataStreamSender. This solution replaces the type for tuple data and compression scratch from std::string to TrackedString, an std:basic_string with MemTrackerAllocator as the custom allocator. This patch adds memory estimation in DataStreamSink.java to account for OutboundRowBatch memory allocation. This patch also removes the thrift-based serialization because the thrift RPC has been removed in the prior commit. Testing: - Passed core tests. - Ran a single node benchmark which shows no regression. - Updated row-batch-serialize-test and row-batch-serialize-benchmark to test the row-batch serialization used by KRPC. - Manually collected query-profile, heap growth, and memory usage log showing untracked memory decreased by 1/2. - Add test_datastream_sender.py to verify the peak memory of EXCHANGE SENDER node. - Raise mem_limit in two of test_spilling_large_rows test case. - Print test line number in PlannerTestBase.java New row-batch serialization benchmark: Machine Info: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz serialize: 10% 50% 90% 10% 50% 90% (rel) (rel) (rel) ------------------------------------------------------------- ser_no_dups_base 18.6 18.8 18.9 1X 1X 1X ser_no_dups 18.5 18.5 18.8 0.998X 0.988X 0.991X ser_no_dups_full 14.7 14.8 14.8 0.793X 0.79X 0.783X ser_adj_dups_base 28.2 28.4 28.8 1X 1X 1X ser_adj_dups 68.9 69.1 69.8 2.44X 2.43X 2.43X ser_adj_dups_full 56.2 56.7 57.1 1.99X 2X 1.99X ser_dups_base 20.7 20.9 20.9 1X 1X 1X ser_dups 20.6 20.8 20.9 0.994X 0.995X 1X ser_dups_full 39.8 40 40.5 1.93X 1.92X 1.94X deserialize: 10% 50% 90% 10% 50% 90% (rel) (rel) (rel) ------------------------------------------------------------- deser_no_dups_base 75.9 76.6 77 1X 1X 1X deser_no_dups 74.9 75.6 76 0.987X 0.987X 0.987X deser_adj_dups_base 127 128 129 1X 1X 1X deser_adj_dups 179 193 195 1.41X 1.51X 1.51X deser_dups_base 128 128 129 1X 1X 1X deser_dups 165 190 193 1.29X 1.48X 1.49X Change-Id: I2ba2b907ce4f275a7a1fb8cf75453c7003eb4b82 Reviewed-on: http://gerrit.cloudera.org:8080/18798 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-17 21:59:57 +00:00
Zoltan Borok-Nagy	dc912f016e	IMPALA-11655: Impala should set write mode "merge-on-read" by default Similarly to HIVE-26596 Impala should set merge-on-read write mode for V2 tables, unless otherwise specified: * during table creation with 'format-version'='2' * during alter table set tblproperties 'format-version'='2' We do so because in the foreseeable future Impala will only support merge-on-read (on the write-side, on the read side copy-on-write is also supported). Also, currently Hive only supports merge-on-read. Testing: * e2e tests added Change-Id: Icaa32472cde98e21fb23f5461175db1bf401db3d Reviewed-on: http://gerrit.cloudera.org:8080/19138 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-17 19:37:15 +00:00
Joe McDonnell	11e66523d6	IMPALA-11526: Install en_US.UTF-8 locale into docker images In IMPALA-11492, ExprTest.Utf8MaskTest was failing on some configurations because the en_US.UTF-8 was missing. Since the Docker images don't contain en_US.UTF-8, they are subject to the same bug. This was confirmed by adding tests cases to the test_utf8_strings.py end-to-end test and running it in the dockerized tests. This add the appropriate language pack to the list of packages installed for the Docker build. Testing: - This adds end-to-end tests to test_utf8_strings.py covering the same cases that were failing in ExprTest.Utf8MaskTest. They failed without the added languages packs, and now succeed. Change-Id: I353f257b3cb6d45f7d0a28f7d5319fdb457e6e3d Reviewed-on: http://gerrit.cloudera.org:8080/19080 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>	2022-10-11 20:30:50 +00:00
Michael Smith	3577030df6	IMPALA-11562: Revert support for o3fs as default filesystem Reverts support for o3fs as a default filesystem added in IMPALA-9442. Updates test setup to use ofs instead. Munges absolute paths in Iceberg metadata to match the new location required for ofs. Ozone has strict requirements on volume and bucket names, so all tables must be created within a bucket (e.g. inside /impala/test-warehouse/). Change-Id: I45e90d30b2e68876dec0db3c43ac15ee510b17bd Reviewed-on: http://gerrit.cloudera.org:8080/19001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-28 22:35:48 +00:00
xqhe	d47d305bf4	IMPALA-11418: A statement that returns at most one row need not to spool results A query that returns at most one row can run more efficiently without result spooling. If result spooling is enabled, it will set the minimum memory reservation in PlanRootSink, e.g. 'select 1' minimum memory reservation is 4MB. This optimization can reduce the statement's resource reservation and prevent the exception 'Failed to get minimum memory reservation' when the host memory limit not available. Testing: - Add tests in result-spooling.test Change-Id: Icd4d73c21106048df68a270cf03d4abd56bd3aac Reviewed-on: http://gerrit.cloudera.org:8080/18711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-28 05:46:03 +00:00
xiabaike	a377662e94	IMPALA-11420: Support CREATE/ALTER VIEW SET/UNSET TBLPROPERTIES Add TBLPROPERTIES support to the view, here are some examples: CREATE VIEW [IF NOT EXISTS] [database_name.]view_name [(column_name [COMMENT 'column_comment'][, ...])] [COMMENT 'view_comment'] [TBLPROPERTIES (property_name = property_value, ...)] AS select_statement; ALTER VIEW [database_name.]view_name SET TBLPROPERTIES (property_name = property_value, ...); ALTER VIEW [database_name.]view_name UNSET TBLPROPERTIES (property_name, ...); Change-Id: I8d05bb4ec1f70f5387bb21fbe23f62c05941af18 Reviewed-on: http://gerrit.cloudera.org:8080/18940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-27 13:27:41 +00:00
Zoltan Borok-Nagy	b91aa06537	IMPALA-11582: Implement table sampling for Iceberg tables This patch adds table sampling functionalities for Iceberg tables. From now it's possible to execute SELECT and COMPUTE STATS statements with table sampling. Predicates in the WHERE clause affect the results of table sampling similarly to how legacy tables work (sampling is applied after static partition and file pruning). Sampling is repeatable via the REPEATABLE clause. Testing * planner tests * e2e tests for V1 and V2 tables Change-Id: I5de151747c0e9d9379a4051252175fccf42efd7d Reviewed-on: http://gerrit.cloudera.org:8080/18989 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-26 15:49:22 +00:00
Zoltan Borok-Nagy	3f382b7ebb	IMPALA-11583: Use Iceberg API to update stats Before this patch we used HMS API alter_table() to update an Iceberg table's statistics. 'alter_table()' API calls are unsafe for Iceberg tables as they overwrite the whole HMS table, including the table property 'metadata_location' which must always point to the latest snapshot. Hence concurrent modification to the same table could be reverted by COMPUTE STATS. In this patch we are using Iceberg API to update Iceberg tables. Also, table-level stats (e.g. numRows, totalSize, totalFiles) are not set as Iceberg keeps them up-to-date. COMPUTE INCREMENTAL STATS without partition clause is the same as plain COMPUTE STATS for Iceberg tables. This behavior is aligned with current behavior on non-partitioned tables: https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html COMPUTE INCREMENTAL STATS .. PARTITION raises an error. DROP STATS has been also modified to not drop table-level stats for HMS-integrated Iceberg tables. Testing: * added e2e tests for COMPUTE STATS * added e2e tests for DROP STATS * manually tested concurrent Hive INSERT and Impala COMPUTE STATS using latest Hive * opened IMPALA-11590 to add automated interop tests with Hive Change-Id: I46b6e0a5a65e18e5aaf2a007ec0242b28e0fed92 Reviewed-on: http://gerrit.cloudera.org:8080/18995 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-22 18:33:25 +00:00
Peter Rozsa	04b5319e6e	IMPALA-9499: Display support for all complex types in a SELECT * query This change adds EXPAND_COMPLEX_TYPES query option to support the display of complex types in SELECT statements where star () expression is in the select list. By default, the query option is disabled. When it's enabled, it changes the behaviour of star expansion to list all top-level column fields including ones with complex types, instead of listing the scalar column fields only. Nested complex type expansion is also supported, eg.: struct. will enumerate the members of the struct. Array, map and struct types are supported. Testing: - Analyzer tests check select statements when the query option is enabled or disabled. - EE tests check the proper complex type deserialization when the query option is enabled, and the original behaviour when the option is disabled. Change-Id: I84b5e5703f9e0ce0f4f8bff83941677dd7489974 Reviewed-on: http://gerrit.cloudera.org:8080/18863 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-09 22:43:33 +00:00
Csaba Ringhofer	2e1ce445b2	IMPALA-11567: Fix left outer join if the right side is subquery with complex type Non-matching rows from the left side will null out all slots from the right side in left outer joins. If the right side is a subquery, it is possible that some returned expressions will be non-NULL even if all slots are NULL (e.g. constants) - these expressions are wrapped as IF(TupleIsNull(tids), NULL, expr) to null them in the non-matching case. The logic above used to hit a precondition for complex types. We can safely ignore complex types for now, as currently the only possible expression that returns a complex type is SlotRef, which doesn't need to be wrapped. We will have to revisit this once functions are added that return complex types. Testing: - added a regression test and ran it Change-Id: Iaa8991cd4448d5c7ef7f44f73ee07e2a2b6f37ce Reviewed-on: http://gerrit.cloudera.org:8080/18954 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-09 14:42:41 +00:00
Gergely Fürnstáhl	f598b2ad68	IMPALA-10610: Support multiple file formats in a single Iceberg Table Added support for multiple file formats. Previously Impala created a Scanner class based on the partitions file format, now in case of an Iceberg table it will read out the file format from the file level metadata instead. IcebergScanNode will aggregate file formats as well instead of relying on partitions, so it can be used for plannig. Testing: Created a mixed file format table with hive and added a test for it. Change-Id: Ifc816595724e8fd2c885c6664f790af61ddf5c07 Reviewed-on: http://gerrit.cloudera.org:8080/18935 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-08 18:03:37 +00:00
Daniel Becker	37f44a58f3	IMPALA-10918: Allow map type in SELECT list Adding support for MAP types in the select list. An example of how maps are printed: {"k1":2,"k2":null} Nested collection types (maps and arrays) are supported in any combination. However, structs in collections and collections in structs are not supported. Limitations (other than map support) as described in the commit for IMPALA-9498 still apply, the following are to be implemented later: - Unify HS2 / Beeswax logic with the way STRUCTs are handled. This could be done in a "final" logic that can handle STRUCTS/ARRAYS nested to each other - Implement "deep copy" and "deep serialize" for collections in BE. This would enable all operators, e.g. ORDER BY and UNION. Testing: - modified the FE tests that checked that maps were not allowed in the select list - now the test expect maps are allowed there - added FE and EE tests involving maps based on the array tests Change-Id: I921c647f1779add36e7f5df4ce6ca237dcfaf001 Reviewed-on: http://gerrit.cloudera.org:8080/18736 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-07 19:55:43 +00:00
LPL	cc26f345a4	IMPALA-11507: Use absolute_path when Iceberg data files are outside of the table location For Iceberg tables, when one of the following properties is used, it is considered that the table is possible to have data outside the table location directory: - 'write.object-storage.enabled' is true - 'write.data.path' is not empty - 'write.location-provider.impl' is configured - 'write.object-storage.path'(Deprecated) is not empty - 'write.folder-storage.path'(Deprecated) is not empty We should tolerate the situation that relative path of the data files cannot be obtained by the table location path, and we could use the absolute path in that case. E.g. the ETL program will write the table that the metadata of the Iceberg tables is placed in 'hdfs://nameservice_meta/warehouse/hadoop_catalog/ice_tbl/metadata', the recent data files in 'hdfs://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', and the data files half a year ago in 's3a://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', it should still be queried normally by Impala. Testing: - added e2e tests Change-Id: I666bed21d20d5895f4332e92eb30a94fa24250be Reviewed-on: http://gerrit.cloudera.org:8080/18894 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-06 18:35:30 +00:00
Zoltan Borok-Nagy	73da4d7ddf	IMPALA-11484: Create SCAN plan for Iceberg V2 position delete tables This patch adds support for reading Iceberg V2 tables use position deletes. Equality deletes are still not supported. Position delete files store the file path and file position of the deleted rows. When an Iceberg table has position delete files we need to do an ANTI JOIN between data files and delete files. From the data files we need to query the virtual columns INPUT__FILE__NAME and FILE__POSITION, while from the delete files we need the data columns 'file_path' and 'pos'. The latter data columns are not part of the table schema, so we create a virtual table instance of 'IcebergPositionDeleteTable' that has a table schema corresponding to the delete files ('file_path', 'pos'). This patch introduces a new class 'IcebergScanPlanner' which has the responsibility of doing a plan for Iceberg table scans. It creates the aforementioned ANTI JOIN. Also, if there are data files without corresponding delete files, we can have a separate SCAN node and its results would be UNIONed to the rows coming from the ANTI JOIN: UNION / \ SCAN data ANTI JOIN / \ SCAN data SCAN deletes Some refactorings in the context of this CR: Predicate pushdown and time travel logic is transferred from IcebergScanNode to IcebergScanPlanner. Iceberg snapshot summary retrieval is moved from FeFsTable to FeIcebergTable. Testing: * added planner test * added e2e tests TODO in follow-up Jiras: * better cardinality estimates (IMPALA-11516) * support unrelative collection columns (select item from t.int_array) (IMPALA-11517) Currently such queries return error during analysis Change-Id: I672cfee18d8e131772d90378d5b12ad4d0f7dd48 Reviewed-on: http://gerrit.cloudera.org:8080/18847 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-01 16:51:17 +00:00
Gabor Kaszab	fec7a79c50	IMPALA-11529: FILE__POSITION virtual column for ORC tables IMPALA-11350 implemented the FILE__POSITION virtual column for Parquet files. This ticket does the same but for ORC files. Note, that for full ACID ORC tables there have already been an implementation of row__id that could simply be re-used for this ticket. Testing: - TestScannersVirtualColumns.test_virtual_column_file_position_generic is changed to run now on ORC as well. I don't think further testing is required as this functionality has already been there for row__id we just re-used it for FILE__POSITION. Change-Id: Ie8e951f73ceb910d64cd149192853a4a2131f79b Reviewed-on: http://gerrit.cloudera.org:8080/18909 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-30 16:45:24 +00:00
Tamas Mate	423b087762	IMPALA-11520: Remove functional.unsupported_types misc test IMPALA-9482 added support to the remaining Hive types and removed the functional.unsupported_types table. There was a reference remaining in a misc test. test_misc is not marked as exhaustive but it only runs in exhaustive builds. Change-Id: I65b6ea5ac742fbcc427ad41741d347558cb7d110 Reviewed-on: http://gerrit.cloudera.org:8080/18896 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-25 16:24:41 +00:00
Tamas Mate	40b9b9cd75	IMPALA-11521: Fix test_binary_type Fix test_binary_type typo, it should not reference to an HBase table. Change-Id: Id41049094b632af6326f6ee9f3886577d1fc5ee6 Reviewed-on: http://gerrit.cloudera.org:8080/18895 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-25 15:12:38 +00:00
LPL	d7ecc11149	IMPALA-11500: Fix Impalad crashed in ParquetBoolDecoder::SkipValues when num_values is 0 Fix Impalad crashed in the method ParquetBoolDecoder::SkipValues when the parameter 'num_values' is 0. The function should tolerate that the 'num_values' is 0 values. Testing: - Add e2e tests Change-Id: I8c4c5a4dff9e9e75913c7b524b4ae70967febb37 Reviewed-on: http://gerrit.cloudera.org:8080/18854 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 19:11:58 +00:00
Csaba Ringhofer	7ca11dfc7f	IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 13:55:42 +00:00
Zoltan Borok-Nagy	522ee1fcc0	IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables Virtual column FILE__POSITION returns the ordinal position of the row in the data file. It will be useful to add support for Iceberg's position-based delete files This patch only adds FILE__POSITION to Parquet tables. It works similarly to the handling of collection position slots. I.e. we add the responsibility of dealing with the file position slot to an existing column reader. Because of page-filtering and late materialization we already tracked the file position in member 'current_row_' during scanning. Querying the FILE__POSITION in other file formats raises an error. Testing: * added e2e tests Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7 Reviewed-on: http://gerrit.cloudera.org:8080/18704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-12 19:21:55 +00:00
LPL	0e3e4b57a1	IMPALA-11408: Fill missing partition columns when INSERT INTO iceberg_tbl (col_list) In the case of INSERT INTO iceberg_tbl (col_a, col_b, ...), if the partition columns of Iceberg table are not in the columns permutation, in order for data to be written to the default partition '__HIVE_DEFAULT_PARTITION__' we will fill the missing partition columns with NullLiteral. Testing: - add e2e tests Change-Id: I40c733755d65e5c81a12ffe09b6d16ed5d115368 Reviewed-on: http://gerrit.cloudera.org:8080/18790 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-10 14:34:04 +00:00
Riza Suminto	05c3a8e09c	IMPALA-11465, IMPALA-11466: Bump CDP_BUILD_NUMBER to 30010248 This patch bump up CDP_BUILD_NUMBER to pick Hive version 3.1.3000.7.2.16.0-127 that contains: - thrift-0.16.0 upgrade from HIVE-25635. - Backport of ORC-517. This patch also contains fix for IMPALA-11466 by adding jetty-server as an allowed dependency. Testing: - Build locally and confirm that the cdp components is downloaded. Change-Id: Iff5297a48865fb2444e8ef7b9881536dc1bbf63c Reviewed-on: http://gerrit.cloudera.org:8080/18803 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2022-08-09 06:45:52 +00:00
Tamas Mate	bb5ec85134	IMPALA-10453: Support file pruning via runtime filters on Iceberg Iceberg tables store partition information in manifest files and not in the file path. This metadata has already been pushed down to the scanners and this commit uses this metadata to evaluate runtime filters on Iceberg files. Pefromance measurement: Used TPC-DS Q10 [1] with scale of 10 to measure the query performance. Min/Max filters have been disabled and increased the wait time for runtime filters to 5 seconds. After pre-warming the Catalog I executed Q10 5 times on my local machine. The fastest execution times were: Baseline Parquet tables: 1.08s Baseline Iceberg tables without this patch: 1.43s Iceberg tables with this patch: 1.09s Testing: * Added e2e tests. * Initial perofrmance test with TPC-DS Q10. Ref: [1] TPC-DS Q10: select cd_gender, cd_marital_status, cd_education_status, count() cnt1, cd_purchase_estimate, count() cnt2, cd_credit_rating, count() cnt3, cd_dep_count, count() cnt4, cd_dep_employed_count, count() cnt5, cd_dep_college_count, count() cnt6 from customer c, customer_address ca, customer_demographics where c.c_current_addr_sk = ca.ca_address_sk and ca_county in ('Walker County','Richland County','Gaines County', 'Douglas County','Dona Ana County') and cd_demo_sk = c.c_current_cdemo_sk and exists (select * from store_sales, date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3) and exists (select * from (select ws_bill_customer_sk as customer_sk, d_year,d_moy from web_sales, date_dim where ws_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3 union all select cs_ship_customer_sk as customer_sk, d_year, d_moy from catalog_sales, date_dim where cs_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3 ) x where c.c_customer_sk = customer_sk) group by cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count order by cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count limit 100; Change-Id: I7762e1238bdf236b85d2728881a402a2bb41f36a Reviewed-on: http://gerrit.cloudera.org:8080/18531 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-04 20:18:04 +00:00
Tamas Mate	c0b0875bda	IMPALA-11378: Allow INSERT OVERWRITE for bucket tranforms in some cases This change has been considered only for Iceberg tables mainly for table maintenance reasons. Iceberg table writes create new snapshots and these can accumulate over time. This commit allows a simple form of compaction of these snapshots. INSERT OVERWRITES have been blocked in case partition evolution is in place, because it would be possible to overwrite a data file with a newer schema that has less columns. This could cause unexpected data loss. For bucketed tables, the following syntax is allowed to be executed: INSERT OVERWRITE ice_tbl SELECT * FROM ice_tbl; The source and target table has to be the same and specified, only SELECT '*' queries are allowed. These requirements are also in place to avoid unexpected data loss. - Values are not allowed, because inserting a single record could overwrite a whole file in a bucket. - Only source table is allowed, because at the time of the insert it is unknown which files will be modified, similar to values. Testing: - Added e2e tests. Change-Id: Ibd1bc19d839297246eadeb754cdeeec1e306098a Reviewed-on: http://gerrit.cloudera.org:8080/18649 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-01 13:36:51 +00:00
Gergely Fürnstáhl	cbd3fab8c4	IMPALA-11268: Allow STORED BY and STORED AS as well Extended the parser to support 'stored by' keyword for storage engines, namely Kudu and Iceberg at the moment, to match hive's syntax. Furthermore, this lays the ground work for seperating storage engines from file formats to be able specify both with "STORED BY ... STORED AS ...". Testing: Added front-end Parser and Analyzer tests and query tests for table creation. Change-Id: Ib677bea8e582bbc01c5fb8c81df57eb60b0ed961 Reviewed-on: http://gerrit.cloudera.org:8080/18743 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-26 18:21:48 +00:00
LPL	614dc54dcc	IMPALA-11446: Push-down NOT_IN predicate to iceberg Because the column value bounds of the Iceberg meta are not necessarily a min or max value, NOT_IN cannot be answered using them. NOT_IN(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value in col. But it works when the push-down column is the partition column, it's still very helpful. Testing: - add e2e tests Change-Id: Ib8bdaf6f31a4438e11c4eb27485bb413fe6df9a3 Reviewed-on: http://gerrit.cloudera.org:8080/18760 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-26 18:11:25 +00:00
Csaba Ringhofer	efc303b71a	IMPALA-11434: Fix analysis of multiple more than 1d arrays in select list More than 1d arrays in select list tried to register a CollectionTableRef with name "item" for the inner arrays, leading to name collision if there was more than one such array. The logic is changed to always use the full path as implicit alias in CollectionTableRefs backing arrays in select list. As a side effect this leads to using the fully qualified names in expressions in the explain plans of queries that use arrays from views. This is not an intended change, but I don't consider it to be critical. Created IMPALA-11452 to deal with more sophisticated alias handling in collections. Testing: - added a new table to testdata and a regression test Change-Id: I6f2b6cad51fa25a6f6932420eccf1b0a964d5e4e Reviewed-on: http://gerrit.cloudera.org:8080/18734 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-22 22:59:19 +00:00
wzhou-code	02043744be	IMPALA-11445: Fix bug in firing insert event of partitions located in different FS When adding a partition with location in a file system which is different from the file system of the table location, Impala accept it. But when insert values to the table, catalogd throw exception. This patch fix the issue by using the right FileSystem object. Testing: - Added new test case with partitions on different file systems. Ran the test on S3. - Did manual tests in cluster with partitions on HDFS and Ozone. - Passed core test. Change-Id: I0491ee1bf40c3d5240f9124cef3f3169c44a8267 Reviewed-on: http://gerrit.cloudera.org:8080/18759 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-22 04:23:30 +00:00
Csaba Ringhofer	06e8e7bba7	IMPALA-886: Support displaying HBase cols in the order from HMS Before this patch catalogd always ordered HBase columns lexicographically by family/qualifier. This is incompatible with other table formats and the way Hive handles HBase tables, where the order comes from HMS as defined during CREATE TABLE. I don't know of any valid reason behind this old behavior, it probably just made the implementation a bit easier by doing the ordering in FE instead of BE - the BE actually needs this ordering during scanning as the HBase API returns results in this order, but this should have no effect on other parts of Impala. Added flag use_hms_column_order_for_hbase_tables (used by catalogd) to decide whether to do this reordering: - true: keep HMS order - false: reorder by family/qualifier [default] The old way is kept as default to avoid breaking existing workloads, but it would make sense to change it in the next major release. Note that a query option would be more convenient to use, but it would be much harder to implement it as the order is decided during loading in catalogd. Testing: - added custom cluster test for use_hms_column_order_for_hbase_tables = true Change-Id: Ibc5df8b803f2ae3b93951765326cdaea706e3563 Reviewed-on: http://gerrit.cloudera.org:8080/18635 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-18 19:02:34 +00:00
LPL	e1fea843ea	IMPALA-11433: Remove misleading bucketing info from DESCRIBE FORMATTED output for Iceberg tables The DESCRIBE FORMATTED output show this even for bucketed Iceberg tables: \| Num Buckets: \| 0 \| NULL \| \| Bucket Columns: \| [] \| NULL \| We should remove them, and the user should rely on the information in the '# Partition Transform Information' block instead. Testing: - add e2e tests - tested in a real cluster Change-Id: Idc156c932780f0f12c935a1a60ff6606d59bb1da Reviewed-on: http://gerrit.cloudera.org:8080/18735 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-16 17:11:24 +00:00
Kurt Deschler	7764830216	IMPALA-11430: Support custom hash schema for Kudu range tables KUDU-2671 added support for custom hash partition specification at the range partition level. This patch adds CREATE TABLE and ALTER TABLE syntax to allow Kudu custom hash schema to be specified through Impala. In addition, a new SHOW HASH SCHEMA statement has been added to allow display of the hash schema information for each partition. HASH syntax within a partition is similar to the table-level syntax except that HASH clauses must follow the PARTITION clause and commas are not allowed within a partition. These differences were required to keep the grammar unambiguous and due to limitations of the Java Cup Parser. To make the grammar more consistent, commas in the table-level partion spec and between PARTITION clauses are now optional but allowed for backward compatibility. Example: CREATE TABLE t1 (id int, c2 int, PRIMARY KEY(id, c2)) PARTITION BY HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4 RANGE (c2) ( PARTITION 0 <= VALUES < 10 PARTITION 10 <= VALUES < 20 HASH(id) PARTITIONS 2 HASH(c2) PARTITIONS 3 PARTITION 20 <= VALUES < 30 ) STORED AS KUDU; ALTER TABLE t1 ADD RANGE PARTITION 30 <= VALUES < 40 HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4; This bumps the toolchain to Kudu githash 43ee785b2d to get the needed Kudu-side changes. Testing: Tests added to kudu_partition_ddl.test Change-Id: I981056e0827f4957580706d6e73742e4e6743c1c Reviewed-on: http://gerrit.cloudera.org:8080/18676 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-07-16 06:06:46 +00:00
Gergely Fürnstáhl	8d034a2f7c	IMPALA-11034: Resolve schema of old data files in migrated Iceberg tables When external tables are converted to Iceberg, the data files remain intact, thus missing field IDs. Previously, Impala used name based column resolution in this case. Added a feature to traverse through the data files before column resolution and assign field IDs the same way as iceberg would, to be able to use field ID based column resolutions. Testing: Default resolution method was changed to field id for migrated tables, existing tests use that from now. Added new tests to cover edge cases with complex types and schema evolution. Change-Id: I77570bbfc2fcc60c2756812d7210110e8cc11ccc Reviewed-on: http://gerrit.cloudera.org:8080/18639 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-14 13:06:04 +00:00
Zoltan Borok-Nagy	26438d8e3e	IMPALA-11414: Off-by-one error in Parquet late materialization With PARQUET_LATE_MATERIALIZATION we can set the number of minimum consecutive rows that if filtered out, we avoid materialization of rows in other columns in parquet. E.g. if PARQUET_LATE_MATERIALIZATION is 10, and in a filtered column we find at least 10 consecutive rows that don't pass the predicates we avoid materializing the corresponding rows in the other columns. But due to an off-by-one error we actually only needed (PARQUET_LATE_MATERIALIZATION - 1) consecutive elements. This means if we set PARQUET_LATE_MATERIALIZATION to one, then we need zero consecutive filtered out elements which leads to a crash/DCHECK. The bug is in the GetMicroBatches() algorithm when we produce the micro batches based on the selected rows. Setting PARQUET_LATE_MATERIALIZATION to 0 doesn't make sense so it shouldn't be allowed. Testing * e2e test with PARQUET_LATE_MATERIALIZATION=1 * e2e test for checking SET PARQUET_LATE_MATERIALIZATION=N Change-Id: I38f95ad48c4ac8c1e06651565ab5c496283b29fa Reviewed-on: http://gerrit.cloudera.org:8080/18700 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-11 19:25:02 +00:00
LPL	78e45a7cea	IMPALA-11287 (part 2): Implement CREATE TABLE LIKE for Iceberg tables This commit implements cloning between Iceberg tables. Cloning Iceberg tables from other Types of tables is not implemented, because the Data Types of Iceberg and Impala do not correspond one by one. Testing: - e2e tests Change-Id: I1284b926f51158e221277b18b2e73707e29f86ac Reviewed-on: http://gerrit.cloudera.org:8080/18658 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2022-07-08 13:52:26 +00:00
gaoxq	4174ff5ea6	IMPALA-11320: SHOW PARTITIONS on Iceberg table doesn't list the partitions Currently, SHOW PARTITIONS on Iceberg tables only outputs the partition spec which is not too useful. Instead it should output the concrete partitions, number of files, number of rows in each partitions. E.g.: SHOW PARTITIONS ice_ctas_hadoop_tables_part; '{"d_month":"613"}',4,2 '{"d_month":"614"}',3,1 '{"d_month":"615"}',2,1 Testing: - Added end-to-end test Change-Id: I3b4399ae924dadb89875735b12a2f92453b6754c Reviewed-on: http://gerrit.cloudera.org:8080/18641 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-05 20:32:17 +00:00
LPL	9c96855146	IMPALA-11368: Iceberg time-travel error message should show timestamp in local timezone In the FOR SYSTEM_TIME AS OF clause we expect timestamps in the local timezone, while the error message shows the timestamp in UTC timezone. The error message should show timestamp in local timezone. Testing: - Add e2e test Change-Id: Iba5d5eb65133f11cc4eb2fc15a19f7b25c14cc46 Reviewed-on: http://gerrit.cloudera.org:8080/18675 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-30 18:31:18 +00:00
LPL	f38c53235f	IMPALA-11279: Optimize plain count() queries for Iceberg tables This commit optimizes the plain count() queries for the Iceberg tables. When the `org.apache.iceberg.SnapshotSummary#TOTAL_RECORDS_PROP` can be retrieved from the current `org.apache.iceberg.BaseSnapshot#summary` of the Iceberg table, this kind of query can be very fast. If this property is not retrieved, the query will aggregate the `num_rows` of parquet `file_metadata_` as usual. Queries that can be optimized need to meet the following requirements: - SelectStmt does not have WHERE clause - SelectStmt does not have GROUP BY clause - SelectStmt does not have HAVING clause - The TableRefs of FROM clause contains only one BaseTableRef - Only for the Iceberg table - SelectList must contain 'count(*)' or 'count(constant)' - SelectList can contain other agg functions, e.g. min, sum, etc - SelectList can contain constant Testing: - Added end-to-end test - Existing tests - Test it in a real cluster Change-Id: I8e9c48bbba7ab2320fa80915e7001ce54f1ef6d9 Reviewed-on: http://gerrit.cloudera.org:8080/18574 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-27 15:29:11 +00:00
Gabor Kaszab	5d021ce5a7	IMPALA-9496: Allow struct type in the select list for Parquet tables This patch is to extend the support of Struct columns in the select list to Parquet files as well. There are some limitation with this patch: - Dictionary filtering could work when we have conjuncts on a member of a struct, however, if this struct is given in the select list then the dictionary filtering is disabled. The reason is that in this case there would be a mismatch between the slot/tuple IDs in the conjunct between the ones in the select list due to expr substitution logic when a struct is in the select list. Solving this puzzle would be a nice future performance enhancement. See IMPALA-11361. - When structs are read in a batched manner it delegates the actual reading of the data to the column readers of its children, however, would use the simple ReadValue() on these readers instead of the batched version. The reason is that calling the batched reader in the member column readers would in fact read in batches, but it won't handle the case when the parent struct is NULL and would set only itself to NULL but not the parent struct. This might also be a future performance enhancement. See IMPALA-11363. - If there is a struct in the select list then late materialization is turned off. The reason is that LM expects the column readers to be used through the batched reading interface, however, as said in the above bulletpoint currently struct column readers use the non-batched reading interface of its children. As a result after reading the column readers are not in a state as SkipRows() of LM expects and then results in a query failure because it's not able to skip the rows for non-filter readers. Once IMPALA-11363 is implemented and the struct will also use the ReadValueBatch() interface of its children then late materialization could be turned on even if structs are in the select list. See IMPALA-11364. Testing: - There were a lot of tests already to exercise this functionality but they were only run on ORC table. I changed these to cover Parquet tables too. Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452 Reviewed-on: http://gerrit.cloudera.org:8080/18596 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-22 17:55:07 +00:00
LPL	75ccdc6aec	IMPALA-11367: Fix some formatting bugs when DESCRIBE FORMATTED for Iceberg tables Fix some formatting bugs when DESCRIBE FORMATTED for Iceberg tables: - 'LINE_DELIM' is missing on '# Partition Transform Information' - The partition transform columns header should be 'col_name,transform_type,NULL' Testing: - Existing tests Change-Id: I991644cefb34decc843a5542b47eaec11d7b6e42 Reviewed-on: http://gerrit.cloudera.org:8080/18634 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-20 14:06:56 +00:00

1 2 3 4 5 ...

1630 Commits