impala

mirror of https://github.com/apache/impala.git synced 2026-02-01 21:00:29 -05:00

Author	SHA1	Message	Date
Steve Carlin	13bbff4e4e	IMPALA-11323: Don't evaluate constants-only inferred predicates IMPALA-10182 fixed the problem of creating inferred predicates when both sides of an equality predicate came from the same slot. Inferred predicates also should not be created when both sides of an equality predicate are constant values which do not have scan slots. Change-Id: If1cd4559dda406d2d38703ed594b70b41ed336fd Reviewed-on: http://gerrit.cloudera.org:8080/18579 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2022-06-04 03:36:11 +00:00
Gergely Fürnstáhl	decb46aa0d	IMPALA-9410: Support resolving ORC file columns by names Added query option and implementation to be able to resolve columns by names. Changed secondary resolution strategy for iceberg orc tables to name based resolution. Testing: Added new test dimension for orc tests, added results to now working iceberg migrated table test Change-Id: I29562a059160c19eb58ccea76aa959d2e408f8de Reviewed-on: http://gerrit.cloudera.org:8080/18397 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-02 17:06:05 +00:00
Michael Smith	2243f331cb	IMPALA-11274: CNF Rewrite causes a regress in join node performance This patch defines a subset of all predicates that are common and relatively inexpensive to compute. Such predicates must involve columns, constants, simple math or cast functions only. Examples of the subset of the predicates allowed: 1. (a = 1 AND cast(b as int) = 2) OR (c = d AND e = f) 2. a in ('1', '2', '3') OR ((b = 'abc') AND (c = d)) 3. (a between 1 and 100) OR ((b is null) AND (c = d)) Examples of the predicates not allowed: 1. (upper(a) != 'Y') AND b = 2) OR (c = d AND e = f) 2. (coalesce(CAST(a AS string), '') = '') AND b = 2) OR (c = d AND e = f) This patch further restricts the predicates to be converted to conjunctive normal form (CNF) to be such a subset, with the aim to reduce the run-time evaluation overhead of CNFs in which some of the predicates can be duplicated. Uses a cache in branching expressions to avoid visiting the entire subtree on each call to applyRuleBottomUp. Skips cache complexity on casts as they don't branch and are unlikely to be deeply nested. Testing: - New expression writer tests - New planner tests Change-Id: I326406c6b004fe31ec0e2a2f390a3845b8925aa9 Reviewed-on: http://gerrit.cloudera.org:8080/18458 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-25 05:37:17 +00:00
Riza Suminto	ad915ca58e	IMPALA-11306: Create symlink for dataset of scale factor 1 single_node_perf_run.py and load-data.py can fail if user set scale factor argument 1. This is because generate-schema-statements.py will insert the scale factor into the database name (ie., "tpch1"), but the preload script omit the scale factor when creating dataset directory (ie., "tpch"). This patch fix the issue by additionally creating symlink for scale factor 1. Testing: - Manual test by running the following script: ./bin/load-data.py --scale_factor=1 --workloads=targeted-perf \ --table_formats=text/none/none Change-Id: I76c9c90b243df6213626e11652cfed59643aed2c Reviewed-on: http://gerrit.cloudera.org:8080/18545 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-24 14:24:17 +00:00
LPL	b58966b983	IMPALA-11289: Push-down compound predicates to iceberg This patch implements pushing compound predicates down to Iceberg. The compound predicates include NOT, AND, and OR. Testing: - Added end-to-end test Change-Id: I27bc67b71033900c466183da5b1907ac90844177 Reviewed-on: http://gerrit.cloudera.org:8080/18535 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-24 13:16:21 +00:00
Joe McDonnell	09a297a270	IMPALA-10503: Use larger yarn containers for dataload This increases the 'yarn.app.mapreduce.am.resource.mb' parameter to 2GB in yarn-site.xml. This reduces the frequency of dataload hitting the container size limit on the docker-based tests and seems likely to address other problems related to the container size. Testing: - Ran docker-based tests - Ran GVO successfully - Ran a debug core job on a different machine configuration Change-Id: I06567ffc44fa378be7c8cf4008f138b47b68d931 Reviewed-on: http://gerrit.cloudera.org:8080/17201 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-05-22 19:58:15 +00:00
stiga-huang	6ea15409b8	IMPALA-11208: Fix uninitialized counter of CollectionItemsRead in orc-scanner CollectionItemsRead in the runtime profile counts the total number of nested collection items read by the scan node. Only created for scans that support nested types, e.g. Parquet or ORC. Each scanner thread maintains its local counter and merges it into HdfsScanNode counter for each row batch. However, the local counter in orc-scanner is uninitialized, leading to weird values. This patch simply initializes it to 0 and adds test coverage. Tests: Add profile verification for this counter on some existing query tests. Note that there are some implementation difference between Parquet and ORC scanners (e.g. in predicate pushdown). So we will see different counter results in some query. I just pick some queries that have consistent counters. Change-Id: Id7783d1460ac9b98e94d3a31028b43f5a9884f99 Reviewed-on: http://gerrit.cloudera.org:8080/18528 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-18 23:59:58 +00:00
Michael Smith	d4cb3afe69	[tools] fix buildall.sh -testdata with prior data The help output for buildall.sh notes running `buildall.sh -testdata` as an option to incrementally load test data without formatting the mini-cluster. However trying to do that with existing data loaded results in an error when running `hadoop fs -mkdir /test-warehouse`. Add `-p` so this step is idempotent, allowing the example to work as documented. Change-Id: Icc4ec4bb746abf53f6787fce4db493919806aaa9 Reviewed-on: http://gerrit.cloudera.org:8080/18522 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-16 08:19:33 +00:00
LPL	7a7934ffba	IMPALA-11283: Push-down IS_NULL and NOT_NULL predicates to iceberg This patch implements pushing the IS_NULL and NOT_NULL predicates down to Iceberg. Testing: - Added end-to-end test Change-Id: I9c3608af67b552bebc55dcc5526f61f5439967bf Reviewed-on: http://gerrit.cloudera.org:8080/18504 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-13 23:10:17 +00:00
LPL	e711461a5c	IMPALA-11286: Writes value_counts to Iceberg metadata Impala does not write 'value_counts' to Iceberg metadata, just 'null_value_counts'. Push-down NOT_NULL predicate does not work when the data is written by the impala, so we implement impala to write 'value_counts' to Iceberg metadata. Testing: - existing tests - tested manually on a real cluster Change-Id: I6b7afab8be197118e573fda1a381fa08e4c8c9c0 Reviewed-on: http://gerrit.cloudera.org:8080/18513 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-12 16:23:17 +00:00
Zoltan Borok-Nagy	7c40b95a04	IMPALA-11287 (part 1): Disable CREATE TABLE LIKE statements for Iceberg tables We currently don't implement correct behavior for CREATE TABLE LIKE statements for Iceberg tables. Neither on the source, nor on the target table side. This patch forbids such statements until they are correctly implemented. Testing * added e2e test Change-Id: I9cee6fc82547dabf63937cc541163c1ee59a4013 Reviewed-on: http://gerrit.cloudera.org:8080/18517 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-12 12:37:23 +00:00
Steve Carlin	ca5ea4aeab	IMPALA-11162: Support GenericUDFs for Hive Hive has 2 types of UDFs. This commit contains limited support for the second generation UDFs called GenericUDFs. The main limitations are as follows: Decimal types are not supported. The Impala framework determines the precision and scale of the decimal return type. However, the Hive GenericUDFs allow the capability to choose its own return type based on the parameters. Until this can be resolved, it is safer to forbid decimals from being used. Note that this limitation currently exists in the first generation of Hive Java UDFs. Complex types are not supported. Functions are not extracted from the jar file. The first generation of Hive UDFs allowed this because the method prototypes are explicitly defined and can be determined at function creation time. For GenericUDFs, the return types are determined based on the parameters passed in when running a query. For the same reason as above, GenericUDFs cannot be made permanent. They will need to be recreated everytime the server is restarted. This is a severe limitation and will be resolved in the near future. Change-Id: Ie6fd09120db413fade94410c83ebe8ff104013cd Reviewed-on: http://gerrit.cloudera.org:8080/18295 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2022-05-11 15:10:28 +00:00
Michael Smith	e6ed98c22b	IMPALA-11201: update gitignore files Updates gitignore for files generated during bootstrap_development. Fixes deleting tracked files in be/src/thirdparty. Includes ignore rules for past versions of shell dependencies and updates ignores for current versions. Change-Id: I03deba5e7fb151ef8e34039becdcc3fb47684084 Reviewed-on: http://gerrit.cloudera.org:8080/18499 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-10 03:06:59 +00:00
LPL	93fe446b85	IMPALA-11277: Push-down IN predicate to iceberg Iceberg provides a rich API to push predicates down. Currently we only push BinaryPredicate such as EQ, LT, GT, etc. This commit implements IN predicate push down. Testing: - Added end-to-end testing for pushing down IN predicate to iceberg Change-Id: Id4be9aa31a6353021b0eabc4485306c0b0e8bb07 Reviewed-on: http://gerrit.cloudera.org:8080/18463 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-04 16:50:05 +00:00
LPL	5a73401f3e	IMPALA-11276: Fix TestIcebergTable.test_partitioned_insert became flaky TestIcebergTable.Test_partitioned_insert test is not stable because SHOW FILES on Iceberg table will sort the list of FILES. So restore the original VERIFY_IS_SUBSET for some flaky cases. Change-Id: Ic38b399ab51903edb59b3f2d1066cd5f5cbff4d4 Reviewed-on: http://gerrit.cloudera.org:8080/18465 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-03 12:13:32 +00:00
Daniel Becker	9baf790606	IMPALA-10838: Fix substitution and improve unification of struct slots The following query fails: ''' with sub as ( select id, outer_struct from functional_orc_def.complextypes_nested_structs) select sub.id, sub.outer_struct.inner_struct2 from sub; ''' with the following error: ''' ERROR: IllegalStateException: Illegal reference to non-materialized tuple: debugname=InlineViewRef sub alias=sub tid=6 ''' while if 'outer_struct.inner_struct2' is added to the select list of the inline view, the query works as expected. This change fixes the problem by two modifications: - if a field of a struct needs to be materialised, also materialise all of its enclosing structs (ancestors) - in InlineViewRef, struct fields are inserted into the 'smap' and 'baseTableSmap' with the appropriate inline view prefix This change also changes the way struct fields are materialised: until now, if a member of a struct was needed to be materialised, the whole struct, including other members of the struct were materialised. This behaviour can lead to using significantly more memory than necessary if we for example query a single member of a large struct. This change modifies this behaviour so that we only materialise the struct members that are actually needed. Tests: - added queries that are fixed by this change (including the one above) in nested-struct-in-select-list.test - added a planner test in fe/src/test/java/org/apache/impala/planner/PlannerTest.java that asserts that only the required parts of structs are materialised Change-Id: Iadb9233677355b85d424cc3f22b00b5a3bf61c57 Reviewed-on: http://gerrit.cloudera.org:8080/17847 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Daniel Becker <daniel.becker@cloudera.com>	2022-05-02 07:21:37 +00:00
LPL	78609dca32	IMPALA-11256: Fix SHOW FILES on Iceberg tables lists all files SHOW FILES on Iceberg tables lists all files in table directory. Even deleted files and metadata files. We should only shows the current data files. Testing: - existing tests Change-Id: If07c2fd6e05e494f7240ccc147b8776a8f217179 Reviewed-on: http://gerrit.cloudera.org:8080/18455 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-28 17:48:23 +00:00
stiga-huang	6380a3187c	IMPALA-11141: Use exact data types in IN-list filter Currently, we use a std::unordered_set<int64_t> for all numeric types (including DATE type). It's a waste of space for small data types like tinyint, smallint, int, etc. This patch extends the base InListFilter class with native implementations for different data types. For string type in-list filters, this patch uses impala::StringValue instead of std::string. This simplifies the Insert() method, which improves the codegen time. To use impala::StringValue, this patch switches the set implementation to boost::unordered_set. Same as what we use in InPredicate. Another improvement of using impala::StringValue is that we can easily maintain the strings in MemPool. When inserting a new batch of values, the new values are inserted into a temp set. String pointers still reference to the original tuple values. At the end of processing each batch, MaterializeValues() is invoked to copy the strings into the filter's own mem pool. This is more memory-friendly than the original approach since we can allocate the string batch at once. Tests: - Add unit tests for different types of in-list filters Change-Id: Id434a542b2ced64efa3bfc974cb565b94a4193e9 Reviewed-on: http://gerrit.cloudera.org:8080/18433 Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-27 03:30:41 +00:00
Daniel Becker	c802be42b6	IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC) When querying a non-toplevel nested struct from an ORC file, the NULL values are displayed at an incorrect level. E.g.: select id, outer_struct.inner_struct3 from functional_orc_def.complextypes_nested_structs where id >= 4; +----+----------------------------+ \| id \| outer_struct.inner_struct3 \| +----+----------------------------+ \| 4 \| {"s":{"i":null,"s":null}} \| \| 5 \| {"s":null} \| +----+----------------------------+ However, in the first row it is expected that 's' should be null and not its members; in the second row the result should be 'NULL', i.e. 'outer_struct.inner_struct3' is null. For reference see what is returned when querying 'outer_struct' instead of 'outer_struct.inner_struct3': +----+-------------------------------------------------------------------------------------------------------------------------------+ \| 4 \| {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} \| \| 5 \| {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null} \| +----+-------------------------------------------------------------------------------------------------------------------------------+ The problem comes from the incorrect handling of the different depths of the following trees: - the ORC type hierarchy (schema) - the tuple descriptor / slot descriptor hierarchy as the ORC type hierarchy contains a node for every level in the schema but the tuple/slot descriptor hierarchy omits the levels of structs that are not in the select list (but an ancestor of theirs is), as these structs are not materialised. In the case of the example query, the two hierarchies are the following: ORC: root --> outer_struct -> inner_struct3 -> s --> i \| \-> s \-> id Tuple/slot descriptors: main_tuple --> inner_struct3 -> s --> i \| \-> s \-> id We create 'OrcColumnReader's for each node in the ORC type tree. Each OrcColumnReader is assigned an ORC type node and a slot descriptor. The incorrect behaviour comes from the incorrect pairing of ORC type nodes with slot descriptors. The old behaviour is described below: Starting from the root, going along a path in both trees (for example the path leading to outer_struct.inner_struct3.s.i), for each step we consume a level in both trees until no more nodes remain in the tuple/slot desc tree, and then we pair the last element from that tree with the remaining ORC type node(s). In the example, we get the following pairs: (root, main_tuple) -> (outer_struct, inner_struct3) -> (inner_struct3, s) -> (s, i) -> (i, i) When we run out of structs in the tuple/slot desc tree, we still create OrcStructReaders (because the ORC type is still a struct, but the slot descriptor now refers to an INT), but we mark them incorrectly as non-materialised. Also, the OrcStructReaders for non-materialised structs do not need to check for null-ness as they are not present in the select list, only their descendants, and the ORC batch object stores null information also for the descendants of null values. Let's look at the row with id 4 in the example: Because of the bug, the non-materialising OrcStructReader appears at the level of the (s, i) pair, so the 's' struct is not checked for null-ness, although it is actually null. One level lower, for 'i' (and the inner 's' string field), the ORC batch object tells us that the values are null (because their parent is). Therefore the nulls appear one level lower than they should. The correct behaviour is that ORC type nodes are paired with slot descriptors if either - the ORC type node matches the slot descriptor (they refer to the same node in the schema) or - the slot descriptor is a descendant of the schema node that the ORC type node refers to. This patch fixes the incorrect pairing of ORC types and slot descriptors, so we have the following pairs: (root, main_tuple) -> (outer_struct, main_tuple) -> (inner_struct3, inner_struct3) -> (s, s) -> (i, i) In this case the OrcStructReader for the pair (outer_struct, main_tuple) becomes non-materialising and the one for (s, s) will be materialising, so the 's' struct will also be null-checked, recognising null-ness at the correct level. This commit also fixes some comments in be/src/exec/orc-column-readers.h and be/src/exec/hdfs-orc-scanner.h mentioning the field HdfsOrcScanner::col_id_path_map_, which has been removed by "IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution". Testing: - added tests to testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test that query various levels of the struct 'outer_struct' to check that NULLs are at the correct level. Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040 Reviewed-on: http://gerrit.cloudera.org:8080/18403 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-21 13:59:17 +00:00
Zoltan Borok-Nagy	e91c7810f0	IMPALA-10850: Interpret timestamp predicates in local timezone in IcebergScanNode IcebergScanNode interprets the timestamp literals as UTC timestamps during predicate pushdown to Iceberg. It causes problems when the Iceberg table uses TIMESTAMPTZ (which corresponds to TIMESTAMP WITH LOCAL TIME ZONE in SQL) because in the scanners we assume that the timestamp literals in a query are in local timezone. Hence, if the Iceberg table is partitioned by HOUR(ts), and Impala is running in a different timezone than UTC, then the following query doesn't return any rows: SELECT * from t WHERE ts = <some ts>; Because during predicate pushdown the timestamp is interpreted as a UTC timestamp (no conversion from local to UTC), but during query execution the timestamp data in the files are converted to local timezone, then compared to <some ts>. I.e. in the scanner the assumption is that <some ts> is in local timezone. On the other hand, when Iceberg type TIMESTAMP (which correcponds to TIMESTAMP WITHOUT TIME ZONE in SQL) is used, then we should just push down the timestamp values without any conversion. In this case there is no conversion in the scanners either. Testing: * added e2e test with TIMESTAMPTZ * added e2e test with TIMESTAMP Change-Id: I181be5d2fa004f69b457f69ff82dc2f9877f46fa Reviewed-on: http://gerrit.cloudera.org:8080/18399 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2022-04-21 12:49:31 +00:00
Aman Sinha	8645ac6db3	IMPALA-11247: Test script changes for materialized views IMPALA-10723 added support for treating materialized views as tables. In certain test configurations, the rebuild of the materialized views (which is done via Hive) was not populating the data in the MV. In this patch, I have changed the source tables of materialized views to be full-acid instead of insert-only transactional tables. This enables the tests to succeed. Insert-only source tables are also meant to work for the MV rebuild but that is a Hive issue that will be investigated separately. Change-Id: I349faa0ad36ec8ca6f574f7f92d9a32fb7d0d344 Reviewed-on: http://gerrit.cloudera.org:8080/18421 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Aman Sinha <amsinha@cloudera.com>	2022-04-17 22:36:57 +00:00
Aman Sinha	e644c99724	IMPALA-10723: Treat materialized view as a table instead of a view The existing behavior is that materialized views are treated as views and therefore expanded similar to a view when one queries the MV directly (SELECT * FROM materialized_view). This is incorrect since an MV is a regular table with physical properties such as partitioning, clustering etc. and should be treated as such even though it has a view definition associated with it. This patch focuses on the use case where MVs are created as HDFS tables and makes the MVs a derived class of HdfsTable, therefore making it a Table object. It adds support for collecting and displaying statistics on materialized views and these statistics could be leveraged by an external frontend that supports MV based query rewrites (note that such a rewrite is not supported by Impala with or without this patch). Note that we are not introducing new syntax for MVs since DDL, DML operations on MVs are only supported through Hive. Directly querying a MV is permitted but inserts into MVs is not since MVs are supposed to be only modified through an external refresh when the source tables have modifications. If the source tables associated with a materialized view have column masking or row-filtering Ranger policies, querying the MV will throw an error. This behavior is consistent with that of Hive. Testing: - Added transactional tables for alltypes, jointbl and used them as source tables to create materialized view. - Added tests for compute stats, drop stats, show stats and simple select query on a materialized view. - Added test for select on a materialized view when the source table has a column mask. - Modified analyzer tests related to alter, insert, drop of materialized view. Change-Id: If3108996124c6544a97fb0c34b6aff5e324a6cff Reviewed-on: http://gerrit.cloudera.org:8080/17595 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>	2022-04-14 11:56:20 +00:00
Fang-Yu Rao	c190966db9	IMPALA-11232: Do not add some jars to HADOOP_CLASSPATH when starting HMS This patch changes the line that added to HADOOP_CLASSPATH all the jar files in the folder ${RANGER_HOME}/ews/webapp/WEB-INF/lib to a line that only includes those jar files with names starting with "ranger-" since almost all other jar files do not seem to be necessary to run the E2E test of test_hive_with_ranger_setup. This way we also avoid adding too many paths to HADOOP_CLASSPATH, which in turn could result in Hadoop not being able to return its version to the script that starts HMS due to the error of "Argument list too long". Testing: - Verified after this patch, test_hive_with_ranger_setup still succeeds. - Verified in a local development environment that the length of Hadoop's environment variable 'CLASSPATH' logged in hive-metastore.out decreases from 100,876 characters to 62,634 characters when executing run-hive-server.sh with the flag '-with_ranger' if $HADOOP_SHELL_SCRIPT_DEBUG is "true" and $IMPALA_HOME is "/home/fangyurao/Impala_for_FE". Change-Id: Ifd66fd99a346835b9f81f95b5f046273fcce2590 Reviewed-on: http://gerrit.cloudera.org:8080/18398 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-13 20:37:45 +00:00
Csaba Ringhofer	3843f7ff46	IMPALA-11200: Avoid redundant "Codegen enabled" messages in profile Before this patch the message was added to the profile in Open(), which can be called multiple times in subplans. Moved it to Close(), which is only called once in the lifetime of a Node/Aggregator. A drawback of this is that this info won't be visible when the Node is still active, but I don't think that it is a very useful info in a still running query. Also added a new feature to test_result_verifier.py: Inside RUNTIME_PROFILE section row_regex can be negated with !, so !row_regex [regex] means that regex is not matched by any line in the profile. Testing: - added a regression test Change-Id: Iad2e31900ee6d29385cc8adc6bbf067d91f6450f Reviewed-on: http://gerrit.cloudera.org:8080/18385 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-13 12:31:36 +00:00
Riza Suminto	953705b8d2	IMPALA-11239: Fix failure in test_parquet_count_star_optimization IMPALA-11123 add assertion to verify NumFileMetadataRead in parquet-stats-agg.test. In the multiblock test, the number of NumFileMetadataRead can differ in erasure coding configuration. This patch removes that assertion in the multiblock test. The rest of the assertion, including the count results, remains the same. Testing: - Pass e2e tests in erasure coding setup. Change-Id: I6fe3f6e97358b619838b48eddb22192b39d29cc6 Reviewed-on: http://gerrit.cloudera.org:8080/18407 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-13 06:22:14 +00:00
Tamas Mate	9cd4823aa9	IMPALA-11023: Raise error when delete file is found in an Iceberg table Iceberg V2 DeleteFiles are skipped during scans and the whole content of the DataFiles are returned. This commit adds an extra check to prevent scanning tables that have delete files to avoid unexpected results till merge on read is supported. Metadata operations are allowed on tables with delete files. Testing: - Added e2e test. Change-Id: I6e9cbf2424b27157883d551f73e728ab4ec6d21e Reviewed-on: http://gerrit.cloudera.org:8080/18383 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-11 19:37:04 +00:00
Joe McDonnell	7b235eebd5	IMPALA-11230: Add test for crash in partitioned Top-N codegen code User workloads hit a crash for certain queries that use partitioned Top-N operators. The crash occurred only when codegen is enabled. After investigation, the crash was due to a nullptr being passed into TupleRowComparator::Compare(). The issue was fixed as part of IMPALA-10961. This adds a test case with a SQL statement that triggers a crash if IMPALA-10961 is not present. Change-Id: I6909ef660b01ad3d301273deb8a8c31120445f79 Reviewed-on: http://gerrit.cloudera.org:8080/18389 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-07 16:07:56 +00:00
Riza Suminto	f932d78ad0	IMPALA-11123: Optimize count(star) for ORC scans This patch provides count(star) optimization for ORC scans, similar to the work done in IMPALA-5036 for Parquet scans. We use the stripes num rows statistics when computing the count star instead of materializing empty rows. The aggregate function changed from a count to a special sum function initialized to 0. This count(star) optimization is disabled for the full ACID table because the scanner might need to read and validate the 'currentTransaction' column in table's special schema. This patch drops 'parquet' from names related to the count star optimization. It also improves the count(star) operation in general by serving the result just from the file's footer stats for both Parquet and ORC. We unify the optimized count star and zero slot scan functions into HdfsColumnarScanner. The following table shows a performance comparison before and after the patch. primitive_count_star query target tpch10_parquet.lineitem table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc query is a modified primitive_count_star query that targets tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly. +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| tpch_parquet \| count_star_parq \| parquet / none / none \| 0.06 \| 0.07 \| -10.45% \| 2.87% \| * 25.51% * \| 9 \| -1.47% \| -1.26 \| -1.22 \| \| tpch_orc_def \| count_star_orc \| orc / def / none \| 0.06 \| 0.08 \| -22.37% \| 6.22% \| * 30.95% * \| 9 \| -1.85% \| -1.16 \| -2.14 \| \| TARGETED-PERF(10) \| primitive_count_star \| parquet / none / none \| 0.06 \| 0.08 \| I -30.40% \| 2.68% \| * 29.63% * \| 9 \| I -7.20% \| -2.42 \| -3.07 \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Add PlannerTest.testOrcStatsAgg - Add TestAggregationQueries::test_orc_count_star_optimization - Exercise count(star) in TestOrc::test_misaligned_orc_stripes - Pass core tests Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Reviewed-on: http://gerrit.cloudera.org:8080/18327 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-05 13:27:10 +00:00
Gabor Kaszab	32e2ace38d	IMPALA-11038: Zipping unnest from view IMPALA-10920 introduced zipping unnest functionality for arrays that are in a table. This patch improves that support further by accepting inputs from views as well. Testing: - Added planner tests to verify which execution node handles the predicates on unnested items. - E2E tests for both unnesting syntaxes (select list and from clause) to cover when the source of the unnested arrays is not a table but a view. Also tested multi-level views and filtering the unnested items on different levels. Change-Id: I68f649dda9e41f257e7f6596193d07b24049f92a Reviewed-on: http://gerrit.cloudera.org:8080/18094 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2022-04-05 07:56:39 +00:00
Tamas Mate	efba58f5f0	IMPALA-10737: Optimize the number of Iceberg API Metadata requests Iceberg stores the table metadata next to the data files, when this is accessed through the Iceberg API a filesystem call is executed (HDFS, S3, ADLS). These calls were used in various places during query processing and this patch unifies the Iceberg metadata request in the CatalogD and ImpalaD: - CatalogD loads and caches the org.apache.iceberg.Table object. - When ImpalaDs request the Table metadata, the current catalog snapshot id is sent over and the ImpalaD loads and caches the org.apache.iceberg.Table object throught Iceberg API as well. This approach (loading the Iceberg table twice) was choosen because the org.apache.iceberg.Table could not be meaningfully serialized and deserialized. The result of a serialized Table is a lightweight SerializableTable object which is in the Iceberg core package. As a result REFRESH/INVALIDATE METADATA is required to reload any Iceberg metadata changes and the metadata load time is improved. This improvement is more significant for smaller queries, where the metadata request has larger impact on the query execution time. Additionally, the dependency on the Iceberg core package has been reduced and the TableMetadata/BaseTable class uses has been replaced with the Table class from the Iceberg api package in most places. Testing: - Passed Iceberg E2E tests. Change-Id: I5492e0cdb31602f0276029c2645d14ff5cb2f672 Reviewed-on: http://gerrit.cloudera.org:8080/18353 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-04 23:52:33 +00:00
Csaba Ringhofer	6bf56c95c7	IMPALA-11115: Fix hitting DCHECK for brotli and deflate compressions The DCHECK was hit when an unsupported compression was included in enum THdfsCompression but not in COMPRESSION_MAP. Removed COMPRESSION_MAP as we can get the names from enum THdfsCompression directly. In release builds this didn't cause a crash, only a weird error message ("INVALID" instead of the compression name). Testing: - added ee tests that try to insert with brotli and deflate Change-Id: Ic38294b108ff3c4aa0b49117df95c5a1b8c60a4b Reviewed-on: http://gerrit.cloudera.org:8080/18242 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-04 20:28:07 +00:00
xqhe	abfa5a72b6	IMPALA-11008: fix incorrect to propagate inferred predicates It is incorrect to propagate predicates inferred from equi-join conjuncts into a plan subtree that is on the nullable side of an outer join if the predicate is not null-filtering for the nullable side. For example: SELECT * FROM ( SELECT id IS NOT NULL AND col IS NULL AS a FROM ( SELECT A.id, B.col FROM A LEFT JOIN B ON A.id = B.id ) t ) t WHERE a = 1 Before this patch the inferred predicate '(B.id is not null and B.col is null) = 1' is evaluated at the scanner of B. This is incorrect since the predicate '(A.id is not null and B.col is null) = 1' is not null-filtering for B. To generate the inferred predicate we substitue the non-outer-join slots first and use 'isNullableConjunct' to do a more strict check on the conjunct before the final substitution. Tests: - Add plan tests in predicate-propagation.test - Add new query tests to verify the correctness of inferred predicates propagation - Ran the full set of verifications in Impala Public Jenkins Change-Id: I9e64230f6d0c2b9ef1560186ceba349a5920ccdf Reviewed-on: http://gerrit.cloudera.org:8080/18234 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-04 05:54:58 +00:00
stiga-huang	0fb14962d7	IMPALA-11039: Fix incorrect page jumping in late materialization of Parquet The current calculation of LastRowIdxInCurrentPage() is incorrect. It uses the first row index of the next candidate page instead of the next valid page. The next candidate page could be far away from the current page. Thus giving a number larger than the current page size. Skipping rows in the current page could overflow the boundary due to this. This patch fixes LastRowIdxInCurrentPage() to use the next valid page. When skip_row_id is set (>0), the current approach of SkipRowsInternal<false>() expects jumping to a page containing this row and then skipping rows in that page. However, the expected row might not be in the candidate pages. When we jump to the next candidate page, the target row could already be skipped. In this case, we don't need to skip rows in the current page. Tests: - Add a test on alltypes_empty_pages to reveal the bug. - Add more batch_size values in test_page_index. - Pass tests/query_test/test_parquet_stats.py locally. Change-Id: I3a783115ba8faf1a276e51087f3a70f79402c21d Reviewed-on: http://gerrit.cloudera.org:8080/18372 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-02 03:26:28 +00:00
Zoltan Borok-Nagy	952f2af0ca	IMPALA-11210: Impala can only handle lowercase schema elements of Iceberg table When Impala/Hive creates a table they lowercase the schema elements. When Spark creates an Iceberg table it doesn't lowercase the names of the columns in the Iceberg metadata. This triggers a precondition check in Impala which makes such Iceberg tables unloadable. This patch converts column names to lowercase when converting Iceberg schemas to Hive/Impala schemas. Testing: * added e2e test Change-Id: Iffd910f76844fbf34db805dda6c3053c5ad1cf79 Reviewed-on: http://gerrit.cloudera.org:8080/18368 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-31 11:53:13 +00:00
Yu-Wen Lai	ca48b940ec	Bump up CDP_BUILD_NUMBER to 23144489 This patch is to include HIVE-25753, which is needed to improve the performance of retrieving the latest committed compaction for a table. Change-Id: Ifd4ae0cba48217483a40a51f97156fabfb00cf27 Reviewed-on: http://gerrit.cloudera.org:8080/18296 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2022-03-14 19:47:19 +00:00
Zoltan Borok-Nagy	2fffac3bad	IMPALA-11175: Iceberg table cannot be loaded when partition value is NULL When Impala created the metadata objects about the Iceberg data files it tried to convert the partition values to strings. But the partition values can be NULLs as well. The code didn't expect this, so we got a NullPointerException. With this patch we pass the table's null partition key value in case of NULLs. Testing: * added e2e tests Change-Id: I88c4f7a2c2db4f6390c8ee5c08baddc96b04602e Reviewed-on: http://gerrit.cloudera.org:8080/18307 Reviewed-by: Tamas Mate <tmater@apache.org> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-11 13:23:30 +00:00
Tamas Mate	aef30d0442	Revert "IMPALA-10737: Optimize the number of Iceberg API Metadata requests" This reverts commit `cd10acdbb1`. This commit has been reverted, because it blocks upgrading the Iceberg version to 0.13. In the newer Iceberg version the BaseTable serialization has been changed, it serializes the BaseTable to a SerializableTable sibiling class. This is a lightweigth Table class which does not have the necessary metadata that could be cached and reused by the ImpalaDs. SerializableTable utilization has to be further considered. Change-Id: I21e65cb3ab38d9e683223fb100d7ced90caa6edd Reviewed-on: http://gerrit.cloudera.org:8080/18305 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-10 15:30:59 +00:00
Zoltan Borok-Nagy	7f1ce039be	IMPALA-11154: Idle Kudu daemons consume too much CPU Due to KUDU-1973 kudu-tservers produce high CPU consumption (see also KUDU-3134) when there is a high number of table replicas. This means that in the Impala dev environment the CPU consumption can be around 15-20% per kudu-tserver (there are 3 kudu-tservers) when all the Kudu tables are loaded. Setting the value to 3 seconds lowers CPU usage to ~5% per kudu-terver. Testing: * ran exhaustive tests Change-Id: Ieb4de56540f5a7dc860bf6e27d9a5c0e4f4b3d26 Reviewed-on: http://gerrit.cloudera.org:8080/18290 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-08 18:35:04 +00:00
Zoltan Borok-Nagy	c10e951bcb	IMPALA-11053: Impala should be able to read migrated partitioned Iceberg tables When Hive (and probably other engines as well) converts a legacy Hive table to Iceberg it doesn't rewrite the data files. It means that the data files don't have write ids neither partition column data. Currently Impala expects the partition columns to be present in the data files, so it is not able to read converted partitioned tables. With this patch Impala loads partition values from the Iceberg metadata. The extra metadata information is attached to the file descriptor objects and propageted to the scanners. This metadata contains the Iceberg data file format (later it could be used to handle mixed-format tables), and partition data. We use the partition data in the HdfsScanner to create the template tuple that contains the partition values of identity-partitioned columns. This is not only true to migrated tables, but all Iceberg tables with identity partitions, which means we also save some IO and CPU time for such columns. The partition information could also be used for Dynamic Partition Pruning later. We use the (human-readable) string representation of the partition data when storing them in the flat buffers. This helps debugging, also it provides the needed flexibility when the partition columns evolve (e.g. INT -> BIGINT, DECIMAL(4,2) -> DECIMAL(6,2)). Testing * e2e test for all data types that can be used to partition a table * e2e test for migrated partitioned table + schema evolution (without renaming columns) * e2e for table where all columns are used as identity-partitions Change-Id: Iac11a02de709d43532056f71359c49d20c1be2b8 Reviewed-on: http://gerrit.cloudera.org:8080/18240 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-07 20:00:42 +00:00
Gergely Fürnstáhl	71c904e5c2	IMPALA-10948: Default scale and DecimalType Added default 0 for scale if it is not set to comply with parquet spec. Wrapped reading scale and precision in a function to support reading LogicalType.DecimalType if it is set, falling back to old ones if it is not, for backward compatibility. Regenerated bad_parquet_decimals table with filled DecimalType, moved missing scale test, as it is no longer a bad table. Added no_scale.parquet table to test reading table without set scale. Checked it with parquet-tools: message schema { optional fixed_len_byte_array(2) d1 (DECIMAL(4,0)); } Change-Id: I003220b6e2ef39d25d1c33df62c8432803fdc6eb Reviewed-on: http://gerrit.cloudera.org:8080/18224 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-04 16:49:22 +00:00
Tamas Mate	cd10acdbb1	IMPALA-10737: Optimize the number of Iceberg API Metadata requests Iceberg stores the table metadata next to the data files, when this is accessed through the Iceberg API a filesystem call is executed (HDFS, S3, ADLS). These calls were used in various places during query processing and this patch unifies the Iceberg metadata request in the CatalogD similar to other metadata requests: - CatalogD loads and caches the org.apache.iceberg.BaseTable object. - ImpalaDs requests the org.apache.iceberg.BaseTable from the CatalogD and caches it as well. As a result REFRESH/INVALIDATE METADATA is required to reload any Iceberg metadata changes and the metadata load time is improved. This improvement is more significant for smaller queries, where the metadata request has larger impact on the query execution time. Testing: - Passed Iceberg E2E tests. Change-Id: I9e62a1fb9753ea1b022c7763047d9ccfd1d27d62 Reviewed-on: http://gerrit.cloudera.org:8080/18226 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2022-03-04 15:24:23 +00:00
stiga-huang	374783c55e	IMPALA-10898: Add runtime IN-list filters for ORC tables ORC files have optional bloom filter indexes for each column. Since ORC-1.7.0, the C++ reader supports pushing down predicates to skip unreleated RowGroups. The pushed down predicates will be evaludated on file indexes (i.e. statistics and bloom filter indexes). Note that only EQUALS and IN-list predicates can leverage bloom filter indexes. Currently Impala has two kinds of runtime filters: bloom filter and min-max filter. Unfortunately they can't be converted into EQUALS or IN-list predicates. So they can't leverage the file level bloom filter indexes. This patch adds runtime IN-list filters for this purpose. Currently they are generated for the build side of a broadcast join. They will only be applied on ORC tables and be pushed down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list, if # of distinct values of the build side exceeds a threshold (default to 1024), we set the filter to ALWAYS_TRUE and clear its entry. The threshold can be configured by a new query option, RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT. Evaluating runtime IN-list filters is much slower than evaluating runtime bloom filters due to the current simple implementation (i.e. std::unorder_set) and the lack of codegen. So we disable it at row level. For visibility, this patch addes two counters in the HdfsScanNode: - NumPushedDownPredicates - NumPushedDownRuntimeFilters They reflect the predicates and runtime filters that are pushed down to the ORC reader. Currently, runtime IN-list filters are disabled by default. This patch extends the query option, ENABLED_RUNTIME_FILTER_TYPES, to support a comma separated list of filter types. It defaults to be "BLOOM,MIN_MAX". Add "IN_LIST" in it to enable runtime IN-list filters. Ran perf tests on a 3 instances cluster on my desktop using TPC-DS with scale factor 20. It shows significant improvements in some queries: +-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ \| TPCDS(20) \| TPCDS-Q67A \| orc / snap / block \| 35.07 \| 44.01 \| I -20.32% \| 0.38% \| 1.38% \| 10 \| I -25.69% \| -3.58 \| -45.33 \| \| TPCDS(20) \| TPCDS-Q37 \| orc / snap / block \| 1.08 \| 1.45 \| I -25.23% \| 7.14% \| 3.09% \| 10 \| I -34.09% \| -3.58 \| -12.94 \| \| TPCDS(20) \| TPCDS-Q70A \| orc / snap / block \| 6.30 \| 8.60 \| I -26.81% \| 5.24% \| 4.21% \| 10 \| I -36.67% \| -3.58 \| -14.88 \| \| TPCDS(20) \| TPCDS-Q16 \| orc / snap / block \| 1.33 \| 1.85 \| I -28.28% \| 4.98% \| 5.92% \| 10 \| I -39.38% \| -3.58 \| -12.93 \| \| TPCDS(20) \| TPCDS-Q18A \| orc / snap / block \| 5.70 \| 8.06 \| I -29.25% \| 3.00% \| 4.12% \| 10 \| I -40.30% \| -3.58 \| -19.95 \| \| TPCDS(20) \| TPCDS-Q22A \| orc / snap / block \| 2.01 \| 2.97 \| I -32.21% \| 6.12% \| 5.94% \| 10 \| I -47.68% \| -3.58 \| -14.05 \| \| TPCDS(20) \| TPCDS-Q77A \| orc / snap / block \| 8.49 \| 12.44 \| I -31.75% \| 6.44% \| 3.96% \| 10 \| I -49.71% \| -3.58 \| -16.97 \| \| TPCDS(20) \| TPCDS-Q75 \| orc / snap / block \| 7.76 \| 12.27 \| I -36.76% \| 5.01% \| 3.87% \| 10 \| I -59.56% \| -3.58 \| -23.26 \| \| TPCDS(20) \| TPCDS-Q21 \| orc / snap / block \| 0.71 \| 1.27 \| I -44.26% \| 4.56% \| 4.24% \| 10 \| I -77.31% \| -3.58 \| -28.31 \| \| TPCDS(20) \| TPCDS-Q80A \| orc / snap / block \| 9.24 \| 20.42 \| I -54.77% \| 4.03% \| 3.82% \| 10 \| I -123.12% \| -3.58 \| -40.90 \| \| TPCDS(20) \| TPCDS-Q39-1 \| orc / snap / block \| 1.07 \| 2.26 \| I -52.74% \| * 23.83% * \| 2.60% \| 10 \| I -149.68% \| -3.58 \| -14.43 \| \| TPCDS(20) \| TPCDS-Q39-2 \| orc / snap / block \| 1.00 \| 2.33 \| I -56.95% \| * 19.53% * \| 2.07% \| 10 \| I -151.89% \| -3.58 \| -20.81 \| +-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ "Base Avg" is the avg of the original time. "Avg" is the current time. However, we also see some regressions due to the suboptimal implementation. The follow-up JIRAs will focus on improvements: - IMPALA-11140: Codegen InListFilter::Insert() and InListFilter::Find() - IMPALA-11141: Use exact data types in IN-list filters instead of casting data to a set of int64_t or a set of string. - IMPALA-11142: Consider IN-list filters in partitioned joins. Tests: - Test IN-list filter on string, date and all integer types - Test IN-list filter with NULL - Test IN-list filter on complex exprs targets Change-Id: I25080628233799aa0b6be18d5a832f1385414501 Reviewed-on: http://gerrit.cloudera.org:8080/18141 Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-03 00:21:06 +00:00
stiga-huang	b2e4b29f06	IMPALA-11120: Fix codec not set in generating ORC tables We use 'mapred.output.compression.codec' to set the compression codec in generating test files by Hive. However, it doesn't affect ORC files. Instead, we need to set 'orc.compress' in tblproperties for each ORC tables. The default value of 'orc.compress' is ZLIB which corresponds to our 'def' codec. We only need to set it for non-def codecs. This patch also fixes a bug in build_compression_codec_statement() that would raise KeyError when loading lz4 non-avro tables. Tests - Loaded tpch data in orc/none/none, orc/def/block, orc/snap/block, orc/lz4/block and verified there compression codecs. Change-Id: I02bd5d9400864145133ff019a3d076a6cab36fcc Reviewed-on: http://gerrit.cloudera.org:8080/18228 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-01 13:09:59 +00:00
Fucun Chu	4186727fe6	IMPALA-10871: Add MetastoreShim to support Apache Hive 3.1.2 Like IMPALA-8369, this patch adds a compatibility shim in fe so that Impala can interoperate with Hive 3.1.2. we need adds a new Metastoreshim class under compat-apache-hive-3 directory. These shim classes implement method which are different in cdp-hive-3 vs apache-hive-3 and are used by front end code. At the build time, based on the environment variable IMPALA_HIVE_DIST_TYPE one of the two shims is added to as source using the fe/pom.xml build plugin. Some codes that directly use Hive 4 APIs need to be ignored in compilation, eg. fe/src/main/java/org/apache/impala/catalog/metastore/. Use Maven profile to ignore some codes, profile will automatically activated based on the IMPALA_HIVE_DIST_TYPE. Testing: 1. Code compiles and runs against both HMS-3 and ASF-HMS-3 2. Ran full-suite of tests against HMS-3 3. Running full-tests against ASF-HMS-3 will need more work supporting Tez in the mini-cluster (for dataloading) and HMS transaction support. This will be on-going effort and test failures on ASF-Hive-3 will be fixed in additional sub-tasks. Notes: 1. Patch uses a custom build of Apache Hive to be deployed in mini-cluster. This build has the fixes for HIVE-21569, HIVE-20038. This hack will be added to the build script in additional sub-tasks. Change-Id: I9f08db5f6da735ac431819063060941f0941f606 Reviewed-on: http://gerrit.cloudera.org:8080/17774 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-27 06:36:19 +00:00
Riza Suminto	873fe2e524	IMPALA-11135: Deflake LEFT ANTI JOIN test case in test_spilling.py TestSpillingDebugActionDimensions.test_spilling has been flaky because a test case from IMPALA-9725 sometimes does not spill its hash join partition. This patch lowers the buffer_pool_limit of this test from 110MB to 105MB, just slightly above its Max Per-Host Resource Reservation (104.61MB), to ensure consistent spilling behavior. Testing: After lowering the buffer pool limit, I loop the test 1000 times, and all spill consistently in fragment "HASH_JOIN_NODE (id=14)". To be specific, these are the num of SpilledPartitions of the first instance (ending with "000d") of "Hash Join Builder (join_node_id=14)" fragment across 1000 query runs: +--------------------+----------+ \| #SpilledPartitions \| #Queries \| +--------------------+----------+ \| 2 \| 30 \| \| 3 \| 96 \| \| 4 \| 674 \| \| 5 \| 52 \| \| 6 \| 146 \| \| 7 \| 2 \| +--------------------+----------+ Change-Id: Idad9fc6ec6a0ba7fc70e0701e567da7165e40e83 Reviewed-on: http://gerrit.cloudera.org:8080/18261 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-24 07:06:50 +00:00
Zoltan Borok-Nagy	8217c230ac	IMPALA-11147: Min/max filtering crashes on Parquet file that contains partition columns Impala crashes on a Parquet file that contains the partition columns. Data files usually don't contain the partition columns, so Impala don't expect to find such columns in the data files. Unfortunately min/max filtering generates a SEGFAULT when the partition column is present in the data files. It happens when FindSkipRangesForPagesWithMinMaxFilters() tries to retrieve the Parquet schema element for a given slot descriptor. When the slot descriptor refers to a partition column, we usually don't find a schema element so we don't try to skip pages. But when the partition column is present in the data file, the code tries to calculate the filtered pages for the column. It uses the column reader object corresponding to the column, but this is NULL for partition columns, hence we get a SEGFAULT. The code shouldn't do anything at the page-level for partition columns, as the data in such columns are the same for the whole file and it is already filtered at a higher level. Testing: * added e2e test Change-Id: I17eff4467da3fd67a21353ba2d52d3bec405acd2 Reviewed-on: http://gerrit.cloudera.org:8080/18265 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-24 04:24:08 +00:00
Csaba Ringhofer	7cfc73c2fd	IMPALA-11150: Remove resource-requirements tests on functional_parquet.alltypes These test became flaky after IMPALA-10961 as it led to smaller and varying size for the table. This is a short term solution to make builds green as fixing the tests properly may take some time. Change-Id: I5bf0f963d3e053345aec27e834974eeead4190ac Reviewed-on: http://gerrit.cloudera.org:8080/18267 Reviewed-by: Fang-Yu Rao <fangyu.rao@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2022-02-23 22:54:50 +00:00
stiga-huang	331ff4647d	IMPALA-11137: Enable proleptic Gregorian Calendar for Hive Since HIVE-22589, Hive still uses Julian Calendar for writing dates before 1582-10-15, whereas Impala uses proleptic Gregorian Calendar. This affects the results Impala gets when querying tables written by Hive. Currently, the Avro and ORC formats of date_tbl are suffering this issue. This patch enables proleptic Gregorian Calendar for Hive by default. It also reverts the two commits of IMPALA-9555 which modifies the tests to satisfy the inconsistent results. Tests: - Ran CORE tests Change-Id: I6be9c9720dd352d6821cdaa6c64d35ba20473bc0 Reviewed-on: http://gerrit.cloudera.org:8080/18262 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-23 20:09:10 +00:00
Zoltan Borok-Nagy	b60ccabd5b	IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file Impala returns "Couldn't skip rows in file" error for old Parquet file written by an old Impala (e.g. Impala 2.5, 2.6) In DEBUG build Impala crashes by a DCHECK: Check failed: num_buffered_values_ > 0 (-1 vs. 0) The problem is that in some old Parquet files there can be a mismatch between 'num_values' in a page and the encoded def/rep levels. There is usually one more def/rep levels encoded in these files. In SkipTopLevelRows() we skipped values based on how many def levels are `92ce6fe48e/be/src/exec/parquet/parquet-column-readers.cc (L1308-L1314)` Since there are more def levels than values in some old files, num_buferred_values_ could become negative. This patch also takes the value of num_buferred_values_ into account when calculating 'read_count', so we can deal with such files. With this patch we also include the column name in the "Couldn't skip rows" error message, so in the future it'll be easier to identify the problematic columns. Testing: * added Parquet file written by Impala 2.5 and e2e test for it Change-Id: I568fe59df720ea040be4926812412ba4c1510a26 Reviewed-on: http://gerrit.cloudera.org:8080/18257 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-23 03:45:22 +00:00
stiga-huang	1697af02d6	IMPALA-11124: Reuse local TPCH/TPCDS data in testdata loading When loading testdata for TPC-H/TPC-DS, we first run a preload script to generate local data, and then upload them to HDFS to be used by Hive. The preload script currently always generates the data, which is time-consuming in large scale factors. This patch modifies the preload scripts to check if the last run succeeded, and reuse the data if it does. Otherwise, generate the data and leave a success marker in the data directory. Tests: - Verified the scripts locally. Change-Id: Ied40e599cda009ae0ad88ad13385e7bb86428bb4 Reviewed-on: http://gerrit.cloudera.org:8080/18233 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-17 20:28:51 +00:00

1 2 3 4 5 ...

2562 Commits