impala

mirror of https://github.com/apache/impala.git synced 2026-01-31 09:00:19 -05:00

Author	SHA1	Message	Date
Csaba Ringhofer	4697db0214	IMPALA-5121: Fix AVG() on timestamp col with use_local_tz_for_unix_timestamp_conversions AVG used to contain a back and forth timezone conversion if use_local_tz_for_unix_timestamp_conversions is true. This could affect the results if there were values from different DST rules. Note that AVG on timestamps has other issues besides this, see IMPALA-7472 for details. Testing: - added a regression test Change-Id: I999099de8e07269b96b75d473f5753be4479cecd Reviewed-on: http://gerrit.cloudera.org:8080/17412 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-14 13:58:04 +00:00
Fucun Chu	b1326f7eff	IMPALA-10687: Implement ds_cpc_union() function This function receives a set of serialized Apache DataSketches CPC sketches produced by ds_cpc_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_cpc_estimate(ds_cpc_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_cpc_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_cpc_union() on those sketches Change-Id: Ib94b45ae79efcc11adc077dd9df9b9868ae82cb6 Reviewed-on: http://gerrit.cloudera.org:8080/17372 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-13 12:18:55 +00:00
Fucun Chu	e39c30b3cd	IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions These functions can be used to get cardinality estimates of data using CPC algorithm from Apache DataSketches. ds_cpc_sketch() receives a dataset, e.g. a column from a table, and returns a serialized CPC sketch in string format. This can be written to a table or be fed directly to ds_cpc_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the CPC sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' CPC see: http://datasketches.apache.org/docs/CPC/CPC.html Figures-of-Merit Comparison of the HLL and CPC Sketches see: https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch_parquet.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3x faster. CPC gives closer estimate than current ndv(). CPC is more accurate than HLL in some cases Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a Reviewed-on: http://gerrit.cloudera.org:8080/16656 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-11 18:07:40 +00:00
Zoltan Borok-Nagy	e26543426c	IMPALA-9967: Add support for reading ORC's TIMESTAMP WITH LOCAL TIMEZONE ORC-189 and ORC-666 added support for a new timestamp type 'TIMESTMAP WITH LOCAL TIMEZONE' to the Orc library. This patch adds support for reading such timestamps with Impala. These are UTC-normalized timestamps, therefore we convert them to local timezone during scanning. Testing: * added test for CREATE TABLE LIKE ORC * added scanner tests to test_scanners.py Change-Id: Icb0c6a43ebea21f1cba5b8f304db7c4bd43967d9 Reviewed-on: http://gerrit.cloudera.org:8080/17347 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-07 14:11:08 +00:00
Zoltan Borok-Nagy	f0f083e45e	IMPALA-10482, IMPALA-10493: Fix bugs in full ACID collection query rewrites IMPALA-10482: SELECT * query on unrelative collection column of transactional ORC table will hit IllegalStateException. The AcidRewriter will rewrite queries like "select item from my_complex_orc.int_array" to "select item from my_complex_orc t, t.int_array" This cause troubles in star expansion. Because the original query "select * from my_complex_orc.int_array" is analyzed as "select item from my_complex_orc.int_array" But the rewritten query "select * from my_complex_orc t, t.int_array" is analyzed as "select id, item from my_complex_orc t, t.int_array". Hidden table refs can also cause issues during regular column resolution. E.g. when the table has top-level 'pos'/'item'/'key'/'value' columns. The workaround is to keep track of the automatically added table refs during query rewrite. So when we analyze the rewritten query we can ignore these auxiliary table refs. IMPALA-10493: Using JOIN ON syntax to join two full ACID collections produces wrong results. When AcidRewriter.splitCollectionRef() creates a new collection ref it doesn't copy every information needed to correctly execute the query. E.g. it dropped the ON clause, turning INNER joins to CROSS joins. Testing: * added e2e tests Change-Id: I8fc758d3c1e75c7066936d590aec8bff8d2b00b0 Reviewed-on: http://gerrit.cloudera.org:8080/17038 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-03 20:42:30 +00:00
Csaba Ringhofer	676f79aa81	IMPALA-10691: Fix multithreading related crash in CAST FORMAT The issue occurs when a CastFormatExpr is shared among threads and multiple threads call its OpenEvaluator(). Later calls delete the DateTimeFormatContext created by older calls which makes fn_ctx->GetFunctionState() a dangling pointer. This only happens when CastFormatExpr is shared among FragmentInstances - in case of scanner threads OpenEvaluator() is called with THREAD_LOCAL and returns early without modifying anything. Testing: - added a regression test Change-Id: I501c8a184591b1c836b2ca4cada1f2117f9f5c99 Reviewed-on: http://gerrit.cloudera.org:8080/17374 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-30 23:36:31 +00:00
Amogh Margoor	af6adf7618	IMPALA-10654: Fix precision loss in DecimalValue to double conversion. Original approach to convert DecimalValue(internal representation of decimals) to double was not accurate. It was: static_cast<double>(value_) / pow(10.0, scale). However only integers from −2^53 to 2^53 can be represented accurately by double precision without any loss. Hence, it would not work for numbers like -0.43149576573887316. For DecimalValue representing -0.43149576573887316, value_ would be -43149576573887316 and scale would be 17. As value_ < -2^53, result would not be accurate. In newer approach we are using third party library https://github.com/lemire/fast_double_parser, which handles above scenario in a performant manner. Testing: 1. Added End to End Tests covering following scenarios: a. Test to show precision limitation of 16 in the write path b. DecimalValue's value_ between -2^53 and 2^53. b. value_ outside above range but abs(value_) < UINT64_MAX c. abs(value_) > UINT64_MAX -covers DecimalValue<__int128_t> 2. Ran existing backend and end-to-end tests completely Change-Id: I56f0652cb8f81a491b87d9b108a94c00ae6c99a1 Reviewed-on: http://gerrit.cloudera.org:8080/17303 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-27 14:52:43 +00:00
Csaba Ringhofer	9355b25e11	IMPALA-10662: Change EE tests to return the same results for HS2 as Beeswax In EE tests HS2 returned results with smaller precision than Beeswax for FLOAT/DOUBLE/TIMESTAMP types. These differences are not inherent to the HS2 protocol - the results are returned with full precision in Thrift and lose precision during conversion in client code. This patch changes to conversion in HS2 to match Beeswax and removes test section DBAPI_RESULTS that was used to handle the differences: - float/double: print method is changed from str() to ":.16".format() - timestamp: impyla's cursor is created with convert_types=False to avoid conversion to datetime.datetime (which has only microsec precision) Note that FLOAT/DOUBLE are still different in impala-shell, this change only deals with EE tests. Testing: - ran the changed tests Change-Id: If69ae90c6333ff245c2b951af5689e3071f85cb2 Reviewed-on: http://gerrit.cloudera.org:8080/17325 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-20 22:21:32 +00:00
Qifan Chen	a985e1134e	IMPALA-10647 Improve always-true min/max filter handling in coordinator The change improves how a coordinator behaves when a just arriving min/max filter is always true. A new member 'always_true_filter_received_' is introduced to record such a fact. Similarily, the new member always_false_flipped_to_false_ is added to indicate that the always false flag is flipped from 'true' to 'false'. These two members only influence how the min and max columns in "Filter routing table" and "Final filter table" in profile are displayed as follows. 1. 'PartialUpdates' - The min and the max are partially updated; 2. 'AlwaysTrue' - One received filter is AlwaysTrue; 3. 'AlwaysFalse' - No filter is received or all received filters are empty; 4. 'Real values' - The final accumulated min/max from all received filters. A second change introduced is to record, in scan node, the arrival time of min/max filters (as a timestamp since the system is rebooted, obtained by calling MonotonicMillis()). A timestamp of similar nature is recorded for hdfs parquet scanners when a row group is processed. By comparing these two timestamps, one can easily diagnose issues related to late arrival of min/max filters. This change also addresses a flaw with rows unexpectedly filtered out, due to the reason that the always_true_ flag in a min/max filter, when set, is ignored in the eval code path in RuntimeFilter::Eval(). Testing: 1. Added three new tests in overlap_min_max_filters.test to verify that the min/max are displayed correctly when the min/max filter in hash join builder is set to always true, always false, or a pair of meaningful min and max values. 2. Ran unit tests; 3. Ran runtime-filter-test; 4. Ran core tests successfully. Change-Id: I326317833979efcbe02ce6c95ad80133dd5c7964 Reviewed-on: http://gerrit.cloudera.org:8080/17252 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-20 00:44:48 +00:00
Qifan Chen	1231208da7	IMPALA-10494: Making use of the min/max column stats to improve min/max filters This patch adds the functionality to compute the minimal and the maximal value for column types of integer, float/double, date, or decimal for parquet tables, and to make use of the new stats to discard min/max filters, in both hash join builders and Parquet scanners, when their coverage are too close to the actual range defined by the column min and max. The computation and dislay of the new column min/max stats can be controlled by two new Boolean query options (default to false): 1. compute_column_minmax_stats 2. show_column_minmax_stats Usage examples. set compute_column_minmax_stats=true; compute stats tpcds_parquet.store_sales; set show_column_minmax_stats=true; show column stats tpcds_parquet.store_sales; +-----------------------+--------------+-...-------+---------+---------+ \| Column \| Type \| #Falses \| Min \| Max \| +-----------------------+--------------+-...-------+---------+---------+ \| ss_sold_time_sk \| INT \| -1 \| 28800 \| 75599 \| \| ss_item_sk \| BIGINT \| -1 \| 1 \| 18000 \| \| ss_customer_sk \| INT \| -1 \| 1 \| 100000 \| \| ss_cdemo_sk \| INT \| -1 \| 15 \| 1920797 \| \| ss_hdemo_sk \| INT \| -1 \| 1 \| 7200 \| \| ss_addr_sk \| INT \| -1 \| 1 \| 50000 \| \| ss_store_sk \| INT \| -1 \| 1 \| 10 \| \| ss_promo_sk \| INT \| -1 \| 1 \| 300 \| \| ss_ticket_number \| BIGINT \| -1 \| 1 \| 240000 \| \| ss_quantity \| INT \| -1 \| 1 \| 100 \| \| ss_wholesale_cost \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_list_price \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_sales_price \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_ext_discount_amt \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_ext_sales_price \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_ext_wholesale_cost \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_ext_list_price \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_ext_tax \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_coupon_amt \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_net_paid \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_net_paid_inc_tax \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_net_profit \| DECIMAL(7,2) \| -1 \| -1 \| -1 \| \| ss_sold_date_sk \| INT \| -1 \| 2450816 \| 2452642 \| +-----------------------+--------------+-...-------+---------+---------+ Only the min/max values for non-partition columns are stored in HMS. The min/max values for partition columns are computed in coordinator. The min-max filters, in C++ class or protobuf form, are augmented to deal with the always true state better. Once always true is set, the actual min and max values in the filter are no longer populated. Testing: - Added new compute/show stats tests in compute-stats-column-minmax.test; - Added new tests in overlap_min_max_filters.test to demonstrate the usefulness of column stats to quickly disable useless filters in both hash join builder and Parquet scanner; - Added tests in min-max-filter-test.cc to demonstrate method Or(), ToProtobuf() and constructor can deal with always true flag well; - Tested with TPCDS 3TB to demonstrate the usefulness of the min and max column stats in disabling min/max filters that are not useful. - core tests. TODO: 1. IMPALA-10602: Intersection of multiple min/max filters when applying to common equi-join columns; 2. IMPALA-10601: Creating lineitem_orderkey_only table in tpch_parquet database; 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg tables with Parquet data files; 4. IMPALA-10617: Compute min/max column stats beyond parquet tables. Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df Reviewed-on: http://gerrit.cloudera.org:8080/17075 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-02 21:50:17 +00:00
stiga-huang	2baed42736	IMPALA-10554: Block updates when row-filter/column-mask is enabled for the user Per RANGER-1087 and RANGER-1100, table updates(e.g. insert, delete, truncate, upsert, alter, etc.) should be blocked when row-filtering or column-masking policy is enabled for the user. This patch adds the check for any row-filtering or column-masking policy on the table and rejects the update operation if any of them exisits. Tests: - Add FE unit tests - Add audit tests - Add e2e tests Change-Id: I1c899f2ec24b895867cbf2cf9ed23bc7b5a77326 Reviewed-on: http://gerrit.cloudera.org:8080/17230 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-31 05:38:50 +00:00
Zoltan Borok-Nagy	dbc2fc14d8	IMPALA-10597: Enable setting 'iceberg.file_format' Currently we prohibit setting the following properties: * iceberg.catalog * iceberg.catalog_location * iceberg.file_format * iceberg.table_identifier This patch enables setting 'iceberg.file_format', therefore if a table was created by another engine, but using HiveCatalog, we'll be able to set the data file format to the proper value and make the table readable by Impala. Setting the other properties are not needed for HiveCatalog tables. If the table wasn't created by HiveCatalog, then we cannot load the table, therefore we cannot invoke any ALTER TABLE statement at all. In that case we need to create an external table. If the table already contains data files, then Impala checks if all of them have the proper file format. If not, the ALTER TABLE statement fails. Before this patch a CREATE TABLE statement accepted any string for 'iceberg.file_format', and in case of invalid file formats the frontend silently used Parquet. This patch also adds a check to only allow valid file formats. Testing: * added e2e test Change-Id: I4b3506be4562a1ace3e6435867aadb3bdde7a8e2 Reviewed-on: http://gerrit.cloudera.org:8080/17207 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-29 18:32:31 +00:00
Fucun Chu	77d6acd032	IMPALA-10581: Implement ds_theta_intersect_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the intersection of two sketches of same or different column and returns the resulting sketch of intersection. Example: select ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) from sketch_tbl; +-----------------------------------------------------------+ \| ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) \| +-----------------------------------------------------------+ \| 5 \| +-----------------------------------------------------------+ Change-Id: I335eada00730036d5433775cfe673e0e4babaa01 Reviewed-on: http://gerrit.cloudera.org:8080/17186 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-29 15:59:49 +00:00
Fucun Chu	622e3c95ad	IMPALA-10580: Implement ds_theta_union_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 15 \| +-------------------------------------------------------+ Change-Id: I8329979b81ceeaad739a43fab79768ca9c2916fa Reviewed-on: http://gerrit.cloudera.org:8080/17179 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-24 15:16:07 +00:00
wzhou-code	410c3e79e4	IMPALA-10564: Return error when inserting an invalid decimal value When using CTAS statements or INSERT-SELECT statements to insert rows to table with decimal columns, Impala insert NULL for overflowed decimal values, instead of returning error. This issue happens when the data expression for the decimal column in SELECT sub-query consists at least one alias. This issue is similar as IMPALA-6340, but IMPALA-6340 only fixed the issue for the cases with the data expression for the decimal columns as constants. This patch fixed the issue by calling RuntimeState::CheckQueryState() in the end of HdfsTableWriter::AppendRows() and KuduTableSink::Send(). If there is an invalid decimal error, the query will be failed without inserting NULL for decimal column. We did not change the behaviour for decimal_v1. NULL will be inserted to the table for invalid decimal values with warning message. Tests: - Added unit-tests for INSERT-SELECT and CTAS statements with overflowed decimal values to be inserted into tables. The overflowed decimal values are expressed as a constant expression, or as an expression with aliases. Also added cases to verify behaviour of decimal_v1 is unchanged. - Passed exhaustive tests. Change-Id: I64ce4ed194af81ef06401ffc1124e12f05b8da98 Reviewed-on: http://gerrit.cloudera.org:8080/17168 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-23 22:52:38 +00:00
stiga-huang	a0f77680c5	IMPALA-10483: Support subqueries in Ranger masking policies This patch adds support for using subqueries in Ranger masking policies, i.e. column-masking/row-filtering policies. The subquery can reference either the current table or other tables. However, masking policies on these tables won't be applied recursively. This is consistent with Hive. One motivation is to avoid infinitely masking if it references the same table. Another motivation I think is to simplify the masking behavior, so when the admin is setting a masking expression, it can be considered as running in the admin's perspective (i.e. no masking). Implementation Before analyzing the query, the coordinator loads the metadata of all possibly used tables into the query's StmtTableCache. Table masking takes place after the analyzing phase. If the subquery filter introduces any new tables, the analyzer will fail to resolve them since their metadata is not loaded in the StmtTableCache. This patch modified the StmtMetadataLoader to also load those tables introduced by masking policies. So they can be resolved correctly. Tests - Add more complex tests in test_row_filtering Change-Id: I254df9f684c95c660f402abd99ca12dded7e764f Reviewed-on: http://gerrit.cloudera.org:8080/17185 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-22 15:52:03 +00:00
stiga-huang	c9d7bcb4a1	IMPALA-9661: Avoid introducing unused columns in table masking view Previously, if a table has column masking policies, we replace its unanalyzed TableRef with an analyzed InlineViewRef (table masking view) in FromClause.analyze(). However, we can't detect which columns are actually used in the original query at this point. In fact, analyze() for SelectList, WhereClause, GroupByClause and other clauses containing SlotRefs happen after FromClause.analyze(). After the whole query block is analyzed, we can get the exact set of required columns. This patch refactor the codes to do table masking after analyze() to avoid introducing unused columns. Referenced columns of a TableRef are registered in analyze(), which helps to figure out what columns are actually needed. With this, we don't need to revert table masking in FromClause.reset(). The doTableMasking flag in AST is also removed since now the table mask is resolved once after analyze(). Tests: - Add more e2e tests in test_ranger.py - Run CORE tests Change-Id: Ib015a8ab528065907b27fbdceb8e2818deb814e1 Reviewed-on: http://gerrit.cloudera.org:8080/17199 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-22 08:41:00 +00:00
stiga-huang	98de1c5436	IMPALA-9234: Support Ranger row filtering policies Ranger row filtering policies provide customized expressions to filter out rows for specific users when reading from a table. This patch adds support for this feature. A new feature flag, enable_row_filtering, is added to disable this experimental feature. It defaults to be true so the feature is enabled by default. Enabling row-filtering requires --enable_column_masking=true since it depends on the column masking implementation. Note that row filtering policies take effects prior to any column masking policies, because column masking policies apply on result data. Implementation: The existing table masking view infrastructure can be extended to support row filtering. Currently when analyzing a table with column masking policies, we replace the TableRef with an InlineViewRef which contains a SelectStmt wrapping the columns with masking expressions. This patch adds the row filtering expressions to the WhereClause of the SelectStmt. Limitations: - Expressions using subqueries are not supported (IMPALA-10483). - Row filtering policies on nested tables will not be applied when nested collection columns are used directly in the FROM clause. This will leak data so we forbid such kinds of queries until IMPALA-10484 is resolved. Tests: - Add FE test for error message when disabling row filtering. - Add e2e test with row filtering policies. - Add e2e test with column masking and row filtering policies both take place. - Verified audits in a CDP cluster with Ranger and Solr set up. Change-Id: I580517be241225ca15e45686381b78890178d7cc Reviewed-on: http://gerrit.cloudera.org:8080/16976 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 21:08:14 +00:00
Zoltan Borok-Nagy	6162343842	IMPALA-10512: ALTER TABLE ADD PARTITION should bump the write id for ACID tables ALTER TABLE ADD PARTITION should bump the write id for ACID tables. Both for INSERT-only and full ACID tables. For transational tables we are adding partitions in an ACID transaction in the following sequence: 1. open transaction 2. allocate write id for table 3. add partitions to HMS table 4. commit transaction However, please note that table metadata modifications are independent of ACID transactions. I.e. if add partitions succeed, but we cannot commit the transaction, then we the newly added partitions won't get removed. So why are we opening a txn then? We are doing it in order to bump the write id in a best-effort way. This aids table metadata caching, so by looking at the table write id we can determine if the cached table metadata is up-to-date. Testing: * added e2e test Change-Id: Iad247008b7c206db00516326c1447bd00a9b34bd Reviewed-on: http://gerrit.cloudera.org:8080/17081 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-18 19:35:58 +00:00
Fucun Chu	3e82501531	IMPALA-10558: Implement ds_theta_exclude() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the a-not-b set operation given two sketches of same or different column. Example: select ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 5 \| +-------------------------------------------------------+ Change-Id: I05119fd8c652c07ff248a99e44b0da3541e46ca3 Reviewed-on: http://gerrit.cloudera.org:8080/17153 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-17 22:14:44 +00:00
Riza Suminto	47219ec366	IMPALA-10565: Adjust result spooling memory based on scratch_limit IMPALA-9856 enables result spooling by default. Result spooling depends on the ability to spill its entire BufferedTupleStream to disk once it hits maximum memory reservation. However, if the query option scratch_limit is set lower than max_spilled_result_spooling_mem, the query might fail in the middle of execution due to insufficient scratch space. This patch adds planner change to consider scratch_limit and scratch_dirs query option when computing resource used by result spooling. The algorithm is as follow: * If scratch_dirs is empty or scratch_limit < minMemReservationBytes required to use BufferedPlanRootSink, we set spool_query_results to false and fallback to use BlockingPlanRootSink. * If scratch_limit > minMemReservationBytes but still fairly low, we lower the max_result_spooling_mem (default is 100MB) and max_spilled_result_spooling_mem (default is 1GB) to fit scratch_limit. * if scratch_limit > max_spilled_result_spooling_mem, do nothing. Testing: - Add TestScratchLimit::test_result_spooling_and_varying_scratch_limit - Verify that spool_query_results query option is disabled in TestScratchDir::test_no_dirs - Pass exhaustive tests. Change-Id: I541f46e6911694e14c0fc25be1a6982fd929d3a9 Reviewed-on: http://gerrit.cloudera.org:8080/17166 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Aman Sinha <amsinha@cloudera.com>	2021-03-14 03:35:40 +00:00
Zoltan Borok-Nagy	6c6b0ee869	IMPALA-10222: CREATE TABLE AS SELECT for Iceberg tables This patch adds support for CREATE TABLE AS SELECT statements for Iceberg tables. CTAS statements work like the following in Impala: 1. Analysis of the whole CTAS statement 2. Divide CTAS to CREATE stmt and INSERT stmt 3. Create temporary in-memory target table from the CREATE stmt 4. Analyse the INSERT statement by using the temporary target table 5. If everything is OK so far, create the target table 6. Execute the INSERT query For Iceberg tables the non-trivial thing was to create the temporary target table without actually creating it via Iceberg API. I've created a new class 'IcebergCtasTarget' that mimics an FeIceberg table. It can be used with catalog V1 and V2 as well. Testing * e2e CTAS tests in iceberg-ctas.test * SHOW CREATE TABLE stmts in show-create-table.test Change-Id: I81d2084e401b9fa74d5ad161b51fd3e2aa3fcc67 Reviewed-on: http://gerrit.cloudera.org:8080/17130 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 19:28:19 +00:00
Fucun Chu	0d22e89df4	IMPALA-10520: Implement ds_theta_intersect() function This function receives a set of serialized Apache DataSketches Theta sketches produced by ds_theta_sketch() and intersects them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and intersect them to get estimates based on the partitions the user is interested in related sketches. E.g.: SELECT ds_theta_estimate(ds_theta_intersect(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_theta_intersect() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_theta_intersect() on those sketches Change-Id: I80e68c2151c4604f0386d3dfb004c82b10293f97 Reviewed-on: http://gerrit.cloudera.org:8080/17088 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 16:13:48 +00:00
liuyao	1a01bfe831	IMPALA-10377: Improve the accuracy of resource estimation PlanNode does not consider some factors when estimating memory, this will cause a large error rate AggregationNode 1.MemoryEstimate = Ndv * (AvgRowSize + SizeOfBucket) 2.When estimating the Ndv of merge aggregation, Ndv should be divided only once. 3.If there is no grouping exprs, MemoryEstimate = MIN_PLAIN_AGG_MEM SortNode 1.MemoryEstimate = Cardinality * AvgRowSize. Memory used when there is enough memory HashJoinNode 1.MemoryEstimate= DataRows + Buckets + DuplicateNodes, DataRows = RightTableCardinality * AvgRowSize, Buckets= roundUpToPowerOf2(RightTableCardinality) * SizeOfBucket, DuplicateNodes = (RightTableCardinality - RightNdv) * SizeOfDuplicateNode KuduScanNode 1.MemoryEstimate = Columns * BytesPerColumn * MaxScannerThreads, Columns are scanned in query, not all the columns of the table UnitTest 1.CardinalityTest adds test cases to test memory estimation. Modify existing test cases related to memory estimation Change-Id: Ic01db168ff2c6d6de33ee553a8175599f035d7a1 Reviewed-on: http://gerrit.cloudera.org:8080/16842 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 14:23:04 +00:00
Fang-Yu Rao	2039746ebe	IMPALA-10576: Add refresh authorization to make a test case less flaky We found that a test case run in test_grant_revoke_with_role() that is used to verify a requesting user does not possess the necessary privilege to perform the GRANT operation could fail since the expected AuthorizationException is not returned after the query. Since the privilege of GRANT was revoked immediately before this test case, we suspect the authorization-related metadata has not been updated. To make this test case less flaky, in this patch we add a REFRESH AUTHORIZATION after the query that revoked the GRANT privilege from the requesting user. Testing: - Verified that this patch passes the core tests in an ASAN build. Change-Id: I7407bac0407e162ab5ba623505bd7ee49bdf3abf Reviewed-on: http://gerrit.cloudera.org:8080/17165 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-12 00:00:03 +00:00
Riza Suminto	49ac55fb69	IMPALA-9856: Enable result spooling by default. Result spooling has been relatively stable since it was introduced, and it has several benefits described in IMPALA-8656. This patch enable result spooling (SPOOL_QUERY_RESULTS) query options by default. Furthermore, some tests need to be adjusted to account for result spooling by default. The following are the adjustment categories and list of tests that fall under such category. Change in assertions: PlannerTest#testAcidTableScans PlannerTest#testBloomFilterAssignment PlannerTest#testConstantFolding PlannerTest#testFkPkJoinDetection PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled PlannerTest#testKuduSelectivity PlannerTest#testMaxRowSize PlannerTest#testMinMaxRuntimeFilters PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled PlannerTest#testMtDopValidation PlannerTest#testParquetFiltering PlannerTest#testParquetFilteringDisabled PlannerTest#testPartitionPruning PlannerTest#testPreaggBytesLimit PlannerTest#testResourceRequirements PlannerTest#testRuntimeFilterQueryOptions PlannerTest#testSortExprMaterialization PlannerTest#testSpillableBufferSizing PlannerTest#testTableSample PlannerTest#testTpch PlannerTest#testKuduTpch PlannerTest#testTpchNested PlannerTest#testUnion TpcdsPlannerTest custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs metadata/test_explain.py::TestExplain::test_explain_level2 metadata/test_explain.py::TestExplain::test_explain_level3 metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation Increase BUFFER_POOL_LIMIT: query_test/test_queries.py::TestQueries::test_analytic_fns query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive query_test/test_udfs.py::TestUdfExecution::test_mem_limits Increase MEM_LIMIT: query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling Increase MAX_ROW_SIZE: custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config query_test/test_insert.py::TestInsertQueries::test_insert_large_string query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter query_test/test_scanners.py::TestWideRow::test_wide_row Disable result spooling to maintain assertion: custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only) query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit shell/test_shell_client.py::TestShellClient::test_fetch_size Testing: - Pass exhaustive tests. Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4 Reviewed-on: http://gerrit.cloudera.org:8080/16755 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-02 04:58:51 +00:00
Qifan Chen	16493c0416	IMPALA-10532 TestOverlapMinMaxFilters.test_overlap_min_max_filters seems flaky This patch addresses the flakiness seen with a particular test within overlap_min_max_filters by allowing the sum of NumRuntimeFilteredPages to be greater than an expected value. Previously, such a sum can only be equal to the expected value and is not sufficient for various test conditions in which the scan of the parquet data files can start before the arrival of a runtime filter. The extension in test_result_verifier.py allows '>' and '<' condition to be expressed for aggregation(SUM, <counter>), such as aggregation(SUM, NumRuntimeFilteredPages)> 80. Testing: - Ran TestOverlapMinMaxFilters. Change-Id: I93940a104bfb2d68cb1d41d7e303348190fd5972 Reviewed-on: http://gerrit.cloudera.org:8080/17111 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-26 22:52:16 +00:00
Fucun Chu	7f60990028	IMPALA-10467: Implement ds_theta_union() function This function receives a set of serialized Apache DataSketches Theta sketches produced by ds_theta_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_theta_estimate(ds_theta_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_theta_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_theta_union() on those sketches Change-Id: I91baf58c76eb43748acd5245047edac8c66761b2 Reviewed-on: http://gerrit.cloudera.org:8080/17048 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-19 13:32:09 +00:00
Qifan Chen	ebb2e06639	IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds a new class of predicates called overlap predicates to aid in the acceptance or rejection of a row group, a page, or a row in a Parquet table, utilizing the minimal and the maximal values gathered from an equi hash join and the Parquet column index stats. When a row group or page is rejected, all contained rows within are rejected all together. For example in the following query, the min and max in the overlap predicate are computed from the join column from table 'b', and are compared against the min/max of each row group or page at the scan node for 'a'. select straight_join count(*) from lineitem a join [SHUFFLE] lineitem b where a.l_shipdate = b.l_receiptdate and b.l_commitdate = "1992-01-31"; An overlap predicate associated with the column type B in hash table and scan column type A will be formed when both A and B are of or can be converted to as: 1. booleans; 2. integers (tinyint, smallint, int, or bigint); 3. approximate numeric (float or double); 4. decimals with the same precision and scale; 5. strings; 6. date; or 7. timestamps. The overlap predicate is implemented as a min/max filter and can be observed in the explain output of a query. A new query option 'minmax_filter_threshold' is provided to control the new feature. Setting it to 0.0 disables the feature. Setting it to a value > 0.0 but less than 1.0 provides a threshold. An overlap predicate will be evaluated against a row group and possibly the containing pages/rows, as long as its overlap ratio is less than the threshold. The overlap ratio is the common area of the row group and the filter, divided by the area of the row group. A second query option, minmax_filtering_level, is provided to specify the filtering scope: 1. ROW_GROUP: the overlap is only tested for row groups; 2. PAGE: the overlap is tested for both row groups and pages; 3. ROW: the overlap is for row groups, pages and rows. Two new run-time profile counters are added to report the number of row groups or pages filtered out via the overlap predicates respectively: 1. NumRuntimeFilteredRowGroups 2. NumRuntimeFilteredPages Two new column "Min value" and "Max value" are added to the "Filter routing table" and "Final filter table" in profile to display the min and the max values for a min/max filter. Testing: 1. Unit tested on various column types with TPCH and TPCDS tables. Benefits were significant when the join column on the outer table is sorted and there exist many row groups or pages no overlapping with the min/max filters; 2. Added following new tests: a) In overlap_min_max_filters.test to demonstrate the number of filtered out pages and row groups with the two new profile counters; b) In runtime-filter-propagation.test to demonstrate that the overlap predicates work with different column types; 3. Core testing; 4. Performance measurement: the overal improvement with 3TB scale TPCDS is at 1.45% with the filter threshold at 0.5 and filtering level at ROW_GROUP. Good improvement (over 10%) are seen with query 16, 25, 62, 83, 94 and 99, due to the join column ship_date_sk being strongly correlated to the partition column sold_date_sk. To do in follow-up JIRAs: 1. Improve filtering efficiency; 2. Apply the overlap predicate on partition columns; 3. IR code-gen for various MinMaxFilter::EvalOverlap methods. 4. Address the current limitation that the "Min value" and "Max value" columns may be empty for LOCAL filters. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Reviewed-on: http://gerrit.cloudera.org:8080/16720 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-18 14:31:51 +00:00
Fucun Chu	65c6a81ed9	IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions These functions can be used to get cardinality estimates of data using Theta algorithm from Apache DataSketches. ds_theta_sketch() receives a dataset, e.g. a column from a table, and returns a serialized Theta sketch in string format. This can be written to a table or be fed directly to ds_theta_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the Theta sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' Theta see: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch25_parquet.lineitem to compare perfomance with ds_hll_. ds_theta_ is faster than ds_hll_* on the original data, the difference is around 1%-10%. ds_hll_estimate() is faster than ds_theta_estimate() on existing sketch. HLL and Theta gives closer estimate except for string. see IMPALA-10464. Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Reviewed-on: http://gerrit.cloudera.org:8080/17008 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-17 17:09:48 +00:00
Tamas Mate	dc133d9513	IMPALA-10499: Fix failing test_misc This change modifies the result type of the misc test which was failing. Testing: - executed the misc tests with exhaustive exploration strategy Change-Id: Ibe95f4bc3521f49d19e6da53deb904a25ac30982 Reviewed-on: http://gerrit.cloudera.org:8080/17066 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-15 22:25:41 +00:00
Aman Sinha	baf81dea6d	IMPALA-9745: Propagate source type when doing constant propagation When doing constant propagation the source type was not being propagated to the target expression leading to an analysis failure. The behavior is most easily reproducible with STRING to TIMESTAMP conversion in the presence of other predicates. This patch fixes this by adding an implicit cast if needed for such cases. Testing: - Added planner test and ran other planner tests - Added end-to-end test Change-Id: Ic3853478945229440f733c256ea225639f9178ff Reviewed-on: http://gerrit.cloudera.org:8080/17064 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Aman Sinha <amsinha@cloudera.com>	2021-02-13 23:02:54 +00:00
Tim Armstrong	b42c64993d	IMPALA-9979: part 2: partitioned top-n Planner changes: --------------- The planner now identifies predicates that can be converted into limits in a partitioned or unpartitioned top-n with the following method: * Push down predicates that reference analytic tuple into inline view. These will be evaluated after the analytic plan for the inline SelectStmt is generated. * Identify predicates that reference the analytic tuple and could be converted to limits. * If they can be applied to the last sort group of the analytic plan, and the windows are all compatible, then the lowest limit gets converted into a limit in the top N. * Otherwise generate a select node with the conjuncts. We add logic to merge SELECT nodes to avoid generating duplicates from inside and outside the inline view. * The pushed predicate is still added to the SELECT node because it is necessary for correctness for predicates like '=' to filter additional rows and also the limit pushdown optimization looks for analytic predicates there, so retaining all predicates simplifies that. The selectivity of the predicate is adjusted so that cardinality estimates remain accurate. The optimization can be disabled by setting ANALYTIC_RANK_PUSHDOWN_THRESHOLD=0. By default it is only enabled for limits of 1000 or less, because the in-memory Top-N may perform significantly worse than a full sort for large heaps (since updating the heap for every input row ends up being more expensive than doing a traditional sort). We could probably optimize this more with better tuning so that it can gracefully fall back to doing the full sort at runtime. rank() and row_number() are handled. rank() needs support in the TopN node to include ties for the last place, which is also added in this patch. If predicates are trivially false, we generate empty nodes. This interacts with the limit pushdwon optimization. The limit pushdown optimization is applied after the partitioned top-n is generated, and can sometimes result in more optimal plans, so it is generalized to handle pushing into partitioned top-n nodes. Backend changes: --------------- The top-n node in the backend is augmented to handle the partitioned case, for which we use a std::map and a comparator based on the partition exprs. The partitioned top-n node has a soft limit of 64MB on the size of the in-memory heaps and can spill with use of an embedded Sorter. The current implementation tries to evict heaps that are less effective at filtering rows. Limitations: ----------- There are several possible extensions to this that we did not do: * dense_rank() is not supported because it would require additional backend support - IMPALA-10014. * ntile() is not supported because it would require additional backend support - IMPALA-10174. * Only one predicate per analytic is pushed. * Redundant rank()/row_number() predicates are not merged, only the lowest is chosen. * Lower bounds are not converted into OFFSET. * The analytic operator cannot be eliminated even if the analytic expression was only used in the predicate. * This doesn't push predicates into UNION - IMPALA-10013 * Always false predicates don't result in empty plan - IMPALA-10015 Tests: ----- * Planner tests - added tests that exercise the interesting code paths added in planning. - Predicate ordering in SELECT nodes changed in a couple of cases because some predicates were pushed into the inline views. * Modified SORT targeted perf tests to avoid conversion to Top-N * Added targeted perf test for partitioned top-n. * End-to-end tests - Unpartitioned Top-N end-to-end tests - Basic partitioning and duplicate handling tests on functional - Similar basic tests on larger inputs from TPC-DS and with larger partition counts. - I inspected the results and also ran the same tests with analytic_rank_pushdown_threshold=0 to confirm that the results were the same as with the full sort. - Fallback to spilling sort. Perf: ----- Added a targeted benchmark that goes from ~2s to ~1s with mt_dop=8 on TPC-H 30 on my desktop. Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Reviewed-on: http://gerrit.cloudera.org:8080/16242 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-10 23:52:28 +00:00
Tamas Mate	701714b10a	IMPALA-10379: Add missing HiveLexer classes to shared-deps HIVE-19064 introduced additional lexer classes that are required during runtime. This commit adds the missing HiveLexer lexer classes to the shared-deps. Without these classes queries such as 'select 1 as "``"' would fail with 'NoClassDefFoundError'. Testing: - added a misc.test to verify that the classes are available and that IMPALA-9641 is fixed by HIVE-19064 Change-Id: I6e3a00335983f26498c1130ab9f109f6e67256f5 Reviewed-on: http://gerrit.cloudera.org:8080/17019 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-07 05:20:48 +00:00
Zoltan Borok-Nagy	a3f441193d	IMPALA-10223: Implement INSERT OVERWRITE for Iceberg tables This patch adds support for INSERT OVERWRITE statements for Iceberg tables. We use Iceberg's ReplacePartitions interface for this. This interface provides consistent behavior with INSERT OVERWRITEs against regular tables. It's also consistent with other engines dynamic overwrites, e.g. Spark. INSERT OVERWRITE for partitioned tables replaces the partitions affected by the INSERT, while keeping the other partitions untouched. INSERT OVERWRITE is prohibited for tables that use the BUCKET partition transform because it would randomly overwrite table data. Testing * added e2e test Change-Id: Idf4acfb54cf62a3f3b2e8db9d04044580151299c Reviewed-on: http://gerrit.cloudera.org:8080/17012 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-05 14:46:08 +00:00
stiga-huang	0473e1b973	IMPALA-10473: Fix wrong analytic results on constant partition/order by exprs When the Partition-by and Order-by expressions of an analytic are all constants, it should be evaluated in a single unpartitioned fragment (same as analytics that have no Partition-by/Order-by exprs). Currently, it's placed within the same fragment with the child node, which causes it to be computed locally and get incorrect results when the fragment is partitioned. Tests: - Added planner tests - Added e2e tests Change-Id: Ibc88a410dab984ff37e27dc635bee5f289003a2a Reviewed-on: http://gerrit.cloudera.org:8080/17023 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-05 11:54:00 +00:00
Zoltan Borok-Nagy	646b0e011c	IMPALA-10456: Implement TRUNCATE for Iceberg tables This patch adds support for the TRUNCATE statement for Iceberg tables. The TRUNCATE operation creates a new snapshot for the target table that doesn't have any data files. Table and column stats are also cleared. This patch also fixes a bug that caused table/column stats not being propagated. Testing * added e2e tests for both partitioned and unpartitioned tables Change-Id: I6116c7c36aba871c0be79f499e0ac618072ca7b8 Reviewed-on: http://gerrit.cloudera.org:8080/16987 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: wangsheng <skyyws@163.com>	2021-02-01 11:14:01 +00:00
xqhe	4ae847bf94	IMPALA-10382: fix invalid outer join simplification When set ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION = true, the planner will simplify outer joins if the predicate with case expr or conditional function on both sides of outer join. However, the predicate maybe not null-rejecting, if simplify the outer join, the result is incorrect. E.g. t1.b > coalesce(t1.c, t2.c) can return true if t2.c is null, so it is not null-rejecting predicate for t2. The fix is simply to support the case that the predicate with two operands and the operator is one of (=, !=, >, <, >=, <=), 1. one of the operands or 2. if the operand is arithmetic expression and one of the children does not contain conditional builtin function or case expr and has tuple id in outer joined tuples. E.g. t1.b > coalesce(t2.c, t1.c) or t1.b + coalesce(t2.c, t1.c) > coalesce(t2.c, t1.c) is null-rejecting predicate for t1. Testing: * Add new plan tests in outer-to-inner-joins.test * Add new query tests to verify the correctness on transformation Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b Reviewed-on: http://gerrit.cloudera.org:8080/16845 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2021-01-27 17:30:37 +00:00
Zoltan Borok-Nagy	08367e91f0	IMPALA-10452: CREATE Iceberg tables with old PARTITIONED BY syntax For convenience this patch adds support with the old-style CREATE TABLE ... PARTITIONED BY ...; syntax for Iceberg tables. So users should be able to write the following: CREATE TABLE ice_t (i int) PARTITIONED BY (p int) STORED AS ICEBERG; Which should be equivalent to this: CREATE TABLE ice_t (i int, p int) PARTITION BY SPEC (p IDENTITY) STORED AS ICEBERG; Please note that the old-style CREATE TABLE statement creates IDENTITY-partitioned tables. For other partition transforms the users must use the new, more generic syntax. Hive also supports the old PARTITIONED BY syntax with the same behavior. Testing: * added e2e tests Change-Id: I789876c161bc0987820955aa9ae01414e0dcb45d Reviewed-on: http://gerrit.cloudera.org:8080/16979 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-26 22:12:25 +00:00
stiga-huang	e8720b40f1	IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions A unicode character can be encoded into 1-4 bytes in UTF-8. String functions will return undesired results when the input contains unicode characters, because we deal with a string as a byte array. For instance, length() returns the length in bytes, not in unicode characters. UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem. This patch adds UTF-8 support in some string functions so they can have UTF-8 aware behavior. For compatibility with the old versions, a new query option, UTF8_MODE, is added for turning on/off the UTF-8 aware behavior. Currently, only length(), substring() and reverse() support it. Other function supports will be added in later patches. String functions will check the query option and switch to use the desired implementation. It's similar to how we use the decimal_v2 query option in builtin functions. For easy testing, the UTF-8 aware version of string functions are also exposed as builtin functions (named by utf8_*, e.g. utf8_length). Tests: - Add BE tests for utf8 functions. - Add e2e tests for the UTF8_MODE query option. Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Reviewed-on: http://gerrit.cloudera.org:8080/16908 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-26 00:43:39 +00:00
liuyao	18acca92ee	IMPALA-10435: Extend 'compute incremental stats' syntax to support a list of columns Modified parser to support compute incremental stats columns.No need to modify the code of other modules because it already supports Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b Reviewed-on: http://gerrit.cloudera.org:8080/16947 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2021-01-21 19:35:26 +00:00
Zoltan Borok-Nagy	90f3b2f491	IMPALA-10432: INSERT INTO Iceberg tables with partition transforms INSERT INTO Iceberg tables that use partition transforms. Partition transforms are functions that calculate partition data from row data. There are the following partition transforms in Iceberg: https://iceberg.apache.org/spec/#partition-transforms * IDENTITY * BUCKET * TRUNCATE * YEAR * MONTH * DAY * HOUR INSERT INTO identity-partitioned Iceberg tables are already supported. This patch adds support for the rest of the transforms. We create the partitioning expressions in InsertStmt. Based on these expressions data are automatically shuffled and sorted by the backend executors before rows are given to the table sink operators. The table sink operator writes the partitions one-by-one and creates a human-readable partition path for them. In the end, we will convert the partition path to partition data and create Iceberg DataFiles with information about the files written. Testing: * added planner test * added e2e tests Change-Id: I3edf02048cea78703837b248c55219c22d512b78 Reviewed-on: http://gerrit.cloudera.org:8080/16939 Reviewed-by: wangsheng <skyyws@163.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-18 18:46:42 +00:00
stiga-huang	9bb7157bf0	IMPALA-10387: Add missing overloads of mask functions used in Ranger default masking policies The mask functions in Hive are implemented through GenericUDFs which can accept an infinite number of function signatures. Impala currently don't support GenericUDFs. So we provide builtin mask functions with limited overloads. This patch adds some missing overloads that could be used by Ranger default masking policies, e.g. MASK_HASH, MASK_SHOW_LAST_4, MASK_DATE_SHOW_YEAR, etc. Tests: - Add test coverage on all default masking policies applied on all supported types. Change-Id: Icf3e70fd7aa9f3b6d6b508b776696e61ec1fcc2e Reviewed-on: http://gerrit.cloudera.org:8080/16930 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-15 13:01:53 +00:00
Zoltan Borok-Nagy	696dafed66	IMPALA-10426: Fix crash when inserting invalid timestamps Insertion of invalid timestamps causes Impala to crash when it uses the INT64 Parquet timestamp types. This patch fixes the error by checking for null values in Int64TimestampColumnWriterBase::ConvertValue(). Testing: * added e2e tests Change-Id: I74fb754580663c99e1d8c3b73f8d62ea3305ac93 Reviewed-on: http://gerrit.cloudera.org:8080/16951 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-14 19:34:38 +00:00
skyyws	1093a563e6	IMPALA-10368: Support required/optional property when creating Iceberg table We supported create required/optional field for Iceberg table in this patch. If we set 'NOT NULL' property for Iceberg table column in SQL, Impala will create required field by Iceberg api, 'NULL' or default will create optional field. Besides, 'DESCRIBE XXX' for Iceberg table will display 'optional' property like this: +------+--------+---------+----------+ \| name \| type \| comment \| nullable \| +------+--------+---------+----------+ \| id \| int \| \| false \| \| name \| string \| \| true \| \| age \| int \| \| true \| +------+--------+---------+----------+ And 'SHOW CREATE TABLE XXX' will also display 'NULL'/'NOT NULL' property for Iceberg table. Tests: * added new test in iceberg-create.test * added new test in iceberg-negative.test * added new test in show-create-table.test * modify 'DESCRIBE XXX' result in iceberg-create.test * modify 'DESCRIBE XXX' result in iceberg-alter.test * modify create table result in show-create-table.test Change-Id: I70b8014ba99f43df1b05149ff7a15cf06b6cd8d3 Reviewed-on: http://gerrit.cloudera.org:8080/16904 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-11 17:08:21 +00:00
stiga-huang	e7839c4530	IMPALA-10416: Add raw string mode for testfiles to verify non-ascii results Currently, the result section of the testfile is required to used escaped strings. Take the following result section as an example: --- RESULTS 'Alice\nBob' 'Alice\\nBob' The first line is a string with a newline character. The second line is a string with a '\' and an 'n' character. When comparing with the actual query results, we need to escape the special characters in the actual results, e.g. replace newline characters with '\n'. This is done by invoking encode('unicode_escape') on the actual result strings. However, the input type of this method is unicode instead of str. When calling it on str vars, Python will implicitly convert the input vars to unicode type. The default encoding, ascii, is used. This causes UnicodeDecodeError when the str contains non-ascii bytes. To fix this, this patch explicitly decodes the input str using 'utf-8' encoding. After fixing the logic of escaping the actual result strings, the next problem is that it's painful to write unicode-escaped expected results. Here is an example: ---- QUERY select "你好\n你好" ---- RESULTS '\u4f60\u597d\n\u4f60\u597d' ---- TYPES STRING It's painful to manually translate the unicode characters. This patch adds a new comment, RAW_STRING, for the result section to use raw strings instead of unicode-escaped strings. Here is an example: ---- QUERY select "你好" ---- RESULTS: RAW_STRING '你好' ---- TYPES STRING If the result contains special characters, it's recommended to use the default string mode. If the special characters only contain newline characters, we can use RAW_STRING and the existing MULTI_LINE comment together. This patch also fixes the issue that pytest fails to report assertion failures if any of the compared str values contain non-ascii bytes (IMPALA-10419). However, pytest works if the compared values are both in unicode type. So we explicitly converting the actual and expected str values to unicode type. Test: - Add tests in special-strings.test for raw string mode and the escaped string mode (default). - Run test_exprs.py::TestExprs::test_special_strings locally. Change-Id: I7cc2ea3e5849bd3d973f0cb91322633bcc0ffa4b Reviewed-on: http://gerrit.cloudera.org:8080/16919 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-06 04:39:56 +00:00
Tim Armstrong	1d5fe2771f	IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages The encoding is identical to the already-supported PLAIN_DICTIONARY encoding but the PLAIN enum value is used for the dictionary pages and the RLE_DICTIONARY enum value is used for the data pages. A hidden option -write_new_parquet_dictionary_encodings is added to turn on writing too, for test purposes only. Testing: * Added an automated test using a pregenerated test file. * Ran core tests. * Manually tested by writing out TPC-H lineitem with the new encoding and reading back in Impala and Hive. Parquet-tools output for the generated test file: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/att/824de2afebad009f-6f460ade00000003_643159826_data.0.parq 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade00000003_643159826_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema -------------------------------------------------------------------------------- id: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bool_col: OPTIONAL BOOLEAN R:0 D:1 tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1 smallint_col: OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1 int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bigint_col: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 float_col: OPTIONAL FLOAT R:0 D:1 double_col: OPTIONAL DOUBLE R:0 D:1 date_string_col: OPTIONAL BINARY R:0 D:1 string_col: OPTIONAL BINARY R:0 D:1 timestamp_col: OPTIONAL INT96 R:0 D:1 year: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 month: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 row group 1: RC:8 TS:754 OFFSET:4 -------------------------------------------------------------------------------- id: INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0] bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN ST:[min: false, max: true, num_nulls: 0] tinyint_col: INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] int_col: INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] bigint_col: INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0] float_col: FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0] double_col: DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0] date_string_col: BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, num_nulls: 0] string_col: BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0] timestamp_col: INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined] year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0] month: INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0] Parquet-tools output for one of the lineitem files: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/li2/4b4d9143c575dd71-3f69d3cf00000001_1879643220_data.0.parq 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf00000001_1879643220_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema -------------------------------------------------------------------------------- l_orderkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_partkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_suppkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_linenumber: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 l_quantity: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_extendedprice: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_discount: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_tax: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1 l_returnflag: OPTIONAL BINARY R:0 D:1 l_linestatus: OPTIONAL BINARY R:0 D:1 l_shipdate: OPTIONAL BINARY R:0 D:1 l_commitdate: OPTIONAL BINARY R:0 D:1 l_receiptdate: OPTIONAL BINARY R:0 D:1 l_shipinstruct: OPTIONAL BINARY R:0 D:1 l_shipmode: OPTIONAL BINARY R:0 D:1 l_comment: OPTIONAL BINARY R:0 D:1 row group 1: RC:1724693 TS:58432195 OFFSET:4 -------------------------------------------------------------------------------- l_orderkey: INT64 SNAPPY DO:4 FPO:159797 SZ:2839537/13147604/4.63 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 2142211, max: 6000000, num_nulls: 0] l_partkey: INT64 SNAPPY DO:2839640 FPO:3028619 SZ:8179566/13852808/1.69 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 1, max: 200000, num_nulls: 0] l_suppkey: INT64 SNAPPY DO:11019308 FPO:11059413 SZ:3063563/3103196/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 10000, num_nulls: 0] l_linenumber: INT32 SNAPPY DO:14082964 FPO:14083007 SZ:412884/650550/1.58 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 7, num_nulls: 0] l_quantity: FIXED_LEN_BYTE_ARRAY SNAPPY DO:14495934 FPO:14496204 SZ:1298038/1297963/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1.00, max: 50.00, num_nulls: 0] l_extendedprice: FIXED_LEN_BYTE_ARRAY SNAPPY DO:15794062 FPO:16003224 SZ:9087746/10429259/1.15 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 904.00, max: 104949.50, num_nulls: 0] l_discount: FIXED_LEN_BYTE_ARRAY SNAPPY DO:24881912 FPO:24881976 SZ:866406/866338/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0.00, max: 0.10, num_nulls: 0] l_tax: FIXED_LEN_BYTE_ARRAY SNAPPY DO:25748406 FPO:25748463 SZ:866399/866325/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0.00, max: 0.08, num_nulls: 0] l_returnflag: BINARY SNAPPY DO:26614888 FPO:26614918 SZ:421113/421069/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x41, max: 0x52, num_nulls: 0] l_linestatus: BINARY SNAPPY DO:27036081 FPO:27036106 SZ:262209/270332/1.03 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x46, max: 0x4F, num_nulls: 0] l_shipdate: BINARY SNAPPY DO:27298370 FPO:27309301 SZ:2602937/2627148/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3032, max: 0x313939382D31322D3031, num_nulls: 0] l_commitdate: BINARY SNAPPY DO:29901405 FPO:29912079 SZ:2602680/2626308/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3331, max: 0x313939382D31302D3331, num_nulls: 0] l_receiptdate: BINARY SNAPPY DO:32504185 FPO:32515219 SZ:2603040/2627498/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3036, max: 0x313939382D31322D3330, num_nulls: 0] l_shipinstruct: BINARY SNAPPY DO:35107326 FPO:35107408 SZ:434968/434917/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x434F4C4C45435420434F44, max: 0x54414B45204241434B2052455455524E, num_nulls: 0] l_shipmode: BINARY SNAPPY DO:35542401 FPO:35542471 SZ:650639/650580/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x414952, max: 0x545255434B, num_nulls: 0] l_comment: BINARY SNAPPY DO:36193124 FPO:36711343 SZ:22240470/52696671/2.37 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 0x20546972657369617320, max: 0x7A7A6C653F20626C697468656C792069726F6E69, num_nulls: 0] Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Reviewed-on: http://gerrit.cloudera.org:8080/16893 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-05 23:30:35 +00:00
Aman Sinha	49680559b0	IMPALA-10182: Don't add inferred identity predicates to SELECT node For an inferred equality predicates of type c1 = c2 if both sides are referring to the same underlying tuple and slot, it is an identity predicate which should not be evaluated by the SELECT node since it will incorrectly eliminate NULL rows. This patch fixes the behavior. Testing: - Added planner tests with base table and with outer join - Added runtime tests with base table and with outer join - Added planner test for IMPALA-9694 (same root cause) - Ran PlannerTest .. no other plans changed Change-Id: I924044f582652dbc50085851cc639f3dee1cd1f4 Reviewed-on: http://gerrit.cloudera.org:8080/16917 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-05 23:04:25 +00:00
Zoltan Borok-Nagy	03af0b2c8c	IMPALA-10422: EXPLAIN statements leak ACID transactions and locks Currently EXPLAIN statements might open ACID transactions and create locks on ACID tables. This is not necessary since we won't modify the table. But the real problem is that these transactions and locks are leaked and open forever. They are even getting heartbeated while the coordinator is still running. The solution is to not consume any ACID resources for EXPLAIN statements. Testing: * Added EXPLAIN INSERT OVERWRITE in front of an actual INSERT OVERWRITE in an e2e test Change-Id: I05113b1fd9a3eb2d0dd6cf723df916457f3fbf39 Reviewed-on: http://gerrit.cloudera.org:8080/16923 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-05 21:31:05 +00:00
Fucun Chu	4099a60689	IMPALA-10317: Add query option that limits huge joins at runtime This patch adds support for limiting the rows produced by a join node such that runaway join queries can be prevented. The limit is specified by a query option. Queries exceeding that limit get terminated. The checking runs periodically, so the actual rows produced may go somewhat over the limit. JOIN_ROWS_PRODUCED_LIMIT is exposed as an advanced query option. Rows produced Query profile is updated to include query wide and per backend metrics for RowsReturned. Example from " set JOIN_ROWS_PRODUCED_LIMIT = 10000000; select count() from tpch_parquet.lineitem l1 cross join (select from tpch_parquet.lineitem l2 limit 5) l3;": NESTED_LOOP_JOIN_NODE (id=2): - InactiveTotalTime: 107.534ms - PeakMemoryUsage: 16.00 KB (16384) - ProbeRows: 1.02K (1024) - ProbeTime: 0.000ns - RowsReturned: 10.00M (10002025) - RowsReturnedRate: 749.58 K/sec - TotalTime: 13s337ms Testing: Added tests for JOIN_ROWS_PRODUCED_LIMIT Change-Id: Idbca7e053b61b4e31b066edcfb3b0398fa859d02 Reviewed-on: http://gerrit.cloudera.org:8080/16706 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-22 06:10:39 +00:00

1 2 3 4 5 ...

1468 Commits