impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 09:00:39 -05:00

Author	SHA1	Message	Date
Fang-Yu Rao	efc627d050	IMPALA-10158: Set timezone to UTC for Iceberg-related E2E tests We found that the tests of test_iceberg_query and test_iceberg_profile fail after the patch for IMPALA-9741 has been merged and that it is due to the default timezone of Impala not being UTC. This patch fixes the issue by adding "SET TIMEZONE=UTC;" before those test queries are run. Testing: - Verified in a local development environment that the tests of test_iceberg_query and test_iceberg_profile could pass after applying this patch. Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3 Reviewed-on: http://gerrit.cloudera.org:8080/16425 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-09 13:26:42 +00:00
skyyws	fb6d96e001	IMPALA-9741: Support querying Iceberg table by impala This patch mainly realizes the querying of iceberg table through impala, we can use the following sql to create an external iceberg table: CREATE EXTERNAL TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); Or just including table name and location like this: CREATE EXTERNAL TABLE default.iceberg_test STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); 'iceberg_file_format' is the file format in iceberg, currently only support PARQUET, other format would be supported in the future. And if you don't specify this property in your SQL, default file format is PARQUET. We achieved this function by treating the iceberg table as normal unpartitioned hdfs table. When querying iceberg table, we pushdown partition column predicates to iceberg to decide which data files need to be scanned, and then transfer this information to BE to do the real scan operation. Testing: - Unit test for Iceberg in FileMetadataLoaderTest - Create table tests in functional_schema_template.sql - Iceberg table query test in test_scanners.py Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006 Reviewed-on: http://gerrit.cloudera.org:8080/16143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-06 02:12:07 +00:00
Daniel Becker	dc08b657e8	IMPALA-7658: Proper codegen for HiveUdfCall Implementing codegen for HiveUdfCall. Testing: Verified that java udf tests pass locally. Benchmarks: Used a UDF from TestUdf.java that adds three integers: create function tpch15_parquet.sum3(int, int, int) returns int location '/test-warehouse/impala-hive-udfs.jar' symbol='org.apache.impala.TestUdf'; Used the following query on the master branch and the change's branch: set num_nodes=1; set mt_dop=1; select min(tpch15_parquet.sum3(cast(l_orderkey as int), cast(l_partkey as int), cast(l_suppkey as int))) from tpch15_parquet.lineitem; Results averaged over 100 runs after warmup: Master: 20.6346s, stddev: 0.3132411856765332 This change: 19.0256s, stddev: 0.42039019873436 This is a ~7.8% improvement. Change-Id: I2f994dac550f297ed3c88491816403f237d4d747 Reviewed-on: http://gerrit.cloudera.org:8080/16314 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-04 00:55:02 +00:00
Adam Tamas	99e5f5a885	IMPALA-10133:Implement ds_hll_stringify function. This function receives a string that is a serialized Apache DataSketches HLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_hll_stringify(ds_hll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_hll_stringify(ds_hll_sketch(float_col)) \| +--------------------------------------------+ \| ### HLL sketch summary: \| \| Log Config K : 12 \| \| Hll Target : HLL_4 \| \| Current Mode : LIST \| \| LB : 2 \| \| Estimate : 2 \| \| UB : 2.0001 \| \| OutOfOrder flag: false \| \| Coupon count : 2 \| \| ### End HLL sketch summary \| \| \| +--------------------------------------------+ Change-Id: I85dbf20b5114dd75c300eef0accabe90eac240a0 Reviewed-on: http://gerrit.cloudera.org:8080/16382 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-03 12:11:48 +00:00
Aman Sinha	5e9f10d34c	IMPALA-10064: Support constant propagation for eligible range predicates This patch adds support for constant propagation of range predicates involving date and timestamp constants. Previously, only equality predicates were considered for propagation. The new type of propagation is shown by the following example: Before constant propagation: WHERE date_col = CAST(timestamp_col as DATE) AND timestamp_col BETWEEN '2019-01-01' AND '2020-01-01' After constant propagation: WHERE date_col >= '2019-01-01' AND date_col <= '2020-01-01' AND timestamp_col >= '2019-01-01' AND timestamp_col <= '2020-01-01' AND date_col = CAST(timestamp_col as DATE) As a consequence, since Impala supports table partitioning by date columns but not timestamp columns, the above propagation enables partition pruning based on timestamp ranges. Existing code for equality based constant propagation was refactored and consolidated into a new class which handles both equality and range based constant propagation. Range based propagation is only applied to date and timestamp columns. Testing: - Added new range constant propagation tests to PlannerTest. - Added e2e test for range constant propagation based on a newly added date partitioned table. - Ran precommit tests. Change-Id: I811a1f8d605c27c7704d7fc759a91510c6db3c2b Reviewed-on: http://gerrit.cloudera.org:8080/16346 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 22:57:55 +00:00
Zoltan Borok-Nagy	502e1134be	IMPALA-10071: Impala shouldn't create filename starting with underscore during ACID TRUNCATE When Impala TRUNCATEs an ACID table, it creates a new base directory with the hidden file "_empty" in it. Newer Hive versions ignore files starting with underscore, therefore they ignore the whole base directory. To resolve this issue we can simply rename the empty file to "empty". Testing: * update acid-truncate.test accordingly Change-Id: Ia0557b9944624bc123c540752bbe3877312a7ac9 Reviewed-on: http://gerrit.cloudera.org:8080/16396 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 13:29:25 +00:00
Adam Tamas	4cb3c3556e	IMPALA-10108: Implement ds_kll_stringify function This function receives a string that is a serialized Apache DataSketches KLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_kll_stringify(ds_kll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_kll_stringify(ds_kll_sketch(float_col)) \| +--------------------------------------------+ \| ### KLL sketch summary: \| \| K : 200 \| \| min K : 200 \| \| M : 8 \| \| N : 8 \| \| Epsilon : 1.33% \| \| Epsilon PMF : 1.65% \| \| Empty : false \| \| Estimation mode: false \| \| Levels : 1 \| \| Sorted : false \| \| Capacity items : 200 \| \| Retained items : 8 \| \| Storage bytes : 64 \| \| Min value : 0 \| \| Max value : 1.1 \| \| ### End sketch summary \| \| \| +--------------------------------------------+ Change-Id: I97f654a4838bf91e3e0bed6a00d78b2c7aa96f75 Reviewed-on: http://gerrit.cloudera.org:8080/16370 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 10:49:10 +00:00
Zoltan Borok-Nagy	329bb41294	IMPALA-10115: Impala should check file schema as well to check full ACIDv2 files Currently Impala checks file metadata 'hive.acid.version' to decide the full ACID schema. There are cases when Hive forgets to set this value for full ACID files, e.g. query-based compactions. So it's more robust to check the schema elements instead of the metadata field. Also, sometimes Hive write the schema with different character cases, e.g. originalTransaction vs originaltransaction, so we should rather compare the column names in a case insensitive way. Testing: * added test for full ACID compaction * added test_full_acid_schema_without_file_metadata_tag to test full ACID file without metadata 'hive.acid.version' Change-Id: I52642c1755599efd28fa2c90f13396cfe0f5fa14 Reviewed-on: http://gerrit.cloudera.org:8080/16383 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-01 22:27:27 +00:00
Shant Hovsepian	f4273a40fe	IMPALA-7310: Partial fix for NDV cardinality with NULLs. This fix just handles the case where a column's cardinality is zero however it's nullable and we have null stats to indicate there are null values, therefore we adjust the cardinality from 0 to 1. The cardinality of zero was especially problematic when calculating cardinalities for multiple predicates with multiplication. The 0 would propagate up the plan tree and result in poor plan choices such as always using broadcast joins where shuffle would've been more optimal. Testing: * 26 Node TPC-DS 30TB run had better plans for Q4 and Q11 - Q4 172s -> 80s - Q11 103s -> 77s * CardinalityTest * TpcdsPlannerTest Change-Id: Iec967053b4991f8c67cde62adf003cbd3f429032 Reviewed-on: http://gerrit.cloudera.org:8080/16349 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-01 08:54:42 +00:00
Shant Hovsepian	827070b473	IMPALA-10099: Push down DISTINCT in Set operations INTERSECT/EXCEPT are not duplicate preserving operations. The distinct aggregations can happen in each operand, the leftmost operand only, or after all the operands in a separate aggregation step. Except for a couple special cases we would use the last strategy most often. This change pushes the distinct aggregation down to the leftmost operand in cases where there are no analytic functions, or when a distinct or grouping operation already eliminates duplicates. In general DISTINCT placement such as in this case should be done throughout the entire plan tree in a cost based manner as described in IMPALA-5260 Testing: * TpcdsPlannerTest * PlannerTest * TPC-DS 30TB Perf run for any affected queries - Q14-1 180s -> 150s - Q14-2 109s -> 90s - Q8 no significant change * SetOperation Planner Tests * Analyzer tests * Tpcds Functional Workload Change-Id: Ia248f1595df2ab48fbe70c778c7c32bde5c518a5 Reviewed-on: http://gerrit.cloudera.org:8080/16350 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-31 18:34:07 +00:00
Tim Armstrong	ea75e68f9e	IMPALA-10110: bloom filter target fpp query option This adds a BLOOM_FILTER_ERROR_RATE option that takes a value between 0 and 1 (exclusive) that can override the default target false positive probability (fpp) value of 0.75 for selecting the filter size. It does not affect whether filters are disabled at runtime. Adds estimated FPP and bloom size to the routing table so we have some observability. Here is an example: tpch_kudu> select count(*) from customer join nation on n_nationkey = c_nationkey; ID Src. Node Tgt. Node(s) Target type Partition filter Pending (Expected) First arrived Completed Enabled Bloom Size Est fpp ----------------------------------------------------------------------------------------------------------------------------------------- 1 2 0 LOCAL false 0 (3) N/A N/A true MIN_MAX 0 2 0 LOCAL false 0 (3) N/A N/A true 1.00 MB 1.04e-37 Testing: Added a test that shows the query option affecting filter size. Ran core tests. Change-Id: Ifb123a0ea1e0e95d95df9837c1f0222fd60361f3 Reviewed-on: http://gerrit.cloudera.org:8080/16377 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-29 05:48:37 +00:00
Shant Hovsepian	0fcf846592	IMPALA-10095: Include query plan tests for all of TPC-DS Added TpcdsPlannerTest to include each TPC-DS query as a separate plan test file. Removed the previous tpcds-all test file. This means when running only PlannerTest no TPC-DS plans are checked, however as part of a full frontend test run the TpcdsPlannerTest will be included. Runs with cardinality and resource checks, as well as using parquet tables to include predicate pushdowns. Change-Id: Ibaf40d8b783be1dc7b62ba3269feb034cb8047da Reviewed-on: http://gerrit.cloudera.org:8080/16345 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-27 00:11:56 +00:00
Gabor Kaszab	28d94851b1	IMPALA-10020: Implement ds_kll_cdf_as_string() function This is the support for Cumulative Distribution Function (CDF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), (-inf, 5), (-inf, 10), (-inf, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_cdf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_cdf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.2,0.401644,1,1 \| +----------------------------------------------------------+ Change-Id: I77e6afc4556ad05a295b89f6d06c2e4a6bb2cf82 Reviewed-on: http://gerrit.cloudera.org:8080/16359 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 10:59:49 +00:00
Gabor Kaszab	a8a35edbc4	IMPALA-10019: Implement ds_kll_pmf_as_string() function This is the support for Probabilistic Mass Function (PMF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), [1, 5), [5, 10), [10, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_pmf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_pmf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.202192,0.199452,0.598356,0 \| +----------------------------------------------------------+ Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba Reviewed-on: http://gerrit.cloudera.org:8080/16336 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 01:50:16 +00:00
Tim Armstrong	e133d1838a	IMPALA-7782: fix constant NOT IN subqueries that can return 0 rows The bug was the the statement rewriter converted NOT IN <subquery> predicates to != <subquery> predicates when the subquery could be an empty set. This was invalid, because NOT IN (<empty set>) is true, but != (<empty set>) is false. Testing: Added targeted planner and end-to-end tests. Ran exhaustive tests. Change-Id: I66c726f0f66ce2f609e6ba44057191f5929a67fc Reviewed-on: http://gerrit.cloudera.org:8080/16338 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-25 23:08:40 +00:00
Gabor Kaszab	41065845e9	IMPALA-9962: Implement ds_kll_quantiles_as_string() function This function is very similar to ds_kll_quantile() but this one can receive any number of rank parameters and returns a comma separated string that holds the results for all of the given ranks. For more details about ds_kll_quantile() see IMPALA-9959. Note, ds_kll_quantiles() should return an Array of floats as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_quantiles_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Change-Id: I76f6039977f4e14ded89a3ee4bc4e6ff855f5e7f Reviewed-on: http://gerrit.cloudera.org:8080/16324 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-25 18:06:22 +00:00
stiga-huang	e0a6e942b2	IMPALA-9955,IMPALA-9957: Fix not enough reservation for large pages in GroupingAggregator The minimum requirement for a spillable operator is ((min_buffers -2) * default_buffer_size) + 2 * max_row_size. In the min reservation, we only reserve space for two large pages, one for reading, the other for writing. However, to make the non-streaming GroupingAggregator work correctly, we have to manage these extra reservations carefully. So it won't run out of the min reservation when it actually needs to spill a large page, or when it actually needs to read a large page. To be specific, for how to manage the large write page reservation, depending on whether needs_serialize is true or false: - If the aggregator needs to serialize the intermediate results when spilling a partition, we have to save a large page worth of reservation for the serialize stream, in case it needs to write large rows. This space can be restored when all the partitions are spilled so the serialize stream is not needed until we build/repartition a spilled partition and thus have pinned partitions again. If the large write page reservation is used, we save it back whenever possible after we spill or close a partition. - If the aggregator doesn't need the serialize stream at all, we can restore the large write page reservation whenever we fail to add a large row, before spilling any partitions. Reclaim it whenever possible after we spill or close a partition. A special case is when we are processing a large row and it's the last row in building/repartitioning a spilled partition, the large write page reservation can be restored for it no matter whether we need the serialize stream. Because partitions will be read out after this so no needs for spilling. For the large read page reservation, it's transferred to the spilled BufferedTupleStream that we are reading in building/repartitioning a spilled partition. The stream will restore some of it when reading a large page, and reclaim it when the output row batch is reset. Note that the stream is read in attach_on_read mode, the large page will be attached to the row batch's buffers and only get freed when the row batch is reset. Tests: - Add tests in test_spilling_large_rows (test_spilling.py) with different row sizes to reproduce the issue. - One test in test_spilling_no_debug_action becomes flaky after this patch. Revise the query to make the udf allocate larger strings so it can consistently pass. - Run CORE tests. Change-Id: I3d9c3a2e7f0da60071b920dec979729e86459775 Reviewed-on: http://gerrit.cloudera.org:8080/16240 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-25 16:11:20 +00:00
Qifan Chen	2ebf554dfd	IMPALA-7779 Parquet Scanner can write binary data into profile This fix addresses the current limitation in that an ill-formatted Parquet version string is not properly formatted before appearing in an error message or impalad.INFO. With the fix, any such string is converted to a hex string first. The hex string is a sequence of four hex digit groups separated by spaces and each group is one or two hex digits, such as "6c 65 2e a". Testing: Ran "core" tests successfully. Change-Id: I281d6fa7cb2f88f04588110943e3e768678b9cf1 Reviewed-on: http://gerrit.cloudera.org:8080/16331 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Sahil Takiar <stakiar@cloudera.com>	2020-08-25 15:42:01 +00:00
Shant Hovsepian	3f1b1476af	IMPALA-10034: Add remaining TPC-DS queries to workload. Include remaining TPC-DS queries to the testdata workload definition. Q8 and Q38 were using non standard variants, those have been replaced by the official query versions. Q35 is using an official variant. Had to escape a table alias in Q90 as we treat 'AT' as a reserved keyword. Change-Id: Id5436689390f149694f14e6da1df624de4f5f7ad Reviewed-on: http://gerrit.cloudera.org:8080/16280 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-24 16:02:45 +00:00
Adam Tamas	6539c50f66	IMPALA-9982: Fix flakyness in test_dateless_timestamp_text In this test there is no need to check for "Error parsing row" since the "Error converting column" is enought to be sure we are no longer able to read dateless timestamps. Change-Id: Ia97490288dae81561969d260739a07ec42571f48 Reviewed-on: http://gerrit.cloudera.org:8080/16334 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-13 18:58:23 +00:00
Qifan Chen	5fedf7bf72	IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans This work addresses the current limitation in computing the total row count for a Hive table in a scan. The row count can be incorrectly computed as 0, even though there exists data in the Hive table. This is the stats corruption at table level. Similar stats corruption exists for a partition. The row count of a table or a partition sometime can also be -1 which indicates a missing stats situation. In the fix, as long as no partition in a Hive table exhibits any missing or corrupt stats, the total row count for the table is computed from the row counts in all partitions. Otherwise, Impala looks at the table level stats particularly the table row count. In addition, if the table stats is missing or corrupted, Impala estimates a row count for the table, if feasible. This row count is the sum of the row count from the partitions with good stats, and an estimation of the number of rows in the partitions with missing or corrupt stats. Such estimation also applies when some partition has corrupt stats. One way to observe the fix is through the explain of queries scanning Hive tables with missing or corrupted stats. The cardinality for any full scan should be a positive value (i.e. the estimated row count), instead of 'unavailable'. At the beginning of the explain output, that table is still listed in the WARNING section for potentially corrupt table statistics. Testing: 1. Ran unit tests with queries documented in the case against Hive tables with the following configrations: a. No stats corruption in any partitions b. Stats corruption in some partitions c. Stats corruption in all partitions 2. Added two new tests in test_compute_stats.py: a. test_corrupted_stats_in_partitioned_Hive_tables b. test_corrupted_stats_in_unpartitioned_Hive_tables 3. Fixed failures in corrupt-stats.test 4. Ran "core" test Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Reviewed-on: http://gerrit.cloudera.org:8080/16098 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-13 03:10:16 +00:00
zhaorenhai	8addf4300b	IMPALA-9995 Fix test_alloc_fail failed case on aarch64 Length of Json object '{"a": 1}", '$.a' is 32 bytes on x86, but is 48 bytes on aarch64 Change-Id: I9a5a4ba19b225bdb4f18a68d6d9cb2c2d16f91fd Reviewed-on: http://gerrit.cloudera.org:8080/16307 Tested-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-13 02:22:31 +00:00
zhaorenhai	c5d90169d7	IMPALA-9926 base64decode % will not return error when in newer OS for example, base64decode('YWxwaGE%') will return 'alpha\377' in newer os which has newer sasl library. I tested it on Ubuntu 18.04 aarch64 version. Change-Id: Ib9bd9e03d5f744c18c957cdaf2064fa918086004 Reviewed-on: http://gerrit.cloudera.org:8080/16175 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-13 02:22:31 +00:00
Zoltan Borok-Nagy	da34d34a42	IMPALA-9859: Full ACID Milestone 4: Part 2 Reading modified tables (complex types) This implements scanning full ACID tables that contain complex types. The same technique works that we use for primitive types. I.e. we add a LEFT ANTI JOIN on top of the Hdfs scan node in order to subtract the deleted rows from the inserted rows. However, there were some types of queries where we couldn't do that. These are the queries that scan the nested collection items directly. E.g.: SELECT item FROM complextypestbl.int_array; The above query only creates a single tuple descriptor that holds the collection items. Since this tuple descriptor is not at the table-level, we cannot add slot references to the hidden ACID column which are at the top level of the table schema. To resolve this I added a statement rewriter that rewrites the above statement to the following: SELECT item FROM complextypestbl $a$1, $a$1.int_array; Now in this example we'll have two tuple descriptors, one for the table-level, and one for the collection item. So we can add the ACID slot refs to the table-level tuple descriptor. The rewrite is implemented by the new AcidRewriter class. Performance I executed the following query with num_nodes=1 on a non-transactional table (without the rewrite), and on an ACID table (with the rewrite): select count() from customer_nested.c_orders.o_lineitems; Without the rewrite: Fetched 1 row(s) in 0.41s +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ \| Operator \| #Hosts \| #Inst \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ \| F00:ROOT \| 1 \| 1 \| 13.61us \| 13.61us \| \| \| 0 B \| 0 B \| \| \| 01:AGGREGATE \| 1 \| 1 \| 3.68ms \| 3.68ms \| 1 \| 1 \| 16.00 KB \| 10.00 MB \| FINALIZE \| \| 00:SCAN HDFS \| 1 \| 1 \| 280.47ms \| 280.47ms \| 6.00M \| 15.00M \| 56.98 MB \| 8.00 MB \| tpch_nested_orc_def.customer.c_orders.o_lineitems \| +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ With the rewrite: Fetched 1 row(s) in 0.42s +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ \| Operator \| #Hosts \| #Inst \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ \| F00:ROOT \| 1 \| 1 \| 25.16us \| 25.16us \| \| \| 0 B \| 0 B \| \| \| 05:AGGREGATE \| 1 \| 1 \| 3.44ms \| 3.44ms \| 1 \| 1 \| 63.00 KB \| 10.00 MB \| FINALIZE \| \| 01:SUBPLAN \| 1 \| 1 \| 16.52ms \| 16.52ms \| 6.00M \| 125.92M \| 47.00 KB \| 0 B \| \| \| \|--04:NESTED LOOP JOIN \| 1 \| 1 \| 188.47ms \| 188.47ms \| 0 \| 10 \| 24.00 KB \| 12 B \| CROSS JOIN \| \| \| \|--02:SINGULAR ROW SRC \| 1 \| 1 \| 0ns \| 0ns \| 0 \| 1 \| 0 B \| 0 B \| \| \| \| 03:UNNEST \| 1 \| 1 \| 25.37ms \| 25.37ms \| 0 \| 10 \| 0 B \| 0 B \| $a$1.c_orders.o_lineitems o_lineitems \| \| 00:SCAN HDFS \| 1 \| 1 \| 96.26ms \| 96.26ms \| 100.00K \| 12.59M \| 38.19 MB \| 72.00 MB \| default.customer_nested $a$1 \| +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ So the overhead is very little. Testing Added planner tests to PlannerTest/acid-scans.test * E2E query tests to QueryTest/full-acid-complex-type-scans.test * E2E tests for rowid-generation: QueryTest/full-acid-rowid.test Change-Id: I8b2c6cd3d87c452c5b96a913b14c90ada78d4c6f Reviewed-on: http://gerrit.cloudera.org:8080/16228 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-12 17:45:50 +00:00
Gabor Kaszab	f95f7940e4	IMPALA-10017: Implement ds_kll_union() function This function receives a set of serialized Apache DataSketches KLL sketches produced by ds_kll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_kll_quantile(ds_kll_union(sketch_col), 0.5) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_kll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_kll_union() on those sketches. Change-Id: I020aea28d36f9b6ef9fb57c08411f2170f5c24bf Reviewed-on: http://gerrit.cloudera.org:8080/16267 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-08 11:50:04 +00:00
Sahil Takiar	a0057788c5	IMPALA-9478: Profiles should indicate if custom UDFs are being used Adds a marker to runtime profiles and explain plans indicating if custom (e.g. non-built in) user-defined functions are being used. For explain plans, a SQL-style comment is added after any function call. For runtime profiles, a new Frontend entry called "User Defined Functions (UDFs)" lists out all UDFs analyzed during planning. Take the following example: create function hive_lower(string) returns string location '/test-warehouse/hive-exec.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower'; set explain_level=3; explain select * from functional.alltypes order by hive_lower(string_col); ... 01:SORT order by: default.hive_lower(string_col) /* JAVA UDF / ASC materialized: default.hive_lower(string_col) / JAVA UDF / ... This shows up in the runtime profile as well. When the above query is actually run, the runtime profile includes the following entry: Frontend User Defined Functions (UDFs): default.hive_lower Error messages will also include SQL-style comments about any UDFs used. For example: select aggfn(int_col) over (partition by int_col) from functional.alltypesagg Throws: Aggregate function 'default.aggfn(int_col) / NATIVE UDF /' not supported with OVER clause. Testing: Added tests to test_udfs.py * Ran core tests Change-Id: I79122e6cc74fd5a62c76962289a1615fbac2f345 Reviewed-on: http://gerrit.cloudera.org:8080/16188 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-07 20:08:21 +00:00
Gabor Kaszab	0736fcf691	IMPALA-10018: Implement ds_kll_rank() function ds_kll_rank() receives two parameters: a STRING that represents a serialized DataSketches KLL sketch and a float to provide a probing value in the sketch. Returns a DOUBLE that is the rank of the given probing value in the range of [0,1]. E.g. a return value of 0.2 means that the probing value given as parameter is greater than the 20% of all the values in the sketch. Note, this is an approximate calculation. Change-Id: I95857886dfbb8c84aeeaf718c0e610012fda4be0 Reviewed-on: http://gerrit.cloudera.org:8080/16283 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-07 12:58:40 +00:00
Gabor Kaszab	87aeb2ad78	IMPALA-9963: Implement ds_kll_n() function This function receives a serialized Apache DataSketches KLL sketch and returns how many input values were fed into this sketch. Change-Id: I166e87a468e68e888ac15fca7429ac2552dbb781 Reviewed-on: http://gerrit.cloudera.org:8080/16259 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-06 19:15:04 +00:00
Bikramjeet Vig	0a13029afc	IMPALA-8125: Add query option to limit number of hdfs writer instances This patch adds a new query option MAX_FS_WRITERS that limits the number of HDFS writer instances. Highlights: - Depending on the plan, it either restricts the num of instances of the root fragment or adds an exchange and then limits the num of instances of that. - Assigns instances evenly across available backends. - "no-shuffle" query hint is ignored when using query option. - Change in behavior of plans is only when this query option is used. - The only exception to the previous point is that the optimization logic that decides to add an exchange now looks at the num of instances instead of the number of nodes. Limitation: A mismatch of cluster state during query planning and scheduling can result in more or less fragment instances to be scheduled than expected. Eg. If max_fs_writers in 2 and the planner sees only 2 executors then it might not add an exchange between a scan node and the table sink, but during scheduling if there are 3 nodes then that scan+tablesink instance will be scheduled on 3 backends. Testing: - Added planner tests to cover all cases where this enforcement kicks in and to highlight the behavior. - Added e2e tests to confirm that the scheduler is enforcing the limit and distributing the instance evenly across backends for different plan shapes. Change-Id: I17c8e61b9a32d908eec82c83618ff9caa41078a5 Reviewed-on: http://gerrit.cloudera.org:8080/16204 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-04 05:56:03 +00:00
Aman Sinha	ecfc1af0db	IMPALA-9983 : Pushdown limit to analytic sort operator This patch pushes the LIMIT from a top level Sort down to the Sort below an Analytic operator when it is safe to do so. There are several qualifying checks that are done. The optimization is done at the time of creating the top level Sort in the single node planner. When the pushdown is applicable, the analytic sort is converted to a TopN sort. Further, this is split into a bottom TopN and an upper TopN separated by a hash partition exchange. This ensures that the limit is applied as early as possible before hash partitioning. Fixed couple of additional related issues uncovered as a result of limit pushdown: - Changed the analytic sort's partition-by expr sort semantic from NULLS FIRST to NULLS LAST to ensure correctness in the presence of limit. - The LIMIT on the analytic sort node was causing it to be treated as a merging point in the distributed planner. Fixed it by introducing an api allowPartitioned() in the PlanNode. Testing: - Ran PlannerTest and updated several EXPLAIN plans. - Added Planner tests for both positive and negative cases of limit pushdown. - Ran end-to-end TPC-DS queries. Specifically tested TPC-DS q67 for limit pushdown and result correctness. - Added targeted end-to-end tests using TPC-H dataset. Change-Id: Ib39f46a7bb75a34466eef7f91ddc25b6e6c99284 Reviewed-on: http://gerrit.cloudera.org:8080/16219 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-02 02:42:40 +00:00
Shant Hovsepian	ea3f073881	IMPALA-9943,IMPALA-4974: INTERSECT/EXCEPT [DISTINCT] INTERSECT and EXCEPT set operations are implemented as rewrites to joins. Currently only the DISTINCT qualified operators are implemented, not ALL qualified. The operator MINUS is supported as an alias for EXCEPT. We mimic Oracle and Hive's non-standard implementation which treats all operators with the same precedence, as opposed to the SQL Standard of giving INTERSECT higher precedence. A new class SetOperationStmt was created to encompass the previous UnionStmt behavior. UnionStmt is preserved as a special case of union only operands to ensure compatibility with previous union planning behavior. Tests: * Added parser and analyzer tests. * Ensured no test failures or plan changes for union tests. * Added TPC-DS queries 14,38,87 to functional and planner tests. * Added functional tests test_intersect test_except * New planner testSetOperationStmt Change-Id: I5be46f824217218146ad48b30767af0fc7edbc0f Reviewed-on: http://gerrit.cloudera.org:8080/16123 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-31 17:23:45 +00:00
Gabor Kaszab	033a4607e2	IMPALA-9959: Implement ds_kll_sketch() and ds_kll_quantile() functions ds_kll_sketch() is an aggregate function that receives a float parameter (e.g. a float column of a table) and returns a serialized Apache DataSketches KLL sketch of the input data set wrapped into STRING type. This sketch can be saved into a table or view and later used for quantile approximations. ds_kll_quantile() receives two parameters: a STRING parameter that contains a serialized KLL sketch and a DOUBLE that represents the rank of the quantile in the range of [0,1]. E.g. rank=0.1 means the approximate value in the sketch where 10% of the sketched items are less than or equals to this value. Testing: - Added automated tests on small data sets to check the basic functionality of sketching and getting a quantile approximate. - Tested on TPCH25_parquet.lineitem to check that sketching and approximating works on bigger scale as well where serialize/merge phases are also required. On this scale the error range of the quantile approximation is within 1-1.5% Change-Id: I11de5fe10bb5d0dd42fb4ee45c4f21cb31963e52 Reviewed-on: http://gerrit.cloudera.org:8080/16235 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-31 14:34:49 +00:00
Adam Tamas	21918ef18b	IMPALA-9942: DataSketches HLL shouldn't take empty strings as distinct values In Hive empty strings doesn't count as separate values when querying count(distinct) estimates using Apache DataSketches HLL algorithm on strings and varchars. For compatibility's sake Impala should not take it either. Tests: -added extra tests for hll with empty strings Change-Id: Ie7648217bbe2f66b817788f131c062f349b1e9ad Reviewed-on: http://gerrit.cloudera.org:8080/16226 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-28 18:25:22 +00:00
Gabor Kaszab	9c542ef589	IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Note, currently there is a known limitation of unioning string types where some input sketches come from Impala and some from Hive. In this case if there is an overlap in the input data used by Impala and by Hive this overlapping data is still counted twice due to some string representation difference between Impala and Hive. For more details see: https://issues.apache.org/jira/browse/IMPALA-9939 Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Reviewed-on: http://gerrit.cloudera.org:8080/16095 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-23 22:20:12 +00:00
Zoltan Borok-Nagy	3a6022ce80	Bump up CDP_BUILD_NUMBER to 4493826 This change bumps up the CDP_BUILD_NUMBER to 4493826. This is needed to fix a failing test. Hive started to assign bucket ids to files differently. Because of that I had to modify the test_full_acid_rowid test that had an assumption about how bucket ids are assigned to files. If you have problems restarting the Hive Metastore, try the following: buildall.sh <your usual flags> -upgrade_metastore_db If you have problems restarting Kudu, try the following: Unset LD_LIBRARY_PATH in your shell, and stop setting it in impala-config-local.sh Change-Id: Ia4635feef146c945624135e0715495bb01ea4699 Reviewed-on: http://gerrit.cloudera.org:8080/16195 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-15 19:14:58 +00:00
Tim Armstrong	63f5e8ec00	IMPALA-1270: add distinct aggregation to semi joins When generating plans with left semi/anti joins (typically resulting from subquery rewrites), the planner now considers inserting a distinct aggregation on the inner side of the join. The decision is based on whether that aggregation would reduce the number of rows by more than 75%. This is fairly conservative and the optimization might be beneficial for smaller reductions, but the conservative threshold is chosen to reduce the number of potential plan regressions. The aggregation can both reduce the # of rows and the width of the rows, by projecting out unneeded slots. ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION query option is added to allow toggling the optimization. Tests: * Add positive and negative planner tests for various cases - including semi/anti joins, missing stats, broadcast/shuffle, different numbers of join predicates. * Add some end-to-end tests to verify plans execute correctly. Change-Id: Icbb955e805d9e764edf11c57b98f341b88a37fcc Reviewed-on: http://gerrit.cloudera.org:8080/16180 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-15 17:10:50 +00:00
Zoltan Borok-Nagy	f602c3f80f	IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_0000001_0000001_0000/0000_0 * full_acid/delta_0000002_0000002_0000/0000_0 * full_acid/delete_delta_0000003_0000003_0000/0000_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Reviewed-on: http://gerrit.cloudera.org:8080/16082 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-14 12:53:51 +00:00
Tim Armstrong	574fef2a76	IMPALA-9917: grouping() and grouping_id() support Implements grouping() and grouping_id() builtins. grouping_id() has both a no-arg version, which returns a bit vector of all grouping exprs and a varargs version, which returns a bit vector of the provided arguments. Grouping is a keyword, so needs special handling in the parser to be accepted as a function name. These functions are implemented in the transpose agg with a CASE expression similar to other aggregate functions, but returning the grouping() or grouping_id() value for that aggregation class instead of an aggregated value. Testing: * Added parser test for grouping keyword. * Added analysis tests for the functions. * Added basic planner test to show expressions generated * Added some TPC-DS queries that use grouping() - queries 80, 70 and 86 using reference .test files from Fang-Yu Rao. 27 and 36 were added with reference results from https://github.com/cwida/tpcds-result-reproduction * Add targeted end-to-end tests. * Added view compatibility test with Hive. Change-Id: If0b1640d606256c0fe9204d2a21a8f6d06abcdb6 Reviewed-on: http://gerrit.cloudera.org:8080/16140 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	3e1e7da229	IMPALA-9898: generate grouping set plans Integrates the parsing and analysis with plan generation. Testing: * Add analysis test to make sure we reject unsupported queries. * Added targeted planner tests to ensure we generate the correct aggregation classes for a variety of cases. * Add targeted end-to-end functional tests. Added five TPC-DS queries that use ROLLUP, building on some work done by Fang-Yu Rao. Some tweaks were required for these tests. * Add an extra ORDER BY clause to q77 to make fully deterministic. * Add backticks around `returns` to avoid reserved word. * Add INTERVAL keyword to date/timestamp arithmetic. We can run q80, too, but I haven't added or verified results yet - that can be done in a follow-up. Change-Id: Ie454c5bf7aee266321dee615548d7f2b71380197 Reviewed-on: http://gerrit.cloudera.org:8080/16128 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	4e2498da6f	IMPALA-9949: fix SELECT list subqueries with HAVING/LIMIT The patch for IMPALA-8954 failed to account for subqueries that could produce < 1 row. SelectStmt.returnsSingleRow() is confusing because it actually returns true if it returns at most one row. As a fix I split it into returnsExactlyOneRow() and returnsAtMostOneRow(), then used returnsExactlyOneRow() to determine if the subquery should instead be rewritten into a LEFT OUTER JOIN, which produces the correct result. CROSS JOIN is still preferred because it can be more freely reordered during planning. Testing: * Added planner tests for a range of scenarios where it can be rewritten as a CROSS JOIN and where it needs to be a LEFT OUTER JOIN for correctness. * Added some targeted end-to-end tests where the results were previously incorrect. Checked the behaviour against Hive and postgres. Ran exhaustive tests. Change-Id: I6034aedac776783bdc8cdb3a2df344e2b3662da6 Reviewed-on: http://gerrit.cloudera.org:8080/16171 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-13 22:38:36 +00:00
Tim Armstrong	fea5dffec5	IMPALA-9924: handle single subquery in or predicate This patch supports a subset of cases of subqueries inside OR inside WHERE and HAVING clauses. The approach used is to rewrite the subquery into a many-to-one LEFT OUTER JOIN with the subquery and then replace the subquery in the expression with a reference to the single select list expressions of the subquery. This works because: * A many-to-one LEFT OUTER JOIN returns one output row for each left input row, meaning that for every row in the original query before the rewrite, we get the same row plus a single matched row from the subquery * Expressions can be rewritten to refer to a slotref from the right side of the LEFT OUTER JOIN without affecting semantics. E.g. an IN subquery becomes <slot> IS NOT NULL or <operator> (<subquery>) becomes <operator> <slot>. This does not affect SELECT list subqueries, which are rewritten using a different mechanism that can already support some subqueries in disjuncts. Correlated and uncorrelated subqueries are both supported, but various limitations are present. Limitations: * Only one subquery per predicate is supported. The rewriting approach should generalize to multiple subqueries but other code needs refactoring to handle this case. * EXISTS and NOT EXISTS subqueries are not supported. The rewriting approach can generalise to that, but we need to add or pick a select list item from the subquery to check for NULL/IS NOT NULL and a little more work is required to do that correctly. * NOT IN is not supported because of the special NULL semantics. * Subqueries with aggregates + grouping by are not supported because we rely on adding distinct to select list and we don't support distinct + aggregations because of IMPALA-5098. Tests: * Positive analysis tests for IN and binary predicate operators. * Negative analysis tests for unsupported subquery operators. * Negative analysis tests for multiple subqueries. * Negative analysis tests for runtime scalar subqueries. * Positive and negative analysis tests for aggregations in subquery. * TPC-DS Query 45 planner and query tests * Targeted planner tests for various supported queries. * Targeted functional tests to confirm plans are executable and return correct result. These exercise a mix of the supported features - correlated/correlated, aggregate functions, EXISTS/comparator, etc. * Tests for BETWEEN predicate, which is supported as a side-effect of being rewritten during analysis. Change-Id: I64588992901afd7cd885419a0b7f949b0b174976 Reviewed-on: http://gerrit.cloudera.org:8080/16152 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2020-07-13 16:02:27 +00:00
wzhou-code	55099517b0	IMPALA-9294: Support DATE for min-max runtime filter Implemented Date min-max filter and applied it to Kudu as other min-max runtime filters. Added new test cases for Date min-max filters. Testing: Passed all core tests. Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Reviewed-on: http://gerrit.cloudera.org:8080/16103 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-08 22:59:57 +00:00
Adam Tamas	1bafb7bd29	IMPALA-9531: Dropped support for dateless timestamps Removed the support for dateless timestamps. During dateless timestamp casts if the format doesn't contain date part we get an error during tokenization of the format. If the input str doesn't contain a date part then we get null result. Examples: select cast('01:02:59' as timestamp); This will come back as NULL value. select to_timestamp('01:01:01', 'HH:mm:ss'); select cast('01:02:59' as timestamp format 'HH12:MI:SS'); select cast('12 AM' as timestamp FORMAT 'AM.HH12'); These will come back with a parsing errors. Casting from a table will generate similar results. Testing: Modified the previous tests related to dateless timestamps. Added test to read fromtables which are still containing dateless timestamps and covered timestamp to string path when no date tokens are requested in the output string. Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Reviewed-on: http://gerrit.cloudera.org:8080/15866 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2020-07-08 19:32:15 +00:00
Gabor Kaszab	7e456dfa9d	IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Reviewed-on: http://gerrit.cloudera.org:8080/16000 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-07-07 14:11:21 +00:00
Tim Armstrong	24f24b131f	IMPALA-9902: add rewrite of TPC-DS q38 I generated the query with dsqgen and then rewrote it to avoid intersect. Testing: Compared results to hive running the original version of the query. Change-Id: I81807683aa265a946729e15156bd2e33724103e1 Reviewed-on: http://gerrit.cloudera.org:8080/16118 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-06 23:47:45 +00:00
Shant Hovsepian	2dca55695e	IMPALA-9784, IMPALA-9905: Uncorrelated subqueries in HAVING. Support rewriting subqueries in the HAVING clause by nesting the aggregation query and pulling up the subquery predicates into the outer WHERE clause. Testing: * New analyzer tests * New functional subquery tests * Added Q23, Q24 and Q44 to the tpcds workload * Ran subquery rewrite tests Change-Id: I124a58a09a1a47e1222a22d84b54fe7d07844461 Reviewed-on: http://gerrit.cloudera.org:8080/16052 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Shant Hovsepian	388ad555d7	IMPALA-8954: Uncorrelated scalar subqueries in the select list Extend StmtRewriter with the ability to rewrite scalar subqueries in the select list into cross joins. Currently the subquery must pass plan-time checks to determine that it returns a single row which may miss cases that may be valid at runtime or with more complex evaluation of the predicate expressions in the planner. Support for correlated subqueries will be a follow on change. Testing: * Added new analyzer tests, updated previous subquery tests * test_queries.py::TestQueries::test_subquery * Added test_tpcds_q9 to e2e and planner tests Change-Id: Ibcf55d26889aa01d69bb85f18c9241dda095fb66 Reviewed-on: http://gerrit.cloudera.org:8080/16007 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Aman Sinha	08a7569d4c	IMPALA-9911: Fix IS [NOT] NULL predicate selectivity When null count is 0, the IsNullPredicate's selectivity was not being computed since it did not distinguish between a -1 (no stats) vs a 0 null count. This caused a default selectivity estimate being applied. This patch fixes it by explicitly checking whether nulls count stat is present and if so, use it regardless of whether it is 0 or more. Testing: - Added cardinality tests for IS NULL and IS NOT NULL. - Ran PlannerTest and updated baseline plans. - Updated expected selectivity for null predicate tests in ExprCardinalityTest. - Ran precommit tests through gerrit-verify-dryrun Change-Id: I46c084be780b8f5aead9e2b9656fbab6cc8c8874 Reviewed-on: http://gerrit.cloudera.org:8080/16131 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-02 11:24:44 +00:00
Fang-Yu Rao	2a48f7dd98	IMPALA-9890 (Part 1): Add more TPCDS queries to Impala's test suite This patch adds the following 12 TPCDS queries to the class of TestTpcdsDecimalV2Query: Q26, Q30, Q31, Q47, Q48, Q57, Q58, Q59, Q63, Q83, Q85, and Q89. All the queries except for Q31 are added to the class of TestTpcdsQuery as well because Impala returns one fewer row than expected for TestTpcdsQuery::test_tpcds_q31(), which requires further investigation. To verify whether or not the returned result set from Impala for a given query is correct, we compare the result set with that produced by the HiveServer2 (HS2) in Impala's mini-cluster. We could execute SQL statements in HS2 via Beeline, HS2's command line shell, which could be launched by the following command. beeline -u "jdbc:hive2://localhost:11050/default" We note that among these 12 queries, the execution of Q31, Q58, and Q83 result in the error of "Counters limit exceeded" by TEZ. To work around this problem, for these 3 queries we have to execute the following statement before running them to increase the default number of counters, which is set to 120. set tez.counters.max=1200 On the other hand, the table of 'reason' is referenced by Q85. This table was not referenced by any TPCDS query before this patch and thus was not created. In this regard, in this patch we also modify tpcds_schema_template.sql to create this additional table along with its data. Testing: - Verified that this patch passes the exhaustive tests in the DEBUG build. Change-Id: Ib5f260e75a3803aabe9ccef271ba94036f96e5cf Reviewed-on: http://gerrit.cloudera.org:8080/16119 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-30 13:06:33 +00:00
Zoltan Borok-Nagy	930264afbd	IMPALA-9515: Full ACID Milestone 3: Read support for "original files" "Original files" are files that don't have full ACID schema. We can see such files if we upgrade a non-ACID table to full ACID. Also, the LOAD DATA statement can load non-ACID files into full ACID tables. So such files don't store special ACID columns, that means we need to auto-generate their values. These are (operation, originalTransaction, bucket, rowid, and currentTransaction). With the exception of 'rowid', all of them can be calculated based on the file path, so I add their values to the scanner's template tuple. 'rowid' is the ordinal number of the row inside a bucket inside a directory. For now Impala only allows one file per bucket per directory. Therefore we can generate row ids for each file independently. Multiple files in a single bucket in a directory can only be present if the table was non-transactional earlier and we upgraded it to full ACID table. After the first compaction we should only see one original file per bucket per directory. In HdfsOrcScanner we calculate the first row id for our split then the OrcStructReader fills the rowid slot with the proper values. Testing: * added e2e tests to check if the generated values are correct * added e2e test to reject tables that have multiple files per bucket * added unit tests to the new auxiliary functions Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Reviewed-on: http://gerrit.cloudera.org:8080/16001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-29 21:00:05 +00:00

1 2 3 4 5 ...

2297 Commits