impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 00:00:40 -05:00

Author	SHA1	Message	Date
Fang-Yu Rao	efc627d050	IMPALA-10158: Set timezone to UTC for Iceberg-related E2E tests We found that the tests of test_iceberg_query and test_iceberg_profile fail after the patch for IMPALA-9741 has been merged and that it is due to the default timezone of Impala not being UTC. This patch fixes the issue by adding "SET TIMEZONE=UTC;" before those test queries are run. Testing: - Verified in a local development environment that the tests of test_iceberg_query and test_iceberg_profile could pass after applying this patch. Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3 Reviewed-on: http://gerrit.cloudera.org:8080/16425 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-09 13:26:42 +00:00
skyyws	fb6d96e001	IMPALA-9741: Support querying Iceberg table by impala This patch mainly realizes the querying of iceberg table through impala, we can use the following sql to create an external iceberg table: CREATE EXTERNAL TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); Or just including table name and location like this: CREATE EXTERNAL TABLE default.iceberg_test STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); 'iceberg_file_format' is the file format in iceberg, currently only support PARQUET, other format would be supported in the future. And if you don't specify this property in your SQL, default file format is PARQUET. We achieved this function by treating the iceberg table as normal unpartitioned hdfs table. When querying iceberg table, we pushdown partition column predicates to iceberg to decide which data files need to be scanned, and then transfer this information to BE to do the real scan operation. Testing: - Unit test for Iceberg in FileMetadataLoaderTest - Create table tests in functional_schema_template.sql - Iceberg table query test in test_scanners.py Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006 Reviewed-on: http://gerrit.cloudera.org:8080/16143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-06 02:12:07 +00:00
Daniel Becker	dc08b657e8	IMPALA-7658: Proper codegen for HiveUdfCall Implementing codegen for HiveUdfCall. Testing: Verified that java udf tests pass locally. Benchmarks: Used a UDF from TestUdf.java that adds three integers: create function tpch15_parquet.sum3(int, int, int) returns int location '/test-warehouse/impala-hive-udfs.jar' symbol='org.apache.impala.TestUdf'; Used the following query on the master branch and the change's branch: set num_nodes=1; set mt_dop=1; select min(tpch15_parquet.sum3(cast(l_orderkey as int), cast(l_partkey as int), cast(l_suppkey as int))) from tpch15_parquet.lineitem; Results averaged over 100 runs after warmup: Master: 20.6346s, stddev: 0.3132411856765332 This change: 19.0256s, stddev: 0.42039019873436 This is a ~7.8% improvement. Change-Id: I2f994dac550f297ed3c88491816403f237d4d747 Reviewed-on: http://gerrit.cloudera.org:8080/16314 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-04 00:55:02 +00:00
Adam Tamas	99e5f5a885	IMPALA-10133:Implement ds_hll_stringify function. This function receives a string that is a serialized Apache DataSketches HLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_hll_stringify(ds_hll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_hll_stringify(ds_hll_sketch(float_col)) \| +--------------------------------------------+ \| ### HLL sketch summary: \| \| Log Config K : 12 \| \| Hll Target : HLL_4 \| \| Current Mode : LIST \| \| LB : 2 \| \| Estimate : 2 \| \| UB : 2.0001 \| \| OutOfOrder flag: false \| \| Coupon count : 2 \| \| ### End HLL sketch summary \| \| \| +--------------------------------------------+ Change-Id: I85dbf20b5114dd75c300eef0accabe90eac240a0 Reviewed-on: http://gerrit.cloudera.org:8080/16382 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-03 12:11:48 +00:00
Aman Sinha	5e9f10d34c	IMPALA-10064: Support constant propagation for eligible range predicates This patch adds support for constant propagation of range predicates involving date and timestamp constants. Previously, only equality predicates were considered for propagation. The new type of propagation is shown by the following example: Before constant propagation: WHERE date_col = CAST(timestamp_col as DATE) AND timestamp_col BETWEEN '2019-01-01' AND '2020-01-01' After constant propagation: WHERE date_col >= '2019-01-01' AND date_col <= '2020-01-01' AND timestamp_col >= '2019-01-01' AND timestamp_col <= '2020-01-01' AND date_col = CAST(timestamp_col as DATE) As a consequence, since Impala supports table partitioning by date columns but not timestamp columns, the above propagation enables partition pruning based on timestamp ranges. Existing code for equality based constant propagation was refactored and consolidated into a new class which handles both equality and range based constant propagation. Range based propagation is only applied to date and timestamp columns. Testing: - Added new range constant propagation tests to PlannerTest. - Added e2e test for range constant propagation based on a newly added date partitioned table. - Ran precommit tests. Change-Id: I811a1f8d605c27c7704d7fc759a91510c6db3c2b Reviewed-on: http://gerrit.cloudera.org:8080/16346 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 22:57:55 +00:00
Zoltan Borok-Nagy	502e1134be	IMPALA-10071: Impala shouldn't create filename starting with underscore during ACID TRUNCATE When Impala TRUNCATEs an ACID table, it creates a new base directory with the hidden file "_empty" in it. Newer Hive versions ignore files starting with underscore, therefore they ignore the whole base directory. To resolve this issue we can simply rename the empty file to "empty". Testing: * update acid-truncate.test accordingly Change-Id: Ia0557b9944624bc123c540752bbe3877312a7ac9 Reviewed-on: http://gerrit.cloudera.org:8080/16396 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 13:29:25 +00:00
Adam Tamas	4cb3c3556e	IMPALA-10108: Implement ds_kll_stringify function This function receives a string that is a serialized Apache DataSketches KLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_kll_stringify(ds_kll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_kll_stringify(ds_kll_sketch(float_col)) \| +--------------------------------------------+ \| ### KLL sketch summary: \| \| K : 200 \| \| min K : 200 \| \| M : 8 \| \| N : 8 \| \| Epsilon : 1.33% \| \| Epsilon PMF : 1.65% \| \| Empty : false \| \| Estimation mode: false \| \| Levels : 1 \| \| Sorted : false \| \| Capacity items : 200 \| \| Retained items : 8 \| \| Storage bytes : 64 \| \| Min value : 0 \| \| Max value : 1.1 \| \| ### End sketch summary \| \| \| +--------------------------------------------+ Change-Id: I97f654a4838bf91e3e0bed6a00d78b2c7aa96f75 Reviewed-on: http://gerrit.cloudera.org:8080/16370 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 10:49:10 +00:00
Zoltan Borok-Nagy	329bb41294	IMPALA-10115: Impala should check file schema as well to check full ACIDv2 files Currently Impala checks file metadata 'hive.acid.version' to decide the full ACID schema. There are cases when Hive forgets to set this value for full ACID files, e.g. query-based compactions. So it's more robust to check the schema elements instead of the metadata field. Also, sometimes Hive write the schema with different character cases, e.g. originalTransaction vs originaltransaction, so we should rather compare the column names in a case insensitive way. Testing: * added test for full ACID compaction * added test_full_acid_schema_without_file_metadata_tag to test full ACID file without metadata 'hive.acid.version' Change-Id: I52642c1755599efd28fa2c90f13396cfe0f5fa14 Reviewed-on: http://gerrit.cloudera.org:8080/16383 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-01 22:27:27 +00:00
Tim Armstrong	ea75e68f9e	IMPALA-10110: bloom filter target fpp query option This adds a BLOOM_FILTER_ERROR_RATE option that takes a value between 0 and 1 (exclusive) that can override the default target false positive probability (fpp) value of 0.75 for selecting the filter size. It does not affect whether filters are disabled at runtime. Adds estimated FPP and bloom size to the routing table so we have some observability. Here is an example: tpch_kudu> select count(*) from customer join nation on n_nationkey = c_nationkey; ID Src. Node Tgt. Node(s) Target type Partition filter Pending (Expected) First arrived Completed Enabled Bloom Size Est fpp ----------------------------------------------------------------------------------------------------------------------------------------- 1 2 0 LOCAL false 0 (3) N/A N/A true MIN_MAX 0 2 0 LOCAL false 0 (3) N/A N/A true 1.00 MB 1.04e-37 Testing: Added a test that shows the query option affecting filter size. Ran core tests. Change-Id: Ifb123a0ea1e0e95d95df9837c1f0222fd60361f3 Reviewed-on: http://gerrit.cloudera.org:8080/16377 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-29 05:48:37 +00:00
Gabor Kaszab	28d94851b1	IMPALA-10020: Implement ds_kll_cdf_as_string() function This is the support for Cumulative Distribution Function (CDF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), (-inf, 5), (-inf, 10), (-inf, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_cdf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_cdf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.2,0.401644,1,1 \| +----------------------------------------------------------+ Change-Id: I77e6afc4556ad05a295b89f6d06c2e4a6bb2cf82 Reviewed-on: http://gerrit.cloudera.org:8080/16359 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 10:59:49 +00:00
Gabor Kaszab	a8a35edbc4	IMPALA-10019: Implement ds_kll_pmf_as_string() function This is the support for Probabilistic Mass Function (PMF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), [1, 5), [5, 10), [10, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_pmf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_pmf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.202192,0.199452,0.598356,0 \| +----------------------------------------------------------+ Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba Reviewed-on: http://gerrit.cloudera.org:8080/16336 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 01:50:16 +00:00
Tim Armstrong	e133d1838a	IMPALA-7782: fix constant NOT IN subqueries that can return 0 rows The bug was the the statement rewriter converted NOT IN <subquery> predicates to != <subquery> predicates when the subquery could be an empty set. This was invalid, because NOT IN (<empty set>) is true, but != (<empty set>) is false. Testing: Added targeted planner and end-to-end tests. Ran exhaustive tests. Change-Id: I66c726f0f66ce2f609e6ba44057191f5929a67fc Reviewed-on: http://gerrit.cloudera.org:8080/16338 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-25 23:08:40 +00:00
Gabor Kaszab	41065845e9	IMPALA-9962: Implement ds_kll_quantiles_as_string() function This function is very similar to ds_kll_quantile() but this one can receive any number of rank parameters and returns a comma separated string that holds the results for all of the given ranks. For more details about ds_kll_quantile() see IMPALA-9959. Note, ds_kll_quantiles() should return an Array of floats as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_quantiles_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Change-Id: I76f6039977f4e14ded89a3ee4bc4e6ff855f5e7f Reviewed-on: http://gerrit.cloudera.org:8080/16324 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-25 18:06:22 +00:00
stiga-huang	e0a6e942b2	IMPALA-9955,IMPALA-9957: Fix not enough reservation for large pages in GroupingAggregator The minimum requirement for a spillable operator is ((min_buffers -2) * default_buffer_size) + 2 * max_row_size. In the min reservation, we only reserve space for two large pages, one for reading, the other for writing. However, to make the non-streaming GroupingAggregator work correctly, we have to manage these extra reservations carefully. So it won't run out of the min reservation when it actually needs to spill a large page, or when it actually needs to read a large page. To be specific, for how to manage the large write page reservation, depending on whether needs_serialize is true or false: - If the aggregator needs to serialize the intermediate results when spilling a partition, we have to save a large page worth of reservation for the serialize stream, in case it needs to write large rows. This space can be restored when all the partitions are spilled so the serialize stream is not needed until we build/repartition a spilled partition and thus have pinned partitions again. If the large write page reservation is used, we save it back whenever possible after we spill or close a partition. - If the aggregator doesn't need the serialize stream at all, we can restore the large write page reservation whenever we fail to add a large row, before spilling any partitions. Reclaim it whenever possible after we spill or close a partition. A special case is when we are processing a large row and it's the last row in building/repartitioning a spilled partition, the large write page reservation can be restored for it no matter whether we need the serialize stream. Because partitions will be read out after this so no needs for spilling. For the large read page reservation, it's transferred to the spilled BufferedTupleStream that we are reading in building/repartitioning a spilled partition. The stream will restore some of it when reading a large page, and reclaim it when the output row batch is reset. Note that the stream is read in attach_on_read mode, the large page will be attached to the row batch's buffers and only get freed when the row batch is reset. Tests: - Add tests in test_spilling_large_rows (test_spilling.py) with different row sizes to reproduce the issue. - One test in test_spilling_no_debug_action becomes flaky after this patch. Revise the query to make the udf allocate larger strings so it can consistently pass. - Run CORE tests. Change-Id: I3d9c3a2e7f0da60071b920dec979729e86459775 Reviewed-on: http://gerrit.cloudera.org:8080/16240 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-25 16:11:20 +00:00
Qifan Chen	2ebf554dfd	IMPALA-7779 Parquet Scanner can write binary data into profile This fix addresses the current limitation in that an ill-formatted Parquet version string is not properly formatted before appearing in an error message or impalad.INFO. With the fix, any such string is converted to a hex string first. The hex string is a sequence of four hex digit groups separated by spaces and each group is one or two hex digits, such as "6c 65 2e a". Testing: Ran "core" tests successfully. Change-Id: I281d6fa7cb2f88f04588110943e3e768678b9cf1 Reviewed-on: http://gerrit.cloudera.org:8080/16331 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Sahil Takiar <stakiar@cloudera.com>	2020-08-25 15:42:01 +00:00
Adam Tamas	6539c50f66	IMPALA-9982: Fix flakyness in test_dateless_timestamp_text In this test there is no need to check for "Error parsing row" since the "Error converting column" is enought to be sure we are no longer able to read dateless timestamps. Change-Id: Ia97490288dae81561969d260739a07ec42571f48 Reviewed-on: http://gerrit.cloudera.org:8080/16334 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-13 18:58:23 +00:00
Qifan Chen	5fedf7bf72	IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans This work addresses the current limitation in computing the total row count for a Hive table in a scan. The row count can be incorrectly computed as 0, even though there exists data in the Hive table. This is the stats corruption at table level. Similar stats corruption exists for a partition. The row count of a table or a partition sometime can also be -1 which indicates a missing stats situation. In the fix, as long as no partition in a Hive table exhibits any missing or corrupt stats, the total row count for the table is computed from the row counts in all partitions. Otherwise, Impala looks at the table level stats particularly the table row count. In addition, if the table stats is missing or corrupted, Impala estimates a row count for the table, if feasible. This row count is the sum of the row count from the partitions with good stats, and an estimation of the number of rows in the partitions with missing or corrupt stats. Such estimation also applies when some partition has corrupt stats. One way to observe the fix is through the explain of queries scanning Hive tables with missing or corrupted stats. The cardinality for any full scan should be a positive value (i.e. the estimated row count), instead of 'unavailable'. At the beginning of the explain output, that table is still listed in the WARNING section for potentially corrupt table statistics. Testing: 1. Ran unit tests with queries documented in the case against Hive tables with the following configrations: a. No stats corruption in any partitions b. Stats corruption in some partitions c. Stats corruption in all partitions 2. Added two new tests in test_compute_stats.py: a. test_corrupted_stats_in_partitioned_Hive_tables b. test_corrupted_stats_in_unpartitioned_Hive_tables 3. Fixed failures in corrupt-stats.test 4. Ran "core" test Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Reviewed-on: http://gerrit.cloudera.org:8080/16098 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-13 03:10:16 +00:00
zhaorenhai	8addf4300b	IMPALA-9995 Fix test_alloc_fail failed case on aarch64 Length of Json object '{"a": 1}", '$.a' is 32 bytes on x86, but is 48 bytes on aarch64 Change-Id: I9a5a4ba19b225bdb4f18a68d6d9cb2c2d16f91fd Reviewed-on: http://gerrit.cloudera.org:8080/16307 Tested-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-13 02:22:31 +00:00
zhaorenhai	c5d90169d7	IMPALA-9926 base64decode % will not return error when in newer OS for example, base64decode('YWxwaGE%') will return 'alpha\377' in newer os which has newer sasl library. I tested it on Ubuntu 18.04 aarch64 version. Change-Id: Ib9bd9e03d5f744c18c957cdaf2064fa918086004 Reviewed-on: http://gerrit.cloudera.org:8080/16175 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-13 02:22:31 +00:00
Zoltan Borok-Nagy	da34d34a42	IMPALA-9859: Full ACID Milestone 4: Part 2 Reading modified tables (complex types) This implements scanning full ACID tables that contain complex types. The same technique works that we use for primitive types. I.e. we add a LEFT ANTI JOIN on top of the Hdfs scan node in order to subtract the deleted rows from the inserted rows. However, there were some types of queries where we couldn't do that. These are the queries that scan the nested collection items directly. E.g.: SELECT item FROM complextypestbl.int_array; The above query only creates a single tuple descriptor that holds the collection items. Since this tuple descriptor is not at the table-level, we cannot add slot references to the hidden ACID column which are at the top level of the table schema. To resolve this I added a statement rewriter that rewrites the above statement to the following: SELECT item FROM complextypestbl $a$1, $a$1.int_array; Now in this example we'll have two tuple descriptors, one for the table-level, and one for the collection item. So we can add the ACID slot refs to the table-level tuple descriptor. The rewrite is implemented by the new AcidRewriter class. Performance I executed the following query with num_nodes=1 on a non-transactional table (without the rewrite), and on an ACID table (with the rewrite): select count() from customer_nested.c_orders.o_lineitems; Without the rewrite: Fetched 1 row(s) in 0.41s +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ \| Operator \| #Hosts \| #Inst \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ \| F00:ROOT \| 1 \| 1 \| 13.61us \| 13.61us \| \| \| 0 B \| 0 B \| \| \| 01:AGGREGATE \| 1 \| 1 \| 3.68ms \| 3.68ms \| 1 \| 1 \| 16.00 KB \| 10.00 MB \| FINALIZE \| \| 00:SCAN HDFS \| 1 \| 1 \| 280.47ms \| 280.47ms \| 6.00M \| 15.00M \| 56.98 MB \| 8.00 MB \| tpch_nested_orc_def.customer.c_orders.o_lineitems \| +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ With the rewrite: Fetched 1 row(s) in 0.42s +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ \| Operator \| #Hosts \| #Inst \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ \| F00:ROOT \| 1 \| 1 \| 25.16us \| 25.16us \| \| \| 0 B \| 0 B \| \| \| 05:AGGREGATE \| 1 \| 1 \| 3.44ms \| 3.44ms \| 1 \| 1 \| 63.00 KB \| 10.00 MB \| FINALIZE \| \| 01:SUBPLAN \| 1 \| 1 \| 16.52ms \| 16.52ms \| 6.00M \| 125.92M \| 47.00 KB \| 0 B \| \| \| \|--04:NESTED LOOP JOIN \| 1 \| 1 \| 188.47ms \| 188.47ms \| 0 \| 10 \| 24.00 KB \| 12 B \| CROSS JOIN \| \| \| \|--02:SINGULAR ROW SRC \| 1 \| 1 \| 0ns \| 0ns \| 0 \| 1 \| 0 B \| 0 B \| \| \| \| 03:UNNEST \| 1 \| 1 \| 25.37ms \| 25.37ms \| 0 \| 10 \| 0 B \| 0 B \| $a$1.c_orders.o_lineitems o_lineitems \| \| 00:SCAN HDFS \| 1 \| 1 \| 96.26ms \| 96.26ms \| 100.00K \| 12.59M \| 38.19 MB \| 72.00 MB \| default.customer_nested $a$1 \| +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ So the overhead is very little. Testing Added planner tests to PlannerTest/acid-scans.test * E2E query tests to QueryTest/full-acid-complex-type-scans.test * E2E tests for rowid-generation: QueryTest/full-acid-rowid.test Change-Id: I8b2c6cd3d87c452c5b96a913b14c90ada78d4c6f Reviewed-on: http://gerrit.cloudera.org:8080/16228 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-12 17:45:50 +00:00
Gabor Kaszab	f95f7940e4	IMPALA-10017: Implement ds_kll_union() function This function receives a set of serialized Apache DataSketches KLL sketches produced by ds_kll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_kll_quantile(ds_kll_union(sketch_col), 0.5) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_kll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_kll_union() on those sketches. Change-Id: I020aea28d36f9b6ef9fb57c08411f2170f5c24bf Reviewed-on: http://gerrit.cloudera.org:8080/16267 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-08 11:50:04 +00:00
Gabor Kaszab	0736fcf691	IMPALA-10018: Implement ds_kll_rank() function ds_kll_rank() receives two parameters: a STRING that represents a serialized DataSketches KLL sketch and a float to provide a probing value in the sketch. Returns a DOUBLE that is the rank of the given probing value in the range of [0,1]. E.g. a return value of 0.2 means that the probing value given as parameter is greater than the 20% of all the values in the sketch. Note, this is an approximate calculation. Change-Id: I95857886dfbb8c84aeeaf718c0e610012fda4be0 Reviewed-on: http://gerrit.cloudera.org:8080/16283 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-07 12:58:40 +00:00
Gabor Kaszab	87aeb2ad78	IMPALA-9963: Implement ds_kll_n() function This function receives a serialized Apache DataSketches KLL sketch and returns how many input values were fed into this sketch. Change-Id: I166e87a468e68e888ac15fca7429ac2552dbb781 Reviewed-on: http://gerrit.cloudera.org:8080/16259 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-06 19:15:04 +00:00
Shant Hovsepian	ea3f073881	IMPALA-9943,IMPALA-4974: INTERSECT/EXCEPT [DISTINCT] INTERSECT and EXCEPT set operations are implemented as rewrites to joins. Currently only the DISTINCT qualified operators are implemented, not ALL qualified. The operator MINUS is supported as an alias for EXCEPT. We mimic Oracle and Hive's non-standard implementation which treats all operators with the same precedence, as opposed to the SQL Standard of giving INTERSECT higher precedence. A new class SetOperationStmt was created to encompass the previous UnionStmt behavior. UnionStmt is preserved as a special case of union only operands to ensure compatibility with previous union planning behavior. Tests: * Added parser and analyzer tests. * Ensured no test failures or plan changes for union tests. * Added TPC-DS queries 14,38,87 to functional and planner tests. * Added functional tests test_intersect test_except * New planner testSetOperationStmt Change-Id: I5be46f824217218146ad48b30767af0fc7edbc0f Reviewed-on: http://gerrit.cloudera.org:8080/16123 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-31 17:23:45 +00:00
Gabor Kaszab	033a4607e2	IMPALA-9959: Implement ds_kll_sketch() and ds_kll_quantile() functions ds_kll_sketch() is an aggregate function that receives a float parameter (e.g. a float column of a table) and returns a serialized Apache DataSketches KLL sketch of the input data set wrapped into STRING type. This sketch can be saved into a table or view and later used for quantile approximations. ds_kll_quantile() receives two parameters: a STRING parameter that contains a serialized KLL sketch and a DOUBLE that represents the rank of the quantile in the range of [0,1]. E.g. rank=0.1 means the approximate value in the sketch where 10% of the sketched items are less than or equals to this value. Testing: - Added automated tests on small data sets to check the basic functionality of sketching and getting a quantile approximate. - Tested on TPCH25_parquet.lineitem to check that sketching and approximating works on bigger scale as well where serialize/merge phases are also required. On this scale the error range of the quantile approximation is within 1-1.5% Change-Id: I11de5fe10bb5d0dd42fb4ee45c4f21cb31963e52 Reviewed-on: http://gerrit.cloudera.org:8080/16235 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-31 14:34:49 +00:00
Adam Tamas	21918ef18b	IMPALA-9942: DataSketches HLL shouldn't take empty strings as distinct values In Hive empty strings doesn't count as separate values when querying count(distinct) estimates using Apache DataSketches HLL algorithm on strings and varchars. For compatibility's sake Impala should not take it either. Tests: -added extra tests for hll with empty strings Change-Id: Ie7648217bbe2f66b817788f131c062f349b1e9ad Reviewed-on: http://gerrit.cloudera.org:8080/16226 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-28 18:25:22 +00:00
Gabor Kaszab	9c542ef589	IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Note, currently there is a known limitation of unioning string types where some input sketches come from Impala and some from Hive. In this case if there is an overlap in the input data used by Impala and by Hive this overlapping data is still counted twice due to some string representation difference between Impala and Hive. For more details see: https://issues.apache.org/jira/browse/IMPALA-9939 Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Reviewed-on: http://gerrit.cloudera.org:8080/16095 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-23 22:20:12 +00:00
Zoltan Borok-Nagy	3a6022ce80	Bump up CDP_BUILD_NUMBER to 4493826 This change bumps up the CDP_BUILD_NUMBER to 4493826. This is needed to fix a failing test. Hive started to assign bucket ids to files differently. Because of that I had to modify the test_full_acid_rowid test that had an assumption about how bucket ids are assigned to files. If you have problems restarting the Hive Metastore, try the following: buildall.sh <your usual flags> -upgrade_metastore_db If you have problems restarting Kudu, try the following: Unset LD_LIBRARY_PATH in your shell, and stop setting it in impala-config-local.sh Change-Id: Ia4635feef146c945624135e0715495bb01ea4699 Reviewed-on: http://gerrit.cloudera.org:8080/16195 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-15 19:14:58 +00:00
Tim Armstrong	63f5e8ec00	IMPALA-1270: add distinct aggregation to semi joins When generating plans with left semi/anti joins (typically resulting from subquery rewrites), the planner now considers inserting a distinct aggregation on the inner side of the join. The decision is based on whether that aggregation would reduce the number of rows by more than 75%. This is fairly conservative and the optimization might be beneficial for smaller reductions, but the conservative threshold is chosen to reduce the number of potential plan regressions. The aggregation can both reduce the # of rows and the width of the rows, by projecting out unneeded slots. ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION query option is added to allow toggling the optimization. Tests: * Add positive and negative planner tests for various cases - including semi/anti joins, missing stats, broadcast/shuffle, different numbers of join predicates. * Add some end-to-end tests to verify plans execute correctly. Change-Id: Icbb955e805d9e764edf11c57b98f341b88a37fcc Reviewed-on: http://gerrit.cloudera.org:8080/16180 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-15 17:10:50 +00:00
Zoltan Borok-Nagy	f602c3f80f	IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_0000001_0000001_0000/0000_0 * full_acid/delta_0000002_0000002_0000/0000_0 * full_acid/delete_delta_0000003_0000003_0000/0000_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Reviewed-on: http://gerrit.cloudera.org:8080/16082 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-14 12:53:51 +00:00
Tim Armstrong	574fef2a76	IMPALA-9917: grouping() and grouping_id() support Implements grouping() and grouping_id() builtins. grouping_id() has both a no-arg version, which returns a bit vector of all grouping exprs and a varargs version, which returns a bit vector of the provided arguments. Grouping is a keyword, so needs special handling in the parser to be accepted as a function name. These functions are implemented in the transpose agg with a CASE expression similar to other aggregate functions, but returning the grouping() or grouping_id() value for that aggregation class instead of an aggregated value. Testing: * Added parser test for grouping keyword. * Added analysis tests for the functions. * Added basic planner test to show expressions generated * Added some TPC-DS queries that use grouping() - queries 80, 70 and 86 using reference .test files from Fang-Yu Rao. 27 and 36 were added with reference results from https://github.com/cwida/tpcds-result-reproduction * Add targeted end-to-end tests. * Added view compatibility test with Hive. Change-Id: If0b1640d606256c0fe9204d2a21a8f6d06abcdb6 Reviewed-on: http://gerrit.cloudera.org:8080/16140 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	3e1e7da229	IMPALA-9898: generate grouping set plans Integrates the parsing and analysis with plan generation. Testing: * Add analysis test to make sure we reject unsupported queries. * Added targeted planner tests to ensure we generate the correct aggregation classes for a variety of cases. * Add targeted end-to-end functional tests. Added five TPC-DS queries that use ROLLUP, building on some work done by Fang-Yu Rao. Some tweaks were required for these tests. * Add an extra ORDER BY clause to q77 to make fully deterministic. * Add backticks around `returns` to avoid reserved word. * Add INTERVAL keyword to date/timestamp arithmetic. We can run q80, too, but I haven't added or verified results yet - that can be done in a follow-up. Change-Id: Ie454c5bf7aee266321dee615548d7f2b71380197 Reviewed-on: http://gerrit.cloudera.org:8080/16128 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	4e2498da6f	IMPALA-9949: fix SELECT list subqueries with HAVING/LIMIT The patch for IMPALA-8954 failed to account for subqueries that could produce < 1 row. SelectStmt.returnsSingleRow() is confusing because it actually returns true if it returns at most one row. As a fix I split it into returnsExactlyOneRow() and returnsAtMostOneRow(), then used returnsExactlyOneRow() to determine if the subquery should instead be rewritten into a LEFT OUTER JOIN, which produces the correct result. CROSS JOIN is still preferred because it can be more freely reordered during planning. Testing: * Added planner tests for a range of scenarios where it can be rewritten as a CROSS JOIN and where it needs to be a LEFT OUTER JOIN for correctness. * Added some targeted end-to-end tests where the results were previously incorrect. Checked the behaviour against Hive and postgres. Ran exhaustive tests. Change-Id: I6034aedac776783bdc8cdb3a2df344e2b3662da6 Reviewed-on: http://gerrit.cloudera.org:8080/16171 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-13 22:38:36 +00:00
Tim Armstrong	fea5dffec5	IMPALA-9924: handle single subquery in or predicate This patch supports a subset of cases of subqueries inside OR inside WHERE and HAVING clauses. The approach used is to rewrite the subquery into a many-to-one LEFT OUTER JOIN with the subquery and then replace the subquery in the expression with a reference to the single select list expressions of the subquery. This works because: * A many-to-one LEFT OUTER JOIN returns one output row for each left input row, meaning that for every row in the original query before the rewrite, we get the same row plus a single matched row from the subquery * Expressions can be rewritten to refer to a slotref from the right side of the LEFT OUTER JOIN without affecting semantics. E.g. an IN subquery becomes <slot> IS NOT NULL or <operator> (<subquery>) becomes <operator> <slot>. This does not affect SELECT list subqueries, which are rewritten using a different mechanism that can already support some subqueries in disjuncts. Correlated and uncorrelated subqueries are both supported, but various limitations are present. Limitations: * Only one subquery per predicate is supported. The rewriting approach should generalize to multiple subqueries but other code needs refactoring to handle this case. * EXISTS and NOT EXISTS subqueries are not supported. The rewriting approach can generalise to that, but we need to add or pick a select list item from the subquery to check for NULL/IS NOT NULL and a little more work is required to do that correctly. * NOT IN is not supported because of the special NULL semantics. * Subqueries with aggregates + grouping by are not supported because we rely on adding distinct to select list and we don't support distinct + aggregations because of IMPALA-5098. Tests: * Positive analysis tests for IN and binary predicate operators. * Negative analysis tests for unsupported subquery operators. * Negative analysis tests for multiple subqueries. * Negative analysis tests for runtime scalar subqueries. * Positive and negative analysis tests for aggregations in subquery. * TPC-DS Query 45 planner and query tests * Targeted planner tests for various supported queries. * Targeted functional tests to confirm plans are executable and return correct result. These exercise a mix of the supported features - correlated/correlated, aggregate functions, EXISTS/comparator, etc. * Tests for BETWEEN predicate, which is supported as a side-effect of being rewritten during analysis. Change-Id: I64588992901afd7cd885419a0b7f949b0b174976 Reviewed-on: http://gerrit.cloudera.org:8080/16152 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2020-07-13 16:02:27 +00:00
wzhou-code	55099517b0	IMPALA-9294: Support DATE for min-max runtime filter Implemented Date min-max filter and applied it to Kudu as other min-max runtime filters. Added new test cases for Date min-max filters. Testing: Passed all core tests. Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Reviewed-on: http://gerrit.cloudera.org:8080/16103 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-08 22:59:57 +00:00
Adam Tamas	1bafb7bd29	IMPALA-9531: Dropped support for dateless timestamps Removed the support for dateless timestamps. During dateless timestamp casts if the format doesn't contain date part we get an error during tokenization of the format. If the input str doesn't contain a date part then we get null result. Examples: select cast('01:02:59' as timestamp); This will come back as NULL value. select to_timestamp('01:01:01', 'HH:mm:ss'); select cast('01:02:59' as timestamp format 'HH12:MI:SS'); select cast('12 AM' as timestamp FORMAT 'AM.HH12'); These will come back with a parsing errors. Casting from a table will generate similar results. Testing: Modified the previous tests related to dateless timestamps. Added test to read fromtables which are still containing dateless timestamps and covered timestamp to string path when no date tokens are requested in the output string. Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Reviewed-on: http://gerrit.cloudera.org:8080/15866 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2020-07-08 19:32:15 +00:00
Gabor Kaszab	7e456dfa9d	IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Reviewed-on: http://gerrit.cloudera.org:8080/16000 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-07-07 14:11:21 +00:00
Shant Hovsepian	2dca55695e	IMPALA-9784, IMPALA-9905: Uncorrelated subqueries in HAVING. Support rewriting subqueries in the HAVING clause by nesting the aggregation query and pulling up the subquery predicates into the outer WHERE clause. Testing: * New analyzer tests * New functional subquery tests * Added Q23, Q24 and Q44 to the tpcds workload * Ran subquery rewrite tests Change-Id: I124a58a09a1a47e1222a22d84b54fe7d07844461 Reviewed-on: http://gerrit.cloudera.org:8080/16052 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Shant Hovsepian	388ad555d7	IMPALA-8954: Uncorrelated scalar subqueries in the select list Extend StmtRewriter with the ability to rewrite scalar subqueries in the select list into cross joins. Currently the subquery must pass plan-time checks to determine that it returns a single row which may miss cases that may be valid at runtime or with more complex evaluation of the predicate expressions in the planner. Support for correlated subqueries will be a follow on change. Testing: * Added new analyzer tests, updated previous subquery tests * test_queries.py::TestQueries::test_subquery * Added test_tpcds_q9 to e2e and planner tests Change-Id: Ibcf55d26889aa01d69bb85f18c9241dda095fb66 Reviewed-on: http://gerrit.cloudera.org:8080/16007 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Zoltan Borok-Nagy	930264afbd	IMPALA-9515: Full ACID Milestone 3: Read support for "original files" "Original files" are files that don't have full ACID schema. We can see such files if we upgrade a non-ACID table to full ACID. Also, the LOAD DATA statement can load non-ACID files into full ACID tables. So such files don't store special ACID columns, that means we need to auto-generate their values. These are (operation, originalTransaction, bucket, rowid, and currentTransaction). With the exception of 'rowid', all of them can be calculated based on the file path, so I add their values to the scanner's template tuple. 'rowid' is the ordinal number of the row inside a bucket inside a directory. For now Impala only allows one file per bucket per directory. Therefore we can generate row ids for each file independently. Multiple files in a single bucket in a directory can only be present if the table was non-transactional earlier and we upgraded it to full ACID table. After the first compaction we should only see one original file per bucket per directory. In HdfsOrcScanner we calculate the first row id for our split then the OrcStructReader fills the rowid slot with the proper values. Testing: * added e2e tests to check if the generated values are correct * added e2e test to reject tables that have multiple files per bucket * added unit tests to the new auxiliary functions Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Reviewed-on: http://gerrit.cloudera.org:8080/16001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-29 21:00:05 +00:00
wzhou-code	c7ce4fa109	IMPALA-9691: Support Kudu Timestamp and Date bloom filter Impala save timestamp as 12 bytes of structure TimestampValue with time in nano seconds. Kudu store timestamp as 8 bytes of Unix Time microseconds. To avoid the data truncation issue in the bloom filter, add FunctionCallExpr with 'utc_to_unix_micros' as the root of source expression of bloom filter to convert timestamp values to microseconds when building timestamp bloom filter for Kudu. Generated functional date_tbl table in Kudu format for unit-test. Added new test cases for Kudu Timestamp and Date bloom filters. Testing: Passed all core tests. Change-Id: I3c1e9bcc9fd6d79a39f25eaa3396188fc0a52a48 Reviewed-on: http://gerrit.cloudera.org:8080/16094 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-26 06:56:16 +00:00
Daniel Becker	e6c930a38f	IMPALA-9747: More fine-grained codegen for text file scanners Currently if the materialization of any column cannot be codegen'd because its type is unsupported (e.g. CHAR(N)), the whole codegen is cancelled for the text scanner. This commit adds the function TextConverter::SupportsCodegenWriteSlot that returns whether the given ColumnType is supported. If the type is not supported, HdfsScanner codegens code that calls the interpreted version instead of failing codegen. For other columns codegen is used as usually. Benchmarks: Copied and modified a TPCH table with scale factor 5 to add a CHAR column to it:: USE tpch5; CREATE TABLE IF NOT EXISTS lineitem_char AS SELECT , CAST(l_shipdate AS CHAR(10)) l_shipdate_char FROM lineitem; Run the following query 100 times after one warm-up run with and without this change: SELECT FROM tpch5.lineitem_char WHERE l_partkey BETWEEN 500 AND 500000 AND l_linestatus = 'F' AND l_quantity < 35 AND l_extendedprice BETWEEN 2000 AND 8000 AND l_discount > 0 AND l_tax BETWEEN 0.04 AND 0.06 AND l_returnflag IN ('A', 'N') AND l_shipdate_char < '1996-06-20' ORDER BY l_shipdate_char LIMIT 10; Without this commit: mean: 2.92, standard deviation: 0.13. With this commit: mean: 2.21, standard deviation: 0.072. Testing: The interesting cases regarding char are covered in `0167c5b424/testdata/workloads/functional-query/queries/QueryTest/chars.test` Change-Id: Id370193af578ecf23ed3c6bfcc65fec448156fa3 Reviewed-on: http://gerrit.cloudera.org:8080/16059 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-24 16:51:49 +00:00
norbert.luksa	71a64591e3	IMPALA-8755: Unlock Z-ordering by default Z-ordering has been around for a while behind a feature flag (unlock_zorder_sort). It's around time to turn this flag on by default. This commit beside setting the flag to true, merges the Z-order tests from custom cluster tests into the normal test files. Tests: - Run all related tests. Change-Id: I653e0e2db8f7bc2dd077943b3acf667514d45811 Reviewed-on: http://gerrit.cloudera.org:8080/16003 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-24 03:47:18 +00:00
Martin Zink	d82cefc6a4	IMPALA-452: Add support for string concatenation operator using \|\| Separated "\|\|" and "OR" into different tokens. OR (KW_OR) remains the same. (it creates CompoundPredicate and expects two BOOLEAN operands) \|\| (KW_LOGICAL_OR) creates CompoundVerticalBarExpr which expects two BOOLEAN operands or two STRING operands CompoundVerticalBarExpr creates either a CompoundPredicate or a FunctionCallExpr member variable based on the type of the left operand during analyze. Similarly to BetweenPredicate it cannot be executed directly so its needs to be replaced by its member variable by ExtractCompoundVerticalBarExprRule. Change-Id: Ie3f990d56ecb1e18d1b2737e8c5eab0d524edfaf Reviewed-on: http://gerrit.cloudera.org:8080/15877 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-06-22 21:36:09 +00:00
skyyws	8fcad905a1	IMPALA-9688: Support create iceberg table by impala This patch mainly realizes the creation of iceberg table through impala, we can use the following sql to create a new iceberg table: create table iceberg_test( level string, event_time timestamp, message string, register_time date, telephone array <string> ) partition by spec( level identity, event_time identity, event_time hour, register_time day ) stored as iceberg; 'identity' is one of Iceberg's Partition Transforms. 'identity' means that the source data values are used to create partitions, and other partition transfroms would be supported in the future, such as BUCKET/TRUNCATE. We can alse use 'show create table iceberg_test' to display table schema, and use 'show partitions iceberg_test' to display partition column info. By the way, partition column must be the source column. Testing: - Add test cases in metadata/test_show_create_table.py. - Add custom cluster test test_iceberg.py. Change-Id: I8d85db4c904a8c758c4cfb4f19cfbdab7e6ea284 Reviewed-on: http://gerrit.cloudera.org:8080/15797 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 21:56:32 +00:00
Tim Armstrong	2c76ff5e6e	IMPALA-2515: support parquet decimal with extra padding This adds support for reading Parquet files where the DECIMAL is encoded as a FIXED_LEN_BYTE_ARRAY field with extra padding. This requires loosening file validation and fixing up the decoding so that it no longer assumes that the in-memory value is at least as large as the encoded representation. The decimal decoding logic was reworked so that we could add the extra condition handling without regressing performance of the decoding logic in the common case. In the end I was able to significantly speed up the decoding logic. The bottleneck, revealed by perf record while running the below benchmark, was CPU stalls on the bitshift used for sign extension instruction waiting on loading the result of ByteSwap(). I worked around this by doing the sign-extension before the ByteSwap(), Perf: Ran a microbenchmark to check that scanning perf didn't regress as a result of the change. The query scans a DECIMAL column that is mostly plain-encoded, so to maximally stress the FIXED_LEN_BYTE_ARRAY decoding performance. set mt_dop=1; set num_nodes=1; select min(l_extendedprice) from tpch_parquet.lineitem The SCAN time in the summary averaged out to 94ms before the change and is reduced to 74ms after the change. The actual speedup of the DECIMAL decoding is greater - it went from ~20% of time in to ~6% of time as measured by perf. Testing: Added a couple of parquet files that were generated with a hacked version of Impala to have extra padding. Sanity-checked that hacked tables returned the same results on Hive. The tests failed before this code change. Ran exhaustive tests with the hacked version of Impala (so that all decimal tables got extra padding). Change-Id: I2700652eab8ba7f23ffa75800a1712d310d4e1ec Reviewed-on: http://gerrit.cloudera.org:8080/16090 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 21:00:27 +00:00
Joe McDonnell	f15a311065	IMPALA-9709: Remove Impala-lzo from the development environment This removes Impala-lzo from the Impala development environment. Impala-lzo is not built as part of the Impala build. The LZO plugin is no longer loaded. LZO tables are not loaded during dataload, and LZO is no longer tested. This removes some obsolete scan APIs that were only used by Impala-lzo. With this commit, Impala-lzo would require code changes to build against Impala. The plugin infrastructure is not removed, and this leaves some LZO support code in place. If someone were to decide to revive Impala-lzo, they would still be able to load it as a plugin and get the same functionality as before. This plugin support may be removed later. Testing: - Dryrun of GVO - Modified TestPartitionMetadataUncompressedTextOnly's test_unsupported_text_compression() to add LZO case Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e Reviewed-on: http://gerrit.cloudera.org:8080/15814 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-06-15 23:42:12 +00:00
Bikramjeet Vig	f9cb0a65fe	IMPALA-9077: Remove scalable admission control configs Removed the 3 scalable configs added in IMPALA-8536: - Max Memory Multiple - Max Running Queries Multiple - Max Queued Queries Multiple This patch removes the functionality related to those configs but retains the additional test coverage and cleanup added in IMPALA-8536. This removal is to make it easier to enhance Admission Control using Executor Groups which has turned out to be a useful building block. Testing: Ran core tests. Change-Id: Ib9bd63f03758a6c4eebb99c64ee67e60cb56b5ac Reviewed-on: http://gerrit.cloudera.org:8080/16039 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-10 04:02:16 +00:00
Joe McDonnell	9125de7ae3	IMPALA-9318: Add admission control setting to cap MT_DOP This introduces the max-mt-dop setting for admission control. If a statement runs with an MT_DOP setting that exceeds the max-mt-dop, then the MT_DOP setting is downgraded to the max-mt-dop value. If max-mt-dop is set to a negative value, no limit is applied. max-mt-dop is set via the llama-site.xml and can be set at the daemon level or at the resource pool level. When there is no max-mt-dop setting, it defaults to -1, so no limit is applied. The max-mt-dop is evaluated once prior to query planning. The MT_DOP settings for queries past planning are not reevaluated if the policy changes. If a statement is downgraded, it's runtime profile contains a message explaining the downgrade: MT_DOP limited by admission control: Requested MT_DOP=9 reduced to MT_DOP=4. Testing: - Added custom cluster test with various max-mt-dop settings - Ran core tests Change-Id: I3affb127a5dca517591323f2b1c880aa4b38badd Reviewed-on: http://gerrit.cloudera.org:8080/16020 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-09 16:26:23 +00:00
xiaomeng	d45e3a50b0	IMPALA-9673: Add external warehouse dir variable in E2E test Updated CDP build to 7.2.1.0-57 to include new Hive features such as HIVE-22995. In minicluster, we have default values of hive.create.as.acid and hive.create.as.insert.only which are false. So by default hive creates external type table located in external warehouse directory. Due to HIVE-22995, desc db returns external warehouse directory. With above reasons, we need use external warehouse dir in some tests. Also add a new test for "CREATE DATABASE ... LOCATION". Tested: Re-run failed test in minicluster. Run exhaustive tests. Change-Id: I57926babf4caebfd365e6be65a399f12ea68687f Reviewed-on: http://gerrit.cloudera.org:8080/15990 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-05 23:48:53 +00:00

1 2 3 4 5 ...

1373 Commits