impala

mirror of https://github.com/apache/impala.git synced 2026-01-21 15:03:35 -05:00

Author	SHA1	Message	Date
Tim Armstrong	63f5e8ec00	IMPALA-1270: add distinct aggregation to semi joins When generating plans with left semi/anti joins (typically resulting from subquery rewrites), the planner now considers inserting a distinct aggregation on the inner side of the join. The decision is based on whether that aggregation would reduce the number of rows by more than 75%. This is fairly conservative and the optimization might be beneficial for smaller reductions, but the conservative threshold is chosen to reduce the number of potential plan regressions. The aggregation can both reduce the # of rows and the width of the rows, by projecting out unneeded slots. ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION query option is added to allow toggling the optimization. Tests: * Add positive and negative planner tests for various cases - including semi/anti joins, missing stats, broadcast/shuffle, different numbers of join predicates. * Add some end-to-end tests to verify plans execute correctly. Change-Id: Icbb955e805d9e764edf11c57b98f341b88a37fcc Reviewed-on: http://gerrit.cloudera.org:8080/16180 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-15 17:10:50 +00:00
Zoltan Borok-Nagy	f602c3f80f	IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_0000001_0000001_0000/0000_0 * full_acid/delta_0000002_0000002_0000/0000_0 * full_acid/delete_delta_0000003_0000003_0000/0000_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Reviewed-on: http://gerrit.cloudera.org:8080/16082 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-14 12:53:51 +00:00
Tim Armstrong	574fef2a76	IMPALA-9917: grouping() and grouping_id() support Implements grouping() and grouping_id() builtins. grouping_id() has both a no-arg version, which returns a bit vector of all grouping exprs and a varargs version, which returns a bit vector of the provided arguments. Grouping is a keyword, so needs special handling in the parser to be accepted as a function name. These functions are implemented in the transpose agg with a CASE expression similar to other aggregate functions, but returning the grouping() or grouping_id() value for that aggregation class instead of an aggregated value. Testing: * Added parser test for grouping keyword. * Added analysis tests for the functions. * Added basic planner test to show expressions generated * Added some TPC-DS queries that use grouping() - queries 80, 70 and 86 using reference .test files from Fang-Yu Rao. 27 and 36 were added with reference results from https://github.com/cwida/tpcds-result-reproduction * Add targeted end-to-end tests. * Added view compatibility test with Hive. Change-Id: If0b1640d606256c0fe9204d2a21a8f6d06abcdb6 Reviewed-on: http://gerrit.cloudera.org:8080/16140 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	3e1e7da229	IMPALA-9898: generate grouping set plans Integrates the parsing and analysis with plan generation. Testing: * Add analysis test to make sure we reject unsupported queries. * Added targeted planner tests to ensure we generate the correct aggregation classes for a variety of cases. * Add targeted end-to-end functional tests. Added five TPC-DS queries that use ROLLUP, building on some work done by Fang-Yu Rao. Some tweaks were required for these tests. * Add an extra ORDER BY clause to q77 to make fully deterministic. * Add backticks around `returns` to avoid reserved word. * Add INTERVAL keyword to date/timestamp arithmetic. We can run q80, too, but I haven't added or verified results yet - that can be done in a follow-up. Change-Id: Ie454c5bf7aee266321dee615548d7f2b71380197 Reviewed-on: http://gerrit.cloudera.org:8080/16128 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	4e2498da6f	IMPALA-9949: fix SELECT list subqueries with HAVING/LIMIT The patch for IMPALA-8954 failed to account for subqueries that could produce < 1 row. SelectStmt.returnsSingleRow() is confusing because it actually returns true if it returns at most one row. As a fix I split it into returnsExactlyOneRow() and returnsAtMostOneRow(), then used returnsExactlyOneRow() to determine if the subquery should instead be rewritten into a LEFT OUTER JOIN, which produces the correct result. CROSS JOIN is still preferred because it can be more freely reordered during planning. Testing: * Added planner tests for a range of scenarios where it can be rewritten as a CROSS JOIN and where it needs to be a LEFT OUTER JOIN for correctness. * Added some targeted end-to-end tests where the results were previously incorrect. Checked the behaviour against Hive and postgres. Ran exhaustive tests. Change-Id: I6034aedac776783bdc8cdb3a2df344e2b3662da6 Reviewed-on: http://gerrit.cloudera.org:8080/16171 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-13 22:38:36 +00:00
Tim Armstrong	fea5dffec5	IMPALA-9924: handle single subquery in or predicate This patch supports a subset of cases of subqueries inside OR inside WHERE and HAVING clauses. The approach used is to rewrite the subquery into a many-to-one LEFT OUTER JOIN with the subquery and then replace the subquery in the expression with a reference to the single select list expressions of the subquery. This works because: * A many-to-one LEFT OUTER JOIN returns one output row for each left input row, meaning that for every row in the original query before the rewrite, we get the same row plus a single matched row from the subquery * Expressions can be rewritten to refer to a slotref from the right side of the LEFT OUTER JOIN without affecting semantics. E.g. an IN subquery becomes <slot> IS NOT NULL or <operator> (<subquery>) becomes <operator> <slot>. This does not affect SELECT list subqueries, which are rewritten using a different mechanism that can already support some subqueries in disjuncts. Correlated and uncorrelated subqueries are both supported, but various limitations are present. Limitations: * Only one subquery per predicate is supported. The rewriting approach should generalize to multiple subqueries but other code needs refactoring to handle this case. * EXISTS and NOT EXISTS subqueries are not supported. The rewriting approach can generalise to that, but we need to add or pick a select list item from the subquery to check for NULL/IS NOT NULL and a little more work is required to do that correctly. * NOT IN is not supported because of the special NULL semantics. * Subqueries with aggregates + grouping by are not supported because we rely on adding distinct to select list and we don't support distinct + aggregations because of IMPALA-5098. Tests: * Positive analysis tests for IN and binary predicate operators. * Negative analysis tests for unsupported subquery operators. * Negative analysis tests for multiple subqueries. * Negative analysis tests for runtime scalar subqueries. * Positive and negative analysis tests for aggregations in subquery. * TPC-DS Query 45 planner and query tests * Targeted planner tests for various supported queries. * Targeted functional tests to confirm plans are executable and return correct result. These exercise a mix of the supported features - correlated/correlated, aggregate functions, EXISTS/comparator, etc. * Tests for BETWEEN predicate, which is supported as a side-effect of being rewritten during analysis. Change-Id: I64588992901afd7cd885419a0b7f949b0b174976 Reviewed-on: http://gerrit.cloudera.org:8080/16152 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2020-07-13 16:02:27 +00:00
wzhou-code	55099517b0	IMPALA-9294: Support DATE for min-max runtime filter Implemented Date min-max filter and applied it to Kudu as other min-max runtime filters. Added new test cases for Date min-max filters. Testing: Passed all core tests. Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Reviewed-on: http://gerrit.cloudera.org:8080/16103 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-08 22:59:57 +00:00
Adam Tamas	1bafb7bd29	IMPALA-9531: Dropped support for dateless timestamps Removed the support for dateless timestamps. During dateless timestamp casts if the format doesn't contain date part we get an error during tokenization of the format. If the input str doesn't contain a date part then we get null result. Examples: select cast('01:02:59' as timestamp); This will come back as NULL value. select to_timestamp('01:01:01', 'HH:mm:ss'); select cast('01:02:59' as timestamp format 'HH12:MI:SS'); select cast('12 AM' as timestamp FORMAT 'AM.HH12'); These will come back with a parsing errors. Casting from a table will generate similar results. Testing: Modified the previous tests related to dateless timestamps. Added test to read fromtables which are still containing dateless timestamps and covered timestamp to string path when no date tokens are requested in the output string. Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Reviewed-on: http://gerrit.cloudera.org:8080/15866 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2020-07-08 19:32:15 +00:00
Gabor Kaszab	7e456dfa9d	IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Reviewed-on: http://gerrit.cloudera.org:8080/16000 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-07-07 14:11:21 +00:00
Shant Hovsepian	2dca55695e	IMPALA-9784, IMPALA-9905: Uncorrelated subqueries in HAVING. Support rewriting subqueries in the HAVING clause by nesting the aggregation query and pulling up the subquery predicates into the outer WHERE clause. Testing: * New analyzer tests * New functional subquery tests * Added Q23, Q24 and Q44 to the tpcds workload * Ran subquery rewrite tests Change-Id: I124a58a09a1a47e1222a22d84b54fe7d07844461 Reviewed-on: http://gerrit.cloudera.org:8080/16052 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Shant Hovsepian	388ad555d7	IMPALA-8954: Uncorrelated scalar subqueries in the select list Extend StmtRewriter with the ability to rewrite scalar subqueries in the select list into cross joins. Currently the subquery must pass plan-time checks to determine that it returns a single row which may miss cases that may be valid at runtime or with more complex evaluation of the predicate expressions in the planner. Support for correlated subqueries will be a follow on change. Testing: * Added new analyzer tests, updated previous subquery tests * test_queries.py::TestQueries::test_subquery * Added test_tpcds_q9 to e2e and planner tests Change-Id: Ibcf55d26889aa01d69bb85f18c9241dda095fb66 Reviewed-on: http://gerrit.cloudera.org:8080/16007 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Zoltan Borok-Nagy	930264afbd	IMPALA-9515: Full ACID Milestone 3: Read support for "original files" "Original files" are files that don't have full ACID schema. We can see such files if we upgrade a non-ACID table to full ACID. Also, the LOAD DATA statement can load non-ACID files into full ACID tables. So such files don't store special ACID columns, that means we need to auto-generate their values. These are (operation, originalTransaction, bucket, rowid, and currentTransaction). With the exception of 'rowid', all of them can be calculated based on the file path, so I add their values to the scanner's template tuple. 'rowid' is the ordinal number of the row inside a bucket inside a directory. For now Impala only allows one file per bucket per directory. Therefore we can generate row ids for each file independently. Multiple files in a single bucket in a directory can only be present if the table was non-transactional earlier and we upgraded it to full ACID table. After the first compaction we should only see one original file per bucket per directory. In HdfsOrcScanner we calculate the first row id for our split then the OrcStructReader fills the rowid slot with the proper values. Testing: * added e2e tests to check if the generated values are correct * added e2e test to reject tables that have multiple files per bucket * added unit tests to the new auxiliary functions Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Reviewed-on: http://gerrit.cloudera.org:8080/16001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-29 21:00:05 +00:00
wzhou-code	c7ce4fa109	IMPALA-9691: Support Kudu Timestamp and Date bloom filter Impala save timestamp as 12 bytes of structure TimestampValue with time in nano seconds. Kudu store timestamp as 8 bytes of Unix Time microseconds. To avoid the data truncation issue in the bloom filter, add FunctionCallExpr with 'utc_to_unix_micros' as the root of source expression of bloom filter to convert timestamp values to microseconds when building timestamp bloom filter for Kudu. Generated functional date_tbl table in Kudu format for unit-test. Added new test cases for Kudu Timestamp and Date bloom filters. Testing: Passed all core tests. Change-Id: I3c1e9bcc9fd6d79a39f25eaa3396188fc0a52a48 Reviewed-on: http://gerrit.cloudera.org:8080/16094 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-26 06:56:16 +00:00
Daniel Becker	e6c930a38f	IMPALA-9747: More fine-grained codegen for text file scanners Currently if the materialization of any column cannot be codegen'd because its type is unsupported (e.g. CHAR(N)), the whole codegen is cancelled for the text scanner. This commit adds the function TextConverter::SupportsCodegenWriteSlot that returns whether the given ColumnType is supported. If the type is not supported, HdfsScanner codegens code that calls the interpreted version instead of failing codegen. For other columns codegen is used as usually. Benchmarks: Copied and modified a TPCH table with scale factor 5 to add a CHAR column to it:: USE tpch5; CREATE TABLE IF NOT EXISTS lineitem_char AS SELECT , CAST(l_shipdate AS CHAR(10)) l_shipdate_char FROM lineitem; Run the following query 100 times after one warm-up run with and without this change: SELECT FROM tpch5.lineitem_char WHERE l_partkey BETWEEN 500 AND 500000 AND l_linestatus = 'F' AND l_quantity < 35 AND l_extendedprice BETWEEN 2000 AND 8000 AND l_discount > 0 AND l_tax BETWEEN 0.04 AND 0.06 AND l_returnflag IN ('A', 'N') AND l_shipdate_char < '1996-06-20' ORDER BY l_shipdate_char LIMIT 10; Without this commit: mean: 2.92, standard deviation: 0.13. With this commit: mean: 2.21, standard deviation: 0.072. Testing: The interesting cases regarding char are covered in `0167c5b424/testdata/workloads/functional-query/queries/QueryTest/chars.test` Change-Id: Id370193af578ecf23ed3c6bfcc65fec448156fa3 Reviewed-on: http://gerrit.cloudera.org:8080/16059 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-24 16:51:49 +00:00
norbert.luksa	71a64591e3	IMPALA-8755: Unlock Z-ordering by default Z-ordering has been around for a while behind a feature flag (unlock_zorder_sort). It's around time to turn this flag on by default. This commit beside setting the flag to true, merges the Z-order tests from custom cluster tests into the normal test files. Tests: - Run all related tests. Change-Id: I653e0e2db8f7bc2dd077943b3acf667514d45811 Reviewed-on: http://gerrit.cloudera.org:8080/16003 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-24 03:47:18 +00:00
Martin Zink	d82cefc6a4	IMPALA-452: Add support for string concatenation operator using \|\| Separated "\|\|" and "OR" into different tokens. OR (KW_OR) remains the same. (it creates CompoundPredicate and expects two BOOLEAN operands) \|\| (KW_LOGICAL_OR) creates CompoundVerticalBarExpr which expects two BOOLEAN operands or two STRING operands CompoundVerticalBarExpr creates either a CompoundPredicate or a FunctionCallExpr member variable based on the type of the left operand during analyze. Similarly to BetweenPredicate it cannot be executed directly so its needs to be replaced by its member variable by ExtractCompoundVerticalBarExprRule. Change-Id: Ie3f990d56ecb1e18d1b2737e8c5eab0d524edfaf Reviewed-on: http://gerrit.cloudera.org:8080/15877 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-06-22 21:36:09 +00:00
skyyws	8fcad905a1	IMPALA-9688: Support create iceberg table by impala This patch mainly realizes the creation of iceberg table through impala, we can use the following sql to create a new iceberg table: create table iceberg_test( level string, event_time timestamp, message string, register_time date, telephone array <string> ) partition by spec( level identity, event_time identity, event_time hour, register_time day ) stored as iceberg; 'identity' is one of Iceberg's Partition Transforms. 'identity' means that the source data values are used to create partitions, and other partition transfroms would be supported in the future, such as BUCKET/TRUNCATE. We can alse use 'show create table iceberg_test' to display table schema, and use 'show partitions iceberg_test' to display partition column info. By the way, partition column must be the source column. Testing: - Add test cases in metadata/test_show_create_table.py. - Add custom cluster test test_iceberg.py. Change-Id: I8d85db4c904a8c758c4cfb4f19cfbdab7e6ea284 Reviewed-on: http://gerrit.cloudera.org:8080/15797 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 21:56:32 +00:00
Tim Armstrong	2c76ff5e6e	IMPALA-2515: support parquet decimal with extra padding This adds support for reading Parquet files where the DECIMAL is encoded as a FIXED_LEN_BYTE_ARRAY field with extra padding. This requires loosening file validation and fixing up the decoding so that it no longer assumes that the in-memory value is at least as large as the encoded representation. The decimal decoding logic was reworked so that we could add the extra condition handling without regressing performance of the decoding logic in the common case. In the end I was able to significantly speed up the decoding logic. The bottleneck, revealed by perf record while running the below benchmark, was CPU stalls on the bitshift used for sign extension instruction waiting on loading the result of ByteSwap(). I worked around this by doing the sign-extension before the ByteSwap(), Perf: Ran a microbenchmark to check that scanning perf didn't regress as a result of the change. The query scans a DECIMAL column that is mostly plain-encoded, so to maximally stress the FIXED_LEN_BYTE_ARRAY decoding performance. set mt_dop=1; set num_nodes=1; select min(l_extendedprice) from tpch_parquet.lineitem The SCAN time in the summary averaged out to 94ms before the change and is reduced to 74ms after the change. The actual speedup of the DECIMAL decoding is greater - it went from ~20% of time in to ~6% of time as measured by perf. Testing: Added a couple of parquet files that were generated with a hacked version of Impala to have extra padding. Sanity-checked that hacked tables returned the same results on Hive. The tests failed before this code change. Ran exhaustive tests with the hacked version of Impala (so that all decimal tables got extra padding). Change-Id: I2700652eab8ba7f23ffa75800a1712d310d4e1ec Reviewed-on: http://gerrit.cloudera.org:8080/16090 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 21:00:27 +00:00
Joe McDonnell	f15a311065	IMPALA-9709: Remove Impala-lzo from the development environment This removes Impala-lzo from the Impala development environment. Impala-lzo is not built as part of the Impala build. The LZO plugin is no longer loaded. LZO tables are not loaded during dataload, and LZO is no longer tested. This removes some obsolete scan APIs that were only used by Impala-lzo. With this commit, Impala-lzo would require code changes to build against Impala. The plugin infrastructure is not removed, and this leaves some LZO support code in place. If someone were to decide to revive Impala-lzo, they would still be able to load it as a plugin and get the same functionality as before. This plugin support may be removed later. Testing: - Dryrun of GVO - Modified TestPartitionMetadataUncompressedTextOnly's test_unsupported_text_compression() to add LZO case Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e Reviewed-on: http://gerrit.cloudera.org:8080/15814 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-06-15 23:42:12 +00:00
Bikramjeet Vig	f9cb0a65fe	IMPALA-9077: Remove scalable admission control configs Removed the 3 scalable configs added in IMPALA-8536: - Max Memory Multiple - Max Running Queries Multiple - Max Queued Queries Multiple This patch removes the functionality related to those configs but retains the additional test coverage and cleanup added in IMPALA-8536. This removal is to make it easier to enhance Admission Control using Executor Groups which has turned out to be a useful building block. Testing: Ran core tests. Change-Id: Ib9bd63f03758a6c4eebb99c64ee67e60cb56b5ac Reviewed-on: http://gerrit.cloudera.org:8080/16039 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-10 04:02:16 +00:00
Joe McDonnell	9125de7ae3	IMPALA-9318: Add admission control setting to cap MT_DOP This introduces the max-mt-dop setting for admission control. If a statement runs with an MT_DOP setting that exceeds the max-mt-dop, then the MT_DOP setting is downgraded to the max-mt-dop value. If max-mt-dop is set to a negative value, no limit is applied. max-mt-dop is set via the llama-site.xml and can be set at the daemon level or at the resource pool level. When there is no max-mt-dop setting, it defaults to -1, so no limit is applied. The max-mt-dop is evaluated once prior to query planning. The MT_DOP settings for queries past planning are not reevaluated if the policy changes. If a statement is downgraded, it's runtime profile contains a message explaining the downgrade: MT_DOP limited by admission control: Requested MT_DOP=9 reduced to MT_DOP=4. Testing: - Added custom cluster test with various max-mt-dop settings - Ran core tests Change-Id: I3affb127a5dca517591323f2b1c880aa4b38badd Reviewed-on: http://gerrit.cloudera.org:8080/16020 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-09 16:26:23 +00:00
xiaomeng	d45e3a50b0	IMPALA-9673: Add external warehouse dir variable in E2E test Updated CDP build to 7.2.1.0-57 to include new Hive features such as HIVE-22995. In minicluster, we have default values of hive.create.as.acid and hive.create.as.insert.only which are false. So by default hive creates external type table located in external warehouse directory. Due to HIVE-22995, desc db returns external warehouse directory. With above reasons, we need use external warehouse dir in some tests. Also add a new test for "CREATE DATABASE ... LOCATION". Tested: Re-run failed test in minicluster. Run exhaustive tests. Change-Id: I57926babf4caebfd365e6be65a399f12ea68687f Reviewed-on: http://gerrit.cloudera.org:8080/15990 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-05 23:48:53 +00:00
wzhou-code	c62a6808fc	IMPALA-3741 [part 2]: Push runtime bloom filter to Kudu Defined the BloomFilter class as the wrapper of kudu::BlockBloomFilter. impala::BloomFilter build runtime bloom filter in kudu::BlockBloomFilter APIs with FastHash as default hash algorithm. Removed the duplicated functions from impala::BloomFillter class. Pushed down bloom filter to Kudu through Kudu clinet API. Added a new query option ENABLED_RUNTIME_FILTER_TYPES to set enabled runtime filter types, which only affect Kudu scan node now. By default, bloom filter is not enabled, only min-max filter will be enabled for Kudu. With this option, user could enable bloom filter, min-max filter, or both bloom and min-max runtime filters. Added new test cases in PlannerTest and end-end runtime_filters test for pushing down bloom filter to Kudu. Added test cases to compare the number of rows returned from Kudu scan when appling different types of runtime filter on same queries. Updated bloom-filter-benchmark due to the bloom-filter implementation change. Bump Kudu version to d652cab17. Testing: - Passed all exhaustive tests. Performance benchmark: - Ran single_node_perf_run.py on TPC-H with scale as 30 for parquet and Kudu. Verified that new hash function and bloom-filter implementation don't cause regressions for HDFS bloom filters. For Kudu, there is one regression for query TPCH-Q9 and there are improvement for about 8 queris when appling both bloom and min-max filters. The bloom filter reduce the number of rows returned from Kudu scan, hence reduce the cost for aggregation and hash join. But bloom filter evaluation add extra cost for Kudu scan, which offset the gain on aggregation and join. Kudu scan need to be optimized for bloom filter in following tasks. - Ran bloom-filter microbenchmarks and verified that there is no regression for Insert/Find/Union functions with or without AVX2 due to bloom-filter implementation changes. There is small performance degradation for Init function, but this function is not in hot path. Change-Id: I9100076f68ea299ddb6ec8bc027cac7a47f5d754 Reviewed-on: http://gerrit.cloudera.org:8080/15683 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-05 17:43:32 +00:00
Tim Armstrong	6a1c448cf7	IMPALA-9782: fix Kudu DML with mt_dop In general, ScalarExpr subclasses should not have mutable state, i.e. state that is modified during query execution, and definitely need to be thread-safe. KuduPartitionExpr violated that, since KuduPartitioner and KuduPartialRow are both mutated as part of expr evaluation. This patch moves those objects into thread-local state for KuduPartitionExpr. Testing: * Add a regression test that does a large insert. Before this patch it reliably crashed Impala with mt_dop=4. * Run more Kudu DML with mt_dop to improve test coverage. * Make test_kudu.py use the correct test dimension (fixing a cosmetic issue). Perf: This changes adds some indirection to the expression evaluation, so I did some manual benchmarking with this query, which should somewhat stress the partitioning: set mt_dop=1; insert into orders_key_only select o_orderkey from tpch_kudu.orders where o_orderkey % 10 = 0 The timing was in the same range before and after - between 6 and 8 seconds - but the results were very unstable so inconclusive. The Kudu tservers were using an order-of-magnitude more CPU than the impalads, so it seems safe to conclude that these partitioning exprs are not a bottleneck for DML workloads. Perf record showed impala::KuduPartitionExpr::GetIntValInterpreted() taking up 0.09% of the impalad CPU, for additional evidence that this doesn't make a difference. Change-Id: Ie7e86894da9dbcb3b69f7db236c841ecc08dbf1a Reviewed-on: http://gerrit.cloudera.org:8080/16029 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-05 00:40:33 +00:00
Csaba Ringhofer	fb28011b57	IMPALA-9707: fix Parquet stat filtering when min/max values are cast to NULL The min/max stat predicate is allowed when the left side is not a slot but an implicit cast of a slot. This could lead to incorrectly dropping a row group or page when min/max values were not castable to the type, e.g. it is string with a pre 1400 date and we want to cast it to a timestamp. The change should only affect timestamps, as dates return an error on failed cast from a string, and numeric types won't be cast implicitly from string. The fix is simply to accept NULL result for the min/max predicate in the backend. Note that the alternative solution of casting the right (const) side of the predicate instead of the left side would be tricky, as more than one string can mean the same timestamp, e.g. "1970-01-01" and "1970-01-01 00:00:00". Testing: - added an EE regression test and ran it Change-Id: I35f66e1dfc4523624c249073004f9d5eddd07bb6 Reviewed-on: http://gerrit.cloudera.org:8080/15959 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-21 18:40:20 +00:00
Tim Armstrong	e5777f0eb8	IMPALA-8834: Short-circuit partition key scan This adds a new version of the pre-existing partition key scan optimization that always returns correct results, even when files have zero rows. This new version is always enabled by default. The old existing optimization, which does a metadata-only query, is still enabled behind the OPTIMIZE_PARTITION_KEY_SCANS query option. The new version of the optimization must scan the files to see if they are non-empty. Instead of using metadata only, the planner instructs the backend to short-circuit HDFS scans after a single row has been returned from each file. This gives results equivalent to returning all the rows from each file, because all rows in the file belong to the same partition and therefore have identical values for any columns that are partition key values. Planner cardinality estimates are adjusted accordingly to enable potentially better plans and other optimisations like disabling codegen. We make some effort to avoid generated extra scan ranges for remote scans by only generating one range per remote file. The backend optimisation is implemented by constructing a row batch with capacity for a single row only and then terminating each scan range once a single row has been produced. Both Parquet and ORC have optimized code paths for zero slot table scans that mean this will only result in a footer read. (Other file formats still need to read some portion of the file, but can terminate early once one row has been produced.) This should be quite efficient in practice with file handle caching and data caching enabled, because it then only requires reading the footer from the cache for each file. The partition key scan optimization is also slightly generalised to apply to scans of unpartitioned tables where no slots are materialized. A limitation of the optimization where it did not apply to multiple grouping classes was also fixed. Limitations: * This still scans every file in the partition. I.e. there is no short-circuiting if a row has already been found in the partition by the current scan node. * Resource reservations and estimates for the scan node do not all take into account this optimisation, so are conservative - they assume the whole file is scanned. Testing: * Added end-to-end tests that execute the query on all HDFS file formats and verify that the correct number of rows flow through the plan. * Added planner test based on the existing test partition key scan test. * Added test to make sure single node optimisation kicks in when expected. * Add test for cardinality estimates with and without stats * Added test for unpartitioned tables. * Added planner test that checks that optimisation is enabled for multiple aggregation classes. * Added a targeted perf test. Change-Id: I26c87525a4f75ffeb654267b89948653b2e1ff8c Reviewed-on: http://gerrit.cloudera.org:8080/13993 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 23:03:23 +00:00
Zoltan Borok-Nagy	f8015ff68d	IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list Minor compactions can compact several delta directories into a single delta directory. The current directory filtering algorithm had to be modified to handle minor compacted directories and prefer those over plain delta directories. This happens in the Frontend, mostly in AcidUtils.java. Hive Streaming Ingestion writes similar delta directories, but they might contain rows Impala cannot see based on its valid write id list. E.g. we can have the following delta directory: full_acid/delta_0000001_0000010/0000 # minWriteId: 1 # maxWriteId: 10 This delta dir contains rows with write ids between 1 and 10. But maybe we are only allowed to see write ids less than 5. Therefore we need to check the ACID write id column (named originalTransaction) to determine which rows are valid. Delta directories written by Hive Streaming don't have a visibility txn id, so we can recognize them based on the directory name. If there's a visibilityTxnId and it is committed => every row is valid: full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId # every row is valid If there's no visibilityTxnId then it was created via Hive Streaming, therefore we need to validate rows. Fortunately Hive Streaming writes rows with different write ids into different ORC stripes, therefore we don't need to validate the write id per row. If we had statistics, we could validate per stripe, but since Hive Streaming doesn't write statistics we validate the write id per ORC row batch (an alternative could be to do a 2-pass read, first we'd read a single value from each stripe's 'currentTransaction' field, then we'd read the stripe if the write id is valid). Testing * the frontend logic is tested in AcidUtilsTest * the backend row validation is tested in test_acid_row_validation Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d Reviewed-on: http://gerrit.cloudera.org:8080/15818 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 21:00:44 +00:00
Joe McDonnell	3e76da9f51	IMPALA-9708: Remove Sentry support Impala 4 decided to drop Sentry support in favor of Ranger. This removes Sentry support and related tests. It retires startup flags related to Sentry and does the first round of removing obsolete code. This does not adjust documentation to remove references to Sentry, and other dead code will be removed separately. Some issues came up when implementing this. Here is a summary of how this patch resolves them: 1. authorization_provider currently defaults to "sentry", but "ranger" requires extra parameters to be set. This changes the default value of authorization_provider to "", which translates internally to the noop policy that does no authorization. 2. These flags are Sentry specific and are now retired: - authorization_policy_provider_class - sentry_catalog_polling_frequency_s - sentry_config 3. The authorization_factory_class may be obsolete now that there is only one authorization policy, but this leaves it in place. 4. Sentry is the last component using CDH_COMPONENTS_HOME, so that is removed. There are still Maven dependencies coming from the CDH_BUILD_NUMBER repository, so that is not removed. 5. To make the transition easier, testdata/bin/kill-sentry-service.sh is not removed and it is still called from testdata/bin/kill-all.sh. Testing: - Core job passes Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e Reviewed-on: http://gerrit.cloudera.org:8080/15833 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 17:43:40 +00:00
Csaba Ringhofer	7b8d2e1f78	IMPALA-9753: Fix TRUNCATE of ACID tables on S3 The use of HDFS API was incorrect when creating an empty file in the new base dir during truncate. Simply calling Create(Path) does create the file in HDFS, but it is only created on S3 when the returned stream is closed. Testing: - Acid truncate tests are not running on S3 as they need a running Hive server. Aded a regression test that will run on S3 too. It would be nice to run all tests on S3, but this is out of the scope of this change. Change-Id: I96d315638b669c5c7198a8e47939cb2b236e35bb Reviewed-on: http://gerrit.cloudera.org:8080/15940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-18 16:51:42 +00:00
Zoltan Borok-Nagy	f2f4c9891a	IMPALA-9091: Fix flaky query_test.test_scanners.TestScannerReservation.test_scanners query_test.test_scanners.TestScannerReservation.test_scanners was flaky. It checks the average value of ParquetRowGroupIdealReservation which is almost always 3.50 MB. However, very rarely it's a bit different than that, e.g. 3.88 MB, or 4.12 MB, and so on. I wasn't able to reproduce this problem, probably it is due to some randomness during data loading. I modified the test to accept any value between 3.0 and 4.99. I think values in this range are acceptable for this test. Change-Id: I668d0ccd77a62059284e76fee51efb08bef580eb Reviewed-on: http://gerrit.cloudera.org:8080/15923 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-15 21:46:38 +00:00
Tim Armstrong	fcf08d1822	IMPALA-9725: incorrect spilling join results for wide keys The control flow was broken if the join operator hit the end of the expression values cache before the end of the probe batch, immediately after processing a row for a spilled partition. In NextProbeRow(), the problematic code path was: * The last row in the expression values cache was for a spilled partition, so skip_row=true and it falls out of the loop with 'current_probe_row_' pointing to that row. * probe_batch_iterator->AtEnd() is false, because the expression value cache is smaller than the probe batch, so 'current_probe_row_' is not nulled out. Thus we end up in a state where 'current_probe_row_' is set, but 'hash_table_iterator_' is unset. In the case of a left anti join, this was interpreted by ProcessProbeRowLeftSemiJoins() as meaning that there was no hash table match for 'current_probe_row_', and it therefore returned the row. This bug could only occur under specific circumstances: * The join key takes up > 256 bytes in the expression values cache (assuming the default batch size of 1024). * The join spilled. * The join operator returns rows that were unmatched in the right input, i.e. LEFT OUTER JOIN, LEFT ANTI JOIN, FULL OUTER JOIN. The core of the fix is to null out 'current_probe_row_' when falling out the bottom of the loop in NextProbeRow(). Related DCHECKS were fixed and some control flow was slightly simplified. Testing: Added a test query on TPC-H that reproduces the problem reliably. Ran exhaustive tests. Change-Id: I9d7e5871c35a90e8cf24b8dded04775ee1eae9d8 Reviewed-on: http://gerrit.cloudera.org:8080/15904 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-15 16:49:32 +00:00
Tim Armstrong	506a303c6d	IMPALA-6984: coordinator cancels backends on EOS Before this patch, when the coordinator returned the last row, it waited for backends to finish of their own accord, which could happen indirectly as exchanges got closed. The idea of this change is to send out cancellation RPCs to expedite cancellation, then wait for the final exec status reports to come in. Those reports will be included in the final profile because the backend is not marked as done when sending out the cancellation RPCs. The bulk of this change is modifying the cancellation code path to allow sending the cancel RPCs but not consider the backend done until it gets back the final status report. The old "fire and forget" mode of cancellation is still used for explicit cancellation and errors. Testing: Ran exhaustive tests. Ran cancellation tests under TSAN, checked for errors. Manually inspected logs of some queries with limit, saw that it sent cancellation then waited for backends as expected. Added a functional perf test that goes from ~5s down to < ~1s on my system. Change-Id: I966eceaafdc18a019708b780aee4ee9d70fd3a47 Reviewed-on: http://gerrit.cloudera.org:8080/15840 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-14 02:52:16 +00:00
Akos Kovacs	5e72ca546e	IMPALA-7833 Audit and fix string builtins for long string handling Some string built-in functions could crash impalad, in case the result was longer than 1 gig max size. Added some overflow checks. Overflow error messages modified not to hard code max size. Testing: * Added some backend tests to cover overflow check * Ran core tests Change-Id: I93a53845f04e61ff446b363c78db1e49cbd5dc49 Reviewed-on: http://gerrit.cloudera.org:8080/15864 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-13 22:01:34 +00:00
Chang Wu	a93f2c2675	IMPALA-8205: Support number of true and false statistics for boolean column This change compute the real number of true and false statistics information for boolean columns. Before this, impala used to set numTrues and numFalses to hardcoded -1 to indicate that its statistics is missing. Test Done: Append the numTrue and numFalse test for all the statistics-related test cases including the non-incremental, incremental and other test cases. Change-Id: I991bee8e7fdc644d908289f5fe2ee8032cc2c431 Reviewed-on: http://gerrit.cloudera.org:8080/14666 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-12 23:29:04 +00:00
Adam Tamas	7295edcc26	IMPALA-9680: Fixed compressed inserts failing Modified the insert testfiles to get which database they need to use for 'CREATE TABLE LIKE' dynamically. Tests: Did targeted exhaustive testruns in test_insert.py and test_mt_dop.py and did a full exhaustive testrun. Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d Reviewed-on: http://gerrit.cloudera.org:8080/15816 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-05-11 19:32:08 +00:00
Tim Armstrong	b2d9901fb8	IMPALA-9176: shared null-aware anti-join build This switches null-aware anti-join (NAAJ) to use shared join builds with mt_dop > 0. To support this, we make all access to the join build data structures from the probe read-only. NAAJ requires iterating over rows from build partitions at various steps in the algorithm and before this patch this was not thread-safe. We avoided that problem by having a separate builder for each join node and duplicating the data. The main challenge was iteration over null_aware_partition()->build_rows() from the probe side, because it uses an embedded iterator in the stream so was not thread-safe (since each thread would be trying to use the same iterator). The solution is to extend BufferedTupleStream to allow multiple read iterators into a pinned, read-only, stream. Each probe thread can then iterate over the stream independently with no thread safety issues. With BufferedTupleStream changes, I partially abstracted ReadIterator more from the rest of BufferedTupleStream, but decided not to completely refactor so that this patchset didn't cause excessive churn. I.e. much BufferedTupleStream code still accesses internal fields of ReadIterator. Fix a pre-existing bug in grouping-aggregator where Spill() hit a DCHECK because the hash table was destroyed unnecessarily when it hit an OOM. This was flushed out by the parameter change in test_spilling. Testing: Add test to buffered-tuple-stream-test for multiple readers to BTS. Tweaked test_spilling_naaj_no_deny_reservation to have a smaller minimum reservation, required to keep the test passing with the new, lower, memory requirement. Updated a TPC-H planner test where resource requirements slightly decreased for the NAAJ. Ran the naaj tests in test_spilling.py with TSAN enabled, confirmed no data races. Ran exhaustive tests, which passed after fixing IMPALA-9611. Ran core tests with ASAN. Ran backend tests with TSAN. Perf: I ran this query that exercises EvaluateNullProbe() heavily. select l_orderkey, l_partkey, l_suppkey, l_linenumber from tpch30_parquet.lineitem where l_suppkey = 4162 and l_shipmode = 'AIR' and l_returnflag = 'A' and l_shipdate > '1993-01-01' and if(l_orderkey > 5500000, NULL, l_orderkey) not in ( select if(o_orderkey % 2 = 0, NULL, o_orderkey + 1) from orders where l_orderkey = o_orderkey) order by 1,2,3,4; It went from ~13s to ~11s running on a single impalad with this change, because of the inlining of CreateOutputRow() and EvalConjuncts(). I also ran TPC-H SF 30 on Parquet with mt_dop=4, and there was no change in performance. Change-Id: I95ead761430b0aa59a4fb2e7848e47d1bf73c1c9 Reviewed-on: http://gerrit.cloudera.org:8080/15612 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-24 20:56:58 +00:00
Tim Armstrong	dc410a2cf4	IMPALA-9596: deflake test_tpch_mem_limit_single_node This changes the test to use a debug action instead of trying to hit the memory limit in the right spot, which has tended to be flaky. This still exercises the error handling code in the scanner, which was the original point of the test (see IMPALA-2376). This revealed an actual bug in the ORC scanner, where it was not returning the error directly from AssembleCollection(). Before I fixed that, the scanner got stuck in an infinite loop when running the test. Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21 Reviewed-on: http://gerrit.cloudera.org:8080/15700 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2020-04-16 15:45:49 +00:00
Tim Armstrong	76e4a17fb3	IMPALA-9643: fix runtime filter race for mt_dop This patch avoids the race with registration of a consumer filter by registering all filters upfront when the filter bank is constructed. Then registration of producers and consumers hands out references to the pre-constructed filters. A nice bonus of this change is that RegisterConsumer() and RegisterProducer() don't mutate anything and we can avoid lock acquisitions. Also adds test infrastructure and fixes TestRuntimeRowFilters to work with mt_dop=4 (it was accidentally not enabled before). That mostly involved modifying the tests to use aggregates of counters instead of picking out lines with regexes. Testing: Added a regression test that reliably failed before this fix. This relies on extending debug actions to allow longer delays, plus a minor extension to the RUNTIME_PROFILE .test file parser to handle spaces in counter names. Ran exhaustive tests. Change-Id: I194c0d2515b6a0e5474e1c0c8647f0e54dc94397 Reviewed-on: http://gerrit.cloudera.org:8080/15715 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-14 15:42:13 +00:00
Adam Tamas	c32849a391	IMPALA-8980: Remove functional*.alltypesinsert from EE tests -Modified the ‘test_insert.py’ so the tests can run parallel. -Every test will create its own temporary tables for insert testing. -Swapped out the SETUP tags to Truncate table QUERY statement. -Becouse the SETUP tag is not used anymore, the correspondig code was removed. -A test query in ‘insert.test’. The test was incorrect so modified to test for the right behavior. Testing: -tests/run-tests.py query_test/test_insert.py -impala-py.test tests/query_test/test_insert.py -the same for test_insert_permutation.py and test_load.py Change-Id: I257e936868917a2fcc6c030f6c855b247e8a0eea Reviewed-on: http://gerrit.cloudera.org:8080/15529 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-14 12:18:21 +00:00
Zoltan Borok-Nagy	b770d2d378	Put transactional tables into 'managed' directory HIVE-22794 disallows ACID tables outside of the 'managed' warehouse directory. This change updates data loading to make it conform to the new rules. The following tests had to be modified to use the new paths: * AnalyzeDDLTest.TestCreateTableLikeFileOrc() * create-table-like-file-orc.test Change-Id: Id3b65f56bf7f225b1d29aa397f987fdd7eb7176c Reviewed-on: http://gerrit.cloudera.org:8080/15708 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-11 00:36:56 +00:00
stiga-huang	5c2cae89f2	IMPALA-9529: Fix multi-tuple predicates not assigned in column masking Column masking is implemented by replacing the masked table with a table masking view which has masked expressions in its SelectList. However, nested columns can't be exposed in the SelectList, so we expose them in the output field of the view in IMPALA-9330. As a result, predicates that reference both primitive and nested columns of the masked table become multi-tuple predicates (referencing tuples of the view and the masked table). Such kinds of predicates are not assigned since they no longer bound to the view's tuple or the masked table's tuple. We need to pick up the masked table's tuple id when getting unassigned predicates for the table masking view. Also need to do this for assigning predicates to the JoinNode which is the only place that introduces multi-tuple predicates. Tests: - Add tests with multi-tuple predicates referencing nested columns. - Run CORE tests. Change-Id: I12f1b59733db5a88324bb0c16085f565edc306b3 Reviewed-on: http://gerrit.cloudera.org:8080/15654 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-07 15:10:27 +00:00
Zoltan Borok-Nagy	8aa0652871	IMPALA-9484: Full ACID Milestone 1: properly scan files that has full ACID schema Full ACID row format looks like this: { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1, "row": {"i": 1} } User columns are nested under "row". In the frontend we need to create slot descriptors that correspond to the file schema. In the catalog we could mimic the file schema but that would introduce several complexities and corner cases in column resolution. Also in query results the heading of the above user column would be "row.i". Star expansion should also be modified, etc. Because of that in the Catalog I create the exact opposite of the above schema: { "row__id": { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1 } "i": 1 } This way very little modification is needed in the frontend. And the hidden columns can be easily retrieved via 'SELECT row__id.' when we need those for debugging/testing. We only need to change Path.getAbsolutePath() to return a schema path that corresponds to the file schema. Also in the backend we need some extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the table schema path from the file schema path. Testing: I changed data loading to load ORC files in full ACID format by default. With this change we should be able to scan full ACID tables that are not minor-compacted, don't have deleted rows, and don't have original files. Newly added Tests: specific queries about hidden columns (full-acid-rowid.test) * SHOW CREATE TABLE (show-create-table-full-acid.test) * DESCRIBE [FORMATTED] TABLE (describe-path.test) * INSERT should be forbidden (acid-negative.test) * added tests for column masking ( ranger_column_masking_complex_types.test) Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb Reviewed-on: http://gerrit.cloudera.org:8080/15395 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-02 12:01:41 +00:00
Attila Bukor	2576952655	IMPALA-5092 Add support for VARCHAR in Kudu tables KUDU-1938 added VARCHAR column type support to Kudu. This commit adds support for Kudu's VARCHAR type to Impala. The length of a Kudu varchar is applied as a character length as opposed to a byte length like Impala currently uses. When writing data to Kudu, the VARCHAR length is not an issue because Impala only officially supports ASCII characters and those characters are the same size in bytes and characters. Additionally, extra bytes would be truncated by the Kudu client if somehow a value was too long. When reading data from Kudu, it is possible that the value written by some other application is wider in bytes than Impala expects and can handle. This can happen due to multi-byte UTF-8 characters. In that case, we adjust the length in Impala to truncate the extra bytes of the value. This isn’t a great solution, but one other integrations have taken as well given Impala doesn’t support UTF-8 values. IMPALA-5675 tracks adding UTF-8 Character length support to VARCHAR columns and marked the truncation code with a TODO that references that Jira. Testing: * Performed manual testing of standard DDL and DML interaction * Manually reproduced a check failure due to multi-byte characters and tested that length truncation resolve that issue. * Added/adjusted the following automated tests: AnalyzeDDLTest: CTAS into Kudu with varchar type AnalyzeKuduDDLTest: CREATE TABLE in Kudu with VARCHAR type kudu_create.test: Create table with VARCHAR column, key, hash partition, and range partition kudu_describe.test: Describe table with VARCHAR column and key kudu_insert.test: Insert with VARCHAR columns including null and non-null defaults kudu_update.test: Updates with VARCHAR column kudu_upsert.test: Upserts with VARCHAR column kudu_delete.test Deletes with VARCHAR columns ** kudu-scan-node.test Tests basic predicates with VARCHAR columns Follow on work: - IMPALA-9580: Add min-max runtime filter support/tests - IMPALA-9581: Pushdown string predicates - IMPALA-9583: Automated multibyte truncation tests Change-Id: I0d4959410fdd882bfa980cb55e8a7837c7823da8 Reviewed-on: http://gerrit.cloudera.org:8080/14197 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2020-04-01 15:48:36 +00:00
Csaba Ringhofer	a08cd7f49b	IMPALA-9584: remove flaky avg(TIMESTAMP) aggregates from test_analytic_fns AVG(TIMESTAMP) is not deterministic, because it uses a double to sum the timestamps, and adding doubles in different order can lead to different results. This does not cause problems for DOUBLE columns, because the test framework does not require exact match if the result is double. As AVG is the only function for TIMESTAMP with this problem, reducing the precision of all timestamps checks seemed like an overkill. As a short term solution I removed the problematic aggregates from the tests. Testing: - ran only the related tests Change-Id: I10e0027a64a4e430b7db3ed7c8d0cc8cdcb202e0 Reviewed-on: http://gerrit.cloudera.org:8080/15621 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-31 22:57:00 +00:00
Tim Armstrong	ab7e209d1b	IMPALA-9099: allow mt_dop for joins without feature flag This allows running any read-only query with mt_dop > 0. Before this patch, no joins were allowed with mt_dop > 0. Previous patches, particularly IMPALA-9156, added significantly more code coverage for multithreading+joins. It should be safe to allow enabling on a query-by-query basis. Many improvements are still planned - see IMPALA-3902. So behaviour and performance characteristics of mt_dop > 0 with more complex plans and joins will continue to change. Testing: Updated the mt_dop validation tests and remove redundant planner test that doesn't provide much additional coverage of the validation support. Ran exhaustive tests. Change-Id: I9c6566abb239db0e775f2beaa25a62c36313cd6f Reviewed-on: http://gerrit.cloudera.org:8080/15545 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-31 20:45:59 +00:00
Attila Jeges	1cfc31c84f	IMPALA-9555 part 2: [Hive3] Fix test failure introduced by HIVE-22589 This patch is a continuation of IMPALA-9555. It makes Avro DATE tests more resilient by using regex for expected error messages instead of using concrete error messages. Change-Id: I36340be70a37b75997cf49625a173ec2690ed9b8 Reviewed-on: http://gerrit.cloudera.org:8080/15618 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-31 19:11:20 +00:00
Csaba Ringhofer	e8f604a213	IMPALA-9572: Fix DCHECK in nested Parquet scanning The issue occurred when there were skipped pages and a column inside a collection was scanned, but its position was not needed. The repetition level still needs to be read in this case, as the skipped ranges are set in top level rows, so collection items need to know which top level row do they belong to. A DCHECK in StrideWriter's constructor was hit, otherwise the code ran correctly in release mode. The DCHECK is moved to functions where the condition would actually cause problems. Testing: - added and ran a regression test Change-Id: I5e8ef514ead71f732c73f910af7fd1aecd37bb81 Reviewed-on: http://gerrit.cloudera.org:8080/15598 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-31 18:43:09 +00:00
Joe McDonnell	e9dd5d3f8c	IMPALA-9560: Fix TestStatsExtrapolation for release versions When changing the Impala version from 3.4.0-SNAPSHOT to 3.4.0-RELEASE, TestStatsExtrapolation::test_stats_extrapolation started failing due to a difference in the expected cardinality (expected: 17.91K, actual 17.90K). This is because the Impala version gets embedded into parquet files, and this causes a slight difference in file size, which translates into a slight difference in expected cardinality. This modifies TestStatsExtrapolation::test_stats_extrapolation to allow any 17.9*K cardinality. Testing: - Tested on master and on branch-3.4.0 Change-Id: Iebe538936f23c095ef58c808e425cfb7b31edd94 Reviewed-on: http://gerrit.cloudera.org:8080/15569 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-27 23:27:12 +00:00
Attila Jeges	2cd7a2b77a	IMPALA-9555: [Hive3] Fix test failure introduced by HIVE-22589 With HIVE-22589 Hive3 switched back to using Julian Calendar for historical dates by default which caused an Impala test failure around Avro DATE values. Change-Id: I51dd933867ea7877235e7f6e1f2b56711dca107e Reviewed-on: http://gerrit.cloudera.org:8080/15564 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-27 17:16:00 +00:00
Tamas Mate	7dd13f7278	IMPALA-5308: Resolve confusing Kudu SHOW TABLE STATS output This change modifies the output of the SHOW TABLE STATS and SHOW PARTITIONS for Kudu tables. - PARTITIONS: the #Row column has been removed - TABLE STATS: instead of showing partition informations it returns a resultset similar to HDFS table stats, #Rows, #Partitions, Size, Format and Location Example outputs can be seen in the doc changes. Testing: * kudu_stats.test is modified to verify the new result set * kudu_partition_ddl.test is modified to verify the new partitions style * Updated unit test with the new error message Change-Id: Ice4b8df65f0a53fe14b8fbe35d82c9887ab9a041 Reviewed-on: http://gerrit.cloudera.org:8080/15199 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-18 18:05:34 +00:00

1 2 3 4 5 ...

1379 Commits