impala

mirror of https://github.com/apache/impala.git synced 2026-01-28 18:00:14 -05:00

Author	SHA1	Message	Date
Csaba Ringhofer	0906e0817c	IMPALA-7889: Write new logical types in Parquet Fill the LogicalType field in Parquet schemas for columns that have an associated logical type. ConvertedType still has to be filled to remain compatible with older readers. Testing: - added new tests to check both logical and converted types to test_insert_parquet.py Change-Id: I6f377950845683ab9c6dea79f4c54db0359d0b91 Reviewed-on: http://gerrit.cloudera.org:8080/12004 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-17 02:48:39 +00:00
Tim Armstrong	e1e258039d	IMPALA-8078: fix test_corrupt_stats explain output Testing: impala-py.test tests/metadata/test_compute_stats.py \ -k corrupt_stats \ --workload_exploration_strategy=functional-query:exhaustive Change-Id: I0201f7d5363f793290179defcac3a56c6f47d1a9 Reviewed-on: http://gerrit.cloudera.org:8080/12227 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-17 02:34:12 +00:00
Janaki Lahorani	f8c3e4c0a4	IMPALA-8064: Fix intermittent test failures from test_min_max_filters test_min_max_filters and test_decimal_min_max_filters records the aggregated probe rows to check whether min-max filter was exercised. In the case of ASAN builds, the probe side started processing before the filters reached the probe side, because ASAN builds are a little slower. The resolution is to increase RUNTIME_FILTER_WAIT_TIME_MS to accommodate ASAN. This issue was also seen earlier on a runtime filter tests and fixed through IMPALA-6201. This fix mimics the same, by setting RUNTIME_FILTER_WAIT_TIME_MS to $RUNTIME_FILTER_WAIT_TIME_MS. Change-Id: I111ed15947bd2812753ae68d3bbb8a9871e25b08 Reviewed-on: http://gerrit.cloudera.org:8080/12224 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-14 09:07:03 +00:00
paul-rogers	a7ea86b768	IMPALA-8021: Add estimated cardinality to EXPLAIN output Cardinality is vital to understanding why a plan has the form it does, yet the planner normally emits cardinality information only for the detailed levels. Unfortunately, most query profiles we see are at the standard level without this information (except in the summary table), making it hard to understand what happened. This patch adds cardinality to the standard EXPLAIN output. It also changes the displayed cardinality value to be in abbreviated "metric" form: 1.23K instead of 1234, etc. Changing the DESCRIBE output has a huge impact on PlannerTest: all the "golden" test files must change. To avoid doing this twice, this patch also includes: IMPALA-7919: Add predicates line in plan output for partition key predicates This is also the time to also include: IMPALA-8022: Add cardinality checks to PlannerTest The comparison code was changed to allow a set of validators, one of which compares cardinality to ensure it is within 5% of the expected value. This should ensure we don't change estimates unintentionally. While many planner tests are concerned with cardinality, many others are not. Testing showed that the cardinality is actually unstable within tests. For such tests, added filters to ignore cardinality. The filter is enabled by default (for backward compatibility) but disabled (to allow cardinality verification) for the critical tests. Rebasing the tests was complicated by a bug in the error-matching code, so this patch also fixes: IMPALA-8023: Fix PlannerTest to handle error lines consistently Now, the error output written to the output "save results" file matches that expected in the "golden" file -- no more handling these specially. Testing: * Added cardinality verification. * Reran all FE tests. * Rebased all PlannerTest .test files. * Adjusted the metadata/test_explain.py test to handle the changed EXPLAIN output. Change-Id: Ie9aa2d715b04cbb279aaffec8c5692686562d986 Reviewed-on: http://gerrit.cloudera.org:8080/12136 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-12 04:03:26 +00:00
Janaki Lahorani	aacd5c35d3	IMPALA-6533: Add min-max filter for decimal types on kudu tables. The code mimics the code written for other min-max filters. Decimal data can be stored using 4 bytes, 8 bytes and 16 bytes. The code respectively handles these 3 storage configurations. The column definition states the precision and the precision determines the storage size. The minimum and maximum values are stored in a union. The precision from the column will come in as an input. Based on the precision the size will be found, and depending on the size appropriate variable will be used. The code in min-max-filter* follows the general convention of the file, hence uses macros. The test includes 24 decimal columns (as listed below) with the following joins: 1. Inner Join with broadcast (2 tables) 1a. 1 predicate 1b. 4 predicates - all results in decimal min-max filter 1c. 4 predicates - 3 results in decimal min=max filter; 1 doesn't 2. Inner Join with Shuffle (3 tables) 3. Right outer join (2 tables) 4. Left Semi join (2 tables) 5. Right Semi join (2 tables) Decimal Columns: 4bytes: (5,0), (5,1), (5,3), (5,5) (9,0), (9,1), (9,5), (9,9) 8 bytes: (14,0), (14,1), (14,7), (14,14) (18,0), (18,1), (18,9), (18,18) 16 bytes: (28,0), (28,1), (28,14), (28,28) (38,0), (38,1), (38,19), (38,38) The test aggregates the count of probe rows. This shows that the min-max filter is exercised, because the number of probe rows is less than the total number of rows in the probe side table. The count of probe rows is considered to be deterministic. But, it will be beneficial to look out for changes in Kudu that can change the way data is partitioned. Such a change could change the probe row count and in that case, the test will have to be updated. impala_test_suite.py and test_result_verifier.py are enhanced to support saving of aggregation using update_results. Change-Id: Ib7e7278e902160d7060f8097290bc172d9031f94 Reviewed-on: http://gerrit.cloudera.org:8080/12113 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-10 03:32:25 +00:00
Tim Armstrong	ae65ff8319	IMPALA-7446: enable buffer pool GC when near process mem limit GC is performed when: * The amount of memory allocated from the system for the buffer pool exceeds the reservation (i.e. free buffers and clean pages are not offset by unused reservation). * The soft or hard process memory limit would otherwise cause an allocation to fail. Testing: Looped the old version of the semi_joins_exhaustive test, which reliably reproduced the issue. I confirmed that the buffer pool GC was running and that it preventing the query failures. Added a backend test that reproed the issue. A large chunk of the code change is to add infrastructure to use TCMalloc memory metrics for the process memory tracker in backend tests. Ran exhaustive tests. Change-Id: I81e8e29f1ba319f1b499032f9518d32c511b4b21 Reviewed-on: http://gerrit.cloudera.org:8080/12133 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-09 05:26:50 +00:00
Philip Zeyliger	635149b85f	IMPALA-8016: Fix lifecycle of classloader for UDFs. The ClassLoader whence a UDF was loaded needs to be kept open for executions of the UDF, so that the UDF can load other classes from the same jar. (A typical scenario might be a utility class.) This was broken by the fix to IMPALA-7668. This commit moves closing the ClassLoader to the close() function. A test for a UDF that imports a static method from another file has been added. Doing so failed without this change. Change-Id: Ic02e42fb25a2754ede21fe00312a60f07e0ba8a2 Reviewed-on: http://gerrit.cloudera.org:8080/12125 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-27 21:02:57 +00:00
Bikramjeet Vig	899bfd10a2	IMPALA-7960: Revert "IMPALA-5929: Remove redundant explicit casts to string" The fix for IMPALA-5929 introduced a bug that produced wrong results. This bug is detailed in IMPALA-7960. Reverting for now. This reverts commit `545163bb0a`. Change-Id: I6f0da62a7ff86f05859a2acbec13a726a9bd6f4c Reviewed-on: http://gerrit.cloudera.org:8080/12073 Reviewed-by: Zoram Thanga <zoram@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-12 04:47:41 +00:00
Paul Rogers	d94a474658	IMPALA-7907: Fix ScalarFunction.toSql() Each AST statement node has a toSql() implementation. The code for ScalarFunction and ToSqlUtils has a number of issues: * If Location or Symbol are not set, they are shown as 'null'. Better to omit these clauses if the items are not available. This is mostly an issue during testing. * The generated SQL does not follow the CREATE TABLE syntax. For example, the signature and return value are provided for Java functions, but should not be. * Unlike other statements, this one is generated with a trailing newline. * ToSql.getCreateFunctionSql() fails to separate functions with the semi-colon statement separator. These are all minor issues, but we might as well fix the code to work as intended. Testing: * Added a new unit tests to verify the behavior (no tested existed previously.) * Re-ran all FE tests. Change-Id: Id34d6df97760a11c299092dff8edbdb7033bce1c Reviewed-on: http://gerrit.cloudera.org:8080/12014 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-11 02:47:16 +00:00
Csaba Ringhofer	56dd5767b8	IMPALA-7853: Add support to read int64 NANO timestamps from Parquet PARQUET-1387 added int64 timestamps with nanosecond precision that stores timestamps as nanoseconds since the Unix epoch. As 64 bits are not enough to represent the whole 1400..9999 range of Impala timestamps, this new type works with a limited range: 1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807 UTC The benefit of the reduced range is that no validation is necessary during scanning, as every possible 64 bit value represents a valid timestamp in Impala. This may mean that this has the potential be the fastest way to store timestamps in Impala + Parquet. Another way NANO differs from MICRO and MILLI is that NANO can be only described with new logical types in Parquet, it has no converted type equivalent. This made implementing CREATE TABLE LIKE PARQUET less trivial than it was for MICRO/MILLI: the type conversion logic in ParquetHelper.java had to be rewritten to use LogicalTypeAnnotation instead of ConvertedType. The changes on Java side also made bumping CDH_BUILD_NUMBER necessary. Testing: - added a new testfile with int64 nano timestamps - ran core tests Change-Id: I932396d8646f43c0b9ca4a6359f164c4d8349d8f Reviewed-on: http://gerrit.cloudera.org:8080/11984 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-10 15:39:24 +00:00
Bharath Vissapragada	04d027df13	IMPALA-7659: Populate NULL count while computing column stats It was disabled for performance reasons (IMPALA-1003) and this patch re-enables it since a lot of codegen improvements have happened since then. This patch switches the aggregation to use the CASE conditional instead of IF since the former has proper codegen support (IMPALA-7655). Tests: ===== - Updated the affected tests to include the null counts. - Added unit tests that verify IS [NOT] NULL predicates' cardinality estimation. Perf note: ========= I reran the compute stats child query with null counts included on the store_sales table from 1000 SF (1TB) tpcds dataset. The table had 22 non-partitioned columns (on which null counts were computed) and ~2.8B rows. This experiment showed around 7-8% perf drop compared to the same child query without null counts for these columns. Change-Id: Ic68f8b4c3756eb1980ce299a602a7d56db1e507a Reviewed-on: http://gerrit.cloudera.org:8080/11565 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-08 00:21:55 +00:00
Tim Armstrong	07fd332089	IMPALA-7869: break up parquet-column-readers.cc Move parquet classes into exec/parquet. Move CollectionColumnReader and ParquetLevelDecoder into separate files. Remove unnecessary 'encoding_' field from ParquetLevelDecoder. Switch BOOLEAN decoding to use composition instead of inheritance. This lets the boolean decoding use the faster batched implementations in ScalarColumnReader and avoids some confusing aspects of the class hierarchy, like the ReadValueBatch() implementation on the base class that was shared between BoolColumnReader and CollectionColumnReader. Improve compile times by instantiating BitPacking templates in a separate file (this looks to give a 30s+ speedup for compiling parquet-column-readers.cc). Testing: Ran exhaustive tests. Change-Id: I0efd5c50b781fe9e3c022b33c66c06cfb529c0b8 Reviewed-on: http://gerrit.cloudera.org:8080/11949 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-27 02:01:12 +00:00
poojanilangekar	5e2dcd25d8	IMPALA-7873: Fix flakiness in TestExchangeMemUsage IMPALA-7367 reduced the memory requirement of the query tested in test_exchange_mem_usage_scaling. This change reduces the mem limit to ensure that the query runs out of memory as expected. Testing: Ran the test 100 times in a loop without any failures. Change-Id: Ib2f063fb88ebf0c7f994b55ecfc860d81726fdd8 Reviewed-on: http://gerrit.cloudera.org:8080/11965 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-21 00:08:25 +00:00
Michael Ho	9ef9dacaf7	IMPALA-7829: Mark a fragment instance as done only after Close() is called As shown in IMPALA-7828. there is some non-determinism on whether the errors detected in FragmentInstanceState::Close() will show up in the final profile sent to the coordinator. The reason is that the current code marks a fragment instance as "done" after ExecInternal() completes but before Close() is called. There is a window between when the final status report is sent and when Close() finishes. This change fixes the problem by not sending the final report until Close() is called. This has no implication on the first row available time for normal queries. It may slightly lengthen the first row available time for DML queries. Testing done: Updated udf-no-expr-rewrite.test to exercise this test Perf run on an 8 node clusters didn't show any regression: TPCH-300 +------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +------------+-----------------------+---------+------------+------------+----------------+ \| TPCH(_300) \| parquet / none / none \| 23.94 \| -2.05% \| 12.55 \| -2.62% \| +------------+-----------------------+---------+------------+------------+----------------+ Small concurrency +-------------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------------+-----------------------+---------+------------+------------+----------------+ \| TPCDS-UNMODIFIED(_1000) \| parquet / none / none \| 6.89 \| -0.66% \| 6.62 \| +0.41% \| +-------------------------+-----------------------+---------+------------+------------+----------------+ Medium concurrency +-------------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------------+-----------------------+---------+------------+------------+----------------+ \| TPCDS-UNMODIFIED(_1000) \| parquet / none / none \| 55.57 \| -1.04% \| 55.27 \| -0.98% \| +-------------------------+-----------------------+---------+------------+------------+----------------+ Change-Id: I61618854ae3f4e7ef20028dcb0ff5cbcfa8adb01 Reviewed-on: http://gerrit.cloudera.org:8080/11939 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-20 23:49:11 +00:00
poojanilangekar	2a4835cfba	IMPALA-7367: Pack StringValue and CollectionValue slots This change packs StringValue and CollectionValue slots to ensure they now occupy 12 bytes instead of 16 bytes. This reduces the memory requirements and improves the performance. Since Kudu tuples are populated using a memcopy, 4 bytes of padding was added to StringSlots in Kudu tables. Testing: Ran core tests. Added static asserts to ensure the value sizes are as expected. Performance tests on TPCH-40 produced 3.96% improvement. Change-Id: I32f3b06622c087e4aa288e8db1bf4581b10d386a Reviewed-on: http://gerrit.cloudera.org:8080/11599 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-11-19 17:27:13 +00:00
Tim Armstrong	153663c22f	IMPALA-4123: Columnar decoding in Parquet The idea is to optimise the common case where there are long runs of NULL or non-NULL values (i.e. the def level is repeated). We can detect this cheaply by keying the decoding loop in the column reader off the state of the def level RLE decoder - if there's a long run of repeated levels, we can skip checking the def level for every value. We still fall back to decoding, caching and reading value-by-value a batch of def levels whenever the next def level is not in a repeated run. We still use the old approach for decoding rep levels. There might be some benefit to using the same approach for rep levels if repeated def and rep level runs line up. These changes should unlock further optimizations because more time is spent in simple kernel functions, e.g. UnpackAndDecode32Values() for dictionary decompression, which is very optimisable using SIMD etc. Snappy decompression now seems to be the main CPU bottleneck for decoding snappy-compressed Parquet. Perf: Running TPC-H scale factor 60 on uncompressed and snappy parquet both showed a ~4% speedup overall. Microbenchmarks on uncompressed parquet show scans only doing dictionary decoding on uncompressed Parquet is ~75% faster: set mt_dop=1; select min(l_returnflag) from lineitem; Testing: We have alltypes agg with a mix of null and non-null. Many tables have long runs of non-null values. Added new test data and coverage: * a test table manynulls with long runs of null values. * a large CHAR test table * missing coverage for materialising pos slot in flattened nested types scan. * Extended dict test to test longer runs. * A larger version of complextypestbl with interesting collection shapes - NULL collections, empty collections, etc, particularly runs of collections with the same shape. * Test interaction of timestamp validation with conversion * Ran code coverage build to confirm all code paths are tested * ASAN and exhaustive runs. Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef Reviewed-on: http://gerrit.cloudera.org:8080/8319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-17 01:48:05 +00:00
Andrew Sherman	d3948d9a01	IMPALA-5821: Add query with implicit casts to extended explain output. If explain_level is at 'extended' level or higher, then enhance the output from the explain command. (1) Show the analyzed sql in the explain header, this is the rewritten sql, which includes implicit casts, and literals are printed with a cast so that their type is visible. (2) When predicates are shown in the plan these are shown in the same format. The toSql() method can be called on a ParseNode tree to return the sql corresponding ot the tree. In the past toSQl() has been enhanced to print rewritten sql by partially overloading toSql() [with toSql(boolean)]. This current change requires changing toSQl() in many places as NumericLiteral can appear at different points in ia parse tree. To avoid many new fragile overloads of toSql() I added toSql(ToSqlOptions), where ToSqlOptions is an enum which controls the form of the Sql that is returned. This changes many files but is safer and means that any future options to toSql() can be added painlessly. If SHOW_IMPLICIT_CASTS is passed to toSql() then - in CastExpr print the implicit cast - in NumericLiteral print the literal with a cast to show the type Add a PlannerTestOption directive that will force the query text showing implicit casts to be included in the PLAN section of a .test file. The analyzed query text is wrapped at 80 characters. Note that the analyzed query cannot always be executed as queries rewritten to use LEFT SEMI JOIN are not legal sql. In addition some space characters may be removed from the query for prettier display. Documentation of this change will be done as IMPALA-7718 EXAMPLE OUTPUT: [localhost:21000] default> set explain_level=2; EXPLAIN_LEVEL set to 2 [localhost:21000] default> explain select * from functional_kudu.alltypestiny where bigint_col < 1000 / 100; Query: explain select * from functional_kudu.alltypestiny where bigint_col < 1000 / 100 Max Per-Host Resource Reservation: Memory=0B Threads=2 Per-Host Resource Estimates: Memory=10MB Codegen disabled by planner Analyzed query: SELECT * FROM functional_kudu.alltypestiny WHERE CAST(bigint_col AS DOUBLE) < CAST(10 AS DOUBLE) "" F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 \| Per-Host Resources: mem-estimate=4.88MB mem-reservation=0B thread-reservation=2 PLAN-ROOT SINK \| mem-estimate=0B mem-reservation=0B thread-reservation=0 \| 00:SCAN KUDU [functional_kudu.alltypestiny] predicates: CAST(bigint_col AS DOUBLE) < CAST(10 AS DOUBLE) mem-estimate=4.88MB mem-reservation=0B thread-reservation=1 tuple-ids=0 row-size=97B cardinality=1 in pipelines: 00(GETNEXT) Fetched 16 row(s) in 0.03s TESTING: All end-to-end tests pass. Added a new test in ExprRewriterTest which prints sql with implict casts for some interesting queries. Add a unit test for the code which wraps text at 60 characters. The output of some Planner Tests in .test files has been updated to include the Analyzed sql that is printed when explain_level is at at least 'extended' level. Change-Id: I55c3bdacc295137f66b2316a912fc347da30d6b0 Reviewed-on: http://gerrit.cloudera.org:8080/11719 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>	2018-11-15 21:32:07 +00:00
Bikramjeet Vig	0d0356c932	IMPALA-7837: Fix flakiness in test_resource_limits for release builds test_resource_limits was failing in release build because the queries used were finishing earlier than expected. This resulted in fragment instances not being able to send enough updates to the coordinator in order to hit the limits used for the tests. This patches adds a deterministic sleep to the queries which gives enough time to the coordinator to catch up on reports. Testing: Checked that tests passed on release builds. Change-Id: I4a47391e52f3974db554dfc0d38139d3ee18a1b4 Reviewed-on: http://gerrit.cloudera.org:8080/11933 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-15 05:48:39 +00:00
Csaba Ringhofer	60095a4c6b	IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet Changes: - parquet.thrift is updated to a newer version which contains the timestamp logical type. - INT64 columns with converted types TIMESTAMP_MILLIS and TIMESTAMP_MICROS can be read as TIMESTAMP. - If the logical type is timestamp, then the type will contain the information whether the UTC->local conversion is necessary. This feature is only supported for the new timestamp types, so INT96 timestamps must still use flag convert_legacy_hive_parquet_utc_timestamps. - Min/max stat filtering is enabled again for columns that need UTC->local conversion. This was disabled in IMPALA-7559 because it could incorrectly drop column chunks. - CREATE TABLE LIKE PARQUET converts these columns to TIMESTAMP - before the change, an error was returned instead. - Bulk of the Parquet column stat logic was moved to a new class called "ColumnStatsReader". Testing: - Added unit tests for timezone conversion (this needed a new public function in timezone_db.h and adding CET to tzdb_tiny). - Added parquet files (created with parquet-mr) with int64 timestamp columns. Change-Id: I4c7c01fffa31b3d2ca3480adf6ff851137dadac3 Reviewed-on: http://gerrit.cloudera.org:8080/11057 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-14 20:16:14 +00:00
Michael Ho	d1aa1c009f	IMPALA-7828: A temporary workaround for flaky UDF test (test_mem_leak()) Before IMPALA-4063, the error message detected during FragmentInstanceState::Close() was always lost. After IMPALA-4063, we may sometimes get the error message in FragmentInstanceState::Close(). It's non-deterministic as the fragment instance thread may race with the query state thread which reports the final status. The UDF test currently tries to handle this non-determinism by using "row_regex:.*" in the ERRORS section but it doesn't always seem to work. This change workarounds the issue by commenting out the ERRORS section in udf-no-expr-rewrite.text for now. The actual fix will be done in IMPALA-7829. Change-Id: I6a55d5ad1a5a7278e7390f60854a8df28c1b9f28 Reviewed-on: http://gerrit.cloudera.org:8080/11900 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-08 03:53:53 +00:00
Thomas Tauber-Marshall	c28bc3e4f3	IMPALA-3652: Fix resource transfer in subplans with limits Impala assumes that when Reset() is called on an ExecNode, all of the memory returned from that node by GetNext() has been attached to the output RowBatch. In a query with a LIMIT on the subplan, such that some nodes don't reach 'eos', this may not be the case. The solution is to have Reset() take a RowBatch that any such memory can be attached to. I examined all ExecNodes for resources being transferred on 'eos' and added transferring of those resources in Resst(). Testing: - Added e2e tests that repro the issue for hash and nested loop joins. Change-Id: I3968a379fcbb5d30fcec304995d3e44933dbbc77 Reviewed-on: http://gerrit.cloudera.org:8080/11852 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-07 22:53:57 +00:00
Tim Armstrong	250d85e94e	IMPALA-7822: handle overflows in repeat() builtin We need to carefully check that the intermediate value fits in an int64_t and the final size fits in an int. If they don't we raise an error and fail the query. Testing: Added a couple of backend tests to exercise the overflow check code paths. Change-Id: I872ce77bc2cb29116881c27ca2a5216f722cdb2a Reviewed-on: http://gerrit.cloudera.org:8080/11889 Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-07 21:49:15 +00:00
stakiar	98d923243f	IMPALA-5004: Switch to sorting node for large TopN queries Adds a new query option 'topn_bytes_limit' that places a limit on the number of estimated bytes that a TopN operator can process. If the Impala planner estimates that a TopN operator will process more bytes than this limit, it will replace the TopN operator with a sort operator. Since the TopN operator cannot spill to disk, it has to buffer everything in memory. This can cause frequent OOM issues when running with a large limit + offset. Switching to a sort operator allows Impala to spill to disk. We prefer to use the TopN operator when possible as it has better performance than the sort operator for 'order by limit [offset]' queries. The default limit is set to 512MB and is based on micro-benchmarking the topn vs. sort operator for various limits (see the JIRA for full details). The default is set to an intentionally high value in order to avoid performance regressions. Testing: * Added a new planner test to fuctional-planner/ to validate that 'topn_bytes_limit' properly switches between topn and sort operators. Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9 Reviewed-on: http://gerrit.cloudera.org:8080/11698 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-11-07 01:17:34 +00:00
Michal Ostrowski	cf89c73485	IMPALA-6323 Allow constant analytic window expressions. The constraint imposed by IMPALA-1354 was artificial. If there are constant "partition by" expressions, simply drop them, they are no-ops. Constant "order by" expressions can be ignored as well, though in effect they should be accounted for as null expressions in the backend, with the effect that combine all rows in the same window (i.e. no window breaks). Change-Id: Idf129026c45120e9470df601268863634037908c Reviewed-on: http://gerrit.cloudera.org:8080/11556 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com>	2018-11-06 07:54:47 +00:00
Michael Ho	941038229a	IMPALA-4063: Merge report of query fragment instances per executor Previously, each fragment instance executing on an executor will independently report its status to the coordinator periodically. This creates a huge amount of RPCs to the coordinator under highly concurrent workloads, causing lock contention in the coordinator's backend states when multiple fragment instances send them at the same time. In addition, due to the lack of coordination between query fragment instances, a query may end without collecting the profiles from all fragment instances when one of them hits an error before another fragment instance manages to finish Prepare(), leading to missing profiles for certain fragment instances. This change fixes the problem above by making a thread per QueryState (started by QueryExecMgr) to be responsible for periodically reporting the status and profiles of all fragment instances of a query running on a backend. As part of this refactoring, each query fragment instance will not report their errors individually. Instead, there is a cumulative status maintained per QueryState. It's set to the error status of the first fragment instance which hits an error or any general error (e.g. failure to start a thread) when starting fragment instances. With this change, the status reporting threads are also removed. Testing done: exhaustive tests This patch is based on a patch by Sailesh Mukil Change-Id: I5f95e026ba05631f33f48ce32da6db39c6f421fa Reviewed-on: http://gerrit.cloudera.org:8080/11615 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-06 01:01:07 +00:00
Tim Armstrong	58cd69ac48	IMPALA-402: test for random partitioning in insert This adds a basic regression test for the bug reported in IMPALA-402. Testing: Exhaustive build. Looped the modified test overnight. Change-Id: I4bbca5c64977cadf79dabd72f0c8876a40fdf410 Reviewed-on: http://gerrit.cloudera.org:8080/11799 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-06 00:00:12 +00:00
stakiar	31669a6703	IMPALA-7777: Fix crash due to arithmetic overflows in Exchange Node Fixes an arithmetic overflow in ExchangeNode::GetNextMerging. Prior to this patch, the code read: int rows_to_keep = num_rows_skipped_ - offset_; Where num_rows_skipped_ and offset_ were of type int64_t. The result was cast to an int which can lead to an overflow if the result exceeds the value of 2^31. The value of rows_to_keep would be passed into row-batch.h::CopyRows which would crash due to a DCHECK_LE error. This crash arises when the value of the OFFSET is a large number, for example, the query: select int_col from functional.alltypes order by 1 limit 1 offset 9223372036854775800; Would crash the Impalad executor for this query. The fix is to change rows_to_keep to an int64_t to avoid the overflow, which prevents the DCHECK_LE from failing. Change-Id: I8bb8064aae6ad25c8a19f6a8869086be7e70400a Reviewed-on: http://gerrit.cloudera.org:8080/11844 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-05 21:58:45 +00:00
Michal Ostrowski	15d48c3205	IMPALA-6661 Make NaN values equal for grouping purposes. Similar to the treatment of NULLs, we want to consider NaN values as equal when grouping. - When detecting a NaN in a set of row values, the NaN value must be converted to a canonical value - so that all NaN values have the same bit-pattern for hashing purposes. - When doing equality evaluation, floating point types must have additional logic to consider NaN values as equal. - Existing logic for handling NULLs in this way is appropriate for triggering this behavior for NaN values. - Relabel "force null equality" as "inclusive equality" to expand the scope of the concept to a more generic form that includes NaN. Change-Id: I996c4a2e1934fd887046ed0c55457b7285375086 Reviewed-on: http://gerrit.cloudera.org:8080/11535 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com>	2018-11-02 02:59:07 +00:00
Tim Armstrong	95b56d0e2d	IMPALA-7586: fix predicate pushdown of escaped strings This fixes a class of bugs where the planner incorrectly uses the raw string from the parser instead of the unescaped string. This occurs in several places that push predicates down to the storage layer: * Kudu scans * HBase scans * Data source scans There are some more complex issues with escapes and the LIKE predicate that are tracked separately by IMPALA-2422. This also uncovered a different issue with RCFiles that is tracked by IMPALA-7778 and is worked around by the tests added. In order to make bugs like this more obvious in future, I renamed getValue() to getValueWithOriginalEscapes(). Testing: Added regression test that tests handling of backslash escapes on all file formats. I did not add a regression test for the data source bug since it seems to require some major modification of the data source test infrastructure. Change-Id: I53d6e20dd48ab6837ddd325db8a9d49ee04fed28 Reviewed-on: http://gerrit.cloudera.org:8080/11814 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-01 21:27:13 +00:00
Tim Armstrong	0e1f304e6b	IMPALA-7792: fix disabling of ORC scans The previous approach could lead to hangs or cryptic error messages because it removed the ORC data type from a lookup table. Instead check explicitly in the planner for ORC scans and throw a more helpful error message. Testing: Added custom cluster test to exercise code and check error message. Change-Id: I209e79b18745c48d0182800a916d6566083f4609 Reviewed-on: http://gerrit.cloudera.org:8080/11835 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-01 07:30:58 +00:00
Tim Armstrong	85166afa8a	IMPALA-6374: fix handling of commas in .test files The .test file parser implemented an unconventional method for parsing single-quoted strings in comma-separated value format. This didn't handle trailing commas in the string correctly. This commit switches to using a conventional method for parsing comma-separated value format: * Commas enclosed by single quotes are not treated as field separators * Single quotes can be escaped within a string by doubling them. I looked into using Python's .csv module for this, but it wouldn't work without modifying the test file format more because it automatically discards the quotes during parsing, which are actually semantically important in .test files. E.g. without the quotes we can't distinguish between the literal string 'regex:...' and the regex regex:.... Testing: Ran exhaustive tests and fixed .test files that required modifications. Will rerun before merging. Added a couple of tests to exercise edge cases in the test file parser. Change-Id: I18ddcb0440490ddf8184be66d3681038a1615dd9 Reviewed-on: http://gerrit.cloudera.org:8080/11800 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-10-30 22:17:49 +00:00
Bharath Vissapragada	35c0e4416d	IMPALA-7727: Fix TStatusCode to TErrorCode mapping - Uses a "GENERAL" TErrorCode type for all non-OK statuses. - Detailed regression root cause description in the jira IMPALA-7727. - Added a regression test. Change-Id: Ie62527734aa73c1524c731773638590bdac9e789 Reviewed-on: http://gerrit.cloudera.org:8080/11778 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-30 19:55:12 +00:00
Tim Armstrong	449fe73d21	IMPALA-7662: fix error race when scanner open fails This is very similar to IMPALA-7335, except happens when 'progress_' is incremented in the call chain HdfsScanNode::ProcessSplit -> HdfsScanNodeBase::CreateAndOpenScanner() -> HdfsScanner::Close() The fix required restructuring the code so that SetDoneInternal() is called with the error before HdfsScanner::Close(). This required a refactoring because HdfsScanNodeBase doesn't actually know about SetDoneInternal(). My fix is to put the common logic between HdfsScanNode and HdfsScanNodeMt into a helper in HdfsScanNodeBase, then in HdfsScanNode, make sure to call SetDoneInternal() before closing the scanner. I also reworked HdfsScanNode::ProcessSplit() to handle error propagation internally. I think the joint responsibility between ProcessSplit() and its caller for handling errors made things harder than necessary. Testing: Added a debug action and test that reproduced the race before the fix. Change-Id: I45a61210ca7d057b048c77d9f2f2695ec450f19b Reviewed-on: http://gerrit.cloudera.org:8080/11596 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-27 04:07:51 +00:00
Bikramjeet Vig	8cbec20ea4	IMPALA-7351: Add estimates to Exchange node Added rough estimates for exchange node and a justification of the method in the in-line comments. Testing: Updated Planner tests. Change-Id: I5b577f9511abc48b992e814d50bba4959f23f7fd Reviewed-on: http://gerrit.cloudera.org:8080/11692 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-25 21:17:12 +00:00
Thomas Tauber-Marshall	15e8ce4f27	IMPALA-7677: Fix DCHECK failure in GroupingAggregator After inserting all of its input into its Aggregators, StreamingAggregationNode performs some cleanup, such as calling InputDone() on each Aggregator. Previously, StreamingAggregationNode only checked that all of the child's batches had been fetched before doing this cleanup, which causes problems if the final child batch isn't processed fully in a single GetNext() call. In this case, multiple calls to InputDone() lead to a DCHECK failure. The solution is to only perform the cleanup once the final child batch has been fully processed. Testing: - Added an e2e test with a query that hits this condition. Change-Id: I851007a60472d0e53081c076c863c866c516677c Reviewed-on: http://gerrit.cloudera.org:8080/11626 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-24 01:03:38 +00:00
Bikramjeet Vig	0340a153ce	IMPALA-7697: Fix flakiness in test_resource_limits This patch fixes one of the tests in test_resource_limits that expects a query to run for more than 2 seconds but currently fails because it sometimes completes earlier than that. Change-Id: I2ba7080f62f0af3e16ef6c304463ebf78dec1b0c Reviewed-on: http://gerrit.cloudera.org:8080/11741 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 03:43:41 +00:00
Bikramjeet Vig	77c56a805a	IMPALA-7699: Fix spilling test run with hdfs erasure coding turned on A spilling test when run on test build with hdfs erasure coding turned on hits an out of memory error on the hdfs scan node. This happens because the test is tuned for a regular 3 node minicluster without hdfs erasure coding. Fix is to simply increase the memory limit on the test to accommodate this difference yet keep it small enough to achieve desired spilling on the hash join node. Testing: Ran it on an EC enabled minicluster to make sure it works Change-Id: I207569822ba7388e78936d25e2311fa09c7a1b9a Reviewed-on: http://gerrit.cloudera.org:8080/11740 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 03:32:07 +00:00
Thomas Tauber-Marshall	9f5c5e6df0	IMPALA-7272: Fix crash in StringMinMaxFilter StringMinMaxFilter uses a MemPool to allocate space for StringBuffers. Previously, the MemPool was created by RuntimeFilterBank and passed to each StringMinMaxFilter. In queries with multiple StringMinMaxFilters being generated by the same fragment instance, this leads to overlapping use of the MemPool by different threads, which is incorrect as MemPools are not thread-safe. The solution is to have each StringMinMaxFilter create its own MemPool. This patch also documents MemPool as not thread-safe and introduces a DFAKE_MUTEX to help enforce correct usage. Doing this requires modifying our CMakeLists.txt to pass '-DNDEBUG' to clang only in RELEASE builds, so that the DFAKE_MUTEX will be present in the compiled IR for DEBUG builds. Testing: - I have been unable to repro the actual crash despite trying a large variety of different things. However, with additional logging added its clear that the MemPool is being used concurrently, which is incorrect. - Added an e2e test that covers the potential issue. It hits the DFAKE_MUTEX with a sleep added to MemPool::Allocate. - Ran a full exhaustive build in both DEBUG and RELEASE. Change-Id: I751cad7e6b75c9d95d7ad029bbd1e52fe09e8a29 Reviewed-on: http://gerrit.cloudera.org:8080/11650 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-18 20:56:52 +00:00
Zoltan Borok-Nagy	de7f09d726	IMPALA-7644: Hide Parquet page index writing with feature flag This commit adds the command line flag enable_parquet_page_index_writing to the Impala daemon that switches Impala's ability of writing the Parquet page index. By default the flag is false, i.e. Impala doesn't write the page index. This flag is only temporary, we plan to remove it once Impala is able to read the page index and has better testing around it. Because of this change I had to move test_parquet_page_index.py to the custom_cluster test suite since I need to set this command line flag in order to test the functionality. I also merged most of the test cases because we don't want to restart the cluster too many times. I removed 'num_data_pages_' from BaseColumnWriter since it was rather confusing and didn't provide any measurable performance improvement. This commit fixes the ASAN error produced by the first IMPALA-7644 commit which was reverted later. Change-Id: Ib4a9098a2085a385351477c715ae245d83bf1c72 Reviewed-on: http://gerrit.cloudera.org:8080/11694 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-17 19:57:17 +00:00
Joe McDonnell	af76186e01	IMPALA-7704: Revert "IMPALA-7644: Hide Parquet page index writing with feature flag" The fix for IMPALA-7644 introduced ASAN issues detailed in IMPALA-7704. Reverting for now. This reverts commit `843683ed6c`. Change-Id: Icf0a64d6ec747275e3ecd6e801e054f81095591a Reviewed-on: http://gerrit.cloudera.org:8080/11671 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com>	2018-10-13 03:26:03 +00:00
Tim Armstrong	d05f73f415	IMPALA-7647: Add HS2/Impyla dimension to TestQueries I used some ideas from Alex Leblang's abandoned patch: https://gerrit.cloudera.org/#/c/137/ in order to run .test files through HS2. The advantage of using Impyla is that much of the code will be reusable for any Python client implementing the standard Python dbapi and does not require us implementing yet another thrift client. This gives us better coverage of non-trivial result sets from HS2, including handling of NULLs, error logs and more interesting result sets than the basic HS2 tests. I added HS2 coverage to TestQueries, which has a reasonable variety of queries and covers the data types in alltypes. I also added TestDecimalQueries, TestStringQuery and TestCharFormats to get coverage of DECIMAL, CHAR and VARCHAR that aren't in alltypes. Coverage of results sets with NULLs was limited so I added a couple of queries. Places where results differ from Beeswax: * Impyla is a Python dbapi client so must convert timestamps into python datetime objects, which only have microsecond precision. Therefore result timestamps within nanosecond precision are truncated. * The HS2 interface reports the NULL type as BOOLEAN as a workaround for IMPALA-914. * The Beeswax interface reported VARCHAR as STRING, but HS2 reports VARCHAR. I dealt with different results by adding additional result sections so that the expected differences between the clients/protocols were explicit. Limitations: * Not all of the same methods are implemented as for beeswax, so some tests that have more complicated interactions with the client will not work with HS2 yet. * We don't have a way to get the affected row count for inserts. I also simplified the ImpalaConnection API by removing some unnecessary methods and moved some generic methods to the base class. Testing: * Confirmed that it detected IMPALA-7588 by re-applying the buggy patch. * Ran exhaustive and CentOS6 tests. Change-Id: I9908ccc4d3df50365be8043b883cacafca52661e Reviewed-on: http://gerrit.cloudera.org:8080/11546 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-09 00:45:10 +00:00
Zoltan Borok-Nagy	843683ed6c	IMPALA-7644: Hide Parquet page index writing with feature flag This commit adds the command line flag enable_parquet_page_index_writing to the Impala daemon that switches Impala's ability of writing the Parquet page index. By default the flag is false, i.e. Impala doesn't write the page index. This flag is only temporary, we plan to remove it once Impala is able to read the page index and has better testing around it. Because of this change I had to move test_parquet_page_index.py to the custom_cluset test suite since I need to set this command line flag in order to test the functionality. I also merged most of the test cases because we don't want to restart the cluster too many times. Change-Id: If9994882aa59cbaf3ae464100caa8211598287bc Reviewed-on: http://gerrit.cloudera.org:8080/11563 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-08 13:27:49 +00:00
Fredy Wijaya	e5c502e4e4	IMPALA-7671: Fix broken SHOW GRANT USER ON <object> This patch fixes the broken SHOW GRANT USER ON <object> that always shows an empty result due to incorrect comparison between TPrivilege for the filter vs TPrivilege for the actual privilege that should not consider the "grantoption". Testing: - Added new E2E tests - Ran all FE tests - Ran all authorization E2E tests Change-Id: I7adc403caddd18e5a954cf6affd5d1d555b9f5f0 Reviewed-on: http://gerrit.cloudera.org:8080/11598 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-06 07:25:49 +00:00
Bikramjeet Vig	fc91e706b4	IMPALA-7349: Add Admission control support for automatically setting per host memory limit for a query With this patch the per host memory limit of a query is automatically set using the mem_limit set in the query options and the mem_estimate calculated by the planner based on the following pseudo code: if mem_limit is set in query options: use that and if 'clamp-mem-limit-query-option' is true: enforce the min/max query mem limits defined in the pool config. else: mem_limit = max(mem_estiamte, min_mem_limit_required_to_accomodate_largest_initial_reservation) finally, enforce min/max query mem limits defined in the pool config on this value. This calculated mem limit will also be used for admission accounting and consequently for admission control. Moreover, three new pool configuration options have been added to enable this behaviour: "min-query-mem-limit" & "max-query-mem-limit" => help clamp the per host memory limit for a query. If both these limits are not configured, then the estimates from planning are not used as a memory limit and only used for making admission decisions. Moreover the estimates will no longer have a lower bound based on the largest initial reservation. "clamp-mem-limit-query-option" => if false, the mem_limit defined in the query options is used directly and the max/min query mem limits are not enforced on it. Testing: Added e2e test cases. Added frontend tests for changes to RequestPoolService. Successfully passed exhaustive tests. Change-Id: Ifec00141651982f5975803c2165b7d7a10ebeaa6 Reviewed-on: http://gerrit.cloudera.org:8080/11157 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-05 04:38:24 +00:00
Tim Armstrong	c80c62f22d	IMPALA-7648 part 1: add expected out-of-memory tests This adds test coverage for some cases where queries are currently expected to fail with out-of-memory: * memory limit exceeded in exchange node * aggregation with large var-len intermediate values * top N with large limit * hash join with many duplicates on right side * analytic with a large window that needs to be buffered in-memory Note that it's not always totally deterministic where the query hits 'memory limit exceeded' so we don't include the node ID or name in the expected error message. Testing: * ran exhaustive tests * looped modified tests locally overnight Change-Id: Icd1a7eb97837b742a967c260cafb5a7f4f45412e Reviewed-on: http://gerrit.cloudera.org:8080/11564 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-05 01:36:17 +00:00
Csaba Ringhofer	810841115a	IMPALA-7595: Check the validity of the time part of Parquet timestamps Before this fix Impala did not check whether a timestamp's time part is out of the valid [0, 24 hour) range when reading Parquet files, so these timestamps were memcopied as they were to slots, leading to results like: 1970-01-01 -00:00:00.000000001 1970-01-01 24:00:00 Different parts of Impala treat these timestamp differently: - string conversion leads to invalid representation that cannot be converted back to timestamp - timezone conversions handle the overflowing time part and give a valid timestamp result (at least since CCTZ, I did not check older versions of Impala) - Parquet writing inserts these timestamp as they are, so the resulting Parquet file will also contain corrupt timestamps The fix adds a check that converts these corrupt timestamps to NULL, similarly to the handling of timestamp outside the [1400..10000) range. A new error code is added for this case. If both the date and the time part is corrupt, then error about corrupt time is returned. Testing: - added a new scanner test that reads a corrupted Parquet file with edge values Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058 Reviewed-on: http://gerrit.cloudera.org:8080/11521 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-01 13:20:40 +00:00
stiga-huang	ddef2cb9b1	IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[].a'), we first perform '$[]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Reviewed-on: http://gerrit.cloudera.org:8080/10950 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-29 11:59:03 +00:00
Adam Holley	34e666f520	IMPALA-7503: SHOW GRANT USER not showing all privileges. This patch fixes the SHOW GRANT USER statement to show all privileges granted to a user, either directly via object ownership, or granted through a role via a group the user belongs to. The output for SHOW GRANT USER will have two additional columns for privilege name and privilege type so the user can know where the privilege comes from. Truncated sample showing two columns that are different from role: +----------------+----------------+--------+----------+-... \| principal_type \| principal_name \| scope \| database \| ... +----------------+----------------+--------+----------+-... \| USER \| foo \| table \| foo_db \| ... \| ROLE \| foo_role \| server \| \| ... +----------------+----------------+--------+----------+-... Testing: - Create new custom cluster test with custom group mapping. - Ran FE and custom cluster tests. Change-Id: Ie9f6c88f5569e1c414ceb8a86e7b013eaa3ecde1 Reviewed-on: http://gerrit.cloudera.org:8080/11531 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-28 17:05:59 +00:00
Yongjun Zhang	ba27b03814	IMPALA-6202. mod and % should be equivalent. Currently in DECIMAL V2 mode, typeof(9.9 % 3) is DECIMAL(2,1) and typeof(mod(9.9, 3)) is DECIMAL(4,1), while both are expected to be DECIMAL(2,1). This jira fixes V2 mode by replacing "mod" with "%" at parser stage thus they share the same code path afterwards. Testing: Added unit tests and done real cluster testing. Change-Id: Ib0067da04083859ffbf662a29007154461bea2ba Reviewed-on: http://gerrit.cloudera.org:8080/11443 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-09-27 20:41:29 +00:00
Tim Armstrong	ce145ffee6	IMPALA-7600: bump mem_limit for test_kudu_scan_mem_usage The estimate for memory consumption for this scan is 9 columns * 384kb per column = 3.375mb. So if we set the mem_limit to 6.5mb, we should still not get more than one scanner thread, but we can avoid hitting out-of-memory. The issue in the JIRA was queued row batches. With this change, and num_scanner_threads=2, there should be max 12 row batches (10 in the queue, 2 in the scanner threads about to be enqueued) and based on the column stats I'd estimate that each row batch is around 200kb, so this change should provide significantly more headroom. Change-Id: I6d992cc076bc8678089f765bdffe92e877e9d229 Reviewed-on: http://gerrit.cloudera.org:8080/11513 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-26 06:51:56 +00:00

1 2 3 4 5 ...

1210 Commits