impala

mirror of https://github.com/apache/impala.git synced 2026-01-10 09:00:16 -05:00

Author	SHA1	Message	Date
Zoltan Borok-Nagy	0eaab69fff	IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows Tuple pointers in the generated row batches may not be initialized if a tuple has byte size 0. There are some codes which compare these uninitialized pointers against nullptr so having them uninitialized may return wrong (and non-deterministic) results, e.g.: impala::TupleIsNullPredicate::GetBooleanVal() The following query produces non-deterministic results currently: SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN ( SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col) WHERE t3.bool_col = true; The alltypestiny table has 8 records, 4 records of them has the true boolean value for bool_col. Therefore, the above query is a fancy way of saying "8 * 8 * 4", i.e. it should return 256. The solution is that scanners initialize tuple pointers to a non-null value when there are no materialized slots. This non-null value is provided by the new static member Tuple::POISON. I extended QueryTest/scanners.test with the above query. This test executes the query against all table formats. This change has the biggest performance impact on count() queries on large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem and doubled its data. The resulting table has 12,002,430 rows. Without this patch 'select count() from biglineitem' runs for ~0.12s. With the patch applied, the overhead is around a dozens of ms. I measured the query on my desktop PC using a relase build of Impala. On debug builds, the execution time of the patched version is around 160% of the original version. Without this patch: +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| Operator \| #Hosts \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| 03:AGGREGATE \| 1 \| 127.50us \| 127.50us \| 1 \| 1 \| 28.00 KB \| 10.00 MB \| FINALIZE \| \| 02:EXCHANGE \| 1 \| 22.32ms \| 22.32ms \| 3 \| 1 \| 0 B \| 0 B \| UNPARTITIONED \| \| 01:AGGREGATE \| 3 \| 1.78ms \| 1.89ms \| 3 \| 1 \| 16.00 KB \| 10.00 MB \| \| \| 00:SCAN KUDU \| 3 \| 8.00ms \| 8.28ms \| 12.00M \| -1 \| 512.00 KB \| 0 B \| default.biglineitem \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ With this patch: +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| Operator \| #Hosts \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| 03:AGGREGATE \| 1 \| 129.01us \| 129.01us \| 1 \| 1 \| 28.00 KB \| 10.00 MB \| FINALIZE \| \| 02:EXCHANGE \| 1 \| 33.00ms \| 33.00ms \| 3 \| 1 \| 0 B \| 0 B \| UNPARTITIONED \| \| 01:AGGREGATE \| 3 \| 1.99ms \| 2.13ms \| 3 \| 1 \| 16.00 KB \| 10.00 MB \| \| \| 00:SCAN KUDU \| 3 \| 13.13ms \| 13.97ms \| 12.00M \| -1 \| 512.00 KB \| 0 B \| default.biglineitem \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ Change-Id: I298122aaaa7e62eb5971508e0698e189519755de Reviewed-on: http://gerrit.cloudera.org:8080/9239 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-16 20:23:21 +00:00
Michael Ho	9d37021927	IMPALA-6512: Fix test_exchange_delays for KRPC The sender timed out error message diverges between Thrift and KRPC slightly due to the source address being not readily available with Thrift RPC implementation. This leads to failure in test_exchange_delays when KRPC is enabled. This change fixes the problem by shortening the error message string to match against. Testing done: Tested with KRPC enabled in the code and verified the tests passed. Change-Id: Idd9410381dbb931231c92f084917265e5067b4c9 Reviewed-on: http://gerrit.cloudera.org:8080/9331 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-15 10:55:55 +00:00
Bikramjeet Vig	8fc1eccce4	IMPALA-5519: Allocate fragment's runtime filter memory from Buffer pool This patch adds changes to the planner to account for memory used by bloom filters at the fragment instance level. Also adds changes to allocate memory for those bloom filters from the buffer pool. Testing: - Modified Planner Tests and end to end tests to account for memory reservation for the runtime filters. - Modified backend tests and benchmarks to use the bufferpool for bloom filter allocation. - Add an end to end test. - Ran rest of the core tests. Change-Id: Iea2759665fb2e8bef9433014a8d42a7ebf99ce1f Reviewed-on: http://gerrit.cloudera.org:8080/8971 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-13 08:29:03 +00:00
Tim Armstrong	95f1666309	IMPALA-6077: remove Parquet BIT_PACKED def level support The encoding was added in an early version of the Parquet spec and deprecated even in the Parquet 1.0 spec. Parquet-MR switched to generating RLE at the same time as the spec changed in mid-2013. Impala always wrote RLE: see commit `6e293090e6`. The Impala implementation of BIT_PACKED was never correct because it implemented little endian bit unpacking instead of the big endian unpacking required by the spec for levels. Testing: Updated tests to reflect expected behaviour for supported and unsupported def level encodings. Cherry-picks: not for 2.x. Change-Id: I12c75b7f162dd7de8e26cf31be142b692e3624ae Reviewed-on: http://gerrit.cloudera.org:8080/9241 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-12 21:59:37 +00:00
Alex Behm	1a1927b07d	IMPALA-6228: Control stats extrapolation via tbl prop. Introduces a new TBLPROPERTY for controlling stats extrapolation on a per-table basis: impala.enable.stats.extrapolation=true/false The property key was chosen to be consistent with the impalad startup flag --enable_stats_extrapolation and to indicate that the property was set and is used by Impala. Behavior: - If the property is not set, then the extrapolation behavior is determined by the impalad startup flag. - If the property is set, it overrides the impalad startup flag, i.e., extrapolation can be explicitly enabled or disabled regardless of the startup flag. Testing: - added new unit tests - code/hdfs run passed Change-Id: Ie49597bf1b93b7572106abc620d91f199cba0cfd Reviewed-on: http://gerrit.cloudera.org:8080/9139 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 22:56:13 +00:00
Lars Volker	fc529b7f9f	IMPALA-5293: Turn insert clustering on by default This change enables clustering by default. IMPALA-2521 introduced the 'clustered' hint which inserts a local sort by the partitioning columns to a query plan. The hint is only effective for HDFS and Kudu tables. Like before, the 'noclustered' hint prevents clustering. If a table has ordering columns defined, the 'noclustered' hint is ignored and we issue a warning. This change removes some tests that were added specifically to test that clustering can be enabled using the 'clustered' hint. It changes some tests to use the 'noclustered' hint to make sure that clustering can be disabled. It also adds tests to make sure that we cover the 'noclustered' case properly. Cherry-picks: not for 2.x. Change-Id: Idbf2368cf4415e6ecfa65058daf6ff87ef62f9d9 Reviewed-on: http://gerrit.cloudera.org:8080/9153 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 05:58:50 +00:00
Gabor Kaszab	097f2f3f3b	IMPALA-6113: Skip row groups with predicates on NULL columns Based on the existing Parquet column chunk level statistics null_count, Impala's Parquet scanner is enhanced to skip an entire row group if the null_count statistics indicate that all the values under the predicated column are NULL as we wouldn't get any result rows from that row group anyway. Change-Id: I141317af0e0df30da8f220b29b0bfba364f40ddf Reviewed-on: http://gerrit.cloudera.org:8080/9140 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 03:24:37 +00:00
Tianyi Wang	c2184e56ae	IMPALA-5990: End-to-end compression of metadata Currently the catalog data is compressed in the statestore, but uncompressed when passed between FE and BE. It results in a ~2GB limit on the metadata. IMPALA-3499 introduced a workaround in the impalad but there isn't one in the catalogd. This patch aims to increase the size limit for statestore updates, reduce the copying of the metadata and reduce the memory footprint. With this patch, the catalog objects are passed and (de)compressed between FE and BE one at a time. The new limits are: - A single catalog object cannot be larger than ~2GB. - A statestore catalog update cannot be larger than ~4GB. It is compressed size if FLAGS_compact_catalog_topic is true. The behavior of the catalog op executer is not changed. The data is not compressed and the size limit is still 2GB. Testing: Ran existing tests. A test for compressing and decompressing catalog objects is added. Manually tested with a 1.95GB catalog object and a 3.90 GB uncompressed statestore update. Change-Id: I3a8819cad734b3a416eef6c954e55b73cc6023ae Reviewed-on: http://gerrit.cloudera.org:8080/8825 Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-02 23:17:47 +00:00
Tianyi Wang	f0b3d9d122	IMPALA-3916: Reserve SQL:2016 reserved words This patch reserves SQL:2016 reserved words, excluding: 1. Impala builtin function names. 2. Time unit words(year, month, etc.). 3. An exception list based on a discussion. Some test cases are modified to avoid these words. A impalad and catalogd startup option reserved_words_version is added. The words are reserved if the option is set to "3.0.0". Change-Id: If1b295e6a77e840cf1b794c2eb73e1b9d2b8ddd6 Reviewed-on: http://gerrit.cloudera.org:8080/9096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-02 01:13:08 +00:00
Vuk Ercegovac	08ca346f2e	IMPALA-3562: support column restriction for compute stats The 'compute stats' statement currently computes column-level statistics for all columns of a table. This adds potentially unneeded work for columns whose stats are not needed by queries. It can be especially costly for very wide tables and unneeded large string fields. This change modifies the 'compute stats' (non-incremental only) to support a user-specified list of columns for which stats should be computed. An example with the extension is as follows: compute stats my_db.my_table(column_a, column_b); While the phrase "for columns ..." is commonly used, since 'compute stats' seems fairly unique (vs. 'analyze table ...'), this change favors brevity with the parenthesized column list. Whereas currently 'compute stats' is applied to the columns that can be analyzed, the 'compute stats' in this change results in an error when a column is specified that cannot be analyzed (e.g., column does not exist, column is of an unsupported type, column is a partitioning column). Moreover, an empty column list can be supplied which means that no columns will be analyzed. Testing: - analyzing a subset of columns is already supported (e.g., not all columns can be analyzed), so the focus with testing is to check that the user-specified columns are handled as expected. - tests include: parser tests, ddl analysis, end-to-end tests. Change-Id: If8b25dd248e578dc7ddd35468125cca12d1b9f27 Reviewed-on: http://gerrit.cloudera.org:8080/9133 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 20:27:14 +00:00
Tim Armstrong	acfd169c8e	IMPALA-4319: remove some deprecated query options Adds a concept of a "removed" query option that has no effect but does not return an error when a user attempts to set it. These options are not returned by "set" or "set all" commands that are executed in impala-shell or server-side. These query options have been deprecated for several releases: DEFAULT_ORDER_BY_LIMIT, ABORT_ON_DEFAULT_LIMIT_EXCEEDED, V_CPU_CORES, RESERVATION_REQUEST_TIMEOUT, RM_INITIAL_MEM, SCAN_NODE_CODEGEN_THRESHOLD, MAX_IO_BUFFERS RM_INITIAL_MEM did still have an effect, but it was undocumented and MEM_LIMIT should be used in preference. DISABLE_CACHED_READS also had an effect but it was documented as deprecated. Otherwise the options had no effect at all. Testing: Ran exhaustive build. Updated query option tests to reflect the new behaviour. Cherry-picks: not for 2.x. Change-Id: I9e742e9b0eca0e5c81fd71db3122fef31522fcad Reviewed-on: http://gerrit.cloudera.org:8080/9118 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 08:26:26 +00:00
Jinchul	1b1087eb05	IMPALA-3282: Adds regexp_escape built-in function Escapes the following special characters in RE2 library: .\+*?[^]$(){}=!<>\|:- Testing: Add some unit tests into ExprTest.StringRegexpFunctions Add some E2E tests into exprs.test Change-Id: I84c3e0ded26f6eb20794c38b75be9b25cd111e4b Reviewed-on: http://gerrit.cloudera.org:8080/8900 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 05:14:14 +00:00
Taras Bobrovytsky	0a1d586d2a	IMPALA-4924: Enable Decimal V2 by default In this commit we enable Decimal_V2 by default. We also update the expected results in many of our tests. Testing: Ran an exhaustive test which almost passed. Updated the few failed tests in it. Cherry-pick: not for 2.x Change-Id: Ibbdd05bf986b7947f106b396017faa3a0bd87fd7 Reviewed-on: http://gerrit.cloudera.org:8080/9062 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-25 04:33:11 +00:00
Zoltan Borok-Nagy	545e60f832	IMPALA-5191: Standardize column alias behavior We should not perform alias substitution in the subexpressions of GROUP BY, HAVING, and ORDER BY to be more standard conformant. === Allowed === SELECT int_col / 2 AS x FROM functional.alltypes GROUP BY x; SELECT int_col / 2 AS x FROM functional.alltypes ORDER BY x; SELECT NOT bool_col AS nb FROM functional.alltypes GROUP BY nb HAVING nb; === Not allowed === SELECT int_col / 2 AS x FROM functional.alltypes GROUP BY x / 2; SELECT int_col / 2 AS x FROM functional.alltypes ORDER BY -x; SELECT int_col / 2 AS x FROM functional.alltypes GROUP BY x HAVING x > 3; Some extra checks were added to AnalyzeExprsTest.java. I had to update other tests to make them pass since the new behavior is more restrictive. I added alias.test to the end-to-end tests. Cherry-picks: not for 2.x. Change-Id: I0f82483b486acf6953876cfa672b0d034f3709a8 Reviewed-on: http://gerrit.cloudera.org:8080/8801 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-24 22:47:18 +00:00
aphadke	11d1784c0a	IMPALA-6435: Disable codegen for CHAR literals. Currently we do not codegen CHAR types. This change checks for CHAR literals in a expr and disables codegen. Change-Id: I7e4e27350c53bc69ce412a004e392e7480214f73 Reviewed-on: http://gerrit.cloudera.org:8080/9102 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-24 05:00:33 +00:00
Gabor Kaszab	7f652ce659	IMPALA-5654: Disallow setting Kudu table name in CREATE TABLE This change disallows explicitly setting the Kudu table name property for managed Kudu tables in a CREATE TABLE statement. The Kudu table name property gets a generated value as the following: 'impala::db_name.table_name' where table_name is the one given in the CREATE TABLE statement. Providing the Kudu table name property when creating a managed Kudu table results in an error without creating the table. E.g.: CREATE TABLE t (i INT) STORED AS KUDU TBLPROPERTIES('kudu.table_name'='some_name'); Alongside the CREATE TABLE statement also the ALTER TABLE statement is changed not to allow the modification of Kudu.table_name of managed Kudu tables. Change-Id: Ieca037498abf8f5fde67b77e824b720482cdbe6f Reviewed-on: http://gerrit.cloudera.org:8080/8820 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-22 22:34:30 +00:00
Vuk Ercegovac	db98dc6504	IMPALA-4993: extend dictionary filtering to collections Currently, top-level scalar columns in parquet files can be used at runtime to prune row-groups by evaluating certain conjuncts over the column's dictionary (if available). This change extends such pruning to scalar values that are stored in collection type columns. Currently, dictionary pruning works by finding eligible conjuncts for top-level slots. Since only top-level slots are supported, the slots are implicitly part of the scan node's tuple descriptor. With this change, we track eligible conjuncts by slot as well as the tuple that contains the slot (either top-level or nested collection). Since collection conjuncts are already managed by a map that associates tuple descriptors to a list of their conjuncts, this extension follows the existing representation. The frontend builds the mapping of SlotId to conjuncts that are dictionary filterable. This mapping now includes SlotId's that reference nested tuples. The backend is adjusted to use the same representation. In addition, collection readers are decomposed into scalar filterable columns and other, non-dictionary filterable readers. When filtering a row group using a conjunct associated to a (possibly) nested collection type, an additional tuple buffer is allocated per tuple descriptor. Testing: - e2e test extended to illustrate row-groups that are pruned by nested collection dictionary filters. Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd Reviewed-on: http://gerrit.cloudera.org:8080/8775 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-19 20:37:25 +00:00
Tim Armstrong	579e33207b	IMPALA-6368: make test_chars parallel Previously it had to be executed serially because it modified tables in the functional database. This change separates out tests that use temporary tables and runs those in a unique_database. Testing: Ran locally in a loop with parallelism of 4 for a while. Change-Id: I2f62ede90f619b8cebbb1276bab903e7555d9744 Reviewed-on: http://gerrit.cloudera.org:8080/9022 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-19 09:55:52 +00:00
Bikramjeet Vig	028a83e654	IMPALA-6382: Cap spillable buffer size and max row size query options Currently the default and min spillable buffer size and max row size query options accept any valid int64 value. Since the planner depends on these values for memory estimations, if a very large value close to the limits of int64 is set, the variables representing or relying on these estimates can overflow during different phases of query execution. This patch puts a reasonable upper limit of 1TB to these query options to prevent such a situation. Testing: Added backend query option tests. Change-Id: I36d3915f7019b13c3eb06f08bfdb38c71ec864f1 Reviewed-on: http://gerrit.cloudera.org:8080/9023 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-18 23:08:26 +00:00
Tianyi Wang	6cc76d7201	IMPALA-6353: Fix crash in snappy decompressor SnappyDecompressor::MaxOutputLen assumes the input pointer to be non-null. It's not true when the parquet file is corrupted and the compressed_page_size field in a page header is 0. This patch handles this error instead of failing a DCHECK. Testing: A bad parquet file with 0 compressed_page_size is added. It crashes impala without this patch. Change-Id: I0d42937aab92a74f8e104d2f7fcd64dc24f6a500 Reviewed-on: http://gerrit.cloudera.org:8080/8977 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 04:18:24 +00:00
Adam Holley	4c43cace87	IMPALA-4323: "SET ROW FORMAT" option added to "ALTER TABLE" command Examples of new command: ALTER TABLE t1 SET ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'; ALTER TABLE t1 SET ROW FORMAT DELIMITED LINES TERMINATED BY '\001'; Testing: Added parser tests and unit tests for alter statements including partition options. Change-Id: I96e347463504915a6f33932552e4d1f61e9b1154 Reviewed-on: http://gerrit.cloudera.org:8080/8928 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-16 23:58:24 +00:00
Jinchul	6041865031	IMPALA-3651: Adds murmur_hash() built-in function murmur_hash relys on HashUtil::MurmurHash2_64 which MurmurHash2 64-bit version. Testing: Add unit tests for primitive types: ExprTest.MurmurHashFunction Add E2E tests into exprs.test Change-Id: I14d56ffb8fab256f3f66a2669271fd4b3c50cc29 Reviewed-on: http://gerrit.cloudera.org:8080/8893 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 20:17:26 +00:00
Taras Bobrovytsky	c86b0a9736	IMPALA-5014: Part 2: Round when casting decimal to timestamp When there are too many digits to the right of the dot in a decimal, we would always truncate when casting to timestamp. In this patch we change the behavior to round instead of truncating when decimal_v2 is enabled. Testing: - Added some EE tests, ran BE tests on my machine. Change-Id: I8fb3a7d976ab980b8572d7e9524850572bad57da Reviewed-on: http://gerrit.cloudera.org:8080/8969 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 05:47:23 +00:00
Tim Armstrong	d3ff67b8b3	IMPALA-6370: fix partitioned parquet tables with nested types When materialising a nested collection, has_template_tuple() should use the template tuple for the collection, not the top-level tuple. Testing: Added tests based on nested-types-basic.test that operate on a simple partitioned table. The tests reliably crashed Impala before the fix. Change-Id: Ic808b824ce3b31af0539036d8ca23d17b18deab4 Reviewed-on: http://gerrit.cloudera.org:8080/8947 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-05 20:44:21 +00:00
Thomas Tauber-Marshall	96b976aff3	IMPALA-6295: Fix mix/max handling of 'nan' and 'inf' This patch fixes several issues related to the min/max aggregate functions and their handling of 'nan' and 'inf': - Previously, if 'inf' or '-inf' was the only value for the min/max and codegen was being used, the result would be incorrect. This occurred, for example in the case of 'inf' and 'min', because we set an initial value of numeric_limits::max, which is less than 'inf', so the returned min was numeric_limits::max when it should be 'inf'. The fix is to set the initial value to numeric_limits::infinity. - Previously, if one of the values was 'nan', the result of min/max was non-deterministic depending on the order the values were evaluated in. This occurs because 'nan' < or > 'any value' is always false, so if the first value added was 'nan', all other comparisons would be false and 'nan' would be returned, whereas if the first value wasn't 'nan' then the 'nan' wouldn't be returned. The fix is to treat 'nan' specially and to always return 'nan' if there is a single 'nan' value. Testing: - Added e2e tests for both scenarios, as well as adding a little extra nan/inf coverage for other aggregate functions. Change-Id: Ia1e206105937ce5afc75ca5044597d39b3dc6a81 Reviewed-on: http://gerrit.cloudera.org:8080/8854 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-04 01:23:43 +00:00
Bikramjeet Vig	545163bb0a	IMPALA-5929: Remove redundant explicit casts to string This patch adds a query rewriter to remove redundant explicit casts to a string type (string, char, varchar) from binary predicates of the form "cast(<non-const expr> to <string type>) <eq/ne op> <string constant>". The cast is redundant if the predicate evaluation is the same even if the cast is removed and the constant is converted to the original type of the expression. For example: cast(int_col as string) = '123456' -> int_col = 123456 Performance: For the following query on a table having 6001215 records - select * from tpch.lineitem where cast(l_linenumber as string) = '0' +-----------------+-----------+--------+ \| \| Scan Time \| +-----------------+-----------+--------+ \| \| Avg \| St dev \| \| Without rewrite \| 1s406ms \| 44ms \| \| With rewrite \| 1s099ms \| 28ms \| +-----------------+-----------+--------+ Testing: - Added unit tests to ExprRewriteRulesTest - Added functional test to expr.test - Current FE planner tests and BE expr-test run successfully with this change. Change-Id: I91b7c6452d0693115f9b9ed9ba09f3ffe0f36b2b Reviewed-on: http://gerrit.cloudera.org:8080/8660 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-03 01:15:42 +00:00
Alex Behm	1f7b3b00e9	IMPALA-5310: Part 3: Use SAMPLED_NDV() in COMPUTE STATS. Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV() function. Testing: - modified/improved existing functional tests - core/hdfs run passed Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b Reviewed-on: http://gerrit.cloudera.org:8080/8840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:58:59 +00:00
Taras Bobrovytsky	7256fcefb4	IMPALA-6284: Mark the intermediate decimal avg struct as packed We saw some failures on the exhaustive release build because the compiler assumed that the pointer to the intermediate struct that is used for computing decimal average was aligned. To fix the problem, we mark the struct with a "packed" attribute so that the compiler does not expect it to be aligned. Testing: - Ran the failing test locally on an release build and it passed. Change-Id: Id25ec6e20dde3f50fb37a22135b355ad251809e0 Reviewed-on: http://gerrit.cloudera.org:8080/8836 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 03:26:43 +00:00
stiga-huang	5c593be59c	IMPALA-6301: Fix test failures when username or group name contains dots Some tests use the local user's group name to construct SQLs, which may lead to syntax errors when group name contains dots. We need to quote the group names in SQL to avoid this error. Besides, a test in test_admission_controller uses '\w+' to match the local user name. This expression cannot match usernames with dots, which causes test failure as well. Instead, we should use '\S+'. Change-Id: Ib8ae15bb6a929dc48d3ad2176c8b3fafff87f32b Reviewed-on: http://gerrit.cloudera.org:8080/8807 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-13 23:06:45 +00:00
Jinchul	4feb4f3a54	IMPALA-5754: Improve randomness of rand()/random() Currently implementation of rand/random built-in functions use rand_r of C library. We recognized its randomness was poor. pcg32 of third party library shows better randomness than rand_r. Testing: Revise unit test in expr-test Add E2E test to random.test Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684 Reviewed-on: http://gerrit.cloudera.org:8080/8355 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-12-13 10:04:40 +00:00
Zach Amsden	245df3c69a	IMPALA-6245: Tolerate column indenting from Hive The fix for HIVE-3140 started indenting multi-line comments, which breaks Impala testing when run against Hive 2.1.1. To test this using the pure test runner proved difficult since it would require extensive changes to support both row_regexes (since the columns changed order) and subset support (since the number of rows changed). Instead, we manually verify the hints are present in the output in the python test. The fact that the hints have been reformatted leaves us in an uncertain state as to whether they actually get applied, so a new test case has been added to run EXPLAIN SELECT on the view and verify the joins happen exactly as we expect. Testing: Ran the views-ddl test against Impala mini-cluster setups using both Hive 2.1.1 and Hive 1.1.0 Change-Id: I49e53b1230520ca6e850af28078526e6627d69de Reviewed-on: http://gerrit.cloudera.org:8080/8719 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-12 00:17:56 +00:00
Thomas Tauber-Marshall	2e83ba5796	IMPALA-6280: Materialize TupleIsNullPredicate for insert sorts When a sort is inserted into a plan for an INSERT due to either the target table being a Kudu table or the use of the 'clustered' hint, and a TupleIsNullPredicate is present in the output of the sort, the TupleIsNullPredicate may reference an incorrect tuple (i.e. not the materialized sort tuple), leading to errors. The solution is to materialize the TupleIsNullPredicate into the sort tuple and then perform the appropriate expr substitutions, as is already done for the case of analytic sorts. Testing: - Added an e2e test with a query that would previously fail. Change-Id: I6c0ca717aa4321a5cc84edd1d5857912f8c85583 Reviewed-on: http://gerrit.cloudera.org:8080/8791 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-08 23:58:18 +00:00
Thomas Tauber-Marshall	e94c60833a	IMPALA-6069: Fix CodegenAnyVal's handling of 'nan' Previously, CodegenAnyVal used an LLVM function for floating point comparison that considered 'nan' = 'nan' to be true. This is inconsistent with the way we handle 'nan' in the non-codegen path, where we consider 'nan' = 'nan' to be false, leading to inconsisent results. This patch fixes CodegenAnyVal to use an LLVM function for floating point comparison that considers 'nan' = 'nan' to be false. Testing: - Added e2e tests for the two scenarios affected by this: CASE and joins. Change-Id: I1bb8e5074b3c939927dedc46bc9db63ca24486a1 Reviewed-on: http://gerrit.cloudera.org:8080/8790 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-08 22:42:03 +00:00
Michael Ho	ed72910e96	IMPALA-6262: Always initialize runtime profile for DataSink This change moves the creation of the runtime profile from DataSink::Prepare() to the ctor of DataSink derived classes. This makes sure that DataSink::Close() and other functions can access the profile even if the DataSink fails to initialize. Testing done: Added a test case which triggers failure in the initialization of output expressions in a HdfsTableSink. Impalad crashed consistently without the fix. Change-Id: I2a683000ef180027b929dbebe78bc2a530a4767e Reviewed-on: http://gerrit.cloudera.org:8080/8770 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-07 09:47:09 +00:00
Vuk Ercegovac	16c5f514e0	IMPALA-6273: fixes subquery tests for functional_hbase IMPALA-1422 introduced tests that do not work with the testing setup for hbase. Namely, tinyinttable is not defined in the functional_hbase database, but is defined in the functional database. Exhaustive tests uncovered the issue. This change makes two changes so that tests work with functional_hbase: 1) use a table that is present in both functional and functional_hbase. the tests needed a subquery result with a single int column. tinyinttable is replaced with an inline view that provides this single int column in a portable manner. 2) nulls are handled differently with hbase (see IMPALA-728) so the nulltable used in the tests is set to functional.nulltable to avoid inconsistent results across input formats. Testing: - ran e2e tests with exhaustive exploration strategy for the broken test. Change-Id: Ibaa3a3df7362ac6d3ed07aff133dc4b3520bb4e0 Reviewed-on: http://gerrit.cloudera.org:8080/8765 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-06 00:41:11 +00:00
Vuk Ercegovac	633dbff71d	IMPALA-1422: support a constant on LHS of IN predicates. Currently, constant expressions for the LHS of the IN predicate are not supported. This patch adds this support as a rewrite in StmtRewriter (where subqueries are rewritten to joins). Since there is a nested-loop variant of left semijoin, support for IN is handled by not erring out. NOT IN is handled by a rewrite to corresponding NOT EXISTS predicate. Support for NOT IN with a correlated subquery is not included in this change. Re-organized the frontend subquery analysis tests to expand coverage. Testing: - added frontend subquery analysis tests - added e2e tests Change-Id: I0d69889a3c72e90be9d4ccf47d2816819ae32acb Reviewed-on: http://gerrit.cloudera.org:8080/8322 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-02 04:09:05 +00:00
Taras Bobrovytsky	575b5a20e6	IMPALA-5017: Error on decimal overflow Before this patch, decimal operations would either silently overflow (in the case of sum() and avg()), or produce a warning. In this patch, the behaviour is changed so that an error is produced in the case of overflow when DECIMAL_v2 is enabled. Decimal v1 behaviour is unchanged. We introduce overflow checks when computing sum() and avg(). This results in a ~30% performance regression when we are in decimal v2 mode compared to decimal v1. Benchmarks: Query: select sum(dec_38_19) from decimal_tbl Decimal v1: 11.57s Decimal v2: 16.58s Query: select avg(dec_38_19) from decimal_tbl Decimal v1: 12.08s Decimal v2: 17.08s The performance regression is not as bad if we are computing the sum or average of decimal column with a lower precision: Query: select sum(dec_9_5) from decimal_tbl Decimal v1: 11.06s Decimal v2: 13.08s Query: select avg(dec_9_5) from decimal_tbl Decimal v1: 11.56s Decimal v2: 13.57s Testing: - Added several end to end tests. - Updated Expr tests to check for error in case of overflow. Change-Id: Id98a92c9a9469ec8cf14e518c741a2dab7053019 Reviewed-on: http://gerrit.cloudera.org:8080/8404 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-01 23:23:01 +00:00
Zach Amsden	24c2ba0cc5	IMPALA-6244: Fix test failures with Hadoop 3.0 The metadata query test fails when run against Hadoop 3.0 due to some defaults changing for sequence files. Testing: Compared the output of hadoop fs -text /test-warehouse/alltypesmixedformat/year=2009/month=2/000023_0 and verified it is the same after a data load on Hadoop 2.6 and Hadoop 3.0; ran the metadata query test and verified it now passes in both cases. Change-Id: I1ccffdb0f712da1feb55f839e8d87a30f15e4fb6 Reviewed-on: http://gerrit.cloudera.org:8080/8656 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-30 07:33:16 +00:00
Alex Behm	b3d8a507cb	IMPALA-5310: Add COMPUTE STATS TABLESAMPLE. Adds the TABLESAMPLE clause for COMPUTE STATS. Syntax: COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)] Computes and replaces the table-level row count and total file size, as well as all table-level column statistics. Existing partition-level row counts are not modified. The TABLESAMPLE clause can be used to limit the scanned data volume to a desired percentage. When sampling, the unmodified results of the COMPUTE STATS queries are sent to the CatalogServer. There, the stats are extrapolated before storing them into the HMS so as not to confuse other engines like Hive/SparkSQL which may rely on the shared HMS fields being accurate. Limitations - Only works for HDFS tables - TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS - TABLESAMPLE requires --enable_stats_extrapolation=true Changes to EXPLAIN The stored statistics from the HMS are more clearly displayed under a 'stored statistics' section. Example: 00:SCAN HDFS [functional.alltypes, RANDOM] partitions=24/24 files=24 size=478.45KB stored statistics: table: rows=7300 size=478.45KB partitions: 24/24 rows=7300 columns: all Testing: - added new functional tests - core/hdfs run passed Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7 Reviewed-on: http://gerrit.cloudera.org:8080/8136 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 22:37:01 +00:00
Michael Ho	63f17e9cea	IMPALA-6187: Fix missing conjuncts evaluation with empty projection Previously, scanners will assume that there are no conjuncts associated with a scan node for queries with no materialized slots (e.g. count()). This is not necessarily the case as one can write queries such as select count() from tpch.lineitem where rand() * 10 < 0; or select count() from tpch.lineitem where rand() > <a partition column>. In which case, the conjuncts should still be evaluated once per row. This change fixes the problem in the short-circuit handling logic for count() to evaluate the conjuncts once per row and only commits a row to the output row batch if the conjuncts evaluate to true. Testing done: Added the example above to the scanner test Change-Id: Ib530f1fdcd2c6de699977db163b3f6eb38481517 Reviewed-on: http://gerrit.cloudera.org:8080/8623 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 05:53:15 +00:00
Jinchul	2fba80ee5e	IMPALA-5146: Fix inconsitent results at FROM_UNIXTIME() The FROM_UNIXTIME(epoch) and FROM_UNIXTIME(epoch, format) produce different results when epoch is out of range of TimestampValue. The former produces an empty string, while the latter gives NULL. The fix is to harmonize the results to NULL. Testing: Add unit tests to ExprTest.TimestampFunctions. Change-Id: Ie3a5e9a9cb39d32993fa2c7f725be44d8b9ce9f2 Reviewed-on: http://gerrit.cloudera.org:8080/8629 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 05:22:32 +00:00
Zoltan Borok-Nagy	4f11bed407	IMPALA-5936: operator '%' overflows on large decimals Suppose we have a large decimal number, which is greater than INT_MAX. We want to calculate the modulo of this number by 3: BIG_DECIMAL % 3 The result of this calculation can be 0, 1, or 2. This can fit into a decimal with precision 1. The in-memory representation of such small decimals are stored in int32_t in the backend. Let's call this int32_t the result type. The backend had the invalid assumption that it can do the calculation as well using the result type. This assumption is true for multiplying or adding numbers, but not for modulo. Now the backend selects the biggest type of ['return type', '1st operand type', '2nd operand type'] to do the calculation. Change-Id: I2b06c8acd5aa490943e84013faf2eaac7c26ceb4 Reviewed-on: http://gerrit.cloudera.org:8080/8574 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 21:11:45 +00:00
Thomas Tauber-Marshall	7dd28ff431	IMPALA-6201: Fix test_basic_filters on ASAN TestRuntimeFilters.test_basic_filters is flaky on ASAN as sometimes the runtime filters aren't recieved within the specified RUNTIME_FILTER_WAIT_TIME_MS. This patch increases the timeout for ASAN builds. Change-Id: I8c20cbb75a9b6da73137f220657aa75dea9dfdce Reviewed-on: http://gerrit.cloudera.org:8080/8646 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 03:01:39 +00:00
Gabor Kaszab	88cb68cfbe	IMPALA-2181: Add query option levels for display Four display levels are introduced for each query option: REGULAR, ADVANCED, DEVELOPMENT and DEPRECATED. When the query options are displayed in Impala shell using SET then only the REGULAR and ADVANCED options are shown. A new command called SET ALL shows all the options grouped by their option levels. When the query options are displayed through the SET SQL statement then the result set would contain an extra column indicating the level of each option. Similarly to Impala shell here the SET command only diplays the REGULAR and ADVANCED options while SET ALL shows them all. If the Impala shell connects to an Impala daemon that predates this change then all the options would be displayed in the REGULAR group. Change-Id: I75720d0d454527e1a0ed19bb43cf9e4f018ce1d1 Reviewed-on: http://gerrit.cloudera.org:8080/8447 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 00:31:15 +00:00
Vuk Ercegovac	21a96ed2e3	IMPALA-4985: use parquet stats of nested types for dynamic pruning Currently, parquet row-groups can be pruned at run-time using min/max stats when predicates (in, binary) are specified for column scalar types. This patch extends pruning to nested types for the same class of predicates. A nested value is an instance of a nested type (struct, array, map). A nested value consists of other nested and scalar values (as declared by its type). Predicates that can be used for row-group pruning must be applied to nested scalar values. In addition, the parent of the nested scalar must also be required, that is, not empty. The latter requirement is conservative: some filters that could be used for pruning are not used for correctness reasons. Testing: - extended nested-types-parquet-stats e2e test cases. Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe Reviewed-on: http://gerrit.cloudera.org:8080/8480 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-22 22:00:16 +00:00
Thomas Tauber-Marshall	2510fe0aa0	IMPALA-4252: Min-max runtime filters for Kudu This patch implements min-max filters for runtime filters. Each runtime filter generates a bloom filter or a min-max filter, depending on if it has HDFS or Kudu targets, respectively. In RuntimeFilterGenerator in the planner, each hash join node generates a bloom and min-max filter for each equi-join predicate, but only those filters that end up being assigned to a target make it into the final plan. Min-max filters are only assigned to Kudu scans if the target expr is a column, as Kudu doesn't support bounds on general exprs, and only if the join op is '=' and not 'is distinct from', as Kudu doesn't support returning NULLs if a bound is set. Min-max filters are inserted into by the PartitionedHashJoinBuilder. Codegen is used to eliminate branching on the type of filter. String min-max filters truncate their bounds at 1024 chars, so that the max amount of memory used by min-max filters is negligible. For now, min-max filters are only applied at the KuduScanner, which passes them into the Kudu client. Future work will address applying min-max filters at HDFS scan nodes and applying bloom filters at Kudu scan nodes. Functional Testing: - Added new planner tests and updated the old ones. (in old tests, a lot of runtime filters are renumbered as we always generate min-max filters even if they don't end up getting assigned and they take up some of the RF ids). - Updated existing runtime filter tests to work with Kudu. - Added e2e tests for min-max filter specific functionality. Perf Testing: - All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu, timings are averages of 3 runs. - Ran a contrived query with a filter that does not eliminate any rows (full self join of lineitem). The difference in running time was negligible - 24.46s with filters on, 24.15s with filters off for a ~1% slowdown. - Ran a contrived query with a filter that elimiates all rows (self join on lineitem with a join condition that never matches). The filters resulted in a significant speedup - 0.26s with filters on, 1.46s with filters off for a ~5.6x speedup. This query is added to targeted-perf. Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c Reviewed-on: http://gerrit.cloudera.org:8080/7793 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-17 21:33:51 +00:00
Michael Ho	3ddafcd295	IMPALA-6184: Clean up after ScalarExprEvaluator::Clone() fails When ScalarExprEvaluator::Clone() fails, the newly created evaluator was not added to the output vector. This makes it impossible for callers to close and clean up the evaluators afterwards. This change fixes this by always adding the newly created evaluator to the output vector before checking for the error status. This path is only exercised in the scanner code. Two new tests are added to exercise the failure paths. Testing done: newly added tests in udf-errors.test Change-Id: I45ffd722d0a69ad05ae3c748cf504c7f1a959a1d Reviewed-on: http://gerrit.cloudera.org:8080/8572 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-17 02:24:48 +00:00
Tim Armstrong	ae116b5bf7	IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding Switch the decoders to using more batch-oriented interfaces. As an intermediate step this doesn't make the interfaces of LevelDecoder or DictDecoder batch-oriented, only the lower-level utility classes. The next step would be to change those interfaces to be batch-oriented and make according optimisations in parquet. This could deliver much larger perf improvements than the current patch. The high-level changes are. * BitReader -> BatchedBitReader, which is built to unpack runs of 32 bit-packed values efficiently. * RleDecoder -> RleBatchDecoder, which exposes the repeated and literal runs to the caller and uses BatchedBitReader to unpack literal runs efficiently. * Dict decoding uses RleBatchDecoder to decode repeated runs efficiently and uses the BitPacking utilities to unpack and encode in a single step. Also removes an older benchmark that isn't too interesting (since the batch-oriented approach to encoding and decoding is so much faster than the value-by-value approach). Testing: * Ran core tests. * Updated unit tests to exercise new code. * Added test coverage for the deprecated bit-packed level encoding to that it still works (there was no coverage previously). Perf: Single-node benchmarks showed a few % performance gain. 16 node cluster benchmarks only showed a gain for TPC-H nested. Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6 Reviewed-on: http://gerrit.cloudera.org:8080/8267 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-16 21:23:09 +00:00
Taras Bobrovytsky	d1b92c8b52	IMPALA-6183: Fix Decimal to Double conversion When converting a decimal to a double, we incorrectly used the powf() function in the backend, which returns a float instead of a double. This caused us to lose precision. We fix the problem by replacing the powf() function with a pow() function, which returns a double. Testing: - Added an EE test. Change-Id: I9bf81d039e5037f22c64a32b328832235aafe9e3 Reviewed-on: http://gerrit.cloudera.org:8080/8547 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-15 02:54:53 +00:00
Tianyi Wang	73d1bc3ad4	IMPALA-2281: Replace FNV with FastHash in exchange nodes FNV is not a good enough hash function. This patch introduces FastHash into the codebase and uses it in exchange nodes. Testing: Two test cases involving arbitrary ordering are changed. Single node performance benchmark shows no performance difference. Change-Id: I778317d982dcdb94173a369a65b39f32b4f7ded2 Reviewed-on: http://gerrit.cloudera.org:8080/8417 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-11-08 07:39:02 +00:00

1 2 3 4 5 ...

1039 Commits