impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 03:02:48 -05:00

Author	SHA1	Message	Date
Grant Henke	0c8eba076c	IMPALA-5752: Add support for DECIMAL on Kudu tables Adds support for the Kudu DECIMAL type introduced in Kudu 1.7.0. Note: Adding support for Kudu decimal min/max filters is tracked in IMPALA-6533. Tests: * Added Kudu create with decimal test to AnalyzeDDLTest.java * Added Kudu table_format to test_decimal_queries.py ** Both decimal.test and decimal-exprs.test workloads * Added decimal queries to the following Kudu workloads: kudu_create.test kudu_delete.test kudu_insert.test kudu_update.test ** kudu_upsert.test Change-Id: I3a9fe5acadc53ec198585d765a8cfb0abe56e199 Reviewed-on: http://gerrit.cloudera.org:8080/9368 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 00:03:54 +00:00
Csaba Ringhofer	5a1f432e81	IMPALA-4167: Support insert plan hints for CREATE TABLE AS SELECT This change adds support for "clustered", "noclustered", "shuffle" and "noshuffle" hints in CTAS statement. Example: create /+ clustered,noshuffle / table t partitioned by (year, month) as select * from functional.alltypes The effect of these hints are the same as in insert statements: clustered: Sort locally by partition expression before insert to ensure that only one partition is written at a time. The goal is to reduce the number of files kept open / buffers kept in memory simultaneously. noclustered: Do not sort by primary key before insert to Kudu table. No effect on HDFS tables currently, as this is their default behavior. shuffle: Forces the planner to add an exchange node that repartitions by the partition expression of the output table. This means that a partition will be written only by a single node, which minimizes the global number of simultaneous writes. If only one partition is written (because all partitioning columns are constant or the target table is not partitioned), then the shuffle hint leads to a plan where all rows are merged at the coordinator where the table sink is executed. noshuffle: Do not add exchange node before insert to partitioned tables. The parser needed some modifications to be able to differentiate between CREATE statements that allow hints (CTAS), and CREATE statements that do not (every other type of CREATE statements). As a result, KW_CREATE was moved from tbl_def_without_col_defs to statement rules. Testing: The parser tests mirror the tests of INSERT, while analysis and planner tests are minimal, as the underlying logic is the same as for INSERT. Query tests are not created, as the hints have no effect on the DDL part of CTAS, and the actual query ran is the same as in the insert case. Change-Id: I8d74bca999da8ae1bb89427c70841f33e3c56ab0 Reviewed-on: http://gerrit.cloudera.org:8080/8400 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-22 20:43:44 +00:00
Zoltan Borok-Nagy	881e00a8bf	IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN If the first number in a row group written by Impala is NaN, then Impala writes incorrect statistics in the metadata. This will result in incorrect results when filtering the data. This commit fixes the read path when encountering NaNs in Parquet min/max statistics. If min and max are both NaN, we can't use the statistics at all. If only one of them is NaN, the other still can be used. I added some tests to QueryTest/parqet-stats.test Change-Id: If3897fc1426541239223670812f59e2bed32f455 Reviewed-on: http://gerrit.cloudera.org:8080/9358 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-22 00:57:46 +00:00
Joe McDonnell	baec8cae34	IMPALA-4874: Increase maximum KRPC message size The default value for rpc_max_message_size is 50MB. Impala currently requires support for messages of up to 2GB. This changes the value of rpc_max_message_size to INT_MAX for Impala. Testing: - Added a test to test_very_large_strings that generates a row with multiple large strings. This row requires that the RPC framework successfully transmit over 400MB. This works for both KRPC and Thrift. This query operates under the same amount of memory as other queries in large_strings.test. - Tested separately that larger row sizes also work, including tests up to almost 2GB. Change-Id: I876bba0536e1d85e41eacd9c0aeccfe5c2126e58 Reviewed-on: http://gerrit.cloudera.org:8080/9337 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-21 03:17:57 +00:00
Michael Ho	62d8462e13	IMPALA-5518: Allocate KrpcDataStreamRecvr RowBatch tuples from BufferPool Previously, tuple pointers of a row batch are allocated from the heap via malloc() and tuple data is allocated from the MemPool associated with the RowBatch. This change converts the allocations of tuple pointers and tuple data to using BufferPool for row batches allocated from KrpcDataStreamRecvr. The primary motivation for this change is to take advantage of the fact that buffers allocated from BufferPool always go back to the per-core arena they came from when they are freed. This alleviates the TCMalloc imbalance between the RPC service threads and the fragment execution threads. As described in IMPALA-5518, row batches are always allocated from the service threads' TCMalloc cache and placed into the fragment execution threads' TCMalloc cache when they're freed. This leads to underflow and overflow in those threads' caches and high contention for the spinlock of the central free list. With BufferPool, the memory always went back to its originating arena so this kind of imbalance is less likely to occur. This also dovetails with the long term plan to put most allocations under BufferPool and have each operators in the plan reserved appropriate amount of memory before execution. Note that the proper reservation mechanism of the exchange node hasn't yet been implemented in this change so the buffer pool client handle used for allocating buffers has an ad-hoc set-up of no reservation limit and using root reservation tracker as parent. This needs to be fixed as part of IMPALA-6524. The default buffer pool limit is also bumped to 85% to account for the extra usage from the exchange nodes. The minimum buffer size is also lowered to 8KB to reduce amount of memory wastage as a row batch's tuple pointers / tuple data can sometimes be much smaller than 64KB. Testing done: Debug core build. Change-Id: If4b1a45f68b9df0d3b539511e15aff15700246f2 Reviewed-on: http://gerrit.cloudera.org:8080/9344 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-20 04:08:11 +00:00
Bikramjeet Vig	61d941fad3	IMPALA-6526: Fix spilling test for running on local FS One of the spilling test was failing because its minimum bufferpool mem requirement was more when ran on local FS as compared to when it is run on HDFS. The fix is to increase the bufferpool limit to a value just above the min limit so that it still forces spill to disk on both filesystems. Testing: Ran core tests with local FS as target file system. Made sure the failing test passed. Change-Id: I50648d7936007a26891cf64d6343c47d9d646596 Reviewed-on: http://gerrit.cloudera.org:8080/9354 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-17 01:40:08 +00:00
Zoltan Borok-Nagy	0eaab69fff	IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows Tuple pointers in the generated row batches may not be initialized if a tuple has byte size 0. There are some codes which compare these uninitialized pointers against nullptr so having them uninitialized may return wrong (and non-deterministic) results, e.g.: impala::TupleIsNullPredicate::GetBooleanVal() The following query produces non-deterministic results currently: SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN ( SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col) WHERE t3.bool_col = true; The alltypestiny table has 8 records, 4 records of them has the true boolean value for bool_col. Therefore, the above query is a fancy way of saying "8 * 8 * 4", i.e. it should return 256. The solution is that scanners initialize tuple pointers to a non-null value when there are no materialized slots. This non-null value is provided by the new static member Tuple::POISON. I extended QueryTest/scanners.test with the above query. This test executes the query against all table formats. This change has the biggest performance impact on count() queries on large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem and doubled its data. The resulting table has 12,002,430 rows. Without this patch 'select count() from biglineitem' runs for ~0.12s. With the patch applied, the overhead is around a dozens of ms. I measured the query on my desktop PC using a relase build of Impala. On debug builds, the execution time of the patched version is around 160% of the original version. Without this patch: +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| Operator \| #Hosts \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| 03:AGGREGATE \| 1 \| 127.50us \| 127.50us \| 1 \| 1 \| 28.00 KB \| 10.00 MB \| FINALIZE \| \| 02:EXCHANGE \| 1 \| 22.32ms \| 22.32ms \| 3 \| 1 \| 0 B \| 0 B \| UNPARTITIONED \| \| 01:AGGREGATE \| 3 \| 1.78ms \| 1.89ms \| 3 \| 1 \| 16.00 KB \| 10.00 MB \| \| \| 00:SCAN KUDU \| 3 \| 8.00ms \| 8.28ms \| 12.00M \| -1 \| 512.00 KB \| 0 B \| default.biglineitem \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ With this patch: +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| Operator \| #Hosts \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ \| 03:AGGREGATE \| 1 \| 129.01us \| 129.01us \| 1 \| 1 \| 28.00 KB \| 10.00 MB \| FINALIZE \| \| 02:EXCHANGE \| 1 \| 33.00ms \| 33.00ms \| 3 \| 1 \| 0 B \| 0 B \| UNPARTITIONED \| \| 01:AGGREGATE \| 3 \| 1.99ms \| 2.13ms \| 3 \| 1 \| 16.00 KB \| 10.00 MB \| \| \| 00:SCAN KUDU \| 3 \| 13.13ms \| 13.97ms \| 12.00M \| -1 \| 512.00 KB \| 0 B \| default.biglineitem \| +--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+ Change-Id: I298122aaaa7e62eb5971508e0698e189519755de Reviewed-on: http://gerrit.cloudera.org:8080/9239 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-16 20:23:21 +00:00
Michael Ho	9d37021927	IMPALA-6512: Fix test_exchange_delays for KRPC The sender timed out error message diverges between Thrift and KRPC slightly due to the source address being not readily available with Thrift RPC implementation. This leads to failure in test_exchange_delays when KRPC is enabled. This change fixes the problem by shortening the error message string to match against. Testing done: Tested with KRPC enabled in the code and verified the tests passed. Change-Id: Idd9410381dbb931231c92f084917265e5067b4c9 Reviewed-on: http://gerrit.cloudera.org:8080/9331 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-15 10:55:55 +00:00
Alex Behm	f5986bebcb	Use unqualified table refs in TPCH planner tests. There were a few places where we accidentally used fully-qualified table references. As a result, the testTpchViews() test did not exactly cover what was intended. Change-Id: I886c451ab61a1739af96eeb765821dfd8e951b07 Reviewed-on: http://gerrit.cloudera.org:8080/9270 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-14 05:54:25 +00:00
Fredy Wijaya	887fb14438	IMPALA-6392: Consistent explain format for parquet predicate statistics In EXPLAIN_LEVEL=2+, change the explain format for parquet predicate statistics to output each tuple descriptor per line. This change is to make it consistent with the output of other predicates. Before: parquet statistics predicates: c_custkey < 10, o_orderkey < 5, l_linenumber < 3 After: parquet statistics predicates: c_custkey < 10 parquet statistics predicates on o: o_orderkey < 5 parquet statistics predicates on o_lineitems: l_linenumber < 3 Testing: - Ran existing planner tests and updated the ones that are affected by this change. - Ran end-to-end tests in query_test Change-Id: Ia3d55ab6a1ae551867a9f68b3622844102cc854e Reviewed-on: http://gerrit.cloudera.org:8080/9223 Tested-by: Impala Public Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2018-02-13 21:02:17 +00:00
Bikramjeet Vig	8fc1eccce4	IMPALA-5519: Allocate fragment's runtime filter memory from Buffer pool This patch adds changes to the planner to account for memory used by bloom filters at the fragment instance level. Also adds changes to allocate memory for those bloom filters from the buffer pool. Testing: - Modified Planner Tests and end to end tests to account for memory reservation for the runtime filters. - Modified backend tests and benchmarks to use the bufferpool for bloom filter allocation. - Add an end to end test. - Ran rest of the core tests. Change-Id: Iea2759665fb2e8bef9433014a8d42a7ebf99ce1f Reviewed-on: http://gerrit.cloudera.org:8080/8971 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-13 08:29:03 +00:00
Tim Armstrong	95f1666309	IMPALA-6077: remove Parquet BIT_PACKED def level support The encoding was added in an early version of the Parquet spec and deprecated even in the Parquet 1.0 spec. Parquet-MR switched to generating RLE at the same time as the spec changed in mid-2013. Impala always wrote RLE: see commit `6e293090e6`. The Impala implementation of BIT_PACKED was never correct because it implemented little endian bit unpacking instead of the big endian unpacking required by the spec for levels. Testing: Updated tests to reflect expected behaviour for supported and unsupported def level encodings. Cherry-picks: not for 2.x. Change-Id: I12c75b7f162dd7de8e26cf31be142b692e3624ae Reviewed-on: http://gerrit.cloudera.org:8080/9241 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-12 21:59:37 +00:00
Tim Armstrong	3151421902	IMPALA-6495: fix targeted-perf for new column alias syntax I was able to run bin/single_node_perf_run.py after these changes. Change-Id: Ib25139873b1c67f8039ac6c85e936135e008101b Reviewed-on: http://gerrit.cloudera.org:8080/9268 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-02-09 16:49:54 +00:00
Thomas Tauber-Marshall	32bf54dfd0	IMPALA-6473: Fix analytic fn that partitions and orders on same expr Previously, an analytic function that used the same expr in both the 'partition by' and 'order by' clauses, and where the expr meets the criteria for being materialized before the sort, would hit an IllegalStateException due to the expr being inserted into the same ExprSubstitutionMap twice. If the values have already been partitioned on the expr, then all of the values for it in each partition will be the same and also ordering on the expr doesn't change the results. So, the fix is to simply exclude the duplicate expr from the 'order by' while still partitioning on it. Testing: - Added a regression test to PlannerTest. Change-Id: Id5f1d5fbc6f69df5850f96afed345ce27668c30b Reviewed-on: http://gerrit.cloudera.org:8080/9218 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-06 03:25:02 +00:00
Taras Bobrovytsky	c856b30e36	IMPALA-6475: Enable running TPCH on Kudu Change-Id: I88b66f5db105694b3bcf33360887265996f9059c Reviewed-on: http://gerrit.cloudera.org:8080/9206 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 23:06:13 +00:00
Alex Behm	1a1927b07d	IMPALA-6228: Control stats extrapolation via tbl prop. Introduces a new TBLPROPERTY for controlling stats extrapolation on a per-table basis: impala.enable.stats.extrapolation=true/false The property key was chosen to be consistent with the impalad startup flag --enable_stats_extrapolation and to indicate that the property was set and is used by Impala. Behavior: - If the property is not set, then the extrapolation behavior is determined by the impalad startup flag. - If the property is set, it overrides the impalad startup flag, i.e., extrapolation can be explicitly enabled or disabled regardless of the startup flag. Testing: - added new unit tests - code/hdfs run passed Change-Id: Ie49597bf1b93b7572106abc620d91f199cba0cfd Reviewed-on: http://gerrit.cloudera.org:8080/9139 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 22:56:13 +00:00
Lars Volker	fc529b7f9f	IMPALA-5293: Turn insert clustering on by default This change enables clustering by default. IMPALA-2521 introduced the 'clustered' hint which inserts a local sort by the partitioning columns to a query plan. The hint is only effective for HDFS and Kudu tables. Like before, the 'noclustered' hint prevents clustering. If a table has ordering columns defined, the 'noclustered' hint is ignored and we issue a warning. This change removes some tests that were added specifically to test that clustering can be enabled using the 'clustered' hint. It changes some tests to use the 'noclustered' hint to make sure that clustering can be disabled. It also adds tests to make sure that we cover the 'noclustered' case properly. Cherry-picks: not for 2.x. Change-Id: Idbf2368cf4415e6ecfa65058daf6ff87ef62f9d9 Reviewed-on: http://gerrit.cloudera.org:8080/9153 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 05:58:50 +00:00
Taras Bobrovytsky	a493a01645	IMPALA-4924 addendum: Change result type to decimal in a TPCH query Change the expected result type of Kudu TPCH Q17 to Decimal because DECIMAL_V2 is now enabled by default. This was not done earlier because we were not running TPCH on Kudu regularly. Cherry-picks: not for 2.x Change-Id: I46fc038d40969547622707ce77a037494f0ed0a9 Reviewed-on: http://gerrit.cloudera.org:8080/9208 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 05:22:24 +00:00
Gabor Kaszab	097f2f3f3b	IMPALA-6113: Skip row groups with predicates on NULL columns Based on the existing Parquet column chunk level statistics null_count, Impala's Parquet scanner is enhanced to skip an entire row group if the null_count statistics indicate that all the values under the predicated column are NULL as we wouldn't get any result rows from that row group anyway. Change-Id: I141317af0e0df30da8f220b29b0bfba364f40ddf Reviewed-on: http://gerrit.cloudera.org:8080/9140 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-03 03:24:37 +00:00
Tianyi Wang	c2184e56ae	IMPALA-5990: End-to-end compression of metadata Currently the catalog data is compressed in the statestore, but uncompressed when passed between FE and BE. It results in a ~2GB limit on the metadata. IMPALA-3499 introduced a workaround in the impalad but there isn't one in the catalogd. This patch aims to increase the size limit for statestore updates, reduce the copying of the metadata and reduce the memory footprint. With this patch, the catalog objects are passed and (de)compressed between FE and BE one at a time. The new limits are: - A single catalog object cannot be larger than ~2GB. - A statestore catalog update cannot be larger than ~4GB. It is compressed size if FLAGS_compact_catalog_topic is true. The behavior of the catalog op executer is not changed. The data is not compressed and the size limit is still 2GB. Testing: Ran existing tests. A test for compressing and decompressing catalog objects is added. Manually tested with a 1.95GB catalog object and a 3.90 GB uncompressed statestore update. Change-Id: I3a8819cad734b3a416eef6c954e55b73cc6023ae Reviewed-on: http://gerrit.cloudera.org:8080/8825 Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-02 23:17:47 +00:00
Tianyi Wang	f0b3d9d122	IMPALA-3916: Reserve SQL:2016 reserved words This patch reserves SQL:2016 reserved words, excluding: 1. Impala builtin function names. 2. Time unit words(year, month, etc.). 3. An exception list based on a discussion. Some test cases are modified to avoid these words. A impalad and catalogd startup option reserved_words_version is added. The words are reserved if the option is set to "3.0.0". Change-Id: If1b295e6a77e840cf1b794c2eb73e1b9d2b8ddd6 Reviewed-on: http://gerrit.cloudera.org:8080/9096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-02 01:13:08 +00:00
Vuk Ercegovac	08ca346f2e	IMPALA-3562: support column restriction for compute stats The 'compute stats' statement currently computes column-level statistics for all columns of a table. This adds potentially unneeded work for columns whose stats are not needed by queries. It can be especially costly for very wide tables and unneeded large string fields. This change modifies the 'compute stats' (non-incremental only) to support a user-specified list of columns for which stats should be computed. An example with the extension is as follows: compute stats my_db.my_table(column_a, column_b); While the phrase "for columns ..." is commonly used, since 'compute stats' seems fairly unique (vs. 'analyze table ...'), this change favors brevity with the parenthesized column list. Whereas currently 'compute stats' is applied to the columns that can be analyzed, the 'compute stats' in this change results in an error when a column is specified that cannot be analyzed (e.g., column does not exist, column is of an unsupported type, column is a partitioning column). Moreover, an empty column list can be supplied which means that no columns will be analyzed. Testing: - analyzing a subset of columns is already supported (e.g., not all columns can be analyzed), so the focus with testing is to check that the user-specified columns are handled as expected. - tests include: parser tests, ddl analysis, end-to-end tests. Change-Id: If8b25dd248e578dc7ddd35468125cca12d1b9f27 Reviewed-on: http://gerrit.cloudera.org:8080/9133 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 20:27:14 +00:00
Tim Armstrong	acfd169c8e	IMPALA-4319: remove some deprecated query options Adds a concept of a "removed" query option that has no effect but does not return an error when a user attempts to set it. These options are not returned by "set" or "set all" commands that are executed in impala-shell or server-side. These query options have been deprecated for several releases: DEFAULT_ORDER_BY_LIMIT, ABORT_ON_DEFAULT_LIMIT_EXCEEDED, V_CPU_CORES, RESERVATION_REQUEST_TIMEOUT, RM_INITIAL_MEM, SCAN_NODE_CODEGEN_THRESHOLD, MAX_IO_BUFFERS RM_INITIAL_MEM did still have an effect, but it was undocumented and MEM_LIMIT should be used in preference. DISABLE_CACHED_READS also had an effect but it was documented as deprecated. Otherwise the options had no effect at all. Testing: Ran exhaustive build. Updated query option tests to reflect the new behaviour. Cherry-picks: not for 2.x. Change-Id: I9e742e9b0eca0e5c81fd71db3122fef31522fcad Reviewed-on: http://gerrit.cloudera.org:8080/9118 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 08:26:26 +00:00
Jinchul	1b1087eb05	IMPALA-3282: Adds regexp_escape built-in function Escapes the following special characters in RE2 library: .\+*?[^]$(){}=!<>\|:- Testing: Add some unit tests into ExprTest.StringRegexpFunctions Add some E2E tests into exprs.test Change-Id: I84c3e0ded26f6eb20794c38b75be9b25cd111e4b Reviewed-on: http://gerrit.cloudera.org:8080/8900 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 05:14:14 +00:00
Taras Bobrovytsky	0a1d586d2a	IMPALA-4924: Enable Decimal V2 by default In this commit we enable Decimal_V2 by default. We also update the expected results in many of our tests. Testing: Ran an exhaustive test which almost passed. Updated the few failed tests in it. Cherry-pick: not for 2.x Change-Id: Ibbdd05bf986b7947f106b396017faa3a0bd87fd7 Reviewed-on: http://gerrit.cloudera.org:8080/9062 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-25 04:33:11 +00:00
Zoltan Borok-Nagy	545e60f832	IMPALA-5191: Standardize column alias behavior We should not perform alias substitution in the subexpressions of GROUP BY, HAVING, and ORDER BY to be more standard conformant. === Allowed === SELECT int_col / 2 AS x FROM functional.alltypes GROUP BY x; SELECT int_col / 2 AS x FROM functional.alltypes ORDER BY x; SELECT NOT bool_col AS nb FROM functional.alltypes GROUP BY nb HAVING nb; === Not allowed === SELECT int_col / 2 AS x FROM functional.alltypes GROUP BY x / 2; SELECT int_col / 2 AS x FROM functional.alltypes ORDER BY -x; SELECT int_col / 2 AS x FROM functional.alltypes GROUP BY x HAVING x > 3; Some extra checks were added to AnalyzeExprsTest.java. I had to update other tests to make them pass since the new behavior is more restrictive. I added alias.test to the end-to-end tests. Cherry-picks: not for 2.x. Change-Id: I0f82483b486acf6953876cfa672b0d034f3709a8 Reviewed-on: http://gerrit.cloudera.org:8080/8801 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-24 22:47:18 +00:00
aphadke	11d1784c0a	IMPALA-6435: Disable codegen for CHAR literals. Currently we do not codegen CHAR types. This change checks for CHAR literals in a expr and disables codegen. Change-Id: I7e4e27350c53bc69ce412a004e392e7480214f73 Reviewed-on: http://gerrit.cloudera.org:8080/9102 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-24 05:00:33 +00:00
Gabor Kaszab	7f652ce659	IMPALA-5654: Disallow setting Kudu table name in CREATE TABLE This change disallows explicitly setting the Kudu table name property for managed Kudu tables in a CREATE TABLE statement. The Kudu table name property gets a generated value as the following: 'impala::db_name.table_name' where table_name is the one given in the CREATE TABLE statement. Providing the Kudu table name property when creating a managed Kudu table results in an error without creating the table. E.g.: CREATE TABLE t (i INT) STORED AS KUDU TBLPROPERTIES('kudu.table_name'='some_name'); Alongside the CREATE TABLE statement also the ALTER TABLE statement is changed not to allow the modification of Kudu.table_name of managed Kudu tables. Change-Id: Ieca037498abf8f5fde67b77e824b720482cdbe6f Reviewed-on: http://gerrit.cloudera.org:8080/8820 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-22 22:34:30 +00:00
Vuk Ercegovac	db98dc6504	IMPALA-4993: extend dictionary filtering to collections Currently, top-level scalar columns in parquet files can be used at runtime to prune row-groups by evaluating certain conjuncts over the column's dictionary (if available). This change extends such pruning to scalar values that are stored in collection type columns. Currently, dictionary pruning works by finding eligible conjuncts for top-level slots. Since only top-level slots are supported, the slots are implicitly part of the scan node's tuple descriptor. With this change, we track eligible conjuncts by slot as well as the tuple that contains the slot (either top-level or nested collection). Since collection conjuncts are already managed by a map that associates tuple descriptors to a list of their conjuncts, this extension follows the existing representation. The frontend builds the mapping of SlotId to conjuncts that are dictionary filterable. This mapping now includes SlotId's that reference nested tuples. The backend is adjusted to use the same representation. In addition, collection readers are decomposed into scalar filterable columns and other, non-dictionary filterable readers. When filtering a row group using a conjunct associated to a (possibly) nested collection type, an additional tuple buffer is allocated per tuple descriptor. Testing: - e2e test extended to illustrate row-groups that are pruned by nested collection dictionary filters. Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd Reviewed-on: http://gerrit.cloudera.org:8080/8775 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-19 20:37:25 +00:00
Tim Armstrong	579e33207b	IMPALA-6368: make test_chars parallel Previously it had to be executed serially because it modified tables in the functional database. This change separates out tests that use temporary tables and runs those in a unique_database. Testing: Ran locally in a loop with parallelism of 4 for a while. Change-Id: I2f62ede90f619b8cebbb1276bab903e7555d9744 Reviewed-on: http://gerrit.cloudera.org:8080/9022 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-19 09:55:52 +00:00
Bikramjeet Vig	028a83e654	IMPALA-6382: Cap spillable buffer size and max row size query options Currently the default and min spillable buffer size and max row size query options accept any valid int64 value. Since the planner depends on these values for memory estimations, if a very large value close to the limits of int64 is set, the variables representing or relying on these estimates can overflow during different phases of query execution. This patch puts a reasonable upper limit of 1TB to these query options to prevent such a situation. Testing: Added backend query option tests. Change-Id: I36d3915f7019b13c3eb06f08bfdb38c71ec864f1 Reviewed-on: http://gerrit.cloudera.org:8080/9023 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-18 23:08:26 +00:00
Taras Bobrovytsky	35a3e186d6	IMPALA-5478: Run TPCDS queries with decimal_v2 enabled We add new TPCDS .test files that are expected to be run with decimal_v2 enabled. The new expected results were generated using Impala and I inspected them manually. Change-Id: Ib867c51a521ec4a087bc127d99aee4b95ba97733 Reviewed-on: http://gerrit.cloudera.org:8080/8985 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-18 03:28:51 +00:00
Joe McDonnell	d9b6fd0730	IMPALA-6386: Invalidate metadata at table level for dataload Dataload currently executes bin/load-data.py for TPC-H, TPC-DS, and functional-query concurrently. One of the final steps for bin/load-data.py is to run a global "invalidate metadata". Global "invalidate metadata" commands are known to cause problem on concurrent systems. See IMPALA-5087. For dataload, if TPC-H executes "invalidate metadata" while TPC-DS is still creating tables and adding partitions, the TPC-DS executor might erroneously believe that a table does not exist. This changes dataload to invalidate metadata at an individual table level rather than globally. This prevents the concurrency issue. This also changes the names of some of the intermediate SQL files generated by generate-schema-statements.py and consumed by load-data.py to make them less confusing. Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f Reviewed-on: http://gerrit.cloudera.org:8080/9009 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 22:52:58 +00:00
Tianyi Wang	6cc76d7201	IMPALA-6353: Fix crash in snappy decompressor SnappyDecompressor::MaxOutputLen assumes the input pointer to be non-null. It's not true when the parquet file is corrupted and the compressed_page_size field in a page header is 0. This patch handles this error instead of failing a DCHECK. Testing: A bad parquet file with 0 compressed_page_size is added. It crashes impala without this patch. Change-Id: I0d42937aab92a74f8e104d2f7fcd64dc24f6a500 Reviewed-on: http://gerrit.cloudera.org:8080/8977 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 04:18:24 +00:00
Taras Bobrovytsky	f8b406222d	IMPALA-6388: Fix the Union node number of hosts estimation Before this patch, we would estimate the number of hosts for the union node by looking only at the first union operand. This is obviously incorrect and lead us to underestimate the value. We fix the problem by setting the estimate to be the maximum of its children. Testing: - Added a planner test that reproduces the issue Change-Id: I51e1ecca8dbc84b2b5a72708667b2799d00279f0 Reviewed-on: http://gerrit.cloudera.org:8080/9017 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 01:41:16 +00:00
Adam Holley	4c43cace87	IMPALA-4323: "SET ROW FORMAT" option added to "ALTER TABLE" command Examples of new command: ALTER TABLE t1 SET ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'; ALTER TABLE t1 SET ROW FORMAT DELIMITED LINES TERMINATED BY '\001'; Testing: Added parser tests and unit tests for alter statements including partition options. Change-Id: I96e347463504915a6f33932552e4d1f61e9b1154 Reviewed-on: http://gerrit.cloudera.org:8080/8928 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-16 23:58:24 +00:00
Bharath Vissapragada	20daa4d516	IMPALA-6384: RequestPoolService should honor custom group mapping config Due to the way in which we instantiate fair scheduler allocation loader, we donot read the config overrides from the HDFS config files. This is an unexpected behavior from users' POV since we typically support overrides like custom user -> group mapping via HDFS config (for ex: LDAPGroupsMapping) that eventually affects the query -> pool assignment. Fix: This patch loads the hadoop default configuration so that the underlying QueuePlacementPolicy is based on user specified overrides. Testing (manual): Changed the core-site.xml to use LDAPGroupsMapping instead of the default ShellBasedUnixGroupsMapping and confirmed that the correct group mapping plugin is loaded, by adding additional logging. Also, modified TestRequestPoolService to assert that the core-site xml overrides are loaded. Change-Id: Ibb93870c0cc37e2432a643a274931f1d3d13fb96 Reviewed-on: http://gerrit.cloudera.org:8080/9000 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-11 22:52:29 +00:00
Jinchul	6041865031	IMPALA-3651: Adds murmur_hash() built-in function murmur_hash relys on HashUtil::MurmurHash2_64 which MurmurHash2 64-bit version. Testing: Add unit tests for primitive types: ExprTest.MurmurHashFunction Add E2E tests into exprs.test Change-Id: I14d56ffb8fab256f3f66a2669271fd4b3c50cc29 Reviewed-on: http://gerrit.cloudera.org:8080/8893 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 20:17:26 +00:00
Taras Bobrovytsky	c86b0a9736	IMPALA-5014: Part 2: Round when casting decimal to timestamp When there are too many digits to the right of the dot in a decimal, we would always truncate when casting to timestamp. In this patch we change the behavior to round instead of truncating when decimal_v2 is enabled. Testing: - Added some EE tests, ran BE tests on my machine. Change-Id: I8fb3a7d976ab980b8572d7e9524850572bad57da Reviewed-on: http://gerrit.cloudera.org:8080/8969 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 05:47:23 +00:00
Jinchul	99962d2e81	IMPALA-4168: Adds Oracle-style hint placement for INSERT/UPSERT Allow to specify Oracle-style hint on INSERT/UPSERT statements. For example, - insert /* +noshuffle / into table functional.alltypes partition(year, month) select from functional.alltypes; - upsert /* +noshuffle / into functional_kudu.alltypes select from functional.alltypes; Testing: Add unit tests to ParserTest#TestPlanHints Add plan check tests to PlannerTest#testInsert, PlannerTest#testKuduUpsert Add tests to ToSqlTest#planHintsTest Change-Id: Ied7629d70197a0270cdc0853e00cc021fdb4dc20 Reviewed-on: http://gerrit.cloudera.org:8080/8676 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 03:03:49 +00:00
aphadke	38461c524f	IMPALA-5052: Read and write signed integer logical types in Parquet This patch maps a signed integer logical type in parquet to a supported Impala column type. This change introduces the following mapping - INT_8 -> TINYINT INT_16 -> SMALLINT INT_32 -> INT INT_64 -> BIGINT Also, added a parquet file with the following schema for testing - schema { optional int32 id; optional int32 tinyint_col (INT_8); optional int32 smallint_col (INT_16); optional int32 int_col; optional int64 bigint_col; } Change-Id: I47a8371858c9597c6a440808cf6f933532468927 Reviewed-on: http://gerrit.cloudera.org:8080/8548 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 04:55:59 +00:00
Tianyi Wang	c4d950b9e9	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Resubmitted with filesystem type check. Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8 Reviewed-on: http://gerrit.cloudera.org:8080/8916 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 03:24:36 +00:00
Tim Armstrong	d3ff67b8b3	IMPALA-6370: fix partitioned parquet tables with nested types When materialising a nested collection, has_template_tuple() should use the template tuple for the collection, not the top-level tuple. Testing: Added tests based on nested-types-basic.test that operate on a simple partitioned table. The tests reliably crashed Impala before the fix. Change-Id: Ic808b824ce3b31af0539036d8ca23d17b18deab4 Reviewed-on: http://gerrit.cloudera.org:8080/8947 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-05 20:44:21 +00:00
Thomas Tauber-Marshall	96b976aff3	IMPALA-6295: Fix mix/max handling of 'nan' and 'inf' This patch fixes several issues related to the min/max aggregate functions and their handling of 'nan' and 'inf': - Previously, if 'inf' or '-inf' was the only value for the min/max and codegen was being used, the result would be incorrect. This occurred, for example in the case of 'inf' and 'min', because we set an initial value of numeric_limits::max, which is less than 'inf', so the returned min was numeric_limits::max when it should be 'inf'. The fix is to set the initial value to numeric_limits::infinity. - Previously, if one of the values was 'nan', the result of min/max was non-deterministic depending on the order the values were evaluated in. This occurs because 'nan' < or > 'any value' is always false, so if the first value added was 'nan', all other comparisons would be false and 'nan' would be returned, whereas if the first value wasn't 'nan' then the 'nan' wouldn't be returned. The fix is to treat 'nan' specially and to always return 'nan' if there is a single 'nan' value. Testing: - Added e2e tests for both scenarios, as well as adding a little extra nan/inf coverage for other aggregate functions. Change-Id: Ia1e206105937ce5afc75ca5044597d39b3dc6a81 Reviewed-on: http://gerrit.cloudera.org:8080/8854 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-04 01:23:43 +00:00
Bikramjeet Vig	545163bb0a	IMPALA-5929: Remove redundant explicit casts to string This patch adds a query rewriter to remove redundant explicit casts to a string type (string, char, varchar) from binary predicates of the form "cast(<non-const expr> to <string type>) <eq/ne op> <string constant>". The cast is redundant if the predicate evaluation is the same even if the cast is removed and the constant is converted to the original type of the expression. For example: cast(int_col as string) = '123456' -> int_col = 123456 Performance: For the following query on a table having 6001215 records - select * from tpch.lineitem where cast(l_linenumber as string) = '0' +-----------------+-----------+--------+ \| \| Scan Time \| +-----------------+-----------+--------+ \| \| Avg \| St dev \| \| Without rewrite \| 1s406ms \| 44ms \| \| With rewrite \| 1s099ms \| 28ms \| +-----------------+-----------+--------+ Testing: - Added unit tests to ExprRewriteRulesTest - Added functional test to expr.test - Current FE planner tests and BE expr-test run successfully with this change. Change-Id: I91b7c6452d0693115f9b9ed9ba09f3ffe0f36b2b Reviewed-on: http://gerrit.cloudera.org:8080/8660 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-03 01:15:42 +00:00
Philip Zeyliger	f755910e97	Remove unused deps, centralize some pom versions, upgrade SLF4J and commons-io. As a follow-on to centralizing into one parent pom, we can now manage thirdparty dependency versions in Java a little bit more clearly. Upgrades SLF4J, commons.io: slf4j: 1.7.5 -> 1.7.25 commons.io: 2.4 -> 2.6 The SLF4J upgrade is nice to be able to run under Java9. The release notes at https://www.slf4j.org/news.html are uneventful. Commons IO 2.6 supports Java 9 and is source and binary compatible, per https://commons.apache.org/proper/commons-io/upgradeto2_6.html and https://commons.apache.org/proper/commons-io/upgradeto2_5.html. Removes the following dependencies: htrace-core hadoop-mapreduce-client-core hive-shims com.stumbleupon:async commons-dbcp jdo-api I ran "mvn dependency:analyze" and these were some (but not all) of the "Unused declared dependencies found." Spelunking in git logs, these dependencies are from 2013 and possibly from an effort to run with dependencies from the filesystem. They don't seem to be required anymore. Stops pulling in an old version of hadoop-client and kite-data-core in testdata/TableFlattener by using the same versions as the Hadoop we use. Doing so was unnecessarily causing us to download extra, old Hadoop jars, and the new Hadoop jars seem to work just as well. This is the kind of divergence that centralizing the versions into variables will help with. Creates variables for: junit.version slf4j.version hadoop.version commons-io.version httpcomponents.core.version thrift.version kite.version (controlled via $IMPALA_KITE_VERSION in impala-config.sh) Cleans up unused IMPALA_PARQUET_URL variables in impala-config.sh. We only download Parquet via Maven, rather than downloading it in the toolchain, so this variable wasn't doing anything. I ran the core tests with this change. Change-Id: I717e0625dfe0fdbf7e9161312e9e80f405a359c5 Reviewed-on: http://gerrit.cloudera.org:8080/8853 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-20 22:04:18 +00:00
David Knupp	2fb11fb732	Revert "IMPALA-3887: Wait for HDFS replication in data loading" Using fsck breaks non-HDFS builds: local, S3, and Isilon. This reverts commit `5a7c10ec3d`. Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc Reviewed-on: http://gerrit.cloudera.org:8080/8880 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-19 23:26:29 +00:00
Alex Behm	1f7b3b00e9	IMPALA-5310: Part 3: Use SAMPLED_NDV() in COMPUTE STATS. Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV() function. Testing: - modified/improved existing functional tests - core/hdfs run passed Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b Reviewed-on: http://gerrit.cloudera.org:8080/8840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:58:59 +00:00
Tianyi Wang	5a7c10ec3d	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322 Reviewed-on: http://gerrit.cloudera.org:8080/8846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:53:56 +00:00
Taras Bobrovytsky	7256fcefb4	IMPALA-6284: Mark the intermediate decimal avg struct as packed We saw some failures on the exhaustive release build because the compiler assumed that the pointer to the intermediate struct that is used for computing decimal average was aligned. To fix the problem, we mark the struct with a "packed" attribute so that the compiler does not expect it to be aligned. Testing: - Ran the failing test locally on an release build and it passed. Change-Id: Id25ec6e20dde3f50fb37a22135b355ad251809e0 Reviewed-on: http://gerrit.cloudera.org:8080/8836 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 03:26:43 +00:00

1 2 3 4 5 ...

1779 Commits