Commit Graph

1321 Commits

Author SHA1 Message Date
Csaba Ringhofer
fb28011b57 IMPALA-9707: fix Parquet stat filtering when min/max values are cast to NULL
The min/max stat predicate is allowed when the left side is not a slot
but an implicit cast of a slot. This could lead to incorrectly dropping
a row group or page when min/max values were not castable to the type,
e.g. it is string with a pre 1400 date and we want to cast it to a
timestamp.

The change should only affect timestamps, as dates return an error
on failed cast from a string, and numeric types won't be cast
implicitly from string.

The fix is simply to accept NULL result for the min/max predicate in
the backend. Note that the alternative solution of casting the right
(const) side of the predicate instead of the left side would be tricky,
as more than one string can mean the same timestamp, e.g.
"1970-01-01" and "1970-01-01 00:00:00".

Testing:
- added an EE regression test and ran it

Change-Id: I35f66e1dfc4523624c249073004f9d5eddd07bb6
Reviewed-on: http://gerrit.cloudera.org:8080/15959
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-21 18:40:20 +00:00
Tim Armstrong
e5777f0eb8 IMPALA-8834: Short-circuit partition key scan
This adds a new version of the pre-existing partition
key scan optimization that always returns correct
results, even when files have zero rows. This new
version is always enabled by default. The old
existing optimization, which does a metadata-only
query, is still enabled behind the
OPTIMIZE_PARTITION_KEY_SCANS query option.

The new version of the optimization must scan the files
to see if they are non-empty. Instead of using metadata
only, the planner instructs the backend to short-circuit HDFS
scans after a single row has been returned from each
file. This gives results equivalent to returning all
the rows from each file, because all rows in the file
belong to the same partition and therefore have identical
values for any columns that are partition key values.

Planner cardinality estimates are adjusted accordingly
to enable potentially better plans and other optimisations
like disabling codegen.

We make some effort to avoid generated extra scan ranges
for remote scans by only generating one range per remote
file.

The backend optimisation is implemented by constructing a
row batch with capacity for a single row only and then
terminating each scan range once a single row has been
produced.  Both Parquet and ORC have optimized code paths
for zero slot table scans that mean this will only result
in a footer read. (Other file formats still need to read
some portion of the file, but can terminate early once
one row has been produced.)

This should be quite efficient in practice with file handle
caching and data caching enabled, because it then only
requires reading the footer from the cache for each file.

The partition key scan optimization is also slightly
generalised to apply to scans of unpartitioned tables
where no slots are materialized.

A limitation of the optimization where it did not apply
to multiple grouping classes was also fixed.

Limitations:
* This still scans every file in the partition. I.e. there is
  no short-circuiting if a row has already been found in the
  partition by the current scan node.
* Resource reservations and estimates for the scan node do
  not all take into account this optimisation, so are
  conservative - they assume the whole file is scanned.

Testing:
* Added end-to-end tests that execute the query on all
  HDFS file formats and verify that the correct number of rows
  flow through the plan.
* Added planner test based on the existing test partition key
  scan test.
* Added test to make sure single node optimisation kicks in
  when expected.
* Add test for cardinality estimates with and without stats
* Added test for unpartitioned tables.
* Added planner test that checks that optimisation is enabled
  for multiple aggregation classes.
* Added a targeted perf test.

Change-Id: I26c87525a4f75ffeb654267b89948653b2e1ff8c
Reviewed-on: http://gerrit.cloudera.org:8080/13993
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-20 23:03:23 +00:00
Zoltan Borok-Nagy
f8015ff68d IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list
Minor compactions can compact several delta directories into a single
delta directory. The current directory filtering algorithm had to be
modified to handle minor compacted directories and prefer those over
plain delta directories. This happens in the Frontend, mostly in
AcidUtils.java.

Hive Streaming Ingestion writes similar delta directories, but they
might contain rows Impala cannot see based on its valid write id list.

E.g. we can have the following delta directory:

full_acid/delta_0000001_0000010/0000 # minWriteId: 1
                                     # maxWriteId: 10

This delta dir contains rows with write ids between 1 and 10. But maybe
we are only allowed to see write ids less than 5. Therefore we need to
check the ACID write id column (named originalTransaction) to determine
which rows are valid.

Delta directories written by Hive Streaming don't have a visibility txn
id, so we can recognize them based on the directory name. If there's
a visibilityTxnId and it is committed => every row is valid:

full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId
                                       # every row is valid

If there's no visibilityTxnId then it was created via Hive Streaming,
therefore we need to validate rows. Fortunately Hive Streaming writes
rows with different write ids into different ORC stripes, therefore we
don't need to validate the write id per row. If we had statistics,
we could validate per stripe, but since Hive Streaming doesn't write
statistics we validate the write id per ORC row batch (an alternative
could be to do a 2-pass read, first we'd read a single value from each
stripe's 'currentTransaction' field, then we'd read the stripe if the
write id is valid).

Testing
 * the frontend logic is tested in AcidUtilsTest
 * the backend row validation is tested in test_acid_row_validation

Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d
Reviewed-on: http://gerrit.cloudera.org:8080/15818
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-20 21:00:44 +00:00
Joe McDonnell
3e76da9f51 IMPALA-9708: Remove Sentry support
Impala 4 decided to drop Sentry support in favor of Ranger. This
removes Sentry support and related tests. It retires startup
flags related to Sentry and does the first round of removing
obsolete code. This does not adjust documentation to remove
references to Sentry, and other dead code will be removed
separately.

Some issues came up when implementing this. Here is a summary
of how this patch resolves them:
1. authorization_provider currently defaults to "sentry", but
   "ranger" requires extra parameters to be set. This changes the
   default value of authorization_provider to "", which translates
   internally to the noop policy that does no authorization.
2. These flags are Sentry specific and are now retired:
 - authorization_policy_provider_class
 - sentry_catalog_polling_frequency_s
 - sentry_config
3. The authorization_factory_class may be obsolete now that
   there is only one authorization policy, but this leaves it
   in place.
4. Sentry is the last component using CDH_COMPONENTS_HOME, so
   that is removed. There are still Maven dependencies coming
   from the CDH_BUILD_NUMBER repository, so that is not removed.
5. To make the transition easier, testdata/bin/kill-sentry-service.sh
   is not removed and it is still called from testdata/bin/kill-all.sh.

Testing:
 - Core job passes

Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e
Reviewed-on: http://gerrit.cloudera.org:8080/15833
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-20 17:43:40 +00:00
Csaba Ringhofer
7b8d2e1f78 IMPALA-9753: Fix TRUNCATE of ACID tables on S3
The use of HDFS API was incorrect when creating an empty file
in the new base dir during truncate. Simply calling Create(Path)
does create the file in HDFS, but it is only created on S3 when
the returned stream is closed.

Testing:
- Acid truncate tests are not running on S3 as they need a running
  Hive server. Aded a regression test that will run on S3 too. It
  would be nice to run all tests on S3, but this is out of the scope
  of this change.

Change-Id: I96d315638b669c5c7198a8e47939cb2b236e35bb
Reviewed-on: http://gerrit.cloudera.org:8080/15940
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-18 16:51:42 +00:00
Zoltan Borok-Nagy
f2f4c9891a IMPALA-9091: Fix flaky query_test.test_scanners.TestScannerReservation.test_scanners
query_test.test_scanners.TestScannerReservation.test_scanners was flaky.
It checks the average value of ParquetRowGroupIdealReservation which is
almost always 3.50 MB. However, very rarely it's a bit different than
that, e.g. 3.88 MB, or 4.12 MB, and so on. I wasn't able to reproduce
this problem, probably it is due to some randomness during data loading.

I modified the test to accept any value between 3.0 and 4.99. I think
values in this range are acceptable for this test.

Change-Id: I668d0ccd77a62059284e76fee51efb08bef580eb
Reviewed-on: http://gerrit.cloudera.org:8080/15923
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-15 21:46:38 +00:00
Tim Armstrong
fcf08d1822 IMPALA-9725: incorrect spilling join results for wide keys
The control flow was broken if the join operator hit
the end of the expression values cache before the end
of the probe batch, immediately after processing a row
for a spilled partition. In NextProbeRow(), the problematic
code path was:
* The last row in the expression values cache was for a
  spilled partition, so skip_row=true and it falls out
  of the loop with 'current_probe_row_' pointing to that
  row.
* probe_batch_iterator->AtEnd() is false, because
  the expression value cache is smaller than the probe batch,
  so 'current_probe_row_' is not nulled out.

Thus we end up in a state where 'current_probe_row_' is
set, but 'hash_table_iterator_' is unset.

In the case of a left anti join, this was interpreted by
ProcessProbeRowLeftSemiJoins() as meaning that there was
no hash table match for 'current_probe_row_', and it
therefore returned the row.

This bug could only occur under specific circumstances:
* The join key takes up > 256 bytes in the expression values
  cache (assuming the default batch size of 1024).
* The join spilled.
* The join operator returns rows that were unmatched in
  the right input, i.e. LEFT OUTER JOIN, LEFT ANTI JOIN,
  FULL OUTER JOIN.

The core of the fix is to null out 'current_probe_row_' when
falling out the bottom of the loop in NextProbeRow(). Related
DCHECKS were fixed and some control flow was slightly
simplified.

Testing:
Added a test query on TPC-H that reproduces the problem reliably.

Ran exhaustive tests.

Change-Id: I9d7e5871c35a90e8cf24b8dded04775ee1eae9d8
Reviewed-on: http://gerrit.cloudera.org:8080/15904
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-15 16:49:32 +00:00
Tim Armstrong
506a303c6d IMPALA-6984: coordinator cancels backends on EOS
Before this patch, when the coordinator returned the last row,
it waited for backends to finish of their own accord, which
could happen indirectly as exchanges got closed.

The idea of this change is to send out cancellation RPCs to
expedite cancellation, then wait for the final exec status
reports to come in. Those reports will be included in the
final profile because the backend is *not* marked as done
when sending out the cancellation RPCs.

The bulk of this change is modifying the cancellation code
path to allow sending the cancel RPCs but *not* consider
the backend done until it gets back the final status report.
The old "fire and forget" mode of cancellation is still used
for explicit cancellation and errors.

Testing:
Ran exhaustive tests.

Ran cancellation tests under TSAN, checked for errors.

Manually inspected logs of some queries with limit,
saw that it sent cancellation then waited for backends
as expected.

Added a functional perf test that goes from ~5s down to < ~1s
on my system.

Change-Id: I966eceaafdc18a019708b780aee4ee9d70fd3a47
Reviewed-on: http://gerrit.cloudera.org:8080/15840
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-14 02:52:16 +00:00
Akos Kovacs
5e72ca546e IMPALA-7833 Audit and fix string builtins for long string handling
Some string built-in functions could crash impalad,
in case the result was longer than 1 gig max size.
Added some overflow checks.
Overflow error messages modified not to hard code max size.

Testing:
* Added some backend tests to cover overflow check
* Ran core tests

Change-Id: I93a53845f04e61ff446b363c78db1e49cbd5dc49
Reviewed-on: http://gerrit.cloudera.org:8080/15864
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-13 22:01:34 +00:00
Chang Wu
a93f2c2675 IMPALA-8205: Support number of true and false statistics for boolean column
This change compute the real number of true and false statistics
information for boolean columns. Before this, impala used to set
numTrues and numFalses to hardcoded -1 to indicate that its
statistics is missing.

Test Done:
Append the numTrue and numFalse test for all the statistics-related
test cases including the non-incremental, incremental and other test
cases.

Change-Id: I991bee8e7fdc644d908289f5fe2ee8032cc2c431
Reviewed-on: http://gerrit.cloudera.org:8080/14666
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-12 23:29:04 +00:00
Adam Tamas
7295edcc26 IMPALA-9680: Fixed compressed inserts failing
Modified the insert testfiles to get which database they need to
use for 'CREATE TABLE LIKE' dynamically.

Tests:
Did targeted exhaustive testruns in test_insert.py and
test_mt_dop.py and did a full exhaustive testrun.

Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d
Reviewed-on: http://gerrit.cloudera.org:8080/15816
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2020-05-11 19:32:08 +00:00
Tim Armstrong
b2d9901fb8 IMPALA-9176: shared null-aware anti-join build
This switches null-aware anti-join (NAAJ) to use shared
join builds with mt_dop > 0. To support this, we
make all access to the join build data structures
from the probe read-only. NAAJ requires iterating
over rows from build partitions at various steps
in the algorithm and before this patch this was not
thread-safe. We avoided that problem by having a
separate builder for each join node and duplicating
the data.

The main challenge was iteration over
null_aware_partition()->build_rows() from the probe
side, because it uses an embedded iterator in the
stream so was not thread-safe (since each thread
would be trying to use the same iterator).

The solution is to extend BufferedTupleStream to
allow multiple read iterators into a pinned,
read-only, stream. Each probe thread can then
iterate over the stream independently with no
thread safety issues.

With BufferedTupleStream changes, I partially abstracted
ReadIterator more from the rest of BufferedTupleStream,
but decided not to completely refactor so that this patchset
didn't cause excessive churn. I.e. much BufferedTupleStream
code still accesses internal fields of ReadIterator.

Fix a pre-existing bug in grouping-aggregator where
Spill() hit a DCHECK because the hash table was
destroyed unnecessarily when it hit an OOM. This was
flushed out by the parameter change in test_spilling.

Testing:
Add test to buffered-tuple-stream-test for multiple readers
to BTS.

Tweaked test_spilling_naaj_no_deny_reservation to have
a smaller minimum reservation, required to keep the
test passing with the new, lower, memory requirement.

Updated a TPC-H planner test where resource requirements
slightly decreased for the NAAJ.

Ran the naaj tests in test_spilling.py with TSAN enabled,
confirmed no data races.

Ran exhaustive tests, which passed after fixing IMPALA-9611.

Ran core tests with ASAN.

Ran backend tests with TSAN.

Perf:
I ran this query that exercises EvaluateNullProbe() heavily.

  select l_orderkey, l_partkey, l_suppkey, l_linenumber
  from tpch30_parquet.lineitem
  where l_suppkey = 4162 and l_shipmode = 'AIR'
        and l_returnflag = 'A' and l_shipdate > '1993-01-01'
        and if(l_orderkey > 5500000, NULL, l_orderkey) not in (
          select if(o_orderkey % 2 = 0, NULL, o_orderkey + 1)
          from orders
          where l_orderkey = o_orderkey)
  order by 1,2,3,4;

It went from ~13s to ~11s running on a single impalad with
this change, because of the inlining of CreateOutputRow() and
EvalConjuncts().

I also ran TPC-H SF 30 on Parquet with mt_dop=4, and there was
no change in performance.

Change-Id: I95ead761430b0aa59a4fb2e7848e47d1bf73c1c9
Reviewed-on: http://gerrit.cloudera.org:8080/15612
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-24 20:56:58 +00:00
Tim Armstrong
dc410a2cf4 IMPALA-9596: deflake test_tpch_mem_limit_single_node
This changes the test to use a debug action instead of
trying to hit the memory limit in the right spot, which
has tended to be flaky. This still exercises the error
handling code in the scanner, which was the original
point of the test (see IMPALA-2376).

This revealed an actual bug in the ORC scanner, where
it was not returning the error directly from
AssembleCollection(). Before I fixed that, the scanner
got stuck in an infinite loop when running the test.

Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21
Reviewed-on: http://gerrit.cloudera.org:8080/15700
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
2020-04-16 15:45:49 +00:00
Tim Armstrong
76e4a17fb3 IMPALA-9643: fix runtime filter race for mt_dop
This patch avoids the race with registration of a
consumer filter by registering all filters upfront
when the filter bank is constructed. Then registration
of producers and consumers hands out references to the
pre-constructed filters.

A nice bonus of this change is that RegisterConsumer()
and RegisterProducer() don't mutate anything and
we can avoid lock acquisitions.

Also adds test infrastructure and fixes TestRuntimeRowFilters to
work with mt_dop=4 (it was accidentally not enabled before). That
mostly involved modifying the tests to use aggregates of counters
instead of picking out lines with regexes.

Testing:
Added a regression test that reliably failed before this
fix. This relies on extending debug actions to allow longer
delays, plus a minor extension to the RUNTIME_PROFILE .test
file parser to handle spaces in counter names.

Ran exhaustive tests.

Change-Id: I194c0d2515b6a0e5474e1c0c8647f0e54dc94397
Reviewed-on: http://gerrit.cloudera.org:8080/15715
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-14 15:42:13 +00:00
Adam Tamas
c32849a391 IMPALA-8980: Remove functional*.alltypesinsert from EE tests
-Modified the ‘test_insert.py’ so the tests can run parallel.
  -Every test will create its own temporary tables for insert testing.
-Swapped out the  SETUP tags to Truncate table QUERY statement.
  -Becouse the SETUP tag is not used anymore, the correspondig
  code was removed.
-A test query in ‘insert.test’. The test was incorrect so modified
to test for the right behavior.

Testing:
-tests/run-tests.py query_test/test_insert.py
-impala-py.test tests/query_test/test_insert.py
-the same for test_insert_permutation.py and test_load.py

Change-Id: I257e936868917a2fcc6c030f6c855b247e8a0eea
Reviewed-on: http://gerrit.cloudera.org:8080/15529
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-14 12:18:21 +00:00
Zoltan Borok-Nagy
b770d2d378 Put transactional tables into 'managed' directory
HIVE-22794 disallows ACID tables outside of the 'managed' warehouse
directory. This change updates data loading to make it conform to
the new rules.

The following tests had to be modified to use the new paths:
* AnalyzeDDLTest.TestCreateTableLikeFileOrc()
* create-table-like-file-orc.test

Change-Id: Id3b65f56bf7f225b1d29aa397f987fdd7eb7176c
Reviewed-on: http://gerrit.cloudera.org:8080/15708
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-11 00:36:56 +00:00
stiga-huang
5c2cae89f2 IMPALA-9529: Fix multi-tuple predicates not assigned in column masking
Column masking is implemented by replacing the masked table with a table
masking view which has masked expressions in its SelectList. However,
nested columns can't be exposed in the SelectList, so we expose them
in the output field of the view in IMPALA-9330. As a result, predicates
that reference both primitive and nested columns of the masked table
become multi-tuple predicates (referencing tuples of the view and the
masked table). Such kinds of predicates are not assigned since they no
longer bound to the view's tuple or the masked table's tuple.

We need to pick up the masked table's tuple id when getting unassigned
predicates for the table masking view. Also need to do this for
assigning predicates to the JoinNode which is the only place that
introduces multi-tuple predicates.

Tests:
 - Add tests with multi-tuple predicates referencing nested columns.
 - Run CORE tests.

Change-Id: I12f1b59733db5a88324bb0c16085f565edc306b3
Reviewed-on: http://gerrit.cloudera.org:8080/15654
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-07 15:10:27 +00:00
Zoltan Borok-Nagy
8aa0652871 IMPALA-9484: Full ACID Milestone 1: properly scan files that has full ACID schema
Full ACID row format looks like this:

{
  "operation": 0,
  "originalTransaction": 1,
  "bucket": 536870912,
  "rowId": 0,
  "currentTransaction": 1,
  "row": {"i": 1}
}

User columns are nested under "row". In the frontend we need to create
slot descriptors that correspond to the file schema. In the catalog we
could mimic the file schema but that would introduce several
complexities and corner cases in column resolution. Also in query
results the heading of the above user column would be "row.i". Star
expansion should also be modified, etc.

Because of that in the Catalog I create the exact opposite of the above
schema:

{
  "row__id":
  {
    "operation": 0,
    "originalTransaction": 1,
    "bucket": 536870912,
    "rowId": 0,
    "currentTransaction": 1
  }
  "i": 1
}

This way very little modification is needed in the frontend. And the
hidden columns can be easily retrieved via 'SELECT row__id.*' when we
need those for debugging/testing.

We only need to change Path.getAbsolutePath() to return a schema path
that corresponds to the file schema. Also in the backend we need some
extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the
table schema path from the file schema path.

Testing:
I changed data loading to load ORC files in full ACID format by default.
With this change we should be able to scan full ACID tables that are
not minor-compacted, don't have deleted rows, and don't have original
files.

Newly added Tests:
 * specific queries about hidden columns (full-acid-rowid.test)
 * SHOW CREATE TABLE (show-create-table-full-acid.test)
 * DESCRIBE [FORMATTED] TABLE (describe-path.test)
 * INSERT should be forbidden (acid-negative.test)
 * added tests for column masking (
   ranger_column_masking_complex_types.test)

Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb
Reviewed-on: http://gerrit.cloudera.org:8080/15395
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-02 12:01:41 +00:00
Attila Bukor
2576952655 IMPALA-5092 Add support for VARCHAR in Kudu tables
KUDU-1938 added VARCHAR column type support to Kudu.
This commit adds support for Kudu's VARCHAR type to Impala.

The length of a Kudu varchar is applied as a character length as opposed
to a byte length like Impala currently uses.

When writing data to Kudu, the VARCHAR length is not an issue because
Impala only officially supports ASCII characters and those characters are
the same size in bytes and characters. Additionally, extra bytes would be
truncated by the Kudu client if somehow a value was too long.

When reading data from Kudu, it is possible that the value written by
some other application is wider in bytes than Impala expects and can
handle. This can happen due to multi-byte UTF-8 characters. In that
case, we adjust the length in Impala to truncate the extra bytes of the
value. This isn’t a great solution, but one other integrations have taken
as well given Impala doesn’t support UTF-8 values.

IMPALA-5675 tracks adding UTF-8 Character length support to VARCHAR
columns and marked the truncation code with a TODO that references
that Jira.

Testing:
* Performed manual testing of standard DDL and DML interaction
* Manually reproduced a check failure due to multi-byte characters
  and tested that length truncation resolve that issue.
* Added/adjusted the following automated tests:
** AnalyzeDDLTest: CTAS into Kudu with varchar type
** AnalyzeKuduDDLTest: CREATE TABLE in Kudu with VARCHAR type
** kudu_create.test: Create table with VARCHAR column, key, hash
   partition, and range partition
** kudu_describe.test: Describe table with VARCHAR column and key
** kudu_insert.test: Insert with VARCHAR columns including null and
   non-null defaults
** kudu_update.test: Updates with VARCHAR column
** kudu_upsert.test: Upserts with VARCHAR column
** kudu_delete.test Deletes with VARCHAR columns
** kudu-scan-node.test Tests basic predicates with VARCHAR columns

Follow on work:
- IMPALA-9580: Add min-max runtime filter support/tests
- IMPALA-9581: Pushdown string predicates
- IMPALA-9583: Automated multibyte truncation tests

Change-Id: I0d4959410fdd882bfa980cb55e8a7837c7823da8
Reviewed-on: http://gerrit.cloudera.org:8080/14197
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
2020-04-01 15:48:36 +00:00
Csaba Ringhofer
a08cd7f49b IMPALA-9584: remove flaky avg(TIMESTAMP) aggregates from test_analytic_fns
AVG(TIMESTAMP) is not deterministic, because it uses a double to sum
the timestamps, and adding doubles in different order can lead to
different results. This does not cause problems for DOUBLE columns,
because the test framework does not require exact match if the result
is double. As AVG is the only function for TIMESTAMP with this problem,
reducing the precision of all timestamps checks seemed like an
overkill.

As a short term solution I removed the problematic aggregates from the
tests.

Testing:
- ran only the related tests

Change-Id: I10e0027a64a4e430b7db3ed7c8d0cc8cdcb202e0
Reviewed-on: http://gerrit.cloudera.org:8080/15621
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-31 22:57:00 +00:00
Tim Armstrong
ab7e209d1b IMPALA-9099: allow mt_dop for joins without feature flag
This allows running *any* read-only query with mt_dop > 0.
Before this patch, no joins were allowed with mt_dop > 0.

Previous patches, particularly IMPALA-9156, added significantly
more code coverage for multithreading+joins. It should be safe to
allow enabling on a query-by-query basis. Many improvements are
still planned - see IMPALA-3902. So behaviour and performance
characteristics of mt_dop > 0 with more complex plans and joins
will continue to change.

Testing:
Updated the mt_dop validation tests and remove redundant planner test
that doesn't provide much additional coverage of the validation
support.

Ran exhaustive tests.

Change-Id: I9c6566abb239db0e775f2beaa25a62c36313cd6f
Reviewed-on: http://gerrit.cloudera.org:8080/15545
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-31 20:45:59 +00:00
Attila Jeges
1cfc31c84f IMPALA-9555 part 2: [Hive3] Fix test failure introduced by HIVE-22589
This patch is a continuation of IMPALA-9555. It makes Avro DATE
tests more resilient by using regex for expected error messages
instead of using concrete error messages.

Change-Id: I36340be70a37b75997cf49625a173ec2690ed9b8
Reviewed-on: http://gerrit.cloudera.org:8080/15618
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-31 19:11:20 +00:00
Csaba Ringhofer
e8f604a213 IMPALA-9572: Fix DCHECK in nested Parquet scanning
The issue occurred when there were skipped pages and a column
inside a collection was scanned, but its position was not needed.
The repetition level still needs to be read in this case, as the
skipped ranges are set in top level rows, so collection items
need to know which top level row do they belong to.

A DCHECK in StrideWriter's constructor was hit, otherwise the
code ran correctly in release mode. The DCHECK is moved to
functions where the condition would actually cause problems.

Testing:
- added and ran a regression test

Change-Id: I5e8ef514ead71f732c73f910af7fd1aecd37bb81
Reviewed-on: http://gerrit.cloudera.org:8080/15598
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-31 18:43:09 +00:00
Joe McDonnell
e9dd5d3f8c IMPALA-9560: Fix TestStatsExtrapolation for release versions
When changing the Impala version from 3.4.0-SNAPSHOT to 3.4.0-RELEASE,
TestStatsExtrapolation::test_stats_extrapolation started failing due
to a difference in the expected cardinality (expected: 17.91K,
actual 17.90K). This is because the Impala version gets embedded into
parquet files, and this causes a slight difference in file size, which
translates into a slight difference in expected cardinality.

This modifies TestStatsExtrapolation::test_stats_extrapolation to
allow any 17.9*K cardinality.

Testing:
 - Tested on master and on branch-3.4.0

Change-Id: Iebe538936f23c095ef58c808e425cfb7b31edd94
Reviewed-on: http://gerrit.cloudera.org:8080/15569
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-27 23:27:12 +00:00
Attila Jeges
2cd7a2b77a IMPALA-9555: [Hive3] Fix test failure introduced by HIVE-22589
With HIVE-22589 Hive3 switched back to using Julian Calendar for
historical dates by default which caused an Impala test failure
around Avro DATE values.

Change-Id: I51dd933867ea7877235e7f6e1f2b56711dca107e
Reviewed-on: http://gerrit.cloudera.org:8080/15564
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-27 17:16:00 +00:00
Tamas Mate
7dd13f7278 IMPALA-5308: Resolve confusing Kudu SHOW TABLE STATS output
This change modifies the output of the SHOW TABLE STATS and SHOW
PARTITIONS for Kudu tables.
 - PARTITIONS: the #Row column has been removed
 - TABLE STATS: instead of showing partition informations it returns a
 resultset similar to HDFS table stats, #Rows, #Partitions, Size, Format
 and Location

Example outputs can be seen in the doc changes.

Testing:
* kudu_stats.test is modified to verify the new result set
* kudu_partition_ddl.test is modified to verify the new partitions style
* Updated unit test with the new error message

Change-Id: Ice4b8df65f0a53fe14b8fbe35d82c9887ab9a041
Reviewed-on: http://gerrit.cloudera.org:8080/15199
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-18 18:05:34 +00:00
Tim Armstrong
08acccf9eb IMPALA-9156: share broadcast join builds
The scheduler will only create one join build finstance per
backend in cases where this is supported.

The builder is aware of the number of finstances executing the
probe and hands off the build data structures to the builders.

Nested loop join requires minimal modifications because the
build data structures are read-only after initial construction.
The only significant change is that memory can't be transferred
to the multiple consumers, so MarkNeedsDeepCopy() needs to be
used instead.

Hash join requires additional synchronisation because the
spilling algorithm mutates build-side data structures. This
patch adds synchronisation so that rebuilding spilled
partitions is done in a thread-safe manner, using a single
thread. This uses the CyclicBarrier added in an earlier patch.

Threads blocked on CyclicBarrier need to be cancellable,
which is handled by cancelling the barrier when cancelling
fragments on the backend.

BufferPool now correctly handles multiple threads calling
CleanPages() concurrently, which makes other methods thread-safe.

Update planner to cost broadcast join and estimate memory
consumption based on a single instance per node.

Planner estimates of number of instances are improved. Instead of
assuming mt_dop instances per node, use the total number of input
splits (also called scan ranges in places) as an upper bound on
the number of instances generated by scans. These instance
estimates from the scan nodes are then propagated up the
plan tree in the same way as the numNodes estimates. The instance
estimate for the join build fragment is fixed to be based on
the destination fragment.

The profile now correctly accounts for time waiting for the
builder, counting it in inactive time and showing it in the
node timeline. Additional improvements/cleanup to the time
accounting are deferring until IMPALA-9422.

Testing:
* Updated planner tests
* Ran a single node stress test with TPC-H and TPC-DS
* Add a targeted test for spilling broadcast joins, both repartitioning
  and not repartitioning.
* Add a targeted test for a spilling broadcast join with empty probe
* Add a targeted test for spilling broadcast join with empty build
  partitions.
* Add a broadcast join to test_cancellation and test_failpoints.

Perf:

I did a single node run on my desktop:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 6.26    | -15.70%    | 4.63       | -16.16%        |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval    |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| TPCH(30) | TPCH-Q21 | parquet / none / none | 24.97  | 23.25       | R +7.38%   |   0.51%   |   0.22%        | 5     | R +6.95%       | 2.31    | 27.93   |
| TPCH(30) | TPCH-Q4  | parquet / none / none | 2.83   | 2.79        |   +1.31%   |   1.86%   |   0.36%        | 5     |   +1.88%       | 1.15    | 1.53    |
| TPCH(30) | TPCH-Q6  | parquet / none / none | 1.28   | 1.28        |   -0.01%   |   1.64%   |   1.63%        | 5     |   -0.11%       | -0.58   | -0.01   |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 2.65   | 2.68        |   -0.94%   |   0.84%   |   1.46%        | 5     |   -0.21%       | -0.87   | -1.25   |
| TPCH(30) | TPCH-Q1  | parquet / none / none | 4.69   | 4.72        |   -0.56%   |   1.29%   |   0.52%        | 5     |   -1.04%       | -1.15   | -0.89   |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 10.64  | 10.80       |   -1.48%   |   0.61%   |   0.60%        | 5     |   -1.39%       | -1.73   | -3.91   |
| TPCH(30) | TPCH-Q15 | parquet / none / none | 4.11   | 4.32        |   -4.92%   |   0.05%   |   0.40%        | 5     |   -4.93%       | -2.31   | -27.46  |
| TPCH(30) | TPCH-Q20 | parquet / none / none | 3.47   | 3.67        | I -5.41%   |   0.81%   |   0.03%        | 5     | I -5.70%       | -2.31   | -15.75  |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 7.58   | 8.14        | I -6.93%   |   3.13%   |   2.62%        | 5     | I -9.31%       | -2.02   | -3.96   |
| TPCH(30) | TPCH-Q9  | parquet / none / none | 15.59  | 17.02       | I -8.38%   |   0.95%   |   0.43%        | 5     | I -8.92%       | -2.31   | -19.37  |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 2.90   | 3.25        | I -10.93%  |   1.42%   |   4.41%        | 5     | I -10.28%      | -2.31   | -5.33   |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.69   | 3.13        | I -14.31%  |   4.50%   |   1.40%        | 5     | I -17.79%      | -2.31   | -7.80   |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.50   | 3.03        | I -17.54%  |   0.10%   |   0.79%        | 5     | I -20.55%      | -2.31   | -49.31  |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 4.76   | 5.92        | I -19.52%  |   0.78%   |   0.33%        | 5     | I -24.31%      | -2.31   | -61.63  |
| TPCH(30) | TPCH-Q2  | parquet / none / none | 2.56   | 3.33        | I -23.18%  |   2.13%   |   0.85%        | 5     | I -30.39%      | -2.31   | -28.14  |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 12.59  | 16.41       | I -23.26%  |   1.73%   |   0.90%        | 5     | I -30.43%      | -2.31   | -32.36  |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.83   | 2.41        | I -24.04%  |   1.83%   |   2.22%        | 5     | I -30.48%      | -2.31   | -20.54  |
| TPCH(30) | TPCH-Q8  | parquet / none / none | 4.43   | 5.94        | I -25.33%  |   0.96%   |   0.54%        | 5     | I -34.54%      | -2.31   | -63.01  |
| TPCH(30) | TPCH-Q5  | parquet / none / none | 3.81   | 5.37        | I -29.08%  |   1.43%   |   0.69%        | 5     | I -41.47%      | -2.31   | -53.11  |
| TPCH(30) | TPCH-Q7  | parquet / none / none | 13.34  | 21.49       | I -37.92%  |   0.46%   |   0.30%        | 5     | I -60.69%      | -2.31   | -203.08 |
| TPCH(30) | TPCH-Q3  | parquet / none / none | 4.73   | 7.73        | I -38.81%  |   4.90%   |   1.35%        | 5     | I -61.68%      | -2.31   | -26.40  |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 3.71   | 6.61        | I -43.83%  |   1.63%   |   0.09%        | 5     | I -77.12%      | -2.31   | -106.61 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+

Change-Id: I4c67e4b2c87ed0fba648f1e1710addb885d66dc7
Reviewed-on: http://gerrit.cloudera.org:8080/15096
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-17 23:29:45 +00:00
Volodymyr Verovkin
6fdc644fed IMPALA-8800: Added support of Kudu DATE type to Impala
This patch supports reading and writing DATE values
to Kudu tables. It does not add min-max filter runtime
support, but there is followup JIRA IMPALA-9294.
Corresponding Kudu JIRA is KUDU-2632.

Change-Id: I91656749a58ac769b54c2a63bdd4f85c89520b32
Reviewed-on: http://gerrit.cloudera.org:8080/14705
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-12 14:26:13 +00:00
stiga-huang
9672d94596 IMPALA-7784: Use unescaped string in partition pruning + fix duplicatedly unescaping strings
String values from external systems (HDFS, Hive, Kudu, etc.) are already
unescaped, the same as string values in Thrift objects deserialized in
coordinators. We should mark needsUnescaping_ as false in creating
StringLiterals for these values (in LiteralExpr#create()).

When comparing StringLiterals in partition pruning, we should also use
the unescaped values if needsUnescaping_ is true.

Tests:
 - Add tests for partition pruning on unescaped strings.
 - Add test coverage for all existing code paths using
   LiteralExpr#create().
 - Run core tests

Change-Id: Iea8070f16a74f9aeade294504f2834abb8b3b38f
Reviewed-on: http://gerrit.cloudera.org:8080/15278
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-09 06:29:35 +00:00
Riza Suminto
a007e3caf8 IMPALA-8674: fix bug where REMOTE runtime filter always marked disabled
When a runtime filter has remote target, coordinator will Disable the
FilterState upon arrival of the last filter update to prevent another
update towards that filter. As consequence, such runtime filter will
always be displayed as disabled in runtime profile (Enabled column is
equal to false in Final filter table), when in reality the runtime
filter has heard back from all pending backends and complete. The
Enabled column should correctly distinguish between failed runtime
filter vs complete runtime filter. To do so, we add
all_updates_received_ flag in FilterState class and set it to true
after filter received enough filter update from pending backends to
proceed. If all_updates_received_ is true, then that runtime filter is
considered as enabled.

Testing:
- Add row regex in runtime_filters.test, query 6, to verify REMOTE
  runtime filter is marked as enabled in final filter table
- Run and pass test_runtime_filters.py
- Run and pass core tests

Change-Id: I82a5a776103abd0a6d73336bebc65e22b4e13fef
Reviewed-on: http://gerrit.cloudera.org:8080/15308
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-07 00:00:22 +00:00
Tim Armstrong
1ea013cac8 IMPALA-9452: slightly reduce reservation for test_spilling_aggs
As far as I can tell, the query failed to spill because the
pre-agg was able to release reservation before the post-agg
needed it. Probably there is some variance because of buffering
in the exchange.

This change slightly reduces the reservation to minimise the
chance of this recurring.

Also remove a duplicated instance of this test.

Change-Id: Ifb8376e2e12d3f73d6c0e27c697be4fc86f9c755
Reviewed-on: http://gerrit.cloudera.org:8080/15339
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-03 06:12:24 +00:00
Csaba Ringhofer
2c54dbe225 IMPALA-9385: Unix time conversion cleanup + ORC fix
ORC scanner uses TimestampValue::FromUnixTimeNanos() to convert
sec + nano representation to Impala's TimestampValue (day + nano).
FromUnixTimeNanos was affected by flag
use_local_tz_for_unix_timestamp_conversions, while that global option
should not affect ORC. By default there was no conversion, but if the
flag is 1, then timestamps were interpreted as UTC and converted to
local time.

This could be solved by creating a UTC version of FromUnixTimeNanos,
but I decided to change the interface in the hope of making To/From
timestamp functions less confusing.

Changes:
- Fixed the bug by passing UTC as timezone in the ORC scanner.
- Changed the interface of these TimestampValue functions to expect
  a timezone pointer, interpret null as UTC and skip conversion. It
  would be also possible to pass the actual UTC timezone and check
  for this in the functions, but I guess it is easier to optimize
  the inlined functions this way.
- Moved the checking of use_local_tz_for_unix_timestamp_conversions to
  RuntimeState and added property time_zone_for_unix_time_conversions()
  to return the timezone to use in Unix time conversions. This made
  TimestampValue's interface clearer and makes it easy to replace the
  flag with a query option if we want to.
- Changed RuntimeState and the Parquet scanner to skip timezone
  conversion if convert_legacy_hive_parquet_utc_timestamps=1 but the
  timezone is UTC. This allows users to avoid the performance penalty
  of this flag by setting query option timezone to UTC in their
  session (IMPALA-7557). CCTZ is not good at this, actually
  conversions are slower with fixed offset timezones (including UTC)
  than with timezones that have DST/historical rule changes.

Postponed changes:
- Didn't remove the UTC versions of the functions yet, as that would
  require changing (and possibly rethinking) several BE tests and
  benchmarks (IMPALA-9409).

Tests:
- Added regression test for Orc and other file formats to
  check that they are not affected by this flag.
- Extended test_hive_parquet_timestamp_conversion.py to cover the case
  when convert_legacy_hive_parquet_utc_timestamps=1 and timezone=UTC.
  Also did some cleanup there to use query option timezone instead of
  env var TZ.

Change-Id: I14e2a7e512ccd013d5d9fe480a5467ed4c46b76e
Reviewed-on: http://gerrit.cloudera.org:8080/15222
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-22 02:02:56 +00:00
Tim Armstrong
0bb056e525 IMPALA-4224: execute separate join builds fragments
This enables parallel plans with the join build in a
separate fragment and fixes all of the ensuing fallout.
After this change, mt_dop plans with joins have separate
build fragments. There is still a 1:1 relationship between
join nodes and builders, so the builders are only accessed
by the join node's thread after it is handed off. This lets
us defer the work required to make PhjBuilder and NljBuilder
safe to be shared between nodes.

Planner changes:
* Combined the parallel and distributed planning code paths.
* Misc fixes to generate reasonable thrift structures in the
  query exec requests, i.e. containing the right nodes.
* Fixes to resource calculations for the separate build plans.
** Calculate separate join/build resource consumption.
** Simplified the resource estimation by calculating resource
   consumption for each fragment separately, and assuming that
   all fragments hit their peak resource consumption at the
   same time. IMPALA-9255 is the follow-on to make the resource
   estimation more accurate.

Scheduler changes:
* Various fixes to handle multiple TPlanExecInfos correctly,
  which are generated by the planner for the different cohorts.
* Add logic to colocate build fragments with parent fragments.

Runtime filter changes:
* Build sinks now produce runtime filters, which required
  planner and coordinator fixes to handle.

DataSink changes:
* Close the input plan tree before calling FlushFinal() to release
  resources. This depends on Send() not holding onto references
  to input batches, which was true except for NljBuilder. This
  invariant is documented.

Join builder changes:
* Add a common base class for PhjBuilder and NljBuilder with
  functions to handle synchronisation with the join node.
* Close plan tree earlier in FragmentInstanceState::Exec()
  so that peak resource requirements are lower.
* The NLJ always copies input batches, so that it can close
  its input tree.

JoinNode changes:
* Join node blocks waiting for build-side to be ready,
  then eventually signals that it's done, allowing the builder
  to be cleaned up.
* NLJ and PHJ nodes handle both the integrated builder and
  the external builder. There is a 1:1 relationship between
  the node and the builder, so we don't deal with thread safety
  yet.
* Buffer reservations are transferred between the builder and join
  node when running with the separate builder. This is not really
  necessary right now, since it is all single-threaded, but will
  be important for the shared broadcast.
  - The builder transfers memory for probe buffers to the join node
    at the end of each build phase.
  - At end of each probe phase, reservation needs to be handed back
    to builder (or released).

ExecSummary changes:
* The summary logic was modified to handle connecting fragments
  via join builds. The logic is an extension of what was used
  for exchanges.

Testing:
* Enable --unlock_mt_dop for end-to-end tests
* Migrate some tests to run as part of end-to-end tests instead of
  custom cluster.
* Add mt_dop dimension to various end-to-end tests to provide
  coverage of join queries, spill-to-disk and cancellation.
* Ran a single node TPC-H and TPC-DS stress test with mt_dop=0
  and mt_dop=4.

Perf:
* Ran TPC-H scale factor 30 locally with mt_dop=0. No significant
  change.

Change-Id: I4403c8e62d9c13854e7830602ee613f8efc80c58
Reviewed-on: http://gerrit.cloudera.org:8080/14859
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-20 01:51:54 +00:00
Csaba Ringhofer
796f5cc43d IMPALA-8909: Fix incorrectly omitting implicit cast in pre-insert sort
Inserts can add a sort node that orders the rows by partitioning
and Kudu primary key columns (aka. clustered insert). The issue
occurred when the target column was a timestamp and the source
was an expression that returned a string (e.g. concat()). Impala
adds an implicit cast to convert the strings to timestamps before
sorting, but this cast was incorrectly removed later during expression
substitution.

This led to hitting a DCHECK in debug builds and a (not too
informative) error message in release mode.

Note that the cast in question is not visible in EXPLAIN outputs.
Explain should contain implicit casts from explain_level=2 since
https://gerrit.cloudera.org/#/c/11719/ , but it is still not shown
in some expressions. I consider this to be a separate issue.

Testing:
- added an EE test that used to crash
- ran planner / sort / kudu_insert tests

Change-Id: Icca8ab1456a3b840a47833119c9d4fd31a1fff90
Reviewed-on: http://gerrit.cloudera.org:8080/15217
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-17 21:20:26 +00:00
stiga-huang
cad156181b IMPALA-9304: Support starting Hive with Ranger in minicluster
Add a new flag -with_ranger in testdata/bin/run-hive-server.sh to start
Hive with Ranger integration. The relative configuration files are
generated in bin/create-test-configuration.sh using a new varient
ranger_auth in hive-site.xml.py. Only Hive3 is supported.

Current limitation:
Can't use different username in Beeline by the -n option. "select
current_user()" keeps returning my username, while "select
logged_in_user()" can return the username given by -n option but it's
not used in authorization.

Tests:
 - Ran bin/create-test-configuration.sh and verified the generated
   hive-site_ranger_auth.xml contains Ranger configurations.
 - Ran testdata/bin/run-hive-server.sh -with_ranger. Verified column
   masking and row filtering policies took effect in Beeline.
 - Added test in test_ranger.py for this mode.

Change-Id: I01e3a195b00a98388244a922a1a79e65146cec42
Reviewed-on: http://gerrit.cloudera.org:8080/15189
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-14 04:26:16 +00:00
Yanjia Li
ea0e1def61 IMPALA-8778: Support Apache Hudi Read Optimized Table
Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.

Tests
 - Unit test for Hudi in FileMetadataLoader
 - Create table tests in functional_schema_template.sql
 - Query tests in hudi-parquet.test

Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Reviewed-on: http://gerrit.cloudera.org:8080/14711
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-11 15:08:39 +00:00
wzhou-code
ebc2c366f5 IMPALA-8110: Fix Parquet min/max filters for narrowed integer types
This patch adds validation for the paired stats values of tinyint
and smallint column data type when reading min/max column stats
value from Parquet file.

Testing:
 - Added automatic test cases in parquet-stats.test for column data
   type been changed from int to tinyint, from smallint to tinyint
   and from int to smallint.
 - Passed EE tests.
 - Passed all core tests.

Change-Id: Id8bdaf4c4b2d0c6ea26d6e9bf013afca647e53a1
Reviewed-on: http://gerrit.cloudera.org:8080/15087
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-09 23:18:20 +00:00
Tim Armstrong
5b233150d2 IMPALA-9309: fix uninitialised pointer in NestedLoopJoinNode
This was reproducible with a failpoint, but test_failpoints.py
does not run with the right set of parameters.

Testing:
Add regression test that reproduces the issue.

Change-Id: I6324d27e93b710e806d8599f775d0f7d575186cb
Reviewed-on: http://gerrit.cloudera.org:8080/15147
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-05 22:00:48 +00:00
stiga-huang
a17bea916f IMPALA-9330: Fix column masking in nested tables + enable column masking by default
Column masking policies on primitive columns of a table which contains
nested types (though they won't be masked) will cause query failures.
To be specifit, if tableA(id int, int_array array<int>) has a masking
policy on column "id", all queries on "tableA" will fail, e.g.
  select id from tableA;
  select t.id, a.item from tableA t, t.int_array a;

Column masking is implemented by wrapping the underlying table/view with
a table masking view. However, as we don't support nested types in
SelectList, the table masking view can't expose nested columns of the
masked table, which causes collection refs not being resolved correctly.

This patch fixes the issue by 2 steps:
1) Expose nested columns of the underlying table in the output Type of
   the table masking view (see InlineViewRef#createTupleDescriptor()).
   So nested Paths in the original query block can be resolved.
2) For such kind of Paths, resolved them again inside the table masking
   view. So they can point to the underlying table as what they mean
   (see Analyzer#resolvePathWithMasking()). TupleDescriptor of such kind
   of table masking view won't be materialized since the view is simple
   enough that its query plan is just a ScanNode of the underlying
   table. The whole query plan can be stitched as if the table is not
   masked.
Note that one day when we support nested columns in SelectList, we may
don't need these 2 hacks.

This patch also adds some TRACE level loggings to improve debuggability,
and enables column masking by default.

Test changes in TestRanger.test_column_masking:
 - Add column masking policy on a table containing nested types.
 - Add queries on the masked tables. Some queries are borrowed from
   existing tests for nested types.

Tests:
 - Run CORE tests.

Change-Id: I1cc5565c64c1a4a56445b8edde59b1168f387791
Reviewed-on: http://gerrit.cloudera.org:8080/15108
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-05 04:18:28 +00:00
Tim Armstrong
7b280e5841 IMPALA-9349: free output_unmatched_batch_ buffers promptly in PHJ
This fixes a subtle memory managment issue where freeing of a
buffer is delayed longer than it should be. This means that
the full buffer pool reservation is not available for
repartitioning, which can lead to crashes or hang for
very specific queries.

The fix is to transfer resources from output_unmatched_batch_
as soon as the last row from the batch is appended to the
output batch.

This bug would only be triggered by join modes that output
unmatched rows from the right side (RIGHT OUTER JOIN,
FULL OUTER JOIN, RIGHT ANTI JOIN) *and* have an empty
probe side (otherwise unmatched rows are output by
iterating over the hash table).

Testing:
Added DCHECKs to check that all resources are available
before repartitioning.

Added a regression test that triggered the bug.

Change-Id: Ie13b51d4d909afb0fe2e7b7dc00b085c51058fed
Reviewed-on: http://gerrit.cloudera.org:8080/15142
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-01 02:25:13 +00:00
Tim Armstrong
f38da0df8e IMPALA-4400: aggregate runtime filters locally
Move RuntimeFilterBank to QueryState(). Implement fine-grained
locking for each filter to mitigate any increased lock
contention from the change.

Make RuntimeFilterBank handle multiple producers of the
same filter, e.g. multiple instances of a partitioned
join. It computes the expected number of filters upfront
then sends the filter to the coordinator once all the
local instances have been merged together. The merging
can be done in parallel locally to improve latency of
filter propagation.

Add Or() methods to MinMaxFilter and BloomFilter, since
we now need to merge those, not just the thrift versions.

Update coordinator filter routing to expect only one
instance of a filter from each producer backend and
to only send one instance to each consumer backend
(instead of sending one per fragment).

Update memory reservations and estimates to be lower
to account for sharing of filters between fragment
instances. mt_dop plans are modified to show these
shared and non-shared resources separately.

Enable waiting for runtime filters for kudu scanner
with mt_dop.

Made min/max filters const-correct.

Testing
* Added unit tests for Or() methods.
* Added some additional e2e test coverage for mt_dop queries
* Updated planner tests with new estimates and reservation.
* Ran a single node 3-impalad stress test with TPC-H kudu and
  TPC-DS parquet.
* Ran exhaustive tests.
* Ran core tests with ASAN.

Perf
* Did a single-node perf run on TPC-H with default settings. No perf change.
* Single-node perf run with mt_dop=8 showed significant speedups:

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 10.14   | -7.29%     | 5.05       | -11.68%        |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval    |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| TPCH(30) | TPCH-Q7  | parquet / none / none | 38.87  | 38.44       |   +1.13%   |   7.17%   | * 10.92% *     | 20    |   +0.72%       | 0.72    | 0.39    |
| TPCH(30) | TPCH-Q1  | parquet / none / none | 4.28   | 4.26        |   +0.50%   |   1.92%   |   1.09%        | 20    |   +0.03%       | 0.31    | 1.01    |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 2.32   | 2.32        |   +0.05%   |   2.01%   |   1.89%        | 20    |   -0.03%       | -0.36   | 0.08    |
| TPCH(30) | TPCH-Q15 | parquet / none / none | 3.73   | 3.75        |   -0.42%   |   0.84%   |   1.05%        | 20    |   -0.25%       | -0.77   | -1.40   |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 9.80   | 9.83        |   -0.38%   |   0.51%   |   0.80%        | 20    |   -0.32%       | -1.30   | -1.81   |
| TPCH(30) | TPCH-Q2  | parquet / none / none | 1.98   | 2.00        |   -1.32%   |   1.74%   |   2.81%        | 20    |   -0.64%       | -1.71   | -1.79   |
| TPCH(30) | TPCH-Q6  | parquet / none / none | 1.22   | 1.25        |   -2.14%   |   2.66%   |   4.15%        | 20    |   -0.96%       | -2.00   | -1.95   |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 5.13   | 5.22        |   -1.65%   |   1.20%   |   1.40%        | 20    |   -1.76%       | -3.34   | -4.02   |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.46   | 2.56        |   -4.13%   |   2.49%   |   1.99%        | 20    |   -4.31%       | -4.04   | -5.94   |
| TPCH(30) | TPCH-Q9  | parquet / none / none | 81.63  | 85.07       |   -4.05%   |   4.94%   |   3.06%        | 20    |   -5.46%       | -3.28   | -3.21   |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 5.07   | 5.50        | I -7.92%   |   0.96%   |   1.33%        | 20    | I -8.51%       | -5.27   | -22.14  |
| TPCH(30) | TPCH-Q21 | parquet / none / none | 24.00  | 26.24       | I -8.57%   |   0.46%   |   0.38%        | 20    | I -9.34%       | -5.27   | -67.47  |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 8.66   | 9.50        | I -8.86%   |   0.62%   |   0.44%        | 20    | I -9.75%       | -5.27   | -55.17  |
| TPCH(30) | TPCH-Q3  | parquet / none / none | 6.01   | 6.70        | I -10.19%  |   1.01%   |   0.90%        | 20    | I -11.25%      | -5.27   | -35.76  |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.98   | 3.39        | I -12.23%  |   1.48%   |   1.48%        | 20    | I -13.56%      | -5.27   | -27.75  |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.69   | 2.00        | I -15.55%  |   1.63%   |   1.47%        | 20    | I -18.09%      | -5.27   | -34.60  |
| TPCH(30) | TPCH-Q4  | parquet / none / none | 2.42   | 2.87        | I -15.69%  |   1.48%   |   1.26%        | 20    | I -18.61%      | -5.27   | -39.50  |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 4.64   | 6.27        | I -26.02%  |   1.35%   |   0.73%        | 20    | I -35.37%      | -5.27   | -94.07  |
| TPCH(30) | TPCH-Q20 | parquet / none / none | 3.19   | 4.37        | I -27.01%  |   1.54%   |   0.99%        | 20    | I -36.85%      | -5.27   | -80.74  |
| TPCH(30) | TPCH-Q5  | parquet / none / none | 4.57   | 6.39        | I -28.36%  |   1.04%   |   0.75%        | 20    | I -39.56%      | -5.27   | -120.02 |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 3.15   | 4.71        | I -33.06%  |   1.59%   |   1.31%        | 20    | I -49.43%      | -5.27   | -87.64  |
| TPCH(30) | TPCH-Q8  | parquet / none / none | 5.25   | 7.95        | I -33.95%  |   0.95%   |   0.53%        | 20    | I -51.11%      | -5.27   | -185.02 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+

Change-Id: Iabeeab5eec869ff2197250ad41c1eb5551704acc
Reviewed-on: http://gerrit.cloudera.org:8080/14538
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-29 00:58:24 +00:00
Anurag Mantripragada
567b3cd04c IMPALA-9311: Store SQLPrimaryKeys in canonical order.
HMS seems to be returning SQLPrimaryKeys in inconsistent orders.
This makes some of the primary keys tests flaky. This change sorts
the list of primary keys and stores them in canonical order within
Impala.

Testing:
- Modified the tests that were relying on HMS to return same order
  every time.
- Ran parametrized job.

Change-Id: I0f798d7a2659c6cd061002db151f3fa787eb6370
Reviewed-on: http://gerrit.cloudera.org:8080/15106
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-01-27 21:48:23 +00:00
Joe McDonnell
0163a10332 IMPALA-9068: Use different directories for external vs managed warehouse
Hive 3 changed the typical storage model for tables to split them
between two directories:
 - hive.metastore.warehouse.dir stores managed tables (which is now
   defined to be only transactional tables)
 - hive.metastore.warehouse.external.dir stores external tables
   (everything that is not a transactional table)
In more recent commits of Hive, there is now validation that the
external tables cannot be stored in the managed directory. In order
to adopt these newer versions of Hive, we need to use separate
directories for external vs managed warehouses.

Most of our test tables are not transactional, so they would reside
in the external directory. To keep the test changes small, this uses
/test-warehouse for the external directory and /test-warehouse/managed
for the managed directory. Having the managed directory be a subdirectory
of /test-warehouse means that the data snapshot code should not need to
change.

The Hive 2 configuration doesn't change as it does not have this concept.

Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION
to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate
location for data as compared to USE_CDP_HIVE=false. That should reduce
conflicts between the two configurations.

Testing:
 - Ran exhaustive tests with USE_CDP_HIVE=false
 - Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version)
 - Verified that dataload succeeds and tests are able to run with a newer
   Hive version.

Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec
Reviewed-on: http://gerrit.cloudera.org:8080/15026
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-24 17:29:15 +00:00
Anurag Mantripragada
cfe60858da IMPALA-9158: Support loading primary key/foreign key constraints
in LocalCatalog Mode.

This change add a new method 'loadConstraints()' to the MetaProvider
interface.

1. In CatalogdMetaProvider implementation, we fetch the primary key
  (PK) and foreign key(FK) information via the GetPartialCatalogObject()
  RPC to the catalogd. This is modified to include PK/FK information.
  This is because, on catalog side we eagerly load PK/FK information
  which can be sent over to local catalog in a single RPC to Catalog.
  This information is then stored in TableMetaRef object for future
  consumers.
2. In the DirectMetaProvider implementation, we make two RPCs to HMS
  to directly get PK/FK information.

Load constraints can be extended to include other constraints later
(for ex: unique constraints.)

Testing:
- Added tests in LocalCatalogTest, CatalogTest and PartialCatalogInfoTest
- This change also modifies the toSqlUtil for show create table
  statements. Added a test for the same.

Change-Id: I7ea7e1bacf6eb502c67caf310a847b32687e0d58
Reviewed-on: http://gerrit.cloudera.org:8080/14731
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-18 03:36:37 +00:00
Gabor Kaszab
63f52518ab IMPALA-8801: Date type support for ORC scanner
Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.

Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.

Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Reviewed-on: http://gerrit.cloudera.org:8080/14982
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-17 18:54:33 +00:00
stiga-huang
d66610837e IMPALA-9009: Core support for Ranger column masking
Ranger provides column masking policies about how to show masked values
to specific users when reading specific columns. This patch adds support
to rewrite the query AST based on column masking policies.

We perform the column masking policies by replacing the TableRef with a
subquery doing the masking. For instance, the following query
  select c_id, c_name from customer c join orders on c_id = o_cid
will be transfomed into
  select c_id, c_name  from (
    select mask1(c_id) as c_id, mask2(c_name) as c_name from customer
  ) c
  join orders
  on c_id = o_cid

The transfomation is done in AST resolution. Just like view resolution,
if the table needs masking we replace it with a subquery(InlineViewRef)
containing the masking expressions.

This patch only adds support for mask types that don't require builtin
mask functions. So currently supported masking types are MASK_NULL and
CUSTOM.

Current Limitations:
 - Users are required to have privileges on all columns of a masked
   table(IMPALA-9223), since the table mask subquery contains all the
   columns.

Tests:
 - Add e2e tests for masked results
 - Run core tests

Change-Id: I4cad60e0e69ea573b7ecfc011b142c46ef52ed61
Reviewed-on: http://gerrit.cloudera.org:8080/14894
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-15 05:59:11 +00:00
norbert.luksa
0511b44f92 IMPALA-8046: Support CREATE TABLE from an ORC file
Impala supports creating a table using the schema of a file.
However, only Parquet is supported currently. This commit adds
support for creating tables from ORC files

The change relies on the ORC Java API with version 1.5 or
greater, because of a bug in earlier versions. Therefore, ORC is
listed as an external dependency, instead of relying on Hive's
ORC version (from Hive3, Hive also lists it as a dependency).

Also, the commit performs a little clean-up on the ParquetHelper
class, renaming it to ParquetSchemaExtractor and removing outdated
comments.

To create a table from an ORC file, run:
CREATE TABLE tablename LIKE ORC '/path/to/file'

Tests:
 * Added analysis tests for primitive and complex types.
 * Added e2e tests for creating tables from ORC files.

Change-Id: I77cd84cda2ed86516937a67eb320fd41e3f1cf2d
Reviewed-on: http://gerrit.cloudera.org:8080/14811
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-14 17:15:17 +00:00
Vihang Karajgaonkar
6ebea33a9d IMPALA-9092: Add support for creating external Kudu table
In HMS-3 the translation layer converts a managed kudu table into an
external kudu table and adds additional table property
'external.table.purge' to 'true'. This means any installation which
is using HMS-3 (or a Hive version which has HIVE-22158) will always
create Kudu tables as external tables. This is problematic since the
output of show create table will now be different and may confuse
the users.

In order to improve the user experience of such synchronized tables
(external tables with external.table.purge property set to true),
this patch adds support in Impala to create
external Kudu tables. Previous versions of Impala disallowed
creating a external Kudu table if the Kudu table did not exist.
After this patch, Impala will check if the Kudu table exists and if
it does not it will create a Kudu table based on the schema provided
in the create table statement. The command will error out if the Kudu
table already exists. However, this applies to only the synchronized
tables. Previous way to create a pure external table behaves the
same.

Following syntax of creating a synchronized table is now allowed:

CREATE EXTERNAL TABLE foo (
  id int PRIMARY KEY,
  name string)
PARTITION BY HASH PARTITIONS 8
STORED AS KUDU
TBLPROPERTIES ('external.table.purge'='true')

The syntax is very similar to creating a managed table, except for
the EXTERNAL keyword and additional table property. A synchronized
table will behave similar to managed Kudu tables (drops and renames
are allowed). The output of show create table on a synchronized
table will display the full column and partition spec similar to the
managed tables.

Testing:
1. After the CDP version bump all of the existing Kudu tables now
create synchronized tables so there is good coverage there.
2. Added additional tests which create synchronized tables and
compares the show create table output.
3. Ran exhaustive tests with both CDP and CDH builds.

Change-Id: I76f81d41db0cf2269ee1b365857164a43677e14d
Reviewed-on: http://gerrit.cloudera.org:8080/14750
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-12-13 23:02:13 +00:00
norbert.luksa
bf031a2142 IMPALA-6660: Change -0/+0 floating point to compare as equal in hash table
Currently -0.0/+0.0 values are hashed to different values due to
their different binary representation, while -0.0==+0.0 is true in
C++. This caused them to be distinct values in hash maps despite
being treated as equal in comparisons.

This commit fixes the hashing of -0.0/+0.0, thus changing the
behaviour of hash joins and aggregations (since aggregations
follow the behaviour of the join). That way, the canonical form for
-0/+0 is changed to +0.

Tests:
 - Added e2e tests for aggregation (group by and distinct) and
   join queries with -0.0 and +0.0 present.

Change-Id: I6bb1a817c81c452d041238c19cb6c9f602a5d565
Reviewed-on: http://gerrit.cloudera.org:8080/14588
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-26 19:14:24 +00:00
norbert.luksa
2114fc6155 IMPALA-4618: Fixing #Hosts and adding #Instances in exec summary
When mt_dop > 0, the summary is reporting the number of fragment
instances, instead of the number of hosts as the header would
imply.

This commit fixes the issue so the number of hosts will be shown
under the #Hosts column. The commit also adds an #Inst column
where the number of instances are shown (current behaviour).

Tests:
 * Changed profile tests with mt_dop > 0.
 * Updated benchmark tests and shell tests accordingly.

Change-Id: I3bdf9a06d9bd842b2397cd16c28294b6bec7af69
Reviewed-on: http://gerrit.cloudera.org:8080/14715
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-26 07:28:23 +00:00