The min/max stat predicate is allowed when the left side is not a slot
but an implicit cast of a slot. This could lead to incorrectly dropping
a row group or page when min/max values were not castable to the type,
e.g. it is string with a pre 1400 date and we want to cast it to a
timestamp.
The change should only affect timestamps, as dates return an error
on failed cast from a string, and numeric types won't be cast
implicitly from string.
The fix is simply to accept NULL result for the min/max predicate in
the backend. Note that the alternative solution of casting the right
(const) side of the predicate instead of the left side would be tricky,
as more than one string can mean the same timestamp, e.g.
"1970-01-01" and "1970-01-01 00:00:00".
Testing:
- added an EE regression test and ran it
Change-Id: I35f66e1dfc4523624c249073004f9d5eddd07bb6
Reviewed-on: http://gerrit.cloudera.org:8080/15959
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds a new version of the pre-existing partition
key scan optimization that always returns correct
results, even when files have zero rows. This new
version is always enabled by default. The old
existing optimization, which does a metadata-only
query, is still enabled behind the
OPTIMIZE_PARTITION_KEY_SCANS query option.
The new version of the optimization must scan the files
to see if they are non-empty. Instead of using metadata
only, the planner instructs the backend to short-circuit HDFS
scans after a single row has been returned from each
file. This gives results equivalent to returning all
the rows from each file, because all rows in the file
belong to the same partition and therefore have identical
values for any columns that are partition key values.
Planner cardinality estimates are adjusted accordingly
to enable potentially better plans and other optimisations
like disabling codegen.
We make some effort to avoid generated extra scan ranges
for remote scans by only generating one range per remote
file.
The backend optimisation is implemented by constructing a
row batch with capacity for a single row only and then
terminating each scan range once a single row has been
produced. Both Parquet and ORC have optimized code paths
for zero slot table scans that mean this will only result
in a footer read. (Other file formats still need to read
some portion of the file, but can terminate early once
one row has been produced.)
This should be quite efficient in practice with file handle
caching and data caching enabled, because it then only
requires reading the footer from the cache for each file.
The partition key scan optimization is also slightly
generalised to apply to scans of unpartitioned tables
where no slots are materialized.
A limitation of the optimization where it did not apply
to multiple grouping classes was also fixed.
Limitations:
* This still scans every file in the partition. I.e. there is
no short-circuiting if a row has already been found in the
partition by the current scan node.
* Resource reservations and estimates for the scan node do
not all take into account this optimisation, so are
conservative - they assume the whole file is scanned.
Testing:
* Added end-to-end tests that execute the query on all
HDFS file formats and verify that the correct number of rows
flow through the plan.
* Added planner test based on the existing test partition key
scan test.
* Added test to make sure single node optimisation kicks in
when expected.
* Add test for cardinality estimates with and without stats
* Added test for unpartitioned tables.
* Added planner test that checks that optimisation is enabled
for multiple aggregation classes.
* Added a targeted perf test.
Change-Id: I26c87525a4f75ffeb654267b89948653b2e1ff8c
Reviewed-on: http://gerrit.cloudera.org:8080/13993
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Minor compactions can compact several delta directories into a single
delta directory. The current directory filtering algorithm had to be
modified to handle minor compacted directories and prefer those over
plain delta directories. This happens in the Frontend, mostly in
AcidUtils.java.
Hive Streaming Ingestion writes similar delta directories, but they
might contain rows Impala cannot see based on its valid write id list.
E.g. we can have the following delta directory:
full_acid/delta_0000001_0000010/0000 # minWriteId: 1
# maxWriteId: 10
This delta dir contains rows with write ids between 1 and 10. But maybe
we are only allowed to see write ids less than 5. Therefore we need to
check the ACID write id column (named originalTransaction) to determine
which rows are valid.
Delta directories written by Hive Streaming don't have a visibility txn
id, so we can recognize them based on the directory name. If there's
a visibilityTxnId and it is committed => every row is valid:
full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId
# every row is valid
If there's no visibilityTxnId then it was created via Hive Streaming,
therefore we need to validate rows. Fortunately Hive Streaming writes
rows with different write ids into different ORC stripes, therefore we
don't need to validate the write id per row. If we had statistics,
we could validate per stripe, but since Hive Streaming doesn't write
statistics we validate the write id per ORC row batch (an alternative
could be to do a 2-pass read, first we'd read a single value from each
stripe's 'currentTransaction' field, then we'd read the stripe if the
write id is valid).
Testing
* the frontend logic is tested in AcidUtilsTest
* the backend row validation is tested in test_acid_row_validation
Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d
Reviewed-on: http://gerrit.cloudera.org:8080/15818
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala 4 decided to drop Sentry support in favor of Ranger. This
removes Sentry support and related tests. It retires startup
flags related to Sentry and does the first round of removing
obsolete code. This does not adjust documentation to remove
references to Sentry, and other dead code will be removed
separately.
Some issues came up when implementing this. Here is a summary
of how this patch resolves them:
1. authorization_provider currently defaults to "sentry", but
"ranger" requires extra parameters to be set. This changes the
default value of authorization_provider to "", which translates
internally to the noop policy that does no authorization.
2. These flags are Sentry specific and are now retired:
- authorization_policy_provider_class
- sentry_catalog_polling_frequency_s
- sentry_config
3. The authorization_factory_class may be obsolete now that
there is only one authorization policy, but this leaves it
in place.
4. Sentry is the last component using CDH_COMPONENTS_HOME, so
that is removed. There are still Maven dependencies coming
from the CDH_BUILD_NUMBER repository, so that is not removed.
5. To make the transition easier, testdata/bin/kill-sentry-service.sh
is not removed and it is still called from testdata/bin/kill-all.sh.
Testing:
- Core job passes
Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e
Reviewed-on: http://gerrit.cloudera.org:8080/15833
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The use of HDFS API was incorrect when creating an empty file
in the new base dir during truncate. Simply calling Create(Path)
does create the file in HDFS, but it is only created on S3 when
the returned stream is closed.
Testing:
- Acid truncate tests are not running on S3 as they need a running
Hive server. Aded a regression test that will run on S3 too. It
would be nice to run all tests on S3, but this is out of the scope
of this change.
Change-Id: I96d315638b669c5c7198a8e47939cb2b236e35bb
Reviewed-on: http://gerrit.cloudera.org:8080/15940
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
query_test.test_scanners.TestScannerReservation.test_scanners was flaky.
It checks the average value of ParquetRowGroupIdealReservation which is
almost always 3.50 MB. However, very rarely it's a bit different than
that, e.g. 3.88 MB, or 4.12 MB, and so on. I wasn't able to reproduce
this problem, probably it is due to some randomness during data loading.
I modified the test to accept any value between 3.0 and 4.99. I think
values in this range are acceptable for this test.
Change-Id: I668d0ccd77a62059284e76fee51efb08bef580eb
Reviewed-on: http://gerrit.cloudera.org:8080/15923
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The control flow was broken if the join operator hit
the end of the expression values cache before the end
of the probe batch, immediately after processing a row
for a spilled partition. In NextProbeRow(), the problematic
code path was:
* The last row in the expression values cache was for a
spilled partition, so skip_row=true and it falls out
of the loop with 'current_probe_row_' pointing to that
row.
* probe_batch_iterator->AtEnd() is false, because
the expression value cache is smaller than the probe batch,
so 'current_probe_row_' is not nulled out.
Thus we end up in a state where 'current_probe_row_' is
set, but 'hash_table_iterator_' is unset.
In the case of a left anti join, this was interpreted by
ProcessProbeRowLeftSemiJoins() as meaning that there was
no hash table match for 'current_probe_row_', and it
therefore returned the row.
This bug could only occur under specific circumstances:
* The join key takes up > 256 bytes in the expression values
cache (assuming the default batch size of 1024).
* The join spilled.
* The join operator returns rows that were unmatched in
the right input, i.e. LEFT OUTER JOIN, LEFT ANTI JOIN,
FULL OUTER JOIN.
The core of the fix is to null out 'current_probe_row_' when
falling out the bottom of the loop in NextProbeRow(). Related
DCHECKS were fixed and some control flow was slightly
simplified.
Testing:
Added a test query on TPC-H that reproduces the problem reliably.
Ran exhaustive tests.
Change-Id: I9d7e5871c35a90e8cf24b8dded04775ee1eae9d8
Reviewed-on: http://gerrit.cloudera.org:8080/15904
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch, when the coordinator returned the last row,
it waited for backends to finish of their own accord, which
could happen indirectly as exchanges got closed.
The idea of this change is to send out cancellation RPCs to
expedite cancellation, then wait for the final exec status
reports to come in. Those reports will be included in the
final profile because the backend is *not* marked as done
when sending out the cancellation RPCs.
The bulk of this change is modifying the cancellation code
path to allow sending the cancel RPCs but *not* consider
the backend done until it gets back the final status report.
The old "fire and forget" mode of cancellation is still used
for explicit cancellation and errors.
Testing:
Ran exhaustive tests.
Ran cancellation tests under TSAN, checked for errors.
Manually inspected logs of some queries with limit,
saw that it sent cancellation then waited for backends
as expected.
Added a functional perf test that goes from ~5s down to < ~1s
on my system.
Change-Id: I966eceaafdc18a019708b780aee4ee9d70fd3a47
Reviewed-on: http://gerrit.cloudera.org:8080/15840
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some string built-in functions could crash impalad,
in case the result was longer than 1 gig max size.
Added some overflow checks.
Overflow error messages modified not to hard code max size.
Testing:
* Added some backend tests to cover overflow check
* Ran core tests
Change-Id: I93a53845f04e61ff446b363c78db1e49cbd5dc49
Reviewed-on: http://gerrit.cloudera.org:8080/15864
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change compute the real number of true and false statistics
information for boolean columns. Before this, impala used to set
numTrues and numFalses to hardcoded -1 to indicate that its
statistics is missing.
Test Done:
Append the numTrue and numFalse test for all the statistics-related
test cases including the non-incremental, incremental and other test
cases.
Change-Id: I991bee8e7fdc644d908289f5fe2ee8032cc2c431
Reviewed-on: http://gerrit.cloudera.org:8080/14666
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Modified the insert testfiles to get which database they need to
use for 'CREATE TABLE LIKE' dynamically.
Tests:
Did targeted exhaustive testruns in test_insert.py and
test_mt_dop.py and did a full exhaustive testrun.
Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d
Reviewed-on: http://gerrit.cloudera.org:8080/15816
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
This switches null-aware anti-join (NAAJ) to use shared
join builds with mt_dop > 0. To support this, we
make all access to the join build data structures
from the probe read-only. NAAJ requires iterating
over rows from build partitions at various steps
in the algorithm and before this patch this was not
thread-safe. We avoided that problem by having a
separate builder for each join node and duplicating
the data.
The main challenge was iteration over
null_aware_partition()->build_rows() from the probe
side, because it uses an embedded iterator in the
stream so was not thread-safe (since each thread
would be trying to use the same iterator).
The solution is to extend BufferedTupleStream to
allow multiple read iterators into a pinned,
read-only, stream. Each probe thread can then
iterate over the stream independently with no
thread safety issues.
With BufferedTupleStream changes, I partially abstracted
ReadIterator more from the rest of BufferedTupleStream,
but decided not to completely refactor so that this patchset
didn't cause excessive churn. I.e. much BufferedTupleStream
code still accesses internal fields of ReadIterator.
Fix a pre-existing bug in grouping-aggregator where
Spill() hit a DCHECK because the hash table was
destroyed unnecessarily when it hit an OOM. This was
flushed out by the parameter change in test_spilling.
Testing:
Add test to buffered-tuple-stream-test for multiple readers
to BTS.
Tweaked test_spilling_naaj_no_deny_reservation to have
a smaller minimum reservation, required to keep the
test passing with the new, lower, memory requirement.
Updated a TPC-H planner test where resource requirements
slightly decreased for the NAAJ.
Ran the naaj tests in test_spilling.py with TSAN enabled,
confirmed no data races.
Ran exhaustive tests, which passed after fixing IMPALA-9611.
Ran core tests with ASAN.
Ran backend tests with TSAN.
Perf:
I ran this query that exercises EvaluateNullProbe() heavily.
select l_orderkey, l_partkey, l_suppkey, l_linenumber
from tpch30_parquet.lineitem
where l_suppkey = 4162 and l_shipmode = 'AIR'
and l_returnflag = 'A' and l_shipdate > '1993-01-01'
and if(l_orderkey > 5500000, NULL, l_orderkey) not in (
select if(o_orderkey % 2 = 0, NULL, o_orderkey + 1)
from orders
where l_orderkey = o_orderkey)
order by 1,2,3,4;
It went from ~13s to ~11s running on a single impalad with
this change, because of the inlining of CreateOutputRow() and
EvalConjuncts().
I also ran TPC-H SF 30 on Parquet with mt_dop=4, and there was
no change in performance.
Change-Id: I95ead761430b0aa59a4fb2e7848e47d1bf73c1c9
Reviewed-on: http://gerrit.cloudera.org:8080/15612
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This changes the test to use a debug action instead of
trying to hit the memory limit in the right spot, which
has tended to be flaky. This still exercises the error
handling code in the scanner, which was the original
point of the test (see IMPALA-2376).
This revealed an actual bug in the ORC scanner, where
it was not returning the error directly from
AssembleCollection(). Before I fixed that, the scanner
got stuck in an infinite loop when running the test.
Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21
Reviewed-on: http://gerrit.cloudera.org:8080/15700
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
This patch avoids the race with registration of a
consumer filter by registering all filters upfront
when the filter bank is constructed. Then registration
of producers and consumers hands out references to the
pre-constructed filters.
A nice bonus of this change is that RegisterConsumer()
and RegisterProducer() don't mutate anything and
we can avoid lock acquisitions.
Also adds test infrastructure and fixes TestRuntimeRowFilters to
work with mt_dop=4 (it was accidentally not enabled before). That
mostly involved modifying the tests to use aggregates of counters
instead of picking out lines with regexes.
Testing:
Added a regression test that reliably failed before this
fix. This relies on extending debug actions to allow longer
delays, plus a minor extension to the RUNTIME_PROFILE .test
file parser to handle spaces in counter names.
Ran exhaustive tests.
Change-Id: I194c0d2515b6a0e5474e1c0c8647f0e54dc94397
Reviewed-on: http://gerrit.cloudera.org:8080/15715
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
-Modified the ‘test_insert.py’ so the tests can run parallel.
-Every test will create its own temporary tables for insert testing.
-Swapped out the SETUP tags to Truncate table QUERY statement.
-Becouse the SETUP tag is not used anymore, the correspondig
code was removed.
-A test query in ‘insert.test’. The test was incorrect so modified
to test for the right behavior.
Testing:
-tests/run-tests.py query_test/test_insert.py
-impala-py.test tests/query_test/test_insert.py
-the same for test_insert_permutation.py and test_load.py
Change-Id: I257e936868917a2fcc6c030f6c855b247e8a0eea
Reviewed-on: http://gerrit.cloudera.org:8080/15529
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HIVE-22794 disallows ACID tables outside of the 'managed' warehouse
directory. This change updates data loading to make it conform to
the new rules.
The following tests had to be modified to use the new paths:
* AnalyzeDDLTest.TestCreateTableLikeFileOrc()
* create-table-like-file-orc.test
Change-Id: Id3b65f56bf7f225b1d29aa397f987fdd7eb7176c
Reviewed-on: http://gerrit.cloudera.org:8080/15708
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Column masking is implemented by replacing the masked table with a table
masking view which has masked expressions in its SelectList. However,
nested columns can't be exposed in the SelectList, so we expose them
in the output field of the view in IMPALA-9330. As a result, predicates
that reference both primitive and nested columns of the masked table
become multi-tuple predicates (referencing tuples of the view and the
masked table). Such kinds of predicates are not assigned since they no
longer bound to the view's tuple or the masked table's tuple.
We need to pick up the masked table's tuple id when getting unassigned
predicates for the table masking view. Also need to do this for
assigning predicates to the JoinNode which is the only place that
introduces multi-tuple predicates.
Tests:
- Add tests with multi-tuple predicates referencing nested columns.
- Run CORE tests.
Change-Id: I12f1b59733db5a88324bb0c16085f565edc306b3
Reviewed-on: http://gerrit.cloudera.org:8080/15654
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Full ACID row format looks like this:
{
"operation": 0,
"originalTransaction": 1,
"bucket": 536870912,
"rowId": 0,
"currentTransaction": 1,
"row": {"i": 1}
}
User columns are nested under "row". In the frontend we need to create
slot descriptors that correspond to the file schema. In the catalog we
could mimic the file schema but that would introduce several
complexities and corner cases in column resolution. Also in query
results the heading of the above user column would be "row.i". Star
expansion should also be modified, etc.
Because of that in the Catalog I create the exact opposite of the above
schema:
{
"row__id":
{
"operation": 0,
"originalTransaction": 1,
"bucket": 536870912,
"rowId": 0,
"currentTransaction": 1
}
"i": 1
}
This way very little modification is needed in the frontend. And the
hidden columns can be easily retrieved via 'SELECT row__id.*' when we
need those for debugging/testing.
We only need to change Path.getAbsolutePath() to return a schema path
that corresponds to the file schema. Also in the backend we need some
extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the
table schema path from the file schema path.
Testing:
I changed data loading to load ORC files in full ACID format by default.
With this change we should be able to scan full ACID tables that are
not minor-compacted, don't have deleted rows, and don't have original
files.
Newly added Tests:
* specific queries about hidden columns (full-acid-rowid.test)
* SHOW CREATE TABLE (show-create-table-full-acid.test)
* DESCRIBE [FORMATTED] TABLE (describe-path.test)
* INSERT should be forbidden (acid-negative.test)
* added tests for column masking (
ranger_column_masking_complex_types.test)
Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb
Reviewed-on: http://gerrit.cloudera.org:8080/15395
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
KUDU-1938 added VARCHAR column type support to Kudu.
This commit adds support for Kudu's VARCHAR type to Impala.
The length of a Kudu varchar is applied as a character length as opposed
to a byte length like Impala currently uses.
When writing data to Kudu, the VARCHAR length is not an issue because
Impala only officially supports ASCII characters and those characters are
the same size in bytes and characters. Additionally, extra bytes would be
truncated by the Kudu client if somehow a value was too long.
When reading data from Kudu, it is possible that the value written by
some other application is wider in bytes than Impala expects and can
handle. This can happen due to multi-byte UTF-8 characters. In that
case, we adjust the length in Impala to truncate the extra bytes of the
value. This isn’t a great solution, but one other integrations have taken
as well given Impala doesn’t support UTF-8 values.
IMPALA-5675 tracks adding UTF-8 Character length support to VARCHAR
columns and marked the truncation code with a TODO that references
that Jira.
Testing:
* Performed manual testing of standard DDL and DML interaction
* Manually reproduced a check failure due to multi-byte characters
and tested that length truncation resolve that issue.
* Added/adjusted the following automated tests:
** AnalyzeDDLTest: CTAS into Kudu with varchar type
** AnalyzeKuduDDLTest: CREATE TABLE in Kudu with VARCHAR type
** kudu_create.test: Create table with VARCHAR column, key, hash
partition, and range partition
** kudu_describe.test: Describe table with VARCHAR column and key
** kudu_insert.test: Insert with VARCHAR columns including null and
non-null defaults
** kudu_update.test: Updates with VARCHAR column
** kudu_upsert.test: Upserts with VARCHAR column
** kudu_delete.test Deletes with VARCHAR columns
** kudu-scan-node.test Tests basic predicates with VARCHAR columns
Follow on work:
- IMPALA-9580: Add min-max runtime filter support/tests
- IMPALA-9581: Pushdown string predicates
- IMPALA-9583: Automated multibyte truncation tests
Change-Id: I0d4959410fdd882bfa980cb55e8a7837c7823da8
Reviewed-on: http://gerrit.cloudera.org:8080/14197
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
AVG(TIMESTAMP) is not deterministic, because it uses a double to sum
the timestamps, and adding doubles in different order can lead to
different results. This does not cause problems for DOUBLE columns,
because the test framework does not require exact match if the result
is double. As AVG is the only function for TIMESTAMP with this problem,
reducing the precision of all timestamps checks seemed like an
overkill.
As a short term solution I removed the problematic aggregates from the
tests.
Testing:
- ran only the related tests
Change-Id: I10e0027a64a4e430b7db3ed7c8d0cc8cdcb202e0
Reviewed-on: http://gerrit.cloudera.org:8080/15621
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This allows running *any* read-only query with mt_dop > 0.
Before this patch, no joins were allowed with mt_dop > 0.
Previous patches, particularly IMPALA-9156, added significantly
more code coverage for multithreading+joins. It should be safe to
allow enabling on a query-by-query basis. Many improvements are
still planned - see IMPALA-3902. So behaviour and performance
characteristics of mt_dop > 0 with more complex plans and joins
will continue to change.
Testing:
Updated the mt_dop validation tests and remove redundant planner test
that doesn't provide much additional coverage of the validation
support.
Ran exhaustive tests.
Change-Id: I9c6566abb239db0e775f2beaa25a62c36313cd6f
Reviewed-on: http://gerrit.cloudera.org:8080/15545
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The issue occurred when there were skipped pages and a column
inside a collection was scanned, but its position was not needed.
The repetition level still needs to be read in this case, as the
skipped ranges are set in top level rows, so collection items
need to know which top level row do they belong to.
A DCHECK in StrideWriter's constructor was hit, otherwise the
code ran correctly in release mode. The DCHECK is moved to
functions where the condition would actually cause problems.
Testing:
- added and ran a regression test
Change-Id: I5e8ef514ead71f732c73f910af7fd1aecd37bb81
Reviewed-on: http://gerrit.cloudera.org:8080/15598
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When changing the Impala version from 3.4.0-SNAPSHOT to 3.4.0-RELEASE,
TestStatsExtrapolation::test_stats_extrapolation started failing due
to a difference in the expected cardinality (expected: 17.91K,
actual 17.90K). This is because the Impala version gets embedded into
parquet files, and this causes a slight difference in file size, which
translates into a slight difference in expected cardinality.
This modifies TestStatsExtrapolation::test_stats_extrapolation to
allow any 17.9*K cardinality.
Testing:
- Tested on master and on branch-3.4.0
Change-Id: Iebe538936f23c095ef58c808e425cfb7b31edd94
Reviewed-on: http://gerrit.cloudera.org:8080/15569
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change modifies the output of the SHOW TABLE STATS and SHOW
PARTITIONS for Kudu tables.
- PARTITIONS: the #Row column has been removed
- TABLE STATS: instead of showing partition informations it returns a
resultset similar to HDFS table stats, #Rows, #Partitions, Size, Format
and Location
Example outputs can be seen in the doc changes.
Testing:
* kudu_stats.test is modified to verify the new result set
* kudu_partition_ddl.test is modified to verify the new partitions style
* Updated unit test with the new error message
Change-Id: Ice4b8df65f0a53fe14b8fbe35d82c9887ab9a041
Reviewed-on: http://gerrit.cloudera.org:8080/15199
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch supports reading and writing DATE values
to Kudu tables. It does not add min-max filter runtime
support, but there is followup JIRA IMPALA-9294.
Corresponding Kudu JIRA is KUDU-2632.
Change-Id: I91656749a58ac769b54c2a63bdd4f85c89520b32
Reviewed-on: http://gerrit.cloudera.org:8080/14705
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
String values from external systems (HDFS, Hive, Kudu, etc.) are already
unescaped, the same as string values in Thrift objects deserialized in
coordinators. We should mark needsUnescaping_ as false in creating
StringLiterals for these values (in LiteralExpr#create()).
When comparing StringLiterals in partition pruning, we should also use
the unescaped values if needsUnescaping_ is true.
Tests:
- Add tests for partition pruning on unescaped strings.
- Add test coverage for all existing code paths using
LiteralExpr#create().
- Run core tests
Change-Id: Iea8070f16a74f9aeade294504f2834abb8b3b38f
Reviewed-on: http://gerrit.cloudera.org:8080/15278
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When a runtime filter has remote target, coordinator will Disable the
FilterState upon arrival of the last filter update to prevent another
update towards that filter. As consequence, such runtime filter will
always be displayed as disabled in runtime profile (Enabled column is
equal to false in Final filter table), when in reality the runtime
filter has heard back from all pending backends and complete. The
Enabled column should correctly distinguish between failed runtime
filter vs complete runtime filter. To do so, we add
all_updates_received_ flag in FilterState class and set it to true
after filter received enough filter update from pending backends to
proceed. If all_updates_received_ is true, then that runtime filter is
considered as enabled.
Testing:
- Add row regex in runtime_filters.test, query 6, to verify REMOTE
runtime filter is marked as enabled in final filter table
- Run and pass test_runtime_filters.py
- Run and pass core tests
Change-Id: I82a5a776103abd0a6d73336bebc65e22b4e13fef
Reviewed-on: http://gerrit.cloudera.org:8080/15308
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As far as I can tell, the query failed to spill because the
pre-agg was able to release reservation before the post-agg
needed it. Probably there is some variance because of buffering
in the exchange.
This change slightly reduces the reservation to minimise the
chance of this recurring.
Also remove a duplicated instance of this test.
Change-Id: Ifb8376e2e12d3f73d6c0e27c697be4fc86f9c755
Reviewed-on: http://gerrit.cloudera.org:8080/15339
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ORC scanner uses TimestampValue::FromUnixTimeNanos() to convert
sec + nano representation to Impala's TimestampValue (day + nano).
FromUnixTimeNanos was affected by flag
use_local_tz_for_unix_timestamp_conversions, while that global option
should not affect ORC. By default there was no conversion, but if the
flag is 1, then timestamps were interpreted as UTC and converted to
local time.
This could be solved by creating a UTC version of FromUnixTimeNanos,
but I decided to change the interface in the hope of making To/From
timestamp functions less confusing.
Changes:
- Fixed the bug by passing UTC as timezone in the ORC scanner.
- Changed the interface of these TimestampValue functions to expect
a timezone pointer, interpret null as UTC and skip conversion. It
would be also possible to pass the actual UTC timezone and check
for this in the functions, but I guess it is easier to optimize
the inlined functions this way.
- Moved the checking of use_local_tz_for_unix_timestamp_conversions to
RuntimeState and added property time_zone_for_unix_time_conversions()
to return the timezone to use in Unix time conversions. This made
TimestampValue's interface clearer and makes it easy to replace the
flag with a query option if we want to.
- Changed RuntimeState and the Parquet scanner to skip timezone
conversion if convert_legacy_hive_parquet_utc_timestamps=1 but the
timezone is UTC. This allows users to avoid the performance penalty
of this flag by setting query option timezone to UTC in their
session (IMPALA-7557). CCTZ is not good at this, actually
conversions are slower with fixed offset timezones (including UTC)
than with timezones that have DST/historical rule changes.
Postponed changes:
- Didn't remove the UTC versions of the functions yet, as that would
require changing (and possibly rethinking) several BE tests and
benchmarks (IMPALA-9409).
Tests:
- Added regression test for Orc and other file formats to
check that they are not affected by this flag.
- Extended test_hive_parquet_timestamp_conversion.py to cover the case
when convert_legacy_hive_parquet_utc_timestamps=1 and timezone=UTC.
Also did some cleanup there to use query option timezone instead of
env var TZ.
Change-Id: I14e2a7e512ccd013d5d9fe480a5467ed4c46b76e
Reviewed-on: http://gerrit.cloudera.org:8080/15222
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This enables parallel plans with the join build in a
separate fragment and fixes all of the ensuing fallout.
After this change, mt_dop plans with joins have separate
build fragments. There is still a 1:1 relationship between
join nodes and builders, so the builders are only accessed
by the join node's thread after it is handed off. This lets
us defer the work required to make PhjBuilder and NljBuilder
safe to be shared between nodes.
Planner changes:
* Combined the parallel and distributed planning code paths.
* Misc fixes to generate reasonable thrift structures in the
query exec requests, i.e. containing the right nodes.
* Fixes to resource calculations for the separate build plans.
** Calculate separate join/build resource consumption.
** Simplified the resource estimation by calculating resource
consumption for each fragment separately, and assuming that
all fragments hit their peak resource consumption at the
same time. IMPALA-9255 is the follow-on to make the resource
estimation more accurate.
Scheduler changes:
* Various fixes to handle multiple TPlanExecInfos correctly,
which are generated by the planner for the different cohorts.
* Add logic to colocate build fragments with parent fragments.
Runtime filter changes:
* Build sinks now produce runtime filters, which required
planner and coordinator fixes to handle.
DataSink changes:
* Close the input plan tree before calling FlushFinal() to release
resources. This depends on Send() not holding onto references
to input batches, which was true except for NljBuilder. This
invariant is documented.
Join builder changes:
* Add a common base class for PhjBuilder and NljBuilder with
functions to handle synchronisation with the join node.
* Close plan tree earlier in FragmentInstanceState::Exec()
so that peak resource requirements are lower.
* The NLJ always copies input batches, so that it can close
its input tree.
JoinNode changes:
* Join node blocks waiting for build-side to be ready,
then eventually signals that it's done, allowing the builder
to be cleaned up.
* NLJ and PHJ nodes handle both the integrated builder and
the external builder. There is a 1:1 relationship between
the node and the builder, so we don't deal with thread safety
yet.
* Buffer reservations are transferred between the builder and join
node when running with the separate builder. This is not really
necessary right now, since it is all single-threaded, but will
be important for the shared broadcast.
- The builder transfers memory for probe buffers to the join node
at the end of each build phase.
- At end of each probe phase, reservation needs to be handed back
to builder (or released).
ExecSummary changes:
* The summary logic was modified to handle connecting fragments
via join builds. The logic is an extension of what was used
for exchanges.
Testing:
* Enable --unlock_mt_dop for end-to-end tests
* Migrate some tests to run as part of end-to-end tests instead of
custom cluster.
* Add mt_dop dimension to various end-to-end tests to provide
coverage of join queries, spill-to-disk and cancellation.
* Ran a single node TPC-H and TPC-DS stress test with mt_dop=0
and mt_dop=4.
Perf:
* Ran TPC-H scale factor 30 locally with mt_dop=0. No significant
change.
Change-Id: I4403c8e62d9c13854e7830602ee613f8efc80c58
Reviewed-on: http://gerrit.cloudera.org:8080/14859
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Inserts can add a sort node that orders the rows by partitioning
and Kudu primary key columns (aka. clustered insert). The issue
occurred when the target column was a timestamp and the source
was an expression that returned a string (e.g. concat()). Impala
adds an implicit cast to convert the strings to timestamps before
sorting, but this cast was incorrectly removed later during expression
substitution.
This led to hitting a DCHECK in debug builds and a (not too
informative) error message in release mode.
Note that the cast in question is not visible in EXPLAIN outputs.
Explain should contain implicit casts from explain_level=2 since
https://gerrit.cloudera.org/#/c/11719/ , but it is still not shown
in some expressions. I consider this to be a separate issue.
Testing:
- added an EE test that used to crash
- ran planner / sort / kudu_insert tests
Change-Id: Icca8ab1456a3b840a47833119c9d4fd31a1fff90
Reviewed-on: http://gerrit.cloudera.org:8080/15217
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Add a new flag -with_ranger in testdata/bin/run-hive-server.sh to start
Hive with Ranger integration. The relative configuration files are
generated in bin/create-test-configuration.sh using a new varient
ranger_auth in hive-site.xml.py. Only Hive3 is supported.
Current limitation:
Can't use different username in Beeline by the -n option. "select
current_user()" keeps returning my username, while "select
logged_in_user()" can return the username given by -n option but it's
not used in authorization.
Tests:
- Ran bin/create-test-configuration.sh and verified the generated
hive-site_ranger_auth.xml contains Ranger configurations.
- Ran testdata/bin/run-hive-server.sh -with_ranger. Verified column
masking and row filtering policies took effect in Beeline.
- Added test in test_ranger.py for this mode.
Change-Id: I01e3a195b00a98388244a922a1a79e65146cec42
Reviewed-on: http://gerrit.cloudera.org:8080/15189
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.
Tests
- Unit test for Hudi in FileMetadataLoader
- Create table tests in functional_schema_template.sql
- Query tests in hudi-parquet.test
Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Reviewed-on: http://gerrit.cloudera.org:8080/14711
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds validation for the paired stats values of tinyint
and smallint column data type when reading min/max column stats
value from Parquet file.
Testing:
- Added automatic test cases in parquet-stats.test for column data
type been changed from int to tinyint, from smallint to tinyint
and from int to smallint.
- Passed EE tests.
- Passed all core tests.
Change-Id: Id8bdaf4c4b2d0c6ea26d6e9bf013afca647e53a1
Reviewed-on: http://gerrit.cloudera.org:8080/15087
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Column masking policies on primitive columns of a table which contains
nested types (though they won't be masked) will cause query failures.
To be specifit, if tableA(id int, int_array array<int>) has a masking
policy on column "id", all queries on "tableA" will fail, e.g.
select id from tableA;
select t.id, a.item from tableA t, t.int_array a;
Column masking is implemented by wrapping the underlying table/view with
a table masking view. However, as we don't support nested types in
SelectList, the table masking view can't expose nested columns of the
masked table, which causes collection refs not being resolved correctly.
This patch fixes the issue by 2 steps:
1) Expose nested columns of the underlying table in the output Type of
the table masking view (see InlineViewRef#createTupleDescriptor()).
So nested Paths in the original query block can be resolved.
2) For such kind of Paths, resolved them again inside the table masking
view. So they can point to the underlying table as what they mean
(see Analyzer#resolvePathWithMasking()). TupleDescriptor of such kind
of table masking view won't be materialized since the view is simple
enough that its query plan is just a ScanNode of the underlying
table. The whole query plan can be stitched as if the table is not
masked.
Note that one day when we support nested columns in SelectList, we may
don't need these 2 hacks.
This patch also adds some TRACE level loggings to improve debuggability,
and enables column masking by default.
Test changes in TestRanger.test_column_masking:
- Add column masking policy on a table containing nested types.
- Add queries on the masked tables. Some queries are borrowed from
existing tests for nested types.
Tests:
- Run CORE tests.
Change-Id: I1cc5565c64c1a4a56445b8edde59b1168f387791
Reviewed-on: http://gerrit.cloudera.org:8080/15108
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fixes a subtle memory managment issue where freeing of a
buffer is delayed longer than it should be. This means that
the full buffer pool reservation is not available for
repartitioning, which can lead to crashes or hang for
very specific queries.
The fix is to transfer resources from output_unmatched_batch_
as soon as the last row from the batch is appended to the
output batch.
This bug would only be triggered by join modes that output
unmatched rows from the right side (RIGHT OUTER JOIN,
FULL OUTER JOIN, RIGHT ANTI JOIN) *and* have an empty
probe side (otherwise unmatched rows are output by
iterating over the hash table).
Testing:
Added DCHECKs to check that all resources are available
before repartitioning.
Added a regression test that triggered the bug.
Change-Id: Ie13b51d4d909afb0fe2e7b7dc00b085c51058fed
Reviewed-on: http://gerrit.cloudera.org:8080/15142
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HMS seems to be returning SQLPrimaryKeys in inconsistent orders.
This makes some of the primary keys tests flaky. This change sorts
the list of primary keys and stores them in canonical order within
Impala.
Testing:
- Modified the tests that were relying on HMS to return same order
every time.
- Ran parametrized job.
Change-Id: I0f798d7a2659c6cd061002db151f3fa787eb6370
Reviewed-on: http://gerrit.cloudera.org:8080/15106
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Hive 3 changed the typical storage model for tables to split them
between two directories:
- hive.metastore.warehouse.dir stores managed tables (which is now
defined to be only transactional tables)
- hive.metastore.warehouse.external.dir stores external tables
(everything that is not a transactional table)
In more recent commits of Hive, there is now validation that the
external tables cannot be stored in the managed directory. In order
to adopt these newer versions of Hive, we need to use separate
directories for external vs managed warehouses.
Most of our test tables are not transactional, so they would reside
in the external directory. To keep the test changes small, this uses
/test-warehouse for the external directory and /test-warehouse/managed
for the managed directory. Having the managed directory be a subdirectory
of /test-warehouse means that the data snapshot code should not need to
change.
The Hive 2 configuration doesn't change as it does not have this concept.
Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION
to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate
location for data as compared to USE_CDP_HIVE=false. That should reduce
conflicts between the two configurations.
Testing:
- Ran exhaustive tests with USE_CDP_HIVE=false
- Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version)
- Verified that dataload succeeds and tests are able to run with a newer
Hive version.
Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec
Reviewed-on: http://gerrit.cloudera.org:8080/15026
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
in LocalCatalog Mode.
This change add a new method 'loadConstraints()' to the MetaProvider
interface.
1. In CatalogdMetaProvider implementation, we fetch the primary key
(PK) and foreign key(FK) information via the GetPartialCatalogObject()
RPC to the catalogd. This is modified to include PK/FK information.
This is because, on catalog side we eagerly load PK/FK information
which can be sent over to local catalog in a single RPC to Catalog.
This information is then stored in TableMetaRef object for future
consumers.
2. In the DirectMetaProvider implementation, we make two RPCs to HMS
to directly get PK/FK information.
Load constraints can be extended to include other constraints later
(for ex: unique constraints.)
Testing:
- Added tests in LocalCatalogTest, CatalogTest and PartialCatalogInfoTest
- This change also modifies the toSqlUtil for show create table
statements. Added a test for the same.
Change-Id: I7ea7e1bacf6eb502c67caf310a847b32687e0d58
Reviewed-on: http://gerrit.cloudera.org:8080/14731
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.
Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.
Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Reviewed-on: http://gerrit.cloudera.org:8080/14982
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Ranger provides column masking policies about how to show masked values
to specific users when reading specific columns. This patch adds support
to rewrite the query AST based on column masking policies.
We perform the column masking policies by replacing the TableRef with a
subquery doing the masking. For instance, the following query
select c_id, c_name from customer c join orders on c_id = o_cid
will be transfomed into
select c_id, c_name from (
select mask1(c_id) as c_id, mask2(c_name) as c_name from customer
) c
join orders
on c_id = o_cid
The transfomation is done in AST resolution. Just like view resolution,
if the table needs masking we replace it with a subquery(InlineViewRef)
containing the masking expressions.
This patch only adds support for mask types that don't require builtin
mask functions. So currently supported masking types are MASK_NULL and
CUSTOM.
Current Limitations:
- Users are required to have privileges on all columns of a masked
table(IMPALA-9223), since the table mask subquery contains all the
columns.
Tests:
- Add e2e tests for masked results
- Run core tests
Change-Id: I4cad60e0e69ea573b7ecfc011b142c46ef52ed61
Reviewed-on: http://gerrit.cloudera.org:8080/14894
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala supports creating a table using the schema of a file.
However, only Parquet is supported currently. This commit adds
support for creating tables from ORC files
The change relies on the ORC Java API with version 1.5 or
greater, because of a bug in earlier versions. Therefore, ORC is
listed as an external dependency, instead of relying on Hive's
ORC version (from Hive3, Hive also lists it as a dependency).
Also, the commit performs a little clean-up on the ParquetHelper
class, renaming it to ParquetSchemaExtractor and removing outdated
comments.
To create a table from an ORC file, run:
CREATE TABLE tablename LIKE ORC '/path/to/file'
Tests:
* Added analysis tests for primitive and complex types.
* Added e2e tests for creating tables from ORC files.
Change-Id: I77cd84cda2ed86516937a67eb320fd41e3f1cf2d
Reviewed-on: http://gerrit.cloudera.org:8080/14811
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In HMS-3 the translation layer converts a managed kudu table into an
external kudu table and adds additional table property
'external.table.purge' to 'true'. This means any installation which
is using HMS-3 (or a Hive version which has HIVE-22158) will always
create Kudu tables as external tables. This is problematic since the
output of show create table will now be different and may confuse
the users.
In order to improve the user experience of such synchronized tables
(external tables with external.table.purge property set to true),
this patch adds support in Impala to create
external Kudu tables. Previous versions of Impala disallowed
creating a external Kudu table if the Kudu table did not exist.
After this patch, Impala will check if the Kudu table exists and if
it does not it will create a Kudu table based on the schema provided
in the create table statement. The command will error out if the Kudu
table already exists. However, this applies to only the synchronized
tables. Previous way to create a pure external table behaves the
same.
Following syntax of creating a synchronized table is now allowed:
CREATE EXTERNAL TABLE foo (
id int PRIMARY KEY,
name string)
PARTITION BY HASH PARTITIONS 8
STORED AS KUDU
TBLPROPERTIES ('external.table.purge'='true')
The syntax is very similar to creating a managed table, except for
the EXTERNAL keyword and additional table property. A synchronized
table will behave similar to managed Kudu tables (drops and renames
are allowed). The output of show create table on a synchronized
table will display the full column and partition spec similar to the
managed tables.
Testing:
1. After the CDP version bump all of the existing Kudu tables now
create synchronized tables so there is good coverage there.
2. Added additional tests which create synchronized tables and
compares the show create table output.
3. Ran exhaustive tests with both CDP and CDH builds.
Change-Id: I76f81d41db0cf2269ee1b365857164a43677e14d
Reviewed-on: http://gerrit.cloudera.org:8080/14750
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently -0.0/+0.0 values are hashed to different values due to
their different binary representation, while -0.0==+0.0 is true in
C++. This caused them to be distinct values in hash maps despite
being treated as equal in comparisons.
This commit fixes the hashing of -0.0/+0.0, thus changing the
behaviour of hash joins and aggregations (since aggregations
follow the behaviour of the join). That way, the canonical form for
-0/+0 is changed to +0.
Tests:
- Added e2e tests for aggregation (group by and distinct) and
join queries with -0.0 and +0.0 present.
Change-Id: I6bb1a817c81c452d041238c19cb6c9f602a5d565
Reviewed-on: http://gerrit.cloudera.org:8080/14588
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When mt_dop > 0, the summary is reporting the number of fragment
instances, instead of the number of hosts as the header would
imply.
This commit fixes the issue so the number of hosts will be shown
under the #Hosts column. The commit also adds an #Inst column
where the number of instances are shown (current behaviour).
Tests:
* Changed profile tests with mt_dop > 0.
* Updated benchmark tests and shell tests accordingly.
Change-Id: I3bdf9a06d9bd842b2397cd16c28294b6bec7af69
Reviewed-on: http://gerrit.cloudera.org:8080/14715
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>