Commit Graph

9647 Commits

Author SHA1 Message Date
Joe McDonnell
60f8f87b09 IMPALA-10274: Initialize impala-python as part of the CMake build
Initializing the impala-python virtualenv takes a couple minutes,
so it is useful to do that in parallel to the rest of the build.
This moves the impala-python initialization to its own step
in the CMake build. It stops using impala-python for commands
invoked from buildall.sh or the CMake build to avoid premature
or concurrent initializations of impala-python. Then, it adds
a dedicated step to initialize impala-python.

Testing:
 - Ran a core job and a couple builds
 - Rebuilt and verified that impala-python is not reinitialized
   if it is already initialized

Change-Id: Ieff51263c55bd234028fed7101c94b4a928590f0
Reviewed-on: http://gerrit.cloudera.org:8080/16607
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-04 17:03:57 +00:00
Alexey Serbin
f8ed3f6722 IMPALA-10472 flag for Kudu connection negotiation timeout
This patch adds --kudu_client_connection_negotiation_timeout_ms flag
to control client-side connection negotiation timeout in the Kudu
client working as a part of the Impala's BE.  Since [1] has been
addressed for Kudu C++ client, it makes sense to provide a control knob
to customize the timeout.  That should help to address cases where very
busy cluster nodes hosting Kudu tablet servers aren't fast enough to
negotiate a new connection within the default timeout interval (3 sec),
as mentioned in the description of [1].

[1] https://issues.apache.org/jira/browse/KUDU-2966

Change-Id: I1223187318691da47082608356547f6d78144466
Reviewed-on: http://gerrit.cloudera.org:8080/16705
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-04 05:42:26 +00:00
wzhou-code
b5e2a0ce2e IMPALA-9224: Blacklist nodes with faulty disk for spilling
This patch extends blacklist functionality by adding executor node to
blacklist if a query fails caused by disk failure during spill-to-disk.
Also classifies disk error codes and defines a blacklistable error set
for non-transient disk errors. Coordinator blacklists executor only if
the executor hitted blacklistable error during spill-to-disk.

Adds a new debug action to simulate disk write error during spill-to-
disk. To use, specify in query options as:
  'debug_action': 'IMPALA_TMP_FILE_WRITE:<hostname>:<port>:<action>'

  where <hostname> and <port> represent the impalad which execute the
  fragment instances, <port> is the BE krpc port (default 27000).

Adds new test cases for blacklist and query-retry to cover the code
changes.

Testing:
 - Passed new test cases.
 - Passed exhaustive test.
 - Manually simulated disk failures in scratch directories on nodes
   of a cluster, verified that the nodes were blacklisted as
   expected.

Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437
Reviewed-on: http://gerrit.cloudera.org:8080/16949
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-04 05:12:42 +00:00
Alexey Serbin
91fd8fd130 [config] bump toolchain build id
The motivation for this version patch is two-fold:

  * Update the version of Kudu client to reflect the recently
    released Kudu 1.14 (see https://kudu.apache.org/releases/1.14.0/)

  * Pick up https://gerrit.cloudera.org/#/c/16705 change to control
    Kudu client connection negotiation timeout in impalad

Change-Id: I20fd6b092ce6a04465624914f6116a33622e977e
Reviewed-on: http://gerrit.cloudera.org:8080/17018
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-03 23:01:27 +00:00
Gabor Kaszab
1a38924d20 IMPALA-9588: Add extra logging to cancel tests
There have been some cancel tests that are flaky and their logs didn't
reveal the root cause of the failures. Adding some extra logging so
that we can see a bit more of the nature of the failure.
The extra log message contains:
  - Query SQL
  - Message of the exception thrown during fetching the results
  - Query Status line from the query profile

Change-Id: Ied7100a9ea2e2f0611cf8e328e589b4c8e5d5100
Reviewed-on: http://gerrit.cloudera.org:8080/16985
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-03 21:01:24 +00:00
Zoltan Borok-Nagy
a81c6a7829 IMPALA-10460: Impala should write normalized paths in Iceberg manifests
Currently Impala writes double slashes in the paths of datafiles
for non-partitioned Iceberg tables. Unnormalized paths can cause
problems later.

This patch removes the redundant slashes.

Testing:
 * Tested manually by inspecting the manifest files of the
   Iceberg tables. Used both non-partitioned and partitioned tables.

Change-Id: If5ecac78102ed35710dd70a18edc71f6e891e748
Reviewed-on: http://gerrit.cloudera.org:8080/16993
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-01 16:55:53 +00:00
Zoltan Borok-Nagy
646b0e011c IMPALA-10456: Implement TRUNCATE for Iceberg tables
This patch adds support for the TRUNCATE statement for
Iceberg tables.

The TRUNCATE operation creates a new snapshot for the target
table that doesn't have any data files. Table and column stats
are also cleared. This patch also fixes a bug that caused
table/column stats not being propagated.

Testing
 * added e2e tests for both partitioned and unpartitioned tables

Change-Id: I6116c7c36aba871c0be79f499e0ac618072ca7b8
Reviewed-on: http://gerrit.cloudera.org:8080/16987
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: wangsheng <skyyws@163.com>
2021-02-01 11:14:01 +00:00
Joe McDonnell
f086d5c6f9 IMPALA-10462: Include org/apache/hive/hadoop/common/type/* in impala-minimal-hive-exec
With newer versions of Iceberg, TestIcebergTable::test_create_iceberg_tables
fails with ClassNotFoundException for org.apache.hive.hadoop.common.type.Date.
This adds that missing location to the impala-minimal-hive-exec.

Testing:
 - Ran TestIcebergTable::test_create_iceberg_tables with newer Iceberg

Change-Id: I3fc33ff17489c2bd54d2ec8798ec7a3e5cfb051c
Reviewed-on: http://gerrit.cloudera.org:8080/17005
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2021-01-29 00:18:33 +00:00
Thomas Tauber-Marshall
39c424d7c8 IMPALA-10454: Bump --ssl_minimum_version to tls1.2
TLS versions < 1.2 are now considered insecure. This patch improves
Impala's default security.

This is made possible now in part because Impala 4.0 dropped support
for Python versions < 2.7.9 (or 2.7.5 on certain distributions where
it has been patched) as lower Python versions do not support tls1.2

Testing:
- Existing SSL tests are updated to reflect the new default.

Change-Id: Ifed66646b041a061f9db92744710aef7453f39e4
Reviewed-on: http://gerrit.cloudera.org:8080/16988
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-28 04:19:39 +00:00
Tim Armstrong
4c828b65ab IMPALA-8306: clarify wording on /sessions UI
Change-Id: I01578feb0f2bccd2605bbe6aa2e9eca382260f2e
Reviewed-on: http://gerrit.cloudera.org:8080/16981
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Vincent Tran <vttran@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-27 23:33:14 +00:00
xqhe
4ae847bf94 IMPALA-10382: fix invalid outer join simplification
When set ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION = true, the planner
will simplify outer joins if the predicate with case expr or conditional
function on both sides of outer join.
However, the predicate maybe not null-rejecting, if simplify the outer
join, the result is incorrect. E.g. t1.b > coalesce(t1.c, t2.c) can
return true if t2.c is null, so it is not null-rejecting predicate
for t2.

The fix is simply to support the case that the predicate with two
operands and the operator is one of (=, !=, >, <, >=, <=),
1. one of the operands or
2. if the operand is arithmetic expression and one of the children
does not contain conditional builtin function or case expr and has
tuple id in outer joined tuples.
E.g. t1.b > coalesce(t2.c, t1.c) or t1.b + coalesce(t2.c, t1.c) >
coalesce(t2.c, t1.c) is null-rejecting predicate for t1.

Testing:
* Add new plan tests in outer-to-inner-joins.test
* Add new query tests to verify the correctness on transformation

Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b
Reviewed-on: http://gerrit.cloudera.org:8080/16845
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2021-01-27 17:30:37 +00:00
Zoltan Borok-Nagy
08367e91f0 IMPALA-10452: CREATE Iceberg tables with old PARTITIONED BY syntax
For convenience this patch adds support with the old-style
CREATE TABLE ... PARTITIONED BY ...; syntax for Iceberg tables.

So users should be able to write the following:

CREATE TABLE ice_t (i int)
PARTITIONED BY (p int)
STORED AS ICEBERG;

Which should be equivalent to this:

CREATE TABLE ice_t (i int, p int)
PARTITION BY SPEC (p IDENTITY)
STORED AS ICEBERG;

Please note that the old-style CREATE TABLE statement creates
IDENTITY-partitioned tables. For other partition transforms the
users must use the new, more generic syntax.

Hive also supports the old PARTITIONED BY syntax with the same
behavior.

Testing:
 * added e2e tests

Change-Id: I789876c161bc0987820955aa9ae01414e0dcb45d
Reviewed-on: http://gerrit.cloudera.org:8080/16979
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-26 22:12:25 +00:00
Tim Armstrong
f4584dd276 IMPALA-10404: Update docs to reflect RLE_DICTIONARY support
Fix references to PLAIN_DICTIONARY to reflect that
RLE_DICTIONARY is supported too.

Change-Id: Iee98abfd760396cf43302c9077c6165eb3623335
Reviewed-on: http://gerrit.cloudera.org:8080/16982
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-26 16:47:27 +00:00
Tim Armstrong
eb85c6eeca IMPALA-9793: Impala quickstart cluster with docker-compose
What works:
* A single node cluster can be started up with docker-compose
* HMS data is stored in Derby database in a docker volume
* Filesystem data is stored in a shared docker volume, using the
  localfs support in the Hadoop client.
* A Kudu cluster with a single master can be optionally added on
  to the Impala cluster.
* TPC-DS data can be loaded automatically by a data loading container.

We need to set up a docker network called quickstart-network,
purely because docker-compose insists on generating network names
with underscores, which are part of the FQDN and end up causing
problems with Java's URL parsing, which rejects these technically
invalid domain names.

How to run:

Instructions for running the quickstart cluster are in
docker/README.md.

How to build containers:

  ./buildall.sh -release -noclean -notests -ninja
  ninja quickstart_hms_image quickstart_client_image docker_images

How to upload containers to dockerhub:

  IMPALA_QUICKSTART_IMAGE_PREFIX=timgarmstrong/
  for i in impalad_coord_exec impalad_coordinator statestored \
           impalad_executor catalogd impala_quickstart_client \
           impala_quickstart_hms
  do
    docker tag $i ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
    docker push ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
  done

I pushed containers build from commit f260cce22, which
was branched from 6cb7cecacf on master.

Misc other stuff:
* Added more metadata to all images.

TODO:
* Test and instructions to run against Kudu quickstart
* Upload latest version of containers before merging.

Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
Reviewed-on: http://gerrit.cloudera.org:8080/15966
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-26 11:22:08 +00:00
Andrew Sherman
3b763b5c32 IMPALA-10447: Add a newline when exporting shell output to a file.
Impala shell outputs a batch of rows using OutputStream. Inside
OutputStream, output to a file is handled slightly differently from
output that is written to stdout. When writing to stdout we use print()
(which appends a newline) while when writing to a file we use write()
(which adds nothing). This difference was introduced in IMPALA-3343 so
this bug may be a regression introduced then. To ensure that output is
the same in either case we need to add a newline after writing each
batch of rows to a file.

TESTING:
    Added a new test for this case.

Change-Id: I078a06c54e0834bc1f898626afbfff4ded579fa9
Reviewed-on: http://gerrit.cloudera.org:8080/16966
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-26 08:32:29 +00:00
stiga-huang
e8720b40f1 IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions
A unicode character can be encoded into 1-4 bytes in UTF-8. String
functions will return undesired results when the input contains unicode
characters, because we deal with a string as a byte array. For instance,
length() returns the length in bytes, not in unicode characters.

UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem.
This patch adds UTF-8 support in some string functions so they can have
UTF-8 aware behavior. For compatibility with the old versions, a new
query option, UTF8_MODE, is added for turning on/off the UTF-8 aware
behavior. Currently, only length(), substring() and reverse() support
it. Other function supports will be added in later patches.

String functions will check the query option and switch to use the
desired implementation. It's similar to how we use the decimal_v2 query
option in builtin functions.

For easy testing, the UTF-8 aware version of string functions are
also exposed as builtin functions (named by utf8_*, e.g. utf8_length).

Tests:
 - Add BE tests for utf8 functions.
 - Add e2e tests for the UTF8_MODE query option.

Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Reviewed-on: http://gerrit.cloudera.org:8080/16908
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-26 00:43:39 +00:00
Riza Suminto
2644203d1c IMPALA-10147: Avoid getting a file handle for data cache hits
When reading from the data cache, the disk IO thread first gets a file
handle, then it checks the data cache for a hit. The file handle is only
used if there is a data cache miss. It is not used when data cache hit
and in turns becomes an overhead. This patch move the file handle
retrieval later when data cache miss hapens.

Testing:
- Add custom cluster test test_no_fd_caching_on_cached_data.
- Pass core tests.

Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Reviewed-on: http://gerrit.cloudera.org:8080/16963
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-25 16:05:00 +00:00
Laszlo Gaal
6d4756da01 IMPALA-10448: Build impala-profile-tool early for Docker-based tests
impala-profile-tool is a new dependency for end-to-end tests.
The tool is built together with all the other backend tests
(so the buildall.sh flag '-notests' can turn off building it), it is
actually used in the parallel phase of end-to-end tests.

This means a problem for Docker-based builds for the following reasons:
- Docker-based tests run BE, FE and various phases of the EE test in
  separate Docker containers for parallel executions
- Test binaries are only built inside the container running BE tests to
  cut down on the build time and the size of the Docker image that all test
  containers are based on.
- This means that the EE_TEST_PARALLEL container will miss the tool
  required for running test designed to test it.

The solution is to build the tool early, at the end of the build phase
running in the build container. There is already another such tool built
there (parquet-reader) for similar reason, so just add
impala-profile-tool to the same 'make' command there.

Tested by running BE_TEST and EE_TEST_PARALLEL phases in a Docker-based
build.

Change-Id: I60e78ea883f3057c59a345feca38ef08a7f6a0b8
Reviewed-on: http://gerrit.cloudera.org:8080/16965
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-22 20:39:59 +00:00
Tim Armstrong
1ada739e81 IMPALA-10296: Fix analytic limit pushdown when predicates are present
This fixes the analytic push down optimization for the case where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >=
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first <limit> rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it is always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The elapsed time for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Reviewed-on: http://gerrit.cloudera.org:8080/16942
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-22 05:31:37 +00:00
liuyao
18acca92ee IMPALA-10435: Extend 'compute incremental stats' syntax
to support a list of columns

Modified parser to support compute incremental stats
columns.No need to modify the code of other modules
because it already supports

Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Reviewed-on: http://gerrit.cloudera.org:8080/16947
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2021-01-21 19:35:26 +00:00
Fucun Chu
5852a0028b IMPALA-10440: Import Theta functionality from DataSketches
This patch imports the functionality needed for Theta approximate
algorithm from Apache DataSketches.

First, I updated our existing snapshot of DataSketches to the
following commit:b2f749ed5ce6ba650f4259602b133c310c3a5ee4"
Merge pull request #182 from chufucun/include_type"
This affects files originated from hll/, kll/ and theta/ directories
of the DataSketches repo.

Then I copied all the files needed for Theta into our snapshot
directory.

Browse the source files here:
b2f749ed5c

Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Reviewed-on: http://gerrit.cloudera.org:8080/16959
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-21 14:48:10 +00:00
Fucun Chu
ac7f605711 IMPALA-10421: [DOCS] Documented the JOIN_ROWS_PRODUCED_LIMIT query option
- Minor edit

Change-Id: I3d422889c433062456748a953b33e3d43799be14
Reviewed-on: http://gerrit.cloudera.org:8080/16922
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
2021-01-21 07:16:42 +00:00
Laszlo Gaal
cdf5108aa9 IMPALA-10441: Skip test_bytes_read_per_column if not on local minicluster
IMPALA-9865 part 2 made some expected outcomes of the above test steps
stricter. Unfortunately these stricter results are only valid when the
tests are run on an HDFS file system in the context of a local
minicluster, breaking the same test on S3 and EC storage.

The patch disables the test step when run outside the context of a local
minicluster HDFS.

Change-Id: If8a179937c9c7c690dd2630549464dbe6aa1b834
Reviewed-on: http://gerrit.cloudera.org:8080/16964
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-21 02:10:14 +00:00
stiga-huang
4c6cf4b2ef IMPALA-10434: Fix impala-shell's unicode regressions on Python2
To make impala-shell compatible for Python3, we explicitly distinguish
bytes and text in Python2 by decoding the bytes for all inputs.

Regression 1: multiple queries in one line with unicode chars will break

In precmd() of impala-shell, if there are multiple queries present in
one input line, we split it into individual queries (by
sqlparse.split()) and append them back to the 'cmdqueue'. They will be
passed to precmd() again. In our Python2 implementation, precmd()
expects them to be str type, and will decode them into unicode type.
However, the output type of sqlparse.split() is unicode which doesn't
have a decode() method. Calling decode() on a unicode var will let
Python2 implicitly encode it to str. This may cause UnicodeEncodeError
since implicitly encoding use 'ascii'.

Regression 2: multi-line query with unicode chars will break when
command history is enabled

In _check_for_command_completion(), when calling
readline.replace_history_item in Python2. We encode the completed_cmd
into bytes. However, we shouldn't replace it since the return type is
expected to be unicode.

Tests:
 - Add tests for these two regressions in Python2.

Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
Reviewed-on: http://gerrit.cloudera.org:8080/16960
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-20 10:20:02 +00:00
wzhou-code
8ecb61e4bd IMPALA-10259: Fixed DCHECK error for backend in terminal state
This issue happened for core ASAN build.
According to log message, one backend sent status report with
instance_exec_status as done for all assigned instances without
error, then it sent last status report with error. The coordinator
treat the backend state as done after it processed the status report
with instance_exec_status as done, but did not apply last status
report with error to the overall backend state.
This caused backend to receive a response with status as OK for the
last status report, hence hit DCHECK error.

This patch fix the race for updating the 'Query State' and updating
the fragment instance state when hitting error during execution of
fragment instance. The backends will not send status report with
fragment instance state as "completed" without error after hitting
error.

Testing:
 - Manual tests
   I could only reproduce the situation by adding some artificial
   delays in the beginning of QueryState::ErrorDuringExecute()
   when repeatedly running test case test_spilling.py::
   TestSpillingDebugActionDimensions::test_spilling_naaj for
   Impala ASAN build.
   Verified that the issue did not happen after applying this
   patch.
 - Passed exhaustive test.

Change-Id: Ic12a80e20ddc11e32349edfec2bd16338c24b841
Reviewed-on: http://gerrit.cloudera.org:8080/16900
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-20 02:19:35 +00:00
Zoltan Borok-Nagy
90f3b2f491 IMPALA-10432: INSERT INTO Iceberg tables with partition transforms
INSERT INTO Iceberg tables that use partition transforms. Partition
transforms are functions that calculate partition data from row data.

There are the following partition transforms in Iceberg:
https://iceberg.apache.org/spec/#partition-transforms

 * IDENTITY
 * BUCKET
 * TRUNCATE
 * YEAR
 * MONTH
 * DAY
 * HOUR

INSERT INTO identity-partitioned Iceberg tables are already supported.
This patch adds support for the rest of the transforms.

We create the partitioning expressions in InsertStmt. Based on these
expressions data are automatically shuffled and sorted by the backend
executors before rows are given to the table sink operators. The table
sink operator writes the partitions one-by-one and creates a
human-readable partition path for them.

In the end, we will convert the partition path to partition data and
create Iceberg DataFiles with information about the files written.

Testing:
 * added planner test
 * added e2e tests

Change-Id: I3edf02048cea78703837b248c55219c22d512b78
Reviewed-on: http://gerrit.cloudera.org:8080/16939
Reviewed-by: wangsheng <skyyws@163.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-18 18:46:42 +00:00
stiga-huang
9bb7157bf0 IMPALA-10387: Add missing overloads of mask functions used in Ranger default masking policies
The mask functions in Hive are implemented through GenericUDFs which can
accept an infinite number of function signatures. Impala currently don't
support GenericUDFs. So we provide builtin mask functions with limited
overloads.

This patch adds some missing overloads that could be used by Ranger
default masking policies, e.g. MASK_HASH, MASK_SHOW_LAST_4,
MASK_DATE_SHOW_YEAR, etc.

Tests:
 - Add test coverage on all default masking policies applied on all
   supported types.

Change-Id: Icf3e70fd7aa9f3b6d6b508b776696e61ec1fcc2e
Reviewed-on: http://gerrit.cloudera.org:8080/16930
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-15 13:01:53 +00:00
Tim Armstrong
5fddbb569a IMPALA-9865: part 2/2: add verbosity to profile tool
Adds a --profile_verbosity option for impala-profile-tool with
the following levels:
* 0: minimal
* 1: legacy - matches old output, this is the default still
* 2: default - basic descriptive stats, used for V2 profile.
* 3: extended
* 4: full

This will help with transition to the V2 profile because we
can have a nice, high-level, readable text profile by default
with the option to produce more detailed profiles and alternate
views of the profile from the thrift profile.

Use the profile version in impala-profile-tool to dump the
more verbose output for the V2 profile while preserving the
same output for the legacy profile.

Reduce verbosity of v2 profile output - only include mean/min/max
by default. I intend to refine the output at the different
verbosity levels for the v2 profiles further as part of IMPALA-9382,
it is still fairly noisy.

Fix output with/without gen_experimental_profile - there
was a small difference in that the summary stats were not
output in the averaged profile.

Testing:
* Add an end-to-end test that generates output for a small
  profile log and compares against expected files.
* Tweak other profile tests to reflect changes to output.

Change-Id: I82618a813e29af7996dfaed78873b2a73bc0231d
Reviewed-on: http://gerrit.cloudera.org:8080/16881
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-15 00:50:39 +00:00
Zoltan Borok-Nagy
696dafed66 IMPALA-10426: Fix crash when inserting invalid timestamps
Insertion of invalid timestamps causes Impala to crash when it uses
the INT64 Parquet timestamp types.

This patch fixes the error by checking for null values in
Int64TimestampColumnWriterBase::ConvertValue().

Testing:
 * added e2e tests

Change-Id: I74fb754580663c99e1d8c3b73f8d62ea3305ac93
Reviewed-on: http://gerrit.cloudera.org:8080/16951
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-14 19:34:38 +00:00
Thomas Tauber-Marshall
91adb33b22 IMPALA-9975 (part 2): Introduce new admission control daemon
A recent patch (IMPALA-9930) introduces a new admission control rpc
service, which can be configured to perform admission control for
coordinators. In that patch, the admission service runs in an impalad.

This patch separates the service out to run in a new daemon, called
the admissiond. It also integrates this new daemon with the build
infrastructure around Docker.

Some notable changes:
- Adds a new class, AdmissiondEnv, which performs the same function
  for the admissiond as ExecEnv does for impalads.
- The '/admission' http endpoint is exposed on the admissiond's webui
  if the admission control service is in use, otherwise it is exposed
  on coordinator impalad's webuis.
- start-impala-cluster.py takes a new flag --enable_admission_service
  which configures the minicluster to have an admissiond with all
  coordinators using it for admission control.
- Coordinators are now configured to use the admission service by
  specifying the startup flag --admission_service_host. This is
  intended to mirror the configuration of the statestored/catalogd
  location.

Testing:
- Existing tests for the admission control serivce are modified to run
  with an admissiond.
- Manually ran start-impala-cluster.py with --enable_admission_service
  and --docker_network to verify Docker integration.

Change-Id: Id677814b31e9193035e8cf0d08aba0ce388a0ad9
Reviewed-on: http://gerrit.cloudera.org:8080/16891
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-13 06:03:37 +00:00
skyyws
1093a563e6 IMPALA-10368: Support required/optional property when creating Iceberg table
We supported create required/optional field for Iceberg table in this
patch. If we set 'NOT NULL' property for Iceberg table column in SQL,
Impala will create required field by Iceberg api, 'NULL' or default
will create optional field.
Besides, 'DESCRIBE XXX' for Iceberg table will display 'optional'
property like this:
+------+--------+---------+----------+
| name | type   | comment | nullable |
+------+--------+---------+----------+
| id   | int    |         | false    |
| name | string |         | true     |
| age  | int    |         | true     |
+------+--------+---------+----------+
And 'SHOW CREATE TABLE XXX' will also display 'NULL'/'NOT NULL'
property for Iceberg table.

Tests:
 * added new test in iceberg-create.test
 * added new test in iceberg-negative.test
 * added new test in show-create-table.test
 * modify 'DESCRIBE XXX' result in iceberg-create.test
 * modify 'DESCRIBE XXX' result in iceberg-alter.test
 * modify create table result in show-create-table.test

Change-Id: I70b8014ba99f43df1b05149ff7a15cf06b6cd8d3
Reviewed-on: http://gerrit.cloudera.org:8080/16904
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-11 17:08:21 +00:00
Thomas Tauber-Marshall
aa48651cb5 IMPALA-9975 (part 1): Various refactors for admission control daemon
This patch contains a variety of small refactors needed to enable the
new admission control daemon, to separate them out from the main patch
for ease of reviewing.

The following changes are made:
- A new class is introduced, DaemonEnv, which contains singleton
  objects common to all Impala daemons, currently just a MetricGroup
  and Webservers. The purpose is to reduce code duplication when the
  new admissiond daemon is added. This is analogous to how ExecEnv is
  used for impalad-specific singletons already.

  This patch modifies the catalogd and statestored to use DaemonEnv.
  impalads could also use DaemonEnv, but its tricky due to
  dependencies in the order of creation and initialization for objects
  such as the ReservationTracker and BufferPool relative to the
  MetricGroup and Webserver, so this is left for followup work.

- Direct use of ExecEnv in ImpalaServicePool ahd AdmissionController
  is removed, as the admissiond will also need to use these classes
  and it will not have an ExecEnv.

Testing:
- Passed a run of existing core tests.

Change-Id: I2e097e20458354f78bfc3477cac6fb3a2835f094
Reviewed-on: http://gerrit.cloudera.org:8080/16890
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-09 04:14:43 +00:00
Tim Armstrong
ab6b7960db IMPALA-10027: configurable default anonymous user
A username can be determined for a session via two mechanisms:
* In a secure env, the user is authenticated by LDAP or Kerberos
* In an unsecure env, the client specifies the user name, either
  as a parameter to the OpenSession API (HS2) or as a parameter
  to the first query run (beeswax)

This patch affects what happens if neither of the above mechanisms
is used. Previously we would end up with the username being an
empty string, but this makes Ranger unhappy. Hive uses the name
"anonymous" in this situation, so we change Impala's behaviour too.

This is configurable by -anonymous_user_name. -anonymous_user_name=
reverts to the old behaviour.

Test
* Add an end-to-end test that exercises this via impala-shell for
  HS2, HS2-HTTP and beeswax protocols.
* Tweak a couple of existing tests that depended on the previous
  behavior.

Change-Id: I6db491231fa22484aed476062b8fe4c8f69130b0
Reviewed-on: http://gerrit.cloudera.org:8080/16902
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-09 00:15:25 +00:00
Akos Kovacs
425e424b37 IMPALA-9687 Improve estimates for number of hosts in Kudu plans
In some cases Kudu plans could contain more hosts than the actual number of executors.
This commit fixes it by capping the number of hosts at the number of executors,
and determining which executors have local scan ranges.

Testing:
 - Ran core tests

Updated Kudu planner tests where the memory estimates changed.

Change-Id: I72e341597e980fb6a7e3792905b942ddf5797d03
Reviewed-on: http://gerrit.cloudera.org:8080/16880
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-08 04:41:43 +00:00
Joe McDonnell
35bae939ab IMPALA-10427: Remove SkipIfS3.eventually_consistent pytest marker
These tests were disabled due to S3's eventually consistent
behavior. Now that S3 is strongly consistent, these tests do
not need to be disabled.

Testing:
 - Ran s3 core job

Change-Id: Ie9041f530bf3a818f8954b31a3d01d9f6753d7d4
Reviewed-on: http://gerrit.cloudera.org:8080/16931
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-07 23:53:56 +00:00
Shajini Thayasingh
44bade8e7f IMPALA-10091: [DOCS] add REFRESH_UPDATED_HMS_PARTITIONS query option
remove trailing spaces
added this new query option for Impala 4.0

Change-Id: I95b31b33f99073c57752e66eaf0f34facf511fc6
Reviewed-on: http://gerrit.cloudera.org:8080/16925
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-07 20:05:50 +00:00
Thomas Tauber-Marshall
799bc22d70 IMPALA-10424: Fix race on not_admitted_reason in AdmissionController
QueueNode::not_admitted_reason can be accessed concurrently by the
coordinator thread that calls SubmitForAdmission and the admission
control dequeue loop thread.

This patch fixes this by ensuring that not_admitted_reason is only
accessed if 'admission_ctrl_lock_' is held.

Change-Id: Iacb3f37d8e1797c2b1d7bc32ba6368419e9ae444
Reviewed-on: http://gerrit.cloudera.org:8080/16926
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-07 00:22:46 +00:00
Shajini Thayasingh
0c7e5a4a7c IMPALA-10388: [DOCS] add limitations on mask functions
incorporated comments, removed the para as per the feedback
listed all the overloads that are introduced
stated that Impala does not yet support new Hive UDFs
called out how mask functions were introduced through overloads

Change-Id: I37f0bcf4cf586cc5cfd03e4df68443967b6bb88f
Reviewed-on: http://gerrit.cloudera.org:8080/16861
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2021-01-06 23:22:37 +00:00
gaoxiaoqing
200d3664f0 IMPALA-10412: ConvertToCNFRule can be applied to view table
For OR predicates that reference a view, currently the ConvertToCNFRule
does not get applied since it is considered a single table predicate
even if the predicate might reference columns from different tables
within the view. This patch enables the application of this rule for
such predicates by checking the expanded view and if it satisfies the
criterion then the rule can be applied and the predicate can be pushed
eventually to the scan.

Testing:
Added planner test in inline-view.test

Change-Id: Ie7a9a215d6b92aec07153e643268370f34186c88
Reviewed-on: http://gerrit.cloudera.org:8080/16912
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-06 22:36:10 +00:00
stiga-huang
e7839c4530 IMPALA-10416: Add raw string mode for testfiles to verify non-ascii results
Currently, the result section of the testfile is required to used
escaped strings. Take the following result section as an example:
  --- RESULTS
  'Alice\nBob'
  'Alice\\nBob'
The first line is a string with a newline character. The second line is
a string with a '\' and an 'n' character. When comparing with the actual
query results, we need to escape the special characters in the actual
results, e.g. replace newline characters with '\n'. This is done by
invoking encode('unicode_escape') on the actual result strings. However,
the input type of this method is unicode instead of str. When calling it
on str vars, Python will implicitly convert the input vars to unicode
type. The default encoding, ascii, is used. This causes
UnicodeDecodeError when the str contains non-ascii bytes. To fix this,
this patch explicitly decodes the input str using 'utf-8' encoding.

After fixing the logic of escaping the actual result strings, the next
problem is that it's painful to write unicode-escaped expected results.
Here is an example:
  ---- QUERY
  select "你好\n你好"
  ---- RESULTS
  '\u4f60\u597d\n\u4f60\u597d'
  ---- TYPES
  STRING
It's painful to manually translate the unicode characters.

This patch adds a new comment, RAW_STRING, for the result section to use
raw strings instead of unicode-escaped strings. Here is an example:
  ---- QUERY
  select "你好"
  ---- RESULTS: RAW_STRING
  '你好'
  ---- TYPES
  STRING
If the result contains special characters, it's recommended to use the
default string mode. If the special characters only contain newline
characters, we can use RAW_STRING and the existing MULTI_LINE comment
together.

This patch also fixes the issue that pytest fails to report assertion
failures if any of the compared str values contain non-ascii bytes
(IMPALA-10419). However, pytest works if the compared values are both
in unicode type. So we explicitly converting the actual and expected str
values to unicode type.

Test:
 - Add tests in special-strings.test for raw string mode and the escaped
   string mode (default).
 - Run test_exprs.py::TestExprs::test_special_strings locally.

Change-Id: I7cc2ea3e5849bd3d973f0cb91322633bcc0ffa4b
Reviewed-on: http://gerrit.cloudera.org:8080/16919
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-06 04:39:56 +00:00
Tim Armstrong
868a01dca9 IMPALA-6101: call DataStreamMgr::Cancel() once per query
This is a bit of cleanup left over from the KRPC work that could avoid
some lock contention for queries with large numbers of fragments.

The change is just to do cancellation of receivers once per query
instead of once per fragment.

Change-Id: I7677d21f0aaddc3d4b56f72c0470ea850e34611e
Reviewed-on: http://gerrit.cloudera.org:8080/16901
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-05 23:55:51 +00:00
Tim Armstrong
1d5fe2771f IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
The encoding is identical to the already-supported PLAIN_DICTIONARY
encoding but the PLAIN enum value is used for the dictionary pages
and the RLE_DICTIONARY enum value is used for the data pages.

A hidden option -write_new_parquet_dictionary_encodings is
added to turn on writing too, for test purposes only.

Testing:
* Added an automated test using a pregenerated test file.
* Ran core tests.
* Manually tested by writing out TPC-H lineitem with the new encoding
  and reading back in Impala and Hive.

Parquet-tools output for the generated test file:
$ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/att/824de2afebad009f-6f460ade00000003_643159826_data.0.parq
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
file:            hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade00000003_643159826_data.0.parq
creator:         impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema:     schema
--------------------------------------------------------------------------------
id:              OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bool_col:        OPTIONAL BOOLEAN R:0 D:1
tinyint_col:     OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1
smallint_col:    OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1
int_col:         OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bigint_col:      OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
float_col:       OPTIONAL FLOAT R:0 D:1
double_col:      OPTIONAL DOUBLE R:0 D:1
date_string_col: OPTIONAL BINARY R:0 D:1
string_col:      OPTIONAL BINARY R:0 D:1
timestamp_col:   OPTIONAL INT96 R:0 D:1
year:            OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
month:           OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1

row group 1:     RC:8 TS:754 OFFSET:4
--------------------------------------------------------------------------------
id:               INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0]
bool_col:         BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN ST:[min: false, max: true, num_nulls: 0]
tinyint_col:      INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
smallint_col:     INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
int_col:          INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
bigint_col:       INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0]
float_col:        FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0]
double_col:       DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0]
date_string_col:  BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, num_nulls: 0]
string_col:       BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0]
timestamp_col:    INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined]
year:             INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0]
month:            INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0]

Parquet-tools output for one of the lineitem files:
$ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/li2/4b4d9143c575dd71-3f69d3cf00000001_1879643220_data.0.parq
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
file:            hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf00000001_1879643220_data.0.parq
creator:         impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema:     schema
--------------------------------------------------------------------------------
l_orderkey:      OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_partkey:       OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_suppkey:       OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_linenumber:    OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
l_quantity:      OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1
l_extendedprice: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1
l_discount:      OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1
l_tax:           OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(12,2) R:0 D:1
l_returnflag:    OPTIONAL BINARY R:0 D:1
l_linestatus:    OPTIONAL BINARY R:0 D:1
l_shipdate:      OPTIONAL BINARY R:0 D:1
l_commitdate:    OPTIONAL BINARY R:0 D:1
l_receiptdate:   OPTIONAL BINARY R:0 D:1
l_shipinstruct:  OPTIONAL BINARY R:0 D:1
l_shipmode:      OPTIONAL BINARY R:0 D:1
l_comment:       OPTIONAL BINARY R:0 D:1

row group 1:     RC:1724693 TS:58432195 OFFSET:4
--------------------------------------------------------------------------------
l_orderkey:       INT64 SNAPPY DO:4 FPO:159797 SZ:2839537/13147604/4.63 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 2142211, max: 6000000, num_nulls: 0]
l_partkey:        INT64 SNAPPY DO:2839640 FPO:3028619 SZ:8179566/13852808/1.69 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 1, max: 200000, num_nulls: 0]
l_suppkey:        INT64 SNAPPY DO:11019308 FPO:11059413 SZ:3063563/3103196/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 10000, num_nulls: 0]
l_linenumber:     INT32 SNAPPY DO:14082964 FPO:14083007 SZ:412884/650550/1.58 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 7, num_nulls: 0]
l_quantity:       FIXED_LEN_BYTE_ARRAY SNAPPY DO:14495934 FPO:14496204 SZ:1298038/1297963/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 1.00, max: 50.00, num_nulls: 0]
l_extendedprice:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:15794062 FPO:16003224 SZ:9087746/10429259/1.15 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 904.00, max: 104949.50, num_nulls: 0]
l_discount:       FIXED_LEN_BYTE_ARRAY SNAPPY DO:24881912 FPO:24881976 SZ:866406/866338/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0.00, max: 0.10, num_nulls: 0]
l_tax:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:25748406 FPO:25748463 SZ:866399/866325/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0.00, max: 0.08, num_nulls: 0]
l_returnflag:     BINARY SNAPPY DO:26614888 FPO:26614918 SZ:421113/421069/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x41, max: 0x52, num_nulls: 0]
l_linestatus:     BINARY SNAPPY DO:27036081 FPO:27036106 SZ:262209/270332/1.03 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x46, max: 0x4F, num_nulls: 0]
l_shipdate:       BINARY SNAPPY DO:27298370 FPO:27309301 SZ:2602937/2627148/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3032, max: 0x313939382D31322D3031, num_nulls: 0]
l_commitdate:     BINARY SNAPPY DO:29901405 FPO:29912079 SZ:2602680/2626308/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3331, max: 0x313939382D31302D3331, num_nulls: 0]
l_receiptdate:    BINARY SNAPPY DO:32504185 FPO:32515219 SZ:2603040/2627498/1.01 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x313939322D30312D3036, max: 0x313939382D31322D3330, num_nulls: 0]
l_shipinstruct:   BINARY SNAPPY DO:35107326 FPO:35107408 SZ:434968/434917/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x434F4C4C45435420434F44, max: 0x54414B45204241434B2052455455524E, num_nulls: 0]
l_shipmode:       BINARY SNAPPY DO:35542401 FPO:35542471 SZ:650639/650580/1.00 VC:1724693 ENC:RLE,RLE_DICTIONARY ST:[min: 0x414952, max: 0x545255434B, num_nulls: 0]
l_comment:        BINARY SNAPPY DO:36193124 FPO:36711343 SZ:22240470/52696671/2.37 VC:1724693 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: 0x20546972657369617320, max: 0x7A7A6C653F20626C697468656C792069726F6E69, num_nulls: 0]

Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Reviewed-on: http://gerrit.cloudera.org:8080/16893
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-05 23:30:35 +00:00
Aman Sinha
49680559b0 IMPALA-10182: Don't add inferred identity predicates to SELECT node
For an inferred equality predicates of type c1 = c2 if both sides
are referring to the same underlying tuple and slot, it is an identity
predicate which should not be evaluated by the SELECT node since it
will incorrectly eliminate NULL rows. This patch fixes the behavior.

Testing:
 - Added planner tests with base table and with outer join
 - Added runtime tests with base table and with outer join
 - Added planner test for IMPALA-9694 (same root cause)
 - Ran PlannerTest .. no other plans changed

Change-Id: I924044f582652dbc50085851cc639f3dee1cd1f4
Reviewed-on: http://gerrit.cloudera.org:8080/16917
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-05 23:04:25 +00:00
Zoltan Borok-Nagy
03af0b2c8c IMPALA-10422: EXPLAIN statements leak ACID transactions and locks
Currently EXPLAIN statements might open ACID transactions and
create locks on ACID tables.

This is not necessary since we won't modify the table. But the
real problem is that these transactions and locks are leaked and
open forever. They are even getting heartbeated while the
coordinator is still running.

The solution is to not consume any ACID resources for EXPLAIN
statements.

Testing:
* Added EXPLAIN INSERT OVERWRITE in front of an actual INSERT OVERWRITE
  in an e2e test

Change-Id: I05113b1fd9a3eb2d0dd6cf723df916457f3fbf39
Reviewed-on: http://gerrit.cloudera.org:8080/16923
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-05 21:31:05 +00:00
Tim Armstrong
a5f6c26044 IMPALA-2536: Make ColumnType constructor explicit
This avoids accidental implicit type conversions from
the PrimitiveType enum, or worse, integer values via
the enum.

Testing:
Ran core tests.

Change-Id: I2fe1d5da051c10904605328607bea78565356ef3
Reviewed-on: http://gerrit.cloudera.org:8080/16906
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-31 06:19:20 +00:00
xqhe
5baadd1da7 IMPALA-10406: Query with analytic functions doesn't need to materialize the predicates bounded to kudu
Before when query with analytic functions will materialize the
unassigned conjuncts.
But for the predicates that can be evaluated by kudu don't need to
materialize.

This optimization can reduce the amount of data to exchange and sort.

Testing:
 - Add planner test in analytic-fns.test

Change-Id: Iba8371eff6ae1bcffd51b44843175c52f2127e46
Reviewed-on: http://gerrit.cloudera.org:8080/16905
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-12-28 05:16:21 +00:00
Tim Armstrong
43b6093dc0 IMPALA-10117: Skip calls to FsPermissionCache for blob stores
This avoids calling precacheChildrenOf() in cases when the
cached values will never be used. This change simply skips
calling precacheChildrenOf() in the cases when getPermissions()
is never called.

There is some opportunity to clean up this permissions
checking further, but I decided to keep this fix limited
in scope.

Change-Id: I2034695a956307309f656d56aa57aa07ae5163d8
Reviewed-on: http://gerrit.cloudera.org:8080/16898
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-25 07:06:29 +00:00
Riza Suminto
a6a2440995 IMPALA-10374: Limit iteration at BufferedTupleStream::DebugString
BufferedTupleStream::DebugString() iterate std::list<Page> that can
potentially grow very large. As consequent, the returned string can grow
large as well and cause a problem as previously happen in IMPALA-9851.
With this patch, BufferedTupleStream::DebugString() only include maximum
of 100 first pages of page list.

Testing:
- Add new be test SimpleTupleStreamTest.ShortDebugString in
  buffered-tuple-stream-test.cc
- Pass core tests

Change-Id: I6626c8d54f35f303c01f85be1dd9aa54c8ad9a2d
Reviewed-on: http://gerrit.cloudera.org:8080/16884
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2020-12-24 18:55:24 +00:00
Riza Suminto
29069b9499 IMPALA-9550: Fix flakiness in TestResultSpoolingFetchSize.test_fetch
TestResultSpoolingFetchSize.test_fetch has been flaky in
ubuntu-16.04-dockerised environment for not reaching finished state
within 10 seconds. This patch increase the timeout of the test to 30
seconds.

Testing:
- Looped the test locally.

Change-Id: Id2e8a9db904da5f1e4acc9e18b3987b8a4ec24e5
Reviewed-on: http://gerrit.cloudera.org:8080/16895
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-23 00:54:24 +00:00
Fucun Chu
4099a60689 IMPALA-10317: Add query option that limits huge joins at runtime
This patch adds support for limiting the rows produced by a join node
such that runaway join queries can be prevented.

The limit is specified by a query option. Queries exceeding that limit
get terminated. The checking runs periodically, so the actual rows
produced may go somewhat over the limit.

JOIN_ROWS_PRODUCED_LIMIT is exposed as an advanced query option.

Rows produced Query profile is updated to include query wide and per
backend metrics for RowsReturned. Example from "
set JOIN_ROWS_PRODUCED_LIMIT = 10000000;
select count(*) from tpch_parquet.lineitem l1 cross join
(select * from tpch_parquet.lineitem l2 limit 5) l3;":

NESTED_LOOP_JOIN_NODE (id=2):
   - InactiveTotalTime: 107.534ms
   - PeakMemoryUsage: 16.00 KB (16384)
   - ProbeRows: 1.02K (1024)
   - ProbeTime: 0.000ns
   - RowsReturned: 10.00M (10002025)
   - RowsReturnedRate: 749.58 K/sec
   - TotalTime: 13s337ms

Testing:
 Added tests for JOIN_ROWS_PRODUCED_LIMIT

Change-Id: Idbca7e053b61b4e31b066edcfb3b0398fa859d02
Reviewed-on: http://gerrit.cloudera.org:8080/16706
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-22 06:10:39 +00:00