Initializing the impala-python virtualenv takes a couple minutes,
so it is useful to do that in parallel to the rest of the build.
This moves the impala-python initialization to its own step
in the CMake build. It stops using impala-python for commands
invoked from buildall.sh or the CMake build to avoid premature
or concurrent initializations of impala-python. Then, it adds
a dedicated step to initialize impala-python.
Testing:
- Ran a core job and a couple builds
- Rebuilt and verified that impala-python is not reinitialized
if it is already initialized
Change-Id: Ieff51263c55bd234028fed7101c94b4a928590f0
Reviewed-on: http://gerrit.cloudera.org:8080/16607
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds --kudu_client_connection_negotiation_timeout_ms flag
to control client-side connection negotiation timeout in the Kudu
client working as a part of the Impala's BE. Since [1] has been
addressed for Kudu C++ client, it makes sense to provide a control knob
to customize the timeout. That should help to address cases where very
busy cluster nodes hosting Kudu tablet servers aren't fast enough to
negotiate a new connection within the default timeout interval (3 sec),
as mentioned in the description of [1].
[1] https://issues.apache.org/jira/browse/KUDU-2966
Change-Id: I1223187318691da47082608356547f6d78144466
Reviewed-on: http://gerrit.cloudera.org:8080/16705
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch extends blacklist functionality by adding executor node to
blacklist if a query fails caused by disk failure during spill-to-disk.
Also classifies disk error codes and defines a blacklistable error set
for non-transient disk errors. Coordinator blacklists executor only if
the executor hitted blacklistable error during spill-to-disk.
Adds a new debug action to simulate disk write error during spill-to-
disk. To use, specify in query options as:
'debug_action': 'IMPALA_TMP_FILE_WRITE:<hostname>:<port>:<action>'
where <hostname> and <port> represent the impalad which execute the
fragment instances, <port> is the BE krpc port (default 27000).
Adds new test cases for blacklist and query-retry to cover the code
changes.
Testing:
- Passed new test cases.
- Passed exhaustive test.
- Manually simulated disk failures in scratch directories on nodes
of a cluster, verified that the nodes were blacklisted as
expected.
Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437
Reviewed-on: http://gerrit.cloudera.org:8080/16949
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
There have been some cancel tests that are flaky and their logs didn't
reveal the root cause of the failures. Adding some extra logging so
that we can see a bit more of the nature of the failure.
The extra log message contains:
- Query SQL
- Message of the exception thrown during fetching the results
- Query Status line from the query profile
Change-Id: Ied7100a9ea2e2f0611cf8e328e589b4c8e5d5100
Reviewed-on: http://gerrit.cloudera.org:8080/16985
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently Impala writes double slashes in the paths of datafiles
for non-partitioned Iceberg tables. Unnormalized paths can cause
problems later.
This patch removes the redundant slashes.
Testing:
* Tested manually by inspecting the manifest files of the
Iceberg tables. Used both non-partitioned and partitioned tables.
Change-Id: If5ecac78102ed35710dd70a18edc71f6e891e748
Reviewed-on: http://gerrit.cloudera.org:8080/16993
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for the TRUNCATE statement for
Iceberg tables.
The TRUNCATE operation creates a new snapshot for the target
table that doesn't have any data files. Table and column stats
are also cleared. This patch also fixes a bug that caused
table/column stats not being propagated.
Testing
* added e2e tests for both partitioned and unpartitioned tables
Change-Id: I6116c7c36aba871c0be79f499e0ac618072ca7b8
Reviewed-on: http://gerrit.cloudera.org:8080/16987
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: wangsheng <skyyws@163.com>
With newer versions of Iceberg, TestIcebergTable::test_create_iceberg_tables
fails with ClassNotFoundException for org.apache.hive.hadoop.common.type.Date.
This adds that missing location to the impala-minimal-hive-exec.
Testing:
- Ran TestIcebergTable::test_create_iceberg_tables with newer Iceberg
Change-Id: I3fc33ff17489c2bd54d2ec8798ec7a3e5cfb051c
Reviewed-on: http://gerrit.cloudera.org:8080/17005
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
TLS versions < 1.2 are now considered insecure. This patch improves
Impala's default security.
This is made possible now in part because Impala 4.0 dropped support
for Python versions < 2.7.9 (or 2.7.5 on certain distributions where
it has been patched) as lower Python versions do not support tls1.2
Testing:
- Existing SSL tests are updated to reflect the new default.
Change-Id: Ifed66646b041a061f9db92744710aef7453f39e4
Reviewed-on: http://gerrit.cloudera.org:8080/16988
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When set ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION = true, the planner
will simplify outer joins if the predicate with case expr or conditional
function on both sides of outer join.
However, the predicate maybe not null-rejecting, if simplify the outer
join, the result is incorrect. E.g. t1.b > coalesce(t1.c, t2.c) can
return true if t2.c is null, so it is not null-rejecting predicate
for t2.
The fix is simply to support the case that the predicate with two
operands and the operator is one of (=, !=, >, <, >=, <=),
1. one of the operands or
2. if the operand is arithmetic expression and one of the children
does not contain conditional builtin function or case expr and has
tuple id in outer joined tuples.
E.g. t1.b > coalesce(t2.c, t1.c) or t1.b + coalesce(t2.c, t1.c) >
coalesce(t2.c, t1.c) is null-rejecting predicate for t1.
Testing:
* Add new plan tests in outer-to-inner-joins.test
* Add new query tests to verify the correctness on transformation
Change-Id: I84a3812f4212fa823f3d1ced6e12f2df05aedb2b
Reviewed-on: http://gerrit.cloudera.org:8080/16845
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
For convenience this patch adds support with the old-style
CREATE TABLE ... PARTITIONED BY ...; syntax for Iceberg tables.
So users should be able to write the following:
CREATE TABLE ice_t (i int)
PARTITIONED BY (p int)
STORED AS ICEBERG;
Which should be equivalent to this:
CREATE TABLE ice_t (i int, p int)
PARTITION BY SPEC (p IDENTITY)
STORED AS ICEBERG;
Please note that the old-style CREATE TABLE statement creates
IDENTITY-partitioned tables. For other partition transforms the
users must use the new, more generic syntax.
Hive also supports the old PARTITIONED BY syntax with the same
behavior.
Testing:
* added e2e tests
Change-Id: I789876c161bc0987820955aa9ae01414e0dcb45d
Reviewed-on: http://gerrit.cloudera.org:8080/16979
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
What works:
* A single node cluster can be started up with docker-compose
* HMS data is stored in Derby database in a docker volume
* Filesystem data is stored in a shared docker volume, using the
localfs support in the Hadoop client.
* A Kudu cluster with a single master can be optionally added on
to the Impala cluster.
* TPC-DS data can be loaded automatically by a data loading container.
We need to set up a docker network called quickstart-network,
purely because docker-compose insists on generating network names
with underscores, which are part of the FQDN and end up causing
problems with Java's URL parsing, which rejects these technically
invalid domain names.
How to run:
Instructions for running the quickstart cluster are in
docker/README.md.
How to build containers:
./buildall.sh -release -noclean -notests -ninja
ninja quickstart_hms_image quickstart_client_image docker_images
How to upload containers to dockerhub:
IMPALA_QUICKSTART_IMAGE_PREFIX=timgarmstrong/
for i in impalad_coord_exec impalad_coordinator statestored \
impalad_executor catalogd impala_quickstart_client \
impala_quickstart_hms
do
docker tag $i ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
docker push ${IMPALA_QUICKSTART_IMAGE_PREFIX}$i
done
I pushed containers build from commit f260cce22, which
was branched from 6cb7cecacf on master.
Misc other stuff:
* Added more metadata to all images.
TODO:
* Test and instructions to run against Kudu quickstart
* Upload latest version of containers before merging.
Change-Id: Ifc0b862af40a368381ada7ec2a355fe4b0aa778c
Reviewed-on: http://gerrit.cloudera.org:8080/15966
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala shell outputs a batch of rows using OutputStream. Inside
OutputStream, output to a file is handled slightly differently from
output that is written to stdout. When writing to stdout we use print()
(which appends a newline) while when writing to a file we use write()
(which adds nothing). This difference was introduced in IMPALA-3343 so
this bug may be a regression introduced then. To ensure that output is
the same in either case we need to add a newline after writing each
batch of rows to a file.
TESTING:
Added a new test for this case.
Change-Id: I078a06c54e0834bc1f898626afbfff4ded579fa9
Reviewed-on: http://gerrit.cloudera.org:8080/16966
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A unicode character can be encoded into 1-4 bytes in UTF-8. String
functions will return undesired results when the input contains unicode
characters, because we deal with a string as a byte array. For instance,
length() returns the length in bytes, not in unicode characters.
UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem.
This patch adds UTF-8 support in some string functions so they can have
UTF-8 aware behavior. For compatibility with the old versions, a new
query option, UTF8_MODE, is added for turning on/off the UTF-8 aware
behavior. Currently, only length(), substring() and reverse() support
it. Other function supports will be added in later patches.
String functions will check the query option and switch to use the
desired implementation. It's similar to how we use the decimal_v2 query
option in builtin functions.
For easy testing, the UTF-8 aware version of string functions are
also exposed as builtin functions (named by utf8_*, e.g. utf8_length).
Tests:
- Add BE tests for utf8 functions.
- Add e2e tests for the UTF8_MODE query option.
Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Reviewed-on: http://gerrit.cloudera.org:8080/16908
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When reading from the data cache, the disk IO thread first gets a file
handle, then it checks the data cache for a hit. The file handle is only
used if there is a data cache miss. It is not used when data cache hit
and in turns becomes an overhead. This patch move the file handle
retrieval later when data cache miss hapens.
Testing:
- Add custom cluster test test_no_fd_caching_on_cached_data.
- Pass core tests.
Change-Id: Icc68f233518f862454e87bcbbef14d65fcdb7c91
Reviewed-on: http://gerrit.cloudera.org:8080/16963
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
impala-profile-tool is a new dependency for end-to-end tests.
The tool is built together with all the other backend tests
(so the buildall.sh flag '-notests' can turn off building it), it is
actually used in the parallel phase of end-to-end tests.
This means a problem for Docker-based builds for the following reasons:
- Docker-based tests run BE, FE and various phases of the EE test in
separate Docker containers for parallel executions
- Test binaries are only built inside the container running BE tests to
cut down on the build time and the size of the Docker image that all test
containers are based on.
- This means that the EE_TEST_PARALLEL container will miss the tool
required for running test designed to test it.
The solution is to build the tool early, at the end of the build phase
running in the build container. There is already another such tool built
there (parquet-reader) for similar reason, so just add
impala-profile-tool to the same 'make' command there.
Tested by running BE_TEST and EE_TEST_PARALLEL phases in a Docker-based
build.
Change-Id: I60e78ea883f3057c59a345feca38ef08a7f6a0b8
Reviewed-on: http://gerrit.cloudera.org:8080/16965
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fixes the analytic push down optimization for the case where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.
In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.
The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.
We can avoid the problem in the case where the partition limit >=
order by limit, however.
In this case the relevant set of partitions is the set of partitions
that include the first <limit> rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.
The solution implemented in this patch is to increase the pushed
down limit so that it is always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.
This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).
This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.
Testing:
* Add new planner test with negative case where it's rejected
because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
+ tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
: <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
tie-handling logic in the execnode more.
Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.
Ran TPC-H scale 30 on a single node, no significant perf change.
Ran a targeted query to check for regressions in the top-n node.
The elapsed time for this targeted query did not change:
use tpch30_parquet;
set mt_dop=1;
select l_extendedprice from lineitem
order by 1 limit 100
Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Reviewed-on: http://gerrit.cloudera.org:8080/16942
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
to support a list of columns
Modified parser to support compute incremental stats
columns.No need to modify the code of other modules
because it already supports
Change-Id: I4dcc2d4458679c39581446f6d87bb7903803f09b
Reviewed-on: http://gerrit.cloudera.org:8080/16947
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
This patch imports the functionality needed for Theta approximate
algorithm from Apache DataSketches.
First, I updated our existing snapshot of DataSketches to the
following commit:b2f749ed5ce6ba650f4259602b133c310c3a5ee4"
Merge pull request #182 from chufucun/include_type"
This affects files originated from hll/, kll/ and theta/ directories
of the DataSketches repo.
Then I copied all the files needed for Theta into our snapshot
directory.
Browse the source files here:
b2f749ed5c
Change-Id: I8485d6829f50b130c84ec8bef0a4b5895255ba6c
Reviewed-on: http://gerrit.cloudera.org:8080/16959
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-9865 part 2 made some expected outcomes of the above test steps
stricter. Unfortunately these stricter results are only valid when the
tests are run on an HDFS file system in the context of a local
minicluster, breaking the same test on S3 and EC storage.
The patch disables the test step when run outside the context of a local
minicluster HDFS.
Change-Id: If8a179937c9c7c690dd2630549464dbe6aa1b834
Reviewed-on: http://gerrit.cloudera.org:8080/16964
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
To make impala-shell compatible for Python3, we explicitly distinguish
bytes and text in Python2 by decoding the bytes for all inputs.
Regression 1: multiple queries in one line with unicode chars will break
In precmd() of impala-shell, if there are multiple queries present in
one input line, we split it into individual queries (by
sqlparse.split()) and append them back to the 'cmdqueue'. They will be
passed to precmd() again. In our Python2 implementation, precmd()
expects them to be str type, and will decode them into unicode type.
However, the output type of sqlparse.split() is unicode which doesn't
have a decode() method. Calling decode() on a unicode var will let
Python2 implicitly encode it to str. This may cause UnicodeEncodeError
since implicitly encoding use 'ascii'.
Regression 2: multi-line query with unicode chars will break when
command history is enabled
In _check_for_command_completion(), when calling
readline.replace_history_item in Python2. We encode the completed_cmd
into bytes. However, we shouldn't replace it since the return type is
expected to be unicode.
Tests:
- Add tests for these two regressions in Python2.
Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
Reviewed-on: http://gerrit.cloudera.org:8080/16960
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This issue happened for core ASAN build.
According to log message, one backend sent status report with
instance_exec_status as done for all assigned instances without
error, then it sent last status report with error. The coordinator
treat the backend state as done after it processed the status report
with instance_exec_status as done, but did not apply last status
report with error to the overall backend state.
This caused backend to receive a response with status as OK for the
last status report, hence hit DCHECK error.
This patch fix the race for updating the 'Query State' and updating
the fragment instance state when hitting error during execution of
fragment instance. The backends will not send status report with
fragment instance state as "completed" without error after hitting
error.
Testing:
- Manual tests
I could only reproduce the situation by adding some artificial
delays in the beginning of QueryState::ErrorDuringExecute()
when repeatedly running test case test_spilling.py::
TestSpillingDebugActionDimensions::test_spilling_naaj for
Impala ASAN build.
Verified that the issue did not happen after applying this
patch.
- Passed exhaustive test.
Change-Id: Ic12a80e20ddc11e32349edfec2bd16338c24b841
Reviewed-on: http://gerrit.cloudera.org:8080/16900
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
INSERT INTO Iceberg tables that use partition transforms. Partition
transforms are functions that calculate partition data from row data.
There are the following partition transforms in Iceberg:
https://iceberg.apache.org/spec/#partition-transforms
* IDENTITY
* BUCKET
* TRUNCATE
* YEAR
* MONTH
* DAY
* HOUR
INSERT INTO identity-partitioned Iceberg tables are already supported.
This patch adds support for the rest of the transforms.
We create the partitioning expressions in InsertStmt. Based on these
expressions data are automatically shuffled and sorted by the backend
executors before rows are given to the table sink operators. The table
sink operator writes the partitions one-by-one and creates a
human-readable partition path for them.
In the end, we will convert the partition path to partition data and
create Iceberg DataFiles with information about the files written.
Testing:
* added planner test
* added e2e tests
Change-Id: I3edf02048cea78703837b248c55219c22d512b78
Reviewed-on: http://gerrit.cloudera.org:8080/16939
Reviewed-by: wangsheng <skyyws@163.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The mask functions in Hive are implemented through GenericUDFs which can
accept an infinite number of function signatures. Impala currently don't
support GenericUDFs. So we provide builtin mask functions with limited
overloads.
This patch adds some missing overloads that could be used by Ranger
default masking policies, e.g. MASK_HASH, MASK_SHOW_LAST_4,
MASK_DATE_SHOW_YEAR, etc.
Tests:
- Add test coverage on all default masking policies applied on all
supported types.
Change-Id: Icf3e70fd7aa9f3b6d6b508b776696e61ec1fcc2e
Reviewed-on: http://gerrit.cloudera.org:8080/16930
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds a --profile_verbosity option for impala-profile-tool with
the following levels:
* 0: minimal
* 1: legacy - matches old output, this is the default still
* 2: default - basic descriptive stats, used for V2 profile.
* 3: extended
* 4: full
This will help with transition to the V2 profile because we
can have a nice, high-level, readable text profile by default
with the option to produce more detailed profiles and alternate
views of the profile from the thrift profile.
Use the profile version in impala-profile-tool to dump the
more verbose output for the V2 profile while preserving the
same output for the legacy profile.
Reduce verbosity of v2 profile output - only include mean/min/max
by default. I intend to refine the output at the different
verbosity levels for the v2 profiles further as part of IMPALA-9382,
it is still fairly noisy.
Fix output with/without gen_experimental_profile - there
was a small difference in that the summary stats were not
output in the averaged profile.
Testing:
* Add an end-to-end test that generates output for a small
profile log and compares against expected files.
* Tweak other profile tests to reflect changes to output.
Change-Id: I82618a813e29af7996dfaed78873b2a73bc0231d
Reviewed-on: http://gerrit.cloudera.org:8080/16881
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Insertion of invalid timestamps causes Impala to crash when it uses
the INT64 Parquet timestamp types.
This patch fixes the error by checking for null values in
Int64TimestampColumnWriterBase::ConvertValue().
Testing:
* added e2e tests
Change-Id: I74fb754580663c99e1d8c3b73f8d62ea3305ac93
Reviewed-on: http://gerrit.cloudera.org:8080/16951
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A recent patch (IMPALA-9930) introduces a new admission control rpc
service, which can be configured to perform admission control for
coordinators. In that patch, the admission service runs in an impalad.
This patch separates the service out to run in a new daemon, called
the admissiond. It also integrates this new daemon with the build
infrastructure around Docker.
Some notable changes:
- Adds a new class, AdmissiondEnv, which performs the same function
for the admissiond as ExecEnv does for impalads.
- The '/admission' http endpoint is exposed on the admissiond's webui
if the admission control service is in use, otherwise it is exposed
on coordinator impalad's webuis.
- start-impala-cluster.py takes a new flag --enable_admission_service
which configures the minicluster to have an admissiond with all
coordinators using it for admission control.
- Coordinators are now configured to use the admission service by
specifying the startup flag --admission_service_host. This is
intended to mirror the configuration of the statestored/catalogd
location.
Testing:
- Existing tests for the admission control serivce are modified to run
with an admissiond.
- Manually ran start-impala-cluster.py with --enable_admission_service
and --docker_network to verify Docker integration.
Change-Id: Id677814b31e9193035e8cf0d08aba0ce388a0ad9
Reviewed-on: http://gerrit.cloudera.org:8080/16891
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We supported create required/optional field for Iceberg table in this
patch. If we set 'NOT NULL' property for Iceberg table column in SQL,
Impala will create required field by Iceberg api, 'NULL' or default
will create optional field.
Besides, 'DESCRIBE XXX' for Iceberg table will display 'optional'
property like this:
+------+--------+---------+----------+
| name | type | comment | nullable |
+------+--------+---------+----------+
| id | int | | false |
| name | string | | true |
| age | int | | true |
+------+--------+---------+----------+
And 'SHOW CREATE TABLE XXX' will also display 'NULL'/'NOT NULL'
property for Iceberg table.
Tests:
* added new test in iceberg-create.test
* added new test in iceberg-negative.test
* added new test in show-create-table.test
* modify 'DESCRIBE XXX' result in iceberg-create.test
* modify 'DESCRIBE XXX' result in iceberg-alter.test
* modify create table result in show-create-table.test
Change-Id: I70b8014ba99f43df1b05149ff7a15cf06b6cd8d3
Reviewed-on: http://gerrit.cloudera.org:8080/16904
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch contains a variety of small refactors needed to enable the
new admission control daemon, to separate them out from the main patch
for ease of reviewing.
The following changes are made:
- A new class is introduced, DaemonEnv, which contains singleton
objects common to all Impala daemons, currently just a MetricGroup
and Webservers. The purpose is to reduce code duplication when the
new admissiond daemon is added. This is analogous to how ExecEnv is
used for impalad-specific singletons already.
This patch modifies the catalogd and statestored to use DaemonEnv.
impalads could also use DaemonEnv, but its tricky due to
dependencies in the order of creation and initialization for objects
such as the ReservationTracker and BufferPool relative to the
MetricGroup and Webserver, so this is left for followup work.
- Direct use of ExecEnv in ImpalaServicePool ahd AdmissionController
is removed, as the admissiond will also need to use these classes
and it will not have an ExecEnv.
Testing:
- Passed a run of existing core tests.
Change-Id: I2e097e20458354f78bfc3477cac6fb3a2835f094
Reviewed-on: http://gerrit.cloudera.org:8080/16890
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A username can be determined for a session via two mechanisms:
* In a secure env, the user is authenticated by LDAP or Kerberos
* In an unsecure env, the client specifies the user name, either
as a parameter to the OpenSession API (HS2) or as a parameter
to the first query run (beeswax)
This patch affects what happens if neither of the above mechanisms
is used. Previously we would end up with the username being an
empty string, but this makes Ranger unhappy. Hive uses the name
"anonymous" in this situation, so we change Impala's behaviour too.
This is configurable by -anonymous_user_name. -anonymous_user_name=
reverts to the old behaviour.
Test
* Add an end-to-end test that exercises this via impala-shell for
HS2, HS2-HTTP and beeswax protocols.
* Tweak a couple of existing tests that depended on the previous
behavior.
Change-Id: I6db491231fa22484aed476062b8fe4c8f69130b0
Reviewed-on: http://gerrit.cloudera.org:8080/16902
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In some cases Kudu plans could contain more hosts than the actual number of executors.
This commit fixes it by capping the number of hosts at the number of executors,
and determining which executors have local scan ranges.
Testing:
- Ran core tests
Updated Kudu planner tests where the memory estimates changed.
Change-Id: I72e341597e980fb6a7e3792905b942ddf5797d03
Reviewed-on: http://gerrit.cloudera.org:8080/16880
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
These tests were disabled due to S3's eventually consistent
behavior. Now that S3 is strongly consistent, these tests do
not need to be disabled.
Testing:
- Ran s3 core job
Change-Id: Ie9041f530bf3a818f8954b31a3d01d9f6753d7d4
Reviewed-on: http://gerrit.cloudera.org:8080/16931
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
QueueNode::not_admitted_reason can be accessed concurrently by the
coordinator thread that calls SubmitForAdmission and the admission
control dequeue loop thread.
This patch fixes this by ensuring that not_admitted_reason is only
accessed if 'admission_ctrl_lock_' is held.
Change-Id: Iacb3f37d8e1797c2b1d7bc32ba6368419e9ae444
Reviewed-on: http://gerrit.cloudera.org:8080/16926
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
incorporated comments, removed the para as per the feedback
listed all the overloads that are introduced
stated that Impala does not yet support new Hive UDFs
called out how mask functions were introduced through overloads
Change-Id: I37f0bcf4cf586cc5cfd03e4df68443967b6bb88f
Reviewed-on: http://gerrit.cloudera.org:8080/16861
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
For OR predicates that reference a view, currently the ConvertToCNFRule
does not get applied since it is considered a single table predicate
even if the predicate might reference columns from different tables
within the view. This patch enables the application of this rule for
such predicates by checking the expanded view and if it satisfies the
criterion then the rule can be applied and the predicate can be pushed
eventually to the scan.
Testing:
Added planner test in inline-view.test
Change-Id: Ie7a9a215d6b92aec07153e643268370f34186c88
Reviewed-on: http://gerrit.cloudera.org:8080/16912
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, the result section of the testfile is required to used
escaped strings. Take the following result section as an example:
--- RESULTS
'Alice\nBob'
'Alice\\nBob'
The first line is a string with a newline character. The second line is
a string with a '\' and an 'n' character. When comparing with the actual
query results, we need to escape the special characters in the actual
results, e.g. replace newline characters with '\n'. This is done by
invoking encode('unicode_escape') on the actual result strings. However,
the input type of this method is unicode instead of str. When calling it
on str vars, Python will implicitly convert the input vars to unicode
type. The default encoding, ascii, is used. This causes
UnicodeDecodeError when the str contains non-ascii bytes. To fix this,
this patch explicitly decodes the input str using 'utf-8' encoding.
After fixing the logic of escaping the actual result strings, the next
problem is that it's painful to write unicode-escaped expected results.
Here is an example:
---- QUERY
select "你好\n你好"
---- RESULTS
'\u4f60\u597d\n\u4f60\u597d'
---- TYPES
STRING
It's painful to manually translate the unicode characters.
This patch adds a new comment, RAW_STRING, for the result section to use
raw strings instead of unicode-escaped strings. Here is an example:
---- QUERY
select "你好"
---- RESULTS: RAW_STRING
'你好'
---- TYPES
STRING
If the result contains special characters, it's recommended to use the
default string mode. If the special characters only contain newline
characters, we can use RAW_STRING and the existing MULTI_LINE comment
together.
This patch also fixes the issue that pytest fails to report assertion
failures if any of the compared str values contain non-ascii bytes
(IMPALA-10419). However, pytest works if the compared values are both
in unicode type. So we explicitly converting the actual and expected str
values to unicode type.
Test:
- Add tests in special-strings.test for raw string mode and the escaped
string mode (default).
- Run test_exprs.py::TestExprs::test_special_strings locally.
Change-Id: I7cc2ea3e5849bd3d973f0cb91322633bcc0ffa4b
Reviewed-on: http://gerrit.cloudera.org:8080/16919
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is a bit of cleanup left over from the KRPC work that could avoid
some lock contention for queries with large numbers of fragments.
The change is just to do cancellation of receivers once per query
instead of once per fragment.
Change-Id: I7677d21f0aaddc3d4b56f72c0470ea850e34611e
Reviewed-on: http://gerrit.cloudera.org:8080/16901
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For an inferred equality predicates of type c1 = c2 if both sides
are referring to the same underlying tuple and slot, it is an identity
predicate which should not be evaluated by the SELECT node since it
will incorrectly eliminate NULL rows. This patch fixes the behavior.
Testing:
- Added planner tests with base table and with outer join
- Added runtime tests with base table and with outer join
- Added planner test for IMPALA-9694 (same root cause)
- Ran PlannerTest .. no other plans changed
Change-Id: I924044f582652dbc50085851cc639f3dee1cd1f4
Reviewed-on: http://gerrit.cloudera.org:8080/16917
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently EXPLAIN statements might open ACID transactions and
create locks on ACID tables.
This is not necessary since we won't modify the table. But the
real problem is that these transactions and locks are leaked and
open forever. They are even getting heartbeated while the
coordinator is still running.
The solution is to not consume any ACID resources for EXPLAIN
statements.
Testing:
* Added EXPLAIN INSERT OVERWRITE in front of an actual INSERT OVERWRITE
in an e2e test
Change-Id: I05113b1fd9a3eb2d0dd6cf723df916457f3fbf39
Reviewed-on: http://gerrit.cloudera.org:8080/16923
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before when query with analytic functions will materialize the
unassigned conjuncts.
But for the predicates that can be evaluated by kudu don't need to
materialize.
This optimization can reduce the amount of data to exchange and sort.
Testing:
- Add planner test in analytic-fns.test
Change-Id: Iba8371eff6ae1bcffd51b44843175c52f2127e46
Reviewed-on: http://gerrit.cloudera.org:8080/16905
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
This avoids calling precacheChildrenOf() in cases when the
cached values will never be used. This change simply skips
calling precacheChildrenOf() in the cases when getPermissions()
is never called.
There is some opportunity to clean up this permissions
checking further, but I decided to keep this fix limited
in scope.
Change-Id: I2034695a956307309f656d56aa57aa07ae5163d8
Reviewed-on: http://gerrit.cloudera.org:8080/16898
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
BufferedTupleStream::DebugString() iterate std::list<Page> that can
potentially grow very large. As consequent, the returned string can grow
large as well and cause a problem as previously happen in IMPALA-9851.
With this patch, BufferedTupleStream::DebugString() only include maximum
of 100 first pages of page list.
Testing:
- Add new be test SimpleTupleStreamTest.ShortDebugString in
buffered-tuple-stream-test.cc
- Pass core tests
Change-Id: I6626c8d54f35f303c01f85be1dd9aa54c8ad9a2d
Reviewed-on: http://gerrit.cloudera.org:8080/16884
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
TestResultSpoolingFetchSize.test_fetch has been flaky in
ubuntu-16.04-dockerised environment for not reaching finished state
within 10 seconds. This patch increase the timeout of the test to 30
seconds.
Testing:
- Looped the test locally.
Change-Id: Id2e8a9db904da5f1e4acc9e18b3987b8a4ec24e5
Reviewed-on: http://gerrit.cloudera.org:8080/16895
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for limiting the rows produced by a join node
such that runaway join queries can be prevented.
The limit is specified by a query option. Queries exceeding that limit
get terminated. The checking runs periodically, so the actual rows
produced may go somewhat over the limit.
JOIN_ROWS_PRODUCED_LIMIT is exposed as an advanced query option.
Rows produced Query profile is updated to include query wide and per
backend metrics for RowsReturned. Example from "
set JOIN_ROWS_PRODUCED_LIMIT = 10000000;
select count(*) from tpch_parquet.lineitem l1 cross join
(select * from tpch_parquet.lineitem l2 limit 5) l3;":
NESTED_LOOP_JOIN_NODE (id=2):
- InactiveTotalTime: 107.534ms
- PeakMemoryUsage: 16.00 KB (16384)
- ProbeRows: 1.02K (1024)
- ProbeTime: 0.000ns
- RowsReturned: 10.00M (10002025)
- RowsReturnedRate: 749.58 K/sec
- TotalTime: 13s337ms
Testing:
Added tests for JOIN_ROWS_PRODUCED_LIMIT
Change-Id: Idbca7e053b61b4e31b066edcfb3b0398fa859d02
Reviewed-on: http://gerrit.cloudera.org:8080/16706
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>