We found that the tests of test_iceberg_query and test_iceberg_profile
fail after the patch for IMPALA-9741 has been merged and that it is due
to the default timezone of Impala not being UTC. This patch fixes the
issue by adding "SET TIMEZONE=UTC;" before those test queries are run.
Testing:
- Verified in a local development environment that the tests of
test_iceberg_query and test_iceberg_profile could pass after applying
this patch.
Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3
Reviewed-on: http://gerrit.cloudera.org:8080/16425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-7961 handle the cases for query "create table if not exists"
with sync_ddl as true. Customers reported similar issue which happened
for query "create database if not exists" with sync_ddl as true.
This patch adds the similar fixing as the fixing for IMPALA-7961 to
function CatalogOpExecutor.createDatabase() to fix the issue.
Testing:
- Manual tests
Since this is a racy bug, I could only reproduce it by forcing
frequent topicUpdateLog GCs along with a specific sequence of
actions, like: run some DDLs and REFRESHs to trigger a GC in
topicUpdateLog, then run query "create database if not exists" with
sync_ddl as true. Verified that the issue couldn't be reproduced
after applying this patch.
- Passed exhaustive test.
Change-Id: Id623118f8938f416414c45d93404fb70d036a9df
Reviewed-on: http://gerrit.cloudera.org:8080/16421
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This work addresses a data race condition in admission controller by
providing the initializing values for two data members (
is_query_mem_tracker_ and query_id_) in a constructor for the MemTracker
class. Without doing so, the two data members are set, without lock
protection, after the object is constructed, which allows other threads
to modify either of them at the same time.
Testing:
1. Ran the python admission controller test successfully with a tsan
build. Data race was not observed with the enhancement. Data race
was observed without the enhancement.
2. Ran the core test.
Change-Id: I9c4ffe8064d3e099a525cc48c218ef73112fb67b
Reviewed-on: http://gerrit.cloudera.org:8080/16408
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change exposes the daemon health of statestored and catalogd via
an HTTP endpoint '/healthz'. If the server is healthy, this endpoint
will return HTTP code 200 (OK). If it is unhealthy, it will return
503 (Service Unavailable). This is consistent with the endpoint added
for impalads in IMPALA-8895.
Testing:
- Extended test in test_web_pages.py
Change-Id: I7714734df8e50dabbbebcb77a86a5a00bd13bf7c
Reviewed-on: http://gerrit.cloudera.org:8080/16295
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
directory error
This work addresses a failure by disabling undefined behavior sanitizer
testing for AdmissionControllerTest.TopNQueryCheck test. In the test,
std::regex_match() is used to verify the appearance of certain strings
and can produce a core with very long stack trace failling in
std::vector::operator[]().
Testing:
1. Ran the test in both regular and disabling undefined behavior
sanitizer check modes. No core was seen.
Change-Id: I16d6cff8fad8d0e93a24ec3fefa9cc1f8c471aad
Reviewed-on: http://gerrit.cloudera.org:8080/16404
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch mainly realizes the querying of iceberg table through impala,
we can use the following sql to create an external iceberg table:
CREATE EXTERNAL TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
Or just including table name and location like this:
CREATE EXTERNAL TABLE default.iceberg_test
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
'iceberg_file_format' is the file format in iceberg, currently only
support PARQUET, other format would be supported in the future. And
if you don't specify this property in your SQL, default file format
is PARQUET.
We achieved this function by treating the iceberg table as normal
unpartitioned hdfs table. When querying iceberg table, we pushdown
partition column predicates to iceberg to decide which data files
need to be scanned, and then transfer this information to BE to
do the real scan operation.
Testing:
- Unit test for Iceberg in FileMetadataLoaderTest
- Create table tests in functional_schema_template.sql
- Iceberg table query test in test_scanners.py
Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006
Reviewed-on: http://gerrit.cloudera.org:8080/16143
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
While the ds_hll_sketch() generates a string value as output the data
is not an ascii encoded text but a bitsketch, because of this, when
the shell get this data it disconnect while it tries to decode it.
The issue can be reproduced with a simple method like using unhex
with a wrong input.
Example: SELECT unhex("aa");
This patch contains a solution, where we replace any not UTF-8
decodable characters if we run into an UnicodeDecodeError after
fetching it.
This solution is working with the Thrift 0.9.3 autogenerated gen-py
but still fails with Thrift 0.11.0.
For Thrift 0.11.0 the error is catched and an error message is sent
(not working with beeswax protocol, because it generates a different
error (TypeError) which can come for other reasons too).
Testing:
-manual testing with these protocols: 'hs2-http', 'hs2', 'beeswax'
Change-Id: I0c5f1290356e21aed8ca7f896f953541942aed05
Reviewed-on: http://gerrit.cloudera.org:8080/16418
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
While the ds_hll_sketch() generates a string value as output the data
is not an ascii encoded text but a bitsketch, because of this, when
the shell get this data it disconnect while it tries to decode it.
The issue can be reproduced with a simple method like using unhex
with a wrong input.
Example: SELECT unhex("aa");
This patch contains a solution, where we replace any not UTF-8
decodable characters if we run into an UnicodeDecodeError after
fetching it.
This solution is working with the Thrift 0.9.3 autogenerated gen-py
but still fails with Thrift 0.11.0.
For Thrift 0.11.0 the error is catched and an error message is sent
(not working with beeswax protocol, because it generates a different
error (TypeError) which can come for other reasons too).
Testing:
-manual testing with these protocols: 'hs2-http', 'hs2', 'beeswax'
Change-Id: Ic5cfb907871ca83e5f04a39ca9d7a8e138d711a8
Reviewed-on: http://gerrit.cloudera.org:8080/16305
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
Implementing codegen for HiveUdfCall.
Testing:
Verified that java udf tests pass locally.
Benchmarks:
Used a UDF from TestUdf.java that adds three integers:
create function tpch15_parquet.sum3(int, int, int) returns int
location '/test-warehouse/impala-hive-udfs.jar'
symbol='org.apache.impala.TestUdf';
Used the following query on the master branch and the change's branch:
set num_nodes=1; set mt_dop=1;
select min(tpch15_parquet.sum3(cast(l_orderkey as int),
cast(l_partkey as int), cast(l_suppkey as int)))
from tpch15_parquet.lineitem;
Results averaged over 100 runs after warmup:
Master: 20.6346s, stddev: 0.3132411856765332
This change: 19.0256s, stddev: 0.42039019873436
This is a ~7.8% improvement.
Change-Id: I2f994dac550f297ed3c88491816403f237d4d747
Reviewed-on: http://gerrit.cloudera.org:8080/16314
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The flaky test was
TestImpalaShellInteractive.test_history_does_not_duplicate_on_interrupt
The test failed with timeout error when the interrupt signal arrived
later after the next test query was started. The impala-shell output was
^C instead of the expected query result.
This change adds an additional blocking expect call to wait for the
interrupt signal to arrive before sending in the next query.
Change-Id: I242eb47cc8093c4566de206f46b75b3feab1183c
Reviewed-on: http://gerrit.cloudera.org:8080/16391
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
This patch adds support for constant propagation of range predicates
involving date and timestamp constants. Previously, only equality
predicates were considered for propagation. The new type of propagation
is shown by the following example:
Before constant propagation:
WHERE date_col = CAST(timestamp_col as DATE)
AND timestamp_col BETWEEN '2019-01-01' AND '2020-01-01'
After constant propagation:
WHERE date_col >= '2019-01-01' AND date_col <= '2020-01-01'
AND timestamp_col >= '2019-01-01' AND timestamp_col <= '2020-01-01'
AND date_col = CAST(timestamp_col as DATE)
As a consequence, since Impala supports table partitioning by date
columns but not timestamp columns, the above propagation enables
partition pruning based on timestamp ranges.
Existing code for equality based constant propagation was refactored
and consolidated into a new class which handles both equality and
range based constant propagation. Range based propagation is only
applied to date and timestamp columns.
Testing:
- Added new range constant propagation tests to PlannerTest.
- Added e2e test for range constant propagation based on a newly
added date partitioned table.
- Ran precommit tests.
Change-Id: I811a1f8d605c27c7704d7fc759a91510c6db3c2b
Reviewed-on: http://gerrit.cloudera.org:8080/16346
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When Impala TRUNCATEs an ACID table, it creates a new base directory
with the hidden file "_empty" in it. Newer Hive versions ignore files
starting with underscore, therefore they ignore the whole base
directory.
To resolve this issue we can simply rename the empty file to "empty".
Testing:
* update acid-truncate.test accordingly
Change-Id: Ia0557b9944624bc123c540752bbe3877312a7ac9
Reviewed-on: http://gerrit.cloudera.org:8080/16396
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Including following changes:
1 build native-toolchain local by script on aarch64 platform
2 change some native-toolchain's lib version number
3 split SKIP_TOOLCHAIN_BOOTSTRAP and DOWNLOAD_CDH_COMPONETS to two things,
because on aarch64, just need to download cdp components ,
but not need to download toolchain.
4 download hadoop aarch64 nativelibs , impala building needs these libs.
With this commit, on ubuntu 18.04 aarch64 version,
just need to run bin/bootstrap_development.sh, just like x86.
Change-Id: I769668c834ab0dd504a822ed9153186778275d59
Reviewed-on: http://gerrit.cloudera.org:8080/16065
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The test test_refresh_updated_partitions runs some commands using Hive which
causes it fail on S3 specific jobs since we don't run HiveServer2 in those
environments. This patch skips the test on non-hdfs environments.
Change-Id: I0d27dd76e772e396a07419a58821ba899ac74188
Reviewed-on: http://gerrit.cloudera.org:8080/16399
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds logic in bin/jenkins/finalize.sh to check the ERROR
log for TSAN messages (i.e. WARNING: ThreadSanitizer: ...)
and generate a JUnitXML with the message. This happens when
TSAN aborts Impala.
Testing:
- Ran TSAN build (which is currently failing)
Change-Id: I44ea33a78482499decae0ec4c7c44513094b2f44
Reviewed-on: http://gerrit.cloudera.org:8080/16397
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently Impala checks file metadata 'hive.acid.version' to decide the
full ACID schema. There are cases when Hive forgets to set this value
for full ACID files, e.g. query-based compactions.
So it's more robust to check the schema elements instead of the metadata
field. Also, sometimes Hive write the schema with different character
cases, e.g. originalTransaction vs originaltransaction, so we should
rather compare the column names in a case insensitive way.
Testing:
* added test for full ACID compaction
* added test_full_acid_schema_without_file_metadata_tag to test full
ACID file without metadata 'hive.acid.version'
Change-Id: I52642c1755599efd28fa2c90f13396cfe0f5fa14
Reviewed-on: http://gerrit.cloudera.org:8080/16383
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds file type support for alluxio.
Alluxio URLs have a different prefix
such as:alluxio://zk@zk-1:2181,zk-2:2181,zk-3:2181/path/
Testing:
Add unit test for alluxio file system type checks.
Change-Id: Id92ec9cb0ee241a039fe4a96e1bc2ab3eaaf8f77
Reviewed-on: http://gerrit.cloudera.org:8080/16379
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fix just handles the case where a column's cardinality is zero
however it's nullable and we have null stats to indicate there are null
values, therefore we adjust the cardinality from 0 to 1.
The cardinality of zero was especially problematic when calculating
cardinalities for multiple predicates with multiplication. The 0 would
propagate up the plan tree and result in poor plan choices such as
always using broadcast joins where shuffle would've been more optimal.
Testing:
* 26 Node TPC-DS 30TB run had better plans for Q4 and Q11
- Q4 172s -> 80s
- Q11 103s -> 77s
* CardinalityTest
* TpcdsPlannerTest
Change-Id: Iec967053b4991f8c67cde62adf003cbd3f429032
Reviewed-on: http://gerrit.cloudera.org:8080/16349
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Remove the dependency on hadoop-hdfs, this jar file contains the core
code for implementing HDFS, and thus pulls in a bunch of unnecessary
transitive dependencies. Impala currently only requires this jar for
some configuration key names. Most of these configuration key names have
been moved to the appropriate HDFS client jars, and some others are
deprecated altogether. Removing this jar required making a few code
changes to move the location of the referenced configuration keys.
Removes all transitive Kafka dependencies from the Apache Ranger
dependency. Previously, Impala only excluded Kafka jars with binary
version kafka_2.11, however, it seems the Ranger recently upgraded the
dependency version to kafka_2.12. Now all Kafka dependencies are
excluded, regardless of artifact name.
Removes all transitive dependencies from the Apache Ozone dependency.
Impala has a dependency on the Ozone client shaded-jar, which already
includes all required transitive dependencies. For some reason, Ozone
still pulls in some transitive dependencies even though they are not
needed.
Made some other minor cleanup / improvements in the fe/pom.xml file.
This saves about 70 MB of space in the Docker images.
Testing:
* Ran exhaustive tests
* Ran on-prem cluster E2E tests
Change-Id: Iadbb6142466f73f067dd7cf9d401ff81145c74cc
Reviewed-on: http://gerrit.cloudera.org:8080/16311
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In HIVE-19064 the class of GenericHiveLexer was introduced as an
intermediate class between the classes of HiveLexer and Lexer. In order
for ToSqlUtils.java to be compiled once we bump up CDP_BUILD_NUMBER that
includes this change on the Hive side, this patch updates
shaded-deps/hive-exec/pom.xml to include the jar of GenericHiveLexer so
that Impala could be successfully built.
Testing:
- Verified that Impala could compile in a local development
environment after applying this patch.
Change-Id: I27db1cb8de36dd86bae08b7177ae3f1c156d73bc
Reviewed-on: http://gerrit.cloudera.org:8080/16390
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
INTERSECT/EXCEPT are not duplicate preserving operations. The distinct
aggregations can happen in each operand, the leftmost operand only, or
after all the operands in a separate aggregation step. Except for a
couple special cases we would use the last strategy most often.
This change pushes the distinct aggregation down to the leftmost operand
in cases where there are no analytic functions, or when a distinct or
grouping operation already eliminates duplicates.
In general DISTINCT placement such as in this case should be done
throughout the entire plan tree in a cost based manner as described in
IMPALA-5260
Testing:
* TpcdsPlannerTest
* PlannerTest
* TPC-DS 30TB Perf run for any affected queries
- Q14-1 180s -> 150s
- Q14-2 109s -> 90s
- Q8 no significant change
* SetOperation Planner Tests
* Analyzer tests
* Tpcds Functional Workload
Change-Id: Ia248f1595df2ab48fbe70c778c7c32bde5c518a5
Reviewed-on: http://gerrit.cloudera.org:8080/16350
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
A query will come into the FINISHED state when some rows are available,
even when some fragment instances are still executing. When a retryable
query comes into the FINISHED state and the client hasn't fetched any
results, we are still able to retry it for any retryable failures. This
patch fixes a DCHECK when retrying a FINISHED state query.
Tests:
- Add a test in test_query_retries.py for retrying a query in FINISHED
state.
Change-Id: I11d82bf80640760a47325833463def8a3791bdda
Reviewed-on: http://gerrit.cloudera.org:8080/16351
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds a BLOOM_FILTER_ERROR_RATE option that takes a
value between 0 and 1 (exclusive) that can override
the default target false positive probability (fpp)
value of 0.75 for selecting the filter size.
It does not affect whether filters are disabled
at runtime.
Adds estimated FPP and bloom size to the routing
table so we have some observability. Here is an
example:
tpch_kudu> select count(*) from customer join nation on n_nationkey = c_nationkey;
ID Src. Node Tgt. Node(s) Target type Partition filter Pending (Expected) First arrived Completed Enabled Bloom Size Est fpp
-----------------------------------------------------------------------------------------------------------------------------------------
1 2 0 LOCAL false 0 (3) N/A N/A true MIN_MAX
0 2 0 LOCAL false 0 (3) N/A N/A true 1.00 MB 1.04e-37
Testing:
Added a test that shows the query option affecting filter size.
Ran core tests.
Change-Id: Ifb123a0ea1e0e95d95df9837c1f0222fd60361f3
Reviewed-on: http://gerrit.cloudera.org:8080/16377
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The aws-java-sdk-bundle is one of the largest dependencies in the Impala
Docker images and continues to grow. The jar includes SDKs for
every single AWS service.
This patch removes most of the unnecessary SDKs from the
aws-java-sdk-bundle, thus drastically decreasing the size of the
dependency. The Maven shade plugin is used to do this, and the
implementation is similar to what is currently done for the hive-exec
jar.
This patch takes a conservative approach to removing packages from the
aws-java-sdk-bundle jar, and I ensured no direct dependencies of the S3
SDK were removed. The idea is to only remove dependencies that S3A would
never conceivably need. Given the huge number of AWS services, I only
focused on removing the largest SDKs (the size of each SDK is estimated
by the number of classes in the SDK).
This decreases the size of the Docker images by about 100 MB.
Testing:
* Ran core tests against S3
Change-Id: I0939f73be986f83cc1fd07921563b4d9201780f2
Reviewed-on: http://gerrit.cloudera.org:8080/16342
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This work addresses the current limitation in admission controller by
appending the last known memory consumption statistics about the set of
queries running or waiting on a host or in a pool to the existing memory
exhaustion message. The statistics is logged in impalad.INFO when a
query is queued or queued and then timed out due to memory pressure in
the pool or on the host. The statistics can also be part of the query
profile.
The new memory consumption statistics can be either stats on host or
aggregated pool stats. The stats on host describes memory consumption
for every pool on a host. The aggregated pool stats describes the
aggregated memory consumption on all hosts for a pool. For each stats
type, information such as query Ids and memory consumption of up to top
5 queries is provided, in addition to the min, the max, the average and
the total memory consumption for the query set.
When a query request is queued due to memory exhaustion, the above
new consumption statistics is logged when the BE logging level is set
at 2.
When a query request is timed out due to memory exhaustion, the above
new consumption statistics is logged when the BE logging level is set
at 1.
Testing:
1. Added a new test TopNQueryCheck in admission-controller-test.cc to
verify that the topN query memory consumption details are reported
correctly.
2. Add two new tests in test_admission_controller.py to simulate
queries being queued and then timed out due to pool or host memory
pressure.
3. Added a new test TopN in mem-tracker-test.cc to
verify that the topN query memory consumption details are computed
correctly from a mem tracker hierarchy.
4. Ran Core tests successfully.
Change-Id: Id995a9d044082c3b8f044e1ec25bb4c64347f781
Reviewed-on: http://gerrit.cloudera.org:8080/16220
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Recent patch for IMPALA-6788 makes coordinator to cancel inflight
query fragment instances when it receives failure report from one
backend. It's possible the BackendState::Cancel() is called for
one fragment instance before the first execution status report
from its backend is received and processed by the coordinator.
Since the status of BackendState is set as Cancelled after Cancel()
is called, the execution of the fragment instance is treated as
Done in such case so that the status report will NOT be processed.
Hence the backend receives response OK from coordinator even it
sent a report with execution error. This make backend hit DCHECK
error if backend in the terminal state with error.
This patch fixs the issue by making coordinator send CANCELLED
status in the response of status report if the backend status is not
ok and the execution status report is not applied.
Testing:
- The issue could be reproduced by running test_failpoints for about
20 iterations. Verified the fixing by running test_failpoints over
200 iterations without DCHECK failure.
- Passed TestProcessFailures::test_kill_coordinator.
- Psssed TestRPCException::test_state_report_error.
- Passed exhaustive tests.
Change-Id: Iba6a72f98c0f9299c22c58830ec5a643335b966a
Reviewed-on: http://gerrit.cloudera.org:8080/16303
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We found that the following 4 tests do not run even we remove all the
decorators like "@SkipIfKudu.no_hybrid_clock" or
"@SkipIfHive3.kudu_hms_notifications_not_supported" to skip the tests.
This is due to the fact that those 3 classes inherit the class of
CustomClusterTestSuite, which adds a constraint that only allows test
vectors with 'file_format' and 'compression_codec' being "text" and
"none", respectively, to be run.
1. TestKuduOperations::test_local_tz_conversion_ops
2. TestKuduClientTimeout::test_impalad_timeout
3. TestKuduHMSIntegration::test_create_managed_kudu_tables
4. TestKuduHMSIntegration::test_kudu_alter_table
To address this issue, in this patch we create a parent class for those
3 classes above and override the method of
add_custom_cluster_constraints() for this newly created parent class so
that we do not skip test vectors with 'file_format' and
'compression_codec' being "kudu" and "none", respectively.
On the other hand, this patch also removes a redundant method call to
super(CustomClusterTestSuite, cls).add_test_dimensions() in
CustomClusterTestSuite.add_custom_cluster_constraints() since
super(CustomClusterTestSuite, cls).add_test_dimensions() had
already been called immediately before the call to
add_custom_cluster_constraints() in
CustomClusterTestSuite.add_test_dimensions().
Testing:
- Manually verified that after removing the decorators to skip those
tests, those tests could be run.
Change-Id: I60a4bd4ac5a9026629fb840ab9cc7b5f9948290c
Reviewed-on: http://gerrit.cloudera.org:8080/16348
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In global invalidate metadata, we always load HDFS cache pools using the
CachePoolReader. Actually, it only works for HDFS file systems, not for
other systems like S3 or local, etc. We already handle this in
CatalogServiceCatalog#CatalogServiceCatalog(). This patch adds a check
in CatalogServiceCatalog#reset() to skip loading cache pools if it's not
a true HDFS file system.
Tests
- Ran tests on S3. Verified that the IllegalStateException doesn't
exists anymore.
Change-Id: Ib243d349177e1b982b313dd6e87ecc2ef4dfc3d8
Reviewed-on: http://gerrit.cloudera.org:8080/16335
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If we have returned any results to the client in the original query,
query retry will be skipped to avoid incorrect results. This patch adds
a query option, spool_all_results_for_retries, for retryable queries to
spool all results before returning any to the client. It defaults to
true. If all query results cannot be contained in the allocated result
spooling space, we'll return results and thus disabled query retry on
the query.
Setting spool_all_results_for_retries to false will fallback to the
original behavior - client can fetch results when any of them are ready.
So we explicitly set it to false in the retried query since it won't be
retried. For non retryable queries or queries that don't enable results
spooling, the spool_all_results_for_retries option takes no effect.
To implement this, this patch defers the time when results are ready to
be fetched. By default, the “rows available” event happens when any
results are ready. For a retryable query, when spool_query_results and
spool_all_results_for_retries are both true, the “rows available” event
happens after all results are spooled or any errors stopping us to do
so, e.g. batch queue is full, cancellation or failures. After waiting
for the root fragment instance’s Open() finishes, the coordinator will
wait until results of BufferedPlanRootSink are ready.
BufferedPlanRootSink sets the results ready signal in its Send(),
Close(), Cancel(), FlushFinal() methods.
Tests:
- Add a test to verify that a retryable query will spool all its results
when results spooling and spool_all_results_for_retries are enabled.
- Add a test to verify that query retry succeeds when a retryable query
is still spooling its results (spool_all_results_for_retries=true).
- Add a test to verify that the retried query won't spool all results
even when results spooling and spool_all_results_for_retries are
enabled in the original query.
- Add a test to verify that the original query can be canceled
correctly. We need this because the added logics for
spool_all_results_for_retries are related to the cancellation code
path.
- Add a test to verify results will be returned when all of them can't
fit into the result spooling space, and query retry will be skipped.
Change-Id: I462dbfef9ddab9060b30a6937fca9122484a24a5
Reviewed-on: http://gerrit.cloudera.org:8080/16323
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Added TpcdsPlannerTest to include each TPC-DS query as a separate plan
test file. Removed the previous tpcds-all test file.
This means when running only PlannerTest no TPC-DS plans are checked,
however as part of a full frontend test run the TpcdsPlannerTest will be
included.
Runs with cardinality and resource checks, as well as using parquet
tables to include predicate pushdowns.
Change-Id: Ibaf40d8b783be1dc7b62ba3269feb034cb8047da
Reviewed-on: http://gerrit.cloudera.org:8080/16345
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
The query option REFRESH_UPDATED_HMS_PARTITIONS was introduced
earlier in IMPALA-4364 to detect changes in the partition
objects in HMS when a refresh table command is issued. Originally,
it relied on using the StorageDescriptor#equals() method to
determine if the Partition in catalogd is same as partition
in HMS with while executing the refresh statement.
However, using StorageDescriptor#equals() is dependent on HMS
version and may introduce inconsistent behaviors after upgrades.
For example, when we backported the original patch to older
distribution which uses Hive-2, the SkewedInfo field of
StorageDescriptor is not null. This field causes the comparison
logic to fail, since catalogd doesn't store the SkewedInfo
field in the cached StorageDescriptor to optimize memory usage.
This patch modifies the comparison logic to use explicit
implementation in HdfsPartition class which compares only
some fields which are cached in the HdfsPartition object.
Testing:
1. Added a new test for the comparison method.
2. Modified existing test for the query option.
Change-Id: I90c797060265f8f508d0b150e15da3d0f9961b9b
Reviewed-on: http://gerrit.cloudera.org:8080/16363
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
This is the support for Cumulative Distribution Function (CDF) from
Apache DataSketches KLL algorithm collection. It receives a serialized
KLL sketch and one or more float values to represent ranges in the
sketched values.
E.g. [1, 5, 10] will mean the following ranges:
(-inf, 1), (-inf, 5), (-inf, 10), (-inf, +inf)
Returns a comma separated string where each value in the string is a
number in the range of [0,1] and shows that what percentage of the
data is in the particular ranges.
Note, ds_kll_cdf() should return an Array of doubles as the result but
with that we have to wait for the complex type support. Until, we
provide ds_kll_cdf_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.
Example:
select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10)
from alltypes;
+----------------------------------------------------------+
| ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) |
+----------------------------------------------------------+
| 0.2,0.401644,1,1 |
+----------------------------------------------------------+
Change-Id: I77e6afc4556ad05a295b89f6d06c2e4a6bb2cf82
Reviewed-on: http://gerrit.cloudera.org:8080/16359
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the support for Probabilistic Mass Function (PMF) from Apache
DataSketches KLL algorithm collection. It receives a serialized KLL
sketch and one or more float values to represent ranges in the
sketched values.
E.g. [1, 5, 10] will mean the following ranges:
(-inf, 1), [1, 5), [5, 10), [10, +inf)
Returns a comma separated string where each value in the string is a
number in the range of [0,1] and shows that what percentage of the
data is in the particular ranges.
Note, ds_kll_pmf() should return an Array of doubles as the result but
with that we have to wait for the complex type support. Until, we
provide ds_kll_pmf_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.
Example:
select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10)
from alltypes;
+----------------------------------------------------------+
| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) |
+----------------------------------------------------------+
| 0.202192,0.199452,0.598356,0 |
+----------------------------------------------------------+
Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba
Reviewed-on: http://gerrit.cloudera.org:8080/16336
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The bug was the the statement rewriter converted NOT IN <subquery>
predicates to != <subquery> predicates when the subquery could
be an empty set. This was invalid, because NOT IN (<empty set>)
is true, but != (<empty set>) is false.
Testing:
Added targeted planner and end-to-end tests.
Ran exhaustive tests.
Change-Id: I66c726f0f66ce2f609e6ba44057191f5929a67fc
Reviewed-on: http://gerrit.cloudera.org:8080/16338
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Allows numeric keys for JSON objects in get_json_object. This patch
makes Impala consistent with Hive and Postgres behavior for
get_json_object.
Queries such as "select get_json_object('{"1": 5}', '$.1');"
would fail before this patch. Now the query will return '5'.
Testing:
* Added tests to expr-test
Change-Id: I7df037ccf2c79da0ba86a46df1dd28ab0e9a45f4
Reviewed-on: http://gerrit.cloudera.org:8080/14905
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This function is very similar to ds_kll_quantile() but this one can
receive any number of rank parameters and returns a comma separated
string that holds the results for all of the given ranks.
For more details about ds_kll_quantile() see IMPALA-9959.
Note, ds_kll_quantiles() should return an Array of floats as the result
but with that we have to wait for the complex type support. Until, we
provide ds_kll_quantiles_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.
Change-Id: I76f6039977f4e14ded89a3ee4bc4e6ff855f5e7f
Reviewed-on: http://gerrit.cloudera.org:8080/16324
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The minimum requirement for a spillable operator is ((min_buffers -2) *
default_buffer_size) + 2 * max_row_size. In the min reservation, we only
reserve space for two large pages, one for reading, the other for
writing. However, to make the non-streaming GroupingAggregator work
correctly, we have to manage these extra reservations carefully. So it
won't run out of the min reservation when it actually needs to spill a
large page, or when it actually needs to read a large page.
To be specific, for how to manage the large write page reservation,
depending on whether needs_serialize is true or false:
- If the aggregator needs to serialize the intermediate results when
spilling a partition, we have to save a large page worth of
reservation for the serialize stream, in case it needs to write large
rows. This space can be restored when all the partitions are spilled
so the serialize stream is not needed until we build/repartition a
spilled partition and thus have pinned partitions again. If the large
write page reservation is used, we save it back whenever possible
after we spill or close a partition.
- If the aggregator doesn't need the serialize stream at all, we can
restore the large write page reservation whenever we fail to add a
large row, before spilling any partitions. Reclaim it whenever
possible after we spill or close a partition.
A special case is when we are processing a large row and it's the last
row in building/repartitioning a spilled partition, the large write page
reservation can be restored for it no matter whether we need the
serialize stream. Because partitions will be read out after this so no
needs for spilling.
For the large read page reservation, it's transferred to the spilled
BufferedTupleStream that we are reading in building/repartitioning a
spilled partition. The stream will restore some of it when reading a
large page, and reclaim it when the output row batch is reset. Note that
the stream is read in attach_on_read mode, the large page will be
attached to the row batch's buffers and only get freed when the row
batch is reset.
Tests:
- Add tests in test_spilling_large_rows (test_spilling.py) with
different row sizes to reproduce the issue.
- One test in test_spilling_no_debug_action becomes flaky after this
patch. Revise the query to make the udf allocate larger strings so it
can consistently pass.
- Run CORE tests.
Change-Id: I3d9c3a2e7f0da60071b920dec979729e86459775
Reviewed-on: http://gerrit.cloudera.org:8080/16240
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
This fix addresses the current limitation in that an ill-formatted
Parquet version string is not properly formatted before appearing
in an error message or impalad.INFO. With the fix, any such string is
converted to a hex string first. The hex string is a sequence of
four hex digit groups separated by spaces and each group is one or
two hex digits, such as "6c 65 2e a".
Testing:
Ran "core" tests successfully.
Change-Id: I281d6fa7cb2f88f04588110943e3e768678b9cf1
Reviewed-on: http://gerrit.cloudera.org:8080/16331
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Sahil Takiar <stakiar@cloudera.com>
Include remaining TPC-DS queries to the testdata workload definition.
Q8 and Q38 were using non standard variants, those have been
replaced by the official query versions. Q35 is using an official
variant. Had to escape a table alias in Q90 as we treat 'AT' as a
reserved keyword.
Change-Id: Id5436689390f149694f14e6da1df624de4f5f7ad
Reviewed-on: http://gerrit.cloudera.org:8080/16280
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
We didn't get to a clear root cause for this, so I'm going
to try two things.
First, under the theory that the problem is somehow the
destruction of the strings, convert them to char char*
which does not require destruction on process teardown.
Second, add some logging if the map lookup fails so
we can better understand what may have happened.
Change-Id: Id4363a93addb8a808d292906cac44ebd25c16889
Reviewed-on: http://gerrit.cloudera.org:8080/16341
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch introduces a new boolean query option
REFRESH_UPDATED_HMS_PARTITIONS. When this query option is set
the refresh table command reloads the partitions which have been
modified in HMS in addition to adding [removing] the new [removed]
partitions.
In order to do this the refresh table command needs to fetch all
the partitions instead of the just the partition names which can
cause the performance of refresh table to degrade when the query
option is set. However for certain use-cases currently there is
no way to detect changed partitions using refresh table command.
For instance, if certain partition locations have been changed,
a refresh table will not update those partitions.
Testing:
1. Added a new test which sets the query option and makes sure
that the updated partitions from hive are reloaded after refresh
table command.
2. Ran exhaustive tests with the patch.
Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Reviewed-on: http://gerrit.cloudera.org:8080/16308
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The root cause for crash is that QueryState::Cancel() was called
before thread unsafe function QueryState::Init() was completed.
This patch fixs the race condition between QueryState::Cancel()
and QueryState::Init(). QueryState::Init() is safe to be called
at any time.
Testing:
- The issue could be reproduced by running expr-test for 10-20
iterations. Verified the fixing by running expr-test over 1000
iterations without crash.
- Passed TestProcessFailures::test_kill_coordinator.
- Passed core tests.
Change-Id: Ib0d3b9c59924a25b70fa20afeb6e8ca93016eca9
Reviewed-on: http://gerrit.cloudera.org:8080/16313
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>