Local catalog mode has been the default and works well in downstream
Impala for over 5 years. This patch turn on local catalog mode by
default (--catalog_topic_mode=minimal and --use_local_catalog=true) as
preferred mode going forward.
Implemented LocalCatalog.setIsReady() to facilitate using local catalog
mode for FE tests. Some FE tests fail due to behavior differences in
local catalog mode like IMPALA-7539. This is probably OK since Impala
now largely hand over FileSystem permission check to Apache Ranger.
The following custom cluster tests are pinned to evaluate under legacy
catalog mode because their behavior changed in local catalog mode:
TestCalcitePlanner.test_calcite_frontend
TestCoordinators.test_executor_only_lib_cache
TestMetadataReplicas
TestTupleCacheCluster
TestWorkloadManagementSQLDetailsCalcite.test_tpcds_8_decimal
At TestHBaseHmsColumnOrder.test_hbase_hms_column_order, set
--use_hms_column_order_for_hbase_tables=true flag for both impalad and
catalogd to get consistent column order in either local or legacy
catalog mode.
Changed TestCatalogRpcErrors.test_register_subscriber_rpc_error
assertions to be more fine grained by matching individual query id.
Move most of test methods from TestRangerLegacyCatalog to
TestRangerLocalCatalog, except for some that do need to run in legacy
catalog mode. Also renamed TestRangerLocalCatalog to
TestRangerDefaultCatalog. Table ownership issue in local catalog mode
remains unresolved (see IMPALA-8937).
Testing:
Pass exhaustive tests.
Change-Id: Ie303e294972d12b98f8354bf6bbc6d0cb920060f
Reviewed-on: http://gerrit.cloudera.org:8080/23080
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.
All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.
The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.
get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.
Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We have plenty of custom_cluster tests that assert against content of
Impala daemon log files while the process is still running using
assert_log_contains() and it's wrappers. The method specifically mention
about disabling glog buffering ('-logbuflevel=-1'), but not all
custom_cluster tests do that. This often result in flaky test that hard
to triage and often neglected if it does not frequently run in core
exploration.
This patch adds boolean param 'disable_log_buffering' into
CustomClusterTestSuite.with_args for test to declare intention to
inspect log files in live minicluster. If it is True, start minicluster
with '-logbuflevel=-1' for all daemons. If it is False, log WARNING on
any calls to assert_log_contains().
There are several complex custom_cluster tests that left unchanged and
print out such WARNING logs, such as:
- TestQueryLive
- TestQueryLogTableBeeswax
- TestQueryLogOtherTable
- TestQueryLogTableHS2
- TestQueryLogTableAll
- TestQueryLogTableBufferPool
- TestStatestoreRpcErrors
- TestWorkloadManagementInitWait
- TestWorkloadManagementSQLDetails
This patch also fixed some small flake8 issues on modified tests.
There is a flakiness sign at test_query_live.py where test query is
submitted to coordinator and fail because sys.impala_query_live table
has not exist yet from coordinator's perspective. This patch modify
test_query_live.py to wait for few seconds until sys.impala_query_live
is queryable.
Testing:
- Pass custom_cluster tests in exhaustive exploration.
Change-Id: I56fb1746b8f3cea9f3db3514a86a526dffb44a61
Reviewed-on: http://gerrit.cloudera.org:8080/22015
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some Thrift request/response structs in CatalogService were changed to
add new variables in the middle, which caused cross version
incompatibility issue for CatalogService.
Impala cluster membership is managed by the statestore. During upgrade
scenarios where different versions of Impala daemons are upgraded one
at a time, the upgraded daemons have incompatible message formats.
Even through protocol versions numbers were already defined for
Statestore and Catalog Services, they were not used. The Statestore and
Catalog server did not check the protocol version in the requests, which
allowed incompatible Impala daemons to join one cluster. This causes
unexpected query failures during rolling upgrade.
We need a way to detect this and enforce that some rules are followed:
- Statestore refuses the registration requests from incompatible
subscribers.
- Catalog server refuses the requests from incompatible clients.
- Scheduler assigns tasks to a group of compatible executors.
This patch isolate Impala daemons into separate clusters based on
protocol versions of Statestore service to prevent incompatible Impala
daemons from communicating with each other. It covers the Thrift RPC
communications between catalogd and coordinators, and communication
between statestore and its subscribers (executor, coordinators,
catalogd and admissiond). This change should work for future upgrade.
Following changes were made:
- Bump StatestoreServiceVersion and CatalogServiceVersion to V2 for
all requests of Statestore and Catalog services.
- Update the request and response structs in CatalogService to ensure
each Thrift request struct has protocol version and each Thrift
response struct has returned status.
- Update the request and response struct in StatestoreService to
ensure each Thrift request struct has protocol version and each
Thrift response struct has returned status.
- Add subscriber type so that statestore could distinguish different
types of subscribers.
- Statestore checks protocol version for registration requests from
subscribers. It refuses the requests with incompatible version.
- Catalog server checks protocol version for Catalog service APIs, and
returns error for requests with incompatible version.
- Catalog daemon sends its address and the protocol version of Catalog
service when it registers to statestore, statestore forwards the
address and the protocol version of Catalog service to all
subscribers during registration.
- Add UpdateCatalogd API for StatestoreSubscriber service so that the
coordinators could receive the address and the protocol version of
Catalog service from statestore if the coordinators register to
statestore before catalog daemon.
- Add GetProtocolVersion API for Statestore service so that the
subscribers can check the protocol version of statestore before
calling RegisterSubscriber API.
- Add starting flag tolerate_statestore_startup_delay. It is off by
default. When it's enabled, the subscriber is able to tolerate
the delay of the statestore's availability. The subscriber's
process will not exit if it cannot register with the specified
statestore on startup. But instead it enter into Recovery mode,
it will loop, sleep and retry till it successfully register with
the statestore. This flag should be enabled during rolling upgrade.
CatalogServiceVersion is defined in CatalogService.thrift. In future,
if we make non backward version compatible changes in the request or
response structures for CatalogService APIs, we need to bump the
protocol version of Catalog service.
StatestoreServiceVersion is defined in StatestoreService.thrift.
Similarly if we make non backward version compatible changes in the
request or response structures for StatestoreService APIs, we need
to bump the protocol version of Statestore service.
Message formats for KRPC communications between coordinators and
executors, and between admissiond and coordinators are defined
in proto files under common/protobuf. If we make non backward version
compatible changes in these structures, we need to bump the
protocol version of Statestore service.
Testing:
- Added end-to-end unit tests.
- Passed the core tests.
- Ran manual test to verify old version of executors cannot register
with new version of statestore, and new version of executors cannot
register with old version of statestore.
Change-Id: If61506dab38c4d1c50419c1b3f7bc4f9ee3676bc
Reviewed-on: http://gerrit.cloudera.org:8080/19959
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Add retries to catalogd RPCs. Previously, connection failures triggered
a retry, but failures on the actual RPC did not trigger a retry. This
change replaces all usages of ClientCache::DoRpc() in the
CatalogOpExecutor with ClientCache::DoRpcWithRetry(). This change moves
the connection retry loop to DoRpcWithRetry(), instead of relying on the
ClientCache to retry the connection.
This patch is based to IMPALA-8904, which adds similar functionality to
statestore RPCs.
Testing:
* Renamed test_statestore_rpc_errors.py to test_services_rpc_errors.py
and added new tests for catalogd RPC errors
* Added new tests to test_restart_services.py
* Ran core tests
Change-Id: I7f33ad2b36d301fb64e70a939e71decab0ca993c
Reviewed-on: http://gerrit.cloudera.org:8080/14246
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>