mirror of
https://github.com/apache/impala.git
synced 2025-12-19 18:12:08 -05:00
IMPALA-1760: Implement shutdown command
This is the same patch except with fixes for the test failures
on EC and S3 noted in the JIRA.
This allows graceful shutdown of executors and partially graceful
shutdown of coordinators (new operations fail, old operations can
continue).
Details:
* In order to allow future admin commands, this is implemented with
function-like syntax and does not add any reserved words.
* ALL privilege is required on the server
* The coordinator impalad that the client is connected to can be shut
down directly with ":shutdown()".
* Remote shutdown of another impalad is supported, e.g. with
":shutdown('hostname')", so that non-coordinators can be shut down
and for the convenience of the client, which does not have to
connect to the specific impalad. There is no assumption that the
other impalad is registered in the statestore; just that the
coordinator can connect to the other daemon's thrift endpoint.
This simplifies things and allows shutdown in various important
cases, e.g. statestore down.
* The shutdown time limit can be overridden to force a quicker or
slower shutdown by specifying a deadline in seconds after the
statement is executed.
* If shutting down, a banner is shown on the root debug page.
Workflow:
1. (if a coordinator) clients are prevented from submitting
queries to this coordinator via some out-of-band mechanism,
e.g. load balancer
2. the shutdown process is started via ":shutdown()"
3. a bit is set in the statestore and propagated to coordinators,
which stop scheduling fragment instances on this daemon
(if an executor).
4. the query startup grace period (which is ideally set to the AC
queueing delay plus some additional leeway) expires
5. once the daemon is quiesced (i.e. no fragments, no registered
queries), it shuts itself down.
6. If the daemon does not successfully quiesce (e.g. rogue clients,
long-running queries), after a longer timeout (counted from the start
of the shutdown process) it will shut down anyway.
What this does:
* Executors can be shut down without causing a service-wide outage
* Shutting down an executor will not disrupt any short-running queries
and will wait for long-running queries up to a threshold.
* Coordinators can be shut down without query failures only if
there is an out-of-band mechanism to prevent submission of more
queries to the shut down coordinator. If queries are submitted to
a coordinator after shutdown has started, they will fail.
* Long running queries or other issues (e.g. stuck fragments) will
slow down but not prevent eventual shutdown.
Limitations:
* The startup grace period needs to be configured to be greater than
the latency of statestore updates + scheduling + admission +
coordinator startup. Otherwise a coordinator may send a
fragment instance to the shutting down impalad. (We could
automate this configuration as a follow-on)
* The startup grace period means a minimum latency for shutdown,
even if the cluster is idle.
* We depend on the statestore detecting the process going down
if queries are still running on that backend when the timeout
expires. This may still be subject to existing problems,
e.g. IMPALA-2990.
Tests:
* Added parser, analysis and authorization tests.
* End-to-end test of shutting down impalads.
* End-to-end test of shutting down then restarting an executor while
queries are running.
* End-to-end test of shutting down a coordinator
- New queries cannot be started on coord, existing queries continue to run
- Exercises various Beeswax and HS2 operations.
Change-Id: I8f3679ef442745a60a0ab97c4e9eac437aef9463
Reviewed-on: http://gerrit.cloudera.org:8080/11484
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
committed by
Impala Public Jenkins
parent
48640b5dfa
commit
f46de21140
@@ -364,6 +364,8 @@ error_codes = (
|
||||
"on backend $2 at offset $3: verification of read data failed."),
|
||||
|
||||
("CANCELLED_INTERNALLY", 119, "Cancelled in $0"),
|
||||
|
||||
("SERVER_SHUTTING_DOWN", 120, "Server is being shut down: $0."),
|
||||
)
|
||||
|
||||
import sys
|
||||
|
||||
Reference in New Issue
Block a user