This puts all of the thrift-generated python code into the
impala_thrift_gen package. This is similar to what Impyla
does for its thrift-generated python code, except that it
uses the impala_thrift_gen package rather than impala._thrift_gen.
This is a preparatory patch for fixing the absolute import
issues.
This patches all of the thrift files to add the python namespace.
This has code to apply the patching to the thirdparty thrift
files (hive_metastore.thrift, fb303.thrift) to do the same.
Putting all the generated python into a package makes it easier
to understand where the imports are getting code. When the
subsequent change rearranges the shell code, the thrift generated
code can stay in a separate directory.
This uses isort to sort the imports for the affected Python files
with the provided .isort.cfg file. This also adds an impala-isort
shell script to make it easy to run.
Testing:
- Ran a core job
Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0
Reviewed-on: http://gerrit.cloudera.org:8080/20169
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Allow administrators to configure per user limits on queries that can
run in the Impala system.
In order to do this, there are two parts. Firstly we must track the
total counts of queries in the system on a per-user basis. Secondly
there must be a user model that allows rules that control per-user
limits on the number of queries that can be run.
In a Kerberos environment the user names that are used for both the user
model and at runtime are short user names, e.g. testuser when the
Kerberos principal is testuser/scm@EXAMPLE.COM
TPoolStats (the data that is shared between Admission Control instances)
is extended to include a map from user name to a count of queries
running. This (along with some derived data structures) is updated when
queries are queued and when they are released from Admission Control.
This lifecycle is slightly different from other TPoolStats data which
usually tracks data about queries that are running. Queries can be
rejected because of user quotas at submission time. This is done for
two reasons: (1) queries can only be admitted from the front of the
queue and we do not want to block other queries due to quotas, and
(2) it is easy for users to understand what is going on when queries
are rejected at submission time.
Note that when running in configurations without an Admission Daemon
then Admission Control does not have perfect information about the
system and over-admission is possible for User-Level Admission Quotas
in the same way that it is for other Admission Control controls.
The User Model is implemented by extending the format of the
fair-scheduler.xml file. The rules controlling the per-user limits are
specified in terms of user or group names.
Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to
the fair-scheduler.xml file. These elements can be placed on the root
configuration, which applies to all pools, or the pool configuration.
The ‘userQueryLimit’ element has 2 child elements: "user"
and "totalCount". The 'user' element contains the short names of users,
and can be repeated, or have the value "*" for a wildcard name which
matches all users. The ‘groupQueryLimit’ element has 2 child
elements: "group" and "totalCount". The 'group' element contains group
names.
The root level rules and pool level rules must both be passed for a new
query to be queued. The rules dictate a maximum number of queries that
can run by a user. When evaluating rules at either the root level, or
at the pool level, when a rule matches a user then there is no more
evaluation done.
To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the
RequestPoolService is enhanced.
If user quotas are enabled for a pool then a list of the users with
running or queued queries in that pool is visible on the coordinator
webui admission control page.
More comprehensive documentation of the user model will be provided in
IMPALA-12943
TESTING
New end-to-end tests are added to test_admission_controller.py, and
admission-controller-test is extended to provide unit tests for the
user model.
Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41
Reviewed-on: http://gerrit.cloudera.org:8080/21616
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds /catalog_ha_info page on Statestore to show catalog HA
information. The page contains the following information: Active Node,
Standby Node, and Notified Subscribers table. In the Notified
Subscribers table, include the following information items:
-- Id,
-- Address,
-- Registration ID,
-- Subscriber Type,
-- Catalogd Version,
-- Catalogd Address,
-- Last Update Catalogd Time
Change-Id: If85f6a827ae8180d13caac588b92af0511ac35e3
Reviewed-on: http://gerrit.cloudera.org:8080/21418
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
To support statestore HA, we allow two statestored instances in an
Active-Passive HA pair to be added to an Impala cluster. We add the
preemptive behavior for statestored. When HA is enabled, the preemptive
behavior allows the statestored with the higher priority to become
active and the paired statestored becomes standby. The active
statestored acts as the owner of Impala cluster and provides statestore
service for the cluster members.
To enable catalog HA for a cluster, two statestoreds in the HA pair and
all subscribers must be started with starting flag
"enable_statestored_ha" as true.
This patch makes following changes:
- Defined new service for Statestore HA.
- Statestored negotiates the role for HA with its peer statestore
instance on startup.
- Create HA monitor thread:
Active statestored sends heartbeat to standby statestored.
Standby statestored monitors peer's connection states with their
subscribers.
- Standby statestored sends heartbeat to subscribers with request
for connection state between active statestore and subscribers.
Standby statestored saves the connection state as failure detecter.
- When standby statestored lost connection with active statestore,
it checks the connection states for active statestore, and takes over
active role if majority of subscribers lost connections with active
statestore.
- New active statestored sends RPC notification to all subscribers
for new active statestored and active catalogd elected by the new
active statestored.
- New active statestored starts sending heartbeat to its peer when it
receives handshake from its peer.
- Active statestored enters recovery mode if it lost connections with
its peer statestored and all subscribers. It keeps sending HA
handshake to its peer until receiving response.
- All subscribers (impalad/catalogd/admissiond) register to two
statestoreds.
- Subscribers report connection state for the requests from standby
statestore.
- Subscribers switch to new active statestore when receiving RPC
notifications from new active statestored.
- Only active statestored sends topic update messages to subscribers.
- Add a new option "enable_statestored_ha" in script
bin/start-impala-cluster.py for starting Impala mini-cluster with
statestored HA enabled.
- Add a new Thrift API in statestore service to disable network
for statestored. It's only used for unit-test to simulate network
failure. For safety, it's only working when the debug action is set
in starting flags.
Testings:
- Added end-to-end unit tests for statestored HA.
- Passed core tests
Change-Id: Ibd2c814bbad5c04c1d50c2edaa5b910c82a6fd87
Reviewed-on: http://gerrit.cloudera.org:8080/20372
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When CatalogD HA is enabled, statestored sends the address of current
active catalogd to coordinators and catalogds on two different events:
catalogd registration, and new active catalogd election change.
Statestored sends the active catalogd in two different kinds of RPCs.
If there are more than one election changes in short time, coordinators
and catalogds could receive RPCs in the order which are different from
the changing order on statestore.
To make coordinators and catalogds to have same view as statestore, we
have to avoid to overwrite the latest version of active catalogd with
previous change.
Version of active catalogd is already in UpdateCatalogd RPC, but not
in response message of statestore registration. This patch adds active
catalogd version in the response message of statestore registration.
Coordinators and catalogds only apply the changes which have newer
version than the version of last recevied active catalogd.
The version of last received active catalogd have to be re-synced for
new registration since statestore could be restarted.
Allow subscribers to skip UpdateCatalogd RPC if subscribers cannot
handle it, for example the statestore-id is unknown for a subscriber
since the RPC is received before registration is completed. The
skipped RPCs will be resent by statestore.
Testing:
- Added a test case to start both catalogds with flag
'force_catalogd_active' as true.
- Passed core tests
Change-Id: Ie49947e563d43c59bdd476b28c35be69848ae12a
Reviewed-on: http://gerrit.cloudera.org:8080/20276
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
To support catalog HA, we allow two catalogd instances in an Active-
Passive HA pair to be added to an Impala cluster.
We add the preemptive behavior for catalogd. When enabled, the
preemptive behavior allows the catalogd with the higher priority to
become active and the paired catalogd becomes standby. The active
catalogd acts as the source of metadata and provides catalog service
for the Impala cluster.
To enable catalog HA for a cluster, two catalogds in the HA pair and
statestore must be started with starting flag "enable_catalogd_ha".
The catalogd in an Active-Passive HA pair can be assigned an instance
priority value to indicate a preference for which catalogd should assume
the active role. The registration ID which is assigned by statestore can
be used as instance priority value. The lower numerical value in
registration ID corresponds to a higher priority. The catalogd with the
higher priority is designated as active, the other catalogd is
designated as standby. Only the active catalogd propagates the
IMPALA_CATALOG_TOPIC to the cluster. This guarantees only one writer for
the IMPALA_CATALOG_TOPIC in a Impala cluster.
The statestore which is the registration center of an Impala cluster
assigns the roles for the catalogd in the HA pair after both catalogds
register to statestore. When statestore detects the active catalogd is
not healthy, it fails over catalog service to standby catalogd. When
failover occurs, statestore sends notifications with the address of
active catalogd to all coordinators and catalogd in the cluster. The
events are logged in the statestore and catalogd logs. When the catalogd
with the higher priority recovers from a failure, statestore does not
resume it as active to avoid flip-flop between the two catalogd.
To make a specific catalogd in the HA pair as active instance, the
catalogd must be started with starting flag "force_catalogd_active" so
that the catalogd will be assigned with active role when it registers
to statestore. This allows administrator to manually perform catalog
service failover.
Added option "--enable_catalogd_ha" in bin/start-impala-cluster.py.
If the option is specified when running the script, the script will
create an Impala cluster with two catalogd instances in HA pair.
Testing:
- Passed the core tests.
- Added unit-test for auto failover and manual failover.
Change-Id: I68ce7e57014e2a01133aede7853a212d90688ddd
Reviewed-on: http://gerrit.cloudera.org:8080/19914
Reviewed-by: Xiang Yang <yx91490@126.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tamas Mate <tmater@apache.org>
Some Thrift request/response structs in CatalogService were changed to
add new variables in the middle, which caused cross version
incompatibility issue for CatalogService.
Impala cluster membership is managed by the statestore. During upgrade
scenarios where different versions of Impala daemons are upgraded one
at a time, the upgraded daemons have incompatible message formats.
Even through protocol versions numbers were already defined for
Statestore and Catalog Services, they were not used. The Statestore and
Catalog server did not check the protocol version in the requests, which
allowed incompatible Impala daemons to join one cluster. This causes
unexpected query failures during rolling upgrade.
We need a way to detect this and enforce that some rules are followed:
- Statestore refuses the registration requests from incompatible
subscribers.
- Catalog server refuses the requests from incompatible clients.
- Scheduler assigns tasks to a group of compatible executors.
This patch isolate Impala daemons into separate clusters based on
protocol versions of Statestore service to prevent incompatible Impala
daemons from communicating with each other. It covers the Thrift RPC
communications between catalogd and coordinators, and communication
between statestore and its subscribers (executor, coordinators,
catalogd and admissiond). This change should work for future upgrade.
Following changes were made:
- Bump StatestoreServiceVersion and CatalogServiceVersion to V2 for
all requests of Statestore and Catalog services.
- Update the request and response structs in CatalogService to ensure
each Thrift request struct has protocol version and each Thrift
response struct has returned status.
- Update the request and response struct in StatestoreService to
ensure each Thrift request struct has protocol version and each
Thrift response struct has returned status.
- Add subscriber type so that statestore could distinguish different
types of subscribers.
- Statestore checks protocol version for registration requests from
subscribers. It refuses the requests with incompatible version.
- Catalog server checks protocol version for Catalog service APIs, and
returns error for requests with incompatible version.
- Catalog daemon sends its address and the protocol version of Catalog
service when it registers to statestore, statestore forwards the
address and the protocol version of Catalog service to all
subscribers during registration.
- Add UpdateCatalogd API for StatestoreSubscriber service so that the
coordinators could receive the address and the protocol version of
Catalog service from statestore if the coordinators register to
statestore before catalog daemon.
- Add GetProtocolVersion API for Statestore service so that the
subscribers can check the protocol version of statestore before
calling RegisterSubscriber API.
- Add starting flag tolerate_statestore_startup_delay. It is off by
default. When it's enabled, the subscriber is able to tolerate
the delay of the statestore's availability. The subscriber's
process will not exit if it cannot register with the specified
statestore on startup. But instead it enter into Recovery mode,
it will loop, sleep and retry till it successfully register with
the statestore. This flag should be enabled during rolling upgrade.
CatalogServiceVersion is defined in CatalogService.thrift. In future,
if we make non backward version compatible changes in the request or
response structures for CatalogService APIs, we need to bump the
protocol version of Catalog service.
StatestoreServiceVersion is defined in StatestoreService.thrift.
Similarly if we make non backward version compatible changes in the
request or response structures for StatestoreService APIs, we need
to bump the protocol version of Statestore service.
Message formats for KRPC communications between coordinators and
executors, and between admissiond and coordinators are defined
in proto files under common/protobuf. If we make non backward version
compatible changes in these structures, we need to bump the
protocol version of Statestore service.
Testing:
- Added end-to-end unit tests.
- Passed the core tests.
- Ran manual test to verify old version of executors cannot register
with new version of statestore, and new version of executors cannot
register with old version of statestore.
Change-Id: If61506dab38c4d1c50419c1b3f7bc4f9ee3676bc
Reviewed-on: http://gerrit.cloudera.org:8080/19959
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the ability to share the per-host stats for locally
admitted queries across all coordinators. This helps to get a more
consolidated view of the cluster for stats like slots_in_use and
mem_admitted when making local admission decisions.
Testing:
Added e2e py test
Change-Id: I2946832e0a89b077d0f3bec755e4672be2088243
Reviewed-on: http://gerrit.cloudera.org:8080/17683
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This work addresses the current limitation in admission controller by
appending the last known memory consumption statistics about the set of
queries running or waiting on a host or in a pool to the existing memory
exhaustion message. The statistics is logged in impalad.INFO when a
query is queued or queued and then timed out due to memory pressure in
the pool or on the host. The statistics can also be part of the query
profile.
The new memory consumption statistics can be either stats on host or
aggregated pool stats. The stats on host describes memory consumption
for every pool on a host. The aggregated pool stats describes the
aggregated memory consumption on all hosts for a pool. For each stats
type, information such as query Ids and memory consumption of up to top
5 queries is provided, in addition to the min, the max, the average and
the total memory consumption for the query set.
When a query request is queued due to memory exhaustion, the above
new consumption statistics is logged when the BE logging level is set
at 2.
When a query request is timed out due to memory exhaustion, the above
new consumption statistics is logged when the BE logging level is set
at 1.
Testing:
1. Added a new test TopNQueryCheck in admission-controller-test.cc to
verify that the topN query memory consumption details are reported
correctly.
2. Add two new tests in test_admission_controller.py to simulate
queries being queued and then timed out due to pool or host memory
pressure.
3. Added a new test TopN in mem-tracker-test.cc to
verify that the topN query memory consumption details are computed
correctly from a mem tracker hierarchy.
4. Ran Core tests successfully.
Change-Id: Id995a9d044082c3b8f044e1ec25bb4c64347f781
Reviewed-on: http://gerrit.cloudera.org:8080/16220
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The new admission control service will be written in protobuf, so
there are various admission control related structures currently
stored in Thrift that it would be convenient to convert to protobuf,
to minimize the amount of converting back and forth that needs to be
done.
This patch converts TBackendDescriptor to protobuf. It isn't used
directly in any rpcs - we serialize it ourselves to send to the
statestore as a string, so no rpc definitions are affected.
This patch is just a refactor and doesn't contain any functional
changes.
Testing:
- Passed a full run of existing tests.
Change-Id: Ie7d1e373d9c87887144517ff6a4c2d5996aa88b8
Reviewed-on: http://gerrit.cloudera.org:8080/15822
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch introduces the concept of 'backend ids', which are unique
ids that can be used to identify individual impalads. The ids are
generated by each impalad on startup.
The patch then uses the ids to fix a bug where the statestore may fail
to inform coordinators when an executor has failed and restarted. The
bug was caused by the fact that the statestore cluster membership
topic was keyed on statestore subscriber ids, which are host:port
pairs.
So, if an impalad fails and a new one is started at the same host:port
before a particular coordinator has a cluster membership update
generated for it by the statestore, the statestore has no way of
differentiating the prior impalad from the newly started impalad, and
the topic update will not show the deletion of the original impalad.
With this patch, the cluster membership topic is now keyed by backend
id, so since the restarted impalad will have a different backend id
the next membership update after the prior impalad failed is
guaranteed to reflect that failure.
The patch also logs the backend ids on startup and adds them to the
/backends webui page and to the query locations section of the
/queries page, for use in debugging.
Further patches will apply the backend ids in other places where we
currently key off host:port pairs to identify impalads.
Testing:
- Added an e2e test that uses a new debug action to add delay to
statestore topic updates. Due to the use of JITTER the test is
non-deterministic, but it repros the original issue locally for me
about 50% of the time.
- Passed a full run of existing tests.
Change-Id: Icf8067349ed6b765f6fed830b7140f60738e9061
Reviewed-on: http://gerrit.cloudera.org:8080/15321
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This integrates mt_dop with the "slots" mechanism that's used
for non-default executor groups.
The idea is simple - the degree of parallelism on a backend
determines the number of slots consumed. The effective
degree of parallelism is used, not the raw mt_dop setting.
E.g. if the query only has a single input split and executes
only a single fragment instance on a host, we don't want
to count the full mt_dop value for admission control.
--admission_control_slots is added as a new flag that
replaces --max_concurrent_queries, since the name better
reflects the concept. --max_concurrent_queries is kept
for backwards compatibility and has the same meaning
as --admission_control_slots.
The admission control logic is extended to take this into
account. We also add an immediate rejection code path
since it is now possible for queries to not be admittable
based on the # of available slots.
We only factor in the "width" of the plan - i.e. the number
of instances of fragments. We don't account for the number
of distinct fragments, since they may not actually execute
in parallel with each other because of dependencies.
This number is added to the per-host profile as the
"AdmissionSlots" counter.
Testing:
Added unit tests for rejection and queue/admit checks.
Also includes a fix for IMPALA-9054 where we increase
the timeout.
Added end-to-end tests:
* test_admission_slots in test_mt_dop.py that checks the
admission slot calculation via the profile.
* End-to-end admission test that exercises the admit
immediately and queueing code paths.
Added checks to test_verify_metrics (which runs after
end-to-end tests) to ensure that the per-backend
slots in use goes to 0 when the cluster is quiesced.
Change-Id: I7b6b6262ef238df26b491352656a26e4163e46e5
Reviewed-on: http://gerrit.cloudera.org:8080/14357
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
This change adds support for running queries inside a single admission
control pool on one of several, disjoint sets of executors called
"executor groups".
Executors can be configured with an executor group through the newly
added '--executor_groups' flag. Note that in anticipation of future
changes, the flag already uses the plural form, but only a single
executor group may be specified for now. Each executor group
specification can optionally contain a minimum size, separated by a
':', e.g. --executor_groups default-pool-1:3. Only when the cluster
membership contains at least that number of executors for the groups
will it be considered for admission.
Executor groups are mapped to resource pools by their name: An executor
group can service queries from a resource pool if the pool name is a
prefix of the group name separated by a '-'. For example, queries in
poll poolA can be serviced by executor groups named poolA-1 and poolA-2,
but not by groups name foo or poolB-1.
During scheduling, executor groups are considered in alphabetical order.
This means that one group is filled up entirely before a subsequent
group is considered for admission. Groups also need to pass a health
check before considered. In particular, they must contain at least the
minimum number of executors specified.
If no group is specified during startup, executors are added to the
default executor group. If - during admission - no executor group for a
pool can be found and the default group is non-empty, then the default
group is considered. The default group does not have a minimum size.
This change inverts the order of scheduling and admission. Prior to this
change, queries were scheduled before submitting them to the admission
controller. Now the admission controller computes schedules for all
candidate executor groups before each admission attempt. If the cluster
membership has not changed, then the schedules of the previous attempt
will be reused. This means that queries will no longer fail if the
cluster membership changes while they are queued in the admission
controller.
This change also alters the default behavior when using a dedicated
coordinator and no executors have registered yet. Prior to this change,
a query would fail immediately with an error ("No executors registered
in group"). Now a query will get queued and wait until executors show
up, or it times out after the pools queue timeout period.
Testing:
This change adds a new custom cluster test for executor groups. It
makes use of new capabilities added to start-impala-cluster.py to bring
up additional executors into an already running cluster.
Additionally, this change adds an instructional implementation of
executor group based autoscaling, which can be used during development.
It also adds a helper to run queries concurrently. Both are used in a
new test to exercise the executor group logic and to prevent regressions
to these tools.
In addition to these tests, the existing tests for the admission
controller (both BE and EE tests) thoroughly exercise the changed code.
Some of them required changes themselves to reflect the new behavior.
I looped the new tests (test_executor_groups and test_auto_scaling) for
a night (110 iterations each) without any issues.
I also started an autoscaling cluster with a single group and ran
TPC-DS, TPC-H, and test_queries on it successfully.
Known limitations:
When using executor groups, only a single coordinator and a single AC
pool (i.e. the default pool) are supported. Executors to not include the
number of currently running queries in their statestore updates and so
admission controllers are not aware of the number of queries admitted by
other controllers per host.
Change-Id: I8a1d0900f2a82bd2fc0a906cc094e442cffa189b
Reviewed-on: http://gerrit.cloudera.org:8080/13550
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds a class to track cluster membership called
ClusterMembershipMgr. It replaces the logic that was partially
duplicated between the ImpalaServer and the Coordinator and makes sure
that the local backend descriptor is consistent (IMPALA-8469).
The ClusterMembershipMgr maintains a view of the cluster membership and
incorporates incoming updates from the statestore. It also registers the
local backend with the statestore after startup. Clients can obtain a
consistent, immutable snapshot of the current cluster membership from
the ClusterMembershipMgr. Additionally, callbacks can be registered to
receive notifications of cluster membership changes. The ImpalaServer
and Frontend use this mechanism.
This change also generalizes the fix for IMPALA-7665: updates from the
statestore to the cluster membership topic are only made visible to the
rest of the local server after a post-recovery grace period has elapsed.
As part of this the flag
'failed_backends_query_cancellation_grace_period_ms' is replaced with
'statestore_subscriber_recovery_grace_period_ms'. To tell the initial
startup from post-recovery, a new metric
'statestore-subscriber.num-connection-failures' is exposed by the
daemon, which tracks the total number of connection failures to the
statestore over the lifetime process lifetime.
This change also unifies the naming of executor-related classes, in
particular it renames "BackendConfig" to "ExecutorGroup". In
anticipation of a subsequent change (IMPALA-8484), it adds maps to store
multiple executor groups.
This change also disables the generation of default operators from the
thrift files and instead adds explicit implementations for the ones that
we rely on. This forces us to explicitly specify comparators when
manipulating containers of thrift structs and will help prevent
accidental bugs.
Testing: This change adds a backend unit test for the new cluster
membership manager. The observable behavior of Impala does not change,
and the existing scheduler unit test and end to end tests should make
sure of that.
Change-Id: Ib3cf9a8bb060d0c6e9ec8868b7b21ce01f8740a3
Reviewed-on: http://gerrit.cloudera.org:8080/13207
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-7694 added a field in the middle of the Metrics.TUnit enum, which
broke backwards compatibility with profiles that had been written by
older versions of Impala. This change fixes the ordering by moving the
field to the end of the enum.
Additionally, it adds a warning to the top of all Thrift files that are
part of the binary profile format, and an note of caution to the main
definition in RuntimeProfile.thrift.
This change also fixes the order of all enums in our Thrift files to
make errors like this less likely in the future.
Change-Id: If215f16a636008757ceb439edbd6900a1be88c59
Reviewed-on: http://gerrit.cloudera.org:8080/12543
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds a flag --mem_limit_includes_jvm that alters memory accounting to
include the amount of memory we think that the JVM is likely to use.
By default this flag is false, so behaviour is unchanged.
We're not ready to change the default but I want to check this in to
enable experimentation.
Two metrics are counted towards the process limit:
* The maximum JVM heap size. We count this because the JVM memory
consumption can expand up to this threshold at any time.
* JVM non-heap committed memory. This can be a non-trivial amount of
memory (e.g. I saw 150MB on one production cluster). There isn't a
hard upper bound on this memory that I know of but should not
grow rapidly.
This requires adjustments in a couple of other places:
* Admission control previous assumed that all of the process memory
limit was available to queries (an assumption that is not strictly
true because of untracked memory, etc, but close enough). However,
the JVM heap makes a large part of the process limit unusable to
queries, so we should only admit up to "process limit - max JVM heap
size" per node.
* The buffer pool is now a percentage of the remaining process limit
after the JVM heap, instead of the total process limit.
Currently, end-to-end tests fail if run with this flag for two reasons:
* The default JVM heap size is 1/4 of physical memory, which means that
essentially all of the process memory limit is consumed by the JVM
heaps when we running 3 impala daemons per host, unless -Xmx is
explicitly set.
* If the heap size is limited to 1-2GB like below, then most tests pass
but TestInsert.test_insert_large_string fails because IMPALA-4865
lets it create giant strings that eat up all the JVM heap.
start-impala-cluster.py \
--impalad_args=--mem_limit_includes_jvm=true --jvm_args="-Xmx1g"
Testing:
Add a custom cluster test that uses the new option and validates the
the memory consumption values.
Change-Id: I39dd715882a32fc986755d573bd46f0fd9eefbfc
Reviewed-on: http://gerrit.cloudera.org:8080/10928
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the same patch except with fixes for the test failures
on EC and S3 noted in the JIRA.
This allows graceful shutdown of executors and partially graceful
shutdown of coordinators (new operations fail, old operations can
continue).
Details:
* In order to allow future admin commands, this is implemented with
function-like syntax and does not add any reserved words.
* ALL privilege is required on the server
* The coordinator impalad that the client is connected to can be shut
down directly with ":shutdown()".
* Remote shutdown of another impalad is supported, e.g. with
":shutdown('hostname')", so that non-coordinators can be shut down
and for the convenience of the client, which does not have to
connect to the specific impalad. There is no assumption that the
other impalad is registered in the statestore; just that the
coordinator can connect to the other daemon's thrift endpoint.
This simplifies things and allows shutdown in various important
cases, e.g. statestore down.
* The shutdown time limit can be overridden to force a quicker or
slower shutdown by specifying a deadline in seconds after the
statement is executed.
* If shutting down, a banner is shown on the root debug page.
Workflow:
1. (if a coordinator) clients are prevented from submitting
queries to this coordinator via some out-of-band mechanism,
e.g. load balancer
2. the shutdown process is started via ":shutdown()"
3. a bit is set in the statestore and propagated to coordinators,
which stop scheduling fragment instances on this daemon
(if an executor).
4. the query startup grace period (which is ideally set to the AC
queueing delay plus some additional leeway) expires
5. once the daemon is quiesced (i.e. no fragments, no registered
queries), it shuts itself down.
6. If the daemon does not successfully quiesce (e.g. rogue clients,
long-running queries), after a longer timeout (counted from the start
of the shutdown process) it will shut down anyway.
What this does:
* Executors can be shut down without causing a service-wide outage
* Shutting down an executor will not disrupt any short-running queries
and will wait for long-running queries up to a threshold.
* Coordinators can be shut down without query failures only if
there is an out-of-band mechanism to prevent submission of more
queries to the shut down coordinator. If queries are submitted to
a coordinator after shutdown has started, they will fail.
* Long running queries or other issues (e.g. stuck fragments) will
slow down but not prevent eventual shutdown.
Limitations:
* The startup grace period needs to be configured to be greater than
the latency of statestore updates + scheduling + admission +
coordinator startup. Otherwise a coordinator may send a
fragment instance to the shutting down impalad. (We could
automate this configuration as a follow-on)
* The startup grace period means a minimum latency for shutdown,
even if the cluster is idle.
* We depend on the statestore detecting the process going down
if queries are still running on that backend when the timeout
expires. This may still be subject to existing problems,
e.g. IMPALA-2990.
Tests:
* Added parser, analysis and authorization tests.
* End-to-end test of shutting down impalads.
* End-to-end test of shutting down then restarting an executor while
queries are running.
* End-to-end test of shutting down a coordinator
- New queries cannot be started on coord, existing queries continue to run
- Exercises various Beeswax and HS2 operations.
Change-Id: I8f3679ef442745a60a0ab97c4e9eac437aef9463
Reviewed-on: http://gerrit.cloudera.org:8080/11484
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This allows graceful shutdown of executors and partially graceful
shutdown of coordinators (new operations fail, old operations can
continue).
Details:
* In order to allow future admin commands, this is implemented with
function-like syntax and does not add any reserved words.
* ALL privilege is required on the server
* The coordinator impalad that the client is connected to can be shut
down directly with ":shutdown()".
* Remote shutdown of another impalad is supported, e.g. with
":shutdown('hostname')", so that non-coordinators can be shut down
and for the convenience of the client, which does not have to
connect to the specific impalad. There is no assumption that the
other impalad is registered in the statestore; just that the
coordinator can connect to the other daemon's thrift endpoint.
This simplifies things and allows shutdown in various important
cases, e.g. statestore down.
* The shutdown time limit can be overridden to force a quicker or
slower shutdown by specifying a deadline in seconds after the
statement is executed.
* If shutting down, a banner is shown on the root debug page.
Workflow:
1. (if a coordinator) clients are prevented from submitting
queries to this coordinator via some out-of-band mechanism,
e.g. load balancer
2. the shutdown process is started via ":shutdown()"
3. a bit is set in the statestore and propagated to coordinators,
which stop scheduling fragment instances on this daemon
(if an executor).
4. the query startup grace period (which is ideally set to the AC
queueing delay plus some additional leeway) expires
5. once the daemon is quiesced (i.e. no fragments, no registered
queries), it shuts itself down.
6. If the daemon does not successfully quiesce (e.g. rogue clients,
long-running queries), after a longer timeout (counted from the start
of the shutdown process) it will shut down anyway.
What this does:
* Executors can be shut down without causing a service-wide outage
* Shutting down an executor will not disrupt any short-running queries
and will wait for long-running queries up to a threshold.
* Coordinators can be shut down without query failures only if
there is an out-of-band mechanism to prevent submission of more
queries to the shut down coordinator. If queries are submitted to
a coordinator after shutdown has started, they will fail.
* Long running queries or other issues (e.g. stuck fragments) will
slow down but not prevent eventual shutdown.
Limitations:
* The startup grace period needs to be configured to be greater than
the latency of statestore updates + scheduling + admission +
coordinator startup. Otherwise a coordinator may send a
fragment instance to the shutting down impalad. (We could
automate this configuration as a follow-on)
* The startup grace period means a minimum latency for shutdown,
even if the cluster is idle.
* We depend on the statestore detecting the process going down
if queries are still running on that backend when the timeout
expires. This may still be subject to existing problems,
e.g. IMPALA-2990.
Tests:
* Added parser, analysis and authorization tests.
* End-to-end test of shutting down impalads.
* End-to-end test of shutting down then restarting an executor while
queries are running.
* End-to-end test of shutting down a coordinator
- New queries cannot be started on coord, existing queries continue to run
- Exercises various Beeswax and HS2 operations.
Change-Id: I4d5606ccfec84db4482c1e7f0f198103aad141a0
Reviewed-on: http://gerrit.cloudera.org:8080/10744
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds the ability for a statestore subscriber to specify a key
prefix which acts as a filter. Only topic entries which match the
specified prefix are transmitted to the subscriber.
This patch makes use of the new feature for a small optimization: the
catalogd subscribes to the catalog topic with a key prefix "!" which we
know doesn't match any actual topic items. This avoids the statestore
having to reflect back the catalog contents to the catalogd, since the
catalogd ignored this info anyway.
A later patch will make use of this to publish lightweight catalog
object version numbers in the same topic as the catalog objects
themselves.
The modification to catalogd's topic subscription is covered by existing
tests. A new specific test is added to verify the filtering mechanism.
Change-Id: I6ddcf3bfaf16bc3cd1ba01100e948ff142a67620
Reviewed-on: http://gerrit.cloudera.org:8080/11253
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Todd Lipcon <todd@apache.org>
min_subscriber_topic_version is expensive to compute (requires iterating
over all subscribers to compute) but is only used by one
subscriber/topic pair: Impalads receiving catalog topic updates.
This patch implements a simple fix - only compute it if a subscriber
asks for it. A more complex alternative would be to maintain
a priority queue of subscriber versions, but that didn't seem worth
the the complexity and risk of bugs.
Testing:
Add a statestore test to validate the versions. It looks like we had a
pre-existing test gap for validating min_subscriber_topic_version so
the test is mainly focused on adding that coverage.
Ran core tests with DEBUG and ASAN.
Change-Id: I8ee7cb2355ba1049b9081e0df344ac41aa4ebeb1
Reviewed-on: http://gerrit.cloudera.org:8080/10705
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
admission control checks
Currently the admission controller assumes that all backends have the
same process mem limit as the impalad it itself is running on. With
this patch the proc mem limit for each impalad is available to the
admission controller and it uses it for making correct admisssion
decisions. It currently works under the assumption that the
per-process memory limit does not change dynamically.
Testing:
Added an e2e test.
IMPALA-5662: Log the queuing reason for a query
The queuing reason is now logged both while queuing for the first
time and while trying to dequeue.
Change-Id: Idb72eee790cc17466bbfa82e30f369a65f2b060e
Reviewed-on: http://gerrit.cloudera.org:8080/10396
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit fixes an issue where the statestore may end up with stale
entries in the catalog update topic that do not correspond to the
catalog objects stored in the catalog. This may occur if the catalog
server restarts and some catalog object (e.g. table) that was known to
the catalog before the restart no longer exists in the Hive Metastore
after the restart.
Fix:
The first update for the catalog update topic that is sent by the catalog
instructs the statestore to clear any entries it may have on this topic
before applying the first update. In that way, we guarantee that the
statestore entries are consistent with the catalog objects stored in the
catalog server. Any coordinator that detects the catalog restart will
receive from the statestore a full topic update that reflects the state
of the catalog server.
Testing:
Added statestore test.
Change-Id: I907509bf92da631ece5efd23c275a613ead00e91
Tmp
Change-Id: I74a8ade8e498ac35cb56d3775d2c67a86988d9b6
Reviewed-on: http://gerrit.cloudera.org:8080/10289
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Problem: A long running table metadata operation (e.g. refresh) could
prevent any other metadata operation from making progress if it
coincided with the catalog topic creation operations. The problem was due
to the conservative locking scheme used when catalog topics were
created. In particular, in order to collect a consistent snapshot of
metadata changes, the global catalog lock was held for the entire
duration of that operation.
Solution: To improve the concurrency of catalog operations the following
changes are performed:
* A range of catalog versions determines the catalog changes to be
included in a catalog update. Any catalog changes that do not fall in
the specified range are ignored (to be processed in subsequent catalog
topic updates).
* The catalog allows metadata operations to make progress while
collecting catalog updates.
* To prevent starvation of catalog updates (i.e. frequently updated
catalog objects skipping catalog updates indefinitely), we keep track
of the number of times a catalog object has skipped an update and if
that number exceeds a threshold it is included in the next catalog
topic update even if its version is not in the specified topic update
version range. Hence, the same catalog object may be sent in two
consecutive catalog topic updates.
This commit also changes the way deletions are handled in the catalog and
disseminated to the impalad nodes through the statestore. In particular:
* Deletions in the catalog are explicitly recorded in a log with
the catalog version in which they occurred. As before, added and deleted
catalog objects are sent to the statestore.
* Topic entries associated with deleted catalog objects have non-empty
values (besided keys) that contain minimal object metadata including the
catalog version.
* Statestore is no longer using the existence or not of
topic entry values in order to identify deleted topic entries. Deleted
topic entries should be explicitly marked as such by the statestore
subscribers that produce them.
* Statestore subscribers now use the 'deleted' flag to determine if a
topic entry corresponds to a deleted item.
* Impalads use the deleted objects' catalog versions when updating the
local catalog cache from a catalog update and not the update's maximum
catalog version.
Testing:
- No new tests were added as these paths are already exercised by
existing tests.
- Run all core and exhaustive tests.
Change-Id: If12467a83acaeca6a127491d89291dedba91a35a
Reviewed-on: http://gerrit.cloudera.org:8080/7731
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
Reviewed-on: http://gerrit.cloudera.org:8080/8752
This reverts commit dd340b8810.
This commit caused a number of issues tracked in IMPALA-6001. The
issues were due to the lack of atomicity between the catalog version
change and the addition to the delete log of a catalog object.
Conflicts:
be/src/service/impala-server.cc
Change-Id: I3a2cddee5d565384e9de0e61b3b7d0d9075e0dce
Reviewed-on: http://gerrit.cloudera.org:8080/8667
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
This commit changes the way deletions are handled in the catalog and
disseminated to the impalad nodes through the statestore. Previously,
deletions of catalog objects were not explicitly annotated with the
catalog version in which the deletion occured and the impalads were
using the max catalog version in a catalog update in order to decide
whether a deletion should be applied to the local catalog cache or not.
This works correctly under the assumption that
all the changes that occurred in the catalog between an update's min
and max catalog version are included in that update, i.e. no gaps or
missing updates. With the upcoming fix for IMPALA-5058, that constraint
will be relaxed, thus allowing for gaps in the catalog updates.
To avoid breaking the existing behavior, this patch introduced the
following changes:
* Deletions in the catalog are explicitly recorded in a log with
the catalog version in which they occurred. As before, added and deleted
catalog objects are sent to the statestore.
* Topic entries associated with deleted catalog objects have non-empty
values (besided keys) that contain minimal object metadata including the
catalog version.
* Statestore is no longer using the existence or not of
topic entry values in order to identify deleted topic entries. Deleted
topic entries should be explicitly marked as such by the statestore
subscribers that produce them.
* Statestore subscribers now use the 'deleted' flag to determine if a
topic entry corresponds to a deleted item.
* Impalads use the deleted objects' catalog versions when updating the
local catalog cache from a catalog update and not the update's maximum
catalog version.
Testing:
- No new tests were added as these paths are already exercised by
existing tests.
- Run all core tests.
Change-Id: I93cb7a033dc8f0d3e0339394b36affe14523274c
Reviewed-on: http://gerrit.cloudera.org:8080/7731
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
This change allows Impala to publish the IP address and port
information of KRPC services if it's enabled via the flag
use_krpc. The information is included in a new field in the
backend descriptor published as statestore updates. Scheduler
will also include this information in the destinations of plan
fragments. Also updated the mini-cluster startup script to assign
KRPC ports to Impalad instances.
This patch also takes into account of a problem found in
IMPALA-5795. In particular, the backend descriptor of the
coordinator may not be found in the backend map in the
scheduler if coordinator is not an executor (i.e. dedicated
coordinator). The fix to also check against the local backend
descriptor.
This patch is partially based on an abandoned patch by Henry Robinson.
Testing done: ran core tests with a patch which ignores the use_krpc
flag to exercise the code in scheduler.
Change-Id: I8707bfb5028bbe81d2a042fcf3e6e19f4b719a72
Reviewed-on: http://gerrit.cloudera.org:8080/7760
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
This commit introduces a new startup option, termed 'is_executor',
that determines whether an impalad process can execute query fragments.
The 'is_executor' option determines if a specific host will be included
in the scheduler's backend configuration and hence included in
scheduling decisions.
Testing:
- Added a customer cluster test.
- Added a new scheduler test.
Change-Id: I5d2ff7f341c9d2b0649e4d14561077e166ad7c4d
Reviewed-on: http://gerrit.cloudera.org:8080/6628
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*
A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:
find . | grep "\.java\|\.py\|\.h\|\.cc" | \
xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g
along with some manual fixes.
After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/
Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Improves the memory-based admission control mechanism by
"reserving" memory on both a cluster-wide basis and also a
per-node basis. The algorithm today only accounts for
cluster-wide memory, so we may oversubscribe particular
impalads even if queries are admitted within the allowed
cluster-wide pool memory limits.
The header comment in admission-controller.h explains the
new algorithm in much more detail.
Testing notes:
The admission control functional tests exercise the max
running queries and memory limits at a high level, though
they don’t yet validate the behavior of the new per-node mem
accounting. Those tests will be written soon. For now, there
is manual test coverage of the new mem resources behavior:
a) Submitting queries with mem limits, queries use a subset
of the nodes so they compete for resources on some nodes
more than others - sanity checked via metrics
b) Submitting queries without mem limits, verified
mem_reserved appears to be as expected.
Most* EE tests were also forced to run with admission control
enabled (by manually changing gflag defaults before running),
and checked there was not unexpected behavior in the following
scenarios:
a) a default pool with no limits (i.e. should be same
behavior as no admission control)
b) a default pool with mem resources set, and a default query
mem limit of 1g
c) a default pool with mem resources set but no default mem
limit: this case falls back to using planner estimates
during admission (supported for legacy only), so queries
with very large estimates are rejected and those tests
fail (this is expected). The new CM UI will not allow
this configuration moving forward so it should be
uncommon.
*Excluding custom_cluster and some others for unrelated issues.
The majority of regular tests passed, those that failed were not
due to AC issues.
Change-Id: Ia0d1eb8c07457cbe4b67b7f7f57136b4774720bc
Reviewed-on: http://gerrit.cloudera.org:8080/1710
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
The statestore is a little loose about when it uses 'heartbeat' and when
it uses 'keep-alive'. This patch standardises on 'heartbeat'.
Change-Id: Ic17995667e08f068b668781b60b00106f914e3e4
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5769
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
As originally envisaged, the statestore was not intended to transmit GBs
of data to large clusters. Now that it is, some changes to the design
are required to improve stability and responsiveness.
One of the most acute problems is the way that the topic updates, which
may be extremely large, were responsible for conveying failure detection
information. The delay in sending topic heartbeats (which were often
blocked behind a queue of other updates) meant that large clusters
needed to set large timeouts for every subscriber, and to reduce the
frequency at which topic updates were sent.
This change adds a second heartbeat type to the statestore which is
responsibie only for coordinating failure information between the
subscriber and the statestore. These 'keep-alive' heartbeats are sent
much more frequently, as they have a tiny payload and do very little
work. The statestore now does not use the result of topic updates to
maintain its view of which subscribers are alive or dead, but only the
results of the keep-alive heartbeats.
I've tested these changes with a 500MB catalog on a 73 node
cluster. Keep-alive frequencies are nice and stable (at 500ms out of the
box) even during the initial large topic-update distribution phase, or
after a statestore failure.
Internally, this patch changes the nomenclature for the statestore: we
stay away from 'heartbeat' (except for the failure detector, which is
heartbeat-based), and instead use 'message' as a generic term for both
keep-alive and topic update. Externally, we need to rationalise the flag
names to control both update types. The benefit of this change is that
only the keep-alive frequency matters for cluster stability, and that
depends only on the cluster size, not on the size of the catalog and the
cluster size. So we can set it to, e.g., 1s out of the box with some
confidence that that will work for clusters up to ~200 nodes. The topic
update frequency may actually be set aggressively, because the
statestore will naturally throttle itself during times of heavy traffic
and nothing will time out.
Change-Id: Ia447e4ebefda890a5b810a213e97f00cca499989
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4003
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5138
Adds the ability to set per-pool memory limits. Each impalad tracks the memory
used by queries in each pool; a per-pool memory tracker is added between the
per-query trackers and the process memory tracker. The current memory usage
is disseminated via statestore heartbeats (along with the other per-pool stats)
and a cluster-wide estimate of the pool memory usage is updated when topic
updates are received. The admission controller will not admit incoming
requests if the total memory usage is already over the configured pool limit.
Change-Id: Ie9bc82d99643352ba77fb91b6c25b42938b1f745
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1508
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 64a137930a318e56a7090a317e6aa5df67ea72cd)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1623
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
This patch fixes a problem observed when a subscriber was processing a
heartbeat, and while doing so tried to re-register with the
statestore. The statestore would schedule a heartbeat for the new
registration, but the subscriber would return an error, thinking that it
was still re-registering (see UpdateState() for the try_lock logic that
gave rise to this error). The statestore, upon receiving the error,
would update its failure detector and eventually mark the subscriber as
failed, unnecessarily forcing a re-registration loop.
This only regularly happens when UpdateState() takes a long time,
i.e. when a subscriber callback takes a while. This patch also adds
metrics to measure the amount of time callbacks take.
Change-Id: I157cdfd550279a6942e7ca54fe622520c8ad5dcf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1574
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
(cherry picked from commit bc0a8819e754623bc9e5e5ab805369ad8381e5b9)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1610
This adds a simple admission control mechanism which is able to make
localized admission decisions and can queue some number of queries
that are not able to execute immediately. In this change, there is
a single pool and the maximum number of concurrent queries and the
maximum number of queued queries are configurable via flags, but the
data structures all support multiple pools so that we can later add
support for getting pools and per-pool configs from Yarn and Llama
configs (i.e. fair-scheduler.xml and llama-site.xml).
Each impalad keeps track of how many queries it has executing and
how many are queued, per pool. The statestore is used to disseminate
local pool statistics to all other impalads. When topic updates are
received, a new cluster-wide total number of currently executing
queries and total number of queued queries are calculated for each
pool. Those totals are used to make localized admission decisions in
the AdmissionController.
There are a number of per-pool metrics which are used in automated
testing (in a separate commit) and there are many assertions and
debugging logging which I've been using to verify manual testing.
Change-Id: I68f92c789108336fca33c2148a4e14534c77e9f0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1347
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit e52b528b8a2fa23585510eab916ecb41da82d24b)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1302
* Statestore is now one word, without camelcase, eveywhere. Previous
names included StateStore, state-store and state_store,
variously. The only exception is a couple of flags that have
'state_store', and can't be changed for compatibility reasons.
* File names are also changed to reflect the standard naming.
* Most comments are now 90 chars wide (from 80 before)
Change-Id: I83b666c87991537f9b1b80c2f0ea70c2e0c07dcf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1225
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins