impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	c5a0ec8bdf	IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package This puts all of the thrift-generated python code into the impala_thrift_gen package. This is similar to what Impyla does for its thrift-generated python code, except that it uses the impala_thrift_gen package rather than impala._thrift_gen. This is a preparatory patch for fixing the absolute import issues. This patches all of the thrift files to add the python namespace. This has code to apply the patching to the thirdparty thrift files (hive_metastore.thrift, fb303.thrift) to do the same. Putting all the generated python into a package makes it easier to understand where the imports are getting code. When the subsequent change rearranges the shell code, the thrift generated code can stay in a separate directory. This uses isort to sort the imports for the affected Python files with the provided .isort.cfg file. This also adds an impala-isort shell script to make it easy to run. Testing: - Ran a core job Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0 Reviewed-on: http://gerrit.cloudera.org:8080/20169 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-04-15 17:03:02 +00:00
Andrew Sherman	de6b902581	IMPALA-12345: Add user quotas to Admission Control Allow administrators to configure per user limits on queries that can run in the Impala system. In order to do this, there are two parts. Firstly we must track the total counts of queries in the system on a per-user basis. Secondly there must be a user model that allows rules that control per-user limits on the number of queries that can be run. In a Kerberos environment the user names that are used for both the user model and at runtime are short user names, e.g. testuser when the Kerberos principal is testuser/scm@EXAMPLE.COM TPoolStats (the data that is shared between Admission Control instances) is extended to include a map from user name to a count of queries running. This (along with some derived data structures) is updated when queries are queued and when they are released from Admission Control. This lifecycle is slightly different from other TPoolStats data which usually tracks data about queries that are running. Queries can be rejected because of user quotas at submission time. This is done for two reasons: (1) queries can only be admitted from the front of the queue and we do not want to block other queries due to quotas, and (2) it is easy for users to understand what is going on when queries are rejected at submission time. Note that when running in configurations without an Admission Daemon then Admission Control does not have perfect information about the system and over-admission is possible for User-Level Admission Quotas in the same way that it is for other Admission Control controls. The User Model is implemented by extending the format of the fair-scheduler.xml file. The rules controlling the per-user limits are specified in terms of user or group names. Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to the fair-scheduler.xml file. These elements can be placed on the root configuration, which applies to all pools, or the pool configuration. The ‘userQueryLimit’ element has 2 child elements: "user" and "totalCount". The 'user' element contains the short names of users, and can be repeated, or have the value "*" for a wildcard name which matches all users. The ‘groupQueryLimit’ element has 2 child elements: "group" and "totalCount". The 'group' element contains group names. The root level rules and pool level rules must both be passed for a new query to be queued. The rules dictate a maximum number of queries that can run by a user. When evaluating rules at either the root level, or at the pool level, when a rule matches a user then there is no more evaluation done. To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the RequestPoolService is enhanced. If user quotas are enabled for a pool then a list of the users with running or queued queries in that pool is visible on the coordinator webui admission control page. More comprehensive documentation of the user model will be provided in IMPALA-12943 TESTING New end-to-end tests are added to test_admission_controller.py, and admission-controller-test is extended to provide unit tests for the user model. Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41 Reviewed-on: http://gerrit.cloudera.org:8080/21616 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-16 06:38:38 +00:00
ttttttz	f67f5f1815	IMPALA-12705: Add /catalog_ha_info page on Statestore to show catalog HA information This patch adds /catalog_ha_info page on Statestore to show catalog HA information. The page contains the following information: Active Node, Standby Node, and Notified Subscribers table. In the Notified Subscribers table, include the following information items: -- Id, -- Address, -- Registration ID, -- Subscriber Type, -- Catalogd Version, -- Catalogd Address, -- Last Update Catalogd Time Change-Id: If85f6a827ae8180d13caac588b92af0511ac35e3 Reviewed-on: http://gerrit.cloudera.org:8080/21418 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-05 22:19:11 +00:00
wzhou-code	c9c5fb89b5	IMPALA-12156: Support High Availability for Statestore To support statestore HA, we allow two statestored instances in an Active-Passive HA pair to be added to an Impala cluster. We add the preemptive behavior for statestored. When HA is enabled, the preemptive behavior allows the statestored with the higher priority to become active and the paired statestored becomes standby. The active statestored acts as the owner of Impala cluster and provides statestore service for the cluster members. To enable catalog HA for a cluster, two statestoreds in the HA pair and all subscribers must be started with starting flag "enable_statestored_ha" as true. This patch makes following changes: - Defined new service for Statestore HA. - Statestored negotiates the role for HA with its peer statestore instance on startup. - Create HA monitor thread: Active statestored sends heartbeat to standby statestored. Standby statestored monitors peer's connection states with their subscribers. - Standby statestored sends heartbeat to subscribers with request for connection state between active statestore and subscribers. Standby statestored saves the connection state as failure detecter. - When standby statestored lost connection with active statestore, it checks the connection states for active statestore, and takes over active role if majority of subscribers lost connections with active statestore. - New active statestored sends RPC notification to all subscribers for new active statestored and active catalogd elected by the new active statestored. - New active statestored starts sending heartbeat to its peer when it receives handshake from its peer. - Active statestored enters recovery mode if it lost connections with its peer statestored and all subscribers. It keeps sending HA handshake to its peer until receiving response. - All subscribers (impalad/catalogd/admissiond) register to two statestoreds. - Subscribers report connection state for the requests from standby statestore. - Subscribers switch to new active statestore when receiving RPC notifications from new active statestored. - Only active statestored sends topic update messages to subscribers. - Add a new option "enable_statestored_ha" in script bin/start-impala-cluster.py for starting Impala mini-cluster with statestored HA enabled. - Add a new Thrift API in statestore service to disable network for statestored. It's only used for unit-test to simulate network failure. For safety, it's only working when the debug action is set in starting flags. Testings: - Added end-to-end unit tests for statestored HA. - Passed core tests Change-Id: Ibd2c814bbad5c04c1d50c2edaa5b910c82a6fd87 Reviewed-on: http://gerrit.cloudera.org:8080/20372 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-10-24 22:05:36 +00:00
wzhou-code	e48af8c04a	IMPALA-12321: Fix the race condition when updating active catalogd When CatalogD HA is enabled, statestored sends the address of current active catalogd to coordinators and catalogds on two different events: catalogd registration, and new active catalogd election change. Statestored sends the active catalogd in two different kinds of RPCs. If there are more than one election changes in short time, coordinators and catalogds could receive RPCs in the order which are different from the changing order on statestore. To make coordinators and catalogds to have same view as statestore, we have to avoid to overwrite the latest version of active catalogd with previous change. Version of active catalogd is already in UpdateCatalogd RPC, but not in response message of statestore registration. This patch adds active catalogd version in the response message of statestore registration. Coordinators and catalogds only apply the changes which have newer version than the version of last recevied active catalogd. The version of last received active catalogd have to be re-synced for new registration since statestore could be restarted. Allow subscribers to skip UpdateCatalogd RPC if subscribers cannot handle it, for example the statestore-id is unknown for a subscriber since the RPC is received before registration is completed. The skipped RPCs will be resent by statestore. Testing: - Added a test case to start both catalogds with flag 'force_catalogd_active' as true. - Passed core tests Change-Id: Ie49947e563d43c59bdd476b28c35be69848ae12a Reviewed-on: http://gerrit.cloudera.org:8080/20276 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-07-29 01:50:57 +00:00
wzhou-code	819db8fa46	IMPALA-12155: Support High Availability for CatalogD To support catalog HA, we allow two catalogd instances in an Active- Passive HA pair to be added to an Impala cluster. We add the preemptive behavior for catalogd. When enabled, the preemptive behavior allows the catalogd with the higher priority to become active and the paired catalogd becomes standby. The active catalogd acts as the source of metadata and provides catalog service for the Impala cluster. To enable catalog HA for a cluster, two catalogds in the HA pair and statestore must be started with starting flag "enable_catalogd_ha". The catalogd in an Active-Passive HA pair can be assigned an instance priority value to indicate a preference for which catalogd should assume the active role. The registration ID which is assigned by statestore can be used as instance priority value. The lower numerical value in registration ID corresponds to a higher priority. The catalogd with the higher priority is designated as active, the other catalogd is designated as standby. Only the active catalogd propagates the IMPALA_CATALOG_TOPIC to the cluster. This guarantees only one writer for the IMPALA_CATALOG_TOPIC in a Impala cluster. The statestore which is the registration center of an Impala cluster assigns the roles for the catalogd in the HA pair after both catalogds register to statestore. When statestore detects the active catalogd is not healthy, it fails over catalog service to standby catalogd. When failover occurs, statestore sends notifications with the address of active catalogd to all coordinators and catalogd in the cluster. The events are logged in the statestore and catalogd logs. When the catalogd with the higher priority recovers from a failure, statestore does not resume it as active to avoid flip-flop between the two catalogd. To make a specific catalogd in the HA pair as active instance, the catalogd must be started with starting flag "force_catalogd_active" so that the catalogd will be assigned with active role when it registers to statestore. This allows administrator to manually perform catalog service failover. Added option "--enable_catalogd_ha" in bin/start-impala-cluster.py. If the option is specified when running the script, the script will create an Impala cluster with two catalogd instances in HA pair. Testing: - Passed the core tests. - Added unit-test for auto failover and manual failover. Change-Id: I68ce7e57014e2a01133aede7853a212d90688ddd Reviewed-on: http://gerrit.cloudera.org:8080/19914 Reviewed-by: Xiang Yang <yx91490@126.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tamas Mate <tmater@apache.org>	2023-06-21 14:02:55 +00:00
wzhou-code	a82830896b	IMPALA-12150: Use protocol version to isolate cluster components Some Thrift request/response structs in CatalogService were changed to add new variables in the middle, which caused cross version incompatibility issue for CatalogService. Impala cluster membership is managed by the statestore. During upgrade scenarios where different versions of Impala daemons are upgraded one at a time, the upgraded daemons have incompatible message formats. Even through protocol versions numbers were already defined for Statestore and Catalog Services, they were not used. The Statestore and Catalog server did not check the protocol version in the requests, which allowed incompatible Impala daemons to join one cluster. This causes unexpected query failures during rolling upgrade. We need a way to detect this and enforce that some rules are followed: - Statestore refuses the registration requests from incompatible subscribers. - Catalog server refuses the requests from incompatible clients. - Scheduler assigns tasks to a group of compatible executors. This patch isolate Impala daemons into separate clusters based on protocol versions of Statestore service to prevent incompatible Impala daemons from communicating with each other. It covers the Thrift RPC communications between catalogd and coordinators, and communication between statestore and its subscribers (executor, coordinators, catalogd and admissiond). This change should work for future upgrade. Following changes were made: - Bump StatestoreServiceVersion and CatalogServiceVersion to V2 for all requests of Statestore and Catalog services. - Update the request and response structs in CatalogService to ensure each Thrift request struct has protocol version and each Thrift response struct has returned status. - Update the request and response struct in StatestoreService to ensure each Thrift request struct has protocol version and each Thrift response struct has returned status. - Add subscriber type so that statestore could distinguish different types of subscribers. - Statestore checks protocol version for registration requests from subscribers. It refuses the requests with incompatible version. - Catalog server checks protocol version for Catalog service APIs, and returns error for requests with incompatible version. - Catalog daemon sends its address and the protocol version of Catalog service when it registers to statestore, statestore forwards the address and the protocol version of Catalog service to all subscribers during registration. - Add UpdateCatalogd API for StatestoreSubscriber service so that the coordinators could receive the address and the protocol version of Catalog service from statestore if the coordinators register to statestore before catalog daemon. - Add GetProtocolVersion API for Statestore service so that the subscribers can check the protocol version of statestore before calling RegisterSubscriber API. - Add starting flag tolerate_statestore_startup_delay. It is off by default. When it's enabled, the subscriber is able to tolerate the delay of the statestore's availability. The subscriber's process will not exit if it cannot register with the specified statestore on startup. But instead it enter into Recovery mode, it will loop, sleep and retry till it successfully register with the statestore. This flag should be enabled during rolling upgrade. CatalogServiceVersion is defined in CatalogService.thrift. In future, if we make non backward version compatible changes in the request or response structures for CatalogService APIs, we need to bump the protocol version of Catalog service. StatestoreServiceVersion is defined in StatestoreService.thrift. Similarly if we make non backward version compatible changes in the request or response structures for StatestoreService APIs, we need to bump the protocol version of Statestore service. Message formats for KRPC communications between coordinators and executors, and between admissiond and coordinators are defined in proto files under common/protobuf. If we make non backward version compatible changes in these structures, we need to bump the protocol version of Statestore service. Testing: - Added end-to-end unit tests. - Passed the core tests. - Ran manual test to verify old version of executors cannot register with new version of statestore, and new version of executors cannot register with old version of statestore. Change-Id: If61506dab38c4d1c50419c1b3f7bc4f9ee3676bc Reviewed-on: http://gerrit.cloudera.org:8080/19959 Reviewed-by: Andrew Sherman <asherman@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-21 04:24:16 +00:00
Bikramjeet Vig	06c9016a37	IMPALA-8762: Track host level admission stats across all coordinators This patch adds the ability to share the per-host stats for locally admitted queries across all coordinators. This helps to get a more consolidated view of the cluster for stats like slots_in_use and mem_admitted when making local admission decisions. Testing: Added e2e py test Change-Id: I2946832e0a89b077d0f3bec755e4672be2088243 Reviewed-on: http://gerrit.cloudera.org:8080/17683 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-28 05:33:16 +00:00
Qifan Chen	2ef6184ee1	IMPALA-9989 Improve admission control pool stats logging This work addresses the current limitation in admission controller by appending the last known memory consumption statistics about the set of queries running or waiting on a host or in a pool to the existing memory exhaustion message. The statistics is logged in impalad.INFO when a query is queued or queued and then timed out due to memory pressure in the pool or on the host. The statistics can also be part of the query profile. The new memory consumption statistics can be either stats on host or aggregated pool stats. The stats on host describes memory consumption for every pool on a host. The aggregated pool stats describes the aggregated memory consumption on all hosts for a pool. For each stats type, information such as query Ids and memory consumption of up to top 5 queries is provided, in addition to the min, the max, the average and the total memory consumption for the query set. When a query request is queued due to memory exhaustion, the above new consumption statistics is logged when the BE logging level is set at 2. When a query request is timed out due to memory exhaustion, the above new consumption statistics is logged when the BE logging level is set at 1. Testing: 1. Added a new test TopNQueryCheck in admission-controller-test.cc to verify that the topN query memory consumption details are reported correctly. 2. Add two new tests in test_admission_controller.py to simulate queries being queued and then timed out due to pool or host memory pressure. 3. Added a new test TopN in mem-tracker-test.cc to verify that the topN query memory consumption details are computed correctly from a mem tracker hierarchy. 4. Ran Core tests successfully. Change-Id: Id995a9d044082c3b8f044e1ec25bb4c64347f781 Reviewed-on: http://gerrit.cloudera.org:8080/16220 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-28 20:22:57 +00:00
Thomas Tauber-Marshall	e46bbf29d4	IMPALA-9692 (part 1): Refactor TBackendDescriptor to protobuf The new admission control service will be written in protobuf, so there are various admission control related structures currently stored in Thrift that it would be convenient to convert to protobuf, to minimize the amount of converting back and forth that needs to be done. This patch converts TBackendDescriptor to protobuf. It isn't used directly in any rpcs - we serialize it ourselves to send to the statestore as a string, so no rpc definitions are affected. This patch is just a refactor and doesn't contain any functional changes. Testing: - Passed a full run of existing tests. Change-Id: Ie7d1e373d9c87887144517ff6a4c2d5996aa88b8 Reviewed-on: http://gerrit.cloudera.org:8080/15822 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-05 21:49:44 +00:00
Thomas Tauber-Marshall	ae0bd674a8	IMPALA-9425 (part 1): Introduce uuids for impalads This patch introduces the concept of 'backend ids', which are unique ids that can be used to identify individual impalads. The ids are generated by each impalad on startup. The patch then uses the ids to fix a bug where the statestore may fail to inform coordinators when an executor has failed and restarted. The bug was caused by the fact that the statestore cluster membership topic was keyed on statestore subscriber ids, which are host:port pairs. So, if an impalad fails and a new one is started at the same host:port before a particular coordinator has a cluster membership update generated for it by the statestore, the statestore has no way of differentiating the prior impalad from the newly started impalad, and the topic update will not show the deletion of the original impalad. With this patch, the cluster membership topic is now keyed by backend id, so since the restarted impalad will have a different backend id the next membership update after the prior impalad failed is guaranteed to reflect that failure. The patch also logs the backend ids on startup and adds them to the /backends webui page and to the query locations section of the /queries page, for use in debugging. Further patches will apply the backend ids in other places where we currently key off host:port pairs to identify impalads. Testing: - Added an e2e test that uses a new debug action to add delay to statestore topic updates. Due to the use of JITTER the test is non-deterministic, but it repros the original issue locally for me about 50% of the time. - Passed a full run of existing tests. Change-Id: Icf8067349ed6b765f6fed830b7140f60738e9061 Reviewed-on: http://gerrit.cloudera.org:8080/15321 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-04 23:43:43 +00:00
Tim Armstrong	b0c6740fae	IMPALA-8998: admission control accounting for mt_dop This integrates mt_dop with the "slots" mechanism that's used for non-default executor groups. The idea is simple - the degree of parallelism on a backend determines the number of slots consumed. The effective degree of parallelism is used, not the raw mt_dop setting. E.g. if the query only has a single input split and executes only a single fragment instance on a host, we don't want to count the full mt_dop value for admission control. --admission_control_slots is added as a new flag that replaces --max_concurrent_queries, since the name better reflects the concept. --max_concurrent_queries is kept for backwards compatibility and has the same meaning as --admission_control_slots. The admission control logic is extended to take this into account. We also add an immediate rejection code path since it is now possible for queries to not be admittable based on the # of available slots. We only factor in the "width" of the plan - i.e. the number of instances of fragments. We don't account for the number of distinct fragments, since they may not actually execute in parallel with each other because of dependencies. This number is added to the per-host profile as the "AdmissionSlots" counter. Testing: Added unit tests for rejection and queue/admit checks. Also includes a fix for IMPALA-9054 where we increase the timeout. Added end-to-end tests: * test_admission_slots in test_mt_dop.py that checks the admission slot calculation via the profile. * End-to-end admission test that exercises the admit immediately and queueing code paths. Added checks to test_verify_metrics (which runs after end-to-end tests) to ensure that the per-backend slots in use goes to 0 when the cluster is quiesced. Change-Id: I7b6b6262ef238df26b491352656a26e4163e46e5 Reviewed-on: http://gerrit.cloudera.org:8080/14357 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-10-16 20:26:11 +00:00
Lars Volker	2397ae5590	IMPALA-8484: Run queries on disjoint executor groups This change adds support for running queries inside a single admission control pool on one of several, disjoint sets of executors called "executor groups". Executors can be configured with an executor group through the newly added '--executor_groups' flag. Note that in anticipation of future changes, the flag already uses the plural form, but only a single executor group may be specified for now. Each executor group specification can optionally contain a minimum size, separated by a ':', e.g. --executor_groups default-pool-1:3. Only when the cluster membership contains at least that number of executors for the groups will it be considered for admission. Executor groups are mapped to resource pools by their name: An executor group can service queries from a resource pool if the pool name is a prefix of the group name separated by a '-'. For example, queries in poll poolA can be serviced by executor groups named poolA-1 and poolA-2, but not by groups name foo or poolB-1. During scheduling, executor groups are considered in alphabetical order. This means that one group is filled up entirely before a subsequent group is considered for admission. Groups also need to pass a health check before considered. In particular, they must contain at least the minimum number of executors specified. If no group is specified during startup, executors are added to the default executor group. If - during admission - no executor group for a pool can be found and the default group is non-empty, then the default group is considered. The default group does not have a minimum size. This change inverts the order of scheduling and admission. Prior to this change, queries were scheduled before submitting them to the admission controller. Now the admission controller computes schedules for all candidate executor groups before each admission attempt. If the cluster membership has not changed, then the schedules of the previous attempt will be reused. This means that queries will no longer fail if the cluster membership changes while they are queued in the admission controller. This change also alters the default behavior when using a dedicated coordinator and no executors have registered yet. Prior to this change, a query would fail immediately with an error ("No executors registered in group"). Now a query will get queued and wait until executors show up, or it times out after the pools queue timeout period. Testing: This change adds a new custom cluster test for executor groups. It makes use of new capabilities added to start-impala-cluster.py to bring up additional executors into an already running cluster. Additionally, this change adds an instructional implementation of executor group based autoscaling, which can be used during development. It also adds a helper to run queries concurrently. Both are used in a new test to exercise the executor group logic and to prevent regressions to these tools. In addition to these tests, the existing tests for the admission controller (both BE and EE tests) thoroughly exercise the changed code. Some of them required changes themselves to reflect the new behavior. I looped the new tests (test_executor_groups and test_auto_scaling) for a night (110 iterations each) without any issues. I also started an autoscaling cluster with a single group and ran TPC-DS, TPC-H, and test_queries on it successfully. Known limitations: When using executor groups, only a single coordinator and a single AC pool (i.e. the default pool) are supported. Executors to not include the number of currently running queries in their statestore updates and so admission controllers are not aware of the number of queries admitted by other controllers per host. Change-Id: I8a1d0900f2a82bd2fc0a906cc094e442cffa189b Reviewed-on: http://gerrit.cloudera.org:8080/13550 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-21 04:54:03 +00:00
Lars Volker	6b3e5fe426	IMPALA-8460: Simplify cluster membership management This change adds a class to track cluster membership called ClusterMembershipMgr. It replaces the logic that was partially duplicated between the ImpalaServer and the Coordinator and makes sure that the local backend descriptor is consistent (IMPALA-8469). The ClusterMembershipMgr maintains a view of the cluster membership and incorporates incoming updates from the statestore. It also registers the local backend with the statestore after startup. Clients can obtain a consistent, immutable snapshot of the current cluster membership from the ClusterMembershipMgr. Additionally, callbacks can be registered to receive notifications of cluster membership changes. The ImpalaServer and Frontend use this mechanism. This change also generalizes the fix for IMPALA-7665: updates from the statestore to the cluster membership topic are only made visible to the rest of the local server after a post-recovery grace period has elapsed. As part of this the flag 'failed_backends_query_cancellation_grace_period_ms' is replaced with 'statestore_subscriber_recovery_grace_period_ms'. To tell the initial startup from post-recovery, a new metric 'statestore-subscriber.num-connection-failures' is exposed by the daemon, which tracks the total number of connection failures to the statestore over the lifetime process lifetime. This change also unifies the naming of executor-related classes, in particular it renames "BackendConfig" to "ExecutorGroup". In anticipation of a subsequent change (IMPALA-8484), it adds maps to store multiple executor groups. This change also disables the generation of default operators from the thrift files and instead adds explicit implementations for the ones that we rely on. This forces us to explicitly specify comparators when manipulating containers of thrift structs and will help prevent accidental bugs. Testing: This change adds a backend unit test for the new cluster membership manager. The observable behavior of Impala does not change, and the existing scheduler unit test and end to end tests should make sure of that. Change-Id: Ib3cf9a8bb060d0c6e9ec8868b7b21ce01f8740a3 Reviewed-on: http://gerrit.cloudera.org:8080/13207 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-02 06:38:07 +00:00
Lars Volker	91d8a8f628	IMPALA-8234: Fix ordering of Thrift enum, fix enum values, add warning IMPALA-7694 added a field in the middle of the Metrics.TUnit enum, which broke backwards compatibility with profiles that had been written by older versions of Impala. This change fixes the ordering by moving the field to the end of the enum. Additionally, it adds a warning to the top of all Thrift files that are part of the binary profile format, and an note of caution to the main definition in RuntimeProfile.thrift. This change also fixes the order of all enums in our Thrift files to make errors like this less likely in the future. Change-Id: If215f16a636008757ceb439edbd6900a1be88c59 Reviewed-on: http://gerrit.cloudera.org:8080/12543 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-22 04:01:35 +00:00
Tim Armstrong	3f0989a4fc	IMPALA-7811: optionally count JVM heap towards process mem limit Adds a flag --mem_limit_includes_jvm that alters memory accounting to include the amount of memory we think that the JVM is likely to use. By default this flag is false, so behaviour is unchanged. We're not ready to change the default but I want to check this in to enable experimentation. Two metrics are counted towards the process limit: * The maximum JVM heap size. We count this because the JVM memory consumption can expand up to this threshold at any time. * JVM non-heap committed memory. This can be a non-trivial amount of memory (e.g. I saw 150MB on one production cluster). There isn't a hard upper bound on this memory that I know of but should not grow rapidly. This requires adjustments in a couple of other places: * Admission control previous assumed that all of the process memory limit was available to queries (an assumption that is not strictly true because of untracked memory, etc, but close enough). However, the JVM heap makes a large part of the process limit unusable to queries, so we should only admit up to "process limit - max JVM heap size" per node. * The buffer pool is now a percentage of the remaining process limit after the JVM heap, instead of the total process limit. Currently, end-to-end tests fail if run with this flag for two reasons: * The default JVM heap size is 1/4 of physical memory, which means that essentially all of the process memory limit is consumed by the JVM heaps when we running 3 impala daemons per host, unless -Xmx is explicitly set. * If the heap size is limited to 1-2GB like below, then most tests pass but TestInsert.test_insert_large_string fails because IMPALA-4865 lets it create giant strings that eat up all the JVM heap. start-impala-cluster.py \ --impalad_args=--mem_limit_includes_jvm=true --jvm_args="-Xmx1g" Testing: Add a custom cluster test that uses the new option and validates the the memory consumption values. Change-Id: I39dd715882a32fc986755d573bd46f0fd9eefbfc Reviewed-on: http://gerrit.cloudera.org:8080/10928 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-04 08:20:34 +00:00
Tim Armstrong	f46de21140	IMPALA-1760: Implement shutdown command This is the same patch except with fixes for the test failures on EC and S3 noted in the JIRA. This allows graceful shutdown of executors and partially graceful shutdown of coordinators (new operations fail, old operations can continue). Details: * In order to allow future admin commands, this is implemented with function-like syntax and does not add any reserved words. * ALL privilege is required on the server * The coordinator impalad that the client is connected to can be shut down directly with ":shutdown()". * Remote shutdown of another impalad is supported, e.g. with ":shutdown('hostname')", so that non-coordinators can be shut down and for the convenience of the client, which does not have to connect to the specific impalad. There is no assumption that the other impalad is registered in the statestore; just that the coordinator can connect to the other daemon's thrift endpoint. This simplifies things and allows shutdown in various important cases, e.g. statestore down. * The shutdown time limit can be overridden to force a quicker or slower shutdown by specifying a deadline in seconds after the statement is executed. * If shutting down, a banner is shown on the root debug page. Workflow: 1. (if a coordinator) clients are prevented from submitting queries to this coordinator via some out-of-band mechanism, e.g. load balancer 2. the shutdown process is started via ":shutdown()" 3. a bit is set in the statestore and propagated to coordinators, which stop scheduling fragment instances on this daemon (if an executor). 4. the query startup grace period (which is ideally set to the AC queueing delay plus some additional leeway) expires 5. once the daemon is quiesced (i.e. no fragments, no registered queries), it shuts itself down. 6. If the daemon does not successfully quiesce (e.g. rogue clients, long-running queries), after a longer timeout (counted from the start of the shutdown process) it will shut down anyway. What this does: * Executors can be shut down without causing a service-wide outage * Shutting down an executor will not disrupt any short-running queries and will wait for long-running queries up to a threshold. * Coordinators can be shut down without query failures only if there is an out-of-band mechanism to prevent submission of more queries to the shut down coordinator. If queries are submitted to a coordinator after shutdown has started, they will fail. * Long running queries or other issues (e.g. stuck fragments) will slow down but not prevent eventual shutdown. Limitations: * The startup grace period needs to be configured to be greater than the latency of statestore updates + scheduling + admission + coordinator startup. Otherwise a coordinator may send a fragment instance to the shutting down impalad. (We could automate this configuration as a follow-on) * The startup grace period means a minimum latency for shutdown, even if the cluster is idle. * We depend on the statestore detecting the process going down if queries are still running on that backend when the timeout expires. This may still be subject to existing problems, e.g. IMPALA-2990. Tests: * Added parser, analysis and authorization tests. * End-to-end test of shutting down impalads. * End-to-end test of shutting down then restarting an executor while queries are running. * End-to-end test of shutting down a coordinator - New queries cannot be started on coord, existing queries continue to run - Exercises various Beeswax and HS2 operations. Change-Id: I8f3679ef442745a60a0ab97c4e9eac437aef9463 Reviewed-on: http://gerrit.cloudera.org:8080/11484 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-26 01:28:36 +00:00
Tim Armstrong	16a04ce81b	Revert "IMPALA-1760: Implement shutdown command" This reverts commit `fda44aed9d`. A couple of the tests broken on S3 and erasure coding. Reverting to unblock testing until we can come up with a proper fix. Change-Id: Icef47b3aa67bc056c40592d47e93c4ebc57be98c Reviewed-on: http://gerrit.cloudera.org:8080/11435 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-09-14 01:12:22 +00:00
Tim Armstrong	fda44aed9d	IMPALA-1760: Implement shutdown command This allows graceful shutdown of executors and partially graceful shutdown of coordinators (new operations fail, old operations can continue). Details: * In order to allow future admin commands, this is implemented with function-like syntax and does not add any reserved words. * ALL privilege is required on the server * The coordinator impalad that the client is connected to can be shut down directly with ":shutdown()". * Remote shutdown of another impalad is supported, e.g. with ":shutdown('hostname')", so that non-coordinators can be shut down and for the convenience of the client, which does not have to connect to the specific impalad. There is no assumption that the other impalad is registered in the statestore; just that the coordinator can connect to the other daemon's thrift endpoint. This simplifies things and allows shutdown in various important cases, e.g. statestore down. * The shutdown time limit can be overridden to force a quicker or slower shutdown by specifying a deadline in seconds after the statement is executed. * If shutting down, a banner is shown on the root debug page. Workflow: 1. (if a coordinator) clients are prevented from submitting queries to this coordinator via some out-of-band mechanism, e.g. load balancer 2. the shutdown process is started via ":shutdown()" 3. a bit is set in the statestore and propagated to coordinators, which stop scheduling fragment instances on this daemon (if an executor). 4. the query startup grace period (which is ideally set to the AC queueing delay plus some additional leeway) expires 5. once the daemon is quiesced (i.e. no fragments, no registered queries), it shuts itself down. 6. If the daemon does not successfully quiesce (e.g. rogue clients, long-running queries), after a longer timeout (counted from the start of the shutdown process) it will shut down anyway. What this does: * Executors can be shut down without causing a service-wide outage * Shutting down an executor will not disrupt any short-running queries and will wait for long-running queries up to a threshold. * Coordinators can be shut down without query failures only if there is an out-of-band mechanism to prevent submission of more queries to the shut down coordinator. If queries are submitted to a coordinator after shutdown has started, they will fail. * Long running queries or other issues (e.g. stuck fragments) will slow down but not prevent eventual shutdown. Limitations: * The startup grace period needs to be configured to be greater than the latency of statestore updates + scheduling + admission + coordinator startup. Otherwise a coordinator may send a fragment instance to the shutting down impalad. (We could automate this configuration as a follow-on) * The startup grace period means a minimum latency for shutdown, even if the cluster is idle. * We depend on the statestore detecting the process going down if queries are still running on that backend when the timeout expires. This may still be subject to existing problems, e.g. IMPALA-2990. Tests: * Added parser, analysis and authorization tests. * End-to-end test of shutting down impalads. * End-to-end test of shutting down then restarting an executor while queries are running. * End-to-end test of shutting down a coordinator - New queries cannot be started on coord, existing queries continue to run - Exercises various Beeswax and HS2 operations. Change-Id: I4d5606ccfec84db4482c1e7f0f198103aad141a0 Reviewed-on: http://gerrit.cloudera.org:8080/10744 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-11 23:57:20 +00:00
Todd Lipcon	da01f29d30	IMPALA-7457. statestore: allow filtering by key prefix This adds the ability for a statestore subscriber to specify a key prefix which acts as a filter. Only topic entries which match the specified prefix are transmitted to the subscriber. This patch makes use of the new feature for a small optimization: the catalogd subscribes to the catalog topic with a key prefix "!" which we know doesn't match any actual topic items. This avoids the statestore having to reflect back the catalog contents to the catalogd, since the catalogd ignored this info anyway. A later patch will make use of this to publish lightweight catalog object version numbers in the same topic as the catalog objects themselves. The modification to catalogd's topic subscription is covered by existing tests. A new specific test is added to verify the filtering mechanism. Change-Id: I6ddcf3bfaf16bc3cd1ba01100e948ff142a67620 Reviewed-on: http://gerrit.cloudera.org:8080/11253 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Todd Lipcon <todd@apache.org>	2018-08-22 16:05:55 +00:00
Tim Armstrong	888dc82ff7	IMPALA-6816: minimise calls to GetMinSubscriberTopicVersion() min_subscriber_topic_version is expensive to compute (requires iterating over all subscribers to compute) but is only used by one subscriber/topic pair: Impalads receiving catalog topic updates. This patch implements a simple fix - only compute it if a subscriber asks for it. A more complex alternative would be to maintain a priority queue of subscriber versions, but that didn't seem worth the the complexity and risk of bugs. Testing: Add a statestore test to validate the versions. It looks like we had a pre-existing test gap for validating min_subscriber_topic_version so the test is mainly focused on adding that coverage. Ran core tests with DEBUG and ASAN. Change-Id: I8ee7cb2355ba1049b9081e0df344ac41aa4ebeb1 Reviewed-on: http://gerrit.cloudera.org:8080/10705 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-26 23:55:38 +00:00
Bikramjeet Vig	8d474a0170	IMPALA-3134: Support different proc mem limits among impalads for admission control checks Currently the admission controller assumes that all backends have the same process mem limit as the impalad it itself is running on. With this patch the proc mem limit for each impalad is available to the admission controller and it uses it for making correct admisssion decisions. It currently works under the assumption that the per-process memory limit does not change dynamically. Testing: Added an e2e test. IMPALA-5662: Log the queuing reason for a query The queuing reason is now logged both while queuing for the first time and while trying to dequeue. Change-Id: Idb72eee790cc17466bbfa82e30f369a65f2b060e Reviewed-on: http://gerrit.cloudera.org:8080/10396 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-22 22:27:21 +00:00
Dimitris Tsirogiannis	e2e7c103a7	IMPALA-6948: Delete catalog update topic entries upon catalog restart This commit fixes an issue where the statestore may end up with stale entries in the catalog update topic that do not correspond to the catalog objects stored in the catalog. This may occur if the catalog server restarts and some catalog object (e.g. table) that was known to the catalog before the restart no longer exists in the Hive Metastore after the restart. Fix: The first update for the catalog update topic that is sent by the catalog instructs the statestore to clear any entries it may have on this topic before applying the first update. In that way, we guarantee that the statestore entries are consistent with the catalog objects stored in the catalog server. Any coordinator that detects the catalog restart will receive from the statestore a full topic update that reflects the state of the catalog server. Testing: Added statestore test. Change-Id: I907509bf92da631ece5efd23c275a613ead00e91 Tmp Change-Id: I74a8ade8e498ac35cb56d3775d2c67a86988d9b6 Reviewed-on: http://gerrit.cloudera.org:8080/10289 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-08 03:31:24 +00:00
Dimitris Tsirogiannis	3fc42ded02	IMPALA-5058: Improve the concurrency of DDL/DML operations Problem: A long running table metadata operation (e.g. refresh) could prevent any other metadata operation from making progress if it coincided with the catalog topic creation operations. The problem was due to the conservative locking scheme used when catalog topics were created. In particular, in order to collect a consistent snapshot of metadata changes, the global catalog lock was held for the entire duration of that operation. Solution: To improve the concurrency of catalog operations the following changes are performed: * A range of catalog versions determines the catalog changes to be included in a catalog update. Any catalog changes that do not fall in the specified range are ignored (to be processed in subsequent catalog topic updates). * The catalog allows metadata operations to make progress while collecting catalog updates. * To prevent starvation of catalog updates (i.e. frequently updated catalog objects skipping catalog updates indefinitely), we keep track of the number of times a catalog object has skipped an update and if that number exceeds a threshold it is included in the next catalog topic update even if its version is not in the specified topic update version range. Hence, the same catalog object may be sent in two consecutive catalog topic updates. This commit also changes the way deletions are handled in the catalog and disseminated to the impalad nodes through the statestore. In particular: * Deletions in the catalog are explicitly recorded in a log with the catalog version in which they occurred. As before, added and deleted catalog objects are sent to the statestore. * Topic entries associated with deleted catalog objects have non-empty values (besided keys) that contain minimal object metadata including the catalog version. * Statestore is no longer using the existence or not of topic entry values in order to identify deleted topic entries. Deleted topic entries should be explicitly marked as such by the statestore subscribers that produce them. * Statestore subscribers now use the 'deleted' flag to determine if a topic entry corresponds to a deleted item. * Impalads use the deleted objects' catalog versions when updating the local catalog cache from a catalog update and not the update's maximum catalog version. Testing: - No new tests were added as these paths are already exercised by existing tests. - Run all core and exhaustive tests. Change-Id: If12467a83acaeca6a127491d89291dedba91a35a Reviewed-on: http://gerrit.cloudera.org:8080/7731 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/8752	2018-01-16 23:01:32 +00:00
Dimitris Tsirogiannis	a88c3b9c52	Revert "IMPALA-5538: Use explicit catalog versions for deleted objects" This reverts commit `dd340b8810`. This commit caused a number of issues tracked in IMPALA-6001. The issues were due to the lack of atomicity between the catalog version change and the addition to the delete log of a catalog object. Conflicts: be/src/service/impala-server.cc Change-Id: I3a2cddee5d565384e9de0e61b3b7d0d9075e0dce Reviewed-on: http://gerrit.cloudera.org:8080/8667 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 02:19:50 +00:00
Dimitris Tsirogiannis	dd340b8810	IMPALA-5538: Use explicit catalog versions for deleted objects This commit changes the way deletions are handled in the catalog and disseminated to the impalad nodes through the statestore. Previously, deletions of catalog objects were not explicitly annotated with the catalog version in which the deletion occured and the impalads were using the max catalog version in a catalog update in order to decide whether a deletion should be applied to the local catalog cache or not. This works correctly under the assumption that all the changes that occurred in the catalog between an update's min and max catalog version are included in that update, i.e. no gaps or missing updates. With the upcoming fix for IMPALA-5058, that constraint will be relaxed, thus allowing for gaps in the catalog updates. To avoid breaking the existing behavior, this patch introduced the following changes: * Deletions in the catalog are explicitly recorded in a log with the catalog version in which they occurred. As before, added and deleted catalog objects are sent to the statestore. * Topic entries associated with deleted catalog objects have non-empty values (besided keys) that contain minimal object metadata including the catalog version. * Statestore is no longer using the existence or not of topic entry values in order to identify deleted topic entries. Deleted topic entries should be explicitly marked as such by the statestore subscribers that produce them. * Statestore subscribers now use the 'deleted' flag to determine if a topic entry corresponds to a deleted item. * Impalads use the deleted objects' catalog versions when updating the local catalog cache from a catalog update and not the update's maximum catalog version. Testing: - No new tests were added as these paths are already exercised by existing tests. - Run all core tests. Change-Id: I93cb7a033dc8f0d3e0339394b36affe14523274c Reviewed-on: http://gerrit.cloudera.org:8080/7731 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-26 20:20:56 +00:00
Michael Ho	7b6ad283f4	IMPALA-4856, IMPALA-4872: Include KRPC services in plan fragment's destinations This change allows Impala to publish the IP address and port information of KRPC services if it's enabled via the flag use_krpc. The information is included in a new field in the backend descriptor published as statestore updates. Scheduler will also include this information in the destinations of plan fragments. Also updated the mini-cluster startup script to assign KRPC ports to Impalad instances. This patch also takes into account of a problem found in IMPALA-5795. In particular, the backend descriptor of the coordinator may not be found in the backend map in the scheduler if coordinator is not an executor (i.e. dedicated coordinator). The fix to also check against the local backend descriptor. This patch is partially based on an abandoned patch by Henry Robinson. Testing done: ran core tests with a patch which ignores the use_krpc flag to exercise the code in scheduler. Change-Id: I8707bfb5028bbe81d2a042fcf3e6e19f4b719a72 Reviewed-on: http://gerrit.cloudera.org:8080/7760 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-31 02:09:54 +00:00
Dimitris Tsirogiannis	e2c53a8bdf	IMPALA-5147: Add the ability to exclude hosts from query execution This commit introduces a new startup option, termed 'is_executor', that determines whether an impalad process can execute query fragments. The 'is_executor' option determines if a specific host will be included in the scheduler's backend configuration and hence included in scheduling decisions. Testing: - Added a customer cluster test. - Added a new scheduler test. Change-Id: I5d2ff7f341c9d2b0649e4d14561077e166ad7c4d Reviewed-on: http://gerrit.cloudera.org:8080/6628 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-26 01:45:40 +00:00
Thomas Tauber-Marshall	b2c2fe7813	IMPALA-3786: Replace "cloudera" with "apache" (part 2) As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . \| grep "\.java\\|\.py\\|\.h\\|\.cc" \| \ xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 21:14:13 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Matthew Jacobs	3c2cb95698	Memory-Based Admission Control Improvements Improves the memory-based admission control mechanism by "reserving" memory on both a cluster-wide basis and also a per-node basis. The algorithm today only accounts for cluster-wide memory, so we may oversubscribe particular impalads even if queries are admitted within the allowed cluster-wide pool memory limits. The header comment in admission-controller.h explains the new algorithm in much more detail. Testing notes: The admission control functional tests exercise the max running queries and memory limits at a high level, though they don’t yet validate the behavior of the new per-node mem accounting. Those tests will be written soon. For now, there is manual test coverage of the new mem resources behavior: a) Submitting queries with mem limits, queries use a subset of the nodes so they compete for resources on some nodes more than others - sanity checked via metrics b) Submitting queries without mem limits, verified mem_reserved appears to be as expected. Most* EE tests were also forced to run with admission control enabled (by manually changing gflag defaults before running), and checked there was not unexpected behavior in the following scenarios: a) a default pool with no limits (i.e. should be same behavior as no admission control) b) a default pool with mem resources set, and a default query mem limit of 1g c) a default pool with mem resources set but no default mem limit: this case falls back to using planner estimates during admission (supported for legacy only), so queries with very large estimates are rejected and those tests fail (this is expected). The new CM UI will not allow this configuration moving forward so it should be uncommon. *Excluding custom_cluster and some others for unrelated issues. The majority of regular tests passed, those that failed were not due to AC issues. Change-Id: Ia0d1eb8c07457cbe4b67b7f7f57136b4774720bc Reviewed-on: http://gerrit.cloudera.org:8080/1710 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-02-11 21:23:10 +00:00
Henry Robinson	56b299c917	IMPALA-1664: Standardise on 'heartbeat' in statestore over 'keep-alive' The statestore is a little loose about when it uses 'heartbeat' and when it uses 'keep-alive'. This patch standardises on 'heartbeat'. Change-Id: Ic17995667e08f068b668781b60b00106f914e3e4 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5769 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2015-01-29 12:54:01 -08:00
Henry Robinson	433a86263c	Separate failure detection from topic updates in statestore As originally envisaged, the statestore was not intended to transmit GBs of data to large clusters. Now that it is, some changes to the design are required to improve stability and responsiveness. One of the most acute problems is the way that the topic updates, which may be extremely large, were responsible for conveying failure detection information. The delay in sending topic heartbeats (which were often blocked behind a queue of other updates) meant that large clusters needed to set large timeouts for every subscriber, and to reduce the frequency at which topic updates were sent. This change adds a second heartbeat type to the statestore which is responsibie only for coordinating failure information between the subscriber and the statestore. These 'keep-alive' heartbeats are sent much more frequently, as they have a tiny payload and do very little work. The statestore now does not use the result of topic updates to maintain its view of which subscribers are alive or dead, but only the results of the keep-alive heartbeats. I've tested these changes with a 500MB catalog on a 73 node cluster. Keep-alive frequencies are nice and stable (at 500ms out of the box) even during the initial large topic-update distribution phase, or after a statestore failure. Internally, this patch changes the nomenclature for the statestore: we stay away from 'heartbeat' (except for the failure detector, which is heartbeat-based), and instead use 'message' as a generic term for both keep-alive and topic update. Externally, we need to rationalise the flag names to control both update types. The benefit of this change is that only the keep-alive frequency matters for cluster stability, and that depends only on the cluster size, not on the size of the catalog and the cluster size. So we can set it to, e.g., 1s out of the box with some confidence that that will work for clusters up to ~200 nodes. The topic update frequency may actually be set aggressively, because the statestore will naturally throttle itself during times of heavy traffic and nothing will time out. Change-Id: Ia447e4ebefda890a5b810a213e97f00cca499989 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4003 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5138	2014-11-05 14:59:25 -08:00
Matthew Jacobs	b879b4c2e4	Admission controller: Separate TPoolStats mem_usage and mem_estimate Change-Id: I521de3a99faca3aaf10e3900a4a12b0d2fa7a0f3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1704 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit b8fa9c0bf7b555d36180be42c89cd4d7f6b8ec7b) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1737	2014-03-03 19:44:51 -08:00
Matthew Jacobs	af84be67dd	Admission controller: add memory limits in addition to number of requests Adds the ability to set per-pool memory limits. Each impalad tracks the memory used by queries in each pool; a per-pool memory tracker is added between the per-query trackers and the process memory tracker. The current memory usage is disseminated via statestore heartbeats (along with the other per-pool stats) and a cluster-wide estimate of the pool memory usage is updated when topic updates are received. The admission controller will not admit incoming requests if the total memory usage is already over the configured pool limit. Change-Id: Ie9bc82d99643352ba77fb91b6c25b42938b1f745 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1508 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins (cherry picked from commit 64a137930a318e56a7090a317e6aa5df67ea72cd) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1623 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-02-20 14:19:34 -08:00
Henry Robinson	bb028a54b7	IMPALA-809: Concurrently received statestore heartbeats are no longer an error This patch fixes a problem observed when a subscriber was processing a heartbeat, and while doing so tried to re-register with the statestore. The statestore would schedule a heartbeat for the new registration, but the subscriber would return an error, thinking that it was still re-registering (see UpdateState() for the try_lock logic that gave rise to this error). The statestore, upon receiving the error, would update its failure detector and eventually mark the subscriber as failed, unnecessarily forcing a re-registration loop. This only regularly happens when UpdateState() takes a long time, i.e. when a subscriber callback takes a while. This patch also adds metrics to measure the amount of time callbacks take. Change-Id: I157cdfd550279a6942e7ca54fe622520c8ad5dcf Reviewed-on: http://gerrit.ent.cloudera.com:8080/1574 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com> (cherry picked from commit bc0a8819e754623bc9e5e5ab805369ad8381e5b9) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1610	2014-02-19 17:34:37 -08:00
Matthew Jacobs	a7ff8da1a5	Query admission controller, part 1 This adds a simple admission control mechanism which is able to make localized admission decisions and can queue some number of queries that are not able to execute immediately. In this change, there is a single pool and the maximum number of concurrent queries and the maximum number of queued queries are configurable via flags, but the data structures all support multiple pools so that we can later add support for getting pools and per-pool configs from Yarn and Llama configs (i.e. fair-scheduler.xml and llama-site.xml). Each impalad keeps track of how many queries it has executing and how many are queued, per pool. The statestore is used to disseminate local pool statistics to all other impalads. When topic updates are received, a new cluster-wide total number of currently executing queries and total number of queued queries are calculated for each pool. Those totals are used to make localized admission decisions in the AdmissionController. There are a number of per-pool metrics which are used in automated testing (in a separate commit) and there are many assertions and debugging logging which I've been using to verify manual testing. Change-Id: I68f92c789108336fca33c2148a4e14534c77e9f0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1347 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit e52b528b8a2fa23585510eab916ecb41da82d24b) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1302	2014-02-07 10:12:24 -08:00
Henry Robinson	51e58e1f3c	Statestore aesthetic cleanup * Statestore is now one word, without camelcase, eveywhere. Previous names included StateStore, state-store and state_store, variously. The only exception is a couple of flags that have 'state_store', and can't be changed for compatibility reasons. * File names are also changed to reflect the standard naming. * Most comments are now 90 chars wide (from 80 before) Change-Id: I83b666c87991537f9b1b80c2f0ea70c2e0c07dcf Reviewed-on: http://gerrit.ent.cloudera.com:8080/1225 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-01-09 09:56:04 -08:00

38 Commits