Co-authored-by: Rachael Sewell <rachmari@github.com> Co-authored-by: Rachael Rose Renk <91027132+rachaelrenk@users.noreply.github.com> Co-authored-by: David Jarzebowski <davidjarzebowski@github.com> Co-authored-by: Anne-Marie <102995847+am-stead@users.noreply.github.com> Co-authored-by: Matt Pollard <mattpollard@users.noreply.github.com> Co-authored-by: Steve Guntrip <stevecat@github.com> Co-authored-by: Isaac Brown <101839405+isaacmbrown@users.noreply.github.com> Co-authored-by: Sam Browning <106113886+sabrowning1@users.noreply.github.com> Co-authored-by: Torsten Walter <torstenwalter@github.com> Co-authored-by: Henry Mercer <henrymercer@github.com> Co-authored-by: Sarah Edwards <skedwards88@github.com>
10 KiB
title, shortTitle, intro, permissions, product, versions, type, topics
| title | shortTitle | intro | permissions | product | versions | type | topics | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Monitoring the health of your cluster nodes with Node Eligibility Service | Node Eligibility Service | You can monitor when nodes in a {% data variables.product.product_name %} cluster have been offline long enough to cause issues by using {% data variables.product.prodname_nes %}. | People with administrative SSH access to a {% data variables.product.product_name %} instance can monitor cluster nodes. | {% data reusables.gated-features.cluster %} |
|
how_to |
|
About {% data variables.product.prodname_nes %}
In a {% data variables.product.product_name %} cluster, an individual node may become unreachable by other nodes due to a hardware or software failure. After time, even if you restore the node's health, the subsequent synchronization of data can negatively impact your instance's performance.
You can proactively mitigate the impact of reduced node availability by using {% data variables.product.prodname_nes %}. This service monitors the state of your cluster's nodes and emits a warning if a node has been offline for too long. You can also prevent an offline node from rejoining the cluster. Optionally, you can allow {% data variables.product.prodname_nes %} to take ineligible nodes offline.
By default, {% data variables.product.prodname_nes %} is disabled. If you enable {% data variables.product.prodname_nes %}, your instance will alert you of unhealthy nodes by displaying a banner in the administrative web UI for {% data variables.product.product_name %}, and in CLI output for some cluster-related utilities, such as ghe-config-apply and ghe-cluster-diagnostics.
{% data variables.product.prodname_nes %} allows you to monitor the health of individual nodes. You can also monitor the overall health of your cluster. For more information, see "AUTOTITLE."
About health and eligibility of cluster nodes
To determine whether to emit a warning or automatically adjust the configuration of your cluster, {% data variables.product.prodname_nes %} continuously monitors the health of each node. Each node regularly reports a timestamped health state, which {% data variables.product.prodname_nes %} compares to a Time To Live (TTL) duration.
Each node has a health state and an eligibility state.
- Health refers to the accessibility of the node within the cluster and has three possible states:
healthy,warning, orcritical. - Eligibility refers to the ability of the node to work in the cluster and has two possible states:
eligibleorineligible.
{% data variables.product.prodname_nes %} provides a configurable TTL setting for two states, warn and fail.
warn: The node has been offline for a short period of time. This may indicate something is wrong with the node and that administrators should investigate. The default setting is 15 minutes.fail: The node has been offline for a long period of time, and reintroduction into the cluster could cause performance issues due to resynchronization. The default setting is 60 minutes.
For each node, {% data variables.product.prodname_nes %} determines health and eligibility for participation in the cluster in the following ways.
- If a node has been observed to be healthy, the health state is
healthyand the eligibility state iseligible. - If a node hasn't been observed to be healthy for longer than the
warnTTL, the health state iswarningand the eligibility state iseligible. - If a node hasn't been observed to be healthy for longer than the
failTTL, the health state iscriticaland its eligibility state isineligible.
Enabling {% data variables.product.prodname_nes %} for your cluster
By default, {% data variables.product.prodname_nes %} is disabled. You can enable {% data variables.product.prodname_nes %} by setting the value for app.nes.enabled using ghe-config.
{% data reusables.enterprise_installation.ssh-into-cluster-node %}
-
To verify whether {% data variables.product.prodname_nes %} is currently enabled, run the following command.
ghe-config app.nes.enabled -
To enable {% data variables.product.prodname_nes %}, run the following command.
ghe-config app.nes.enabled true
{% data reusables.enterprise.apply-configuration %}
-
To verify that {% data variables.product.prodname_nes %} is running, from any node, run the following command.
nomad status nes
Configuring TTL settings for {% data variables.product.prodname_nes %}
To determine how {% data variables.product.prodname_nes %} notifies you, you can configure TTL settings for fail and warn states. The TTL for the fail state must be higher than the TTL for the warn state.
{% data reusables.enterprise_installation.ssh-into-cluster-node %}
-
To verify the current TTL settings, run the following command.
nes get-node-ttl all -
To set the TTL for the
failstate, run the following command. Replace MINUTES with the number of minutes to use for failures.nes set-node-ttl fail MINUTES -
To set the TTL for the
warnstate, run the following command. Replace MINUTES with the number of minutes to use for warnings.nes set-node-ttl warn MINUTES
Managing whether {% data variables.product.prodname_nes %} can take a node offline
By default, {% data variables.product.prodname_nes %} provides alerts to notify you about changes to the health of cluster nodes. Optionally, if the service determines that an unhealthy node is ineligible to rejoin the cluster, you can allow the service to take the node offline.
When a node is taken offline, the instance removes job allocations from the node. If the node runs data storage services, {% data variables.product.prodname_nes %} updates the configuration to reflect the node's ineligibility to rejoin the cluster.
To manage whether {% data variables.product.prodname_nes %} can take a node and its services offline, you can configure adminaction states for the node. If a node is in the approved state, {% data variables.product.prodname_nes %} can take the node offline. If a node is in the none state, {% data variables.product.prodname_nes %} cannot take the node offline.
{% data reusables.enterprise_installation.ssh-into-cluster-node %}
- To configure whether {% data variables.product.prodname_nes %} can take a node offline, run one of the following commands.
-
To allow the service to automatically take administrative action when a node goes offline, run the following command. Replace HOSTNAME with the node's hostname.
nes set-node-adminaction approved HOSTNAME -
To revoke {% data variables.product.prodname_nes %}'s ability to take a node offline, run the following command. Replace HOSTNAME with the node's hostname.
nes set-node-adminaction none HOSTNAME
-
Viewing an overview of node health
To view an overview of your nodes' health using {% data variables.product.prodname_nes %}, use one of the following methods.
- SSH into any node in the cluster, then run
nes get-cluster-health. - Navigate to the {% data variables.enterprise.management_console %}'s "Status" page. For more information, see "AUTOTITLE."
Re-enabling an ineligible node to join the cluster
After {% data variables.product.prodname_nes %} detects that a node has exceeded the TTL for the fail state, and after the service marks the node as ineligible, the service will no longer update the health status for the node. To re-enable a node to join the cluster, you can remove the ineligible status from the node.
{% data reusables.enterprise_installation.ssh-into-cluster-node %}
-
To check the current
adminactionstate for the node, run the following command. Replace HOSTNAME with the hostname of the ineligible node.nes get-node-adminaction HOSTNAME -
If the
adminactionstate is currently set toapproved, change the state tononeby running the following command. Replace HOSTNAME with the hostname of the ineligible node.nes set-node-adminaction none HOSTNAME -
To ensure the node is in a healthy state, run the following command and confirm that the node's status is
ready.nomad node status-
If the node's status is
ineligible, make the node eligible by connecting to the node via SSH and running the following command.nomad node eligibility -enable -self
-
-
To update the node's eligibility in {% data variables.product.prodname_nes %}, run the following command. Replace HOSTNAME with the node's hostname.
nes set-node-eligibility eligible HOSTNAME -
Wait 30 seconds, then check the cluster's health to confirm the target node is eligible by running the following command.
nes get-cluster-health
Viewing logs for {% data variables.product.prodname_nes %}
You can view logs for {% data variables.product.prodname_nes %} from any node in the cluster, or from the node that runs the service. If you generate a support bundle, the logs are included. For more information, see "AUTOTITLE."
{% data reusables.enterprise_installation.ssh-into-cluster-node %}
-
To view logs for {% data variables.product.prodname_nes %} from any node in the cluster, run the following command.
nomad alloc logs -job nes -
Alternatively, you can view logs for {% data variables.product.prodname_nes %} on the node that runs the service. The service writes logs to the systemd journal.
-
To determine which node runs {% data variables.product.prodname_nes %}, run the following command.
nomad job status "nes" | grep running | grep "${nomad_node_id}" | awk 'NR==2{ print $1 }' | xargs nomad alloc status | grep "Node Name" -
To view logs on the node, connect to the node via SSH, then run the following command.
journalctl -t nes
-