Monitor_and_Recover_Service_Nodes

There is a newer version of this page. You can find it here.

Overview

Service nodes are very important for the hierarchy clusters, the failure on a single service node may cause problems for hundreds of compute nodes, so the cluster administrator needs to monitor the service nodes closely and recover the failed service nodes as soon as possible. This documentation describes how to monitor the service nodes, and how to recover the service nodes when some specific service node fails.

The examples in this documentation are based on the following cluster environment:

Management Node: aixmn1(9.114.47.103) running AIX 7.1B and DB2 9.7

Service Node: aixsn1(9.114.47.115) running AIX 7.1B and DB2 9.7

Compute Node: aixcn1(9.114.47.116) running diskless AIX 7.1B

Monitoring Service Nodes

xCAT provies a Monitoring Plug-in infrastructure that can be used to integrate the 3rd-party monitoring software into xCAT cluster. See [Monitoring_an_xCAT_Cluster] for more details on xCAT monitoring infrastructure.

rmcmon is a xCAT built-in plug-in module based on IBM's Resource Monitoring and Control (RMC) subsystem, which is part of IBM's Reliable Scalable Cluster Technology (RSCT). This documentation describes how to use the rmcmon and the xCAT-rmc pre-shipped RSCT conditions and responses to monitor the service nodes.

In this documentation, the xCAT management node is acting as the monitoring server, if you want to use separate monitoring servers, refer to [Monitoring_an_xCAT_Cluster#Define_monitoring_servers] for more details.

This documentation describes how to setup monitoring for the following items on the service nodes, you may not want to monitor all of these items in your cluster, for example, the network services provided by the service nodes can be customized through xCAT configuration, some of the network services may not be configured on the service nodes. You can customize the monitoring settings based on your cluster configuration.

1) Service nodes liveness
2) xcatd
3) Network services: named, DHCP, NFS, conserver, tftp, ftp
4) System health monitoring: memory usage, file system usage

If any of the items listed above fail on any of service node, an action will be trigged by the RMC infrastructure to notify the administrators.

Setup rmcmon

Follow the steps at [Monitoring_an_xCAT_Cluster#RMC_monitoring] to setup rmcmon plugin, substitute the hostnames in the commands with the service nodes.

Pre-shipped Conditions for Service Nodes Monitoring

The package xCAT-rmc pre-shipps a lot of conditions for monitoring, and here are some specific conditions for service nodes monitoring, the condition name should be able to explain what the condition is for:

CheckTFTPonSN
CheckxCATonSN
CheckNAMEDonSN
CheckFTPonSN
CheckCONSonSN
CheckNTPonSN
CheckNFSonSN
CheckDHCPonSN
CheckFTPonSN_AIX

You can use lscondition <condition_name> to get more details on the condition definition.

Since we need to monitor the service nodes liveness, so the following condition will also be used:

NodeReachability

Pre-shipped Responses

The package xCAT-rmc also ships some event responses that can be linked to any of the conditions, there are no responses designed for service nodes monitoring specifically, but the following responses might be useful for the service nodes monitoring:

BroadcastEventsAnyTime
LogEventToxCATDatabase
LogEventToTealEvenetLog

BroadcastEventsAnyTime writes a message to all the users logged in on the management node; LogEventToxCATDatabase logs the message to the xCAT eventlog table; LogEventToTealEvenetLog logs the message to the TEAL database.

You can create your own responses to meet your requirements for events notification, here is an example on how to create a response to send an email to my business email address wheneven there is some event occured in the cluster:

mkresponse -n "EmailAdmin" -e b -s "/opt/xcat/sbin/rmcmon/email-hierarchical-event clusteradmin@ibm.com" EmailAdminAnyTime

Service Node Liveness Monitoring

Network Services Monitoring on Service Nodes


MongoDB Logo MongoDB