{{:Design Warning}}
HA Service Nodes for AIX - using External NFS Server
NOTICE: This is work in progress!!!!!!
An AIX diskless node depends on filesystems NFS mounted from a server. Normally this server is the NIM master.
In a hierarchical cluster the service node is the NIM master. If the NFS service on the service node or the service node itself runs into a problem, all the compute nodes served by this service node will be taken down immediately.
This document illustrates how to use an external highly available NFS server together with a backup service node to provide enhanced availability for cluster compute nodes.
The external NFS server must be set up with HA capability, for example, SONAS or GPFS CNFS.
The core idea of this external NFS server support is to redirect all the NFS mounts on the stateless/statelite nodes to an external NFS server, the external NFS server can be any configuration as long as it could provide HA NFS capability and sufficient performance to handle thousands of NFS mounts.
If a service node fails, the compute nodes served by it will not be affected immediately. Applications can continue run on the compute nodes. However, to complete the transfer to the backup service node you must run the xCAT "snmove" command as soon as the service node failure is detected.
This is an example of the basic steps you could perform to setup a HASN. These steps should be integrated into the process described in the hierarchical document. ([Setting_Up_an_AIX_Hierarchical_Cluster])
Details for the example:
MN: xcatmn1 SNs: xcatsn11,xcatsn12 CNs: compute01 - compute20 networks: - ethernet for MN and SN installs - HFI for SN and CN installs
groups: - xcatsn11nodes - xcatsn12nodes
Set the node attribute “nfsserver” to hostname of the external NFS server. Example:
_**chdef node01 nfsserver=extnfs01**_
Add the service node values to the "servicenode" and "xcatmaster" attributes of the node definitions.
Note-
xcatmaster: : The hostname of the xCAT service node as known by the node.
servicenode: : The hostname of the xCAT service node as known by the management node.
To specify a backup service node you must specify a comma-separated list of two service nodes for the "servicenode" value. The first one will be the primary and the second will be the backup for that node.
For the "xcatmaster" value you should only include the primary name of the service node as known by the node.
In the following example "xcatsn1,xcatsn2" are the names of the service nodes as known by the xCAT management node and "xcatsn1-hfi0" is the name of the primary service node as known by the nodes.
_**chdef node01 servicenode="xcatsn1,xcatsn2" xcatmaster="xcatsn1-hfi0**_"
See [Setting_Up_an_AIX_Hierarchical_Cluster#Using_a_backup_service_node] for more information on setting up primary and backup service nodes for the compute nodes.
Note: When using backup service nodes you should consider splitting the compute nodes between the two service nodes for performance reasons.
Create an xCAT group for the nodes that are assigned to each SN. This will be useful when setting node attributes as well as providing an easy way to switch a set of nodes back to their original server.
To create an xCAT node group (SN1nodes) containing all the nodes that have the service node "xcatsn1" you could run a command similar to the following.
mkdef -t group -o SN1nodes -w servicenode=xcatsn1
AIX statelite is actually the shared_root stateless plus “overlay” for specific files or directories through the information in tables statelite, litefile and litetree.
Put the statelite persistent directory in a /install subdirectory.
The statelite persistent files may be written to frequently by the AIX diskless clients, so it is important to keep the statelite directories syncronized between the service nodes.
Statelite files/dirs have to be sync'd with backup SNs (?? cron job??) example?
Example table entries.
statelite table
litefile table
litetree table
See statelite doc for details.
- spot, shared_root,
_**mkdsklsnode -V -t diskless -r -s /myimages 71dskls**_
The paging space is mounted from the service nodes and it is possible that the AIX diskless clients will write to the paging space, if the paging space goes down while the AIX diskless clients are using the paging space, the operating system may crash immediately. So the directory for the paging spaces have to be syncronized between the service nodes.
can define dump res but would have to reboot node to have it dump to the backup SN
- if dump res is needed then node will have to be rebooted - can set up dump res on SN2 but need reboot for it to work
AIX version 5.0 allows a host to discover if one of its gateways is down (called dead gateway detection) and, if so, choose a backup gateway, if one has been configured. Dead gateway detection can be passive (the default) or active.
In passive mode, a host looks for an alternate gateway when normal TCP or ARP requests notice the gateway is not responding. Passive mode provides a ?best effort? service and requires very little overhead, but there may be a lag of several minutes before a dead gateway is noticed.
In active mode, the host periodically pings the gateways. If a gateway fails to respond to several consecutive pings, the host chooses an alternate gateway. Active mode incurs added overhead, but maintains high availability for your network
- In postscript - create additional static route using the backup SN as gateway - if primary SN crashes the AIX DGD switches to backup gw (SN2) - use makeroutes or new postscript???
I noticed in your write-up you mentioned using makeroutes to set up multiple default gateways. This seems like the logical thing to do however there are some issues.
For example:
"backupgateway"(?) postscript
- creates backup route - In postscript - create additional static route using the backup SN as gateway - if primary SN crashes the AIX DGD switches to backup gw (SN2)
- need to investigate tuning and options for DGD - active vs. passive
also - dgd_retry_time etc.
if test -z 'lsdev -C -S1 -F name -l inet0' then > mkdev -t inet
> chdev -l inet0 -a route=net,-hopcount,0,-netmask,<mask>,-if,<int>,-active_dgd,-static<netname>,<gateway> ex. mask = 255.255.0.0 int = en0 netname=9.114.47.0 gateway= 9.114.154.126
> use "no" cmd to set tunables
Setup monitoring for the service nodes: [Monitor_and_Recover_Service_Nodes#Monitoring_Service_Nodes], when any of the service node fails, you should be able to get notification.
- give example
Run mkdsklsnode with a new flag "-s | --setuphanfs", do not specify flag "-p|--primarySN" or "- b|--backupSN" with mkdsklsnode command, we want the /install file systems on the service nodes pairs to be identical.
- do sn config concurrently!!! - ex.
Run rnetboot or rbootseq/rpower to start the diskless boot
After the service node is down, hardware control through the service node will not work any more, the user needs to change the hcp's servicenode attribute and rerun the mkhwconn command.
The NFSv4 mounts on the compute nodes will fail over to the standby service node transparently and automatically
As soon as the fail over is detected you must run the xCAT snmove command
- If some of the non-critical services, like conserver, monserver and ntpserver are not configured as automatic failover between the service nodes, move them to the standby service node manually. ==
- run snmove to re-target the nodes and update the xCAT db
++++ - reset /etc/xcatinfo on node - syslog/errlog? - LL config file ( switch order of SNs?)
1) Name Resolution
Set by snmove command????
We can specify both the primary and backup service nodes in /etc/resolv.conf on the compute nodes to enable the nameserver redundancy, but considering the dead nameserver may cause the hostname resolution performance degradation, using /etc/hosts for hostname resolution should be the right way to go for the service node redundancy scenario.
2) Conserver
Set by snmove command
After the service node is down, rcons for the compute nodes will not work, the user needs to change the compute nodes' conserver attribute and rerun makeconserverc.
3) Monserver
After the service node is down, monitoring through the service node will not work any more, the user needs to change the compute nodes' monserver attribute and rerun the monitoring configuration procedure.
- how force the failover back to the original SN?? - run snmove with special option???