DRAFT! This is a work-in-progress and is not complete!!!!
(Using NFS v4 Client Replication Failover with GPFS filesystems.)
AIX diskless nodes depend on their service nodes for many services: bootp, tftp, default gateway, name serving, NTP, etc. The most significant service is NFS to access OS files, statelite data, and paging space. This document describes how to use GPFS and NFSv4 client replication failover support to provide continuous operation of the full HPC cluster if a the NFS services provided by a service node become unavailable, whether due to failure of that service node or for other reasons.
During normal cluster operation, if a compute node is no longer able to access its NFS server, it will failover to the configured replica backup server. Since both NFS servers are using GPFS to back the filesystem, the replica server will be able to continue to serve the identical data to the compute node.
Storage Setup Configuration 1
Storage Setup Configuration 2
Recommendations for the GPFS setup
Layout of the file systems on the external disks:
There are a few components that normally run on the service nodes that under certain circumstances need access to the application GPFS cluster. Since a service node can't be (directly) in 2 GPFS clusters at once, some changes in the placement or configuration of these components must be made, now that the service nodes are in their own GPFS cluster. The components that can be affected by this are:
There are a few different ways to satisfy these requirements:
Similarly, if you need to stop and restart GPFS on a service node, make sure to stop/start these services in the following order:
exportfs -ua
exportfs -a
The paging space currently does not support NFSv4 client replication fail over. This may cause problems if the primary service node goes down, and the compute node requires paging to remain operational.
There is currently an issue with using NFSv4 replication client fail over for readwrite files, even when GPFS is ensuring that the files are the same regardless of which SN the are accessed from. A small timing window exists in which the client sends a request to update a file and the server updates it, but before it sends the acknowledgement to the client, the server crashes. When the client fails over to the other server (which has the updated file thanks to GPFS) and resends the update request, the client will detect that the modification time the client and server think the file has are different and bail out, marking the file "dead" until the client closes and reopens the file. This is a precaution, because the NFS client has no way of verifying that this is the exact same file that it updated on the other server. AIX development is sizing a configuration option in which we could tell it not to mark the file dead in this case because GPFS is ensuring the consistency of the files between the servers.
Note - we have not yet directly experienced this condition in any of our testing.
xCAT 2.7.2 including the following code updates:
Base: AIX 7.1.D (7.1.1.0)
Initial code drop of STNFS failover support:
**(AIX CMVC defect 816890)**
**HPCstnfs.111202.epkg.Z **
STNFS Patches from Duen-wen to fix hang with 'ls /etc/nfs' downloaded from ausgsa:
/usr/lib/drivers/stnfs.ext
/usr/lib/ras/autoload/stnfs64.kdb
STNFS Patch from Duen-wen to fix I/O errors from 'ls -lR /' after failover downloaded from ausgsa:
**(AIX CMVC defect 822215):**
/usr/lib/drivers/stnfs.ext
NFS Patches from Duen-wen to fix access failures to libC in /usr filesystem downloaded from ausgsa:
**(AIX CMVC defect 826634):**
/usr/lib/drivers/nfs.ext
/usr/lib/drivers/nfs.netboot.ext
NIM patch to turn off TCB-enabled during SPOT build (locally modified on EMS by Linda Mellor based on instructions from Paul Finley). This is ONLY required for sharedinstall=all (not needed for sharedinstall=sns):
**(AIX CMVC defect 824583):**
/usr/lpp/bos.sysmgt/nim/methods/c_instspot
NOTE: All STNFS/NFS defects are fixed and will be shipped in AIX 7.1.F (7.1.2, GA 5/2012). We will need to work with AIX support if efixes need to be built for a different version of AIX
EMS:
SNs:
CNs:
domain:
network defs:
osimage for compute nodes:
**Note**: If starting over with a new cluster then refer to the
https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster
document for details on how to install an xCAT EMS and service nodes (SN).
Do not remove any xCAT or NIM definitions on the EMS.
Do not remove any postscripts or statelite information from the EMS.
In the following example, "compute" is the name of an xCAT node group containing all the cluster compute nodes.
_**xdsh compute "/usr/sbin/shutdown -F &"**_
The following command will remove all the NIM client definitions from both primary and backup service nodes. See the rmdsklsnode man page for additonal details.
_**rmdsklsnode -V -f compute**_
The existing NIM resources need to be removed on each service node. (With the original /install filesystem still in place.)
In the following example, "service" is the name of the xCAT node group containing all the xCAT service nodes, and "<osimagename>" should be substituted with the actual name of an xCAT osimage object.
_**rmnimimage -V -f -d -s service <osimagename>**_
See rmnimimage for additional details.
When this command is complete it would be good to check the service nodes to make sure there are no other NIM resources still defined. For each service node (or from EMS with 'xdsh service'), run lsnim to list whatever NIM resources may be remaining. Remove any random resources that are no longer needed (you should NOT remove basic NIM resources such as master, network, etc.)
On each service node, clean up the NFS exports.
Re-do the exports
exportfs -ua
exportfs -a (if there are any entries left in /etc/exports)
On each service node, deactivate (in whatever way you choose: rename, overmount, etc.) the local /install filesystem and activate the GPFS shared /install filesystem. Depending on how the /install filesystem was originally created, this may also require updates to /etc/filesystems.
The contents of the backed up local /install, EXCEPT for /install/nim, must by copied back to the shared /install directory in GPFS.
Since the other contents of the /install directory should be the same on all SNs you can just log in to one SN and use rsync to copy the files and directories to the shared file system.
_**ssh <targetSN>**_
rsync .......????...
If your cluster was setup with NFSv3 you will need to convert all existing NIM images to NFSv4. On the EMS, for each OS image definition, run:
_**mknimimage -u <osimage_name> nfs_vers=4**_
On each service node, the AIX OS startup order has to be changed to start GPFS before NFS. Edit /etc/inittab on each service node.
_**vi /etc/inittab**_
Move the call to /etc/rc.nfs to AFTER the start of GPFS, making sure GPFS is active before starting NFS.
On each service node, the AIX OS shutdown order has to be changed to shutdown the NFS server before GPFS, so that NFS doesn't keep trying to serve files backed by GPFS. Add the following to /etc/rc.shutdown on each service node:
_**vi /etc/rc.shutdown and add:**_
_**stopsrc -s nfsd**_
_**exit 0**_
Verify that the following attributes and values are set in the xCAT site definition:
nameservers= "<xcatmaster>"
domain=<domain_name> (this is required by NFSv4)
useNFSv4onAIX="yes"
sharedinstall="sns"
You could set these values using the following command:
_**chdef -t site nameservers= "<xcatmaster>" domain=mycluster.com useNFSv4onAIX="yes" sharedinstall="sns"**_
Verify that all required software and updates are installed.
If you intend to define dump resources for your compute nodes then make sure you have installed the prequisite software. See [XCAT_AIX_Diskless_Nodes#ISCSI_dump_support] for details.
Verify that all required software and updates are installed.
You can use the updatenodecommand to update the SNs.
If you intend to define dump resources for your compute nodes then make sure you have installed the prequisite software. See [XCAT_AIX_Diskless_Nodes#ISCSI_dump_support] for details.
Create node groups for each primary SN
[How_are_we_assigning_nodes_to_primary_and_backup_SNs????]
The following example assumes you are using a 'compute' nodegroup entry in your xCAT postscripts table.
_**chdef -t group compute -p postscripts=setupnfsv4replication**_
The "servicenode" attribute values must be the names of the service nodes as they are known by the EMS. The "xcatmaster" attribute value must be the name of the primary server as known by the nodes.
_**chdef -t node -o <SNgroupname> servicenode=<primarySN>,<backupSN> xcatmaster=<nodeprimarySN>**_
- create or update NIM installp_bundle files that you wish to use with the new osimage - copy in any changes to the bundles files shipped with xCAT
create NIM installp_bundle resources
_These are the bundles I used:_
nim -Fo define -t installp_bundle -a location=/install/nim/installp_bundle/xCATaixCN71.bnd -a /
server=master xCATaixCN71
nim -Fo define -t installp_bundle -a location=/install/nim/installp_bundle/xCATaixHFIdd.bnd -a /
server=master xCATaixHFIdd
nim -Fo define -t installp_bundle -a location=/install/nim/installp_bundle/IBMhpc_base.bnd -a /
server=master IBMhpc_base
nim -Fo define -t installp_bundle -a location=/install/nim/installp_bundle/IBMhpc_all.bnd -a /
server=master IBMhpc_all
ex. "mknimimage -V -r -D -s 71Dsp3_lpp_source -t diskless 71Dsp3tst installp_bundle="xCATaixCN71" configdump=selective"
- add aditional software, updates, ifixes etc. to the lpp_source - update the spot - create a dump resource - and add it to the osimage def - optional - specify configdump value - optional
ex. mknimimage -V -u 71Dsp3tst
create NIM lpp_source for image
)- add all HPC software to the lpp_source
)- add all efixes to the lpp_source:
# Base STNFS failover support:
cp /xcat/stnfs/HPCstnfs.111202.epkg.Z /install/nim/lpp_source/71Dtst_lpp_source/emgr/ppc
nim -Fo check 71Dtst_lpp_source
create osimage on MN (mknimimage)
"mknimimage -t diskless -r -D -s 71D_lpp_source 71Dcompute otherpkgs=”HPCstnfs.111202.epkg.Z” _<other mknimimage input as needed>>_"
_My actual command:_
mknimimage --force -V -r -s 71Dtst_lpp_source -t diskless 71Dtst / :::installp_bundle="xCATaixCN71,xCATaixHFIdd,IBMhpc_base,IBMhpc_all" /
otherpkgs=”HPCstnfs.111202.epkg.Z” synclists=/install/custom/aix/compute.synclist
Note - there is a known xCAT bug that when you use multiple installp_bundle files with mknimimage,the rpm.rte lpp MUST be listed in the first bundle file you specify to xCAT (i.e. xCATaixCN71). The rpm command is needed by subsequent function in mknimimage to install rpms into the image.
)- add custom patches to the spot:
cp /xcat/stnfs/stnfs.ext /install/nim/spot/71Dtst/usr/lib/drivers/stnfs.ext
cp /xcat/stnfs/stnfs64.kdb /install/nim/spot/71Dtst/usr/lib/ras/autoload/stnfs64.kdb
cp /xcat/stnfs/nfs.ext /install/nim/spot/71Dtst/usr/lib/drivers/nfs.ext
cp /xcat/stnfs/nfs.netboot.ext /install/nim/spot/71Dtst/usr/lib/drivers/nfs.netboot.ext
nim -Fo check 71Dtst
)- verify efixes applied to the spot:
xcatchroot -i 71Dtst 'emgr -l'
- statelite table entry MUST be $noderes.xcatmaster - required
do statelite setup
\- The admin must create the persistent directory in the shared filesystem on the service nodes and add it to /etc/exports. Recommend creating it as something like GPFS /install/statelite_data since xCAT mkdsklsnode will NFSv4 export /install with the correct replica info for you.
The statelite table should be set up so that each service node is the NFS server for its compute nodes. You should use the "$noderes.xcatmaster" substitution string instead of specifying the actual service node so that when xCAT changes the service node database values for the compute nodes during an snmove operation, this table will still have correct information. It should look something like:
#node,image,statemnt,mntopts,comments,disable
"compute",,"$noderes.xcatmaster:/install/statelite_data",,,
Reminder, if you have an entry in your litefile table for persistent AIX logs, you MUST redirect your console log to another location, especially in this environment. The NFSv4 client replication failover support logs messages during failover, and if the console log location is in a persistent directory, which is actively failing over, you can hang your failover. If you have an entry in your litefile similar to:
tabdump litefile
#image,file,options,comments,disable
:
"ALL","/var/adm/ras/","persistent","for GPFS",
be sure that you have a postscript that runs during node boot to redirect the console log:
/usr/sbin/swcons -p /tmp/conslog
(or some other local location)
For more information, see: [XCAT_AIX_Diskless_Nodes#Preserving_system_log_files]
Note: when using a shared file system across the SNs you must run the mkdsklsnode command on the backup SNs first and then run it for the primary SNs.
ex. mkdsklsnode -V -S -b -i 71Dsp3tst c250f10c12ap02-hf0
verify - check nim setup on backup SN
run mkdsklsnode for compute nodes
\- first, make sure /etc/exports on service nodes do not contain any old entries. If so, remove, and run 'exportfs -ua'
\- mkdsklsnode -S -V -i 71Dcompute compute
notes:
- use -S flag to setup the NFSv4 replication settings on the SNs
\- the site.sharedinstall value tells us to do the primary only and to
copy resources to one SN (shared file system)
ex. mkdsklsnode -V -S -p -i 71Dsp3tst c250f10c12ap02-hf0
verify - check nim setup on primary SN
verify the NFSv4 replication is exported correctly for your service node pairs:
xdsh service cat /etc/exports | xcoll
====================================
c250f10c12ap01
====================================
/install -replicas=/install@20.10.12.1:/install@20.10.12.17,vers=4,rw,noauto,root=*
====================================
c250f10c12ap17
====================================
/install -replicas=/install@20.10.12.17:/install@20.10.12.1,vers=4,rw,noauto,root=*</font>
_**rbootseq compute hfi**_
_**rpower compute on**_
- NFSv4 replication - dump device configuration
verify NFSv4 replication configured correctly the on compute node:
xdsh <node> nfs4cl showfs
xdsh <node> nfs4cl showfs /usr
- simple test of NFSv4 client replication failover:
s1 = service node 1
s2 = service node 2
c1 = all compute nodes managed by s1, backup s2
c2 = all compute nodes managed by s2, backup s1
xdsh c1,c2 nfs4cl showfs | xcoll
_# should show c1 filesystems served by s1 and c2 filesystems served by s2_
xdsh s1 stopsrc -s nfsd
xdsh c1,c2 ls /usr | xcoll
xdsh c1,c2 nfs4cl showfs | xcoll
_# should show all nodes getting /usr from s2 now (depending on NFS caching, it may take additional activity on the c1 nodes to have all filesystems failover to s2)_
**TESTING NOTE**:At this point, you can restart NFS on s1. You can continue testing by shutting down NFS on s2 and watching all nodes failover to s1. Once NFS is back up on both service nodes, over time, the clients should eventually switch back to using their primary server.
Use the node groups that were created for each primary SN to simplify the process.
_**snmove -V <targetnodegroup>**_
On the MN - node defs (lsdef <node>) - check servicenode and xcatmaster values
On the nodes - def gateway - (netstat -nr) - dump device - (sysdumpdev) - list dump target info??? - statelite tables - in root dir - .client_data files -? in s-r/node/c-d - /etc/xcatinfo file
_**xdsh compute "shutdown -F &"**_
_**rpower compute on**_