
Warning: the HA service node capability described in this document is not for general use. It is a complex environment that should only be used by some select p775 customers running AIX that work directly with xCAT development while setting it up.
(Using AIX NFS v4 Client Replication Failover with GPFS filesystems.)
AIX diskless nodes depend on their service nodes for many services: bootp, tftp, default gateway, name serving, NTP, etc. The most significant service is NFS to access OS files, statelite data, and paging space. This document describes how to use GPFS and NFSv4 client replication failover support to provide continuous NFS operation of the your HPC cluster if the NFS services provided by a service node become unavailable, whether due to failure of that service node or for other reasons.
If you experience a failure with one of your service nodes, and you have determined that the issues cannot be resolved quickly, you may wish to then use the xCAT snmove command to move the nodes managed by that service node to the defined backup service node in order to maintain full xCAT management and functionality of those nodes. The snmove command for AIX disklesss nodes provides the following function for the nodes:
The main premise of this support is to use a shared /install filesystem across all service nodes. This filesystem will reside in a separate GPFS cluster, and NFSv4 with defined replicas will be used to export that filesystem to the diskless nodes in the cluster.
Optionally, future support may include allowing the xCAT EMS to also be connected to that same /install filesystem (this is not supported yet, and this document describes only the service node configuration).
OPTIONAL: The EMS can also be attached to the external disks and included in the GPFS cluster. In this case, the /install filesystem will be common across the EMS and all service nodes.
NOTE: This option is not yet supported by xCAT.
Each service node will NFSv4 export the /install filesystem with its backup service node NFS server replica specified (automatically set in the /etc/exports file by the xCAT mkdsklsnode command).
During normal cluster operation, if a compute node is no longer able to access its NFS server, it will failover to the configured replica backup server. Since both NFS servers are using GPFS to back the filesystem, the replica server will be able to continue to serve the identical data to the compute node. However, there is no automatic failover capability for the dump resource - no dump capability will be available until the xCAT snmove command is run to retarget the compute node's dump device to its backup service node.
There are a few components (e.g. LoadLeveler daemons) that normally run on the service nodes that under certain circumstances need access to the application GPFS cluster. Since a service node can't be directly in 2 GPFS clusters at once, some changes in the placement or configuration of these components must be made, now that the service nodes are in their own GPFS cluster. Our recommendation is that the service nodes remotely mount the GPFS application cluster file system so that the LL daemons can still have access to this file system even though the service node is primarily in the SN GPFS cluster.
The only component this doesn't work for is the TEAL GPFS monitoring collector. This component needs to be on a node that is actually in the GPFS application cluster (remotely mounting it is not sufficient), but is also needs access to the DB. To solve this, the collector node can run the new db2driver package that uses a minimum DB2 client to run the required ODBC interface on a diskless node. This can be a utility node that has network access to the xCAT EMS either through appropriate routing or through an ethernet interface connected to the EMS network.
For reference, in case you want to run the components on other nodes when you switch to HA SN, here are the components requirements:
The support described in this document is for high availability of NFSv4 to your diskless nodes. You will need to review all other services that your diskless nodes require from its service node. Other items may require additional configuration to your cluster. Some things to check:
external network routing -- If your nodes require continuous access to an external network, and you have currently set up a route through the service node, this network connection will be lost if the service node fails. The xCAT snmove will change the default route for a node to its backup service node. However, since this is a manualy admin operation and will not be switched over until the snmove command is run. If external network access cannot be interrupted due to a service node failure, you will need to implement something like the AIX dead gateway detection function (and be aware of the possible side affects to OS jitter or other issues when using that support). xCAT does not provide this function.
error logging -- If your nodes are forwarding syslog messages to its service node (default xCAT configuration), subsequent messages will be lost if the service node goes down. The xCAT snmove command will reconfigure syslog on the nodes to forward messages to the new service node.
You will need to remember that with a shared /install filesystems, the NIM files that are created there are visible to multiple NIM masters. xCAT code accommodates this when running NIM commands on multiple service nodes accessing the same directories and files. If you directly run NIM commands, remember that you can easily corrupt the NIM environment on another server without even realizing it. Use extreme caution when running NIM commands directly, and understand how the results of that command may affect other NIM servers accessing the identical /install/nim directories and files.
Similarly, if you need to stop and restart GPFS on a service node, make sure to stop/start these services in the following order:
exportfs -ua
exportfs -a
If the xCAT xdsh command (and some other xCAT hierarchical commands) detects that the primary service node is not available, it will use the backup service node to process the command. You will notice a performance degradation while xCAT times out trying to contact the primary service node if it is down. Also, if the primary service node is still active, but has network problems in communicating with its compute nodes, the NFSv4 client replication will failover to the backup service node to maintain node operation. However, xCAT xdsh commands to those nodes will fail since xCAT is still able to work with the service node, but the service node can no longer reach its compute nodes.
There is currently an issue with using NFSv4 replication client fail over for readwrite files, even when GPFS is ensuring that the files are the same regardless of which SN the are accessed from. A small timing window exists in which the client sends a request to update a file and the server updates it, but before it sends the acknowledgement to the client, the server crashes. When the client fails over to the other server (which has the updated file thanks to GPFS) and resends the update request, the client will detect that the modification time the client and server think the file has are different and bail out, marking the file "dead" until the client closes and reopens the file. This is a precaution, because the NFS client has no way of verifying that this is the exact same file that it updated on the other server. AIX development is sizing a configuration option in which we could tell it not to mark the file dead in this case because GPFS is ensuring the consistency of the files between the servers.
Note - we have not yet directly experienced this condition in any of our testing.
If "site.sharedinstall=all" (currently not supported), all NIM resources on the EMS will be created directly in the GPFS filesystem, including your lpp_source and spot resources. By default, NIM resources cannot be created with associated files in a GPFS filesystem (only jsf or jsf2 filesystems are supported). To bypass this restriction all NIM commands must be run with either the environment variable "NIM_ATTR_FORCE=yes" set, or by using the 'nim -F' force flag directly on each command. All xCAT commands have been changed to accommodate this setting. However, it is often necessary for an admin to run NIM commands directly. When doing so, be sure to use one of these force options.
1) xCAT version 2.7.2 or higher.
2) Base OS: AIX 7.1 TL1 SP3 (7100-01-03-1207)
3) AIX ifixes:
IV22221m03.epkg.Z
Package Content:
IV14334: stnfs I/O error after NFSv4 replication failover
IV16758: error accessing file after NFSv4 client replication failover
IV15048: additional privileges required for rolelist command
IV22221: IMPLEMENT NFS V4 PAGING SPACE REPLICATION
IV27038: NFSV4 diskless client hang with nfsrgyd dependancy
These fixes are available from the AIX download site:
ftp://public.dhe.ibm.com/aix/ifixes/
4) xCAT ifixes
SF3548436.120801.epkg.Z
A tar file (SF3548436.ifix.tar) containing this epkg file is attached to svn bug #3548436. View the svn bug report and download the tar file to get the ifix.
(https://sourceforge.net/tracker/?atid=1006945&group_id=208749&func=browse)
This ifix is for xCAT version 2.7.2. It is a cumulative ifix that includes the following previous fixes:
SF3526650.120629.epkg.Z
SF3547581.120724.epkg.Z
Make sure to read the README file, that is included in the tar file, before applying this fix.
This procedure assumes the following:
**Note**: If starting over with a new cluster then refer to the
[Setting_Up_an_AIX_Hierarchical_Cluster]
document for details on how to install an xCAT EMS and service nodes (SN).
Do not remove any xCAT or NIM information from the EMS.
Note: the following 2 diagrams do 'not' imply that you can have 2 separate /install files systems in GPFS within the same xCAT cluster - one for one set of service nodes and another for the other set of service nodes. All the service nodes in the xCAT cluster must use the same /install file system in the same GPFS cluster.
Storage Setup Configuration When Service Nodes Also Have Local Disks for their OS

Storage Setup Configuration When Service Nodes Also Do Not Have Local Disks for their OS

Perform the necessary admin steps to assign the fibre channel I/O adapter slots to the selected xCAT SN octant/LPAR (the xCAT chvm command may be used to do this). The xCAT SN LPAR and serving CEC may need to be taken down to make I/O slot changes to the xCAT SN configuration.
Ensure that the assigned SAN Disks being used with the GPFS cluster can be allocated back to the assigned fibre channel adapters and the SAN disks can be seen on the target xCAT SNs .
[PUNEET/BIN_-_PLEASE_ADD_DETAIL_HERE_AS_REQUIRED]
Recommendations for the GPFS setup
Layout of the file systems on the external disks:
For now, mount the GPFS /install filesystem on a temporary mount point on the SNs, such as /install-gpfs. This will need to be changed to /install later in the process.
Since this process will remove all existing data from the local /install/nim directories on your service nodes, you may choose to make a backup copy of the /install filesystem at this time.
The contents of the local /install filesystems on your SNs will need to be copied into the new shared GPFS /install-gpfs filesystem. You should NOT copy over the /install/nim directory -- this will need to be completely re-created in order to ensure NIM is configured correctly for each SN.
Most of the xCAT data should be identical for each local SN /install directory in your cluster. This includes sub-directories such as:
/install/custom
/install/post
/install/prescripts
/install/postscripts
It will only be necessary to copy these subdirectories from one SN into /install-gpfs. Therefore, you can just log into one SN and use rsync to copy the files and directories to the shared file system.
You must create the directory for your persistent statelite data in the /install-gpfs filesystem. e.g. from one SN:
mkdir /install-gpfs/statelite_data
(Optional) At this time, you may choose to place an initial copy of your persistent data into the /install-gpfs filesysem. However, since the compute nodes in your cluster are currently running, they are still updating their persistent files, so you will need to re-sync this data again later after bringing down the cluster. Depending on the amount and stability of your persistent data, the subsequent rsync can take much less time and help reduce your cluster outage time.
Use rsync to do the initial copy from your current statelite directory. You should run this rsync from one SN at a time to copy data into the shared /install-gpfs filesystem. This will ensure that if you happen to have more than one SN that has a subdirectory for the same compute node, you will not run into collisions copying from multiple SNs at the same time. Make sure to use the rsync -u (update) option to ensure stale data from an older SN does not overwrite the data from an active SN.
Note: You do not need to worry about changing /etc/exports to correctly export your statelite directory since you are placing it in /install-gpfs. Later in the process, you will rename the filesystem to /install, and xCAT will add the correct /install export for NFSv4 replication to /etc/exports when mkdsklsnode runs.
The statelite table will need to be set up so that each service node is the NFS server for its compute nodes. You should use the "$noderes.xcatmaster" substitution string instead of specifying the actual service node so that when xCAT changes the service node database values for the compute nodes during an snmove operation, this table will still have correct information. Also, you should specify NFS option "vers=4" as a mount option. The entry should look something like:
#node,image,statemnt,mntopts,comments,disable
"compute",,"$noderes.xcatmaster:/install/statelite_data","vers=4",,
REMINDER: If you have an entry in your litefile table for persistent AIX logs, you MUST redirect your console log to another location, especially in this environment. The NFSv4 client replication failover support logs messages during failover, and if the console log location is in a persistent directory, which is actively failing over, you can hang your failover. If you have an entry in your litefile similar to:
tabdump litefile
#image,file,options,comments,disable
"ALL","/var/adm/ras/","persistent","for GPFS",
Be sure that you have a postscript that runs during node boot to redirect the console log:
/usr/sbin/swcons -p /tmp/conslog
(or some other local location)
For more information, see: XCAT_AIX_Diskless_Nodes/#Preserving_system_log_files
If you have any other non-xCAT data in your local /install filesystems, you will first need to determine if this data is identical across all service nodes, or if you will need to create a directory structure to support unique files for each SN. Based on that determination, copy the data into /install-gpfs as appropriate.
nameservers= (set to null so xCAT will not create a resolv.conf resource for the nodes)
domain=<domain_name> (this is required by NFSv4)
useNFSv4onAIX="yes" (must specify NFSv4 support)
sharedinstall="sns" (indicates /install is shared across service nodes)
You could set these values using the following command:
chdef -t site nameservers="" domain=mycluster.com
useNFSv4onAIX="yes" sharedinstall="sns"
Also, verify that your xCAT networks table does not have a nameservers value set for your cluster network:
lsdef -t network -i nameservers
/mntdb2 -vers=4,sec=sys:krb5p:krb5i:krb5:dh,rw
and run the 'exportfs -a' command to re-export the change.
Verify that all required software and updates are installed.
See XCAT_HASN_with_GPFS/#Software_Pre-requisites for a complete list.
If you intend to define dump resources for your compute nodes then make sure you have installed the prequisite software. See XCAT_AIX_Diskless_Nodes/#ISCSI_dump_support for details.
xdsh service cat /etc/nfs/local_domain | xcoll
If not, copy the file from the EMS to each SN:
xdcp service /etc/nfs/local_domain /etc/nfs/
Verify that all required software and updates are installed.
See XCAT_HASN_with_GPFS/#Software_Pre-requisites for complete list.
You can use the updatenodecommand to update the SNs.
If you intend to define dump resources for your compute nodes then make sure you have installed the prequisite software. See XCAT_AIX_Diskless_Nodes/#ISCSI_dump_support for details.
NOTE: If any software changes you are making require you to reboot the service node, you may wish to postpone this work until you shutdown the cluster nodes later in the process.
On each service node, the AIX OS startup order has to be changed to start GPFS before NFS and xcatd, since both rely on the /install directory being active. Edit /etc/inittab on each service node.
vi /etc/inittab
Move the calls to /etc/rc.nfs and xcatd to AFTER the start of GPFS, making sure GPFS is active before starting these services.
On each service node, the AIX OS shutdown order has to be changed to shutdown the NFS server before GPFS, so that NFS doesn't keep trying to serve files backed by GPFS. Add the following to /etc/rc.shutdown on each service node:
vi /etc/rc.shutdown and add:
stopsrc -s nfsd
exit 0
You may wish to keep copies of these files on the EMS and add them to synclists for your service nodes. Then, if you ever need to re-install your service nodes, these files will be updated correctly at that time.
(There is nothing unique required in this step for HASN support)
Create or update NIM installp_bundle files that you wish to use with your osimages.
Also, if you are upgrading to a new version of xCAT, you should check any installp_bundles that you use that were provided as sample bundle files by xCAT. If these sample bundle files are updated in the new version of xCAT you should update your NIM installp_bundle files appropriately.
The list of bundle files you have defined may include:
To define a NIM installp_bundle resource you can run a command similar to the following:
nim -Fo define -t installp_bundle -a location=/install/nim/installp_bundle/xCATaixCN71.bnd \
-a server=master xCATaixCN71**
You can modify a bundle file by simply editing it. It does not have to be re-defined.
Note: It is important when using multiple bundle files that the first bundle in the list be xCATaixCN71 since this contains the 'rpm' lpp package which is required for installing subsequent bundle files.
If your cluster was setup with NFSv3 you will need to convert all existing NIM images to NFSv4. On the EMS, for each existing OS image definition, run:
mknimimage -u <osimage_name> nfs_vers=4
You will need to build images with the correct version of AIX, all of the required fixes for NFS v4 client replcation failover support, and your desired HPC software stack. You can use existing xCAT osimage definitions or you can create new ones using the xCAT mknimimage command.
To create a new osimage you could run a command similar to the following:
mknimimage -V -r -s /myimages -t diskless <osimage name>
installp_bundle="xCATaixCN71,xCATaixHFIdd,IBMhpc_base,IBMhpc_all"
Whether you are using an existing lpp_source or you created a new one you must make sure you copy any new software prerequisites or updates to the NIM lpp_source resource for the osimage.
The easiest way to do this is to use the "nim -o update" command.
For example, to copy all software from the /tmp/myimages directory you could run the following command.
nim -o update -a packages=all -a source=/tmp/myimages <lpp_source name>
This command will automatically copy installp, rpm, and emgr packages to the correct location in the lpp_source subdirectories.
Once you have copied all you software to the lpp_source it would be good to run the following two commands.
nim -Fo check <lpp_source name>
And.
chkosimage -V <spot name>
See chkosimage for details.
You can use the the xCAT mknimimage, xcatchroot, or xdsh commands to update the spot software on the EMS.
For example, to install the IV15048m03.120516.epkg.Z ifix you could run the following command.
mknimimage -V -u <spot name> otherpkgs="E:IV15048m03.120516.epkg.Z "
Check the spot.
nim -Fo check <spot name>
Verify that the ifixes are applied to the spot.
xcatchroot -i <spot name> "emgr -l"
Some notes regarding spot updates:
This procedure recommends that you do not run your nodes with an /etc/resolv.conf file, and use a fully populated /etc/hosts file in your diskless image. See Limitations,Notes and Issues[XCAT_HASN_with_GPFS] above.
If your OS image has a resolv_conf resource assigned to it then remove it.
lsdef -t osimage -o <image name> -i resolv_conf
If set, remove it:
chdef -t osimage -o <image name> resolv_conf=''
Also, make sure the "nameservers" attribute is NOT set in either the xCAT site definition or the xCAT network definitions. If the "nameservers" is set then xCAT will attempt to create a default NIM "resolv_conf" resource for the nodes.
By default, xCAT will use the /etc/hosts from your EMS to populate the OS image. If you require a different /etc/hosts, copy it into your spot and shared_root:
cp <your compute node hosts> /install/nim/spot/<spot name>/usr/lpp/bos/inst_root/etc/hosts
cp <your compute node hosts> /install/nim/shared_root/<shared root name>/etc/hosts
You may also want to configure the order of name resolution search to use local files first by setting the correct values in the netsvc.conf file in those same locations in the spot and shared_root.
Dump resource
Due to current NIM limitations a dump resource cannot be created in the shared file system.
If you wish to define a dump resource to be included in an osimage definition you must use NIM directly to create the resource in a separate local file system on the EMS. (For example /export/nim.)
Once the dump resource is created you can add its name to your osimage definition.
chdef -t osimage -o <osimage name> dump=<dump res name>
Note: If you have multiple osimages you should create a different NIM dump resource for each one.
When the mkdsklsnode command creates the resources on the SNs it will create the dump resources in a local filesystem with the same name, e.g. /export/nim. If you want these directories to exist in filesystems on the external storage subsystem, you will need to create those filesystems and have them available on each SN before running the mkdsklsnode command.
Paging resource
NOTE: NIM does not support paging space with NFSv4 client failover. Therefore, the paging space files will need to manually be created in the shared /install GPFS filesystem, and then configured on the nodes through an xCAT postscript. That postscript will also need to remove the default swapnfs0 paging space on the nodes since the NIM NFS mount of that space was not done with the correct NVSv4 client replication information, and it may cause a system hang in a failover situation.
NOTE 2: The mkps flags ':fur' are specific for failover support. If you previously created the postscript below prior to using the paging failover support, make sure to replace the old ':wam' flags with ':fur'.
On one SN, create the paging files for all of the compute nodes in your cluster in the shared /install filesystem. Do NOT place the files in the same directory NIM will be using for the node's paging resource containing the initial swapnfs0 paging device (e.g. do NOT put swapnfs1 in /install/nim/paging/<image>_paging/<node>) since all the files in that directory are removed during rmdsklsnode.
For example, to create 128G of swap space for each node do:
mkdir /install/paging
# For each compute node:
mkdir /install/paging/<compute node>
dd if=/dev/zero of=/install/paging/<node>/swapnfs1 bs=1024k count=65536
dd if=/dev/zero of=/install/paging/<node>/swapnfs2 bs=1024k count=65536
Set up a new postscript to run on the compute node to activate that paging space with replication/failover support and disable the default swapnfs0. e.g. 'vi nfsv4mkps':
#!/bin/sh
rmps swapnfs1
rmps swapnfs2
mkps -t nfs $MASTER /install/paging/`hostname -s`/swapnfs1:fur
mkps -t nfs $MASTER /install/paging/`hostname -s`/swapnfs2:fur
swapon /dev/swapnfs1
swapon /dev/swapnfs2
swapoff /dev/swapnfs0
rmps swapnfs0
Add the postscript to your xCAT node definitions:
chdef <compute_nodes> -p postscripts=nfsv4mkps
Typically to view paging configuration information on a node, you would use the 'lsps' command. The "Server Hostname" listed in the output of this command is the primary NFS server for the paging space. Even if NFSv4 client failover has occurred, the 'lsps' output will not reflect the actual NFS server that is servicing the mount. AIX does not provide a utility to view the current active NFS server for the paging space.
Review all NIM resources you have defined on your service nodes:
xdsh service lsnim | xcoll
lsdef -t osimage
Remove all OS images from your service nodes that are not actively being used by the cluster at this time. You can use the rmnimimage command to help with this:
rmnimimage -V -s service <image name>
Review the remaining NIM resources, again:
xdsh service lsnim | xcoll
Use NIM commands to remove any remaining unused resources from each service node. Do NOT remove any resources actively being used by your running cluster, or the NIM master or network definitions. If you're not sure, do not remove the resources at this time, and wait until you bring down the compute nodes later.
If you do not already have a separate nodegroup for all nodes managed by a service node, you may wish to create these nodegroups now to make the following management easier.
xCAT has provided a new postscript setupnfsv4replication that needs to be run on each node to set up the NFS v4 client support for failover. This script contains some tuning settings that you may wish to adjust for your cluster:
$::NFSRETRIES = 3;
$::NFSTIMEO = 1; # in tenths of a second, i.e. NFSTIMEO=1 is 0.1 seconds
NFSTIMEO is how long NFS will wait for a response from the server before timing out. NFSRETRIES is the number of times NFS will try to contact the server before initiating a failover to the backup server. If you do change this postscript, be sure to remember to copy the updates to your new shared /install filesystem on one of the service nodes:
xdcp <any service node> /install/postscripts/setupnfsv4replication /install-gpfs/postscripts/
Note: If you have already migrated to the shared /install directory on your service, the target directory would be /install/postscripts instead.
The following example assumes you are using a 'compute' nodegroup entry in your xCAT postscripts table.
chdef -t group compute -p postscripts=setupnfsv4replication
The "servicenode" attribute values must be the names of the service nodes as they are known by the EMS. The "xcatmaster" attribute value must be the name of the primary server as known by the nodes.
chdef -t node -o <SNgroupname> servicenode=<primarySN>,<backupSN> xcatmaster=<nodeprimarySN>
In the following example, "compute" is the name of an xCAT node group containing all the cluster compute nodes.
xdsh compute "/usr/sbin/shutdown -F &"
The following command will remove all the NIM client definitions from both primary and backup service nodes. See the rmdsklsnode man page for additonal details.
rmdsklsnode -V -f compute
The existing NIM resources need to be removed on each service node. (With the original /install filesystem still in place.)
In the following example, "service" is the name of the xCAT node group containing all the xCAT service nodes, and "<osimagename>" should be substituted with the actual name of an xCAT osimage object.
rmnimimage -V -f -d -s service <osimagename>
See rmnimimage for additional details.
When this command is complete it would be good to check the service nodes to make sure there are no other NIM resources still defined. For each service node (or from EMS with 'xdsh service'), run lsnim to list whatever NIM resources may be remaining. Remove any random resources that are no longer needed (you should NOT remove basic NIM resources such as master, network, etc.)
On each service node, clean up the NFS exports.
exportfs -ua
exportfs -a (if there are any entries left in /etc/exports)
Use rsync to copy all the persistent data from your current statelite directory, and all other data from your /install directory that may have changed since the initial copy. Even if you did take an initial copy earlier in the process, you will want to do this again now to pick up any changes that have been written since then. You should run this rsync from one SN at a time to copy data into the shared /install-gpfs filesystem. Especially for statelite data, this will ensure that if you happen to have more than one SN that has a subdirectory for the same compute node, you will not run into collisions copying from multiple SNs at the same time. Make sure to use the rsync -u (update) option to ensure stale data from an older SN does not overwrite the data from an active SN.
On each service node, deactivate (in whatever way you choose: rename, overmount, etc.) the local /install filesystem. Change your GPFS cluster configuration so the mount point for your shared GPFS /install-gpfs filesystem is switched to /install. Depending on how the old local /install filesystem was originally created, this may also require updates to /etc/filesystems.
If you postponed updating software on your service nodes because of required reboots, you should apply that software now and reboot the SNs.
After the SNs come back up, make sure that the admin GPFS cluster is running and NFS has started correctly.
Make sure /etc/exports on service nodes do not contain any old entries. If so, remove, and run:
exportfs -ua
When using a shared file system across the SNs you must run the mkdsklsnode command on the backup SNs first and then run it for the primary SNs.
This is necessary since there are some install-related files that are server-specific. The server that is configured last is the one the node will boot from first.
NOTE: When running mkdsklsnode you may, in certain cases, see the following error:
Error: there is already one directory named "", but the entry in litefile table is set to one file, please check it
Error: Could not complete the statelite setup.
Error: Could not update the SPOT
If you see this error simply re-run the command.
mkdsklsnode -V -S -b -i <osimage name> <noderange>
Use the -S flag to setup the NFSv4 replication settings on the SNs.
Note: Once you are sure that the "mkdsklnsode -S" option has been run for each service node in the cluster it is no longer necessary to include the "-S" option in subsequent calls to mkdsklsnode. (Although it will not cause any problems if you do.)
Also, be aware that the first time you run mkdsklsnode, it will take a while to copy all of the resources down to your service node shared /install directory. mkdsklsnode uses the rsync command, so after the initial copy, subsequent updates should be much quicker.
If you are using a dump resource you can specify the dump type to be collected from the client. The values are "selective", "full", and "none". If the configdump attribute is set to "full" or "selective" the client will automatically be configured to dump to an iSCSI target device. The "selective" memory dump will avoid dumping user data. The "full" memory dump will dump all the memory of the client partition. Selective and full memory dumps will be stored in subdirectory of the dump resource allocated to the client. This attribute is saved in the xCAT osimage definition.
For example:
mkdsklsnode -V -S -b -i <osimage name> <noderange> configdump=selective
To verify the setup on the SNs you could use xdsh to run the lsnim command on the SNs.
To check for the resource and node definitions you could run:
xdsh <SN name> "lsnim"
To get the details of the NIM client definition you could run"
xdsh <SN name> "lsnim"
To set up the primary service nodes run the same command you just ran on the backup SNs only use the "-p" option instead of the "-b" option.
mkdsklsnode -V -S -p -k -i <osimage name> <noderange>
Note the use of the '-k' flag. By default, the mkdsklsnode command will run the NIM sync_roots operation on the xCAT management node to sync the shared_root resource with any changes that may have been made to the spot. If you KNOW you have not made any image changes since your last mkdsklsnode run, you can skip this step (and save yourself several minutes) by using the '-k' flag.
Verify the NFSv4 replication is exported correctly for your service node pairs:
xdsh service cat /etc/exports | xcoll
====================================
c250f10c12ap01
====================================
/install -replicas=/install@20.10.12.1:/install@20.10.12.17,vers=4,rw,noauto,root=*
====================================
c250f10c12ap17
====================================
/install -replicas=/install@20.10.12.17:/install@20.10.12.1,vers=4,rw,noauto,root=*
rbootseq compute hfi
rpower compute on
If you specified a dump resource you can check if the primary dump device has been set on the node by running:
xdsh <node> "sysdumpdev"
Verify the NFSv4 replication is configured correctly your nodes:
xdsh compute,storage nfs4cl showfs | xcoll
xdsh compute,storage nfs4cl showfs /usr
Depending on your network load, you may actually see some nodes using the NFS server from their backup service node. This is acceptable. If you see too much random activity, and would like to tune the failover to be more tolerant of heavy network traffic, you can modify the values in the xCAT setupnfsv4replication postscript as previously described in XCAT_HASN_with_GPFS/#Update_xCAT_node_definitions.
If you want to test your NFSv4 client replication failover, here is a simple procedure to stop the nfsd daemon on one service node and watch the nodes switch over to using NFS from their backup service node:
s1 = service node 1
s2 = service node 2
c1 = all compute nodes managed by s1, backup s2
c2 = all compute nodes managed by s2, backup s1
xdsh c1,c2 nfs4cl showfs | xcoll
should show c1 filesystems served by s1 and c2 filesystems served by s2
xdsh s1 stopsrc -s nfsd
xdsh c1,c2 ls /usr | xcoll
xdsh c1,c2 nfs4cl showfs | xcoll
should show all nodes getting /usr from s2 now (depending on NFS caching, it may take additional activity on the c1 nodes to have all filesystems failover to s2)
TESTING NOTE At this point, you can restart NFS on s1. You can continue testing by shutting down NFS on s2 and watching all nodes failover to s1. Once NFS is back up on both service nodes, over time, the clients should eventually switch back to using their primary server.
[Monitor_and_Recover_Service_Nodes]
The nodes will continue running after the primary service node goes down, however, you should move the nodes to the backup SN as soon as possible.
Use the xCAT snmove command to move a set of nodes to the backup service node.
Note: Since we have already run mkdsklsnode on the backup SN we know that the NIM resources have been defined and nodes initialized.
In the case where a primary SN fails you can run the snmove command with the node group you created for this SN. For example, if the name of the node group is "SN27group" then you could run the following command:
snmove -V SN27group
You can also specify scripts to be run on the nodes by using the "-P" option.
snmove -V SN27group -P myscript
(Make sure the script has benn added to the /install/postscripts directory and that it has the correct permissions.)
The snmove command performs several steps to both keep the nodes running and to prepare the backup service node for the next time the nodes need to be booted.
This includes the following:
You can verify some of these steps by running the following commands.
lsdef <noderange> -i servicenode,xcatmaster -c | xcoll
xdsh <noderange> "/bin/sysdumpdev"
Make sure the primary dump device has been reset.
xdsh <noderange> "/bin/netstat -rn"
xdsh <noderange> "/bin/cat /etc/xcatinfo"
See if the server is the name of the new SN.
The nodes should continue running after the primary SN goes down, however, it is adviseable to reboot the node at soon as possible.
When the nodes are rebooted they will automatically boot from the new SN.
xdsh compute "shutdown -F &"
rpower compute on**
During development and testing of this support, we have gathered some diagnostic hints and tips that may help you if you are experiencing problems in configuring or executing the support described in this document:
During diskless node boot, if the node hangs with LED values of 610 or 611, that usually indicates there is a problem mounting the NFS filesystems correctly. Some things to check:
Your NFS daemons are running correctly on the service nodes
/install -noauto,replicas=/install@20.1.1.1:/install@20.1.2.1,vers=4,rw,root=*
The process for switching nodes back will depend on what must be done to recover the original service node. Essentially the SN must have all the NIM resources and definitions restored and operations completed before you can use it.
If you are using the xCAT statelite support then you must make sure you have the latest files and directories copied over and that you make any necessary changes to the statelite and/or litetree tables.
If all the configuration is still intact you can simply use the snmove command to switch the nodes back.
If the configuration must be restored then you will have to run the mkdsklsnode command. This commands will re-configure the SN using the common osimages defined on the xCAT management node.
Remember that this SN would now be considered the backup SN, so when you run mkdsklsnode you need to use the "-b" option.
Once the SN is ready you can run the snmove command to switch the node definitions to point to it. For example, to move all the nodes in the "SN27group" back to the original SN you could run the following command.
**svmove -V SN27group**
The next time you reboot the nodes they will boot from the original SN.
Because Teal-gpfs monitoring must be in the application GPFS cluster, we must move the teal.gpfs-sn ( will we change this name) off the service node to a utility node that is in the compute node cluster. Since the teal package requires access to the database server, we will also be installing and configuring a new db2driver package that have a minimum DB2 client that can run the required ODBC interface on a diskless node.
You will have to obtain the db2driver code and the required level of the teal.gpfs-sn code from IBM that supports this function. The db2driver code is available at the following location on fix central and is available to anyone holding the HPC DB2 license.
http://www-933.ibm.com/support/fixcentral/swg/selectFixes?parent=ibm/Information+Management&product=ibm/Information+Management/IBM+Data+Server+Client+Packages&release=9.7.&platform=All&function=fixId&fixids=-dsdriver-*FP005&includeSupersedes=0
The following DB2 driver software package was tested and works with the DB2 9.7.4 or 9.7.5 WSER Server code.
v9.7fp5_aix64_dsdriver.tar.gz
You will need to obtain the appropriate level of TEAL and GPFS and rsct.core.sensorrm code. The teal.gpfs-sn lpp is required on the node with the corresponding gpfs.base and rsct.core.sensorrm
We will configure the Data Server Client in the /db2client directory on the EMS machine. We will use this setup to update the image for the utility node that will run it.
*** install unzip rpm, if not already available.**
Note the Data Server Client code requires unzip. Make sure it is available before continuing:
AIX:
Get unzip from Linux Toolbox, if not already available.
rpm -i unzip-5.51-1.aix5.1.ppc.rpm for diskfull. For AIX diskless need to add to statelite image.
*** Extract the Data Server Client Code on the EMS**
mkdir /db2client
cd /db2client
cp ..../v9.7fp5_aix64_dsdriver.tar.gz .
gunzip v9.7fp5_aix64_dsdriver.tar.gz
tar -xvf v9.7fp5_aix64_dsdriver.tar
export PATH=/db2client/dsdriver/bin:$PATH
export LIBPATH=/db2client/dsdriver/lib:$LIBPATH
This script will only automatically setup the 64 bit driver. We must manually extract the 32 bit driver.
cd /db2client/dsdriver
./installDSdriver
cd odbc_cli_driver
cd odbc_cli_driver/*32
uncompress *.tar.Z
tar -xvf *.tar
Note: the package I downloaded had sub-directories not defined with the bin owner/ bin group. To be sure, do the following:
cd /db2client
chown -R bin *
chgrp -R bin *
*Create shared lib on 32 bit path (AIX)
cd /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/lib
ar -x libdb2.a
mv shr.o libdb2.so
The DB2 Data Server Client has several configuration files that must be setup.
The db2dsdriver.cfg configuration file contains database directory information and client configuration parameters in a human-readable format.
The db2dsdriver.cfg configuration file is a XML file that is based on the db2dsdriver.xsd schema definition file. The db2dsdriver.cfg configuration file contains various keywords and values that can be used to enable various features to a supported database through ODBC, CLI, .NET, OLE DB, PHP, or Ruby applications. The keywords can be associated globally for all database connections, or they can be associated with specific database source name (DSN) or database connection.
cd /db2client/dsdriver/cfg
cp db2dsdriver.cfg.sample db2dsdriver.cfg
chmod 755 db2dsdriver.cfg
vi db2dsdriver.cfg
Here is a sample setup for a node accessing the xcatdb database on the Management Node p7saixmn1.p7sim.com
<configuration>
<dsncollection>
<dsn alias="xcatdb" name="xcatdb" host="p7saixmn1.p7sim.com" port="50001"/>
</dsncollection>
<databases>
<database name="xcatdb" host="p7saixmn1.p7sim.com" port="50001">
</database>
</databases>
</configuration>
The CLI/ODBC initialization file (db2cli.ini) contains various keywords and values that can be used to configure the behavior of CLI and the applications using it.
The keywords are associated with the database alias name, and affect all CLI and ODBC applications that access the database.
cd /db2client/dsdriver/cfg
cp db2cli.ini.sample db2cli.ini
chmod 0600 db2cli.ini
Here is a sample db2cli.in file containing information needed to access the xcatdb database, using instance xcatdb and password cluster. Note this file should only be readable by root.
[xcatdb]
uid=xcatdb
pwd=cluster
For 32 bit, copy the /db2client/dsdriver/cfg files to /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/cfg
cd /db2client/dsdriver/cfg
cp db2cli.ini /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/cfg
cp db2dsdriver.cfg /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/cfg
The unixODBC files are still needed. The following are sample configurations files that must be created and added to the image (this step comes later). Substitute your database (xcatdb) password in for db2root below. The remaining information in the files should be the same.
cat /db2client/odbc.ini
[xcatdb]
Driver = DB2
DATABASE = xcatdb
cat /db2client/odbcinst.ini
[DB2]
Description = DB2 Driver
Driver = /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/lib/libdb2.so
FileUsage = 1
DontDLClose = 1
Threading = 0
cat /db2client/db2cli.ini
[xcatdb]
pwd=db2root
uid=xcatdb
You will need the following levels to support the teal-gpfs on the utility node.
teal.gpfs-sn (SP5)
gpfs.base 3.4.0.13 or later
rsct.core.sensorrm 3.1.2.0 or later
We have create a new diskless image for the teal.gpfs-sn utiltity node. Here is a sample Bundle file:
# sample bundle file for teal-gpfs utility node
I:rpm.rte
I:openssl.base
I:openssl.license
I:openssh.base
I:openssh.man.en_US
I:openssh.msg.en_US
I:gpfs.base
I:gpfs.gnr
I:gpfs.msg.en_US
I:rsct.core.sensorrm
I:teal.gpfs-sn
# RPMs
R:popt*
R:rsync*
# using Perl 5.10.1
R:perl-Net_SSLeay.pm-1.30-3*
R:perl-IO-Socket-SSL*
R:unixODBC*
R:unzip* (optional) since we are setting up db2driver on the EMS
With this additional bundle file, build the diskless image for the teal-gpfs utiltiy node.
Copy /db2client directory and subdirectories into the image, that were setup on the EMS.
cp -rp /db2client /install/nim/spot/<image name>/usr/lpp/bos/inst_root
Copy /db2client/odbc.init into the image.
cp /db2client/odbc.ini /install/nim/spot/<image name>/usr/lpp/bos/inst_root/etc
Copy /db2client/odbcinst.ini into the image.
cp /db2client/odbcinst.ini /install/nim/spot/<image name>/usr/lpp/bos/inst_root/etc
Copy /db2client/db2cli.ini into the image
cp /db2client/db2cli.ini /install/nim/spot/<image name>/usr/lpp/bos/inst_root
Copy /etc/xcat/cfgloc into the image
mkdir /install/nim/spot/<image name>/usr/lpp/bos/inst_root/etc/xcat
cp /etc/xcat/cfgloc /install/nim/spot/<image name>/usr/lpp/bos/inst_root/etc/xcat
Configure IP forwarding such that the Utility Node can access the DB2 Server on the EMS.
After the Utility node boots, test the DB2 setup by running the following, to see if you can connect to the DB2 database on the EMS.
/usr/local/bin/isql -v xcatdb
Wiki: Monitor_and_Recover_Service_Nodes
Wiki: XCAT_Documentation
Wiki: XCAT_HASN_with_GPFS
Wiki: XCAT_Overview,_Architecture,_and_Planning