xCAT Wiki

An extreme cluster/cloud administration toolkit

Brought to you by: besawn, cxhong, gurevich, obihoernchen, victorhu

XCAT_HASN_with_GPFS

There is a newer version of this page. You can find it here.

DRAFT! This is a work-in-progress and is not complete!!!!

XCAT High Availability Service Nodes (HASN)

XCAT High Availability Service Nodes (HASN)

(Using NFS v4 Client Replication Failover with GPFS filesystems.)

AIX diskless nodes depend on their service nodes for many services: bootp, tftp, default gateway, name serving, NTP, etc. The most significant service is NFS to access OS files, statelite data, and paging space. This document describes how to use GPFS and NFSv4 client replication failover support to provide continuous operation of the full HPC cluster if the NFS services provided by a service node become unavailable, whether due to failure of that service node or for other reasons.

Overview of Hardware and Cluster Configuration

Using a shared filesystem

All service nodes in the cluster are FC connected to external disks which will be used to hold a common copy of the node images, statelite files, and paging space.
The disks are owned by GPFS, and all service nodes are in one GPFS cluster. Note that is a separate small management cluster (referred to here as the "admin GPFS cluster"), and is disjoint from the large GPFS application data cluster (referred to here as the "application GPFS cluster") that all of the compute and storage nodes belong to.
A common /install filesystem in the admin GPFS cluster will be used for all data except for dump resources
OPTIONAL: The EMS can also be attached to the external disks and included in the GPFS cluster. In this case, the /install filesystem will be common across the EMS and all service nodes.

NOTE: This option is not yet supported by xCAT.
Each service node will NFSv4 export the /install filesystem with its backup service node NFS server replica specified (automatically set in the /etc/exports file by the xCAT mkdsklsnode command).
Compute node definitions in xCAT will have both a primary and a backup service node defined. The cluster will be configured with these "pairs" of service nodes, such that each service node in a pair is a backup to the other.
Compute nodes will NFSv4 mount the appropriate /install filesystems and be configured for NFSv4 client replication failover. This will include the /usr read-only filesystem, the shared-root filesystem which will be managed by STNFS on the compute node, and the filesystem used for xCAT statelite data. Paging support is not yet available from AIX. The dump resources must NOT reside in a GPFS filesystem - these can either be a jfs* filesystem in the attached storage or in the local servicenode harddrives.

During normal cluster operation, if a compute node is no longer able to access its NFS server, it will failover to the configured replica backup server. Since both NFS servers are using GPFS to back the filesystem, the replica server will be able to continue to serve the identical data to the compute node. However, there is no automatic failover capability for the dump resource - no dump capability will be available until the xCAT snmove command is run to retarget the compute node's dump device to its backup service node.

Considerations for Other Software Components

There are a few components that normally run on the service nodes that under certain circumstances need access to the application GPFS cluster. Since a service node can't be (directly) in 2 GPFS clusters at once, some changes in the placement or configuration of these components must be made, now that the service nodes are in their own GPFS cluster. The components that can be affected by this are:

LL schedd daemons - need to store the spool files in a common file system. This could be the SN admin GPFS cluster if all the schedd's run on the SNs. But the schedd's normally need a file system that is common with the compute nodes where the application executables are stored, which is often the user home directories in the application GPFS cluster. One way to accomplish this may be to use GPFS multi-cluster support to make the user home directories available on the SN from the GPFS application I/O nodes.
LL central mgr, regional mgr, and schedd's - need access to the database if using the database option for LL, which currently is only available from the xCAT EMS and service nodes. LL is not capable of accessing the DB2 database through the new db2driver package that uses a minimum DB2 client to run the ODBC interface from a diskless node. The following LL features need the database option: rolling updates, checkpoint/restart, and the future energy aware scheduling feature. If you do not need to use these features, you can run LL with the traditional config files instead of the database option and move these LL services off the SNs onto utility nodes.
This design does not support checkpoint/restart. This function requires access to the application GPFS cluster.
TEAL GPFS monitoring - needs to run on the GPFS monitoring collector node of the application GPFS cluster and needs access to the database. The collector node can run the new db2driver package that uses a minimum DB2 client to run the required ODBC interface on a diskless node. This can be a utility node that has network access to the xCAT EMS either through appropriate routing or through an ethernet interface connected to the EMS network.

Limitations, Notes and Issues

You will need to remember that with a shared /install filesystems, the NIM files that are created there are visible to multiple NIM masters. xCAT code accommodates this when running NIM commands on multiple service nodes accessing the same directories and files. If you directly run NIM commands, remember that you can easily corrupt the NIM environment on another server without even realizing it. Use extreme caution when running NIM commands directly, and understand how the results of that command may affect other NIM servers accessing the identical /install/nim directories and files.
Since the NIM masters on both the primary and backup SNs for a compute node need to manipulate the identical client directories and files in the shared /install filesystem, you MUST NOT run the mkdsklsnode command on both the primary and backup service nodes at the same time for a given xCAT node. Also, in order for the xCAT snmove function to work correctly, you must run "mkdsklsnode -b .." to create your NIM machine definitions on the backup service node BEFORE running "mkdskslnode -p" which will create the correct client files to reference the active NIM master.
The dump resource CANNOT reside in a GPFS filesystem. This is not supported by AIX. The resources can either be a jfs* filesystem in the attached storage subsystem or reside in the local SN harddrives. There is no automatic failover capability for the dump resource - no dump capability will be available until the xCAT snmove command is run to retarget the compute node's dump device to its backup service node.
The service node OS service startup order (/etc/inittab) must be changed to start the admin GPFS cluster before trying to start NFS. Running NFS with no active GPFS filesystem to back the exported directories caused strange hangs on compute nodes that had that service node registered as its primary NFS server even though it had failed over and was running from the backup replica server. Therefore, you should modify /etc/inittab on the service nodes to control the startup order of NFS and GPFS correctly, moving the call for rc.nfs to after the start of GPFS.
The service node OS service shutdown order has to be changed to shutdown the NFS daemons before GPFS, so that NFS doesn't keep trying to serve files backed by GPFS.
Similarly, if you need to stop and restart GPFS on a service node, make sure to stop/start these services in the following order:
exportfs -ua
stopsrc -g nfs
mmshutdown
mmstartup
startsrc -g nfs (or /etc/rc.nfs)
exportfs -a
The paging space currently does not support NFSv4 client replication fail over. This may cause problems if the primary service node goes down, and the compute node requires paging to remain operational. xCAT development has started preliminary testing with a prototype version of the paging space support, and some notes have been included in the process below.
There is currently an issue with using NFSv4 replication client fail over for readwrite files, even when GPFS is ensuring that the files are the same regardless of which SN the are accessed from. A small timing window exists in which the client sends a request to update a file and the server updates it, but before it sends the acknowledgement to the client, the server crashes. When the client fails over to the other server (which has the updated file thanks to GPFS) and resends the update request, the client will detect that the modification time the client and server think the file has are different and bail out, marking the file "dead" until the client closes and reopens the file. This is a precaution, because the NFS client has no way of verifying that this is the exact same file that it updated on the other server. AIX development is sizing a configuration option in which we could tell it not to mark the file dead in this case because GPFS is ensuring the consistency of the files between the servers.

Note - we have not yet directly experienced this condition in any of our testing.
If "site.sharedinstall=all" (currently not supported), all NIM resources on the EMS will be created directly in the GPFS filesystem, including your lpp_source and spot resources. By default, NIM resources cannot be created with associated files in a GPFS filesystem (only jsf or jsf2 filesystems are supported). To bypass this restriction all NIM commands must be run with either the environment variable "NIM_ATTR_FORCE=yes" set, or by using the 'nim -F' force flag directly on each command. All xCAT commands have been changed to accommodate this setting. However, it is often necessary for an admin to run NIM commands directly. When doing so, be sure to use one of these force options.

Software Pre-requisites

[Need_final_statement_for_prereqs]

xCAT 2.7.2 including the following code updates:

Base: AIX 7.1.D (7.1.1.0)

Initial code drop of STNFS failover support:

**(AIX CMVC defect 816890)**
**HPCstnfs.111202.epkg.Z **

STNFS Patch from Duen-wen to fix I/O errors from 'ls -lR /' after failover downloaded from ausgsa:

**(AIX CMVC defect 822215):**
/usr/lib/drivers/stnfs.ext

NFS Patches from Duen-wen to fix access failures to libC in /usr filesystem downloaded from ausgsa:

**(AIX CMVC defect 826634):**
/usr/lib/drivers/nfs.ext 
/usr/lib/drivers/nfs.netboot.ext

NIM patch to turn off TCB-enabled during SPOT build (locally modified on EMS by Linda Mellor based on instructions from Paul Finley). This is ONLY required for sharedinstall=all (not needed for sharedinstall=sns):

**(AIX CMVC defect 824583):**
/usr/lpp/bos.sysmgt/nim/methods/c_instspot

From AIX Development/Service:

   ==&gt; swinf -u 816890
  U843487|IV11645|816890|bos|bos.net.nfs.client|bos71F onc71F pkg71F|aix|limited stnfs replication support|7.1.1.15|
  U846403|IV11645|816890|bos|bos.net.nfs.client|aix71D|aix|limited stnfs replication support|7.1.1.3|7100-01-03
  U846654|IV11646|816890|bos|bos.net.nfs.client|bos71H onc71H pkg71H|aix|limited stnfs replication support|7.1.2.0|

   ==&gt; swinf -u 822215
  U843487|IV14334|822215|bos|bos.net.nfs.client|onc71F|aix|stnfs replication will not work if the mount point is not the root of a FS.|7.1.1.15|
  U846654|IV15285|822215|bos|bos.net.nfs.client|onc71H|aix|stnfs replication will not work if the mount point is not the root of a FS.|7.1.2.0|

   ==&gt; swinf -u 826634
  U842986|IV16857|826634|bos|bos.net.nfs.client|onc61S|aix|During file recovery, regular file becomes symlink|6.1.7.15|
  U849900|IV18488|826634|bos|bos.net.nfs.client|onc61N|aix|During file recovery, regular file becomes symlink|6.1.6.19|
  U845434|IV16681|826634|bos|bos.net.nfs.client|onc61V|aix|During file recovery, regular file becomes symlink|6.1.8.0|
  U843487|IV16758|826634|bos|bos.net.nfs.client|onc71F|aix|During file recovery, regular file becomes symlink|7.1.1.15|
  U846654|IV17125|826634|bos|bos.net.nfs.client|onc71H|aix|During file recovery, regular file becomes symlink|7.1.2.0|

  816890 will be in 7100-01-03, i.e. 71D SP3.

  The other two, 822215 and 826634 do not show up in 71D as of now.

NOTE: All STNFS/NFS defects are fixed and will be shipped in AIX 7.1.F (7.1.2, GA 5/2012). We will need to work with AIX support if efixes need to be built for a different version of AIX

HASN Setup Process

Assumptions

This procedure assumes the following:

You are starting with an existing cluster.
The EMS is installed with the correct xCAT code, configured, and operational.
Service Nodes (SNs) are installed with correct xCAT code, configured, and operational.
Network routing from the EMS to the compute nodes is set up correctly so that if one service node goes down, there are other routes to reach the compute node network.
Release 2.7.2 of xCAT is installed on the EMS and SNs.
You are running AIX 7.1.1 SP3 release levels that contain NFSv4 fixes on the EMS and Service Nodes and that new compute node images will be built with the same AIX level and fixes.

Preparing an existing cluster

**Note**: If starting over with a new cluster then refer to the
https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster
document for details on how to install an xCAT EMS and service nodes (SN).

Do not remove any xCAT or NIM information from the EMS.

Hardware setup for the shared file system

Storage Setup Configuration 1

[File:StorageSetup01.jpg]

This configuration may be used when the systems housing the service nodes have enough slots available to accommodate internal drives for the service node as well as the fiber channel HBAs for connectivity to the external storage to be used with the GPFS setup.
The amount of storage in the configuration should be sized adequately based on the desired storage capacity desired as well as the overall I/O throughput desired from the setup.
If there are more service nodes than the host ports on the fiber channel storage controller units then the use of fiber channel switches may be required.
Typically the configuration would have two identically configured fiber channel controller setups. Using two controller setups along with GPFS replication will provide data protection beyond that provided at the RAID array level.
The external storage is typically configured in 4+2P RAID6 arrays with one LUN per array. The use of 256KB segment during the array creation will allow for a GPFS file system block size of 1MB. Alternately a segment size of 128KB may also be used which will allow for a GPFS file system block size of 512KB.

Storage Setup Configuration 2

[File:StorageSetup02.jpg]

This configuration may be used when the systems housing the service nodes DO NOT have enough slots available to accommodate internal drives for the service node as well as the fiber channel HBAs for connectivity to the external storage to be used with the GPFS setup thus necessitating that the service nodes boot over fiber channel from the external storage.
The amount of storage in the configuration should be sized adequately based on the desired storage capacity desired as well as the overall I/O throughput desired from the setup.
If there are more service nodes than the host ports on the fiber channel storage controller units then the use of fiber channel switches may be required.
Typically the configuration would have two identically configured fiber channel controller setups. Using two controller setups along with GPFS replication will provide data protection beyond that provided at the RAID array level.
The external storage is typically configured in 4+2P RAID6 arrays with one LUN per array. The use of 256KB segment during the array creation will allow for a GPFS file system block size of 1MB. Alternately a segment size of 128KB may also be used which will allow for a GPFS file system block size of 512KB.
The disks to be used for the boot of the service node(s) can be RAIDED (for example RAID1) or non-RAIDED. The use of storage partitioning is recommended to isolate the disks used for the service node booting from those to be used with the GPFS setup.

Software setup for the shared file system

Perform the necessary admin steps to assign the fibre channel I/O adapter slots to the selected xCAT SN octant/LPAR (the xCAT chvm command may be used to do this). The xCAT SN LPAR and serving CEC may need to be taken down to make I/O slot changes to the xCAT SN configuration.

Ensure that the assigned SAN Disks being used with the GPFS cluster can be allocated back to the assigned fibre channel adapters and the SAN disks can be seen on the target xCAT SNs .

[PUNEET/BIN_-_PLEASE_ADD_DETAIL_HERE_AS_REQUIRED]

Recommendations for the GPFS setup

All the service nodes from a cluster/sub-cluster should typically belong to a GPFS cluster.
The GPFS cluster should be configured over the Ethernet interfaces on the service nodes.
Recommended block sizes for the GPFS file system in the setup are 1MB or 512KB.
Optionally the EMS can also be a part of this GPFS setup.

Layout of the file systems on the external disks:

There is only ONE COMMON /install file system across all service nodes in the cluster/sub-cluster. Optionally, this file system can also be available on the EMS, and written directly there.
There is only one file system for the statelite persistent files. To make NFS exports simple, this can be under the /install filesystem.
The paging spaces will need to be under /install/nim, for example /install/nim/paging. This directory should be configured such that it is not replicated at the GPFS level. This is for optimal use of the space in the GPFS file system as well as for performance reasons.

For now, mount the GPFS /install filesystem on a temporary mount point on the SNs, such as /install-gpfs. This will need to be changed to /install later in the process.

(Optional) Back up local /install on SNs

Since this process will remove all existing data from the local /install/nim directories on your service nodes, you may choose to make a backup copy of the /install filesystem at this time.

Migrate xCAT /install contents

The contents of the local /install filesystems on your SNs will need to be copied into the new shared GPFS /install-gpfs filesystem. You should NOT copy over the /install/nim directory -- this will need to be completely re-created in order to ensure NIM is configured correctly for each SN.

Most of the xCAT data should be identical for each local SN /install directory in your cluster. This includes sub-directories such as:

 /install/custom 
 /install/post 
 /install/prescripts 
 /install/postscripts

It will only be necessary to copy these subdirectories from one SN into /install-gpfs. Therefore, you can just log into one SN and use rsync to copy the files and directories to the shared file system.

   ssh &lt;targetSN&gt;
   rsync .......????...

Migrate statelite data

You must create the directory for your persistent statelite data in the /install-gpfs filesystem. e.g. from one SN:

  mkdir /install-gpfs/statelite_data

(Optional) At this time, you may choose to place an initial copy of your persistent data into the /install-gpfs filesysem. However, since the compute nodes in your cluster are currently running, they are still updating their persistent files, so you will need to resync this data again later after bringing down the cluster. Depending on the amount and stability of your persistent data, the subsequent rsync can take much less time and help reduce your cluster outage time.

Use rsync to do the initial copy from your current statelite directory. You should run this rsync from one SN at a time to copy data into the shared /install-gpfs filesystem. This will ensure that if you happen to have more than one SN that has a subdirectory for the same compute node, you will not run into collisions copying from multiple SNs at the same time. Make sure to use the rsync -u (update) option to ensure stale data from an older SN does not overwrite the data from an active SN.

Note: You do not need to worry about changing /etc/exports to correctly export your statelite directory since you are placing it in /install-gpfs. Later in the process, you will rename the filesystem to /install, and xCAT will add the correct /install export for NFSv4 replication to /etc/exports when mkdsklsnode runs.

Statelite setup

The statelite table will need to be set up so that each service node is the NFS server for its compute nodes. You should use the "$noderes.xcatmaster" substitution string instead of specifying the actual service node so that when xCAT changes the service node database values for the compute nodes during an snmove operation, this table will still have correct information. It should look something like:

_**#node,image,statemnt,mntopts,comments,disable**_
_**"compute",,"$noderes.xcatmaster:/install/statelite_data",,,**_

REMINDER: If you have an entry in your litefile table for persistent AIX logs, you MUST redirect your console log to another location, especially in this environment. The NFSv4 client replication failover support logs messages during failover, and if the console log location is in a persistent directory, which is actively failing over, you can hang your failover. If you have an entry in your litefile similar to:

  tabdump litefile

   _**#image,file,options,comments,disable**_
   _**"ALL","/var/adm/ras/","persistent","for GPFS",**_

Be sure that you have a postscript that runs during node boot to redirect the console log:

_**/usr/sbin/swcons -p /tmp/conslog**_

(or some other local location)

For more information, see: [XCAT_AIX_Diskless_Nodes#Preserving_system_log_files]

Migrate non-xCAT /install contents

If you have any other non-xCAT data in your local /install filesystems, you will first need to determine if this data is identical across all service nodes, or if you will need to create a directory structure to support unique files for each SN. Based on that determination, copy the data into /install-gpfs as appropriate.

Configure the EMS

Verify that the following attributes and values are set in the xCAT site definition:

nameservers= "<xcatmaster>"
domain=<domain_name> (this is required by NFSv4)
useNFSv4onAIX="yes"
sharedinstall="sns"

You could set these values using the following command:

_**chdef -t site nameservers= "&lt;xcatmaster&gt;" domain=mycluster.com**_
_**useNFSv4onAIX="yes" sharedinstall="sns"**_

Verify that all required software and updates are installed.

[NEED_LIST]

If you intend to define dump resources for your compute nodes then make sure you have installed the prequisite software. See [XCAT_AIX_Diskless_Nodes#ISCSI_dump_support] for details.

Configure the SNs

Verify that all required software and updates are installed.

[NEED_LIST?]

You can use the updatenodecommand to update the SNs.

If you intend to define dump resources for your compute nodes then make sure you have installed the prequisite software. See [XCAT_AIX_Diskless_Nodes#ISCSI_dump_support] for details.

NOTE: If any software changes you are making require you to reboot the service node, you may wish to postpone this work until you shutdown the cluster nodes later in the process.

Configure SN startup and shutdown

On each service node, the AIX OS startup order has to be changed to start GPFS before NFS. Edit /etc/inittab on each service node.

_**vi /etc/inittab**_

Move the call to /etc/rc.nfs to AFTER the start of GPFS, making sure GPFS is active before starting NFS.

On each service node, the AIX OS shutdown order has to be changed to shutdown the NFS server before GPFS, so that NFS doesn't keep trying to serve files backed by GPFS. Add the following to /etc/rc.shutdown on each service node:

_**vi /etc/rc.shutdown and add:**_
_**stopsrc -s nfsd**_
_**exit 0**_

You may wish to keep copies of these files on the EMS and add them to synclists for your service nodes. Then, if you ever need to re-install your service nodes, these files will be updated correctly at that time.

Preparing OS Images

Update or create NIM installp_bundle resources

(There is nothing unique required in this step for HASN support)

Create or update NIM installp_bundle files that you wish to use with your osimages.

Also, if you are upgrading to a new version of xCAT, you should check any installp_bundles that you use that were provided as sample bundle files by xCAT. If these sample bundle files are updated in the new version of xCAT you should update your NIM installp_bundle files appropriately.

The list of bundle files you should have defined include:

xCATaixCN71
xCATaixHFIdd
IBMhpc_base
IBMhpc_all

To define a NIM installp_bundle resource you can run a command similar to the following:

_**nim -Fo define -t installp_bundle -a location=/install/nim/installp_bundle/xCATaixCN71.bnd**_
_**-a server=master xCATaixCN71**_

You can modify a bundle file by simply editing it. It does not have to be re-defined.

Convert all existing NIM images to NFSv4

If your cluster was setup with NFSv3 you will need to convert all existing NIM images to NFSv4. On the EMS, for each existing OS image definition, run:

_**mknimimage -u &lt;osimage_name&gt; nfs_vers=4**_

Create and/or update xCAT osimages

You will need to build images with the correct version of AIX, all of the required fixes for NFS v4 client replcation failover support, and your desired HPC software stack. You can use existing xCAT osimage definitions or you can create new ones using the xCAT mknimimage command.

To create a new osimage you could run a command similar to the following:

_**mknimimage -V -r -s /myimages -t diskless &lt;osimage name&gt;**_ 
_**installp_bundle="xCATaixCN71,xCATaixHFIdd,IBMhpc_base,IBMhpc_all"**_

Updating the lpp_source

Whether you are using an existing lpp_source or you created a new one you must make sure you copy any new software prerequisites or updates to the NIM lpp_source resource for the osimage.

The easiest way to do this is to use the "nim -o update" command.

For example, to copy all software from the /tmp/myimages directory you could run the following command.

_**nim -o update -a packages=all -a source=/tmp/myimages &lt;lpp_source name&gt;**_

This command will automatically copy installp, rpm, and emgr packages to the correct location in the lpp_source subdirectories.

Once you have copied all you software to the lpp_source it would be good to run the following two commands.

_**nim -Fo check &lt;lpp_source name&gt;**_

And.

_**chkosimage -V &lt;spot name&gt;**_

See chkosimage for details.

Updating the spot

You can use the the xCAT mknimimage, xcatchroot, or xdsh commands to update the spot software on the EMS.

For example, to install the HPCstnfs.111202.epkg.Z ifix you could run the following command.

_**mknimimage -V -u &lt;spot name&gt; otherpkgs="E:HPCstnfs.111202.epkg.Z"**_

Check the spot.

_**nim -Fo check &lt;spot name&gt;**_

Verify that the ifixes are applied to the spot.

_**xcatchroot -i &lt;spot name&gt; "emgr -l"**_

Special handling for dump and paging resources

Dump resource

Due to current NIM limitations a dump resource cannot be created in the shared file system.

If you wish to define a dump resource to be included in an osimage definition you must use NIM directly to create the resource in a separate local file system on the EMS. (For example /export/nim.)

Once the dump resource is created you can add its name to your osimage definition.

_**chdef -t osimage -o &lt;osimage name&gt; dump=&lt;dump res name&gt;**_

When the mkdsklsnode command creates the resources on the SNs it will create the dump resources in a local filesystem with the same name, e.g. /export/nim. If you want these directories to exist in filesystems on the external storage subsystem, you will need to create those filesystems and have them available on each SN before running the mkdsklsnode command.

Paging resource

[TBD_-_this_section_will_be_expanded_once_the_paging_failover_support_becomes_available]

On one SN, create the paging files for all of the compute nodes in your cluster in the shared /install filesystem. For example, to create 128G of swap space for each node do:

  mkdir /install/paging
  # For each compute node:
  mkdir /install/paging/&lt;compute node&gt;
  dd if=/dev/zero of=/install/paging/&lt;node&gt;/swapnfs1 bs=1024k count=65536
  dd if=/dev/zero of=/install/paging/&lt;node&gt;/swapnfs2 bs=1024k count=65536

Set up a new postscript to run on the compute node to activate that paging space with replication/failover support and disable the default swapnfs0:

  #!/bin/sh

  rmps swapnfs1
  rmps swapnfs2

  mkps -t nfs $MASTER /install/paging/`hostname -s`/swapnfs1:fur
  mkps -t nfs $MASTER /install/paging/`hostname -s`/swapnfs2:fur

  swapon /dev/swapnfs1
  swapon /dev/swapnfs2

  swapoff /dev/swapnfs0
  rmps swapnfs0

NOTE: The paging space failover support is NOT available yet, so if a diskless node is paging during failover, the paging activity will hang. Also, the flags ':fur' are specific for failover support. If you are setting up paging in preparation for this future function, use the flags ':wam' instead.

Installing cluster nodes

Create compute node groups for primary service nodes

Create node groups for each primary SN

[How_are_we_assigning_nodes_to_primary_and_backup_SNs????]

Update xCAT node definitions

Add new postscript setupnfsv4replication.

The following example assumes you are using a 'compute' nodegroup entry in your xCAT postscripts table.

_**chdef -t group compute -p postscripts=setupnfsv4replication**_

Set primary and backup SNs in node definition.

The "servicenode" attribute values must be the names of the service nodes as they are known by the EMS. The "xcatmaster" attribute value must be the name of the primary server as known by the nodes.

_**chdef -t node -o &lt;SNgroupname&gt; servicenode=&lt;primarySN&gt;,&lt;backupSN&gt;  xcatmaster=&lt;nodeprimarySN&gt;**_

Update postscripts and prescripts - (optional)

????? need postscript for creating additional paging???

What others????

[TBD]

Shut down the cluster nodes

In the following example, "compute" is the name of an xCAT node group containing all the cluster compute nodes.

_**xdsh compute "/usr/sbin/shutdown -F &"**_

Remove the NIM client definitions from the SNs

The following command will remove all the NIM client definitions from both primary and backup service nodes. See the rmdsklsnode man page for additonal details.

_**rmdsklsnode -V -f compute**_

Remove NIM resources from the SNs

The existing NIM resources need to be removed on each service node. (With the original /install filesystem still in place.)

In the following example, "service" is the name of the xCAT node group containing all the xCAT service nodes, and "<osimagename>" should be substituted with the actual name of an xCAT osimage object.

_**rmnimimage -V -f -d -s service &lt;osimagename&gt;**_

See rmnimimage for additional details.

When this command is complete it would be good to check the service nodes to make sure there are no other NIM resources still defined. For each service node (or from EMS with 'xdsh service'), run lsnim to list whatever NIM resources may be remaining. Remove any random resources that are no longer needed (you should NOT remove basic NIM resources such as master, network, etc.)

Clean up the NFS exports

On each service node, clean up the NFS exports.

Edit /etc/exports and remove all entries related to /install. The xCAT mkdsklsnode command will create new entries for NFSv4 replication when it is run later in this process.
If your statelite persistent directory will not be located in the shared /install GPFS filesystem (we strongly recommend that it IS located in shared /install), edit /etc/exports and add an NFSv4 entry specifying the correct replica server.
Re-do the exports

exportfs -ua
exportfs -a (if there are any entries left in /etc/exports)

Migrate statelite data

Use rsync to copy all the persistent data from your current statelite directory. Even if you did an initial copy earlier in the process, you will need to do this again now to pick up any changes that have been written since then. You should run this rsync from one SN at a time to copy data into the shared /install-gpfs filesystem. This will ensure that if you happen to have more than one SN that has a subdirectory for the same compute node, you will not run into collisions copying from multiple SNs at the same time. Make sure to use the rsync -u (update) option to ensure stale data from an older SN does not overwrite the data from an active SN.

Switch to shared GPFS /install directory

On each service node, deactivate (in whatever way you choose: rename, overmount, etc.) the local /install filesystem. Change the mount point for your shared GPFS /install-gpfs filesystem to /install. Depending on how the old local /install filesystem was originally created, this may also require updates to /etc/filesystems.

[NEED_EXAMPLES]

Additional SN software updates

If you postponed updating software on your service nodes because of required reboots, you should apply that software now and reboot the SNs.

After the SNs come back up, make sure that the admin GPFS cluster is running and NFS has started correctly.

Run mkdsklsnode

Make sure /etc/exports on service nodes do not contain any old entries. If so, remove, and run:

_**exportfs -ua**_

When using a shared file system across the SNs you must run the mkdsklsnode command on the backup SNs first and then run it for the primary SNs.

This is necessary since there are some install-related files that are server-specific. The server that is configured last is the one the node will boot from first.

mkdsklsnode for backup SNs

_**mkdsklsnode -V -S -b -i &lt;osimage name&gt;  &lt;noderange&gt;**_

Use the -S flag to setup the NFSv4 replication settings on the SNs.

If you are using a dump resource you can specify the type dump to be collected from the client. The values are "selective", "full", and "none". If the configdump attribute is set to "full" or "selective" the client will automatically be configured to dump to an iSCSI target device. The "selective" memory dump will avoid dumping user data. The "full" memory dump will dump all the memory of the client partition. Selective and full memory dumps will be stored in subdirectory of the dump resource allocated to the client. This attribute is saved in the xCAT osimage definition.

For example:

_**mkdsklsnode -V -S -b -i &lt;osimage name&gt;  &lt;noderange&gt; configdump=selective**_

To verify the setup on the SNs you could use xdsh to run the lsnim command on the SNs.

To check for the resource and node definitions you could run:

_**xdsh &lt;SN name&gt; "lsnim"**_

To get the details of the NIM client definition you could run"

_**xdsh &lt;SN name&gt; "lsnim"**_

mkdsklsnode for primary SNs

To set up the primary service nodes run the same command you just ran on the backup SNs only use the "-p" option instead of the "-b" option.

_**mkdsklsnode -V -S -p -i &lt;osimage name&gt;  &lt;noderange&gt;**_

Verify the NFSv4 replication setup

Verify the NFSv4 replication is exported correctly for your service node pairs:

_**xdsh service cat /etc/exports | xcoll**_


====================================
c250f10c12ap01
====================================
/install -replicas=/install@20.10.12.1:/install@20.10.12.17,vers=4,rw,noauto,root=*
====================================
c250f10c12ap17
====================================
/install -replicas=/install@20.10.12.17:/install@20.10.12.1,vers=4,rw,noauto,root=*

Boot nodes

_**rbootseq compute hfi**_
_**rpower compute on**_

Verify node setup and function

If you specified a dump resource you can check if the primary dump device has been set on the node by running:

_**xdsh &lt;node&gt; "sysdumpdev"**_

Verify the NFSv4 replication is configured correctly the a compute node:

_**xdsh &lt;node&gt; nfs4cl showfs**_


_**xdsh &lt;node&gt; nfs4cl showfs /usr**_

[??????]

_simple test of NFSv4 client replication failover:_
s1 = service node 1
s2 = service node 2
c1 = all compute nodes managed by s1, backup s2
c2 = all compute nodes managed by s2, backup s1
xdsh c1,c2 nfs4cl showfs | xcoll
should show c1 filesystems served by s1 and c2 filesystems served by s2
xdsh s1 stopsrc -s nfsd
xdsh c1,c2 ls /usr | xcoll
xdsh c1,c2 nfs4cl showfs | xcoll
should show all nodes getting /usr from s2 now (depending on NFS caching, it may take additional  activity on the c1 nodes to have all filesystems failover to s2)__


_**TESTING NOTE** At this point, you can restart NFS on s1. You can continue testing by _
_shutting down NFS on s2 and watching all nodes failover to s1. Once NFS is back up on_       
_both service nodes, over time, the clients should eventually switch back to using their_ 
_primary server._

SN failover process

Discover a primary SN failure

https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Monitor_and_Recover_Service_Nodes#Monitoring_Service_Nodes

Move nodes to the backup SN

The nodes will continue running after the primary service node goes down, however, you should move the nodes to the backup SN as soon as possible.

Use the xCAT snmove command to move a set of nodes to the backup service node.

Note: Since we have already run mkdsklsnode on the backup SN we know that the NIM resources have been defined and nodes initialized.

In the case where a primary SN fails you can run the snmove command with the node group you created for this SN. For example, if the name of the node group is "SN27group" then you could run the following command:

_**snmove -V SN27group**_

You can also specify scripts to be run on the nodes by using the "-P" option.

_**snmove -V SN27group -P myscript**_

(Make sure the script has benn added to the /install/postscripts directory and that it has the correct permissions.)

Verify the move

The snmove command performs several steps to both keep the nodes running and to prepare the backup service node for the next time the nodes need to be booted.

This includes the following:

Update the "servicenode" and "xcatmaster" attributes in xCAT node definitions.
Update the xCAT statelite tables in the case where the failed SN is specifically mentioned.
Re-do the xCAT statelite files and copy them to the new SN. (These files are used internally by xCAT and will be needed for when the nodes are next booted.)
Restores server-specific system configuration files to the .client_data directory in the shared_root resource. These are files that NIM will used the next time the node is booted. [(TBD_-_insert_description_of_client_data_files.)]
Re-target the primary dump device on the nodes to the new SN, if a dump resource has been allocated. This means that future system dumps will go to the new SN.
Update the /etc/xcatinfo file on the nodes to point to the new SN.
Update the default gateway on the nodes.
Run specified scripts on the nodes. (See the man page for details.)

You can verify some of these steps by running the following commands.

Check if the node definitions have been modified:

lsdef <noderange>
Check the primary dump device on the nodes.

xdsh <noderange> "/bin/sysdumpdev"

Make sure the primary dump device has been reset.

Check the default gateway.

xdsh <noderange> "/bin/netstat -rn"
Check the contents of the /etc/xcatinfo file.

xdsh <noderange> "/bin/cat /etc/xcatinfo"

See if the server is the name of the new SN.

Reboot nodes - (optional)

The nodes should continue running after the primary SN goes down, however, it is adviseable to reboot the node at soon as possible.

When the node are rebooted they will automatically boot from the new SN.

_**xdsh compute "shutdown -F &"**_
_**rpower compute on**_

Reverting to the primary service node

The process for switching nodes back will depend on what must be done to recover the original service node. Essentially the SN must have all the NIM resources and definitions restored and operations completed before you can use it.

If you are using the xCAT statelite support then you must make sure you have the latest files and directories copied over and that you make any necessary changes to the statelite and/or litetree tables.

If all the configuration is still intact you can simply use the snmove command to switch the nodes back.

If the configuration must be restored then you will have to run the mkdsklsnode command. This commands will re-configure the SN using the common osimages defined on the xCAT management node.

Remember that this SN would now be considered the backup SN, so when you run mkdsklsnode you need to use the "-b" option.

Once the SN is ready you can run the snmove command to switch the node definitions to point to it. For example, to move all the nodes in the "SN27group" back to the original SN you could run the following command.

_**svmove -V SN27group**_

The next time you reboot the nodes they will boot from the original SN.

Working in a HASN environment

Removing NIM client definitions

must run rmdsklsnode on primary first and then the backup SN

Removing old NIM resources

Must run rmnimimage for one SN first then run it for the rest.

Setting up the Teal GPFS monitoring utility node

Because Teal-gpfs monitoring must be in the application GPFS cluster, we must move the teal.gpfs-sn ( will we change this name) off the service node to a utility node that is in the compute node cluster. Since the teal package requires access to the database server, we will also be installing and configuring a new db2driver package that have a minimum DB2 client that can run the required ODBC interface on a diskless node.

Software prereqs

You will have to obtain the db2driver code and the required level of the teal.gpfs-sn code from IBM that supports this function. The db2driver code is available at the following location on fix central and is available to anyone holding the HPC DB2 license.

http://www-933.ibm.com/support/fixcentral/swg/selectFixes?parent=ibm/Information+Management&product=ibm/Information+Management/IBM+Data+Server+Client+Packages&release=9.7.&platform=All&function=fixId&fixids=-dsdriver-*FP005&includeSupersedes=0

The following DB2 driver software package was tested and works withe the DB2 9.7.4 or 9.7.5 WSER Server code.

v9.7fp5_aix64_dsdriver.tar.gz

You will need to obtain the appropriate level of TEAL. Only the teal.gpfs-sn lpp is required on the node.

Setup DB2 Data Server Client code on the EMS

We will configure the Data Server Client in the /db2client directory on the EMS machine. We will use this setup to update the image for the utility node that will run it.

*** Get unzip and install unzip, if not already available.**

Note the Data Server Client code requires unzip. Make sure it is available before continuing:

AIX:
 Get unzip from Linux Toolbox, if not already available.
 rpm -i unzip-5.51-1.aix5.1.ppc.rpm for diskfull.   For AIX diskless need to add to statelite image.

*** Extract the Data Server Client Code on the EMS**

 mkdir /db2client
 cd /db2client
 cp ..../v9.7fp5_aix64_dsdriver.tar.gz .
 gunzip v9.7fp5_aix64_dsdriver.tar.gz
 tar -xvf v9.7fp5_aix64_dsdriver.tar
 export PATH=/db2client/dsdriver/bin:$PATH
 export LIBPATH=/db2client/dsdriver/lib:$LIBPATH

Setup Data Server Client

Set the path to the Data Server Client code. You should add these to your .profile on AIX. (Linux TBD).

export PATH=/db2client/dsdriver/bin:$PATH
export LIBPATH=/db2client/dsdriver/lib:$LIBPATH

*** Install the Driver** This script will only automatically setup the 64 bit driver. We must manually extract the 32 bit driver.

cd /db2client/dsdriver
./installDSdriver
cd  odbc_cli_driver
cd *32
uncompress *.tar.Z
tar -xvf *.tar

Fix directory and files owner/group

Note: the package I downloaded had sub-directories not defined with the bin owner/ bin group. To be sure, do the following:

 cd /db2client
 chown -R bin *
 chgrp -R bin *

*Create shared lib on 32 bit path (AIX)

cd /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/lib
ar -x libdb2.a
mv shr.o libdb2.so

Configure DB2 Data Server Client

The DB2 Data Server Client has several configuration files that must be setup.

db2dsdriver.cfg

The db2dsdriver.cfg configuration file contains database directory information and client configuration parameters in a human-readable format.

The db2dsdriver.cfg configuration file is a XML file that is based on the db2dsdriver.xsd schema definition file. The db2dsdriver.cfg configuration file contains various keywords and values that can be used to enable various features to a supported database through ODBC, CLI, .NET, OLE DB, PHP, or Ruby applications. The keywords can be associated globally for all database connections, or they can be associated with specific database source name (DSN) or database connection.

cd /db2client/dsdriver/cfg
cp db2dsdriver.cfg.sample  db2dsdriver.cfg
chmod 755 db2dsdriver.cfg
vi db2dsdriver.cfg

Here is a sample setup for a node accessing the xcatdb database on the Management Node p7saixmn1.p7sim.com

&lt;configuration&gt;
  &lt;dsncollection&gt;
    &lt;dsn alias="xcatdb" name="xcatdb" host="p7saixmn1.p7sim.com" port="50001"/&gt;
  &lt;/dsncollection&gt;
  &lt;databases&gt;
     &lt;database name="xcatdb" host="p7saixmn1.p7sim.com" port="50001"&gt;
     &lt;/database&gt;
  &lt;/databases&gt;
&lt;/configuration&gt;

db2cli.ini

The CLI/ODBC initialization file (db2cli.ini) contains various keywords and values that can be used to configure the behavior of CLI and the applications using it.

The keywords are associated with the database alias name, and affect all CLI and ODBC applications that access the database.

cd /db2client.save/dsdriver/cfg
cp db2cli.ini.sample db2cli.ini
chmod 0600 db2cli.ini

Here is a sample db2cli.in file containing information needed to access the xcatdb database, using instance xcatdb and password cluster. Note this file should only be readable by root.

[xcatdb]
uid=xcatdb
pwd=cluster

For 32 bit, copy the /db2client/dsdriver/cfg files to /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/cfg

cd /db2client/dsdriver/cfg
cp db2cli.ini /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/cfg
cp db2dsdriver.cfg /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/cfg

Using unixODBC

The unixODBC files are still needed. The following are sample configurations:

cat /etc/odbc.ini
[xcatdb]
Driver   = DB2
DATABASE = xcatdb


cat /etc/odbcinst.ini
[DB2]
Description =  DB2 Driver
Driver   = /db2client/dsdriver/odbc_cli_driver/aix32/clidriver/lib/libdb2.so
FileUsage = 1
DontDLClose = 1
Threading = 0

Build the teal-gpfs diskless image

We have create a new diskless image for the teal.gpfs-sn node. Here is a sample Bundle file:

# sample bundle file for teal-gpfs utility node
I:rpm.rte
I:openssl.base
I:openssl.license
I:openssh.base
I:openssh.man.en_US
I:openssh.msg.en_US
I:gpfs.base
I:gpfs.gnr
I:gpfs.msg.en_USI:rsct.core.sensorrm
I:teal.gpfs-sn
# RPMs
R:popt*
R:rsync*
# using Perl 5.10.1
R:perl-Net_SSLeay.pm-1.30-3*
R:perl-IO-Socket-SSL*
R:unixODBC*
R:unzip*  (optional) since we are setting up db2driver on the EMS

With this additional bundle file, build the diskless image for the teal-gpfs utiltiy node.

Copy db2driver and odbc config files into the image

Copy /db2driver directory into the image.

Copy /etc/odbc.init into the image.

Copy /etc/odbcinst.ini into the image.

Copy /db2cli.ini into the image

Copy /etc/xcat/cfgloc into the image

Configure Network for IP forwarding

Configure IP forwarding such that the Utility Node can access the DB2 Server (EMS)

xCAT Wiki

An extreme cluster/cloud administration toolkit

XCAT_HASN_with_GPFS

XCAT High Availability Service Nodes (HASN)

Overview of Hardware and Cluster Configuration

Using a shared filesystem

Considerations for Other Software Components

Limitations, Notes and Issues

Software Pre-requisites

HASN Setup Process

Assumptions

Preparing an existing cluster

Hardware setup for the shared file system

Software setup for the shared file system

Create shared file system on SNs (GPFS)

(Optional) Back up local /install on SNs

Migrate xCAT /install contents

Migrate statelite data

Statelite setup

Migrate non-xCAT /install contents

Configure the EMS

Configure the SNs

Configure SN startup and shutdown

Preparing OS Images

Update or create NIM installp_bundle resources

Convert all existing NIM images to NFSv4

Create and/or update xCAT osimages

Updating the lpp_source

Updating the spot

Special handling for dump and paging resources

Installing cluster nodes

Create compute node groups for primary service nodes

Update xCAT node definitions

Update postscripts and prescripts - (optional)

Shut down the cluster nodes

Remove the NIM client definitions from the SNs

Remove NIM resources from the SNs

Clean up the NFS exports

Migrate statelite data

Switch to shared GPFS /install directory

Additional SN software updates

Run mkdsklsnode

mkdsklsnode for backup SNs

mkdsklsnode for primary SNs

Verify the NFSv4 replication setup

Boot nodes

Verify node setup and function

SN failover process

Discover a primary SN failure

Move nodes to the backup SN

Verify the move

Reboot nodes - (optional)

Reverting to the primary service node

Working in a HASN environment

Removing NIM client definitions

Removing old NIM resources

Setting up the Teal GPFS monitoring utility node

Software prereqs

Setup DB2 Data Server Client code on the EMS

Configure DB2 Data Server Client

db2dsdriver.cfg

db2cli.ini

Using unixODBC

Build the teal-gpfs diskless image

Copy db2driver and odbc config files into the image

Configure Network for IP forwarding