{{:Design Warning}}
AIX diskless nodes depend on their service nodes for many services: bootp, tftp, default gateway, name serving, NTP, etc. The most significant service is NFS to access OS files, statelite data, and paging space. Since providing HA NFS service for a large number of nodes is not trivial, there is a spectrum of solutions that range from partial HA to complete HA. The complexity of setting it up and maintaining it is inversely proportional to the level of HA. So we will give an summary of the choices so you can pick the one that best fits your HA and maintenance goals. They are listed from least HA to most HA.
The most basic capability is that xCAT makes it easy to manually move compute nodes to another service node (SN) when their SN fails. Without any other provisions, this means that the compute nodes serviced by the failed SN go down, and they are rebooted when they are moved to the other SN. This is achieved by the following xCAT commands:
See Setting Up an AIX Hierarchical Cluster for details. NIM should be configured to use NFSv4 by setting the site table attribute useNFSv4onAIX. This positions the nodes to use nfsv4 client replication fail over when you are ready, and also allows for large (up to 64GB) paging spaces. For additional details about using NFSv4 with NIM, see the AIX NIM NFSv4 support mini-design.
As an incremental improvement to this manual approach, the SNs can be FC connected to 2 DS3524 external disks which contains the node images, statelite files, and paging space. This increases the disk capacity of each SN and increases the write performance. The recommended organization and layout of the data on the external disk is:
The 2nd LV/file system will hold the readwrite/statelite files for the compute nodes served by this SN: This should be mirrored and named and mounted at a location that relates to the SN name. For example, name/mount the LV on the 1st SN /nodedata1. This will give you the option, in the case of a SN failure, to mount its statelite LV/file system on the SN that is taking over, and not have to copy the statelite data there. (In this case, the snmove command will need a new flag to tell it not to sync the statelite files to the backup SN (because they will already be there.) Each set of compute nodes can be pointed to a different statelite location, by defining groups for each set of nodes served an SN, and then filling in the statelite table with something like:
"sn1nodes",,"$noderes.xcatmaster:/nodedata1","soft,timeo=30",,
"sn2nodes",,"$noderes.xcatmaster:/nodedata2","soft,timeo=30",,
Using the variable $noderes.xcatmaster means the node will mount its statelite files from the correct SN, even if snmove has been used to move the node to another SN. But hardcoding /nodedata1 for all of the nodes that are originally served by sn1, means that even if those nodes are moved over to sn2, they will still mount their statelite files from /nodedata1 on sn2 (which can be mounted on sn2 from the external disk).
The 3rd LV/file system will hold the paging spaces for the compute nodes served by this SN. Each node will have two 64 GB paging files created by a postscript (in addition to the small one created by NIM during boot time). This should not be mirrored, to improve write performance, and should be mounted at the same location on each SN. This is because xCAT associates the NIM resource for the paging space with the OS image, not the node. Assuming all of you computes nodes are running the same image, the compute nodes from sn1 will be expecting the paging at the same path as the compute nodes from sn2. Also, it is easiest to mount the paging LV/file system at /install/nim/paging, because mknimimage wants to put all resource files under the same top level directory. These choices also position you well for the next solution using GPFS. The downside of this approach is that until you use GPFS to share a single paging file system, each SN has to allocate enough paging space on its own LV/file system for its own compute nodes plus nodes that can be failed over to it. This means you need to allocate 2 * n * p amount of paging space, where n is the number of SNs and p is the amount of paging space 1 set of compute nodes needs. This amount could be reduced to (n+1) * p, by leaving unallocated on the shared disk physical partitions equal to p and then when a SN fails, assigning those unallocated physical partitions to the SN that takes over for it.
At a high level, the recommended steps to set up the SNs with external disks in this way are:
When, for example, sn1 fails, and you want to move its compute nodes to sn2:
As another incremental improvement to this manual approach, the readonly OS files (STNFS and the /usr file system) can be easily failed over to the secondary SN in real time using nfsv4 replication. Since mkdsklsnode has already copied the OS image to both SNs, and it is not changed by compute nodes, the nfsv4 client fail over is sufficient to allow the compute nodes to keep accessing the OS files. This can be configured automatically by mkdsklsnode by adding the --setuphanfs flag to it.
In this approach, neither the statelite files or the paging space is failed over in real time. If the paging space is currently not being used (i.e. you haven't exhausted real memory on the compute nodes), the nodes can continue to run even though the nfs server for the paging space (the primary SN) is not available. The effect of the statelite files not failing over in real time depends on what you define as statelite files for your compute nodes. If they are just a few simple log files, you have the option to tell xCAT to mount the statelite files with the soft option, which will cause the writes to fail when the primary SN goes down, but not hang those processes in the compute nodes.
Note: there is currently a question about whether AIX writes to the paging space even before real memory is exhausted. There are apparently several complex conditions in which AIX will write to paging space before real memory is fully used up. This is being investigated.
In this approach, the SNs are all FC connected to 2 DS3524 external disks which contains the node images, statelite files, and paging space. The file systems containing this data are mounted on the SNs. The statelite files are in a GPFS file system and GPFS coordinates all updates to it from each SN. From the nodes, both readonly and statelite files are failed over using the NFSv4 client replication fail over feature.
In this approach, it is best to also have the EMS connected to the external disks, so that it can write image files directly into /install on the external disks. (Although this is not an absolute requirement.) Then mkdsklsnode doesn't have to copy the OS files to any of the SNs, since they will already be on the external disks for all SNs to see. (Use the new -d mkdsklsnode flag for this.) mkdsklsnode still has to create the NIM resources on each SN (because NIM definitions are stored in the ODM and not in /install). It is not necessary to sync the statelite files between SNs, because they all see the same copy of them. The client is configured for NFSv4 client fail over using the mkdsklsnode --setuphanfs flag and the setupnfsv4replication postscript. For more details about this approach, see the HA NFS on AIX service nodes mini-design. Some additional information can be found in the HA Service Nodes for AIX mini-design.
The layout of the file systems on the external disks will be similar to what is described in the previous section, but with the following differences:
There only needs to be one file system for the statelite persistent files. So instead of having /nodedata1, /nodedata2, etc., there can be a single file system /nodedata (or if you want to keep it in the /install file system, /install/nodedata) that has a subdirectory for each node. This way you don't need a separate entry in the statelite table for each SN's set of nodes. Instead, the statelite table can look like this:
"sn1nodes",,"$noderes.xcatmaster:/nodedata",,,
The paging spaces and dump files can be under /install (like described in the previous section), for example /install/nim/paging and /install/nim/dump, respectively. But the difference from the previous section is that you only have 1 copy of these directories that are shared by all SNs, instead of a separate copy for each SN. These directories should configured in GPFS to not mirror them, for performance reasons.
There are a few components that normally run on the service nodes that under certain circumstances need access to the application GPFS cluster. Since a node can't be (directly) in 2 GPFS clusters at once, some changes in the placement or configuration of these components must be made, now that the SNs are in their own GPFS cluster. The components that can be affected by this are:
There are a few different ways to satisfy these requirements:
With this approach, a separate set of 3 linux svrs are set up with GPFS/CNFS to provide a HA NFS service. Redundant routing is set up from the HFI network, through the SNs, to the LAN the CNFS service is on, using AIX's dead gateway detection capability. The xCAT SNs network boot the compute nodes, pointing them to the CNFS service. (The admin sets the noderes.nfsserver attribute for the compute nodes.) From the nodes, both readonly and statelite files are always available w/o the need of the NFSv4 client replication feature.
A few extra steps are needed when copying new images to the SNs to get the OS, statelite, and paging files on the CNFS server. See External NFS Server Support With AIX Stateless And Statelite for details. Some additional information can be found in the HA Service Nodes for AIX - using External NFS Server mini-design.
Note: currently AIX paging space only supports nfsv2 or nfsv4, whereas CNFS only supports nfsv2 or nfsv3. That only leaves nfsv2 in common and that has a limit on the file system size of 2 GB, which is not big enough for paging space. This means that to store your AIX paging in CNFS, you'll have to create multiple paging spaces. NIM will create one during node boot, then you can add more using the mkps command in a postscript.
In this approach, the SNs are FC attached to 2 DS3524 external disks and are all configured in a Power HA cluster. Power HA (version 7.1.1) will provide takeover function for all 3 categories of NFS: readonly OS image files, readwrite statelite persistent files, and paging space files. The HA NFS component of Power HA will be used. This helps with configuring the resource groups more easily. For the admin, it is mostly a process of stating the file systems to export. Mutual takeover can be used so that all service nodes can be active (i.e. no cold standy SNs needed). NFS v4 will be used for its improved security and recovery as compared to NFSv2/3. Through Power HA's IP take-over and handling of NFS locks and dup cache, the fail over is transparent to the NFS clients, except for a delay during the fail over.
Some Power HA documentation for further details:
The following is old but has good information. It does not match the current interface.
The LV/file system layout on the external disks will be similar to what is described in the Manual Service Node Fail Over section, with a few differences:
This solution is still being investigated.