Note:
This documentation illustrates how to setup a second management node, or standby management node, in your cluster to provide high availability management capability when there is no shared disks configured between the two management nodes. If DB2 is used in your cluster, this documentation only applies to xCAT 2.5 or newer releases.
When the primary xCAT management node fails, the administrator can easily have the standby management node take over role of the management node, and thus avoid long periods of time during which your cluster does not have active cluster management function available.
The xCAT high availability management node(HAMN) feature is not designed for automatic setup or automatic failover, this documentation will describe how to synchronize various data between the primary management node and standby management node automatically, and describe how to perform some manual steps to have the standby management node takeover the management node role when failures occur on the primary management node. However, high availability applications such as IBM Tivoli System Automation(TSA) can be used to achieve automatic failover, this documentation also describes how to configure HAMN with IBM Tivoli System Automation(TSA) to perform automatic failover.
The primary management node will be taken down during the failover process, so any NFS mount or other network connections from the compute nodes to the management node should be temporarily disconnected during the failover process. If the network connectivity is required for compute node run-time operations, you should consider some other way to provide high availability for the network services unless the compute nodes can also be taken down during the failover process. This also implies:
1\. This HAMN approach is primarily intended for clusters in which the management node manages diskful
nodes or linux stateless nodes. This also includes hierarchical clusters in which the management node only
directly manages the diskful or linux stateless service nodes, and the compute nodes managed by the service
nodes can be of any type.
2\. This documentation is **not **primarily intended for clusters in which the nodes directly managed by
the management node are linux statelite or aix diskless nodes, because the nodes depend on the management node
being up to run its operating system over NFS. But if the nodes use only readonly nfs mounts from the MN
management node, then you can use this doc as long as you recognize that your nodes will go down while you are
failing over to the standby management node.
Note: If you are using twin-tailed shared disks between the primary management node and standby management node, the steps below are quite different, please see [Setup_HA_Mgmt_Node_With_Shared_Disks].
xCAT HAMN requires that the operating system version, xCAT version and database version all be identical on the two management nodes.
The hardware type/model are not required to be the same on the two management nodes, but it is recommended to have similar hardware capability on the two management nodes to support the same operating system and have similar management capability.
Since the management node needs to provide IP services through broadcast such as DHCP to the compute nodes, the primary management node and standby management node should be in the same subnet to ensure the network services will work correctly after failover.
The HAMN setup can be performed at any time during the life of the cluster. This documentation assumes the HAMN setup is performed from the very beginning of the cluster setup. You can skip the corresponding steps in case part of the setup has already been done in your cluster.
Twin-tailed disks are not required for this support since different methods are used to ensure the data synchronization between the primary management node and standby management node. However, if you have twin-tailed disks in your cluster, then the data synchronization will be easier. You can put the related directories and files listed in section Setup Database Replication and section Files Synchronization onto the twin-tailed disks, re-mount the twin-tailed disks to the standby management node during the failover, and the corresponding steps to keep the data synchronized can be skipped.
The examples in this documentation are based on the following cluster environment:
Primary Management Node: aixmn1(9.114.47.103) running AIX 6.1L and DB2 9.7
Standby Management Node: aixmn2(9.114.47.104) running AIX 6.1L and DB2 9.7
You need to substitute the hostnames and ip address with your own values when setting up your HAMN environment.
The procedure described in [Setting_Up_a_Linux_xCAT_Mgmt_Node] or [XCAT_AIX_Cluster_Overview_and_Mgmt_Node] can be used for the xCAT setup on the primary management node. If DB2 will be used as the xCAT database system, please refer to the doc [Setting_Up_DB2_as_the_xCAT_DB].
The procedure described in [Setting_Up_a_Linux_xCAT_Mgmt_Node] or [XCAT_AIX_Cluster_Overview_and_Mgmt_Node] can also be used for the xCAT setup on the standby management node. The database system on the standby management node should be the same as the one running on the primary management node.
If shared disks are used between the two management nodes, when setting up the standby management node, the shared disks should not be mounted on the standby management node. Make sure the xcatd can be up and running with whatever database is used as part of the xCAT setup verification on the standby management node.
When installing and configuring DB2 software on the standby management node, you should follow the instructions in [Setting_Up_DB2_as_the_xCAT_DB]. Install DB2 and run db2sqlsetup to setup the xCAT database.
After the xCAT setup is done on the standby management node, perform the following additional configuration steps:
On AIX:
stopsrc -s xcatd
rmssys -s xcatd
On Linux:
service xcatd stop
chkconfig --level 345 xcatd off
service dhcpd stop
chkconfig --level 2345 dhcpd off
The most important data that needs to be kept synchronized on the primary management node and standby management node is the xCAT database. Most of the commercial database systems and some free database systems such as Postgresql and MySQL provide a database replication feature. The database replication feature can be used for high availability capability. The configuration for database replication is quite different with various database systems, so this documentation can not cover all of the configuration scenarios. This documentation will focus on database replication configuration for DB2, and will also provide some documentation links for the replication setup for some of the other database systems. You can refer to the "Setup DB2 as the xCAT Database" document link at [Setting_Up_DB2_as_the_xCAT_DB] for more details on how to setup DB2 as the xCAT database.
DB2 High Availability Disaster Recovery (HADR) is a database replication feature that provides a high availability solution. HADR transmits the log records from the primary database server to the standby server. The HADR standby replays all the log records to its copy of the database, keeping it synchronized with the primary database server. Applications can only access the primary database and have no access to the standby database.
HADR communication between the primary and the standby is through TCP/IP, so the primary database server and standby database server do not need to be in the same subnet.
This documentation will only describe some basic configuration steps for HADR setup. There may be some configuration deviations in different cluster environments, so please refer to the following links for more details:
Please be aware that all the DB2 commands in this section should be run as xcatdb unless otherwise noted.
Before proceeding with the DB2 HADR setup, all the DB2 clients should be disconnected from the DB2 database server. For the xCAT environment, the only DB2 clients should be xcatd, so the xcatd on both management node and service nodes need to be stopped using the command stopsrc -s xcatd. If there is any other DB2 client running on the management node, you need to disconnect all the clients.
One way to ensure that all are disconnectd is on the management node, run the following:
su - xcatdb
> db2 force application all
Several configuration parameters need to be updated for HADR on both the primary management node and standby management node. Please be aware that all the DB2 commands in this section should be run as user xcatdb unless otherwise noted.
su - xcatdb
db2 UPDATE DB CFG FOR XCATDB USING LOGRETAIN ON
db2 UPDATE DB CFG FOR XCATDB USING TRACKMOD ON
db2 UPDATE DB CFG FOR XCATDB USING LOGINDEXBUILD ON
db2 UPDATE DB CFG FOR XCATDB USING INDEXREC RESTART
The xcatdb on the primary management node and standby management node should be synchronized before setting up the HADR, otherwise, we will run into errors when trying to start HADR.
Note: as of xCAT 2.6, the xcatdb instance directory for DB2 can be change by setting the site table databaseloc attribute to the filesystem you would like to use. Our example below uses the default of /var/lib/db2. If you have changed that in the site.databaseloc setting, then use your new directory. For example: if databaseloc is set to the following:
"databaseloc","/databaseloc",,
then /var/lib/db2 should be replaced with /databaseloc/db2.
as root
mkdir /var/lib/db2/backup
chown xcatdb:xcatdb /var/lib/db2/backup
as xcatdb
db2 BACKUP DB XCATDB TO /var/lib/db2/backup/
The command output will be something like:
Backup successful. The timestamp for this backup image is: 20100805161232
Record the timestamp for later use, this timestamp is also part of the filename saved in /var/lib/db2/backup
Note: if you get an error, like "SQL1035N The database is currently in use. SQLSTATE=57019'**", make sure your xcatd daemons on the management node and service nodes are not running. Deactivating the xcatdb using command "db2 DEACTIVATE DB XCATDB" may also be helpful.
Copy the xcatdb backup from the primary management node to standby management node:
scp -rp /var/lib/db2/backup xcatdb@aixmn2:/var/lib/db2/
Restore the xcatdb database:
su - xcatdb
db2 RESTORE DATABASE XCATDB FROM "/var/lib/db2/backup" TAKEN AT 20100805161232 REPLACE HISTORY FILE
You will be prompted with the following question:
SQL2539W Warning! Restoring to an existing database that is the same as the
backup image database. The database files will be deleted.
Do you want to continue? (y/n)
Answer: y
Add the following lines into /etc/services on both the primary management node and standby management node. You need to run as root to edit /etc/services.
DB2_HADR_1 55001/tcp
DB2_HADR_2 55002/tcp
Use the following commands to configure the HADR parameters.
Substitute the IP addresses in the example with your addresses.
On primary management node:
su - xcatdb
db2 UPDATE ALTERNATE SERVER FOR DATABASE XCATDB USING HOSTNAME 9.114.47.104 PORT 60000
db2 UPDATE DB CFG FOR XCATDB USING HADR_LOCAL_HOST 9.114.47.103
db2 UPDATE DB CFG FOR XCATDB USING HADR_LOCAL_SVC DB2_HADR_1
db2 UPDATE DB CFG FOR XCATDB USING HADR_REMOTE_HOST 9.114.47.104
db2 UPDATE DB CFG FOR XCATDB USING HADR_REMOTE_SVC DB2_HADR_2
db2 UPDATE DB CFG FOR XCATDB USING HADR_REMOTE_INST xcatdb
db2 UPDATE DB CFG FOR XCATDB USING HADR_SYNCMODE NEARSYNC
db2 UPDATE DB CFG FOR XCATDB USING HADR_TIMEOUT 3
db2 UPDATE DB CFG FOR XCATDB USING HADR_PEER_WINDOW 120
db2 CONNECT TO XCATDB
db2 QUIESCE DATABASE IMMEDIATE FORCE CONNECTIONS
db2 UNQUIESCE DATABASE
db2 CONNECT RESET
On Standby management node:
su - xcatdb
db2 UPDATE ALTERNATE SERVER FOR DATABASE XCATDB USING HOSTNAME 9.114.47.103 PORT 60000
db2 UPDATE DB CFG FOR XCATDB USING HADR_LOCAL_HOST 9.114.47.104
db2 UPDATE DB CFG FOR XCATDB USING HADR_LOCAL_SVC DB2_HADR_2
db2 UPDATE DB CFG FOR XCATDB USING HADR_REMOTE_HOST 9.114.47.103
db2 UPDATE DB CFG FOR XCATDB USING HADR_REMOTE_SVC DB2_HADR_1
db2 UPDATE DB CFG FOR XCATDB USING HADR_REMOTE_INST xcatdb
db2 UPDATE DB CFG FOR XCATDB USING HADR_SYNCMODE NEARSYNC
db2 UPDATE DB CFG FOR XCATDB USING HADR_TIMEOUT 3
db2 UPDATE DB CFG FOR XCATDB USING HADR_PEER_WINDOW 120
On the standby management node, start HADR as the standby database:
db2 DEACTIVATE DATABASE XCATDB
db2 START HADR ON DATABASE XCATDB AS STANDBY
On the primary management node, start HADR as the primary database:
db2 DEACTIVATE DATABASE XCATDB
db2 START HADR ON DATABASE XCATDB AS PRIMARY
If you get any message other than "DB20000I The START HADR ON DATABASE command completed successfully", make sure all the steps described above have been done correctly, or refer to the DB2 information center for troubleshooting.
HADR can be in the wrong state even if the "START HADR" command returns successfully. The commands "db2 GET SNAPSHOT FOR DB ON XCATDB" or "db2pd -d xcatdb -hadr" can be used to verify HADR status. The HADR status output is quite similar between these two commands, here is an example:
**db2 GET SNAPSHOT FOR DB ON XCATDB**
HADR Status
Role = Primary
State = Peer
Synchronization mode = Nearsync
Connection status = Connected, 08/05/2010 20:33:00.412948
Peer window end = 08/05/2010 21:03:07.000000 (1281013387)
Peer window (seconds) = 120
Heartbeats missed = 0
Local host = 9.114.47.103
Local service = DB2_HADR_1
Remote host = 9.114.47.104
Remote service = DB2_HADR_2
Remote instance = xcatdb
timeout(seconds) = 3
Primary log position(file, page, LSN) = S0000002.LOG, 18, 000000000FA18D7C
Standby log position(file, page, LSN) = S0000002.LOG, 18, 000000000FA18D7C
Log gap running average(bytes) = 0
**db2pd -d xcatdb -hadr**
Database Partition 0 -- Database XCATDB -- Active -- Up 0 days 01:17:11
HADR Information:
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Primary Peer Nearsync 0 0
ConnectStatus ConnectTime Timeout
Connected Thu Aug 5 20:33:00 2010 (1281011580) 3
PeerWindowEnd PeerWindow
Thu Aug 5 21:52:07 2010 (1281016327) 120
LocalHost LocalService
9.114.47.103 DB2_HADR_1
RemoteHost RemoteService RemoteInstance
9.114.47.104 DB2_HADR_2 xcatdb
PrimaryFile PrimaryPg PrimaryLSN
S0000002.LOG 66 0x000000000FA4869D
StandByFile StandByPg StandByLSN
S0000002.LOG 66 0x000000000FA4869D
The attributes "Role", "State" and "ConnectStatus" need to be checked. For an operating HADR environment, the "Role" should be "Primary" or "Standby"; the "State" should be "Peer" and the "ConnectStatus" should be "Connected". If any of the attributes are not correct, you need to go back to check the HADR settings and try to restart the HADR, if the problem persists, refer to DB2 documentation or contact the DB2 service team.
After the HADR setup is done, we should verify the database synchronization between the primary management node and standby management node. Here are the recommended steps:
On the primary management node:
On the standby management node:
On the primary management node:
Besides the HADR related commands described above, there are other HADR commands that are useful for administration and debugging. When debugging errors, a good resource is the DB2 Information Center at http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp . For example, the message SQL1117N can be found in Database reference > Messages > SQL Messages > SQL1000 - SQL1499
Stop HADR :
db2 STOP HADR ON DATABASE XCATDB
Note: On the HADR standby database server, after the HADR is stopped, the database is in "ROLL-FORWARD PENDING" state and the xcatdb can not be activated, this error is returned: "SQL1117N A connection to or activation of database "XCATDB" cannot be made because of ROLL-FORWARD PENDING. SQLSTATE=57019", for the SQLSTATE=57019 , use the command "db2 ROLLFORWARD DATABASE XCATDB TO END OF LOGS AND COMPLETE" to fix this problem.
Check xcatdb configuration :
db2 CONNECT TO XCATDB
db2 GET DB CFG
Takeover HADR role:
db2 TAKEOVER HADR ON DATABASE XCATDB USER xcatdb USING cluster
OR
db2 TAKEOVER HADR ON DATABASE XCATDB USER xcatdb USING cluster BY FORCE
The "BY FORCE" option should be used only if the primary database server is not functional.
Postgresql does provide the feature "Continuous Archiving and Point-In-Time Recovery (PITR)" that can be used to provide high availability cluster configuration. See http://www.postgresql.org/docs/8.4/interactive/warm-standby.html and http://www.postgresql.org/docs/8.4/interactive/continuous-archiving.html for more details.
However, this feature actually uses the "backup on the primary database server" and "restore on the standby database server". Because PITR is not real-time replication, the backup interval is configured manually in the postgresql.conf file, and the recovery interval is configured in recovery.conf. It will save a lot of database logging files and these logging files take large amounts of disk space (each logging file uses about 16MB disk space). Based on these considerations, using the database backup command pg_dump and restore command pg_restore seems to be a better solution for the xCAT postgresql database replication.
On the primary management node:
add crontab entries to:
Here is an example of the crontab entries for user postgres:
0 3 * * * /var/lib/pgsql/bin/pg_dump -f /tmp/xcatdb -F t xcatdb
Here is an example of the crontab entries for user root:
0 4 * * * scp /tmp/xcatdb aixmn2:/tmp/
On the standby management node:
stop the xcatd and Postgresql.
AIX:
stopsrc -s xcatd
su - postgres
/var/lib/pgsql/bin/pg_ctl -D /var/lib/pgsql/data stop
Linux:
service xcatd stop
su - postgres
service postgresql stop
Add a crontab entry to restore the database, here is an example of the crontab entries for user postgres:
0 5 * * * /var/lib/pgsql/bin/pg_restore -d xcatdb -c /tmp/xcatdb
This documentation will not cover the details for setting up replication for the database systems other than DB2. Here are some useful links for setting up database replication for various database systems supported by xCAT.
MySQL: http://dev.mysql.com/doc/refman/5.5/en/replication.html
sqlite: sqlite does not provide replication feature. However, since sqlite is a file-based database, you can use a file copy or synchronization mechanism on Unix/Linux to achieve the database synchronization. Sqlite does not support a hierarchical xCAT cluster. It does not support database clients that are required on the service nodes in a hierarchical cluster.
To make the standby management node be ready for an easy take over, there are a lot files that should be kept synchronized between the primary management node and standby management node.
A straightforward way to keep files synchronized is to use rsync. rsync is shipped with xCAT as part of the xcat-dep on AIX and also shipped with Linux distribution. You can see more details on the official rsync website http://samba.org/rsync/. You can use crontab to automate the synchronization. This documentation will use rsync and crontab as the file synchronization solution. You can use your own file synchronization solution as long as it keeps the corresponding files synchronized between the primary management node and standby management node.
The SSL credentials need to be identical on the primary management node and standby management node. The xcatd requests submitted from service nodes and compute nodes depend on the SSL credentials.
To setup the ssh authentication between the primary management node, standby management node, service nodes and compute nodes, the ssh keys should be kept synchronized between the primary management node and standby management node.
The SSL credentials reside in the directories /etc/xcat/ca, /etc/xcat/cert and $HOME/.xcat/. The ssh keys are in the directory /etc/xcat/hostkeys.
Here is an example of the crontab entries for synchronizing the SSL credentials and SSH keys:
0 1 * * * /usr/bin/rsync -Lprgotz /etc/xcat/ca /etc/xcat/cert /etc/xcat/hostkeys aixmn2:/etc/xcat
0 1 * * * /usr/bin/rsync -Lprgotz $HOME/.xcat aixmn2:$HOME/
Note: You can backup the $HOME/.ssh directory in case some information from the $HOME/.ssh on the primary management node is needed after failover. This is an optional step:
0 1 * * * /usr/bin/rsync -Lprgotz $HOME/.ssh aixmn2:$HOME/sshbackup/
The node deployment packages are under the directory specified by the "installdir" attribute in the xCAT site table. The default location is /install directory. The node deployment packages need to be synchronized to the standby management node.
For Linux, it will be easy to achieve this by copying the whole /install directory from the primary management node to the standby management node. However, copying the whole /install directory is not enough for AIX; we will have to create the NIM resources on the standby management node. Some manual steps are required to create the NIM resources on the backup management node.
Here is an example of the crontab entries for synchronizing the node deployment packages:
0 2 * * * /usr/bin/rsync -Lprogtz /install aixmn2:/
If you do not want to do the manual steps on the standby management node to re-create the NIM resources, the AIX feature High Availability Network Installation Manager(HANIM) can be used for keeping the NIM resources synchronized between the primary management node and standby management node. Please refer to the AIX redbook "NIM from A to Z in AIX 5L" at http://www.redbooks.ibm.com/redbooks/pdfs/sg247296.pdf for more details about HANIM.
A lot of network services are configured on the management node, such as DNS, DHCP and HTTP. The network services are mainly controlled by configuration files. However, some of the network services configuration files contain the local hostname/ipaddresses related information, so simply copying these network services configuration files to the standby management node may not work. Generating these network services configuration files is very easy and quick by running xCAT commands such as makedhcp, makedns or nimnodeset, as long as the xCAT database contains the correct information.
While it is easier to configure the network services on the standby management node by running xCAT commands when failing over to the standby management node, a couple of exceptions are the /etc/hosts and /etc/resolve files; the /etc/hosts and /etc/resolv.conf may be modified on your primary management node as ongoing cluster maintenance occurs. Since the /etc/hosts and /etc/resolv.conf are very important for xCAT commands, the /etc/hosts and /etc/resolv.conf will be synchronized between the primary management node and standby management node. Here is an example of the crontab entries for synchronizing the /etc/hosts and /etc/resolv.conf:
0 2 * * * /usr/bin/rsync -Lprogtz /etc/hosts /etc/resolv.conf aixmn2:/etc/
Besides the files mentioned above, there may be some additional customization files and production files that need to be copied over to the standby management node, depending on your local unique requirements. You should always try to keep the standby management node as an identical clone of the primary management node. Here are some example files that can be considered:
/.profile
/.rhosts
/etc/auto_master
/etc/auto/maps/auto.u
/etc/motd
/etc/security/limits
/etc/resolv.conf
/etc/netscvc.conf
/etc/ntp.conf
/etc/inetd.conf
/etc/passwd
/etc/security/passwd
/etc/group
/etc/security/group
/etc/exports
/etc/dhcpsd.cnf
/etc/sevices
/etc/inittab
(and more)
Note: if the IBM HPC software stack is configured in your environment, please refer to the xCAT wiki page [IBM_HPC_Stack_in_an_xCAT_Cluster] for additional steps required for HAMN setup.
The standby management node should be taken into account when doing any maintenance work in the xCAT cluster with HAMN setup.
At this point, the HA MN Setup is complete, and customer workloads and system administration can continue on the primary management node until a failure occurs. The xcatdb and files on the standby management node will continue to be synchronized until such a failure occurs.
When the primary management node fails for whatever reason, the failover process should be started, there are two methods to perform the failover: manual failover and automatic failover. The administrator can start the manual failover process with some manual steps. The following manual procedure should be followed in the event of a failure on the primary management node. Or, the administrators can configure HAMN with IBM Tivoli System Automation(TSA) to achieve automatic failover, see [Configure_HAMN_with_TSA] for more details.
Use the description in the section "setup database replication" to failover the database replication to the standby management node if necessary. Using the DB2 HADR configuration as an example, there are two scenarios that require different procedures. If the outage is a known outage where the standby management node takes over before the primary management node goes down, the command "db2 TAKEOVER HADR ON DATABASE XCATDB USER xcatdb USING cluster" can be used to failover the HADR; if the outage is a unknown outage where the primary management node remains in control until the primary management goes down, the command "db2 TAKEOVER HADR ON DATABASE XCATDB USER xcatdb USING cluster BY FORCE" can be used to failover the HADR. The "BY FORCE" option is required when the DB2 database on the primary management node is not functional.
If the primary management node is not totally dead, shutdown the primary management node. The standby management node cannot take over the management role if the primary management node is still up, since the standby management node will be configured with the hostname and ip address that the primary management was configured with. When the primary management node is shutdown, the service nodes and compute nodes may no longer function, depending on the type of node installation that was used. If xCAT is still active on the primary management node at this time, rpower and xdsh can be used to shutdown the nodes if needed.
Using DB2 as an example, the following commands can be used to stop DB2:
db2 STOP HADR ON DATABASE XCATDB
Note: If you get error message SQL1769N Stop HADR cannot complete. Reason code = "2", try to run command
db2 DEACTIVATE DATABASE XCATDB USER XCATDB USING cluster
and then rerun the
db2 STOP HADR ON DATABASE XCATDB
db2 connect reset
db2 force applications all
db2 terminate
db2stop
/usr/sbin/mktcpip -h'aixmn1' -a'9.114.47.103' -m'255.255.255.192' -i'en1' -g'9.114.47.126' -t'N/A'
Note: the mktcpip command will update /etc/hosts also. If this not desired, you can use the chdev command instead.
It is recommended that you open a console to the standby management node prior to making any ethernet interface changes. Also, keep the console open, to observe any errors while issuing commands in the remainder of this section.
Update the database configuration to use the new ip address and new hostname. For DB2, use the following command:
re-login as xcatdb
db2gcf -u -p 0 -i xcatdb
This command will update the DB2 database configuration file "/var/lib/db2/sqllib/db2nodes.cfg" and start DB2. Note the path /var/lib/db2 is the default and may have been changed by setting the site table databaseloc attribute.
For Postgres, update the line "host all all x.x.x.x/32 md5" in file /var/lib/pgsql/postgresql.conf and update the line "listen_addresses = 'x.x.x.x'" in file /var/lib/pgsql/pg_hba.conf.
db2 ROLLFORWARD DB XCATDB TO END OF LOG
db2 ROLLFORWARD DB XCATDB COMPLETE
Verify xcatdb is usable, via db2
db2 CONNECT TO XCATDB USER XCATDB USING cluster
db2 LIST TABLES
For Linux: the operating system images definitions are already in the xCAT database, and the operating system image files are already in /install directory.
For AIX: If the HANIM is being used for keeping the NIM resources synchronized, then no manual steps are needed to create the NIM resources on the standby management node; otherwise, the operating system image files are in /install directory, but you will have to create the NIM resources manually. Here are some manual steps that can be referred to for re-creating the NIM resources:
For AIX:
If the nim master is not initialized, run command
nim_master_setup -a mk_resource=no -a device=<source directory>
to initialize the NIM master, where the
Run the following command to list all the AIX operating system images.
lsdef -t osimage -l
For each osimage:
Create the lpp_source resource:
/usr/sbin/nim -Fo define -t lpp_source -a server=master -a location=/install
/nim/lpp_source/<osimagename>_lpp_source <osimagename>_lpp_source
Create the spot resource:
/usr/lpp/bos.sysmgt/nim/methods/m_mkspot -o -a server=master -a
location=/install/nim/spot/ -a source=no <osimage>
Check if the osimage has any of the following resources:
"installp_bundle", "script", "root", "tmp", "home",
"shared_home", "dump" and "paging"
If yes, use commands
/usr/sbin/nim -Fo define -t <type> -a server=master -a location=<location>
<osimagename>_<type>
to create all the necessary nim resources, where the <''location''> is the resource location returned by
lsdef -t osimage -l
If the osimage has shared_root resource defined, the shared_root resource directory needs to be removed before recreating the shared_root resource, here is an example:
rm -rf /install/nim/shared_root/71Bshared_root/
/usr/sbin/nim -Fo define -t shared_root -a server=master -a \
location=/install/nim/shared_root/71Bshared_root -a spot=71Bcosi 71Bshared
Note: If the NIM master was already up and running on the standby management node prior to failover, the NIM master hostname needs to be changed, you can use smit nim to perform the NIM master hostname change.
If you are seeing ssh problems when trying to ssh the compute nodes or any other nodes, the hostname in ssh keys under directory $HOME/.ssh needs to be updated.
** Run nimnodeset or mkdsklsnode
Before run nimnodeset or mkdsklsnode, make sure the entries in file /etc/exports match the exported NIM resources directories, otherwise, you will get exportfs error and nimnodeset/mkdsklsnode could not be finished successfully.
rpower <noderange> reset or rnetboot <noderange>
to initialize the network boot.
When the previous primary management node is back up and running, you may want to failback to the primary management node. Since the xCAT database and related files were not kept up to date on the previous primary management node while it was down, failing back to the the previous primary management node is not an easy action. You must go through all the steps described in this documentation to setup the previous standby management node as the new primary management node and setup the previous primary management node as the new standby management node, and then do a failover from the new primary management node to the new standby management node.
Wiki: Configure_HAMN_with_TSA
Wiki: Highly_Available_Management_Node
Wiki: Setting_Up_DB2_as_the_xCAT_DB
Wiki: Setup_HA_Mgmt_Node_With_Shared_Disks