
!!! This documentation is still under construction !!!
This documentation describes a process for upgrading the software and firmware on a Linux Power775 Cluster.
A good way to do this is run xcatsnap. It will save the DB2 database, system files like /etc/hosts, /install/postscripts, etc. You may want to prune your eventlog, and auditlog to keep the size of the database backup small.
tabprune -a auditlog
tabprine -a eventlog
Create a directory that can hold 50G or more for xcatsnap. Then run:
xcatsnap -d <directory>
It will create a compressed tar file. Save this file to to different system for reference later. Also save the following files that xCAT uses:
/etc/ssh/sshd_config
/etc/exports
/etc/httpd/conf.d/xcat.conf
(any other files that need to be checked?)
Save any other data you think is important from both ems and service nodes.
For TEAL, find all the alerts that are active
tllsalert
For each alert fix and close the alerts (for posterity)
tlchalert -i <rec_id> -s close
Backup the database tables (could be large) - creates a tltab*.tar file /opt/teal/sbin/tltab -p /tmp -d
For xCAT, list the monitoring plug-ins
monls
For each monitoring plug-ins that are "monitored" run:
monstop <plug-in-name>
monstop <plug-in-name> -r
For service nodes
xdsh <service_node_group> llctl stop
For EMS:
service cnmd stop
service teal stop
xdsh <node_groups> mmdelcallback <name> #Prevent flood of msg's when shutting down GPFS
xdsh <compute_group> mmshutdown
xdsh <gpfs_group> mmshutdown
Note: these GPFS commands will take a long time to finish. Please be patient and let it finish. Do not Ctrl-C out. Otherwise you may corrupt the database
xdsh <service_node_group> "service xcatd stop"
rpower <service_node_group> off
rpower <compute_node_group> off
rpower <login_node_group> off
rpower <gpfs_node_group> off
The details are documented in this url, but the basic instruction are in the next steps.
Setting_Up_DB2_as_the_xCAT_DB/#appendix-binstalling-db2-fix-packs.
At this point you must stop xCAT on the EMS. All the the database access applications on the EMS must the stopped, TEAL, ISNM,LL,xCAT. The others were done in previous steps.
service xcatd stop
1. Get DB2 fix pack Use the HPC DVD supplied to you for the HPC DB2 licensed product.
2. Check disk space
To install a Fix Pack during the process there will be two copies of the DB2 code in /opt so additional space is required: To update DB2 server code on the Management Node in /opt -- at least 3.5 gigabytes of free space.
3. Stopping the DB2 Server
You need to stop the DB2 database on the EMS. Now stop DB2 database:
su - xcatdb
db2 force applications all; db2 terminate;
db2stop or db2stop force
4. Install the DB2 fix pack on the EMS Details on installing fix packs to DB2 can be found here, but below are the basic instructions:
Setting_Up_DB2_as_the_xCAT_DB/#appendix-binstalling-db2-fix-packs.
5. Prepare the DB2 code directory for install on the Service Nodes.
lsdef -t site db2installloc (get the location of the DB2 install code directory)
Remove all the old DB2 files under <installloc>
Copy DB2 tarball with new DB2 fix pack under <db2installloc>, unzip and untar it.
Change directory to the location of the FixPack code which you extracted.
cd <db2installloc>/wser
./installFixPack -b /opt/ibm/db2/V9.7
If get an error, read the error log. May suggest you use
./installFixPack -b /opt/ibm/db2/V9.7 -f db2lib
5. Restart the database
su - xcatdb
db2start
exit
Restart xCAT and run the following command to verify if the DB2 upgrade is successful or not:
service xcatd start
tabdump site
lsxcatd -a
Now stop xcat, we will be upgrading the EMS to Redhat 6.1.
service xcatd stop
Since the service node will be reinstalled, so this step will be skipped. The upgrade to operating system and DB2 and all the HPC software and xCAT will be done during the reinstall.
The detailed procedure on updating the operating system on EMS can be found at
su - xcatdb
db2start
exit
tar xjvf xcat-core*.tar
tar xjvf xcat-deps*.tar
Move back any repo files in the /etc/yum.repos.d directory that you might have renamed in the previous step when upgrading RedHat.
Make sure the following repo files are correct.
cat /etc/yum.repos.d/xcat-core
[xcat-core-local]
name=local copy of xCAT core
baseurl=file:/install/post/otherpkgs/rhels6.1/ppc64/xcat/xcat-core
enabled=1
gpgcheck=0
cat /etc/yum.repos.d/xcat-deps
[xcat-dep-local]
name=local copy of xCAT deps
baseurl=file:/install/post/otherpkgs/rhels6.1/ppc64/xcat/xcat-dep
enabled=1
gpgcheck=0
yum clean metadata
yum check-update
yum update '*xCAT*'
Verify xcat is running correctly, run:
lsxcatd -a
Download DFM and Hardware Server packages from the Fix Central or use the DVD supplied.
HW server: http://www-933.ibm.com/support/fixcentral/swg/selectFixes?parent=ibm~ClusterSoftware&product=ibm/Other+software/IBM+High+Performance+Computing+(HPC)+Hardware+Server&release=All&platform=All&function=all
rpm -Uvh xCAT-dfm-*.ppc64.rpm ISNM-hdwr_svr-*.ppc64.rpm
Do the necessary firmware upgrade for the BPAs, FSPs and HMCs. Here is the detailed instruction.
XCAT_Power_775_Hardware_Management/#updating-the-bpa-and-fsp-firmware-using-xcat-dfm.
If LL is installed on EMS, upgrade the LL rpms following normal procedure.
Reference LoadLeveler documentation: Tivoli Workload Scheduler LoadLeveler library
Upgrade the TEAL rpms and start teal following normal procedure.
For teal information: https://sourceforge.net/apps/mediawiki/pyteal/index.php?title=Main_Page (TBD)
Upgrade the ISNM rpms and start cnmd following normal procedure
Reference ISNM configuration from P775 Guide: http://www.ibm.com/developerworks/wikis/download/attachments/162267485/p775_planning_installation_guide.rev1.2.pdf?version=1
1. Setup repository for new operating system(Optional)
copycds <iso file name>
2. Copy HPC rpms to a new directory, like /install/post/otherpkgs/rhels6.2/ppc64/, run createrepo for each HPC sub-directory.
3. Obtain the latest HFI kernel and device driver rpms and copy them to the following directories
/install/kernels/kernel-2.6.32-131.0.15.el6.20120106b2.ppc64.rpm
/install/kernels/kernel-headers-2.6.32-131.0.15.el6.20120106b2.ppc64.rpm
/install/hfi/dd/hfi_util-2.19-0.el6.ppc64.rpm
/install/hfi/dd/hfi_ndai-1.7.3-0.el6.ppc64.rpm
/install/hfi/dd/net-tools-1.60-102.el6.ppc64.rpm
/install/hfi/dhcp/dhclient-hfi-4.2.1-2.P1.el6_2.ppc64.rpm
/install/hfi/dhcp/dhcp-common-hfi-4.2.1-2.P1.el6_2.ppc64.rpm
/install/hfi/dhcp/dhcp-hfi-4.2.1-2.P1.el6_2.ppc64.rpm
Then run:
createrepo /install/kernels
createrepo /install/hfi/dd
createrepo /install/hfi/dhcp
4. Copy the xCAT core rpms and deps rpms to /install/post/otherpkgs/rhels6.x/ppc64/, untar them.
5. If not already done, prepare the DB2 code directory for install on the Service Nodes.(Optional)
lsdef -t site db2installloc (get the location of the DB2 install code directory)
Remove all the old PTF 4 DB2 files under <db2installloc>
Copy DB2 tarball with fix pack 5 under <db2installloc>, uncompress and untar it.
6. Change the os attribute to new operating system for all the nodes (Optional)
chdef service,compute,gpfs,login os=rhels6.2
7. Change some attributes for all the images for the new operating system: (Optional)
chdef -t osimage -o <image_name> \
osvers=rhels6.2 \
otherpkgdir=/install/post/otherpkgs/rhels6.2/ppc64/ \
rootimgdir=/install/netboot/rhels6.2/ppc64/<imgname> \
kernelver=.2.6.32-131.0.15.el6.20120106b2.ppc64.
8. Verify the os image definitions are set to use all the correct files, directories, pkglists, etc. Change any other attributes as needed:
lsdef -t osimage -o <image_name> -l
9. Modify the HPC installation scripts to work with the new HPC software
1. Before starting the service node installation, make sure the osimage for service nodes have been updated correctly.
nodeset <service_node_group> osimage=<image_name>
rpower <service_node_group> off
rpower <service_node_group> on
2. After the installation, make sure xcatd and DB2 are running properly.
xdsh <service_node_group> lsxcatd -a |xcoll
3. Install HFI kernel and device drivers
xdsh <service_node_group> rpm -ivh /install/kernels/kernel-2.6.32-*.ppc64.rpm
xdsh <service_node_group> rpm -ivh /install/kernels/kernel-headers-2.6.32-*.ppc64.rpm --force
xdsh <service_node_group> rpm -ivh /install/hfi/dd/hfi_util-*.el6.ppc64.rpm
xdsh <service_node_group> rpm -ivh /install/hfi/dd/hfi_ndai-*.el6.ppc64.rpm
xdsh <service_node_group> rpm -ivh /install/hfi/dd/net-tools-*.el6.ppc64.rpm --force
xdsh <service_node_group> rpm -ivh /install/hfi/dhcp/dhcp-common-hfi-4.2.1-2.P1.el6_2.ppc64.rpm --force
xdsh <service_node_group> rpm -ivh /install/hfi/dhcp/dhclient-hfi-4.2.1-2.P1.el6_2.ppc64.rpm --force
xdsh <service_node_group> rpm -ivh /install/hfi/dhcp/dhcp-hfi-4.2.1-2.P1.el6_2.ppc64.rpm --force
xdsh <service_node_group> /sbin/new-kernel-pkg --mkinitrd --depmod --install 2.6.32-131.0.15.el6.20120106b2.ppc64
xdsh <service_node_group> /sbin/new-kernel-pkg --rpmposttrans 2.6.32-131.0.15.el6.20120106b2.ppc64
4. Change yaboot to boot from customized kernel on all service nodes.
Setting_Up_a_Linux_Hierarchical_Cluster/#change-yaboot-to-boot-from-customized-kernel.
5. Reboot the service nodes:
xdsh <service_node_group> reboot
6. After the service nodes are all up, configure the HFI interfaces:
updatenode <service_node_group> -P confighfi
7. Create GPFS gplbin rpm
Login in to one service node, create GPFS gplbin rpm and copy it to the correct /install/post/otherpkgs/... directory.
Needs to regenerate the images because the OS level changes and new HPC software changes. Make sure get the new kernel and HFI driver. The following are just normal processes.
genimage <image_name>
liteimg <image_name>
nodeset <compute_node_group> osimage=<image_name>
rbootseq <compute_node_group> hfi
rpower <compute_node_group> off
rpower <compute_node_group> on
Do the same for GPFS servers and login nodes.
xdsh <gpfs_group> mmstartup
xdsh <compute_group> mmstartup
xdsh <node_groups> mmaddcallback <name>
Note: these GPFS commands will take a long time to finish. Please be patient and let it finish. Do not Ctrl-C out. Otherwise you may corrupt the database
If monitor the serviceable events on HMC
moncfg rmcmon <hmc_group> -r
moncfg rmcmon <hmc_group>
If monitor the PNSD
moncfg rmcmon <service_node_group> -r
moncfg rmcmon <service_node_group>
moncfg rmcmon <compute_node_group> -r
moncfg rmcmon <compute_node_group>
If monitor the GPFS
moncfg rmcmon <service_node_group> -r
moncfg rmcmon <service_node_group>
Start up RMC monitoring
monstart rmcmon
monstart rmcmon -r
Start up the GPFS monitoring
tlgpfschnode -C <cluster> -N <service_node_group> -e
Make sure that the condition/responses are setup and running
lscondresp | grep Active
"TealAnyNodeEventNotify" "TealNotifyEventLogged" "c250mgrs21-pvt" "Active" <plug-in>
moncfg <plug-in> <service_node_group>
moncfg <plug-in> <service_node_group> -r
moncfg <plug-in> <compute_node_group>
moncfg <plug-in> <compute_node_group> -r
monstart <plug-in>
monstart <plug-in> -r
Your backup EMS must also be upgraded, including the operating system, DB2, xCAT and HPC software. After you have stabilized the Primary EMS and cluster, you should proceed to follow these steps to upgrade the backup EMS. The steps you will have to do are very similar to the steps that you just did on the Primary EMS.