Please use this as a general guide to review the cluster start-up process and provide me feedback at -> john_simpson@hvc.rr.com
This cookbook will provide information about starting the xCAT HPC system Power 775 hardware along with verification steps as the system is being started. Everything described in this document is only supported in xCAT 2.6.6 and above. If you have other system p hardware, see [XCAT_System_p_Hardware_Management] .
Furthermore, this is intended only as a post-installation procedure.
More information about the Power 775 related software can be found at:
The following terms will be used in this document:
xCAT DFM: Direct FSP Management is the name that we will use to describe the ability for xCAT software to communicate directly to the System p server's service processor without the use of the HMC for management.
Frame node: A node with hwtype set to frame represents a high end System P server 24 inch frame.
CEC node: A node with attribute hwtype set to cec which represents a System P CEC (i.e. one physical server).
BPA node: is node with a hwtype set to bpa and it represents one port on one bpa (each BPA has two ports). For xCAT's purposes, the BPA is the service processor that controls the frame. The relationship between Frame node and BPA node from system admin's perspective is that the admin should always use the Frame node definition for the xCAT hardware control commands and xCAT will figure out which BPA nodes and their ip addresses to use for hardware service processor connections.
FSP node: FSP node is a node with the hwtype set to fsp and represents one port on the FSP. In one CEC with redundant FSPs, there will be two FSPs and each FSP has two ports. There will be four FSP nodes defined by xCAT per server with redundant FSPs. Similar to the relationship between Frame node and BPA node, system admins will always use the CEC node for the hardware control commands. xCAT will automatically use the four FSP node definitions and their attributes for hardware connections.
Service node: This is an LPAR which assists in the hierarchical management of xCAT by exending the capabilities of the EMS. SN have a full disk image and are used to serve the diskless OS images for the nodes that it manages.
IO node: This is an LPAR which has attached disk storage and provides access to the disk for applications. In 775 clusters the IO node will be running GPFS and will be managing the attached storage as part of the GPFS storage.
Compute node: This is a node which is used for customer applications. Compute nodes in a 775 cluster have no local disks or ethernet adapters. They are diskless nodes.
Utility node: This is a general term which refers to a non-compute node/LPAR and a non-IO node/LPAR. Examples of LPARs in a Utility node are the Service Node, Login Node, and local customer nodes for backup of data, or other site-specific functions.
Login node: This is an LPAR defined to allow the users to login and submit the jobs in the cluster. The login node will most likely have an ethernet adapter connecting it to the customer VLAN for access.
In a 775 cluster there are interrelationships and dependencies in the hardware and software architecture which require the startup to be performed in an orderly fashion. This document will explain these relationships and dependencies and describe in detail how to properly bring the system up to an HPC running state where users may login and start to submit jobs.
Each set of hardware has a designated role in the cluster. This section will describe each part of the hardware and its role.
The Ethernet switch hardware is key to any computer complex and provides the networking layer for IP communication. In 775 cluster, the switch hardware is used to support the Cluster Management LAN which is used by xCAT for OS distribution from the EMS to SN as well as administrations from the EMS to the SN. This hardware is also used to support the Cluster Service LAN which connects the EMSs, SNs, HMCs, FSPs, and BPAs together to provide access to the service processors within each Frame nad
To begin understanding of the flow of the start-up process lets first distinguish the different hardware responsibilities in the order in which each set of hardware becomes involved in the bring up process.
The xCAT Executive management Server is the central point of control for administration of the cluster. The EMS contains the xCAT DB as well as the Central Network Manager and its DB, and TEAL and its DB.
The HMCs are used for Service Focal Point and Repair and Verify procedures. During initial installation and configuration the HMCs will be assigned Frames and CECs which they will monitor for any hardware failures.
The Service nodes will be an LPAR within a building block which consists of a full disk image and will serve the diskless OS images for the nodes which it manages. All diskless nodes will require that the SN supporting them is up and running prior to being able to successfully boot. Some administrative operations in xCAT issued on the EMS are pushed out to the SN to perform the operations in a hierarchical manner which is needed for system administration performance.
The IO node is the LPAR with attached storage. It contains the GPFS software which manages the global filesystem for the cluster. All compute nodes are dependent on the IO nodes to be operational before they can mount the global filesystem.
There are some areas which are outside of the scope of this process. In order to draw a boundary on what hardware is part of the start up process and what is considered a prerequisite we will list some assumptions. It is assumed that the site has power and that everything is in place to begin the start up process. This would include the site cooling is up and operational and all power to the devices(switch, EMS, HMC, frames, etc) is ready to be applied.
The network switch hardware is a gray area in this process as some network switch hardware is part of the HPC cluster and others may be outside the cluster. For this discussion, we will make the assumption that all network switches that are customer site specific and not HPC cluster specific are up and operational.
There are some manual tasks involved in this process which require an IBM Systems Engineer or a site administrator to manually start equipment. There should be people available to perform these tasks and they should be very familiar with the power on controls needed for each task they are too perform. Examples include powering on the Ethernet network switches, EMS, HMC, frames, etc. These are all manual tasks which will need to be performed by a person when its time to do that step.
This process also assumes that all initial cabling and configuration, both hardware and software, has been done prior to this process and that the entire system has gone through booting and testing to eliminate any hardware or software problems prior to performing this procedure.
As the cluster is started, it is critical that hardware or software dependencies are up and operational prior to the successful completion of a hardware or software item which has the dependency. Lets take a high level view of the dependencies to help outline the flow of the startup process. This section is intended to give a rough idea of dependencies and it will not go into any detail as to how to accomplish the task or verify its completion.
Ethernet Switches - At the top of the dependencies is the HPC cluster ethernet switch hardware as well as any customer ethernet switch hardware. These will be the first items that need to be started.
EMS and HMCs - The next level of dependency is the EMS and HMCs. These can both be started at the same time once the network switches have been started.
Frames - Once the EMS and HMCs are started then we can begin to start the 775 hardware by powering on all of the frames. The frames are dependent on both the Switches and the EMS in order to come up properly.
CECs - Once the frame is powered on the CECs can be powered on. The CECs depend on the switches, EMS, and frames. Applying power to the CECs brings up the HFI network hardware, which is critical to distributing the operating system to diskless nodes, as well as for application communication.
SN - The SN can be started once the CECs are powered on and is dependent on the switches, EMS, frame, CEC.
IO node - The IO node can be started once the SN is operational. The IO node is dependent on the switches, EMS, frame, CEC, and SN.
Login and Compute nodes - Last in the list is the starting of the login and compute nodes. These can be done once the SN and IO nodes are up and operational. The login and compute node require the SN to be operational for the OS images loading. Login and Compute nodes depend on switches, EMS, frame, CEC, SN, and IO nodes.
Once the login and compute nodes have started, the admin can begin to evaluate the HPC cluster state by checking the various components of the cluster.
This section will document the start-up procedure. Each sub-section will discuss the prerequisites, the process for this step, and the verification for completion. As we mentioned previously there are some assumptions on the current site state which must be met prior to starting this process; these include cooling and power and initial configuration and verification of the cluster performed during installation.
Before we begin with the start-up procedure, we should discuss the benefit of using xCAT group names. xCAT supports the use of group names and which allow the grouping of devices/nodes in a logical fashion to support a given types of nodes. We recommend that the following node groups be in place prior to performing this procedure: frame, cec, bpa, fsp, service, storage, and compute. Other node groups may be used to serve site specific purposes.
Creating node groups will significantly enhance the capability to start a given group of nodes at the same time. Without these definitions, an administrator would have to issue many separate commands when a single command could be used.
It is also key to manage any note any failures in the start-up process and continue when possible. There may be an issue with some part of the cluster starting up which does not affect other parts of the cluster. When this occurs you should continue with the boot process for all areas that are successful while retrying or diagnosing the section with the failure. This will allow the rest of the cluster to continue to start which will be more efficient than holding up the entire cluster start-up. Notes on specific areas where this could happen and how to address failures will be added where appropriate. It is not possible to identify every possible error and documenting all failures and concerns would make this document very difficult to read. We will focus these types of notes on the most critical areas or areas where it may be more common to see an issue during start-up.
This step is the powering of the hardware required to administer this system. As we mentioned previously a critical aspect to the cluster is the starting of the network switch hardware. These should be powered on at this time.
Network Switch Verification Physical inspection of the lights should indicate whether the switches are up and running and ports are active. ARE THERE OTHER THINGS TO CHECK ON SWITCH?
Power-on any external disks used for dual-EMS support. This is required prior to starting the primary EMS.
Once the Ethernet switches and the EMS shared disk drives are up, it is time to power on the primary EMS and the HMCs. The backup EMS will be started after the cold start is complete and the cluster is operational. It is not needed for the cluster start-up and spending time to start it would take away from the limited time for the entire cluster start-up process. Starting the primary EMS and the HMCs is a manual steps which require the administrator to push the power button on each of these systems in order to start the boot process. They can be started at the same time as they do not have a dependency on each other.
Perform the following on the primary EMS. Note: Do not start up the backup EMS at this time. Do not perform these steps on a back-up EMS. The backup EMS start-up process will be described after the Cluster start-up process.
Mount external, shared disks
$ mount /dev/sdc1 /etc/xcat
$ mount /dev/sdc2 /install
$ mount /dev/sdc3 ~/.xcat
$ mount /dev/sdc4 /databaseloc
Start the DB2 Monitoring daemon
$ /opt/ibm/db2/V9.7/bin/db2fmcd &
Next start the xcatdb instance:
$ su - xcatdb
$ db2start xcatdb
$ exit
Start xCAT deamon
For AIX:
$ restartxcatd
For Linux:
$ service xcatd start
Start DHCP
For AIX:
$ startsrc -s dhcpsd
For Linux:
$ service dhcpd restart
Start hardware server
For AIX:
$ /opt/isnm/hdwr_svr/bin/hdwr_svr
For Linux:
$ service hdwr_svr start
Start TEAL
$ service teal start
Start CNM
For AIX:
$ /usr/bin/chnwm –a
For Linux:
$ service cnmd start
Start LoadLeveler
Note: The LoadLeveler CM will be running on the SN when they are started. There is no need to check status for Loadleveler on the EMS at this time as it needs the Loadleveler CM to be started before it can communicate.
$ llctl start
The EMS has a console attached and the administrator can monitor the boot process and await a login prompt. Once the OS has completed booting the administrator can login and begin to evaluate the state of xCAT.
verify that the xCAT deamon is runnnig and can access xCAT database:
$ lsxcatd -a
verify that the ssh is configured on HMCs
$ rspconfig hmc sshconfig
verify that DHCP is running
$ service dhcpd status
verify that conserver is running
$ service conserver status
Verify that hardware server is running
For AIX:
$ ps -eaf | grep hdwr_svr
For Linux:
$ service hdwr_svr status
Verify that CNM started
For AIX:
$ ps -eaf | grep cnmd | grep -v grep
:For Linux:
<pre>$ service cnmd status
You can now verify that the CNM daemon and HFI configuration is working by executing the CNM commands "lsnwloc" display frame-cage and supernode drawers information and "nmcmd" to dump the drawer status information. This will list the current state of the drawers working with CNM. Please reference the HPC using the 9125-F2C guide for more detail about CNM commands, implementation, and debug.
$ /opt/isnm/cnm/bin/lsnwloc
FR0017-CG03-SN000-DR0
FR0017-CG04-SN000-DR1
FR0017-CG05-SN000-DR2
/opt/isnm/cnm/bin/nmcmd -D -D
Frame 17 Cage 3 Supernode 0 Drawer 0 RUNTIME
Frame 17 Cage 5 Supernode 0 Drawer 2 RUNTIME
Frame 17 Cage 4 Supernode 0 Drawer 1 RUNTIME
verify that TEAL has started - ??
Using xCAT EMS verify that each of the HMCs are up and running.
The powering on of the frames is a manual process which requires one or more people to walk around and turn on the frame EPO buttons on the front of the frame. This will apply power to the frames bulk power. Each frame will apply power to the BPAs which will in turn cause them to boot. Once booted each frame and its BPAs will stop at rack standy.
The frame BPAs take about 3 minutes to boot once power is applied. To verify the state of a frame the administrator may issue the following command at the EMS:
$ rpower frame state
Each frame should come back with a state of "Both BPAs at rack standby"
To apply power to all of the CEC FSPs each frame will need to exit rack standby mode. To exit rack standby mode for each frame issue the following command:
$ rpower frame exit_rackstandby
Once the frame has been given the exit rack standby command the BPAs will exit rack standby and continue to standby. The BPAs will also apply power to the FSPs in the frame.
Issue the following command to verify that the frame BPAs are in standby.
$ rpower frame stat
Issue the following command to verify that the CEC FSPs are in standby.
$ rpower cec stat
To verify the CEC power state you issue:
$ nodels cec nodelist.status
On the EMS issue the command to check the links down?
need cmd here...
Need to understand what the minimum acceptable level is for continuing to boot diskless.
At this stage we have applied power to the frame and the CECs and we are ready to boot the service nodes. These are the first nodes to boot within the 775s since they supply the OS images for the remaining nodes. To power on the SN the administrator will issue:
? rpower service on
The verification process for the Service Node includes validating OS boot, critical daemons and services started, as well as communication to the EMS and other SNs is working. The following commands can be issued from the EMS to all of the service nodes to same time. Note - This process is assuming that service node is already configured to not start GPFS and LL. GPFS is not available until the storage nodes are booted after this step and the Loadleveler requires GPFS.
Verify the Service Node state
$ nodels service nodelist.status
Question:Is there some value for this cmd? -> nodestat node.service
Verify that the xCAT daemon, xcatd, is running on the service nodes.
xdsh service -v lsxcatd -a
Verify that the EMS is able to communicate with service nodes
$ xdsh service -v lsxcatd -a
Verify HFI connectivity in between the service nodes
$ xdsh service -v ping <service node1 hfi0> -c 5
Note: GPFS needs to be mounted for LoadLeveler but is not available at this time in the start-up process. Once storage nodes are started there will be a step to validate GPFS on the service nodes.
On the EMS issue the command to check the links down? need cmd here... Need to understand what the minimum accdeptable level is for continuing to boot diskless...
The disk enclosures will receive power when the frame exits rack_standby mode. This will make them start and be operational and ready when the nodes to which they are attached are started. At this point we have power to the frames, CECs, and SN. We are ready to boot the storage nodes which will start the LPARs attached to the IO and begin to bringup GPFS on each of these nodes. To power on the storage nodes issue the following command to each storage node:
rpower storage on
To verify the storage node successful power on we need to validate that the OS booted properly is running and we need to check to see if required services are active and configured properly.
Check to see that --- OS support is available here - could use some help ---
xdsh stroage -v mmgetstate
Due to the significant number of disk drives there can be a delay here
mmstartup mmgetstate start LL llctl start llstatus- can be run from EMS as a local cmd
At this point all critical parts of the infrastructure are operational including swithes, EMS, HMC frames, CECs, SN and storage node. Its time to start all of the compute nodes and the login nodes. Issue the following command to start the compute and login nodes:
To verify that the login and compute nodes successful power on you need to validate that the OS booted properly is running and check to see if GPFS is available on all compute nodes. Since there are a most likely a few login nodes the task for checking them is pretty simple. The compute nodes are typically more numerous and there are suggestions below on how to get a summary of the total number of successful rather than attempting to evaluate each one individually.
llstatus