xCAT Wiki

An extreme cluster/cloud administration toolkit

Brought to you by: besawn, cxhong, gurevich, obihoernchen, victorhu

XCAT_775_Startup_Procedure

There is a newer version of this page. You can find it here.

Introduction

This cookbook will provide information about starting the xCAT HPC system Power 775 hardware along with verification steps as the system is being started. Everything described in this document is only supported in xCAT 2.6.6 and above. If you have other system p hardware, see [XCAT_System_p_Hardware_Management] .

Furthermore, this is intended only as a post-installation procedure.

More information about the Power 775 related software can be found at:

Terminology

The following terms will be used in this document:

xCAT DFM: Direct FSP Management is the name that we will use to describe the ability for xCAT software to communicate directly to the System p server's service processor without the use of the HMC for management.

Frame node: A node with hwtype set to frame represents a high end System P server 24 inch frame.

BPA node: is node with a hwtype set to bpa and it represents one port on one bpa (each BPA has two ports). For xCAT's purposes, the BPA is the service processor that controls the frame. The relationship between Frame node and BPA node from system admin's perspective is that the admin should always use the Frame node definition for the xCAT hardware control commands and xCAT will figure out which BPA nodes and their ip addresses to use for hardware service processor connections.

CEC node: A node with attribute hwtype set to cec which represents a System P CEC (i.e. one physical server).

FSP node: FSP node is a node with the hwtype set to fsp and represents one port on the FSP. In one CEC with redundant FSPs, there will be two FSPs and each FSP has two ports. There will be four FSP nodes defined by xCAT per server with redundant FSPs. Similar to the relationship between Frame node and BPA node, system admins will always use the CEC node for the hardware control commands. xCAT will automatically use the four FSP node definitions and their attributes for hardware connections. Compute node: This is a node which is used for customer applications. Compute nodes in a 775 cluster have no local disks or ethernet adapters. They are diskless nodes. Service node: This is an LPAR which assists in the hierarchical management of xCAT by exending the capabilities of the EMS. SN have a full disk image and are used to serve the diskless OS images for the nodes that it manages. IO node: This is an LPAR which has attached disk storage and provides access to the disk for applications. In 775 clusters the IO node will be running GPFS and will be managing the attached storage as part of the GPFS storage. Login node: This is an LPAR defined to allow the users to login and submit the jobs in the cluster. The login node will most likely have an ethernet adapter connecting it to the customer VLAN for access. Utility node: This is a general term which refers to a non-compute node/LPAR and a non-IO node/LPAR. Examples of LPARs in a Utility node are the Service Node, Login Node, and local customer nodes for backup of data, or other site-specific functions.

Overview

In a 775 cluster there are interrelationships and dependencies in the hardware and software architecture which require the startup to be performed in an orderly fashion. This document will explain these relationships and dependencies and describe in detail how to properly bring the system up to an HPC running state where users may login and start to submit jobs.

Hardware roles

Each set of hardware has a designated role in the cluster. This section will describe each part of the hardware and its role.

Ethernet Network Switch

The Ethernet switch hardware is key to any computer complex and provides the networking layer for IP communication. In 775 cluster, the switch hardware is used to support the Cluster Management LAN which is used by xCAT for OS distribution from the EMS to SN as well as administrations from the EMS to the SN. This hardware is also used to support the Cluster Service LAN which connects the EMSs, SNs, HMCs, FSPs, and BPAs together to provide access to the service processors within each Frame nad

To begin understanding of the flow of the start-up process lets first distinguish the different hardware responsibilities in the order in which each set of hardware becomes involved in the bring up process.

EMS

The xCAT Executive management Server is the central point of control for administration of the cluster. The EMS contains the xCAT DB as well as the Central Network Manager and its DB, and TEAL and its DB.

HMCs

The HMCs are used for Service Focal Point and Repair and Verify procedures. During initial installation and configuration the HMCs will be assigned Frames and CECs which they will monitor for any hardware failures.

SN

The Service nodes will be an LPAR within a building block which consists of a full disk image and will serve the diskless OS images for the nodes which it manages. All diskless nodes will require that the SN supporting them is up and running prior to being able to successfully boot. Some administrative operations in xCAT issued on the EMS are pushed out to the SN to perform the operations in a hierarchical manner which is needed for system administration performance.

IO node

The IO node is the LPAR with attached storage. It contains the GPFS software which manages the global filesystem for the cluster. All compute nodes are dependent on the IO nodes to be operational before they can mount the global filesystem.

Start-up Assumptions

There are some areas which are outside of the scope of this process. In order to draw a boundary on what hardware is part of the start up process and what is considered a prerequisite we will list some assumptions. It is assumed that the site has power and that everything is in place to begin the start up process. This would include the site cooling is up and operational and all power to the devices(switch, EMS, HMC, frames, etc) is ready to be applied.

The network switch hardware is a gray area in this process as some network switch hardware is part of the HPC cluster and others may be outside the cluster. For this discussion, we will make the assumption that all network switches that are customer site specific and not HPC cluster specific are up and operational.

There are some manual tasks involved in this process which require an IBM Systems Engineer or a site administrator to manually start equipment. There should be people available to perform these tasks and they should be very familiar with the power on controls needed for each task they are too perform. Examples include powering on the Ethernet network switches, EMS, HMC, frames, etc. These are all manual tasks which will need to be performed by a person when its time to do that step.

This process also assumes that all initial cabling and configuration, both hardware and software, has been done prior to this process and that the entire system has gone through booting and testing to eliminate any hardware or software problems prior to performing this procedure.

Dependencies

As the cluster is started, it is critical that hardware or software dependencies are up and operational prior to the successful completion of a hardware or software item which has the dependency. Lets take a high level view of the dependencies to help outline the flow of the startup process. This section is intended to give a rough idea of dependencies and it will not go into any detail as to how to accomplish the task or verify its completion.

Ethernet Switches - At the top of the dependencies is the HPC cluster ethernet switch hardware as well as any customer ethernet switch hardware. These will be the first items that need to be started.

EMS and HMCs - The next level of dependency is the EMS and HMCs. These can both be started at the same time once the network switches have been started.

Frames - Once the EMS and HMCs are started then we can begin to start the 775 hardware by powering on all of the frames. The frames are dependent on both the Switches and the EMS in order to come up properly.

CECs - Once the frame is powered on the CECs can be powered on. The CECs depend on the switches, EMS, and frames. Applying power to the CECs brings up the HFI network hardware, which is critical to distributing the operating system to diskless nodes, as well as for application communication.

SN - The SN can be started once the CECs are powered on and is dependent on the switches, EMS, frame, CEC.

IO node - The IO node can be started once the SN is operational. The IO node is dependent on the switches, EMS, frame, CEC, and SN.

Login and Compute nodes - Last in the list is the starting of the login and compute nodes. These can be done once the SN and IO nodes are up and operational. The login and compute node require the SN to be operational for the OS images loading. Login and Compute nodes depend on switches, EMS, frame, CEC, SN, and IO nodes.

Once the login and compute nodes have started, the admin can begin to evaluate the HPC cluster state by checking the various components of the cluster.

Start-up Procedure

This section will document the start-up procedure. Each sub-section will discuss the prerequisites, the process for this step, and the verification for completion. As we mentioned previously there are some assumptions on the current site state which must be met prior to starting this process; these include cooling and power and initial configuration and verification of the cluster performed during installation.

Before we begin with the start-up procedure, we should discuss the benefit of using xCAT group names. xCAT supports the use of group names and which allow the grouping of devices/nodes in a logical fashion to support a given types of nodes. We recommend that the following node groups be in place prior to performing this procedure: frame, cec, bpa, fsp, service, storage, and compute. Other node groups may be used to serve site specific purposes.

Creating node groups will significantly enhance the capability to start a given group of nodes at the same time. Without these definitions, an administrator would have to issue many separate commands when a single command could be used.

Power on Network Switch Hardware

This step is the powering of the hardware required to administer this system. As we mentioned previously a critical aspect to the cluster is the starting of the network switch hardware. These should be powered on at this time.

Network Switch Verification

Physical inspection of the lights should indicate whether the switches are up and running and ports are active. ARE THERE OTHER THINGS TO CHECK ON SWITCH?

Power on external disks attached to the EMS

Before powering on the EMS, power-on any external disks used for dual-EMS support.

Power on the EMS and HMCs

Once the Ethernet switches are up and operational, it is time to power on the EMS(s) and the HMCs. These are manual steps which require the admin to push the power button on these systems in order to start the boot process. They can be started at the same time as they do not have a dependency on each other.

Bring-up EMS

Perform the following on the main EMS. Do not perform this on a back-up EMS.

Mount external, shared disks
Start xCAT deamon

On Linux:

service xcatd start

On AIX:

mkssys -p /opt/xcat/sbin/xcatd -s xcatd -u 0 -S -n 15 -f 15 -a "-f"

startsrc -s xcatd
Start DB2

su - xcatdb
db2start
exit
Start DHCP
Start hardware server

For AIX:

/opt/isnm/hdwr_svr/bin/hdwr_svr
For Linux:

service hdwr_svr start
Start TEAL
Start CNM

For AIX:

/usr/bin/chnwm –a
For Linux:

service cnmd start
Start LoadLeveler

EMS verification

The EMS has a console attached and the administator can monitor the boot process and await a login propmpt. Once the OS has completed booting the administrator can login and begin to evaluate the state of xCAT.

verify that the xCAT DB is operational:
1. nodels example?? others?
verify that DHCP is running
Verify that hardware server is running

For AIX:
> ps -eaf | grep hdwr_svr
root 3473548 1 0 Jun 21 - 0:19 /opt/isnm/hdwr_svr/bin/hdwr_svr
If it is not running, execute,

/opt/isnm/hdwr_svr/bin/hdwr_svr
For Linux:
>service hdwr_svr status
hdwr_svr (pid 28631) is running...
If it is not running, execute:

service hdwr_svr start
verify that CNM started

For AIX:

ps -eaf | grep cnmd | grep -v grep
root 7590 1 0 Sep21 ? 00:12:36 /opt/isnm/cnm/bin/cnmd
For Linux:
service cnmd start
Verify it is running:
> service cnmd status
cnmd (pid 6908) is running...
verify that TEAL has started - ??
verify that the xCAT??

HMC verification

Using xCAT EMS verify that each of the HMCs are up and running.

verify that the HMC is running and that we have connectivity - dsh hmc ???

Power on Frames

The powering on of the frames is a manual process which requires one or more people to walk around and turn on the EPO buttons on the front of the frame. This will apply power to the frames bulk power. Each frame will apply power to the BPAs which will in turn cause them to boot and acquire their DHCP address from the EMS. Once booted each frame will stop at rack standy.

Frame power verification

To verify the state of a frame the administrator may issue the following command at the EMS:

rpower frame17 state

frame17: BPA state - Both BPAs at standby

Each frame should come back with a state of "Rack Standby"

Power on CECs

To apply power to the CECs each frame will need to exit rack standby mode. To exit rack standby mode for frame 17 issue the following command:

rpower frame17 exit_rackstandby

CEC power on verification

To verify the CEC power state you issue:

rpower f17c01  state

f17c01: CEC standby

Power on Service Nodes

At this stage we have applied power to the frame and the CECs and we are ready to boot the service nodes. These are the first nodes to boot within the 775s since they supply the OS for the remaining nodes. Top power on the SN the administrator will issue:

rpower <sn node name> on

Service Node power on verification

The verification process for the Service Node includes validating OS boot, critical daemons and services started and available, as well as communication to the EMS are available.

SN LPAR state - use this command to validate that the SN LPAR is operational.
1. rpower f17sn01 state
2. f17sn01: Running
Check to see that --- OS support is available here - could use some help ---
EMS is able to communicate with SN
1. xdsh f17sn01 date
2. f17sn01: Fri Sep 23 15:32:55 EDT 2011

Power on storage nodes

The disk enclosures will receive power when the frame exits rack_standby mode. This will make them start and be operational and ready when the nodes to which they are attached are started. At this point we have power to the frames, CECs, and SN. We are ready to boot the storage nodes which will start the LPARs attached to the IO and begin to bringup GPFS on each of these nodes. To power on the storage nodes issue the following command to each storage node:

rpower storage on

storage node power on verification

To verify the storage node successful power on we need to validate that the OS booted properly is running and we need to check to see if required services are active and configured properly.

storage node LPAR state - use this command to validate that the storage node LPAR is operational.
1. $ nodels storage nodelist.status
2. f17IO01: Running
Check to see that --- OS support is available here - could use some help ---
1. ssh or telnet to the node
2. Issue GPFS command to verify that its running and operational - what is the command for this?

At this point all critical parts of the infrastructure are operational including swithes, EMS, HMC frames, CECs, SN and storage node. Its time to start all of the compute nodes and the login nodes. Issue the following command to start the compute and login nodes:

To verify that the login and compute nodes successful power on you need to validate that the OS booted properly is running and check to see if GPFS is available on all compute nodes. Since there are a most likely a few login nodes the task for checking them is pretty simple. The compute nodes are typically more numerous and there are suggestions below on how to get a summary of the total number of successful rather than attempting to evaluate each one individually.

login node LPAR state - use this command to validate that the login node LPAR is operational.
1. $ rpower login state
2. f17ln01: Running
Issue xdsh command to use "date" to verify login node is up and running. More specific login node application checking could be used at the administrators discretion.
1. $ xdsh login date
compute node LPAR state - use this command to validate that the compute node LPAR is operational.
1. $ rpower compute state
2. f17ln01: Running
3. the above would be repeated for every compute node --> the admin may need to pipe this to a file or grep to sort our running and non-running lpars
4. $ rpower compute state | grep running | wc -l ---> this would provide the number of running compute lpars
Issue xdsh command to verify compute node connectivity and that GPFS is mounted at the same time.
1. $ xdsh compute ls -l /gpfs/some/dir
2. same comment on piping output to sort out successful entries