xCAT Wiki

An extreme cluster/cloud administration toolkit

Brought to you by: besawn, cxhong, gurevich, obihoernchen, victorhu

XCAT_support_for_AIX_on_system_P_Flex_Blades

Introduction
Setting up the xCAT MN for IBM Flex
xCAT Hierarchy and MySQL databases
Downloading and Installing DFM
Define the CMMs and Switches
CMM_Discovery_and_Configuration
Create node object definitions of flex blade servers
Create the hardware server connection for the blades' FSPs
- Update the FSP firmware (optional)
Get the MAC addresses for the nodes
Using AIX service nodes
Booting Flex blade nodes
Appendix 1: IBM Flex Recovery and CMM Redundancy
- Replacement of CMM
- CMM Redundancy
Appendix 2: CMM and Felxible Service Processor(FSP) password
- Errors caused by an FSP authentication problem
Appendix 3: Updating Firmware on Flex Ethernet and IB Switch Modules
- Firmware Update using CLI
**Appendix 4 Perform Deferred Firmware upgrades for Flex blade CEC **
Appendix 5 Connect Flex P blades to HMC for Teal SFP (Not yet supported)
Diagnostics
- lshwconn LINE DOWN after power outage

Introduction

AIX overview

xCAT may be used to support cluster environments that use the AIX operating system.

In an xCAT cluster the single point of control is the xCAT management node. In an AIX cluster the management node must be an AIX system and must be configured as an AIX NIM master.

Before using xCAT on an AIX cluster you should become familiar with the AIX operating system and Network Installation Manager (NIM) tools. For more information about NIM, see the IBM AIX Installation Guide and Reference. (<http://www-03.ibm.com/servers/aix/library/index.html>)

This document assumes that you have already installed and configured your xCAT AIX management node by following the process described in the AIX overview document. [XCAT_AIX_Cluster_Overview_and_Mgmt_Node].

For large scale cluster environments xCAT provides support for using additional installation servers. In an xCAT cluster these additional servers are referred to as service nodes.

For an xCAT on AIX cluster there is a primary NIM master which is on the xCAT management node(MN). The service nodes(SN) are configured as additional NIM masters. All commands are run on the management node. The xCAT support automatically handles the NIM setup on the low level service nodes and the distribution of the NIM resources. All installation resources for the cluster are managed from the primary NIM master. The NIM resources are automatically replicated on the low level masters when they are needed.

You can set up one or more service nodes in an xCAT cluster. The number you need will depend on many factors including the number of nodes in the cluster, the type of node deployment, the type of network etc.

AIX service nodes must be diskfull (NIM standalone) systems.

This document contains a section on how to set up and use xCAT service nodes. If you do not need service nodes in your cluster then simply skip this section.

xCAT uses AIX/NIM commands to support diskless and standalone NIM clients. For standalone clients you may choose either an "rte" or a "mksysb" type installation.

For more information on using these installation methods please refer to the following documents.

Installing AIX standalone nodes (using NIM rte method) [XCAT_AIX_RTE_Diskfull_Nodes]
Cloning AIX nodes (install using AIX mksysb image) [XCAT_AIX_mksysb_Diskfull_Nodes]
Installing AIX diskless nodes (using stateless,statelite,stateful methods) [XCAT_AIX_Diskless_Nodes]

Flex overview

IBM Flex combines networking, storage and servers in a single offering. It consist of an IBM Flex Chassis, one or two Chassis Management Modules(CMM) and Power 7 and/or x86 compute node servers. The type of the management module for IBM Flex is 'cmm', and the blade servers include the IBM Flex System™ p260, p460, and 24L Power 7 servers as well as the IBM Flex System™ x240 Compute Node which is an x86 Intel-processor based server. In this document only the management of POWER 7 blade server will be covered.

IBM Flex System™ p260, p460, and 24L Power 7 servers need to be managed by a xCAT Management Node (MN) which is to be created on a standalone System P7 server. There needs to be an ethernet commectivity between the xCAT MN to the CMMs, and to all the compute node through the Ethernet Switch Module. The xCAT support uses the hardware type 'hwtype=blade' to manage the P7 Flex blade servers working through the CMM management module). IBM Flex xCAT will use a management type of 'mgt=fsp' to control the POWER 7 servers which is done through the xCAT DFM (Direct FSP Management)). For xCAT IBM Flex Power 7 servers, the management approach is mixture of 'blade' and 'fsp'. Most of the discovery work will be done through CMM and the hardware management work with the server's FSP directly.

Terminology

The following terms will be used in this document:

Direct FSP Management(DFM) - This is the name that we will use to describe the ability for xCAT software to communicate directly to the IBM FLex Power pblade's service processor without the use of the HMC for management.
Chassis Management Module(CMM) - This term is used to reflect the pair of management modules installed in the rear of the chassis which have an Ethernet connection. The CMM is used to discover the servers within the chassis and for some data collection regarding the servers and chassis.
blade node - Blade node refers to a node with the hwtype set to blade and represents the whole blade server. The hcp attribute of the blade is set to the FSP's IP.
FSP - Flexible Service Processor (FSP). This is the service processor within the IBM Flex Power blade.

Setting up the xCAT MN for IBM Flex

It is required that you create the xCAT MN on a standalone System P7 server that has proper ethernet connectivity to the CMMs and the PuerFlex blades. The xCAT administrator should reference the general xCAT MN procedures listed in the System P Linux or System P AIX guides and then follow the PureFlex instructions listed below.

For System P Linux:
XCAT_pLinux_Clusters/#install-xcat-on-the-management-node.

For System P AIX:
[XCAT_AIX_Cluster_Overview_and_Mgmt_Node]

xCAT Hierarchy and MySQL databases

If you are using service nodes you must switch to a database that supports remote access. XCAT currently supports MySQL, and PostgreSQL. As a convenience, the xCAT site provides downloads for MySQL and PostgreSQL.

You may continue to use the SQlite database that is installed by default with xCAT if you are not using service nodes.

( xcat-postgresql-snap201007150920.tar.gz and xcat-mysql-201005260807.tar.gz )

The HPC solution for IBM Flex only supports the MySQL database at this time.

See the following xCAT documents for instructions on how to configure MySQL database.

[Setting_Up_MySQL_as_the_xCAT_DB]

When configuring the new database you will need to add access for each of your service nodes. The process for this is described in the documentation mentioned above.

The database tar files that are available on the xCAT web site may contain multiple versions of RPMs - one for each AIX operating system level. When you are copying required software to your lpp_source resource make sure you copy the rpm that coincides with your OS level. Do not copy multiple versions of the same rpm to the NIM lpp_source directory.

Downloading and Installing DFM

This requires the new xCAT Direct FSP Management(dfm) plugin and hardware server(hdwr_svr) plugin, which are not part of the core xCAT open source, but are available as a free download from IBM. You must download this and install them on your xCAT management node.

Download the suitable dfm and hdwr_svr packages from IBM Fix Central for supported OS. Once you have downloaded these packages, install the hardware server package first, and then install DFM.

Download xCAT-dfm RPM: http://www-933.ibm.com/support/fixcentral/swg/selectFixes?parent=ibm~ClusterSoftware&product=ibm/Other+software/IBM+direct+FSP+management+plug-in+for+xCAT&release=All&platform=All&function=all

Download ISNM-hdwr_svr packages: http://www-933.ibm.com/support/fixcentral/swg/selectFixes?parent=ibm~ClusterSoftware&product=ibm/Other+software/IBM+High+Performance+Computing+%28HPC%29+Hardware+Server&release=All&platform=All&function=all

The ISNM hardware server base isnm.hdwr_svr and PTFs images along with the xCAT DFM rpm package needs to be downloaded and then installed on the xCAT MN .

Download the ISNM hardware server installp and the DFM aix rpm packages to the xCAT MN, and place the packages in a directory such as

    /install/post/otherpkgs/aix/ppc64/dfm

Install the hdwr_svr installp packages, and then install the dfm rpm package.

    cd /install/post/otherpkgs/aix/ppc64/dfm
    inutoc .
    installp -agQXYd . isnm.hdwr_svr  
    rpm -Uvh  xCAT-dfm*.aix5.3.ppc.rpm

Define the CMMs and Switches

Define the CMMs
Define the Switches
Fill in More xCAT Tables
- The passwd Table
- The networks Table
Declare a dynamic range of addresses for discovery

Define the CMMs

First just add the list of CMMs and the groups they belong to:

    nodeadd cmm[01-15] groups=cmm,all

Now define attributes that are the same for all CMMs. These can be defined at the group level. For a description of the attribute names, see the node object definition.

    chdef -t group cmm hwtype=cmm mgt=blade

Next define the attributes that vary for each CMM. There are 2 different ways to do this. Assuming your naming conventions follow a regular pattern, the fastest way to do this is use regular expressions at the group level:

    chdef -t group cmm mpa='|(.*)|($1)|' ip='|cmm(\d+)|10.0.50.($1+0)|'

Note: The Flow for CMM IP addressing is 1) initially each CMM obtains a DHCP address from a dynamic range of IP addresses specified later, 2) This DHCP address will be listed when we do CMM discovery using lsslp 3) CMM configuration steps will change the CMM DHCP obtained ip address to the permanent static IP address which is specified here.

This chdef might look confusing at first, but once you parse it, it's not too bad. The regular expression syntax in xcat database attribute values follows the form:

    |pattern-to-match-on-the-nodename|value-to-give-the-attribute|

You use parentheses to indicate what should be matched on the left side and substituted on the right side. So for example, the mpa attribute above is:

    |(.*)|($1)|

This means match the entire nodename (.*) and substitute it as the value for mpa. This is what we want because for CMMs the mpa attribute should be set to itself.

For the ip attribute above, it is:

    |cmm(\d+)|10.0.50.($1+0)|

This means match the number part of the node name and use it as the last part of the IP address. (Adding 0 to the value just converts it from a string to a number to get rid of any leading zeros, i.e. change 09 to 9.) So for cmm07, the ip attribute will be 10.0.50.7.

For more information on xCAT's database regular expressions, see http://xcat.sourceforge.net/man5/xcatdb.5.html . To verify that the regular expressions are producing what you want, run lsdef for a node and confirm that the values are correct.

If you don't want to use regular expressions, you can create a stanza file containing the node attribute values:

    cmm01:
      objtype=node
      mpa=cmm01
      ip=10.0.50.1
    cmm02:
      objtype=node
      mpa=cmm02
      ip=10.0.50.2
    ...

Then pipe this into chdef:

    cat <stanzafile> | chdef -z

When you are done defining the CMMs, listing one should look like this:

    lsdef cmm07
    Object name: cmm07
        groups=cmm,all
        hwtype=cmm
        ip=10.0.50.7
        mgt=blade
        mpa=cmm07
        postbootscripts=otherpkgs
        postscripts=syslog,remoteshell,syncfiles

Define the Switches

    nodeadd switch[1-4] groups=switch,all
    chdef -t group switch ip='|switch(\d+)|10.0.60.($1+0)|'

Fill in More xCAT Tables

The passwd Table

There are several passwords required for management:

blade - The userid and password for the CMM.
ipmi - The userid and password used to communicate with the IPMI service on the IMM (BMC) of each blade. To avoid problems, this should be the same as the CMM userid and password above.
system - The root id and password which will be set on the node OS during node deployment and used later for the administrator to login to the node OS.

Use tabedit to give the passwd table contents like:

    key,username,password,cryptmethod,comments,disable
    "blade","USERID","PASSW0RD",,,
    "ipmi","USERID","PASSW0RD",,,
    "system","root","cluster",,,

The networks Table

All networks in the cluster must be defined in the networks table. When xCAT was installed, it ran makenetworks, which created an entry in this table for each of the networks the management node is connected to. Now is the time to add to the networks table any other networks in the cluster, or update existing networks in the table.

For a sample Networks Setup, see the following example: Setting_Up_a_Linux_xCAT_Mgmt_Node/#appendix-a-network-table-setup-example.

Declare a dynamic range of addresses for discovery

If you want to use hardware discovery, 2 dynamic ranges must be defined in the networks table: one for the service network (CMMs and IMMs), and one for the management network (the OS for each blade). The dynamic range in the service network (in our example 10.0) is used while discovering the CMMs and IMMs using SLP. The dynamic range in the management network (in our example 10.1) is used when booting the blade with the genesis kernel to get the MACs.

    chdef -t network 10_0_0_0-255_255_0_0 dynamicrange=10.0.255.1-10.0.255.254
    chdef -t network 10_1_0_0-255_255_0_0 dynamicrange=10.1.255.1-10.1.255.254

CMM_Discovery_and_Configuration

Overview
Optional Discovery Method 1 - Mapping the CMMs to the switch port information (Development)
Optional Discovery Method 2 - Manually Discovering the CMMs Instead of Using the Switch Ports
CMM Configuration
CMM Security Password Expiration
Redundant CMM Support
Update the CMM firmware (optional)

Overview

In this section you will perform the CMM discovery and configuration tasks for the CMMs.

During the CMM discovery process all CMMs are discovered using Service Location Protocol(SLP) and the xCAT lsslp command. There are two methods which will allow mapping the SLP discovered CMMs to the CMMs predefined in the xCAT DB. You can either use method 1 which map the SLP data to the switch SNMP data together to update the xCAT DB or you can use method 2 to capture the SLP information to a file and edit it manually and then update the xCAT DB.

Two factors will determine which method you use. If this is a large configuration with many chassis and you are able to enable SNMP on the switch that the CMMs are connected to then method 1 would be preferred. If you are only defining a few chassis then method 2 might be an easier choice.

Note: xCAT Flex discovery now does not support the CMM with both primary and standby port.

Optional Discovery Method 1 - Mapping the CMMs to the switch port information (Development)

This supported will be available in xCAT 2.8.2 and later. This method requires SNMP access to the Ethernet switch where the CMMs are connected. If you can't configure SNMP on your switches, then use the section after:

CMM_Discovery_and_Configuration/#optional-discovery-method-2-manually-discovering-the-cmms-instead-of-using-the-switch-ports to discover and define the CMMs to xCAT.

In large clusters the most automated method for discovering is to map the SLP CMM information to the Ethernet switch SNMP data from which each chassis CMM is connected.

To use this method the xCAT switch and switches tables must be configured. The xCAT switch table will need to be updated with the switch port that each CMM is connected. The xCAT switches table must contain the SNMP access information.

Add the CMM switch/port information to the switch table.

 tabdump switch
 node,switch,port,vlan,interface,comments,disable
 "cmm01","switch","0/1",,,,
 "cmm02","switch","0/2",,,,

where: node is the cmm node object name switch is the hostname of the switch port is the switch port id. Note that xCAT does not need the complete port name. Preceding non numeric characters are ignored.

If you configured your switches to use SNMP V3, then you need to define several attributes in the switches table. Assuming all of your switches use the same values, you can set these attributes at the group level:

    tabch switch=switch switches.snmpversion=3 switches.username=xcatadmin \
         switches.password=passw0rd switches.auth=SHA



   tabdump switches
   switch,snmpversion,username,password,privacy,auth,linkports,sshusername,...
    "switch","3","xcatadmin","passw0rd",,"SHA",,,,,,

Note: It might also be necessary to allow authentication at the VLAN level

    snmp-server group xcatadmin v3 auth context vlan-230

Discover and update the xCAT CMM node definitions with the MAC, Model Type, and Serial Number.

    lsslp -s CMM -w

Verify that the CMMs have been updated with the mac, mtm, and serial information.

    lsdef cmm01
    cmm01:
           objtype=node
           mpa=cmm01
           nodetype=mp
           mtm=789392X
           serial=100037A
           side=2
           groups=cmm,all
           mgt=blade
           mac=5c:f3:fc:25:da:99
           hidden=0
           otherinterfaces=10.0.0.235
           hwtype=cmm

Optional Discovery Method 2 - Manually Discovering the CMMs Instead of Using the Switch Ports

If you can't enable SNMP on your switches, use this more manual approach to discover your hardware. If you have already discovered your hardware using spldiscover of lsslp --flexdiscover, skip this whole section.

Assuming your CMMs have at least received a dynamic address from the DHCP server, you can run lsslp to discover them and create a stanza file that contains their attributes that can be used to update the existing CMM nodes in the xCAT database. The problem is that without the switch port information, lsslp has no way to correlate the responses from SLP to the correct nodes in the database, so you must do that manually. Run:

    lsslp -m -z -s CMM > cmm.stanza

and it will create a stanza file with entries for each CMM that look like this:

    Server--SNY014BG27A01K:
           objtype=node
           mpa=Server--SNY014BG27A01K
           nodetype=mp
           mtm=789392X
           serial=100CF0A
           side=1
           groups=cmm,all
           mgt=blade
           mac=3440b5df0abe
           hidden=0
           otherinterfaces=10.0.0.235
           hwtype=cmm

Note: the otherinterfaces attribute is the dynamic IP address assigned to the CMM.

The first thing we want to do is strip out the non-essential attributes, because we have already defined them at a group level:

    grep -v -E '(mac=|nodetype=|groups=|mgt=|hidden=|hwtype=)' cmm.stanza > cmm2.stanza

Now edit cmm2.stanza and change each "<node>:" line and mpa to have the correct node name. Then put these attributes into the database:

    cat cmm2.stanza | chdef -z

CMM Configuration

For a new CMM the user USERID password is set as expired and you must use the xCAT rspconfig command to change the password to a new password before any other commands can access the CMM.

rspconfig cmm01 USERID=<new password>

Note: If password for CMM has been changed after discovery, you must make sure the correct password for CMM user USERID is updated into mpa table: chtab mpa=<cmm> mpa.username=USERID mpa.password=<password>; . You can then run the rspconfig command listed above.

Once a new password is use rspconfig to set the IP address of each CMM to the permanent (static) address specified in the ip attribute:

    rspconfig cmm01 initnetwork=*

Note: The rspconfig command with the initnetwork option will set the CMM IP address
to a the static IP address specified in the cmm01 node object ip attribute value.
The changing of the CMM network definition and will reset the CMM to boot
with the new value which will cause the CMM to temporarily loose its ethernet connection.

Checking the CMM definition will show that the DHCP value stored in otherinterfaces
has been removed since it is no longer being used.
You should use ping to test the IP address defined in the CMM node ip attribute to know when the CMM comes up before issuing other commands.

Once the CMM is back up and operationals use rspconfig to set the CMM to allow SSH and SNMP.

    rspconfig cmm01 sshcfg=enable 
    rspconfig cmm01 snmpcfg=enable

Note: If you receive error cmmxx: Failed to login to cmmxx, you can run "ssh USERID@cmm01" and set the ssh password for the CAT MN. If this does not work, we may need to check the passwords being referenced on the target CMM and in the xCAT database.

Note: If the cmm was previously defined and the rspconfig sshcfg=enable fails, you may need to clean up the old ssh entry in the know_hosts table on the xCAT MN. You can run "makeknownhosts cmm01 -r" to clean this ssh entry.

Check the values to make sure they were enabled properly.

    rspconfig cmm01 sshcfg snmpcfg
    cmm01: SSH: enabled
    cmm01: SNMP: enabled

Test the SSH connection to the CMM with the rscan CMM info command.

    ssh USERID@cmm01 info
    system> info
    UUID: 5CFB E60F 2EFB 4143 9154 B677 2A37 2143 
    Manufacturer: IBM (BG)
    Manufacturer ID: 20301
    Product ID: 336
    Mach type/model: 789392X
    Mach serial number: 100037A
    Manuf date: 2411
    Hardware rev: 52.48
    Part no.: 88Y6660
    FRU no.: 81Y2893
    FRU serial no.: Y130BG16D022
    CLEI: Not Available
    CMM bays: 2
    Blade bays: 14
    I/O Module bays: 4
    Power Module bays: 6
    Blower bays: 10
    Rear LED Card bays: 1
    U Height of Chassis 10
    Product Name: IBM Chassis Midplane

Test the SNMP connection to the CMM using rscan.

    rscan cmm01
    type    name             id      type-model  serial-number  mpa    address
    cmm     SN#Y014BG27A01K  0       789392X     100CF0A        cmm01  cmm01
    blade   node01           1       789523X     1082EAB        cmm01  10.0.0.232
    blade   node02           2       789523X     1082EBB        cmm01  10.0.0.231

CMM Security Password Expiration

The default security setting for the CMM is secure. This setting will require that the CMM user USERID password be changed within 90 days by default. You can change the password expiration date with the CMM accseccfg command. The following are examples of changing the expiration date.

List the security settings. The -pe is the password expiration:

   > ssh USERID@cmm01 accseccfg -T mm[1]
    system&gt; accseccfg -T mm[1]
    Custom settings:
    -alt 300
    -am local
    -cp on
    -ct 0
    -dc 2
    -de on
    -ia 120
    -ici off
    -id 180
    -lf 20
    -lp 2
    -mls 0
    -pc on
    -pe 90
    -pi 0
    -rc 5
    -wt user

You can change the password expiration date using the CMM flex command accseccfg .

     ssh USERID@cmm01 accseccfg -pe 300 -T mm[1] (set expiration days to 300)
     ssh USERID@cmm01 accseccfg -pe 0 -T mm[1]   (set expiration date to not expire)

More details on the CMM accseccfg command can be found at: http://publib.boulder.ibm.com/infocenter/flexsys/information/index.jsp?topic=%2Fcom.ibm.acc.cmm.doc%2Fcli_command_accseccfg.html

Redundant CMM Support

The xCAT support for CMM redundancy is to use the second CMM as the default standby CMM that has its own ethernet connection into the HW VLAN. For CMM discovery, it is recommended that the Flex cluster admin only plug in and connect the Bay 1 CMM as the primary CMM, where the admin does discovery and configuration of the Flex cluster with one primary CMM. When the primary CMM is fully working as a "static" IP with proper firmware levels, the admin can plug in the second Bay 2 CMM into the Flex chassis, and it will automatically come online as a standby CMM with same CMM firmware as the primary CMM. You can see more information about CMM recovery with Redundant CMM in a different section below.

Update the CMM firmware (optional)

This section specifies how to update the CMM firmware. You can run the xCAT "rinv cmm firm" command to list the cmm firmware level.

     rinv  cmm firm

The CMM firmware can be updated by loading the new cmefs.uxp firmware file using the CMM update command working with the http or tftp interface. Since the AIX xCAT MN does not usually support http, we have provided CMM update instructions working with tftp. The administrator needs to download firmware from IBM Fix Central. The compressed tar file will need to be uncompressed and unzipped to extract the firmware update files. You need to place the cmefs.uxp file in the /tftpboot directory on the xCAT MN for CMM update to work properly.

Once the firmware is unzipped and the cmefs.uxp is placed in the /tftpboot directory on the xCAT MN you can use the CMM update command to update the firmware on one chassis at a time or on all chassis managed by xCAT MN. More details on the CMM update command can be found at: http://publib.boulder.ibm.com/infocenter/flexsys/information/index.jsp?topic=%2Fcom.ibm.acc.cmm.doc%2Fcli_command_update.html

The format of the update command is: flash (-u) the CMM firmware file and reboot (-r) afterwards

    update -T system:mm[1] -r -u tftp://<server>/<update file>

flash (-u), show progress (-v), and reboot (-r) afterwards

    update -T system:mm[1] -v -r -u tftp://<server>/<update file>

Note: Make sure the CMM firmware file cmefs.uxp is placed in /tftpboot directory on xCAT MN. The tftp interface from the CMM will reference the /tftpboot as the default location.

To update firmware and restart a single CMM cmm01 from xCAT MN 70.0.0.1 use:

    ssh USERID@cmm01 update -T system:mm[1] -v  -r -u tftp://70.0.0.1/cmefs.uxp

If unprompted password is setup on all CMMs then you can use xCAT psh to update all CMMs in the cluster at once.

    psh -l USERID cmm update -T system:mm[1] -v -u tftp://70.0.0.1/cmefs.uxp

If you are experiencing a "Unsupported security level" message after the CMM firmware was updated then you should run the following command to overcome this issue.

    rspconfig cmm sshcfg=enable snmpcfg=enable

You can run the xCAT "rinv cmm firm" command to list the new cmm firmware.

     rinv  cmm firm

Add the CMM node names into the /etc/hosts, and dns resolution if being used for name resolution.

    makehosts cmm
    makedns cmm

(For "makehosts" details see: http://xcat.sourceforge.net/man1/makehosts.1.html )

Create node object definitions of flex blade servers

There are different methods used to create the flex blade node objects in the xCAT database. One method is to create the predefined node objects, and then update the node objects using ""rscan -u "". The other method is to create ""rscan -z"", and then manually update the flex blade stanza file. The admin can then create the node objects using the stanza file.

Create predefined nodes in the Database first and run rscan -u to Discovery

This implementation should only be used when there are uniformed blade configurations working in the chassis. If there are mixtures of single and double wide blades in the chassis, the admin will need to remove unused blade node objects.

First just create the predefined node based on cmm and blade location, add the list of blades and the groups they belong too:

    nodeadd cmm[01-02]node[01-14] groups=all,blade

Change the blade definitions with the common attributes.

    chdef -t  group blade mgt=fsp cons=fsp

The attribute 'mpa' should be set to the node name of cmm. The attribute 'slotid' should be set to the physical slot id of the blade. The attribute 'hcp' should be set to the IP that admin try to assign to the fsp of the blade. Use chdef with patterns that will map to the settings you require.

    'chdef -t group blade mpa='|cmm(\d+)node(\d+)|cmm($1)|'slotid='|cmm(\d+)node(\d+)|($2+0)|' \
     hcp='|cmm(\d+)node(\d+)|10.0.($1+0).($2+0)|****

List the blade entries to review the blade the definitions created.

    [root@c870f3ap01 ~]# nodels blade
    cmm01node01
    cmm01node03
    cmm01node05
    cmm01node07
    cmm01node09
    cmm01node10
    cmm01node11

Use lsdef to check each entry to validate the hcp, slotid, and hsp attributes:

    [root@c870f3ap01 ~]# lsdef cmm01node01
    Object name: cmm01node01
    cons=fsp
    groups=blade,all
    hcp=12.0.0.32
    hwtype=blade
    id=1
    mgt=fsp
    mpa=cmm01
    mtm=789542X
    nodetype=ppc,osi
    parent=cmm01
    postbootscripts=otherpkgs
    postscripts=syslog,remoteshell,syncfiles
    serial=10F752A
    slotid=1

Run rscan -u to discover all the compute node servers.

The rscan -u will match the xCAT nodes which have been defined in the xCAT database and update them instead of create a new one. It will also provide an error message that specifies if the blade node object is not found in the xCAT database. This type of error should happen when there is a configuration where the chassis contains both single wide and double wide blade configurations. The admin can execute the rmdef command for any unused blade node objects.

    rscan cmm -u

(For "rscan" details see: http://xcat.sourceforge.net/man1/rscan.1.html )

If there are a mixture of single and double wide blade in the chassis, the admin should remove the unused blade objects from the xCAT DB.

    rmdef  <cmmxxnodeyy>

Create object definition of flex blades by discovery using stanza files

This method is suggested when there are a different mix of flex blades being used in the flex blade cluster.

The rscan command reads the actual configuration of blade server in the CMM and creates node definitions in the xCAT database to reflect them. This command will create node objects for the target CMM, and the flex blades on the CMM in a stanza file. The admin should manually update the different nodes objects to specify the proper node names they want to use in the xCAT cluster. The admin may also want to change the hcp=<FSP IP> to be a different IP address than what was provided by DHCP server. If the CMM node object is already created, you can remove the CMM entries from the stanza file. You may need to add the "id=0" attribute to cmm objects later.

There are unique differences between System P and System X Flex blade node objects working with rscan command. The big differences are the following attributes.

For System P Flex blades

    mgt=fsp
    cons=fsp
    id=1
    slotid=<blade slot>  
    hcp=<FSP IP>

For System X Flex blades

    mgt not set,   admin can update  with mgt=ipmi
    cons not set,  admin can update with cons=ipmi
    id is not used
    slotid=<blade slot>

there is no hcp

Run the rscan command against all of the CMMs to create a stanza file for the definitions of all the compute node servers.

rscan cmm -z >nodes.stanza

The Power 7 compute node stanza file is like this:

    SN#YL10JH184084:
           objtype=node
           nodetype=ppc,osi
           slotid=1
           id=1
           mtm=789542X
           serial=10F69BA
           mpa=flexcmm01
           parent=flexcmm01
           hcp=70.0.0.41
           groups=blade,all
           mgt=fsp
           cons=fsp
           hwtype=blade
    SN#Y110UF18P003:
           objtype=node
           nodetype=ppc,osi
           slotid=3
           id=1
           mtm=789522X
           serial=10F75AA
           mpa=flexcmm01
           parent=flexcmm01
           hcp=70.0.0.22
           groups=blade,all
           mgt=fsp
           cons=fsp
           hwtype=blade

In a stanza file, the user can get the blade server with the attributes hcp (fsp of the blade), mtm, serial and id attributions. For the stanza file above, the node SN#YL10JH184084 is a power blade(nodetype=ppc,hwtype=blade,mpa=cmm01). In order to easily access or operate those compute node servers, the user can edit the stanza file and give the node the name user want them to be for definition of each compute node server.

For Power 7 compute nodes the administrator will change the object name and hcp attribute for the IP of fsp. For example, the user can modify the definition of SN#YL10JH184084 as followings:

    cmm01node01:
       objtype=node
       cons=fsp
       groups=blade,all
       hcp=70.0.0.41
       hwtype=blade
       slotid=3
       id=1
       mgt=fsp
       mpa=cmm01
       mtm=789542X
       nodetype=ppc,osi
       parent=flexcmm01
       serial=10F69BA
       slotid=1

Then create the definitions in the database:

    cat nodes.stanza | mkdef -z

If CMM node objects are not updated from the target stanza file, make sure that the """id=0""" attribute is updated for the CMMs.

     chdef cmm  id=0

Set the network configuration for the fsp

The FSP for the System P flex blade will initially be setup as a dynamic IP address. The admin can choose to use this IP, or has the option to change it to another static IP address in the service VLAN. This FSP IP is controlled by the hcp attribute for the node. You can use mkdef/chdef or rscan to update the hcp entries to set the proper FSP IP addresses. The rspconfig command with the network=* option will set the FSP IP address to the value you specified in the hcp attribute.

    chdef  cmm01node01 hcp=12.0.0.101    
    rspconfig blade network=*

Modify blade server device names

In order to conveniently manage the blade servers, the customer may wan to have a cleaner name for the blade node. The following command can be used to modify a blade device name.

    rspconfig singlenode textid="cmm01node01"

The following command can be used to change a group of blade device name to the node names that are defined in xCAT DB.

    rspconfig blade textid=*

Create the hardware server connection for the blades' FSPs

1. Add the blade's fsp connections for the DFM management:

    mkhwconn blade -t

(For "mkhwconn" details see: http://xcat.sourceforge.net/man1/mkhwconn.1.html )

2. check the connections are LINE_UP:

    lshwconn blade

(For "lshwconn" details see: http://xcat.sourceforge.net/man1/lshwconn.1.html )

3. make sure the blade server powered on

    rpower blade state 
    rpower blade on

(For "rpower" details see: http://xcat.sourceforge.net/man1/rpower.1.html )

Update the FSP firmware (optional)

This is accomplished by using the rflash xCAT command from the xCAT Management node. The admin should download the supported GFW from the IBM Fix central website, and place it in a directory that is available to be read by the xCAT Management node. The default firmware option with rflash is working with "disruptive". Since the Flex blades work with DFM, the admin may use the rflash "deferred" firmware option which is listed in the Appendix.

1. Use rinv command to get the current firmware levels of the blades' FSPs:

    rinv bladenoderange firm

(For "rinv" details see: http://xcat.sourceforge.net/man1/rinv.1.html )

2.Use the rflash command to update the firmware levels for the blades' FSPs. Then validate that the new firmware is loaded:

For firmware disruptive update, you should make sure the blade in power off state firstly.

     rpower bladenoderange off

And then use rflash to do the update:

    rflash bladenoderange -p <directory> --activate disruptive

(For "rflash" details see: http://xcat.sourceforge.net/man1/rflash.1.html )

    rinv bladenoderange firm

Note: If there is an error during the rflash update where the firmware is not loaded properly, you ran reference the firmware recovery procedure at the following xCAT document location.
XCAT_Power_775_Hardware_Management/#recover-the-system-from-a-pp-situation-because-of-the-failed-firmware-update

3. Verify that the blades are healthy, then power on and boot up the blades:

    rpower bladenoderange state
    rvitals bladenoderange lcds
    rpower bladenoderange on

(For "rvitals" details see: http://xcat.sourceforge.net/man1/rvitals.1.html )

Get the MAC addresses for the nodes

IBM Flex POWER 7 blades support getting the mac address through the CMM.

Set the 'getmac' attribute to 'blade'

chdef cmm01node01 getmac=blade

Note: Since the Firmware is not stable at present, the following 2 steps are recommended to get the mac address for the specified interface.

run the getmacs command to display all the macs

    getmacs cmm01node01 -d

Add option '-i' to specify the interface

The option '-i' for 'getmacs' can be used to specify the interface whose mac address will be collected. The admin shall exactly know which interface is connected.

Note: If 4 mac addresses are gotten, they all are the mac addresses of the blade. The N can start from 0(map to the eth0 of the blade) to 3. If 5 mac addresses are gotten, the 1st mac address must be the mac address of the blade's FSP, so the N will start from 1(map to the eth0 of the blade) to 4.

    getmacs cmm01node01 -i enN

(For "getmacs" details see: http://xcat.sourceforge.net/man1/getmacs.1.html )

Update /etc/hosts

All of the cluster nodes should be added to the /etc/hosts file on the xCAT management node. You can either edit the /etc/hosts file by hand, or use [http://xcat.sourceforge.net/man8/makehosts.8.html makehosts].

If you edit the file by hand, it should look similar to:

127.0.0.1  localhost localhost.localdomain
50.1.2.3  mgmtnode-public mgmtnode-public.cluster.com
10.0.0.100  mgmtnode mgmtnode.cluster.com
10.0.0.1  node1 node1.cluster.com
10.0.0.2  node2 node2.cluster.com

On AIX systems the order of the short hostname and long hostname are typically reversed.

If your node names and IP addresses follow a regular pattern, you can easily populate /etc/hosts by putting a regular expression in the xCAT hosts table and then running '''makehosts'''. To do this, you need to first create an initial definition of the nodes in the database, if you haven't done that already:

mkdef node[01-80] groups=compute,all

Next, put a regular expression in the hosts table. The following example will associate IP address 10.0.0.1 with node1, 10.0.0.2 with node2, etc:

chdef -t group -o compute ip='|node(\d+)|10.0.0.($1+0)|'

Then run

makehosts compute

and the following entries will be added to /etc/hosts:

10.0.0.1 node01 node01.cluster.com
10.0.0.2 node02 node02.cluster.com
10.0.0.3 node03 node03.cluster.com

For an explanation of the regular expressions, see the [http://xcat.sourceforge.net/man5/xcatdb.5.html xCAT database man page].

Note that it is a convention of xCAT that for Linux systems the short hostname is the primary hostname for the node, and the long hostname is an alias.

On AIX the order is typically reversed. To have the long hostname be the primary hostname, you can use the -l option on the [http://xcat.sourceforge.net/man8/makehosts.8.html makehosts] command.

xCAT node group support ( static and dynamic)

A node group is essentially a named collection of cluster nodes that can be used as a simple way to target an action to a specific set of nodes. The node group names can be used in any xCAT command that targets a node range.

XCAT supports both static and dynamic groups. A static group is defined to contain a specific set of cluster nodes. A dynamic node group is one that has its members determined by specifying a selection criteria for node attributes. If a nodes attribute values match the selection criteria then it is dynamically included as a member of the group. The actual group membership will change over time as nodes have attributes set or unset. This provides flexible control over group membership by defining the attributes that define the group, rather than the specific node names that belong to the group. The selection criteria is a list of "attr<operator>val" pairs that can be used to determine the members of a group, (see below).

Note: Dynamic node group support is available in xCAT version 2.3 and later.

In xCAT, the definition of a static group has been extended to include additional attributes that would normally be assigned to individual nodes. When a node is part of a static group definition it can inherit the attributes assigned to the group. This feature can make it easier to define and manage cluster nodes in that you can generally assign nodes to the appropriate group and then just manage the group definition instead of multiple node definitions. This feature is not supported for dynamic groups.

To list all the attributes that may be set for a group definition you can run:

    lsdef -t group -h

When a node is included in one or more static groups a particular node attribute could actually be stored in several different object definitions. It could be in the node definition itself or it could be in one or more static group definitions. The precedence for determining which value to use is to choose the attribute value specified in the node definition if it is provided. If not, then each static group that the node belongs to will be checked to see if the attribute is set. The first value that is found is the value that is used. The static groups are checked in the order that they are specified in the "groups"attribute of the node definition.

NOTE: In a large cluster environment it is recommended to focus on group definitions as much as possible and avoid setting the attribute values in the individual node definition. (Of course some attribute values, such as a MAC addresses etc., are only appropriate for individual nodes.) Care must be taken to avoid confusion over which values will be inherited by the nodes.

Group definitions can be created using the mkdef command, changed using the chdef command, listed using the lsdef command and removed using the rmdef command.

Creating a static node group

There are two basic ways to create xCAT static node groups. You can either set the "groups"attribute of the node definition or you can create a group definition directly.

You can set the "groups"attribute of the node definition when you are defining the node with the mkdef or nodeadd command or you can modify the attribute later using the chdef or nodech command. For example, if you want a set of nodes to be added to the group "aixnodes",you could run chdef or nodech as follows.

    chdef -t node -p -o node01,node02,node03 groups=aixnodes

    nodech node01,node02,node03 groups=aixnodes

The "-p"(plus) option specifies that "aixnodes"be added to any existing value for the "groups"attribute. The "-p"(plus) option is not supported by nodech command.

The second option would be to create a new group definition directly using the mkdef command as follows.

    mkdef -t group -o aixnodes members="node01,node02,node03"

These two options will result in exactly the same definitions and attribute values being created in the xCAT database.

Creating a dynamic node group

The selection criteria for a dynamic node group is specified by providing a list of "attr<operator>val" pairs that can be used to determine the members of a group. The valid operators include: "==", "!=", "=~"and "!~". The "attr" field can be any node definition attribute returned by the lsdef command. The "val" field in selection criteria can be a simple sting or a regular expression. A regular expression can only be specified when using the "=~" or "!~" operators. See <http://www.perl.com/doc/manual/html/pod/perlre.html> for information on the format and syntax of regular expressions.

Operator descriptions:

    == Select nodes where the attribute value is exactly this value.
    != Select nodes where the attribute value is not this specific value.
    =~ Select nodes where the attribute value matches this regular expression.
    !~ Select nodes where the attribute value does not match this regular expression.

The selection criteria can be specified using one or more "-w attr<operator>val"options on the command line.

If the "val"field includes spaces or any other characters that will be parsed by shell then the "attr<operator>val" needs to be quoted.

For example, to create a dynamic node group called mygroup, where the hardware control point is hmc01 and the partition profile is not set to service.

    mkdef -t group -o mygroup -d -w hcp==hmc01 -w pprofile!=service

To create a dynamic node group called pslesnodes, where the operating system name includes sles and the architecture includes ppc.

    mkdef -t group -o pslesnodes -d -w os=~sles[0-9]+ -w arch=~ppc

To create a dynamic node group called nonpbladenodes where the node hardware management method is not set to blade and the architecture does not include ppc

    mkdef -t group -o nonpbladenodes -d -w mgt!=blade -w 'arch!~ppc'

Using AIX service nodes

If you wish to use xCAT service nodes in your cluster environment you must follow the process described in this section to properly install and configure the service nodes.

Select the following link.

[Using_AIX_service_nodes]

[XCAT_AIX_RTE_Diskfull_Nodes]

Booting Flex blade nodes

Make sure the blades are in the "on" state

To check the state you can run:

    rpower bladenoderange stat

If node is off then run:

    rpower bladenoderange on

(For "rpower" details see: http://xcat.sourceforge.net/man1/rpower.1.html )

Using a remote console

Make sure the SOL on the CMM has been disabled

It is important that the admin disable the Serial Over Lan (SOL) support on the CMM, so that xCAT DFM can control the remote console for the System P flex blades. Please execute the rspconfig command to each CMM. Run the following commands:

    rspconfig cmm solcfg 
    rspconfig cmm solcfg=disable

(For "rspconfig" details see: http://xcat.sourceforge.net/man1/rspconfig.1.html )

Update the conserver configuration

Run the following commands:

    makeconservercf 
    stopsrc -s conserver 
    startsrc -s conserver

(For "makeconservercf" details see: http://xcat.sourceforge.net/man1/makeconservercf.1.html )

Check rcons function

Run the rcons command to check if it is functioning properly.

    rcons onebladenode

(For "rcons" details see: http://xcat.sourceforge.net/man1/rcons.1.html )

If the blade is in the "off" state, it will specify "Destination BLADE is in POWER OFF state, Please power it on and wait.". If this is the case then you need to change the blade to the "on" state.

rpower onebladenode on

set the bootlist on the node

rbootseq requires that the node is powered on. Use the onstandby rpower option to power the node on and leave it in the standby state.

    rpower bladenoderange onstandby

Use the xCAT rbootseq command to set the boot device on the nodes.

    rbootseq bladenoderange net

(For "rbootseq" details see: http://xcat.sourceforge.net/man1/rbootseq.1.html )

Initiate a network boot of the nodes

Use the rpower command to initiate a network boot of the node.

    rpower bladenoderange reset

Appendix 1: IBM Flex Recovery and CMM Redundancy

The CMM is the gateway for the hardware management and monitoring communication for the Flex chassis and the Flex P7 blades. If you lose the network communication between the xCAT MN and the primary CMM, you can not execute any hardware management commands to the CMM or blades. If the Flex P7 blades and Ethernet SM are running, the blades should be able to keep running for some time.

Replacement of CMM

If you only have one CMM configured in your Flex chassis, you will need to work with IBM service to fix this CMM quickly, since you will not be able to properly manage the Flex blades until you have a working CMM. The CMM replacement activity is to execute CMM HW discovery on new CMM, where you locate the new MAC address and current DHCP dynamic IP address for CMM. You then update the CMM node object's "mac" and "otherinterfaces" attributes with data found from hardware discovery. Once the CMM node object has new data, we execute the configuration CMM steps working with rspconfig. Once the CMM is configured using the static IP, the DHCP and mac address is not referenced.

The following scenario is to replace the CMM working with node object "cmm01" with a static IP of 10.1.100.1.

     lsslp -m -z -s CMM &gt; /tmp/cmm01.stanza   (Locate new mac and DHCP IP for replacement CMM
     chdef cmm01 otherinterfaces=&lt;dhcpip&gt; mac=&lt;macaddr&gt;  (Update cmm01 object with new mac and current DHCP IP) 
     rspconfig cmm01 USERID=&lt;new_passwd&gt;  (Set password for USERID for new cmm01
     rspconfig cmm01 initnetwork=*      (Set new cmm01 back to original static IP
     rspconfig cmm01 sshcfg=enable snmpcfg=enable  Enable ssh and snmp for new cmm01

CMM Redundancy

The recommended support strategy with xCAT is to setup each Flex chassis with 2 CMM's where the primary CMM is located in bay 1 and the standby CMM is in bay 2. Each CMM needs to have their own ethernet connection into the xCAT HW VLAN, and the primary CMM must be configured as a static IP that is listed in the xCAT DB. The xCAT MN only can communicate with the primary CMM when executing hardware management commands. The Standby CMM is only there as a backup, and will take ownership as the primary CMM using the same static IP. The xCAT Flex only supports the default CMM redundancy configuration, and does not support the advanced failover settings. The activity for CMM fail over is that the standby CMM takes over the roll of the primary CMM, and that failed CMM is setup as the standby CMM when registered by the Flex chassis. The xCAT MN will lose it's network connection to the primary CMM during the CMM fail over, but will automatically reconnect back to the new primary CMM when it completes the failover in about 3-4 minutes

The fail over from the primary CMM to the standby CMM happens in the following scenarios.

 Admin executes software failover from the CMM GUI 
 Admin executed software failover using the CMM CLI
 Admin physically pulls out primary CMM from the Flex chassis

Fail over software reset from CMM GUI

The admin will have a network connection into the CMM, and has activated the CMM GUI. They will reference the "Mgt Module Management" and select on the "Restart" . The admin selects the "Restart and Switch to Standby Management Module" . This will cause the primary CMM to reset, and will change the setting of primary to the "Standby CMM" which now becomes the new primary CMM when the fail over completes.

Fail over software reset from CMM CLI

The admin will have a network connection into the CMM, and has a ssh connection into primary CMM with USERID from xCAT MN. The admin use CMM CLI command "env -T" to get to the primary CMM then executes command "reset -f" for the CMM failover. This will cause the primary CMM to reset, and will change the setting of primary to the "Standby CMM" which now becomes the new primary CMM when the fail over completes.

     # ssh USERID@cmm01
      Hostname:              cmm01
      Static IP address:     10.0.100.1
      Burned-in MAC address: 5F:FF:FF:FF:FF:FF
      DHCP:                  Disabled - Use static IP configuration.
     system&gt; env -T system:mm[1]
     OK
     system:mm[1]&gt; reset -f

Fail over hardware reset of CMM

The scenario is when there is a physical activity where the primary CMM is pulled from the chassis. There are different reasons why the admin may want pull out the CMM. This could be when the CMM is no longer working properly or there is an issue with the ethernet interface of the primary CMM. At this time when the primary CMM is pulled, it will do an automatic failover to the standby CMM, and the standby CMM is now the primary. The admin can work IBM or network support to understand the CMM or network failure. When the failed CMM is ready, the admin can just plug it in the Flex chassis, and it will now become the new Standby CMM. The admin can schedule a CMM software fail over if they want to swap back to the original CMM primary.

Appendix 2: CMM and Felxible Service Processor(FSP) password

In the IBM Flex chassis the architecture is designed to simplify some aspects of the systems management of the chassis. As part of this goal the IBM Flex system has integrated the CMM USERID and password into the IBM Flex system p compute nodes FSP. This is done through an internal LDAP server on the CMM serving the userids and passwords to LDAP on the FSPs. What this means to the system xCAT administrator is that the CMM USERID is tightly coupled with xCAT DFM authentication of the FSP. xCAT hardware control failures to authenticate on the FSP is likely the result of an issue with the chassis CMM USERID password. This section will provide commands which will help you determine that you have an authentication problem, verify that its an issue with the CMM USERID password, as well as how to resolve the problem.

Errors caused by an FSP authentication problem

The system administrator may first notice a problem with some of the hardware control commands giving an authentication error.

    > rpower cmm01node01 stat
    cmm01node01: Error: state=CEC AUTHENTICATION FAILED, type=02, MTMS=7895-42X*10F752A, sp=primary, slot=A, ipadd=12.0.0.32, alt_ipadd=unavailable

Checking the connection to the FSP shows that the authenication for this FSP is failing:

    > lshwconn cmm01node01
    cmm01node01: sp=primary,ipadd=12.0.0.32,alt_ipadd=unavailable,state=CEC AUTHENTICATION FAILED

This could be caused by the USERID password being expired on the CMM. You can check with the following:

    > ssh USERID@cmm01 users -T mm[1]
    system&gt; users -T mm[1]
    Users
    =====
    USERID
      Group(s): supervisor
      Max 0 session(s) allowed
      1 active session(s)
      Account is active
      **Password is expired**
      Password is compliant
      Number of SSH public keys installed for this user: 3
    User Permission Groups
    ======================

In order to correct this problem you need to activate the CMM USERID and then remove and add the connections to the FSP.

    > ssh USERID@cmm01 accseccfg -pe 0 -T mm[1]

Checking the USERID password is active:

    > ssh USERID@cmm01 users -T mm[2]
    system&gt; users -T mm[2]
    Users
    =====
    USERID
      Group(s): supervisor
      Max 0 session(s) allowed
      1 active session(s)
      **Account is active**
      Password does not expire
      Password is compliant
      Number of SSH public keys installed for this user: 3
    User Permission Groups
    ======================

Second you need to remove and add back each FSP connection for this chassis to create new connections:

    > rmhwconn cmm01node01
    > mkhwconn cmm01node01 -t

The last step is to check the connection:

    > lshwconn cmm01node01
    cmm01node01: sp=primary,ipadd=12.0.0.32,alt_ipadd=unavailable,state=LINE UP

Appendix 3: Updating Firmware on Flex Ethernet and IB Switch Modules

This section provides manual procedures to help update the firmware for Ethernet and Infiniband (IB) Switch modules. There is more detail information can be referenced in the IBM Flex System documentation under Network switches: http://publib.boulder.ibm.com/infocenter/flexsys/information/

The IB6131 Switch module is a Mellanox IB switch, and you down load firmware (image-PPC_M460EX-SX_3.2.xxx.img) from the Mellanox website into your xCAT Management Node or server that can communicate to Flex IB6131 switch module. We provided the firmware update procedure for the Mellanox IB switches including IB6131 Switch module in our xCAT document Managing the Mellanox Infiniband Network:
Managing_the_Mellanox_Infiniband_Network/#mellanox-switch-and-adapter-firmware-update

The IBM Flex system supports Ethernet switch modules models (EN2092 (1GB), EN4093 (10GB), and the firmware is available from the IBM Support Portal http://www-947.ibm.com/support/entry/portal/overview?brandind=hardware~puresystems~pureflex_system. The firmware update procedure used with the Flex Ethernet (EN2092) switch module which will reference two firmware images for OS (GbScSE-1G-10G-7.5.1.xx_OS.img) and Boot (GbScSE-1G-10G-7.5.1.x_Boot.img). These images should be placed on the xCAT MN or FTP server in the /tftpboot directory. Make sure that this server has proper ethernet communication to the Ethernet switch module.

Firmware Update using CLI

1) Login to the Ethernet switch using the "admin" userid and specify the admin password.

       ssh admin@<switchipaddr>

2) Get into boot directory, and list current image settings with cur command. This includes 2 OS images called image1 and image2,and will specify which image is the current boot image.

       >> boot
       >> cur

3) Get the new Ethernet OS image file from the ftp server to replace the older image on the ethernet switch using gtimg command. The gtimg command will prompt you for full path OS image file name, ftp/root userid, and password. It will ask to specify "data" port, and a confirmation to complete the download, and flashes the update. An example of EN2092 OS image would be "GbScSE-1G-10G-7.5.1.0_OS.img", and replaces "image2" on the ethernet switch.

       >> gtimg image2 <FTP server> GbScSE-1G-10G-7.5.1.0_OS.img
          Enter name of file on FTP/TFTP server: /tftpboot/GbScSE-1G-10G-7.5.1.0_OS.img
          Enter username for FTP server or hit return for TFTP server: root
          Enter password for username on FTP server:  <root password>
          Enter the port to use for downloading the image ["data"|"mgt"]: "data"
          Confirm download operation [y/n]: y

4) Get the new Ethernet boot image file from the ftp server to replace cuurent boot image on the ethernet switch using gtimg command. The gtimg command will prompt you for full path OS image file name, ftp/root userid, and password. It will ask to specify "data" port, and a confirmation to complete the download, and flashes the update. An example of EN2092 OS image would be "GbScSE-1G-10G-7.5.1.0_Boot.img", and will point to new boot image2.

       >> gtimg image2 <FTP server> GbScSE-1G-10G-7.5.1.0_Boot.img
          Enter name of file on FTP/TFTP server: /tftpboot/GbScSE-1G-10G-7.5.1.0_Boot.img
          Enter username for FTP server or hit return for TFTP server: root
          Enter password for username on FTP server:  <root password>
          Enter the port to use for downloading the image ["data"|"mgt"]: "data"
          Confirm download operation [y/n]:  y

5) Validate the current image settings with cur command, where image2 now has the latest firmware level, and that the current boot image is working with latest image2 file. You can then execute the reset command to boot the ethernet switch using the latest firmware level.

       >> cur_
       >> reset

Appendix 4 Perform Deferred Firmware upgrades for Flex blade CEC

Deferred firmware update Background

It may take some time to execute a disruptive firmware update in a large cluster. To reduce the down time of the cluster, customers may want to flash new firmware levels while the Flex blades are up and running, The deferred firmware update will load the new firmware into the T (temp) side, but will not activate it like the disruptive firmware. The customer can continue to run with the P (perm) side and can wait for a maintenance window where they can activate and boot the blades/cec with new firmware levels.

temp/perm side, pending_power_on_side attributes in Deferred firmware update

The deferred firmware update includes 2 parts: The first part (1) is to apply the firmware to the T (temp) sides of Flex blade FSPs when the cluster is up and running. The second part (2) is to activate the new firmware on the blades at a scheduled time.

The default setting is that the CEC/FSPs are working from the temp side (current_power_on_side). During part(1) of the deferred firmware update implementation, the CEC will continue to run on the perm side while the rflash of the new firmware levels will installed to the temp side. It is very important that the perm side contains the current stable version of firmware. The perm side is usually only used as a recovery environment when working with firmware updates.

When executing a reboot to the blade (FSPs), it will run on the side which the pending_power_on_side attribute is set. After we finish the part (1), the admin will want to make sure the pending_power_on_side attribute is set to "perm" if the blades want to be rebooted working with the older stable firmware. When you are ready to activate the new firmware and reboot the blades, you will want to make sure the pending_power_on_side attribute is set to "temp".

The procedure of the deferred firmware update

Before starting the deferred firmware update, the admin should first make sure that the most recent stable firmware level has been applied to the P (perm) side. We should note that T-side firmware will be moved over to the P-side automatically when we execute the rflash of the new firmware into the T (temp) side.

1.1 Apply the firmware for Flex blades

      rinv <blade> firm

1.2 Apply the new GFW code into the blade's FSPs

      rflash <blade> -p <rpm_directory> --activate deferred

1.3 Check to make sure the proper Firmware levels have been loaded into the temp side (new) and the perm side (previous) for the Frames or CECs. The rflash working with "deferred" should now specify Current Power on side to now be "perm":

      rinv <blade> firm

2. Setup Cecs/blades pending power to Perm (needed for CEC/blade reboot -- power off/on)

In part 1, the new firmware is now loaded on the temp side. If you need to keep the Flex blade active for a period of time (such as several days) we need to make sure we are working with previous firmware level, which is running on the P-side. You should change the pending_power_on_side attribute from temp to perm.

      rspconfig <blade> pending_power_on_side
      If not, set CEC's the pending power on side to P-side:
      rspconfig <blade> pending_power_on_side=perm

3.Activate the new firmware at schedule time

The new firmware level has been loaded on the temp side, and it is time to activate the blade/CECs with new firmware level. The admin should make sure the pending_power_on_side is now set back from perm to temp.

3.1 Check if the pending power on side for CEC are on T-side

      rspconfig <blade> pending_power_on_side
      If not, set the pending power on side to T-side
      rspconfig <blade> pending_power_on_side=temp

3.2 Power off the target Flex blades

      rpower <blade> off

3.3 Reboot the service processor for the CECs/blade

       rpower <blade> resetsp

Wait for 5-10 minutes for FSPs to restart. When the connections become LINE_UP again, the FSPs have finished the reboot.

       lshwconn <blade>

3.4 Verify that the cec/blade updates are the new firmware level and that they are using the temp side for the current_power_on_side .

       rinv <blade> firm

3.5 Power on the Flex blades and bring up the Flex blade cluster. The power on of the flex blades will be based on the install environment.

If this is a diskful environment, the admin should be able to "rpower <blade> on " to bring the blade up on the local disk.

      rpower  <blade> on

If this is a diskless environment, the admin should power up the blade to onstandby, set the boot sequence to network, and reset the blade.

      rpower <blade> onstandby  
      rbootseq <blade> net 
      rpower <blade> reset

Appendix 5 Connect Flex P blades to HMC for Teal SFP (Not yet supported)

This section discusses the capability to allow System P flex blades to be connected to HMCs for Service Focal Point (SFP). You make sure that the target HMCs have been defined as HMC nodes in your xCAT database. You allow the xCAT MN to make the hardware connection between the blades and the HMC. The xCAT MN will continue to use the xCAT DFM for remote hardware commands to ccmmunicate directly to the blade FSPs.

The admin needs to create an HMC node object using the mkdef or chdef commands for each HMC. The admin can also set the username and password directly to the HMC node object which will be added to the ppchcp table. T You need to make sure that there is proper SSH connection from the xCAT MN to the HMC.

    mkdef -t node -o hmc1 groups=hmc,all nodetype=ppc hwtype=hmc mgt=hmc username=hscroot password=abc1234
    rspconfig  <HMC node>  sshcfg=enable

You need to execute the mkhwconn -s <HMC> to Flex P blades to reference the target HMC working with the "sfp" attribute. The command will create new hardware connection on the HMC to the Flex P blades. After the connection is made to the HMC, the CE should be able to use the HMC for SFP events.

    mkhwconn blade -s <HMC node>

The CE should now be able to locate the SFP events from the HMC. There is a possibility that Teal may require this setup.

If you want to remove the hardware connection from the HMC to the Flex P blade, use the rmhwconn -s <HMC> command.

    rmhwconn blade -s <HMC node>

Diagnostics

lshwconn LINE DOWN after power outage

Testing has shown that when a chassis looses power and is started back up it is possible that the connections to the blade FSPs will be LINE DOWN. If this occurs you should reset the CMM for the chassis with this problem.

    >> ssh USERID@cmm01 service -T mm[1] -vr

Wiki: Setting_Up_MySQL_as_the_xCAT_DB
Wiki: Using_AIX_service_nodes
Wiki: XCAT_AIX_Cluster_Overview_and_Mgmt_Node
Wiki: XCAT_AIX_POWER_Blade_Nodes
Wiki: XCAT_AIX_RTE_Diskfull_Nodes
Wiki: XCAT_Documentation
Wiki: XCAT_Overview,_Architecture,_and_Planning