XCAT_iDataPlex_Cluster_Quick_Start

There is a newer version of this page. You can find it here.

This document describes the steps necessary to quickly set up a cluster with IBM system x, rack-mounted servers. Although the examples given in this document are specific to iDataplex hardware (because that's the most common server type used for clusters), the basic instructions apply to any x86_64, IPMI-controlled, rack-mounted servers.

xCAT Installation on an iDataplex Configuration

This document is meant to get you going as quickly as possible and therefore only goes through the most common scenario. For additional scenarios and setup tasks, see [XCAT_iDataPlex_Advanced_Setup].

Example Configuration Used in This Document

This configuration will have a single dx360 Management Node with 167 other dx360 servers as nodes. The OS deployed will be RH Enterprise Linux 6.2, x86_64 edition. Here is a diagram of the racks:

In our example, the management node is known as 'mgt', the node namess are n1-n167, and the domain will be 'cluster'. We will use the BMCs in shared mode so they will share the NIC on each node that the node's operating system communicates to the xCAT management node over. This is call the management LAN. We will use subnet 172.16.0.0 with a netmask of 255.240.0.0 (/12) for it. (This provides an IP address range of 172.16.0.1 - 172.31.255.254 .) We will use the following subsets of this range for:

  • The management node: 172.20.0.1
  • The node OSes: 172.20.100+racknum.nodenuminrack
  • The node BMCs: 172.29.100+racknum.nodenuminrack
  • The management port of the switches: 172.30.50.switchnum
  • The DHPC dynamic range for unknown nodes: 172.20.255.1 - 172.20.255.254

The network is physically laid out such that port number on a switch is equal to the U position number within a column, like this:

Overview of Cluster Setup Process

Here is a summary of the steps required to set up the cluster and what this document will take you through:

  1. Prepare the management node - doing these things before installing the xCAT software helps the process to go more smoothly.
  2. Install the xCAT software on the management node.
  3. Configure some cluster wide information
  4. Define a little bit of information in the xCAT database about the ethernet switches and nodes - this is necessary to direct the node discovery process.
  5. Have xCAT configure and start several network daemons - this is necessary for both node discovery and node installation.
  6. Discovery the nodes - during this phase, xCAT configures the BMC's and collects many attributes about each node and stores them in the database.
  7. Set up the OS images and install the nodes.

Distro-specific Steps

  • [RH] indicates that step only needs to be done for RHEL and Red Hat based distros (CentOS, Scientific Linux, and in most cases Fedora).
  • [SLES] indicates that step only needs to be done for SLES.

Command Man Pages and Database Attribute Descriptions

Prepare the Management Node for xCAT Installation

{{:Prepare the Management Node for xCAT Installation}}

Install xCAT on the Management Node

{{:Install xCAT on the Management Node}}

Configure xCAT

Load the e1350 Templates

Several xCAT database tables must be filled in while setting up an iDataPlex cluster. To make this process easier, xCAT provides several template files in /opt/xcat/share/xcat/templates/e1350/. These files contain regular expressions that describe the naming patterns in the cluster. With xCAT's regular expression support, one line in a table can define one or more attribute values for all the nodes in a node group. (For more information on xCAT's database regular expressions, see http://xcat.sourceforge.net/man5/xcatdb.5.html .) To load the default templates into your database:

cd /opt/xcat/share/xcat/templates/e1350/
for i in *csv; do tabrestore $i; done

These templates contain entries for a lot of different node groups, but we will be using the following node groups:

  • ipmi - the nodes controlled via IPMI.
  • idataplex - the iDataPlex nodes
  • 42perswitch - the nodes that are connected to 42 port switches
  • compute - all of the compute nodes
  • 84bmcperrack - the BMCs that are in a fully populated rack of iDataPlex
  • switch - the ethernet switches in the cluster

In our example, ipmi, idataplex, 42perswitch, and compute will all have the exact same membership because all of our iDataPlex nodes have those characteristics.

The templates automatically define the following attributes and naming conventions:

  • The iDataPlex compute nodes:
    • node names are of the form <string><number>, for example n1
    • ip: 172.20.100+racknum.nodenuminrack
    • bmc: the bmc with the same number as the node
    • switch: divide the node number by 42 to get the switch number
    • switchport: the nodes are plugged into 42-port ethernet switches in order of node number
    • mgt: 'ipmi'
    • netboot: 'xnba'
    • profile: 'compute'
    • rack: node number divided by 84
    • unit: in the range of A1 - A42 for the 1st 42 nodes in each rack, and in the range of C1 - C42 for the 2nd 42 nodes in each rack
    • chain: 'runcmd=bmcsetup,shell'
    • ondiscover: 'nodediscover'
  • The BMCs:
    • node names are of the form bmc<nodenum>, for example bmc1
    • ip: 172.29.100+racknum.nodenuminrack
  • The management connection to each ethernet switch:
    • node names are of the form switch<number>, for example switch1
    • ip: 172.30.50.switchnum

For a description of the attribute names in bold above, see the node object definition.

If these conventions don't work for your situation, you can either:

  1. modify the regular expressions - see [XCAT_iDataPlex_Advanced_Setup#Template_modification_example]
  2. or manually define each node - see [XCAT_iDataPlex_Advanced_Setup#Manually_setup_the_node_attributes_instead_of_using_the_templates_or_switch_discovery]

Add Nodes to the nodelist Table

Now you can use the power of the templates to define the nodes quickly. By simply adding the nodes to the correct groups, they will pick up all of the attributes of that group:

nodeadd n1-n167 groups=ipmi,idataplex,42perswitch,compute,all
nodeadd bmc1-bmc167 groups=84bmcperrack 
nodeadd switch1-switch4 groups=switch

To see the list of nodes you just defined:

nodels

To see all of the attributes that the combination of the templates and your nodelist have defined for a few sample nodes:

lsdef n100,bmc100,switch2

This is the easiest way to verify that the regular expressions in the templates are giving you attribute values you are happy with. (Or, if you modified the regular expressions, that you did it correctly.)

Networks Table

All networks in the cluster must be defined in the networks table. When xCAT was installed, it ran makenetworks, which created an entry in this table for each of the networks the management node is connected to. Now is the time to add to or update in the networks table any other networks in the cluster.

For a sample Networks Setup, see the following example: [Setting_Up_a_Linux_xCAT_Mgmt_Node#Appendix_A:_Network_Table_Setup_Example]

Declare a dynamic range of addresses for discovery

If you want to use hardware discovery, a dynamic range is required to be defined in the networks table. It's used for the nodes to get an IP address before xCAT knows their MAC addresses.

In this case, we'll designate 172.20.255.1-172.20.255.254 as a dynamic range:

chdef -t network 172_16_0_0-255_240_0_0 dynamicrange=172.20.255.1-172.20.255.254

Declare use of SOL

If not using a terminal server, SOL is recommended, but not required to be configured. To instruct xCAT to configure SOL in installed operating systems on dx340 systems:

chdef -t group -o compute serialport=1 serialspeed=19200 serialflow=hard

For dx360-m2 and newer use:

chdef -t group -o compute serialport=0 serialspeed=115200 serialflow=hard

passwd Table

The template created a default passwd table. This includes the system entry which is the passwd that will be assigned to root when the node is installed. You can modify this table using tabedit. To change the default password for root on the nodes, change the system line. To change the password to be used for the BMCs, change the ipmi line.

tabedit passwd
#key,username,password,cryptmethod,comments,disable
"system","root","cluster",,,
"ipmi","USERID","PASSW0RD",,,

The table templates already put group-oriented regular expression entries in the switch table. Use lsdef for a sample node to see if the switch and switchport attributes are correct. If not, use chdef or tabedit to change the values.

If you configured your switches to use SNMP V3, then you need to define several attributes in the switches table. Assuming all of your switches use the same values, you can set these attributes at the group level:

tabch switch=switch switches.snmpversion=3 switches.username=xcat switches.password=passw0rd switches.auth=sha

noderes Table

The template created a basic noderes table which defines node resources during install. In the template, servicenode and xcatmaster are not defined, so they will default to the Management Node.

At this point, xCAT should be ready to begin managing services.

Begin using xCAT to configure system and discover nodes

Setup /etc/hosts file

Since the map between the xCAT node names and IP addresses have been added in the hosts table by the 31350 template, you can run the makehosts xCAT command to create the /etc/hosts file from the xCAT hosts table. (You can skip this step if creating /etc/hosts manually.)

makehosts switch,idataplex,ipmi

Verify the entries have been created in the file /etc/hosts. For example your /etc/hosts should look like this:

127.0.0.1               localhost.localdomain localhost
::1                     localhost6.localdomain6 localhost6
###
172.20.0.1 mgt mgt.cluster
172.20.101.1 n1 n1.cluster
172.20.101.2 n2 n2.cluster
172.20.101.3 n3 n3.cluster
172.20.101.4 n4 n4.cluster
172.20.101.5 n5 n5.cluster
172.20.101.6 n6 n6.cluster
172.20.101.7 n7 n7.cluster
              .
              .
              .

Setup DNS

To get the hostname/IP pairs copied from /etc/hosts to the DNS on the MN:

  • Ensure that /etc/sysconfig/named does not have ROOTDIR set
  • Set site.forwarders to your site-wide DNS servers that can resolve site or public hostnames. The DNS on the MN will forward any requests it can't answer to these servers.

    chdef -t site forwarders=1.2.3.4,1.2.5.6

  • Edit /etc/resolv.conf to point the MN to its own DNS. (Note: this won't be required in xCAT 2.8 and above.)

    search cluster
    nameserver 172.20.0.1

  • Run makedns

    makedns && service named start

For more information about name resolution in an xCAT Cluster, see [Cluster_Name_Resolution].

Setup DHCP

You usually don't want your DHCP server listening on your public (site) network, so set site.dhcpinterfaces to your MN's cluster facing NICs. For example:

chdef -t site dhcpinterfaces=eth1

Then this will get the network stanza part of the DHCP configuration (including the dynamic range) set:

makedhcp -n

The IP/MAC mappings for the nodes will be added to DHCP automatically as the nodes are discovered.

Setup TFTP

Nothing to do here - the TFTP server is done by xCAT during the Management Node install.

Setup conserver

makeconservercf

Discover nodes

If you want to update node firmware when you discover the nodes, follow the steps in [XCAT_iDataPlex_Advanced_Setup#Updating_Node_Firmware] before continuing.

If you want to automatically deploy the nodes after they are discovered, follow the steps in [XCAT_iDataPlex_Advanced_Setup#Automatically_Deploying_Nodes_After_Discovery] before continuing. (But if you are new to xCAT we don't recommend this.)

Now walk over to systems, hit power buttons, and on the MN watch nodes discover themselves by:

tail -f /var/log/messages

Look for the dhcp requests, the xCAT discovery requests, and the "<node> has been discovered" messages.

A quick summary of what is happening during the discovery process it:

  • the nodes request a DHCP IP address and PXE boot instructions
  • the DHCP server on the MN responds with a dynamic IP address and the xCAT genesis boot kernel
  • the genesis boot kernel running on the node sends the MAC and MTMS to xcatd on the MN
  • xcatd asks the switches which port this MAC is on so that it can correlate this physical node with the proper node entry in the database. Then it:
    • stores the node's MTMS in the db
    • puts the MAC/IP pair in the DHCP configuration
    • sends several of the node attributes to the genesis kernel on the node
  • the genesis kernel configures the BMC with the proper IP address, userid, and password, and then just drops into a shell

After a successful discovery process, the following attributes will be added to the database for each node. (You can verify this by running lsdef <node> ):

  • bmcpassword - the password xcat uses when running hardware control operations to the BMC
  • mac - the MAC address of the in-band NIC used to manage this node
  • mtm - the hardware type (machine-model)
  • serial - the hardware serial number

If you cannot discover the nodes successfully, see the next section [#Manually_Discover_Nodes].

If at some later time you want to force a re-discover of a node, run:

makedhcp -d &lt;noderange&gt;

and then reboot the node(s).

Manually Discover Nodes

If you just have a few nodes and can't configure the switch for SNMP, you can manually set up the xCAT tables instead, and then run the BMC setup process to configure the BMC on the nodes:

  • Add the mac address for each node to the xCAT database:

This mac address can be obtained from the back panel of the machine. This MAC address should belong to NIC which is connected to the management network.

chdef n1 mac="xx:xx:xx:xx:xx:xx"
chdef n2 mac="yy:yy:yy:yy:yy:yy"
  .
  .
  .
  • Add the nodes to dhcp

    makedhcp idataplex

  • Setup the current node operation to be bmcsetup, and the next one to be waiting in a shell

    nodeset idataplex runcmd=bmcsetup
    chdef idataplex currchain=shell

  • Then walk over and manually power on the nodes.

When the nodes boot, xCAT (really DHCP) should instruct the nodes to download the xCAT genesis boot kernel and run the bmcsetup script to configure the BMC (IMM) properly.

Monitoring Node Discovery

When the bmcsetup process completes on each node (about 5-10 minutes), xCAT genesis will drop into a shell and wait indefinitely (and change the node's currstate attribute to "shell"). You can monitor the progress of the nodes using:

watch -d 'nodels ipmi chain.currstate|xcoll'

Before all nodes complete, you will see output like:

###### ========================


n1,n10,n11,n75,n76,n77,n78,n79,n8,n80,n81,n82,n83,n84,n85,n86,n87,n88,n89,n9,n90,n91


###### ========================


shell



###### ========================


n31,n32,n33,n34,n35,n36,n37,n38,n39,n4,n40,n41,n42,n43,n44,n45,n46,n47,n48,n49,n5,n50,n51,n52,
 n53,n54,n55,n56,n57,n58,n59,n6,n60,n61,n62,n63,n64,n65,n66,n67,n68,n69,n7,n70,n71,n72,n73,n74


###### ========================


runcmd=bmcsetup

When all nodes have made it to the shell, xcoll will just show that the whole nodegroup "ipmi" has the output "shell":

###### ========================


ipmi


###### ========================


shell

When the nodes are in the xCAT genesis shell, you can ssh or psh to any of the nodes to check anything you want.

Verfiy HW Management Configuration

At this point, the BMCs should all be configured and ready for hardware management. To verify this:

# rpower ipmi stat | xcoll


###### ========================


ipmi


###### ========================


on

For iDataPlex nodes you also need to enable the uEFI console redirection with the ASU command:

set uEFI.RemoteConsoleRedirection Enable

See [XCAT_iDataPlex_Advanced_Setup#Updating_ASU_Settings_on_the_Nodes] to set this ASU setting.

Now run:

rcons &lt;node&gt;

To verify that you can see the genesis shell prompt (after hitting enter). To exit rcons type: ctrl-shift-E (all together), then "c", the ".".

You are now ready to choose an operating system and deployment method for the nodes....

Deploying Nodes

  • In you want to install your nodes as stateful (diskful) nodes, follow the next section [#Installing_Stateful_Nodes].
  • If you want to define one or more stateless (diskless) OS images and boot the nodes with those, see section [#Deploying_Stateless_Nodes]. This method has the advantage of managing the images in a central place, and having only one image per node type.
  • If you want to have nfs-root statelite nodes, see [XCAT_Linux_Statelite]. This has the same advantage of managing the images from a central place. It has the added benefit of using less memory on the node while allowing larger images. But it has the drawback of making the nodes dependent on the management node or service nodes (i.e. if the management/service node goes down, the compute nodes booted from it go down too).
  • If you have a very large cluster (more than 500 nodes), at this point you should follow [Setting_Up_a_Linux_Hierarchical_Cluster] to install and configure your service nodes. After that you can return here to install or diskless boot your compute nodes.

Installing Stateful Nodes

This section describes the process for setting up xCAT to install nodes; that is how to install an OS on the disk of each node.

Create Redhat repository

The copycds command copies the contents of the linux distro media to /install/<os>/<arch> so that it will be available to install nodes with or create diskless images.

  • Obtain the Redhat ISOs or DVDs.
  • If using an ISO, copy it to (or NFS mount it on) the management node, and then run:

    copycds <path>/RHEL6.2-Server-20080430.0-x86_64-DVD.iso

  • If using a DVD, put it in the DVD drive of the management node and run:

    copycds /dev/dvd # or whatever the device name of your dvd drive is

Tip: if this is the same distro version as your management node, create a .repo file in /etc/yum.repos.d with content similar to:

[local-rhels6.2-x86_64]
name=xCAT local rhels 6.2
baseurl=file:/install/rhels6.2/x86_64
enabled=1
gpgcheck=0

This way, if you need some additional RPMs on your MN at a later, you can simply install them using yum. Or if you are installing other software on your MN that requires some additional RPMs from the disto, they will automatically be found and installed.

Select or Create an osimage Definition

The copycds command also automatically creates several osimage defintions in the database that can be used for node deployment. To see them:

lsdef -t osimage          # see the list of osimages
lsdef -t osimage &lt;osimage-name&gt;          # see the attributes of a particular osimage

From the list above, select the osimage for your distro, architecture, provisioning method (in this case install), and profile (compute, service, etc.). Although it is optional, we recommend you make a copy of the osimage, changing its name to a simpler name. For example:

lsdef -t osimage -z rhels6.2-x86_64-install-compute | sed 's/^[^ ]\+:/mycomputeimage:/' | mkdef -z

This displays the osimage "rhels6.2-x86_64-install-compute" in a format that can be used as input to mkdef, but on the way there it uses sed to modify the name of the object to "mycomputeimage".

Initially, this osimage object points to templates, pkglists, etc. that are shipped by default with xCAT. And some attributes, for example otherpkglist and synclists, won't have any value at all because xCAT doesn't ship a default file for that. You can now change/fill in any osimage attributes that you want. A general convention is that if you are modifying one of the default files that an osimage attribute points to, copy it into /install/custom and have your osimage point to it there. (If you modify the copy under /opt/xcat directly, it will be over-written the next time you upgrade xCAT.)

But for now, we will use the default values in the osimage definition and continue on. (If you really want to see examples of modifying/creating the pkglist, template, otherpkgs pkglist, and sync file list, see the section [#Deploying_Stateless_Nodes]. Most of the examples there can be used for stateful nodes too.)

Begin Installation

If you already have a different OS on your nodes and you haven't configured your nodes to always boot from the network, then run rsetboot to instruct them to boot from the network for the next boot:

rsetboot compute net

The nodeset command tells xCAT what you want to do next with this node, and powering on the node starts the installation process:

nodeset compute osimage=mycomputeimage
rpower compute boot

Tip: when nodeset is run, it processes the kickstart or autoyast template associated with the osimage, plugging in node-specific attributes, and creates a specific kickstart/autoyast file for each node in /install/autoinst. If you need to customize the template, make a copy of the file that is pointed to by the osimage.template and edit that file (or the files it includes).

Monitor installation

It is possible to use the wcons command to watch the installation process for a sampling of the nodes:

wcons n1,n20,n80,n100

or rcons to watch one node

rcons n1

Additionally, nodestat may be used to check the status of a node as it installs:

nodestat n20,n21
n20: installing man-pages - 2.39-10.el5 (0%)
n21: installing prep

Note: the percentage complete reported by nodestat is not necessarily reliable.

You can also watch nodelist.status until it changes to "booted" for each node:

nodels compute nodelist.status | xcoll

Once all of the nodes are installed and booted, you should be able ssh to all of them from the MN (w/o a password), because xCAT should have automatically set up the ssh keys (if the postscripts ran successfully):

xdsh compute date

If there are problems, see [Debugging_xCAT_Problems].

Deploying Stateless Nodes

{{:Using_Provmethod=osimagename}}

Useful Applications of xCAT commands

This section gives some examples of using key commands and command combinations in useful ways. For any xCAT command, typing 'man <command>' will give details about using that command. For a list of xCAT commands grouped by category, see [XCAT_Commands]. For all the xCAT man pages, see http://xcat.sourceforge.net/man1/xcat.1.html .

Adding groups to a set of nodes

In this configuration, a handy convenience group would be the lower systems in the chassis, the ones able to read temperature and fanspeed. In this case, the odd systems would be on the bottom, so to do this with a regular expression:

# nodech '/n.*[13579]$' groups,=bottom

or explicitly

chdef -p n1-n9,n11-n19,n21-n29,n31-n39,n41-n49,n51-n59,n61-n69,n71-79,n81-n89,
n91-n99,n101-n109,n111-119,n121-n129,n131-139,n141-n149,n151-n159,n161-n167 groups="bottom"

Listing attributes

We can list discovered and expanded versions of attributes (Actual vpd should appear instead of *) :

# nodels n97 nodepos.rack nodepos.u vpd.serial vpd.mtm 
n97: nodepos.u: A-13
n97: nodepos.rack: 2
n97: vpd.serial: ********
n97: vpd.mtm: *******

You can also list all the attributes:

#lsdef n97 
Object name: n97
   arch=x86_64
        .
   groups=bottom,ipmi,idataplex,42perswitch,compute,all
        .
        .
        .
   rack=1    
   unit=A1

Verifying consistency and version of firmware

xCAT provides parallel commands and the sinv (inventory) command, to analyze the consistency of the cluster. See [Parallel_Commands_and_Inventory]

Combining the use of in-band and out-of-band utilities with the xcoll utility, it is possible to quickly analyze the level and consistency of firmware across the servers:

mgt# rinv n1-n3 mprom|xcoll 
==================================== 
n1,n2,n3
==================================== 
BMC Firmware: 1.18

The BMC does not have the BIOS version, so to do the same for that, use psh:

mgt# psh n1-n3 dmidecode|grep "BIOS Information" -A4|grep Version|xcoll 
==================================== 
n1,n2,n3
==================================== 
Version: I1E123A

To update the firmware on your nodes, see [XCAT_iDataPlex_Advanced_Setup#Updating_Node_Firmware].

Verifying or Setting ASU Settings

To do this, see [XCAT_iDataPlex_Advanced_Setup#Updating_ASU_Settings_on_the_Nodes].

Managing the IB Network

xCAT has several utilities to help manage and monitor the Mellanox IB network. See [Managing_the_Mellanox_Infiniband_Network].

Reading and interpreting sensor readings

If the configuration is louder than expected (iDataplex chassis should nominally have a fairly modest noise impact), find the nodes with elevated fanspeed:

# rvitals bottom fanspeed|sort -k 4|tail -n 3
n3: PSU FAN3: 2160 RPM
n3: PSU FAN4: 2240 RPM
n3: PSU FAN1: 2320 RPM

In this example, the fanspeeds are pretty typical. If fan speeds are elevated, there may be a thermal issue. In a dx340 system, if near 10,000 RPM, there is probably either a defective sensor or misprogrammed power supply.

To find the warmest detected temperatures in a configuration:

# rvitals bottom temp|grep Domain|sort -t: -k 3|tail -n 3
n3: Domain B Therm 1: 46 C (115 F)
n7: Domain A Therm 1: 47 C (117 F)
n3: Domain A Therm 1: 49 C (120 F)

Change tail to head in the above examples to seek the slowest fans/lowest temperatures. Currently, an iDataplex chassis without a planar tray in the top position will report '0 C' for Domain B temperatures.

For more options, see rvitals manpage: http://xcat.sourceforge.net/man1/rvitals.1.html

Where Do I Go From Here?

Now that your basic cluster is set up, here are suggestions for additional reading:


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.