This document describes the steps necessary to quickly set up a cluster with IBM system x, rack-mounted servers. Although the examples given in this document are specific to iDataplex hardware (because that's the most common server type used for clusters), the basic instructions apply to any x86_64, IPMI-controlled, rack-mounted servers.
This document is meant to get you going as quickly as possible and therefore only goes through the most common scenario. For additional scenarios and setup tasks, see [XCAT_iDataPlex_Advanced_Setup].
This configuration will have a single dx360 Management Node with 167 other dx360 servers as nodes. The OS deployed will be RH Enterprise Linux 6.2, x86_64 edition. Here is a diagram of the racks:
In our example, the management node is known as 'mgt', the node namess are n1-n167, and the domain will be 'cluster'. We will use the BMCs in shared mode so they will share the NIC on each node that the node's operating system communicates to the xCAT management node over. This is call the management LAN. We will use subnet 172.16.0.0 with a netmask of 255.240.0.0 (/12) for it. (This provides an IP address range of 172.16.0.1 - 172.31.255.254 .) We will use the following subsets of this range for:
The network is physically laid out such that port number on a switch is equal to the U position number within a column, like this:
Here is a summary of the steps required to set up the cluster and what this document will take you through:
{{:Prepare the Management Node for xCAT Installation}}
{{:Install xCAT on the Management Node}}
Several xCAT database tables must be filled in while setting up an iDataPlex cluster. To make this process easier, xCAT provides several template files in /opt/xcat/share/xcat/templates/e1350/. These files contain regular expressions that describe the naming patterns in the cluster. With xCAT's regular expression support, one line in a table can define one or more attribute values for all the nodes in a node group. (For more information on xCAT's database regular expressions, see http://xcat.sourceforge.net/man5/xcatdb.5.html .) To load the default templates into your database:
cd /opt/xcat/share/xcat/templates/e1350/
for i in *csv; do tabrestore $i; done
These templates contain entries for a lot of different node groups, but we will be using the following node groups:
In our example, ipmi, idataplex, 42perswitch, and compute will all have the exact same membership because all of our iDataPlex nodes have those characteristics.
The templates automatically define the following attributes and naming conventions:
For a description of the attribute names in bold above, see the node object definition.
If these conventions don't work for your situation, you can either:
Now you can use the power of the templates to define the nodes quickly. By simply adding the nodes to the correct groups, they will pick up all of the attributes of that group:
nodeadd n1-n167 groups=ipmi,idataplex,42perswitch,compute,all
nodeadd bmc1-bmc167 groups=84bmcperrack
nodeadd switch1-switch4 groups=switch
To see the list of nodes you just defined:
nodels
To see all of the attributes that the combination of the templates and your nodelist have defined for a few sample nodes:
lsdef n100,bmc100,switch2
This is the easiest way to verify that the regular expressions in the templates are giving you attribute values you are happy with. (Or, if you modified the regular expressions, that you did it correctly.)
All networks in the cluster must be defined in the networks table. When xCAT was installed, it ran makenetworks, which created an entry in this table for each of the networks the management node is connected to. Now is the time to add to or update in the networks table any other networks in the cluster.
For a sample Networks Setup, see the following example: [Setting_Up_a_Linux_xCAT_Mgmt_Node#Appendix_A:_Network_Table_Setup_Example]
If you want to use hardware discovery, a dynamic range is required to be defined in the networks table. It's used for the nodes to get an IP address before xCAT knows their MAC addresses.
In this case, we'll designate 172.20.255.1-172.20.255.254 as a dynamic range:
chdef -t network 172_16_0_0-255_240_0_0 dynamicrange=172.20.255.1-172.20.255.254
If not using a terminal server, SOL is recommended, but not required to be configured. To instruct xCAT to configure SOL in installed operating systems on dx340 systems:
chdef -t group -o compute serialport=1 serialspeed=19200 serialflow=hard
For dx360-m2 and newer use:
chdef -t group -o compute serialport=0 serialspeed=115200 serialflow=hard
The template created a default passwd table. This includes the system entry which is the passwd that will be assigned to root when the node is installed. You can modify this table using tabedit. To change the default password for root on the nodes, change the system line. To change the password to be used for the BMCs, change the ipmi line.
tabedit passwd
#key,username,password,cryptmethod,comments,disable
"system","root","cluster",,,
"ipmi","USERID","PASSW0RD",,,
The table templates already put group-oriented regular expression entries in the switch table. Use lsdef for a sample node to see if the switch and switchport attributes are correct. If not, use chdef or tabedit to change the values.
If you configured your switches to use SNMP V3, then you need to define several attributes in the switches table. Assuming all of your switches use the same values, you can set these attributes at the group level:
tabch switch=switch switches.snmpversion=3 switches.username=xcat switches.password=passw0rd switches.auth=sha
The template created a basic noderes table which defines node resources during install. In the template, servicenode and xcatmaster are not defined, so they will default to the Management Node.
At this point, xCAT should be ready to begin managing services.
Since the map between the xCAT node names and IP addresses have been added in the hosts table by the 31350 template, you can run the makehosts xCAT command to create the /etc/hosts file from the xCAT hosts table. (You can skip this step if creating /etc/hosts manually.)
makehosts switch,idataplex,ipmi
Verify the entries have been created in the file /etc/hosts. For example your /etc/hosts should look like this:
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
###
172.20.0.1 mgt mgt.cluster
172.20.101.1 n1 n1.cluster
172.20.101.2 n2 n2.cluster
172.20.101.3 n3 n3.cluster
172.20.101.4 n4 n4.cluster
172.20.101.5 n5 n5.cluster
172.20.101.6 n6 n6.cluster
172.20.101.7 n7 n7.cluster
.
.
.
To get the hostname/IP pairs copied from /etc/hosts to the DNS on the MN:
Set site.forwarders to your site-wide DNS servers that can resolve site or public hostnames. The DNS on the MN will forward any requests it can't answer to these servers.
chdef -t site forwarders=1.2.3.4,1.2.5.6
Edit /etc/resolv.conf to point the MN to its own DNS. (Note: this won't be required in xCAT 2.8 and above.)
search cluster
nameserver 172.20.0.1
Run makedns
makedns && service named start
For more information about name resolution in an xCAT Cluster, see [Cluster_Name_Resolution].
You usually don't want your DHCP server listening on your public (site) network, so set site.dhcpinterfaces to your MN's cluster facing NICs. For example:
chdef -t site dhcpinterfaces=eth1
Then this will get the network stanza part of the DHCP configuration (including the dynamic range) set:
makedhcp -n
The IP/MAC mappings for the nodes will be added to DHCP automatically as the nodes are discovered.
Nothing to do here - the TFTP server is done by xCAT during the Management Node install.
makeconservercf
If you want to update node firmware when you discover the nodes, follow the steps in [XCAT_iDataPlex_Advanced_Setup#Updating_Node_Firmware] before continuing.
If you want to automatically deploy the nodes after they are discovered, follow the steps in [XCAT_iDataPlex_Advanced_Setup#Automatically_Deploying_Nodes_After_Discovery] before continuing. (But if you are new to xCAT we don't recommend this.)
Now walk over to systems, hit power buttons, and on the MN watch nodes discover themselves by:
tail -f /var/log/messages
Look for the dhcp requests, the xCAT discovery requests, and the "<node> has been discovered" messages.
A quick summary of what is happening during the discovery process it:
After a successful discovery process, the following attributes will be added to the database for each node. (You can verify this by running lsdef <node> ):
If you cannot discover the nodes successfully, see the next section [#Manually_Discover_Nodes].
If at some later time you want to force a re-discover of a node, run:
makedhcp -d <noderange>
and then reboot the node(s).
If you just have a few nodes and can't configure the switch for SNMP, you can manually set up the xCAT tables instead, and then run the BMC setup process to configure the BMC on the nodes:
This mac address can be obtained from the back panel of the machine. This MAC address should belong to NIC which is connected to the management network.
chdef n1 mac="xx:xx:xx:xx:xx:xx"
chdef n2 mac="yy:yy:yy:yy:yy:yy"
.
.
.
Add the nodes to dhcp
makedhcp idataplex
Setup the current node operation to be bmcsetup, and the next one to be waiting in a shell
nodeset idataplex runcmd=bmcsetup
chdef idataplex currchain=shell
Then walk over and manually power on the nodes.
When the nodes boot, xCAT (really DHCP) should instruct the nodes to download the xCAT genesis boot kernel and run the bmcsetup script to configure the BMC (IMM) properly.
When the bmcsetup process completes on each node (about 5-10 minutes), xCAT genesis will drop into a shell and wait indefinitely (and change the node's currstate attribute to "shell"). You can monitor the progress of the nodes using:
watch -d 'nodels ipmi chain.currstate|xcoll'
Before all nodes complete, you will see output like:
###### ========================
n1,n10,n11,n75,n76,n77,n78,n79,n8,n80,n81,n82,n83,n84,n85,n86,n87,n88,n89,n9,n90,n91
###### ========================
shell
###### ========================
n31,n32,n33,n34,n35,n36,n37,n38,n39,n4,n40,n41,n42,n43,n44,n45,n46,n47,n48,n49,n5,n50,n51,n52,
n53,n54,n55,n56,n57,n58,n59,n6,n60,n61,n62,n63,n64,n65,n66,n67,n68,n69,n7,n70,n71,n72,n73,n74
###### ========================
runcmd=bmcsetup
When all nodes have made it to the shell, xcoll will just show that the whole nodegroup "ipmi" has the output "shell":
###### ========================
ipmi
###### ========================
shell
When the nodes are in the xCAT genesis shell, you can ssh or psh to any of the nodes to check anything you want.
At this point, the BMCs should all be configured and ready for hardware management. To verify this:
# rpower ipmi stat | xcoll
###### ========================
ipmi
###### ========================
on
For iDataPlex nodes you also need to enable the uEFI console redirection with the ASU command:
set uEFI.RemoteConsoleRedirection Enable
See [XCAT_iDataPlex_Advanced_Setup#Updating_ASU_Settings_on_the_Nodes] to set this ASU setting.
Now run:
rcons <node>
To verify that you can see the genesis shell prompt (after hitting enter). To exit rcons type: ctrl-shift-E (all together), then "c", the ".".
You are now ready to choose an operating system and deployment method for the nodes....
This section describes the process for setting up xCAT to install nodes; that is how to install an OS on the disk of each node.
The copycds command copies the contents of the linux distro media to /install/<os>/<arch> so that it will be available to install nodes with or create diskless images.
If using an ISO, copy it to (or NFS mount it on) the management node, and then run:
copycds <path>/RHEL6.2-Server-20080430.0-x86_64-DVD.iso
If using a DVD, put it in the DVD drive of the management node and run:
copycds /dev/dvd # or whatever the device name of your dvd drive is
Tip: if this is the same distro version as your management node, create a .repo file in /etc/yum.repos.d with content similar to:
[local-rhels6.2-x86_64]
name=xCAT local rhels 6.2
baseurl=file:/install/rhels6.2/x86_64
enabled=1
gpgcheck=0
This way, if you need some additional RPMs on your MN at a later, you can simply install them using yum. Or if you are installing other software on your MN that requires some additional RPMs from the disto, they will automatically be found and installed.
The copycds command also automatically creates several osimage defintions in the database that can be used for node deployment. To see them:
lsdef -t osimage # see the list of osimages
lsdef -t osimage <osimage-name> # see the attributes of a particular osimage
From the list above, select the osimage for your distro, architecture, provisioning method (in this case install), and profile (compute, service, etc.). Although it is optional, we recommend you make a copy of the osimage, changing its name to a simpler name. For example:
lsdef -t osimage -z rhels6.2-x86_64-install-compute | sed 's/^[^ ]\+:/mycomputeimage:/' | mkdef -z
This displays the osimage "rhels6.2-x86_64-install-compute" in a format that can be used as input to mkdef, but on the way there it uses sed to modify the name of the object to "mycomputeimage".
Initially, this osimage object points to templates, pkglists, etc. that are shipped by default with xCAT. And some attributes, for example otherpkglist and synclists, won't have any value at all because xCAT doesn't ship a default file for that. You can now change/fill in any osimage attributes that you want. A general convention is that if you are modifying one of the default files that an osimage attribute points to, copy it into /install/custom and have your osimage point to it there. (If you modify the copy under /opt/xcat directly, it will be over-written the next time you upgrade xCAT.)
But for now, we will use the default values in the osimage definition and continue on. (If you really want to see examples of modifying/creating the pkglist, template, otherpkgs pkglist, and sync file list, see the section [#Deploying_Stateless_Nodes]. Most of the examples there can be used for stateful nodes too.)
If you already have a different OS on your nodes and you haven't configured your nodes to always boot from the network, then run rsetboot to instruct them to boot from the network for the next boot:
rsetboot compute net
The nodeset command tells xCAT what you want to do next with this node, and powering on the node starts the installation process:
nodeset compute osimage=mycomputeimage
rpower compute boot
Tip: when nodeset is run, it processes the kickstart or autoyast template associated with the osimage, plugging in node-specific attributes, and creates a specific kickstart/autoyast file for each node in /install/autoinst. If you need to customize the template, make a copy of the file that is pointed to by the osimage.template and edit that file (or the files it includes).
It is possible to use the wcons command to watch the installation process for a sampling of the nodes:
wcons n1,n20,n80,n100
or rcons to watch one node
rcons n1
Additionally, nodestat may be used to check the status of a node as it installs:
nodestat n20,n21
n20: installing man-pages - 2.39-10.el5 (0%)
n21: installing prep
Note: the percentage complete reported by nodestat is not necessarily reliable.
You can also watch nodelist.status until it changes to "booted" for each node:
nodels compute nodelist.status | xcoll
Once all of the nodes are installed and booted, you should be able ssh to all of them from the MN (w/o a password), because xCAT should have automatically set up the ssh keys (if the postscripts ran successfully):
xdsh compute date
If there are problems, see [Debugging_xCAT_Problems].
{{:Using_Provmethod=osimagename}}
This section gives some examples of using key commands and command combinations in useful ways. For any xCAT command, typing 'man <command>' will give details about using that command. For a list of xCAT commands grouped by category, see [XCAT_Commands]. For all the xCAT man pages, see http://xcat.sourceforge.net/man1/xcat.1.html .
In this configuration, a handy convenience group would be the lower systems in the chassis, the ones able to read temperature and fanspeed. In this case, the odd systems would be on the bottom, so to do this with a regular expression:
# nodech '/n.*[13579]$' groups,=bottom
or explicitly
chdef -p n1-n9,n11-n19,n21-n29,n31-n39,n41-n49,n51-n59,n61-n69,n71-79,n81-n89,
n91-n99,n101-n109,n111-119,n121-n129,n131-139,n141-n149,n151-n159,n161-n167 groups="bottom"
We can list discovered and expanded versions of attributes (Actual vpd should appear instead of *) :
# nodels n97 nodepos.rack nodepos.u vpd.serial vpd.mtm
n97: nodepos.u: A-13
n97: nodepos.rack: 2
n97: vpd.serial: ********
n97: vpd.mtm: *******
You can also list all the attributes:
#lsdef n97
Object name: n97
arch=x86_64
.
groups=bottom,ipmi,idataplex,42perswitch,compute,all
.
.
.
rack=1
unit=A1
xCAT provides parallel commands and the sinv (inventory) command, to analyze the consistency of the cluster. See [Parallel_Commands_and_Inventory]
Combining the use of in-band and out-of-band utilities with the xcoll utility, it is possible to quickly analyze the level and consistency of firmware across the servers:
mgt# rinv n1-n3 mprom|xcoll
====================================
n1,n2,n3
====================================
BMC Firmware: 1.18
The BMC does not have the BIOS version, so to do the same for that, use psh:
mgt# psh n1-n3 dmidecode|grep "BIOS Information" -A4|grep Version|xcoll
====================================
n1,n2,n3
====================================
Version: I1E123A
To update the firmware on your nodes, see [XCAT_iDataPlex_Advanced_Setup#Updating_Node_Firmware].
To do this, see [XCAT_iDataPlex_Advanced_Setup#Updating_ASU_Settings_on_the_Nodes].
xCAT has several utilities to help manage and monitor the Mellanox IB network. See [Managing_the_Mellanox_Infiniband_Network].
If the configuration is louder than expected (iDataplex chassis should nominally have a fairly modest noise impact), find the nodes with elevated fanspeed:
# rvitals bottom fanspeed|sort -k 4|tail -n 3
n3: PSU FAN3: 2160 RPM
n3: PSU FAN4: 2240 RPM
n3: PSU FAN1: 2320 RPM
In this example, the fanspeeds are pretty typical. If fan speeds are elevated, there may be a thermal issue. In a dx340 system, if near 10,000 RPM, there is probably either a defective sensor or misprogrammed power supply.
To find the warmest detected temperatures in a configuration:
# rvitals bottom temp|grep Domain|sort -t: -k 3|tail -n 3
n3: Domain B Therm 1: 46 C (115 F)
n7: Domain A Therm 1: 47 C (117 F)
n3: Domain A Therm 1: 49 C (120 F)
Change tail to head in the above examples to seek the slowest fans/lowest temperatures. Currently, an iDataplex chassis without a planar tray in the top position will report '0 C' for Domain B temperatures.
For more options, see rvitals manpage: http://xcat.sourceforge.net/man1/rvitals.1.html
Now that your basic cluster is set up, here are suggestions for additional reading: