
xCAT supports clusters of all sizes. This document is a collection of hints, tips, and special considerations when working with large clusters, especially clusters with more than 128 nodes.
The information in this document should be viewed as example data only. Many of the suggestions are based on anecdotal experiences and may not apply to your particular environment. Suggestions in different sections of this document may recommend different or conflicting settings since they may have been provided by different people for different cluster environments. Often there is a significant amount of flexiblity in most of these settings -- you will need to resolve these differences in a way that works best for your cluster.
In addition to this document, more recommendations on tuning can be found at https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/System%20X%20Cluster%20Tuning%20Recommendations .
In an xCAT cluster, the single point of control is the xCAT management node. However, in order to provide sufficient scaling and performance for large clusters, it may also be necessary to have additional servers to help handle the deployment and management of the cluster nodes. In an xCAT cluster these additional servers are referred to as service nodes. xCAT only supports these two levels of management hierarchy: the xCAT management node as the top level, and xCAT service nodes as the second level.
A service node is managed by the xCAT management node. A compute node can be managed by either the xCAT management node or by a service node. All xCAT commands are initiated from the management node. xCAT will then internally distribute any hierarchical commands to the service nodes as required, and gather and consolidate the results.
When does it become important to start thinking about a hierarchical cluster? There are many factors to consider. Most clusters of 128 compute nodes or less can easily be handled by a single xCAT management node. Larger non-hierarchical clusters may be feasible given enough resources on your xCAT management node and cluster management network.
Factors:
For details on setting up a Hierarchical xCAT cluster see:
[Setting_Up_an_AIX_Hierarchical_Cluster]
[Setting_Up_a_Linux_Hierarchical_Cluster]
Many services are used during node deployment, depending on the operating system and type of compute nodes: DHCP, TFTP, bootp, HTTP, NFS, conserver, and others. Any one of these services can become a bottleneck when trying to deploy too many nodes from a single server.
The xCAT hierarchical support provides different options for distributing these services across your service nodes. Adjusting the way these services are distributed may provide increased deployment performance and stability.
xCAT uses NFS in hierarchical clusters in various ways:
Note: For Linux stateful node installs, xCAT typically uses http instead of NFS to deploy operating system packages from the service nodes to the stateful nodes.
By default, xCAT will set up NFS servers on your xCAT management node and all service nodes when needed. However, xCAT does provide the option of using an external NFS server with HA capability using, for example, SONAS or GPFS CNFS. See:
[External_NFS_Server_Support_With_Linux_Statelite]
[External_NFS_Server_Support_With_AIX_Stateless_And_Statelite]
An external NFS server can also be a good solution for service nodes in a hierarchical cluster. You can then have /install on the external server shared by the management server and all service nodes. Any change on /install on the management server is immediately available on the service nodes.
By default, xCAT runs conserver on the management node for all of your compute nodes. This may be adequate for your cluster, especially if you are able to use the "console on demand" option:
chdef -t site consoleondemand=yes
When set to 'yes', xCAT will configure conserver to connect and create the console output for nodes only when the user opens the console. Default is 'no' on Linux and 'yes' on AIX. Using console on demand allows conserver to scale very well for any size cluster. However, you lose the ability to continuously log all console output.
If you would like to have your service nodes run conserver for all of the compute nodes they service, you will need to explicitly specify this to xCAT by turning on the 'setupconserver' attribute for the service nodes and by assigning the correct conserver server to each compute node. For example:
chdef service setupconserver=1
chdef compute1 conserver=service1
By default on Linux clusters, each service node should be configured as a DHCP server for its compute nodes. There are MANY different network configurations that are possible in a hierarchical cluster. Providing the correct network and DHCP interface information to xCAT is important in allowing xCAT to automatically set up your DHCP configurations succinctly and correctly.
First, specify to xCAT that you want your service nodes to run DHCP by turning on the 'setupdhcp' attribute
chdef service setupdhcp=1
Set your site table to tell xCAT which adapter interfaces DHCP should listen on:
chdef -t site dhcpinterfaces='yourMN|nicM0,nicm1;service|nicS'
where nicM1(,nicM2,...) is the interface name(s) on your xCAT management node that is attached to your cluster management network, and where nicS is the interface name on your service nodes that is attached to the management network that its compute nodes are connected to. Setting these values will limit the networks that your DHCP servers will need to respond to.
By default, xCAT will configure the dhcpd.leases file to be identical on every dhcp server in the cluster, such that each leases file will contain an entry for EVERY compute node in your cluster. If your cluster management network is fully disjoint, that is you have a separate subnet for each service node and the compute nodes it manages, you can tell xCAT to only create leases entries for the compute nodes that a service node manages:
chdef -t site disjointdhcps=1
When you first installed xCAT, your xCAT networks table was initially populated with information on all of the networks your management node is connected to. Edit your networks table entries include only the management networks in your cluster and that the information is correct. To list the current entries:
lsdef -t network -l
and use chdef to modify any entries, and rmdef to remove any network definitions that are not required to manage your cluster.
Some Linux customers have reduced DHCP timeouts by creating separate subnets/vlans from the service nodes to compute nodes for the DHCP traffic and the HTTP traffic. When simultaneously deploying very large numbers of nodes, this allows DHCP to continue to respond to new node requests while HTTP requests for OS images are served on another subnet.
Author's note: I do not have specific instructions on what needs to be set up in xCAT to make this happen. If you have successfully done this, we would appreciate the input. Please post to the xCAT mailing list. I'm guessing that you need to set the correct nfsserver attribute for the node to the service node's interface on the HTTP subnet, and to make sure that the node's interface on that subnet gets configured early enough to be available for the HTTP transfers of the OS images.
There are many other services that xCAT can distribute and automatically configure on your service nodes. To see a complete list:
tabdump -d servicenode
Each of these services may require additional configuration in xCAT. You will need to search the xCAT documentation for more information on setting up and distributing a specific service.
Adjusting operating system tunables can improve large scale cluster performance, avoid bottlenecks, and prevent failures. The following sections are a collection of suggestions that have been gathered from various large scale HPC clusters. You should investigate and evaluate the validity of each suggestion before applying them to your cluster.
After creating your diskless using genimage (but before packing it using packimage), you can chroot into the image and change any tuning parameters. (The diskless root directory is stored in the rootimgdir attribute of the osimage definition.) For example:
cd /install/netboot/rhels6.3/x86_64/compute
chroot imgdir
After making changes to the root image, you need to run packimage.
For any of the suggestions listed below that you would like to apply to your AIX diskless image, xcat provides the command:
[xcatchroot](http://xcat.sourceforge.net/man1/xcatchroot.1.html)
This command allows you to modify the diskless image so that the tuning parameters are set during your node boot. Changes with xcatchroot could include anything that you wish to modify using the AIX vmo or no commands, system parameters, etc. xcatchroot can also be used to check and verify attributes. An important note is that changing some attributes in the diskless image will also require a bosboot to be run against that image. To run the bosboot against the image, you will need to run nim -o check image. It is a good idea to run the nim -o check of the image once any modifications have been done.
The "best practices" tuning settings for large linux clusters are documented in:
Linux System Tuning Recommendations
That web page is based on experience with many large linux clusters. It contains settings for:
Although there are a few settings in that page that are not needed on some types of nodes (xCAT management/service nodes, GPFS storage nodes, login nodes, or compute nodes), it doesn't hurt to set them on all types of nodes and it is much simpler.
Address Resolution Protocol (ARP) is a network protocol that maps a network layer protocol address to a data link layer hardware address. For example, ARP is used to resolve an IP address to the corresponding Ethernet address.
In very large networks, the ARP table can become overloaded, which can give the appearance that the cluster software is slow.
There are several ARP tuning values that can be set on AIX. Read the discussion on "ARP Cache Tuning" in the "AIX Performance Guide"
The following are the default settings for ARP attributes in AIX 7.1:
arpqsize = 12
arpt_killc = 20
arptab_bsiz = 7
arptab_nb = 149
With these settings, the ARP table can hold 149 buckets (arptab_nb), with 7 entries per bucket (arptab_bsiz), for a total of 1043 entries. If a server connects to more than 1043 hosts concurrently, the system will purge an entry from the ARP table and replace it with a new request. Up to 12 requests (arpqsize) can be queued while this is processing. Entries will remain in the table for 20 minutes (arpt_killc).
Once you have determined the appropriate values for your cluster, you can change change the attributes by running the no command with the new values:
no -r .o arptab_nb=256
no -r .o arptab_bsiz=8
no -p -o arpqsize=64
You must reboot your server for these changes to take affect.
The following AIX attribute settings are recommended on all servers and compute nodes in large clusters:
no -p -o tcp_recvspace=524288
no -p -o tcp_sendspace=524288
no -p -o udp_recvspace=655360
no -p -o udp_sendspace=65536
no -p -o rfc1323=1
no -p -o sb_max=8388608
chdev -l sys0 -a maxuproc='8192'
Check ulimit settings:
ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 131072
stack(kbytes) 32768
memory(kbytes) 32768
coredump(blocks) 2097151
nofiles(descriptors) 2000
threads(per process) unlimited
processes(per user) unlimited
If not unlimited, change the ulimit setting to unlimited, for this session. coredump is optional:
ulimit -m unlimited
ulimit -n 102400
ulimit -d unlimited
ulimit -f unlimited
ulimit -s unlimited
ulimit -t unlimited
ulimit -u unlimited
Edit the /etc/security/limits file to make the settings permanent on reboot:
root:
fsize = -1
core= -1
cpu= -1
data= -1
rss= -1
stack= -1
nofiles= 102400
nproc = -1
Note you should not set nofiles to unlimited (-1). This may cause problems for some system applications.
If you are using MySQL as your xCAT database, you will want to refer to the MySQL documentation for advise on tuning your MySQL server, which is running on your xCAT management node:
MySQL: Tuning Server Parameters
According to this documentation, the two most important variables to configure are key_buffer_size and table_open_cache.
xCAT uses NFS in various ways:
Note: For Linux stateful node installs, xCAT typically uses http instead of NFS to deploy operating system packages to the nodes.
[External_NFS_Server_Support_With_Linux_Statelite]
[External_NFS_Server_Support_With_AIX_Stateless_And_Statelite]
When using an external NFS server with HA capability such as SONAS or GPFS CNFS,
the suggestions in this section may not be relevant.
Refer to that product documentation for tuning recommendations.
For heavily loaded Linux NFS servers, increasing the NFSD count may help improve performance:
For SLES, edit /etc/sysconfig/nfs and set USE_KERNEL_NFSD_NUMBER to a higher value than the default of 8.
For RedHat, update /etc/init.d/nfs and change RPCNFSDCOUNT to a higher
value than the default value of 8.
Very large clusters have used NFSD values as high as 64 or 128. Restart your NFS service
to ensure that the new value takes effect.
# increase TCP max buffer size
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.rmem_default = 65536
net.core.wmem_default = 65536
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 33554432 33554432
net.ipv4.tcp_wmem = 4096 33554432 33554432
net.ipv4.tcp_mem= 33554432 33554432 33554432
net.ipv4.route.flush=1
net.core.netdev_max_backlog=1500
rw,rsize=32768,wsize=32768,hard,intr
For servers that are handling large numbers of NFS operations, especially NFS writes, ensure that your disk subsystem is capable of processing the load. If you are running with a local SCSI drive, if possible you may want to consider moving the filesystem to an attached storage subsystem (RAID, etc.). If that is not option, it may be necessary to add additional local disk drives or create your filesystem so that it spans multiple drives to remove physical bottlenecks in your use of the disk subsystem.
An example of creating a filesystem spanning multiple local disks on AIX:
# Create a volume group across 8 disks
mkvg -f -y xcatvg -s 64 hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9
# Create a logical volume for the jfs2log that spans multiple disks
# This is important because the jfs2log can become a bottleneck
# with many writes to the filesystem
mklv -y install_log -t jfs2log -a c -S 4K xcatvg 32 hdisk2 hdisk3
# Create a logical volume for the NFS filesystem across all the disks
mklv -y install_lv -t jfs2 -a c -e x xcatvg 1000 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9
# Create the NFS filesystem
crfs -v jfs2 -d install_lv -m /install -A yes -p rw -a agblksize=4096 -a isnapshot=no
One special consideration for NFS tuning is statelite persistent files. By default, the management node or service nodes provide the NFS server for the statelite compute nodes. If you define statelite persistent files in the litefile table that the compute nodes will do a lot of writing to, this can cause performance problems on the NFS server side if it is not sized and configured properly for that load. Since many statelite compute nodes are served by a single MN or SN, when the compute nodes all write to a statelite file, the load is multiplied on the MN/SN. For diskfull compute nodes, these writes are not an issue, because each compute node handles it own writes. But in a statelite cluster, you must carefully consider which files are defined as statelite persistent files, how often writes are done to them, and how your NFS servers are sized. Examples of files that are typically defined as statelite files that can cause this problem are:
For MNs and SNs that act as the NFS servers, often the performance bottleneck can be its disk. If the NFS server only has one disk (or 2 mirrored together), the disk often can't keep up with the NFS server daemon for many simultaneous writes. This problem can be addressed by one of the following:
#image,file,options,comments,disable
"ALL","/var/mmfs/","persistent",,
If you are using RMC for cluster monitoring, you may need to increase the TCP/IP buffer sizes on your xCAT management node only to prevent incorrect node status from being returned. If the buffer size is too low for the number of nodes in the cluster, a node could be reported as down even though it can be reached using ping and the RMC subsystem on the node is active.
To temporarily increase the TCP/IP buffer size on AIX, run the following from the command line:
no -o udp_recvspace=262144
You must also recycle RMC. Run the following from the command line:
/usr/sbin/rsct/bin/rmcctrl -k
/usr/sbin/rsct/bin/rmcctrl -s
For a permanent change use the no command and the /etc/tunables/nextboot file, as follows:
See the AIX documentation for detailed command usage information. Note: To use larger values on AIX, increase the udp_recvspace value in increments of 262144, not to exceed the sb_max value.
no .L udp_recvspace
---------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE DEPENDENCIES
---------------------------------------------------------------------------------
udp_recvspace 262144 262144 262144 4K 2G-1 byte C sb_max
---------------------------------------------------------------------------------
no .L sb_max
-------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE DEPENDENCIES
-------------------------------------------------------------------------------
sb_max 1M 1M 1M 1 2G-1 byte D
-------------------------------------------------------------------------------
For a discussion of apache2 tuning, see the external web page: "Apache Performance Tuning"
The default settings for Apache 2 on Red Hat and SLES may not allow enough simultaneous HTTP client connections to support the installation of more than about 50 nodes at a time. To enable greater scaling, the MaxClients and ServerLimit directives need to be increased from 150 to 1000. On Red Hat, change (or add) these directives in
/etc/httpd/conf/httpd.conf
On SLES (with Apache2), change (or add) these directives in
/etc/apache2/server-tuning.conf
Note: this information was contributed by Jonathan Dye and is provided here as an example. The details may have to be changed for distro or apache version.
This is simplest if you set noderes.nfsserver to a separate apache server, and then you can configure it to reverse proxy and cache. For some reason mod_mem_cache doesn't seem to behave as expected, so you can use mod_disk_cache to achieve a similar result: make a tmpfs on the apache server and configure its mountpoint to be the directory that CacheRoot points to. Also tell it to ignore /install/autoinst since the caching settings are really aggressive. Do a recursive wget to warm the cache and watch the tmpfs fill up. Then do a bunch of kickstart installs. Before this, the apache server on the xcat management node may have been a bottleneck during kickstart installs. After this change, it no longer should be.
Here's the apache config file:
ProxyRequests Off # don't be a proxy, just allow the reverse proxy
CacheIgnoreCacheControl On
CacheStoreNoStore On
CacheIgnoreNoLastMod On
CacheRoot /var/cache/apache2/tmpfs
CacheEnable disk /install
CacheDisable /install/autoinst
CacheMaxFileSize 1073741824
# CacheEnable mem / # failed attempt to do in-memory caching
# MCacheSize 20971520
# MCacheMaxObjectSize 524288000
# through ethernet network
# ProxyPass /install http://172.21.254.201/install
# through IB network
ProxyPass /install http://192.168.111.2/install
conserver is an application used on xCAT management and service nodes which allows multiple users to watch serial console at the same time. On AIX large cluster, the default PTY setting prevents conserver forking too many consoles for compute nodes at the same time. It is strongly suggested to tune following PTY setting if there are more than one hundred nodes in the cluster.
The default PTY setting on AIX 71D is:
> lsattr -El pty0
ATTnum 256 Maximum number of Pseudo-Terminals True
BSDnum 16 Maximum number of BSD Pseudo-Terminals True
autoconfig available STATE to be configured at boot time True
csmap sbcs N/A
The ATTnum is used by pts devices and can be enlarged by:
> chdev -l pty0 -a ATTnum=2048
pty0 changed
After appliy the changes, restart conserver by:
> stopsrc -s conserver
0513-044 The conserver Subsystem was requested to stop.
> startsrc -s conserver
0513-059 The conserver Subsystem has been started. Subsystem PID is 40173700.
See "IBM Tuning Guide for High Performance Computing Applications for IBM Power 6": <http://www.ibm.com/developerworks/wikis/download/attachments/137167333/Power6_optimization.pdf?version=1>
This document contains recommended compiler options, discussion of using SMT (simultaneous multi-threading), running legacy executables, MPI performance optimizations, high performance libraries, and benchmark results.
When using GPFS with IP over IB, you may need to address connection backlog issues when a high rate of incoming connection requests result in connection failures.
To temporarily change the somaxconn setting on Linux, run the following from the command line:
echo "8124" > /proc/sys/net/core/somaxconn
To permanently change this parameter, add the following to /etc/sysctl.conf and reboot the Linux server:
net.core.somaxconn=8192
The Parallel Environment Installation Guide includes information on tuning for network performance. The PE documents are available here: "Parallel Environment (PE) Library"
Review these documents for additional settings and detailed explanations:
To enable more than 8 tasks in shared memory:
echo 268435456 > /proc/sys/kernel/shmmax
To avoid socket send and receive buffer overflow:
echo 1048576 > /proc/sys/net/core/wmem_max
echo 8388608 > /proc/sys/net/core/rmem_max
# Or, on Power Linux:
sysctl -w net.core.wmem_max=1048576
sysctl -w net.core.rmem_max=8388608
To avoid datagram reassembly failures when MP_UDP_PACKET_SIZE is greater than the MTU:
echo 1048576 > /proc/sys/net/ipv4/ipfrag_low_thresh
echo 8388608 > /proc/sys/net/ipv4/ipfrag_high_thresh
To enable jumbo frames:
ifconfig eth<x> mtu 9000 up
The switch must be configured with jumbo frame enabled.
To turn off interrupt coalescing on device eth0 on Power Linux:
ifdown eth0
rmmod e1000
insmod /lib/modules/2.6.5-7.97-pseries64/kernel/drivers/net/e1000/e1000.ko \
InterruptThrottleRate=0,0,0
ifconfig eth0
The path to e1000.ko may vary with kernel level.
To allow multiple core dumps:
echo 1 > /proc/sys/fs/suid_dumpable
# Or, on Power Linux:
/sbin/sysctl -w fs.suid_dumpable=1
To obtain more dmesg detail:
echo 1 > /proc/sys/kernel/print-fatal-signals
# Or, on Power Linux:
/sbin/sysctl -w kernel.print-fatal-signals=1
Verify this check passes for PE.
/opt/ibmhpc/pecurrent/ppe.poe/bin/pe_node_diag
To avoid socket send and receive buffer overflow:
no -o sb_max=8388608
To turn off interrupt coalescing on device ent0 for better latency:
ifconfig ent0 detach
chdev -l ent0 -a intr_rate= 0
If your user applications are using large pages, see The PE 5.2.1 Operation and Use Guide: "Using POE with AIX Large Pages"
On all compute nodes:
vmo -p -o lgpg_regions=64 -o lgpg_size=16777216
If your user applications require shared memory segments to be pinned, on all compute nodes:
vmo -p -o v_pinshm=1
If you decide you would like full system dumps to aid in problem diagnosis, on all compute nodes:
chdev -l sys0 -a fullcore=true
If your user applications perform better with Simultaneous Mult-Threading turned off (this is application dependent), on all compute nodes:
smtctl -m off
xCAT provides several settings that may affect the management of your large-scale cluster.
The xCAT site table contains many global settings for your xCAT cluster. To display current values, run the command:
lsdef -t site -l
This will only list those attributes that have been explicitly set in your site table. Many other attributes are supported. To list all possible attribute names, a brief description of each, and default values, go to http://xcat.sourceforge.net/man5/site.5.html or run:
lsdef -t site -h
For large clusters, you consider changing the default settings for some of these attributes to improve the performance on a large-scale cluster or if you are experiencing timeouts or failures in these areas:
consoleondemand : When set to 'yes', conserver connects and creates the console output for a node only when the user explicitly opens the console using rcons or wcons. Default is 'no' on Linux, 'yes' on AIX. Setting this to 'yes' can reduce the load conserver places on your xCAT management node. If you need this set to 'no', you may then need to consider setting up multiple servers to run the conserver daemon, and specify the correct server on a per-node basis by setting each node's conserver attribute.
nodestatus : If set to 'n', the nodelist.status column will not be updated during the node deployment, node discovery and power operations. Default is 'y', always update nodelist.status. Setting this to 'n' for large clusters can eliminate one node-to-server contact and one xCAT database write operation for each node during node deployment, but you will then need to determine deployment status through some other means.
precreatemypostscripts: (yes/1 or no/0, only for Linux). Default is no. If yes, it will instruct xcat at nodeset and updatenode time to query the database once for all of the nodes passed into the command and create the mypostscript file for each node, and put them in a directory in tftpdir(such as: /tftpboot). This prevents xcatd from having to create the mypostscript files one at a time when each deploying node contacts it, so it will speed up the deployment process. (But it also means that if you change database values for these nodes, you must rerun nodeset.) If precreatemypostscripts is set to no, the mypostscript files will not be generated ahead of time. Instead they will be generated when each node is deployed.
svloglocal : if set to 1, syslog on the service node will not get forwarded to the mgmt node. The default is to forward all syslog messages. The tradeoff on setting this attribute is reducing network traffic and log size versus having local management node access to all system messages from across the cluster.
skiptables : a comma separated list of tables to be skipped by dumpxCATdb. A recommended setting is "auditlog,eventlog" because these tables can grow very large. Default is to skip no tables.
useNmapfromMN : When set to yes, nodestat command should obtain the node status using nmap (if available) from the management node instead of the service node. This will improve the performance in a flat network. Default is 'no'.
disjointdhcps : If set to '1', the .leases file on a service node only contains the nodes it manages. The default value is '0', which means include all the nodes in the service node's subnet. This should be set to '1' when you have many service nodes and compute nodes on a single flat subnet, but you don't want them to behave like a service node pool, instead you want to explicitly assign compute nodes to specific service nodes.
dhcplease : The lease time for the dhcp client. The default value is 43200.
xcatmaxconnections : Number of concurrent xCAT protocol requests before requests begin queueing. This applies to both client command requests and node requests, e.g. to get postscripts. Default is 64.
xcatmaxbatchconnections : (xCAT2.8.3 and later) Number of concurrent xCAT connections allowed from the nodes. Number must be less than xcatmaxconnections. See useflowcontrol attribute.
useflowcontrol : (xCAT 2.8.3 and later) If yes, postscripts use xcatd to control access to the server. If no, postscripts sleep and retry. On a new install, it will be set to yes. Not supported on AIX.
ipmidispatch : Whether or not to send ipmi hw control operations to the service node of the target compute nodes. Default is 'y'.
ipmiretries : The # of retries to use when communicating with BMCs. Default is 3.
ipmisdrcache : If set to 'no', then the xCAT IPMI support will not cache locally the target node's SDR cache to improve performance. Default is 'no'. Set to 'yes' to cache the SDR.
syspowerinterval : The number of seconds the rpower command to servers will wait between performing the action for each set of servers. The number of servers in each set is controlled by the 'syspowermaxnodes' setting. This is used for controlling the power on speed in large clusters. Default is 0.
syspowermaxnodes : The number of servers to power on at one time before waiting 'syspowerinterval' seconds to continue on to the next set of nodes. Currently only used for IPMI servers and must be set if 'syspowerinterval' is set.
hwctrldispatch : Whether or not to send hw control operations to the service node of the target nodes. Default is 'y'.
blademaxp : The maximum number of concurrent processes for blade hardware control.
powerinterval : The number of seconds that rpower command will wait between performing the action on each LPARs. It is used for controlling the cluster boot up speed in large clusters. LPARs of different HCPs (HMCs or FSPs) are done in parallel. E.g. if there are 8 lpars in each cec, and the powerinterval is set to 30, it will take about 6 minutes. This value is only used for system p servers. If you are having trouble with some LPARs not booting successfully, you might try 30 for this setting. Default is 0;
syspowerinterval : The number of seconds the rpower command to servers will wait between performing the action for each system p CEC. This is used for controlling the power on speed in large clusters. For p775 CECs, the recommended setting is 30. Default is 0.
ppcmaxp : The max # of processes for PPC hw ctrl. Default is 64.
ppcretry : The max # of PPC hw connection attempts to the HMC before failing. It is only used for hardware control commands through the HMC. Default is 3.
ppctimeout : The timeout, in milliseconds, to use when communicating with PPC hardware through the HMC. It only is used for hardware control commands through the HMC. A value of '0' means use the default. Default is 60.
maxssh : The max # of SSH connections at any one time to the hw ctrl point for PPC. Default is 8.
fsptimeout : The timeout, in milliseconds, to use when communicating with FSPs. A value of '0' means use the default. The default value is 30. If you are experiencing connection timeouts on a heavily loaded management network, you may want to experiment with setting this to a larger value.
As of xCAT 2.8.3, there is available the useflowcontrol attribute in the site table. When this is set to yes/1, it works with the xcatmaxconnections and xcatmaxbatchconnects attributes to control requests from the nodes. This affects performance on installs and updatenode running the postscripts.
If xcatmaxconnection = 64 and xcatmaxbatchconnections = 50, then the daemon will only allow 50 concurrent connections from the nodes. This will allow 14 connections still to be available on the management node for xCAT commands (e.g nodels).
These attributes may be changed based on the size of your cluster. When increasing, consider the other tunables that also may need to increase in this document. xcatmaxbatchconnections should always be less than xcatmaxconnections.
On a new install of xCAT 2.8.3 or later release, useflowcontrol will be set to yes. It is not supported on AIX.
In addition to the site table settings, some xCAT commands provide options for timeouts, retries, and fanouts to help improve the potential for command success under different conditions. If you are experiencing problems running any of these commands, you may wish to experiment with some of these values:
xCAT provides several commands to make management of large clusters easier. Here are some that you may wish to review and use:
The xCAT database contains over 50 different tables, although for most clusters you only need entries in a small number of commonly used tables. The majority of the tables are keyed by the xCAT node name. For very large clusters, consolidating individual node entries into entries keyed by a nodegroup name can reduce the amount of information stored in the database, and make handling the data more manageable.
With the xCAT *def commands (lsdef, mkdef, chdef, rmdef), you can manipulate common attribute values for node groups using "-t group" (object type of 'group'). These will create database entries that are keyed by nodegroup name instead of individual node names.
For example:
chdef -t group -o compute nodetype="osi,ppc" hwtype=lpar os=rhels6 arch=ppc64
will set many common attributes in the nodetype table and ppc table for the nodegroup "compute".
See the xCAT documentation [Node_Group_Support] for details on using node groups and nodegroup entries in the xCAT database.
Another useful feature to use in conjunction with nodegroup entries, are to have entries that include regular expressions. This is especially useful where you would normally require an individual table entry for each node because of unique data values. When those data values follow an identifiable pattern, specifying the pattern as a regular expression enables you to create a single entry for the nodegroup. See the xCAT Database manpage for more details and for examples.
Running the 'syncfiles' postscript during the boot will trigger a sync process for each node, this will cost a long time in a large cluster (6 minutes for 516 nodes/2 service nodes).
An efficient approach is to:
Remove the 'syncfiles' postscript from the postscripts attribute
That means the syncfiles process will NOT be run during the boot of node.
chtab node=xcatdefaults postscripts.postscripts=syslog,remoteshell
Update the sync files after the booting of node
Run 'updatenode -F' to update the sync files after the installation.
updatenode noderange -F
Add the synclist configuration file to the osimage so that the sync file process will be performed during the genimage.
Run 'updatenode -F' against the booted node if there's sync file need to be updated to the running node.
updatenode noderange -F
Run 'xdcp -i -F' to sync the files to the osimage if there's sync file need to be updated. (Remember to run the 'packimage'/'nodeset' to regenerate the rootimg.gz)
xdcp -i /install/netboot/sles11/x86_64/compute/rootimg -F /tmp/mysynclist
The xCAT site table has a powerinterval setting that can be used to specify the number of seconds that the rpower command will wait between performing actions on each target object. You can use this to control the cluster boot up speed in large p-Series clusters. This can help stagger DHCP, tftp, http, and NFS requests to your service nodes when simultaneously booting large numbers of stateless or statelite nodes and may impact how you choose to distribute these services across your cluster. You will need to experiment to find a value that works best for your cluster size and hardware:
chdef -t site powerinterval=2
Default is 0.
If the service nodes are running as diskful systems and there are two or more disks on the service nodes, using disks mirror on the service nodes is an effective way to provide high availability for the service nodes. In case of hard disks faiure, another mirrored disk can take over, after the disk replacement, the mirror can help re-sync the data to the new disk.
On AIX, the commands mirrorvg and unmirrorvg can be used to manage the rootvg mirror with another disk.
On Linux, both the RedHat kickstart and SuSE autoyast provide options to support disk mirror when performing operating system deployment. The kickstart template file /opt/xcat/share/xcat/install/rh/service.raid1.rhel6.ppc64.tmpl and autoyast template file /opt/xcat/share/xcat/install/sles/service.raid1.sles11.tmpl give examples on the disks mirror. The command dmraid can be used to manage the disks mirror on Linux.
Wiki: External_NFS_Server_Support_With_AIX_Stateless_And_Statelite
Wiki: External_NFS_Server_Support_With_Linux_Statelite
Wiki: Node_Group_Support
Wiki: Setting_Up_a_Linux_Hierarchical_Cluster
Wiki: Setting_Up_an_AIX_Hierarchical_Cluster
Wiki: XCAT_2.6_Release_Notes
Wiki: XCAT_Documentation
Wiki: XCAT_Power_775_Hardware_Management