
Xeon Phi is an Intel Coprocessor which is designed with Intel Many Integrated Core (MIC) Architecture. Xeon Phi coprocessor is shipped as a PCIe card which can be installed on Xeon server. Each Xeon server could install one or more (8 or more) Xeon Phi PCIe cards, the Xeon Phi Coprocessor is depending on Xeon Server to function and all the management work for Xeon Phi coprocessor need to be done through the Xeon server. The relationship between 'Xeon Phi coprocessor' and 'Xeon Server' is similar with the relationship between 'Virtual Machine Guest' and 'Hypervisor' in the virtualization industry . In this document, the 'Xeon Server' which has 'Xeon Phi Coprocessor' installed is called 'host node' for 'Xeon Phi Coprocessor' since it's a host of 'Xeon Phi' card. 'Xeon Phi Coprocessor' is called 'mic node' because of its arch is 'Intel Many Integrated Core (MIC) Architecture'.
This document describes how to use xCAT to manage hardware discovery, hardware control and software installation for mic host and mic nodes in a cluster. 'mic node' can run in offload and native mode, this document only focus on the management for native mode that manage each mic card as a 'mic node'.
xCAT supports the 'host node' to be stateful and stateless. The steps which are specific for stateful and stateless will be described separated with the marks [stateful] and [stateless].
This document will be organized that follows the procedure that how cluster administrator manages a 'Xeon Phi' cluster. Following is the usual management procedure for Xeon Phi cluster:
The example environment will be used in this document:
**Host Node**: michost1
**MIC Nodes**: michost1-mic0, michost1-mic1
**MPSS Package**: mpss-3.1-rhel-6.2.tar
**kit name for MPSS**: mpss
**Partial kit package for MPSS**: mpss-3.1-0.1.1-x86_64.NEED_PRODUCT_PKGS.tar.bz2
**MIC osimage Name**: mpss3.1
**Host osimage Name**: **[stateful]** rhels6.2-x86_64-install-compute/**[stateless]** rhels6.2-x86_64-netboot-compute
**Virtual Bridge Name on Host Node**: xbr0 (External Bridge)/ micbr0 (Internal Bridge)
**Network for mic node** 192.168.0.1/24
**IP of Bridge** 192.168.0.254
**IP of mic nodes** 192.168.0.1 (michost1-mic0); 192.168.0.2 (michost1-mic1)
Note: The Xeon Phi support is available in xCAT 2.8.2 and higher release. In xCAT 2.8.2 release, only MPSS 2.x is supported. In xCAT 2.8.3 and higher release, only MPSS 3.1 and higher version is supported.
The Many Integrated Core (MIC) Platform Software Stack (MPSS) is a collection of software that is used to support all the functionalities of mic nodes. Without MPSS installed on 'host node', 'mic node' cannot be function.
xCAT offers a kit package named 'mpss' to manage the install/update of MPSS rpms. Since MPSS rpms include binaries and drivers for specific kernel version, xCAT offers a 'Partial kit' that user could build complete kit from the 'partial kit' + 'MPSS rpms'.
mkdir -p /kit/
cd /kit/
Download the mpss partial kit 'mpss-3.1-0.1.1-x86_64.NEED_PRODUCT_PKGS.tar.bz2' to /kit/
Get in the web site kit for mpss and download the up to date partial mpss kit.
Download the MPSS tar file which shipped by Intel to /kit/ (e.g. the MPSS tar file name is mpss-3.1-rhel-6.2.tar)
Get in the download path MPSS for Xeon Phi, and select the MPSS package base on the version of Operating System for your 'host node'.
Extract the files from MPSS tar file
tar xvf mpss-3.1-rhel-6.2.tar
Then you can see a new created directory like 'mpss-3.1' which contains all the MPSS rpm packages. In the /kit/mpss-3.1/, you can see lots of rpms named mpss*.rpm.
Build complete kit from partial kit
buildkit addpkgs mpss-3.1-0.1.1-x86_64.NEED_PRODUCT_PKGS.tar.bz2 --pkgdir /kit/mpss-3.1
After a successful building, you can see a created complete kit package in /kit/: mpss-3.1-0.1.1-x86_64.tar.bz2
addkit mpss-3.1-0.1.1-x86_64.tar.bz2
Then you can run 'lsdef -t kit' and 'lsdef -t kitcomponent' to display the added kit and kit component. In this example, the kit 'mpss-3.1-0.1.1-x86_64' and kit component 'mpss_compute-3.1-0.1.1-rhels-6-x86_64' will be displayed.
Xeon Phi cards are installed on 'host node' and all the management operations are performed on 'host node', therefore the 'host node' needs to be installed and configured before performing any management against the 'mic node'. The Operating System for 'host node' should be a latest RedHat Enterprise Server or SUSE Linux Enterprise Server.
Note: currently, only Redhat Linux is supported for 'host node'.
This document assumes:
[stateful] Add the mpss kit component to the osimage of your 'host node'
addkitcomp -a -i rhels6.2-x86_64-install-compute mpss_compute-3.1-0.1.1-rhels-6-x86_64
Then the mpss rpms will be installed to your 'host node' during os deployment. Or you could run 'updatenode <host node>' to install mpss rpms for an OS ready 'host node'.
[stateless] Add the mpss kit component to the osimage of your 'host node'
addkitcomp -a -i rhels6.2-x86_64-netboot-compute mpss_compute-3.1-0.1.1-rhels-6-x86_64
Since configuring mic needs 'perl' pakcage to run on 'host node', you have to add 'perl' to the .pkglist for host node osimage.
Add 'perl' package to .pkglist of host node osimage
Then you need run 'genimage' command to install mpss rpms into the root image for stateless 'host node'
genimage rhels6.2-x86_64-netboot-compute
Repeat the steps in Build Kit for MPSS to build the new version of mpss kit. And repeat the step in Install mpss kit on your Host Node to update the MPSS rpms on 'host node'.
In each install or update of mpss by kit, the mpss rpm packages which have been installed will be removed first and then install the new rpms.
If you see the error message like following during install/update, try to reboot host node after install/update to make the new mic driver takes effect. Or you can try to reboot the host node before installing/updating.
Removing MIC Module: FATAL: Module mic is in use.
The virtual network bridge needs to be created on the 'host node' to enable the 'mic node' communication with other 'mic node', 'host node' and even beyond the 'host' in the cluster.
xCAT supports two types of Bridge on 'host node':
**External Bridge** \- the bridge needs be bound to a physical Ethernet interface. 'mic node' can use it to communicate outside of 'host node'.
**Internal Bridge** \- this is an internal bridge for a specific 'host node'. That means the 'mic nodes' only can communicate inside 'host node'. If you want the 'mic node' gets outside of 'host node', a router needs be configured.
An xCAT postscript 'xHRM' can be used to create virtual bridge on the 'host node' when install/update a 'host node'.
chdef michost1 -p postscripts='xHRM bridgeprereq xbr0'
Usage of bridgeprereq option for 'xHRM' postscript
xHRM bridgeprereq xbr0 - Create a bridge named xbr0 which is attached to the primary interface that used for the 'host node' installing.
xHRM bridgeprereq eth0:xbr0 - Create a bridge name xbr0 which is attached to the 'eth0' interface on 'host node'.
xHRM bridgeprereq eth0:xbr0 192.168.0.254 - Create a bridge name xbr0 which is attached to the 'eth0' and assign static IP '192.168.0.254' for the bridge.
You can run updatenode command to create a new bridge in anytime
updatenode michost1
The Internal Bridge will be created automatically when you run 'nodeset' against a 'mic node'. But you need assign an IP and bridge type in 'host node' definition first.
Set the IP and Bridge type for Internal Bridge 'micbr0'
chdef michost1 nicips.micbr0=192.168.0.254 nictypes.micbr0=Internal
Each Xeon Phi card is defined as an xCAT general node which can be managed individually. rscan command can be used to discover the mic node and define them into xCAT database. The default name of mic node will be <host name>-micX, 'X' is the device number of the mic which started from 0 to n.
# rscan michost1
type name id host
mic michost1-mic0 0 michost1
mic michost1-mic1 1 michost1
rscan michost1 -z > /tmp/micstanza
cat /tmp/micstanza | mkdef -z
Except the auto discovered attributes, there are other several attributes need to be set to configure the mic node.
ip - Set the IP of the mic node so that it could communicate with other mic node.
micbridge - The bridge which created on host node to offer communication channel for mic node.
miconboot (Optional) - Set whether auto boot of the mic node when host node boot up. Default is 'yes'.
micvlog (Optional) - Set whether enable the verbose console log for mic node. Default is 'no'.
micpowermgt (Optional) - Set the power management for mic node. Default is 'on' for all power management features.
Make sure the IP for 'mic node' is in the same network with 'Bridge IP', otherwise the 'mic node' cannot join to the Bridge.
chdef michost1-mic0 ip=192.168.0.1
chdef michost1-mic1 ip=192.168.0.2
The Bridge name must be the one which set by 'xHRM' postscript (External Bridge) or the one which set in nics table (Internal Bridge).
chdef michost1-mic0 micbridge=xbr0 # set the External bridge
chdef michost1-mic0 micbridge=micbr0 # set the Internal bridge
You can set the power management state for 'mic node' through 'micpowermgt' attribute. Four sub-attributes are supported for power management: cpufreq, corec6, pc3 and pc6. The valid values are 'on' or 'off'. Refer to doc of MPSS to get more information of 'Power Management'.
chdef michost1-mic0 micpowermgt='cpufreq=on!corec6=off!pc3=off!pc6=off'
Note: Currently, setting of 'cpufreq' does not work. And set others attributes to 'on' does not work too. This is a bug of Intel.
This section uses the terms osimage and ramfs to source that is used to boot mic node. Following are the concepts of them:
**osimage** \- A directory which contains files to create ramfs for mic node. It's created by xCAT on xCAT management node and will be mounted to 'host node' so that all the host node could share the same build resource. Part of files which are used to create ramfs come from it.
**ramfs** \- It is a cpio.gz file which will be loaded to mic node as root image for mic Linux system. The ramfs for each mic node needs to be created base on the osimage, Get generated ramfs will be put in /var/mpss of host node.
You can run copycds command to create osimage definition and create a file structure in xCAT repository path /install (refer to the osimage.pkgdir attribute). The argument for copycds command should be the tar file of MPSS which is downloaded from Intel web site.
copycds mpss-3.1-rhel-6.2.tar
After the successful running of copycds, the directory '/install/mpss3.1' is created. And a new osimage object named mpss3.1-rhels6.2-compute is created too. You will use this osimage name to run the genimage and nodeset command to configure the osimage.
Mic node only has 8G memory that will be shared by 64 cores. To reduce the memory usage by the Operating System which running on mic node, Intel supplies a reduced Linux kernel and basic file system. For end user, except the basic file system, there are several ways to add additional files or software for the mic file system.
The additional file and software need be installed in the osimage (The directory is /install/<osimage name>) so that all ramfs for mic could get the additions.
xCAT supports three formats to add files or software to osimage:
Following is the directory structure in the <pkgdir of osimage>, you can use any of them to add files or software.
--<pkgdir of osimage> e.g. /install/mpss3.1
--system (files for system boot)
--common.filelist
--common
--overlay
--rpm
--simple
--s1,s2,...
--simple.cfg
--rootimg (2.8.4 and later) / package (2.8.3)
--the base file for fs
--opt/mic
--yy.filelist
--xx.filelist
--xx
Note: In xCAT 2.8.3, the path '/<pkgdir of osimage>/overlay/<package>/' is supported as rootimage directory when installing rpms in the osimage. xCAT released a patch patch for genimage which changed osimage root directory to /<pkgdir of osimage>/overlay/<osimage>. All customers who want to use kit or otherpkgs to install software in mic osimage are recommended to apply this patch.
For installation of software which contains a lot files, xCAT recommends user to use '<pkgdir of osimage>/overlay/rootimg/ + ./opt/mic/yy.filelist' or '<pkgdir of osimage>/overlay/rpm'
Example: Install a rpm named pe-1.0-1.x86_64.rpm into '<pkgdir of osimage>/overlay/rootimg/' by otherpkgs mechanism.
1. Create rpm pe-1.0-1.x86_64.rpm.
It includes following files:
/etc/init.d
/etc/init.d/pe
/etc/pe.cfg
/opt/mic
/opt/mic/pe.filelist
/opt/pe
/opt/pe/pe.bin
/root/pe.run
/opt/mic/pe.filelist
/opt/mic/pe.filelist is a must have configuration file for xCAT to know where to install the files in mic ramfs. It has the entries like following:
file /etc/init.d/pe etc/init.d/pe 0755 0 0
file /root/pe.run root/pe.run 0644 0 0
file /etc/pe.cfg etc/pe.cfg 0644 0 0
dir /opt 0755 0 0
dir /opt/pe 0755 0 0
file /opt/pe/pe.bin opt/pe/pe.bin 0755 0 0
2. Add the rpm in otherpkgs list:
mkdir -p /install/post/otherpkgs/mic3.1/x86_64/test
cp pe-1.0-1.x86_64.rpm /install/post/otherpkgs/mic3.1/x86_64/test
Create a test.otherpkglist configuration file with following line
wxp/pe-1.0-1
Add path of test.otherpkglist to osimage.otherpkglist
3. Run genimage to install rpm to mic osimage
genimage mpss3.1-rhels6.2-compute
The base file system for mic node includes some shared libraries for binaries to use. But sometimes you need the libraries which are built by customers themselves or installed from Intel compiler package, following steps can be followed to add additional libraries.
Use '<pkgdir of osimage>/overlay/simple/s1,s2... + simple.cfg' format to add a bunch of libraries to /usr/lib64.
Make a directory '/usr/lib64' at <pkgdir of osimage>/overlay/simple/
Copy all the additional libraries to '<pkgdir of osimage>/overlay/simple/usr/lib64'
Add a simple format configuration file at <pkgdir of osimage>/overlay/simple/usr/lib64/simple.cfg which contains configuration entry:
/usr/lib64->/usr
Run nodeset command against mic node to take effect.
If you want to add a start script that will be run during the boot of mic node, follow these steps:
Add a start script in /etc/init.d/. e.g. /etc/init.d/start
Add a symbol link from /etc/init.d/ to /etc/rc5.d/ e.g. /etc/rc5.d/S50start -> /etc/init.d/start
e.g. Add a start script named 'start' with start order 'S50'
mkdir <pkgdir of osimage>/overlay/st
vi <pkgdir of osimage>/overlay/st.filelist to add following lines
file /etc/init.d/start etc/init.d/start 0755 0 0
slink /etc/rc5.d/S50start ../../etc/init.d/start 0755 0 0
"/etc/hosts",
"/etc/group",
"/etc/passwd",
"/etc/shadow",
"/etc/resolv.conf",
"/etc/nsswitch.conf",
"/etc/ssh/ssh_host_rsa_key",
"/etc/ssh/ssh_config",
"/etc/ssh/ssh_host_key",
"/etc/ssh/sshd_config",
"/etc/ssh/ssh_host_dsa_key",
"/etc/ssh/ssh_host_key.pub",
"/root/.ssh/id_rsa",
"/root/.ssh/id_rsa.pub",
"/root/.ssh/authorized_keys",
Run genimage command:
genimage mpss3.1-rhels6.2-compute
xCAT supports to mount certain directories from nfs server to 'mic node' during the booting of 'mic node'.
image,file,options,comments,disable
"mpss3.1-rhels6.2-compute","/nfs/a:/a","micmount",,
"mpss3.1-rhels6.2-compute","/nfs/b:/b","micmount",,
The columns in litefile table:
image: The name of mpss osimage
file: To specify the source and destination of mount operation. The format must be: src:dest.
options: The mic mount flag. Must be set to 'micmount'
This step is optional that if not set nfs server in statelite table, the nfs server will be the 'host node' of 'mic node'
Set the nfs server in statelite table:
#node,image,statemnt,mntopts,comments,disable
"michost1-mic0",,"192.168.0.100:/nfsroot",,,
Without nfs server set in statelite table, the nfs server will be 'host node'.
192.168.0.254:/nfs/a /a nfs nolock 1 1
With nfs server set in statelite table, the nfs server will get from statelite table.
192.168.0.100:/nfsroot/nfs/a /a nfs nolock 1 1
Note1: xCAT will not cover the setup of nfs server. User needs make sure the nfs server has been set up correctly before booting of mic node.
Note2: [Stateless] The flag 'fsid=0' needs be set for nfs export. e.g. /nfs *(rw,no_root_squash,sync,no_subtree_check,fsid=0)
The osimage is locate at xCAT Management Node or Service Node, it will be mounted to 'host node' when run the 'nodeset' or 'rflash' commands. So if you have Service Node, the /install directory on Service Node should be synced from xCAT Management Node instead of a mount from Management Node. And after any change in the osimage, the syncing of /install directory to Service Node is necessary to be done again:
cd /
prsync install <sn>:/
When the osimage for mic node is ready, you can create ramfs for each mic node. The osimage is located at xCAT Management Node or Service Node, but ramfs will be created at mic host node (/var/mpss).
nodeset command will generate configuration file for each mic node base on the previous setting, and generate ramfs which 'mic node' boots from for each mic node.
nodeset michost1-mic0 osimage=mpss3.1-rhels6.2-compute
If not set miconboot to 'no', after nodeset, the mic node will be on boot status. Then you can log in to the mic node to check the configuration.
[Stateless] Note: Since the reboot of 'host node' will lose all the configuration for mic node, you have to rerun 'nodeset' for each mic node after rebooting of 'host node'.
rpower michost1-mic0 stat
michost1-mic0: online (mode: linux image: /opt/intel/mic/mnt/lib/firmware/mic/uos.img)
rinv michost1-mic0
michost1-mic0: HOST OS : Linux
michost1-mic0: OS Version : 2.6.32-220.el6.x86_64
michost1-mic0: Driver Version : 6720-15
michost1-mic0: MPSS Version : 2.1.6720-15 michost1-mic0: HOST OS : Linux
michost1-mic0: OS Version : 2.6.32-220.el6.x86_64
michost1-mic0: Driver Version : 3.1-0.1.build0
michost1-mic0: MPSS Version : 3.1
michost1-mic0: Host Physical Memory : 65902 MB
michost1-mic0: Flash Version : NotAvailable
michost1-mic0: SMC Firmware Version : NotAvailable
michost1-mic0: SMC Boot Loader Version : NotAvailable
michost1-mic0: uOS Version : NotAvailable
michost1-mic0: Device Serial Number : NotAvailable
michost1-mic0: Vendor ID : 0x8086
michost1-mic0: Device ID : 0x2250
michost1-mic0: Subsystem ID : 0x2500
michost1-mic0: Coprocessor Stepping ID : 3
michost1-mic0: PCIe Width : x16
michost1-mic0: PCIe Speed : 5 GT/s
michost1-mic0: PCIe Max payload size : 256 bytes
michost1-mic0: PCIe Max read req size : 4096 bytes
michost1-mic0: Coprocessor Model : 0x01
michost1-mic0: Coprocessor Model Ext : 0x00
michost1-mic0: Coprocessor Type : 0x00
michost1-mic0: Coprocessor Family : 0x0b
michost1-mic0: Coprocessor Family Ext : 0x00
michost1-mic0: Coprocessor Stepping : B1
michost1-mic0: Board SKU : B1PRQ-5110P
michost1-mic0: ECC Mode : NotAvailable
michost1-mic0: SMC HW Revision : NotAvailable
michost1-mic0: Total No of Active Cores : 60
michost1-mic0: Voltage : 0 uV
michost1-mic0: Frequency : 1052631 kHz
michost1-mic0: GDDR Vendor : Elpida
michost1-mic0: GDDR Version : 0x1
michost1-mic0: GDDR Density : 2048 Mb
michost1-mic0: GDDR Size : 7936 MB
michost1-mic0: GDDR Technology : GDDR5
michost1-mic0: GDDR Speed : 5.000000 GT/s
michost1-mic0: GDDR Frequency : 2500000 kHz
michost1-mic0: GDDR Voltage : 0 uV
Before running rcons command against mic node, you need make sure the screen rpm package has been installed on 'host node' (for both stateful and stateless host node)
rcons michost1-mic0
The 'Xeon Phi Card' has 'bootloader' (SMC) and 'firmware' (FLASH) which are used to boot mic node to standby stat. rflash command can be used to update the 'bootloader' and 'firmware' in the flash for the 'mic node'. By default, the up to date firmware will be searched from '/usr/share/mpss/flash' which installed by mpss kit for 'host node'.
rflash michost1-mic0
/etc/mpss/*.conf - the configuration files of mic node
/var/mpss/ - the source files for ramfs of mic node
/tmp/mictmp/miccfg.<host node> - the configuration file that is used by xCAT to configure mic node
/tmp/mictmp/micflash.<host node> - the configuration file that is used by xCAT to flash mic node
/var/log/xcat/configmic.log - the log file for the steps that how xCAT configures mic node
/var/log/xcat/flash.log - the log file for the steps that how xCAT flash mic node
<xcat management node>:/install/mpss3.1 on /var/mpss/mnt type nfs
Check the nfs server has been started correctly.
Check the /etc/fstab on mic node has correct entries.
Run 'show mount -e <nfs server>' to check the exported directories on nfs server.
Symptom:
Your mic card cannot be found or used by the micctrl command or mpss failed to start that it complains the mic card is not installed.
In the syslog you can see the error message of kernel like following:
kernel: mic 0000:2a:00.0: device not available because of BAR 0 [0x000000-0x1ffffffff] collisions
kernel: pci_enable failed board #0
kernel: mic: probe of 0000:2a:00.0 failed with error -22
Solution:
Generally you need enable the MMIO for larger than 4GB
Enable large BAR setting: Integrated Devices >> Memory I/O larger than 4GB in Bios Settings >> Enabled
If your are using IBM server, try to update the UEFI firmware to the latest version
1. Download the firmwares from IBM web page
IBM web site -> support -> download fix, driver -> select the machine type you want to update
2. Download the firmwares for uEFI and IMM
The files has suffix .uxz, .xml need be downloaded
Note: you also could try to download the UpdateXpess service packs which includes all the firmwares for the target nodes. It's a little big. If you don't know which one to update or when you want to update all the firmware, do it this way.
3. Download the UXSP which will be used to run the update
Download it from here: [UXSPI](http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=SERV-XPRESS)
4. Transport all the firmware files and uxsp(bin file) to the target node, and run following command to do the update
./ibm_utl_uxspi_9.41_rhel6_32-64.bin update --local=./ -L -u
5. Reboot the node for the firmware to take effect.
Symptom:
For any Xeon Phi card issue that you do NOT know how to handle like following cases:
1. Xeon Phi card fails to reset: reset failed
2. mic* command: No devices found : opendir: /sys/class/mic: No such file or directory
3. service mpss unload: Removing MIC Module: FATAL: Module mic is in use
4. rflash failed to finish the update for SMC or flash and it causes the mic node does not work.
Solution:
Just recycle the host node. The best approach is unplug the power cable and waiting for a while, then plug back the power cable.
Wiki: Stateless
Wiki: XCAT_2.8.2_Release_Notes
Wiki: XCAT_2.8.3_Release_Notes
Wiki: XCAT_Documentation