We use SystemImager to maintain (rebuild) our small HPC clusters. Conceptually it's very simple:

  1. Build a node (the 'golden client') just the way you want it.
  2. si_prepareclient: Run rsyncd on the node, accessible to the 'image server'.
  3. si_getimage on the image server copies the entire node into a directory, and analyzes it to produce a script that will recreate the image (with exclusions for files which should differ between nodes).
  4. si_updateclient on a target node fetches the script from the image server; the script configures the target (disk partitioning, etc.) and fetches the image contents, making the target match the golden client.
  5. If the node is dead or brand-new, there's a DHCP/PXE/TFTP process for bootstrapping far enough to run the script and then match the golden client.

Once the SI system is all set up, it's quick & easy to rebuild nodes. Unfortunately there are several complications:

  • The DHCP & TFTP dependencies are somewhat complicated, so bringing up SI without breaking anything is tricky. TFTP & pxelinux are not terribly well documented.
  • The "Latest Stable Release" is SystemImager 4.0.2 from December 2007. One of the key components of SystemImager is a generic kernel & Linux initrd (initial RAMdisk) which include a default set of drivers. But the release is so old that it cannot handle current hardware. There are several newer development versions but they're not fully baked and choosing between them is confusing.
  • SI doesn't yet support grub2 or ext4, which are required for large disks (GPT partition tables).

The workaround I got from the very helpful folks on sisuite-users@ was to use SALI, a modern kernel/initrd pair for SystemImager. Unfortunately SALI's a bit different -- in the process of adding grub2 support, they broke compatibility with the scripts that SI generates. Here's a quick recap of the steps I used (mostly from sisuite-users@) to use SALI:

  • Drop the 2 SALI files into the TFTP directory (normally /var/lib/tftpboot/ or /tftpboot/).
  • Specify the SALI files in /var/lib/tftpboot/pxelinux.cfg/default or equivalent.
  • Add a couple lines to /etc/dhcpd.conf.
  • Set SCRIPTNAME= in pxelinux.cfg/default.
  • In the script created by SI:
    • Change DISK_SIZE entries to "DISK_SIZE=$(get_disksize $DISK0)".
    • Remove -v1 from mkswap arguments.
    • Add -I 128 to mke2fs for the /boot FS.
    • Remove "-o defaults" from mount commands.
    • SystemImager's final line in the script is "shutdown -r now", which fails on SALI. Use reboot until SALI 1.3, which should support shutdown.
  • On our newer cluster, SALI does bizarre things with console redirection. I had to type into the (virtual VGA) console, while output appeared on the serial console. The serial console recognized and echoed my input, but did not execute it.
  • (Not SALI related): Make sure the scripts (normally in /var/lib/systemimager/scripts) are executable -- SI left mine non-executable for some reason.