sukhanov.net

NFS-root

I haven't posted many clustering articles here but I've been doing a lot of work on them recently, building a cluster for SC07 Cluster Challenge as well as rebuilding 2 clusters (Seawulf & Galaxy) from the ground up at Stony Brook University. I'll try to post some more info about this experience as time goes on.

We have about 235 nodes in Seawulf and 150 in Galaxy. To boot all the nodes we use PXE (netboot), this allows for great flexibility and ease of administration -- really its the only sane way to bootstrap a cluster. Our bootstrapping system used to have a configuration where the machine would do a plain PXE boot and then, using a linuxrc script the kernel would download a compressed system image over TFTP, decompress it to a ram-disk and do a pivot root. This system works quite well but it does have some deficiencies. It relies on many custom scripts to maintain the boot images in working order, and many of these scripts are quite sloppily written so that if anything doesn't work as expected you have to spend some time try to coax it back up. Anything but the most trivial system upgrade requires a reboot of the whole cluster (which purges the job queue and annoys users). On almost every upgrade something would go wrong and I'd have to spend a long day to figure it out. Finally, using this configuration you always have to be conscious to not install anything that would bloat the system image -- after all its all kept in ram, larger image means more waste of ram.

During a recent migration from a mixed 32/64bit cluster to a pure 64bit system. I decided to re-architect the whole configuration to use NFS-root instead of linuxrc/pivot-root. I had experience with this style of configuration from a machine we built for the SC07 cluster challenge, how-ever it was a small cluster (13 nodes, 100cores) so I was worried if NFS-root would be feasible in a cluster 20 times larger. After some pondering over the topic I decided to go for it. I figured that linux does a good job of caching disk IO in ram so any applications which are used regularly on each node would be cached on nodes themselves (and also on the NFS server), furthermore if the NFS server got overloaded some other techniques could be applied to reduce the load (staggered boot, NFS tuning, server distribution, local caching for Network File systems). And so I put together the whole system on a test cluster installed the most important software mpi, PBS(torque+Maui+gold), all the bizarre configurations.

Finally, one particularly interesting day this whole configuration got put to the test. I installed the server machines migrated over all my configurations and scripts halted all nodes. Started everything back up -- while monitoring the stress the NFS-root server was enduring, as 235 nodes started to ask it for 100s of files each. The NFS-root server behaved quite well using only 8 NFS-server threads the system never went over 75% CPU utilization. Although the cluster took a little longer to boot. I assume with just 8 NFS threads most of the time the nodes were just standing in line waiting for their files to get served. Starting more NFS threads (64-128) should alleviate this issue but it might put more stress on the NFS-server and since the same machine does a lot of other things I'm not sure its a good idea. Really a non-issue since the cluster rarely gets rebooted, especially now that most of the system can be upgraded live without a reboot.

There are a couple of things to consider if you want to NFS-root a whole cluster. You most likely want to export your NFS share as read-only to all machines but one. You don't want all machines hammering each others files. This does require some trickery. You have to address the following paths:

/var
You cannot mount this to a local partition as most package management systems will make changes to /var and you'll have to go far out of your way to keep them in sync. We utilize a init script which takes /varImage and copies it to a tmpfs /var (ram file system) on boot.
/etc/mtab
This is a pain in the ass I don't know who's great idea was to have this file. It maintains a list of all currently mounted file systems (information is not unlike to that of /proc/mounts). In fact the mount man page says that "It is possible to replace /etc/mtab by a symbolic link to /proc/mounts, and especially when you have very large numbers of mounts things will be much faster with that symlink, but some information is lost that way, and in particular working with the loop device will be less convenient, and using the 'user' option will fail." And it is exactly what we do. NOTE autofs does not support the symlink hack, I have a filed bug in the debian.
/etc/network/run (this might be a debianism)
We use a tmpfs for this also
/tmp
We mount this to a local disk partition

All in all the NFS-root system works quite well I bet that with some tweaking and slightly more powerful NFS-root server (we're using dual socket 3.4Ghz Xeon 2MB cache and 2GB of ram) the NFS-root way of boot strapping a cluster can be pushed to serve over 1000 nodes. More than that would probably require some distribution of the servers. By changing the exports on the NFS server any one node can become read-write node and software can be installed/upgraded on it like any regular machine, changes will propagate to all other nodes (minus daemon restarts). Later the node can again be changed to read-only -- all without a reboot.

Debian LILUG News Software Super Computers 2008-03-02T13:25:11-05:00