sukhanov.net

dnsmasq -- buy 1 get 2 free!

I mentioned earlier that we netboot (PXE) our cluster. Before NFS-root begins, some things have to take place. Namely, the kernel needs to be served, IP assigned, DNS look-ups need to be made to figure out where servers are and so on. Primarily 3 protocols are in the mix at this time, TFTP, DHCP, DNS. We used to run 3 individual applications to handle all of this, they're all in their own right quite fine applications atftpd, Bind9, DHCP (from ISC). But it just becomes too much to look after, you have a config file for each of the daemons as well as databases with node information. Our configuration used MySQL and PHP to generate all the databases for these daemons. This way you would only have to maintain one central configuration. Which means you need to look after yet another daemon to make it all work. You add all of this together and it becomes one major headache.

Several months ago I had installed openWRT onto a router at home. While configuring openWRT I came across something called dnsmasq. By default, on openWRT, dnsmasq handles DNS and DHCP. I thought it was spiffy to merge the 2 services .. after all they are so often run together (on internal networks). The name stuck in my head as something to pay bit more attention to. Somewhere along the line I got some more experience with dnsmasq, and had discovered it also had TFTP support. Could it be possible what we use 4 daemons could be accomplished with just one?

So when the opportunity arose I dumped all node address information out of the MySQL database into a simple awk-parsable flat file. I wrote a short parsing script which took the central database and spit out a file dnsmasq.hosts (with name/IP pairs) and another dnsmasq.nodes (with MAC-address/name pairs). Finally I configured the master (static) dnsmasq.conf file to start all the services I needed (DNS, DHCP, TFTP), include the dnsmasq.hosts and dnsmasq.nodes files. Since the dnsmasq.nodes includes a category flag it is trivial to tell which group of nodes should use what TFTP images and what kind of DHCP leases they should be served.

Dnsmasq couldn't offer a more simple and intuitive configuration with 1/2 days work I was able to greatly improve upon on old system and make a lot more manageable. There is only one gripe I have with dnsmasq, I wish it would be possible to just have one configuration line per node that is have the name, IP, and mac address all in one line. If this was the case I wouldn't even need an awk script to make the config file (although it turned out to be handy because I also use the same file to generate a nodes list for torque). But its understandable since there are instances where you only want to run a DHCP server or just DNS server and so having DHCP and DNS information on one line wouldn't make much sense.

Scalability for dnsmasq is something to consider. Their website claims that it has been tested with installation of up to 1000 nodes, which might or might not be a problem. Depending on what type of configuration your building. I kind of wonder what happens at the 1000s of machines level. How will its performance degrade, and how does that compare to say the other TFTP/DHCP servers (BIND9 is know to work quite well with a lot of data).

Here are some configuration examples:

Master Flat file node database

#NODES file it needs to be processed by nodesFileGen
#nodeType nodeIndex nic# MACAddr

nfsServer 01 1
nfsServer 02 1

headNode 00 1 00:00:00:00:00:00

#Servers based on the supermicro p2400 hardware (white 1u supermicro
box) server_sm2400 miscServ 1 00:00:00:00:00:00 server_sm2400 miscServ 2 00:00:00:00:00:00 #dual 2.4ghz supermicro nodes node2ghz 01 1 00:00:00:00:00:00 node2ghz 02 1 00:00:00:00:00:00 node2ghz 03 1 00:00:00:00:00:00 ...[snip]...
#dual 3.4ghz dell nodes
node3ghz 01 1 00:00:00:00:00:00
node3ghz 02 1 00:00:00:00:00:00
node3ghz 03 1 00:00:00:00:00:00
...[snip]...

Flat File DB Parser script

#!/bin/bash

#intput sample
#type number nic# mac addr
#nodeName 07 1 00:00:00:00:00:00

#output sample
#ip hostname
#10.0.103.10 nodeName10
awk '
  /^headNode.*/ {printf("10.0.0.3 %s\
", $1)};                          \
  /^server_sm2400.*/ {printf("10.0.3.%d %s\
", $3, $2)};                      \
  /^nfsServer.*/ {printf("10.0.1.%d %s%02d\
", $2, $1, $2)};          \
  /^node2ghz.*/ {printf("10.0.100.%d %s%02d\
", $2, $1, $2)};          \
  /^node3ghz.*/ {printf("10.0.101.%d %s%02d\
", $2, $1, $2)};          \
  '

\ ~/data/nodes.db > /etc/dnsmasq.hosts

#output sample
#mac,netType,hostname,hostname
#00:00:00:00:00:00,net:nodeName,nodeName10,nodeName10
awk '
  /^headNode.*/ {printf("%s,net:%s,%s,%s\
", $4, $1, $1, $1)};                      \
  /^server_sm2400.*/ {printf("%s,net:%s,%s,%s\
", $4, $1, $2, $2)};              \
  /^node2ghz.*/ {printf("%s,net:%s,%s%02d,%s%02d\
", $4, $1, $1, $2, $1, $2)};      \
  /^node3ghz.*/ {printf("%s,net:%s,%s%02d,%s%02d\
", $4, $1, $1, $2, $1, $2)};      \
  '

\ ~/data/nodes.db > /etc/dnsmasq.nodes

#output sample
#hostname np=$CPUS type
#nodeName10 np=8 nodeName
awk '
  /^node2ghz.*/ {printf("%s%02d np=2 node2ghz\
", $1, $2)};              \
  /^node3ghz.*/ {printf("%s%02d np=2 node3ghz\
", $1, $2)};              \
  '

\ ~/data/nodes.db > /var/spool/torque/server_priv/nodes

#Lets reload dnsmasq now
killall -HUP dnsmasq

dnsmasq.conf

interface=eth0
dhcp-lease-max=500
domain=myCluster
enable-tftp
tftp-root=/srv/tftp
dhcp-option=3,10.0.0.1
addn-hosts=/etc/dnsmasq.hosts
dhcp-hostsfile=/etc/dnsmasq.nodes

dhcp-boot=net:misc,misc/pxelinux.0,nodeServer,10.0.0.2
dhcp-range=net:misc,10.0.200.0,10.0.200.255,12h

dhcp-boot=net:headNode,headNode/pxelinux.0,nodeServer,10.0.0.2
dhcp-range=net:headNode,10.0.0.3,10.0.0.3,12h

dhcp-boot=net:server_sm2400,server_sm2400/pxelinux.0,nodeServer,10.0.0.2
dhcp-range=net:server_sm2400,10.0.0.3,10.0.0.3,12h

dhcp-boot=net:node2ghz,node2ghz.cfg,nodeServer,10.0.0.2
dhcp-range=net:node2ghz,10.0.100.0,10.0.100.255,12h

dhcp-boot=net:node3ghz,node3ghz.cfg,nodeServer,10.0.0.2
dhcp-range=net:node3ghz,10.0.101.0,10.0.101.255,12h

Debian LILUG News Software Super Computers 2008-03-13T00:30:40-04:00

NFS-root

I haven't posted many clustering articles here but I've been doing a lot of work on them recently, building a cluster for SC07 Cluster Challenge as well as rebuilding 2 clusters (Seawulf & Galaxy) from the ground up at Stony Brook University. I'll try to post some more info about this experience as time goes on.

We have about 235 nodes in Seawulf and 150 in Galaxy. To boot all the nodes we use PXE (netboot), this allows for great flexibility and ease of administration -- really its the only sane way to bootstrap a cluster. Our bootstrapping system used to have a configuration where the machine would do a plain PXE boot and then, using a linuxrc script the kernel would download a compressed system image over TFTP, decompress it to a ram-disk and do a pivot root. This system works quite well but it does have some deficiencies. It relies on many custom scripts to maintain the boot images in working order, and many of these scripts are quite sloppily written so that if anything doesn't work as expected you have to spend some time try to coax it back up. Anything but the most trivial system upgrade requires a reboot of the whole cluster (which purges the job queue and annoys users). On almost every upgrade something would go wrong and I'd have to spend a long day to figure it out. Finally, using this configuration you always have to be conscious to not install anything that would bloat the system image -- after all its all kept in ram, larger image means more waste of ram.

During a recent migration from a mixed 32/64bit cluster to a pure 64bit system. I decided to re-architect the whole configuration to use NFS-root instead of linuxrc/pivot-root. I had experience with this style of configuration from a machine we built for the SC07 cluster challenge, how-ever it was a small cluster (13 nodes, 100cores) so I was worried if NFS-root would be feasible in a cluster 20 times larger. After some pondering over the topic I decided to go for it. I figured that linux does a good job of caching disk IO in ram so any applications which are used regularly on each node would be cached on nodes themselves (and also on the NFS server), furthermore if the NFS server got overloaded some other techniques could be applied to reduce the load (staggered boot, NFS tuning, server distribution, local caching for Network File systems). And so I put together the whole system on a test cluster installed the most important software mpi, PBS(torque+Maui+gold), all the bizarre configurations.

Finally, one particularly interesting day this whole configuration got put to the test. I installed the server machines migrated over all my configurations and scripts halted all nodes. Started everything back up -- while monitoring the stress the NFS-root server was enduring, as 235 nodes started to ask it for 100s of files each. The NFS-root server behaved quite well using only 8 NFS-server threads the system never went over 75% CPU utilization. Although the cluster took a little longer to boot. I assume with just 8 NFS threads most of the time the nodes were just standing in line waiting for their files to get served. Starting more NFS threads (64-128) should alleviate this issue but it might put more stress on the NFS-server and since the same machine does a lot of other things I'm not sure its a good idea. Really a non-issue since the cluster rarely gets rebooted, especially now that most of the system can be upgraded live without a reboot.

There are a couple of things to consider if you want to NFS-root a whole cluster. You most likely want to export your NFS share as read-only to all machines but one. You don't want all machines hammering each others files. This does require some trickery. You have to address the following paths:

/var
You cannot mount this to a local partition as most package management systems will make changes to /var and you'll have to go far out of your way to keep them in sync. We utilize a init script which takes /varImage and copies it to a tmpfs /var (ram file system) on boot.
/etc/mtab
This is a pain in the ass I don't know who's great idea was to have this file. It maintains a list of all currently mounted file systems (information is not unlike to that of /proc/mounts). In fact the mount man page says that "It is possible to replace /etc/mtab by a symbolic link to /proc/mounts, and especially when you have very large numbers of mounts things will be much faster with that symlink, but some information is lost that way, and in particular working with the loop device will be less convenient, and using the 'user' option will fail." And it is exactly what we do. NOTE autofs does not support the symlink hack, I have a filed bug in the debian.
/etc/network/run (this might be a debianism)
We use a tmpfs for this also
/tmp
We mount this to a local disk partition

All in all the NFS-root system works quite well I bet that with some tweaking and slightly more powerful NFS-root server (we're using dual socket 3.4Ghz Xeon 2MB cache and 2GB of ram) the NFS-root way of boot strapping a cluster can be pushed to serve over 1000 nodes. More than that would probably require some distribution of the servers. By changing the exports on the NFS server any one node can become read-write node and software can be installed/upgraded on it like any regular machine, changes will propagate to all other nodes (minus daemon restarts). Later the node can again be changed to read-only -- all without a reboot.

Debian LILUG News Software Super Computers 2008-03-02T13:25:11-05:00