On Managing 30+ Servers

by Will Roberts on

When we started out we only had a handful of servers so I was doing each setup by hand, and manually applying each change to each server’s configuration. That meant I was spending an hour or more for each new setup, then probably 30-45 minutes for each change depending on its complexity. The setup time doesn’t really have a scalability issue, though it does mean that I can’t be doing something else at the same time. The bigger issue is rolling out a change to all our active servers; a 5 minute change suddenly becomes a 2.5 hour chore when you’ve got 30 servers.

After about 15 or so servers we reached a tipping point where I realized I was going to need a more automated mechanism for setting up the servers and for rolling out new changes. I don’t know all the ins and outs of Bourne Shell scripting, but I’ve managed to create some pretty creative scripts over time so that’s where I started. Pushing trivial updates out to existing machines now becomes a matter of running a script once for each server, and since we know all our hostnames, we can just loop over the hosts running each one in turn. I’ve toyed with the idea of running the scripts in parallel (there shouldn’t be an issue), but for the moment I’ve left them in serial so I can see the result of each box in turn.

#!/bin/sh

MYSQL_HOST="mysql.database"
SERVERS=`mysql --host=$MYSQL_HOST -u oursupersecretuser -s --skip-column-names -e "$QUERY" wonder_proxy`

export HOST_STATUS=/home/lilypad/billing/host_status.txt

SCRIPT=$1
shift

for i in $SERVERS
do
  if [ -S /home/lilypad/.ssh/master-wproxy@$i:22 ]; then
    echo -n
  elif [ `grep -c "$i 2" $HOST_STATUS 2> /dev/null` -eq 0 ]; then
    nohup ssh -MNf $i  /dev/null 2> /dev/null
  fi
  $SCRIPT $i $*
done

The script above is the basic loop I use to run my other scripts on our machines. The MYSQL_HOST variable allows us to more easily migrate from one box to another which has already happened (and was an absolute pain the first time). The custom query allows this script to be called by other scripts to only select certain portions of our network. Once we have the list of hosts, we then ensure that each host has an active SSH tunnel or attempt to start one if the host isn’t known to be down. Then the script is executed with the hostname and all extra arguments.

The scripts are all fairly simple, and I tend to reuse/mangle them for other uses as needed, but here’s an example:

#!/bin/sh

scp /data/proxy-setup/ipsec/etc/ipsec.conf $1:~/
ssh $1 sudo cp ipsec.conf /etc/
ssh $1 rm ipsec.conf
ssh $1 sudo /etc/init.d/ipsec restart

Pretty simple, but it’s nice not to repeat those 4 commands 37 times when I make a tiny change. So in order to push that tiny change I’d end up just running:

./run_all_vpn.sh ./push_ipsec_conf.sh

The ssh command we use in the first script allows multiple SSH connections to flow over the same TCP connection. This reduces the cost of initiating the TCP handshake as well as the SSH handshake for exchanging keys. The flags as explained by the man page:

-M
Places the ssh client into “master” mode for connection sharing. Multiple -M options places ssh into “master” mode with confirmation required before slave connections are accepted. Refer to the description of ControlMaster in ssh_config(5) for details.
-N
Do not execute a remote command. This is useful for just forwarding ports (protocol version 2 only).
-f
Requests ssh to go to background just before command execution. This is useful if ssh is going to ask for passwords or passphrases, but the user wants it in the background.