The Problem With Time
A frequent issue we have with our hosts is that the proxies will have the incorrect time. This makes billing unreliable and can make tracking down issues difficult since timestamps in logs will be wrong. The best part is that with the problem providers, they rarely seem to care. Depending on the type of VPS we have with the provider there are two solutions.
The first option is to bug the provider. If they’re running an NTP daemon then all of the nodes will inherit the correct time and everyone wins. Most of the time we can’t seem to get them to do this though.
The second option is to run our own NTP which we’re fine with, but sometimes still requires help from the provider. If our VPS is based on OpenVZ or Virtuozzo then the provider can set an option which allows our node to maintain its own wallclock. However, if our VPS is based on Xen then we can do it ourselves like so:
echo 1 > /proc/sys/xen/independent_wallclock
And more permanently by adding xen.independent_wallclock = 1
to /etc/sysctl.conf
We use SNMP to query the time of our machines using a custom OID that just returns the current UTC seconds since epoch. We then compare that number to the same value from our monitoring machine which is synched to the global NTP servers. Differences of a up to 5 seconds either way are considered acceptable though obviously we’d prefer if it were always dead on. We consider 5-60 seconds offset to be a warning state, and then further than 60 seconds to be an error state.
If the time is properly synched then if we get complaints that something wasn’t working for our customers at a given time I know exactly what time I should be looking at in the logs on the machine. When the time is wrong I have to start doing some mental math to figure out when that proxy thought the error event was, and if it communicates with another machine whose time is off I then have to do another calculation to figure out where the logs would be. So each machine in the chain that has the incorrect time makes it more difficult for us to track down what might have gone wrong.