We've been running our global ping statistics for years. Every hour we have every server on our network ping every other server, and record that data in MySQL. In fact, generating these ping stats led to one of our first blog posts. But as our network has grown, so has the burden of keeping all that data around – currently ~1B rows or ~30GB of data with ~15GB of indexes.
"So you have a lot of data", you say. Indeed we do! And it's affecting us in a few ways:
- Our nightly backup is full, not incremental. So every night we're backing up that 30GB, which means that
- Backups take hours (wall time)
- Backups use considerable I/O for those hours
- We're using & paying for a fair amount of space in S3 & Backblaze for data that hasn't changed, in some cases for years.
- Attempts to use the data can take hours (it's indexed for the exact way we use it on the site, nothing else)
- Using the data can block inserts, which makes other things really grumpy.
So we wanted to reduce the burden on our systems without deleting the interesting ping data. It seemed like there must be a good cloud solution for this, but we weren't quite sure what it was. I asked my followers on twitter, and we received two great suggestions: AWS with Simple Storage Service (S3) & Athena, and Google Big Query. The two products looked to offer similar solutions. We're already using S3 in our backup system, so that's the direction we went.
S3 & Athena
These tools are pretty great. S3 allows you to upload arbitrary files to buckets. These files can be private, public, or managed through complicated Access Control Lists (ACLs. We use them for backups and for our global screenshot service, ShotSherpa. Amazon's Athena allows you to upload files to S3 as CSV, json, apache common log, etc. files, then query those files as if they were a well-indexed database. It's kind of magic really.
Uploading to S3
We decided to shard our data into CSV files by date (e.g.
2016-01-01.csv). We originally considered grouping data by month (2016-01.csv) which would have given us far fewer files, but would have either: A) had us overwriting the current month's data every day, or B) left us unable to access this month's data in Athena.
To do the upload, we bit the bullet and let MySQL index the
timestamp column for a few hours, then got our export together, which resembled the following:
mysql -u export -e "SELECT * from network_pings WHERE timestamp >= '2015-01-01' AND timestamp < '2015-01-02' INTO OUTFILE '/var/lib/mysql-files/2015-01-01.csv' FIELDS TERMINATED BY ',' ENCLOSED BY '' LINES TERMINATED BY '\n' ;" wonderproxy
The export file is then ready to be uploaded to S3, we use the incredibly handy s3cmd.
s3cmd put ./2015-01-01.csv s3://wn-ping-data/2015-01-01.csv
s3cmd uploads the file to s3 and deletes the local file. To upload all of our legacy data, we scripted the export.
Setting Up Athena
Using Athena is remarkably similar to using any regular SQL system: first you define tables, and then you can query the data. The only difference from normal SQL workflows is that you never insert data.
Defining the tables was a bit of a pain: I found the AWS documentation on the available datatypes hard to use and harder to find (eventually ending up at hive documentation). Eventually I was able to define my table in the same form as it exists locally.
CREATE EXTERNAL TABLE `pings`( `source` int, `destination` int, `timestamp` timestamp, `min` double, `avg` double, `max` double, `mdev` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://ping-datas/' TBLPROPERTIES ( 'has_encrypted_data'='false' )
Migrating Old Data
We had a few years of data to migrate and didn't want to babysit the process. Bash to the rescue!
#!/bin/bash BEGIN='2017-04-12' for i in `seq 0 19` do DATE_START=`date --date="$BEGIN +$i days" +%Y-%m-%d` DATE_STOP=`date --date="$DATE_START +1 days" +%Y-%m-%d` echo $DATE_START $DATE_STOP S3CONF=/etc/backup/s3cfg OUTFILE="/var/lib/mysql-files/$DATE_START.csv" BUCKET="s3://wn-ping-data/$DATE_START.csv" mysql --defaults-file=/etc/mysql/debian.cnf -e "SELECT * from network_pings WHERE timestamp >= '$DATE_START' AND timestamp < '$DATE_STOP' INTO OUTFILE '$OUTFILE' FIELDS TERMINATED BY ',' ENCLOSED BY '' LINES TERMINATED BY '\n' ;" wonderproxy s3cmd -c $S3CONF put $OUTFILE $BUCKET rm $OUTFILE done
A day's migration took about 6 minutes. Surprisingly, the script took a little less time as it got closer to the present: more data, but easier for MySQL to find I guess.
We ran into a number of problems along the way:
- After a year's data had been migrated, I attempted to delete it from our MySQL database. This deletion took ~24 hours and the background process to clean up from such a long running transaction took just over a day. This was more than 48 hours of a overloaded MySQL I hadn't planned for with a "quick delete".
- Creating the index on
timestampwhile data was being inserted led to insertion of new ping data slowing down considerably, nearly exhausting available connections.
- After the disastrous attempt to delete one year, we were afraid to delete more. We ended up creating a new table, copying some data over, renaming everything, then dropping the old table. The drop table took ~3 minutes to run, blocking all other queries while it ran. Still better than 2 days of MySQL load warnings, but rather inelegant.
We're incredibly happy with how things stand now:
- Nightly backups take ~6hours less than they used to. Compare:
- Access to the ping data locally is faster. Pages like the ping between Toronto & Sydney load much faster.
- We're able to access our ping data.
select count(*) from pings; 1,048,048,968
select ssource.name, sdest.name, min(min) as min, avg(avg) avg, max(max) as max, avg(mdev) as mdev from pings inner join servers ssource on pings.source = ssource.id inner join servers sdest on pings.destination = sdest.id group by (ssource.name, sdest.name);
We wanted the data off our infrastructure, that was certainly accomplished. During the process we also loaded our MySQL server in new and interesting ways, which led to better monitoring and reporting. We're also reducing our backup cost (ping data is backed up once, not every single night forever), which is nice. As an unexpected bonus we've also managed to make our ping data much more accessible, queries that would take hours now execute in seconds.