Free Happy Hour Is Over (Sorry!)
These days it seems like every bug needs a flashy name, so we've dubbed the billing bug we recently fixed "Free Happy Hour". Free Happy Hour had a good long run and it's yet another example of the difficulty of dealing with time.
The Short Version
When calculating the daily usage for each WonderProxy account, we inadvertently excluded midnight of the previous day from our calculations. Thus, we were only billing for 23 hours out of the day, making traffic between midnight and 1am "free" for our customers. Whoops.
Fixing this bug has likely caused some customers' bills to increase by a small percentage going forward. We made the fix on November 21st, but didn't go back and correct the earlier data, so no one's getting a shocking bill for the last n
months of happy hours. It only seems fair since it was our math mistake.
The Longer Version
A few months ago we noticed that some of the overages we were calculating for our customers did not match the account data. The difference was small, and it was always in our favor, so we wanted to dig into things.
Our billing system is fairly simple. The proxies record information about each connection including username and amount transferred. We sum those up in hourly buckets in our usage
table, and in daily buckets in our sum_usage_daily
table. All of the data on our website and invoices is pulled from the daily table, because no one really wants to sit around and wait for a query against a table with hundreds of millions of rows.
Due to timing issues related to calculating overages and processing our logs, we know that it is possible for a customer's usage to increase after we'd already processed their bill. A small error in the customer's favor is okay with us and couldn't be the culprit for this bug.
Data vs Calculation
Our first assumption was that the website and billing code were using different but similar methods to determine how much traffic each customer used on a given server within the month. A reasonable concern (we may very well have used different methods in the past) but thankfully, the website and billing call were calling the exact same function with the exact same parameters.
With the code identical, we sighed, "it must be the data".
As alluded to, we use two tables for tracking usage. The usage
table is effectively immutable: records are added to it and never modified. If we receive more logs for a previous hour, we add additional records. It seemed unlikely that this table could be the cause of a problem.
Conversely, the sum_usage_daily
table is not immutable: records are added and updated every hour so that we have a single daily record for each server on a customer's account. If there was going to be a problem, it had to be with sum_usage_daily
.
As we dug in to this bug we noticed that charges would match the account data on the day they were posted, but there would be a difference the next day. This turned out to be critically important and caused us to look more carefully at how we updated the values in sum_usage_daily
.
Careful With Dates
The processing of our log files can sometimes be delayed if we're having trouble getting data from the proxy due to maintenance, outages, etc. To compensate for log processing delays, when we update sum_usage_daily
we actually update the sums for past two days rather than just the most recent 24 hours. If we hadn't been updating a period greater than 24 hours, we may not have noticed that correct numbers were becoming incorrect during the updates.
This was part of the WHERE
clause on the query that updated the daily values:
timestamp > date_sub(date(NOW()), INTERVAL 2 DAY)
timestamp
is a MySQL TIMESTAMP that stores the sum for each hour that is parsed out of our logs. The values look like 2016-10-01 23:00:00
and 2016-10-02 00:00:00
. At this point you may have spotted our problem: midnight on a day is not strictly greater than that day.
Solution & Testing
The solution here appears very simple: just change >
to >=
and you're done. Indeed, that was the solution, but this is our billing code and there was understandably some hesitance about changing things on a whim. How could we prove that this change corrected our bug but didn't affect the data for other days?
Our problem was that the sum for the oldest day in the WHERE
clause was wrong. So if we ran the query for 2 days, Monday would be 20GB and Tuesday 15GB, but if we ran the query for 3 days, Sunday would be 17GB, Monday 21GB, and Tuesday 15GB. The sum for Monday should have been the same both times as it was for Tuesday.
With that in mind, we were able to ensure that the old update query did in fact cause the sums to be different and that adding the =
caused the sums to remain consistent and correct.
Having fixed the bug and ensured that the fix did fix it, all that was left was to come up with a catchy name…