The Shadow Spec: When a bug becomes canon
Have you ever been tracing through legacy code and come across an error that seemed unnaturally obvious? Fixing this incorrect behaviour may seem like the right thing to do, but that’s when you learn some bugs have been around so long they have been promoted into the shadowy realm of undocumented specification. When a behaviour has been around long enough, right or wrong, it is now an expected part of your software, and fixing it becomes perilous.
The bug
This one particular enterprise software product featured a snapshot of data on every single page, showing contextually relevant information pulled from a huge dataset of intermingled data stored in a classic relational database system. Since this table had grown exceptionally large (over 100 billion rows), the only way to access information in a reasonable timeframe was to query against predefined indices composed of collections of columns. This allowed quick access to data, but only if the filtering and sorting in the query spec matched the index. So to minimise the cost of accessing this huge table on every page load, one of two very specific query specs (filter + sort) was used so as to perfectly hit one of the existing database indices. The code and the database were both deep legacies, having been written a decade earlier by people long departed from the team.
The mistake
Being the new Senior Developer on the team, I felt a lot of pressure to prove my worth and cement my position as a team leader. So I was always looking around for ways to improve the product. During an unrelated debug session, I discovered that one of the two query specs was not being honoured properly. The results were very close, but not technically correct, and definitely not in line with our documentation. I figured that somewhere in the last decade, this had been broken, and amazingly, nobody had noticed. The existing tests did not use large enough datasets to trigger the issue.
The fix involved reading a bit more data from the database and rearranging it in memory, which seemed a small cost to have a correct system. The risk seemed minimal, and I couldn’t imagine a scenario where serving the “wrong” data made sense. I made the change, patted myself on the back, and moved on to other things. A few weeks later, the code went live. Then the emails started to arrive!
The fallout
First up was the UX optimization team, who noticed 100-200ms degradation in time-to-render metrics across the entire application, which they traced back to my change. This was considered a huge performance gap and a top priority to fix. Next up were the capacity planning folks who had similar bumps in database buffer gets and CPU utilisation. My cost was indeed small, but it was happening on every single page view for millions of users. This was before our move to cloud based infrastructure, so we owned a finite number of physical machines in data centres across the world. There was no way to scale up to meet this demand without a month-long provisioning process, but that was nothing compared to the customer issues that started flowing into support.
It turns out this behaviour had been noticed by someone, one of our largest enterprise customers. And they had opted to just work around it, resulting in it becoming de facto official behaviour as far as they were concerned. These customers had their own test teams testing our software (often more effectively), and my little change was blowing up test suites from New York to Tokyo. People were not happy.
I was also not happy. I felt like I had let overconfidence override my common sense. I had let the team down, and more annoyingly, I had tarnished the company's reputation with very important customers.
The revert
Luckily, the fix was as simple as reverting my change and pushing out an emergency release, which only took a few hours. I gave a silent thank you to the release engineers who could manage such a quick turnaround. Without the operational work to create these emergency release channels, this story could have had a very different ending. I also added more internal testing to explicitly codify the “correct” behaviour, and added code documentation and known issues to explain the situation.
Internally, we acknowledged it was odd for our system not to follow our own documentation. It was also deemed unnecessary to explain the nuance externally, given that no one had complained about it in 10+ years.
The company was not one to punish people for honest mistakes, so aside from the embarrassment at such a huge gaffe so early in my employment (and much good-natured ribbing), I enjoyed many more years of gainful employment. And I now knew that being right did not always mean being correct.
Lessons learned
This experience gave me a very sharp sense that working for a large enterprise software concern would be very different from the years I had spent at various smaller startups. Having customers (especially huge customers who write huge cheques) is a new layer of responsibility to be remembered at every step of the development process. A good organisation will distil this respect for the customer at every level. And a big part of that is to honour the Shadow Spec.