The following is a combination horror story and true crime documentary of developers tricking themselves into seeing things that aren’t there, not seeing things right in front of their faces, and breaking various rules of debugging. The faint of heart should close the browser tab now.
No, it’s not a Heisenbug. Heisenbugs don’t repro the same when you’re looking into them. This is…something else. – Actual quote
There’s one part of this story that I will not address. We have to use 64-bit integers (C# long) for the ids of one of our resources due to downstream / legacy dependencies. They end up being incredibly large, non-sequential values. That’s not the point here, just some context.
A Problem Is Reported
INT. A developer’s desk, early morning, before coffee, bright blue sunny day.
An email is received reporting an issue where the ids of certain resources are intermittently being returned as two less than the true value. Or one less. Or one or two above. Always “close”, but often incorrect.
There is a Fiddler .saz to prove it, and screenshots of that same Fiddler .saz.
A fearless developer figures this must be something pretty silly, opens up the issue in the issue tracking system, and decides to take it on. There will be much weeping.
The Plot Thickens
INT. The same developer’s desk. After standup. A second developer joins to pair.
The developer has traced through the entire command side of the CQRS system and confirmed that the ID is not strangely mutated before publishing an event to the bus.
Also traced through the read model updaters to ensure no strange mutations occur there.
Queried the read model store directly. Data is correct at rest. It’s got to be in the query side or API itself.
We have some message handlers that fire late in the pipeline, just before the response stream is written. The original dev has set breakpoints there and confirmed the ID is correct before it goes out the door.
Yet, Fiddler keeps showing that odd ids always become even. Even ids always stay even, but sometimes go up or down by two. We record each id->mutatedId pairing to look for deeper patterns.
And then, on a whim, add
Accept: text/xml to the Fiddler composer instead of taking the default
It’s correct in the XML.
INT. Kitchenette. Coffee. Sanity beginning to slip.
That Annoying BAMB BAMB Sound From Law and Order
INT. The same developer’s desk. Plus another. Total of three developers.
The third developer quickly confirms what we’ve already done, just to make sure we didn’t miss something silly.
It’s only happening to json. Could it be in the serializer? The message formatter? What is going on?
For the sake of reduction, we go to File > New Project. Eliminate all of our code from this and spin up a barebones project. F5 and see the correct values displayed in the browser. What’s different about the new project? Do we have a different version of the json serializer package? A different version of our web api framework package? What else is our the pipeline, but only for json and not for xml?
EXT. Beautiful, bright blue day. Sunny. Warm. The opposite of the cold grey oppression of this bug.
Walk to lunch, eat lunch, decide to work from a coffee shop and wrap our heads around this thing.
Frayed Ends of Sanity
EXT. Outdoor tables at local cafe.
We create more reductions. The json serializer and message formatters seem fine, but we still aren’t sure what else is in the pipeline.
EXT. Indoor tables at a different, local cafe.
We start mutating the ids in the data store itself and then querying them through the API, attempt to establish a predictable pattern.
We try smaller numbers and find that id = 7 stays 7, even though it’s odd. But we know this isn’t an overflow issue because many higher, even numbers remain correct.
We write a quick script to create 1000 resources with sequential ids starting with one of the large problem ids. Then we request each of them through our API and dump the id we expect alongside the id the API returned. They all match. None are wrong. This was in LinqPad, using C#, and the .NET HttpWebRequest.
But even as LinqPad is showing the right ids, Fiddler is showing the wrong ids…
We saw even/odd, but we didn’t see that they were all, for example, multiples of four.
When we started a new Web API project from scratch, we viewed those responses in a browser instead of in Fiddler. We changed too much at once, including what we used to view the output.
A few years ago, I read this book and recommended it to everyone I worked with. As I look at what we went through in this story, I see that we held fast to some of the principles for good debugging, but got caught out on a couple others.
What Went Well
- We had good evidence of the problem and quickly established a scenario to reproduce the problem
- We eliminated the command side, event bus, and read models very quickly. This let us isolate the problem to the response pipeline.
- We further isolated it to only the json output, further supporting the case that our commands and queries were not the cause of mutation.
Lessons Learned Things We Knew Better But WTH Were We Doing
- Fiddler is a great tool, but it isn’t showing you the raw response unless you’re actually looking at the raw tab
- We narrowed the issue down to “the response pipeline,” but didn’t consider what the tools used to view the output might do after the response was delivered.
- The prime directive to turn only one knob at a time includes what you’re using to view output. If this had been a CSS thing, we certainly would have counted different browsers as knobs we were turning. Here, we tricked ourselves into thinking the raw text response in the browser was the same as the json tab in Fiddler.