The Art of Debugging
I often end up being pulled into various debugging sessions. I find that having a structured approach for debugging can help you figure out what is going on and fix even the nastiest of bugs. This is a story about a few different bugs that I run into.
We have some new features coming in RavenDB 5.4 which touch the manner in which we manage some data. When we threw enough data at RavenDB, it died. Quite rudely, I have to say. I described the actual bug here, if you care. There wasn’t really much to finding this one.
The process crashes, we don’t know why. The process will usually gives us good signs about what exactly is happening. In this case, attaching a debugger and running the code and seeing where it failed was sufficient. I’ll admit that I mostly pressed F10 and F11 until I found the offending function, and then I didn’t believe it for some time.
Still in the same area, however, we began to run into a problem with impossible results. Invariants were broken, quite harshly. The problem was that this was something we could only reproduce after ~30 minutes or so of running a big benchmark. This does not make it easy to figure out what is going on. What is worse, there is a distinct gap between the time that the bug showed up and when it actually happened.
That made it very hard to figure out what is going on. This is where I think the structured approach shines.
Architecture – I hate concurrency debugging. I’m roughly 20+ years of experience in writing parallel programs, it is not sufficient to debug concurrency, I’m afraid. As a result, the design of RavenDB goes to great lengths to avoid having to do concurrency coordination. There are very well defined locations where we are applying concurrency (handling requests) and there are places where we are using serial manner (modifying data). The key is that we are using multiple serial processes at the same time. Each index is bound to a thread, for example, but they are independent of one another. The error in question was in an index, and I was able to narrow it down to a particular area of the code. Somewhere along the way, we messed things up.
Reproducing – I had a way to reproduce the issue, it was just utterly unhelpful for debugging purposes. But since the process is a serial one, it meant that it is also fairly consistent. So I added a bunch of printf() that logged the interesting operations that we had:
And here is what this looked like:
Each line is a particular command that I’m executing. I wrote a trivial parser that would read from the file and perform the relevant operations.
For ease of explanation, imagine that I’m writing a hash table, and this set of operations prime it to expose the issue.
The file contained hundreds of millions of operations. Somewhere toward the end, we run into the corrupted state. That is too much to debug, but I don’t need to do that. I can write code to verify the state of the data structure.
There is one problem here, the cost. Verifying the data structure is an operation that is O(N*logN + N) at least (probably O(N^2), to be honest, didn’t bother to run the proper analysis). I can’t run it on each command. The good news is that I don’t need to.
Here is what I did:
That did a full verification every 100,000 runs. This meant that I was confident that this was roughly the error way. Then I just repeated the process, raising the minimum start for doing the check and reducing the gaps in the frequencies.
This is a pretty boring part, to be honest. You run the code, do some emails, check it again, update a couple of numbers, and move on. This bisection approach yields great results, however. It means that I could very quickly narrow down to the specific action that caused the error.
That was in step 48,292,932, by the way. The actual error was discovered far later.
Fixing – that part I’m going to skip, this is usually fairly obvious once you know what is going on. In the case above, there was some scenario where we were off by 4 bytes, and that caused… havoc.
The key here is that for most of the debugging process, I don’t really need to do any sort of thinking. I figure out what information I need to gather, then I write the code that would do that for me.
And then I basically keep running things until I narrowed the field down to the level where the problem is clear and I can fix that.