You know the kind of issue. No-one can reliably reproduce it and you wouldn’t believe the bug existed if it wasn’t for the screen shot. I’m talking about impossible bugs.
I have recently been fixing a few bugs that are as difficult as they come. In this post I would like to give some tips on how to solve them.
It might sound like psycho-babble but your attitude effects your motivation, which in turn affects the outcome. You might be your own worst enemy, but you can always change that.
Someone has to fix the problem, why not you? If you think someone else is going to fix the problem, then you are hardly in a motivated state of mind.
It is great to brainstorm some ideas about what the problem might be, but before investing lots of time hacking away at code based on an assumption, stop! Treat your assumption as a hypothesis and then think of how you can scientifically test that hypothesis.
Have a set goal and do not stray from it. You need to be dedicated and single-minded. If during your investigation you find other issues, instead of fixing them, document them and get back to the task at hand. Otherwise you are just complicating an already complicated problem.
If you’re sleep deprived or stale when problem solving, your brain is going to be sluggish. Ideas will often come when you walk away from a problem or come back to it.
There is always something that can be done. It might be finding a simple work around or rewriting entire modules. What you are doing is just an exercise in searching for a solution in the most optimal manner.
Once you have the right mindset, how do you go about solving impossible bugs?
Keep notes on what you tried and any other details about the problem as they come to hand. If the problem becomes complex you may need to retrace your steps. It also helps to have something documented if you need to involve other people. Make sure that in your notes you can distinguish proven facts from possible theories. And if you are working with many people it also helps to be able to distinguish each person’s notes.
Keep a list of available options and execute the list in an order that is efficient. Usually this means trying the quick things first. Cross off the options as they become exhausted and add in new entries as you think of them.
Find the problem and verify that it is the problem before you start working on the solution, otherwise you could be wasting time fixing something that “ain’t broke”.
If you can watch the problem occurring, listen carefully to people who encounter the problem and then ask questions to clarify the issue. Be careful how you ask questions, you don’t want to lead people into telling you what you want to hear instead of something factual.
Finding patterns that relate to the problem will help you come up with a hypothesis about what the problem is. The patterns could be usage behaviour leading up to the problem, or environmental conditions under which the problem occurs.
If you can reproduce the problem to some extent and your software is released in revisions, it often pays to check if the issue has been introduced. If it has, you should be able to do a binary search over the versions until you know when it started occurring. From there if you have a good source control system you should be able to track down the delinquent change.
Learn how to use a debugger and a profiler.
I can recommend yourkit (not free) for java and .net. I have also heard of a reflector add-in called deblector (free).
Do what you need to. Add logging to your code, re-deploy and re-test.
Environmental problems are usually related to the operating system, framework or other software that your program relies on. These types of faults are usually easy to reproduce, but not on the machine where all of your development tools are installed. :)
Configuration problems can sometimes be misdiagnosed as environmental issues as they can have the same symptoms. Just like environmental issues, the problems are likely to be easy to reproduce. If you can isolate the software and configuration from the environment and still get the problem, you can suspect the configuration.
Threading issues will usually be difficult to reproduce, even on the same machine. The problem may occur on one test run but not on another similar test run. If you can find patterns in the places where the issue arises, you might be able to deduce where the problem stems from. Logging can often help you track down threading issues.
Memory leaks can be the root cause of some intermittent bugs that are usually difficult to reproduce. The pattern you should look for is “after running for a long period” or “do this many times over”. Getting good information about the steps required to reproduce the problem and profiling the application will often get a result sooner.