Principles of troubleshooting
tl;dr Here are some troubleshooting common sense steps.
“What have you done to try to fix the problem?”
Outages are stressful
We’ve all been there. Either you’ve been the engineer being called and given insufficient information, which forces you to pry more information out of people. Or you’re the desperate engineer calling somebody for help. You’ve thrown all sorts of solutions at the problem without really understanding the problem, so none of them worked. You are receiving a lot of heat for this one. This needs to be fixed NOW.
Everybody is stressed, and upset, and the above situation is usually miserable for everybody involved, no matter what side of the camp you’re on.
Much of what follows can be seen as common sense, but common sense is rarely common, and often comes after — usually painful — experience.
Really skilled designers and engineers are those who can think. They have developed good taste, and have spent a lot of time thinking before they starting solving something. They also spend a lot of time asking questions. They’ve built up a framework of principles that helps them act quickly and solve problems quickly when the need arises.
The order may vary from time to time, and some steps may be left out or bleed together, but I generally find myself using most of these steps.
A lot of use do these instinctually, but we can’t articulate them. This is my attempt at documenting the process.
You receive a phone call. Something broke. You are charged with investigating and finding a solution. You will be applying the fix if it’s one of your systems.
The steps I normally follow
1. Define the problem
2. Verify & replicate
5. Test environment
6. Isolate the problem
7. Adjust your hypothesis (if there’s learning)
8. Apply and re-apply the fix (if found)
9. Deploy & Verify
10. Continue to monitor for problems
1. Define the problem. What is the actual state? What is the expected state? This is always the first question I ask. Sometimes just getting this information out of people can be difficult. “It’s broken”, or “the build is broken” are not sufficient. You can deduce that the build shouldn’t be broken, but in most cases, when the person reporting the bug defines ‘expected’ vs ‘actual’ behavior it goes a long way in defining the problem. What are the symptoms? Is that a symptom, or a manifestation of a deeper issue (code smell)?
2. Verify. Is this problem still consistently occurring? This really includes 2 important steps:
Can I replicate this?
What steps are needed to consistently replicate this?
Best case: It is simple to replicate, and you see it every time.
Worst case: They can’t replicate it, or the defect replicates inconsistently because a variable is needed (high load, or something else). If they can’t replicate it, tell them it may (depending on the issue) have been a fluke, and they should alert you when they see it again, and record any variables they can think of that are contributing to the environment being in that state. This may include taking a screenshot, copying the html, or other measures.
Is this really a defect?
Sometimes it is working as designed but the bug reporter isn’t aware how the feature is supposed to work. If it is working as designed, and everybody is happy, you can move on. If it’s working as designed, and they don’t like it, you can start a discussing around changing it, but it is no longer an emergency.
3. Research (this step is optional if you’ve experienced the error or something similar before). Unless everything you are doing is homegrown (unlikely), there is a good chance somebody else has encountered this problem. Doing a quick google search with the error message, or the quirky css behavior goes a long way here. You are leveraging the internet. It’s big. The challenge here can be knowing the right term to search for (‘collapsed float’, ‘missing semicolon’). More experienced team members can help if you are struggling to even come up with google results.
What changed in the environment?
This includes looking into the environment. What changed? What variables in the environment have changed recently? Is this a new bug, or has this always been happening (but nobody noticed)? If it is a new bug, you can look into what has changed.
How much is affected?
This yields very useful clues. If it’s only 1 app, instead of all 10, it’s likely not a platform or upstream issue. What is different about your app?
Netflix has a dashboard that shows them things that have recently changed. If something major breaks, and you haven’t changed anything, it’s likely something else in your environment changed, perhaps your provider, or something else upstream. In Netflix’s case, they hadn’t changes anything, so they called AWS, and it turned out to be their issue.
Frequently, you have no control over that part of the environment. It’s fine for you to call up another team and ask them to fix it. You should however, have high confidence that their piece is the one causing the problem, with detailed instructions on how to fix it (if you have that knowledge), and how to replicate it. NOBODY likes the hot potato game that happens when you assign a defect to a team, but it has nothing to do with them.
Often this could also mean that a external dependency broke (3rd party, or another team at the company).
4. Hypothesis / Snap assessment. At this point you likely have at least an inkling as to where the problem is. Even if it’s just a general region (“this is a display issue, it’s most likely CSS, or maybe even JS related”, “the functionality isn’t loading, it’s probably a JS issue, maybe syntax, or missing JS file”).
If you don’t have any clue whatsoever, make an educated guess, or do some more research. This could be very broad “one of the dependencies appears to be broken”.
5. Set up a test environment: You need an environment where you can manipulate the variables freely and test your hypothesis. That is the only way you can determine with high confidence that you’ve found the problem and found a resolution. Sometimes using Firebug or the Chrome Dev Tools is sufficient, since that lets you manipulate the variables.
The main goal here really is learning. You tweak the variables in the environment (preferably one at a time) to see what happens. How does the rest of the environment behave? When I change variable A, it has a strange effect on variable B…
Your main purpose is to replicate the problem in your controlled environment.
Best case scenario: The problem appears easily, and it stays.
Worst case: You have to tweak several variables (when this user-agent appears AND coming from this country, the page fails). Logs can be useful in tracking that down.
The reason this environment is so important, is that you can now rapidly iterate. I’ve worked in places where getting a change out to the staging or prod servers can take 24 hours. You do NOT want to be in the position of waiting 24 hours just to find out your fix didn’t work properly. This kind of slow iteration kills a lot of things, including excitement and confidence in your abilities.
Very frequently I have to try 2 or 3 things before I find the proper solution. Once you know what will work, you need to figure out what is the most maintainable and elegant way to apply that fix. Usually there are several ways to solve the problem. If possible, I try to avoid fixes that are a temporary band-aid or an awkward solution that incurs technical debt. If that is the only thing feasible, then make sure you plan in time & budget to fix it properly the following deploy or iteration.
Even worse case:
Sometimes the problem appears in an area that you have no control over. If the problem is upstream, then in your test you can still simulate the variables that are coming from upstream. If you meet with another team to tell them their code is causing you problems, you want to have a HIGH level of confidence that you are right, or else you lose trust and seem like an amateur.
Charles Proxy has been very helpful for that. Wrong JS file? Wrong variables? Wrong headers? Charles will help in all those situations, ESPECIALLY since it is on the OS level and will work in all browsers.
6. Isolate the problem. If you still have no idea what is causing the issue, you need to spend all your efforts on isolating the problem.
It should not ever take very long to isolate the problem at least to a certain region. This is one of the most important steps.
This is like finding a leaky pipe, you have to shut off certain parts to see if the downstream is still polluted.
You isolate the problem by changing one variable at a time to see what changes the state of the problem. Having a controlled environment is important in this case. If you are in production, you can’t have confidence that the environments won’t change, and your iteration time might be awful. It may take anywhere from 5 min to a day to longer to get a tweak into production, locally you can do it in 5 seconds.
With CSS, changing variables means you tweak CSS rules in Firebug / Chrome Dev Tools. This identifies a majority of CSS problems and is fast, and can be done in prod (oh, this float is collapsing, you are being overridden by this global style here). In IE, you can use Charles, or download all the files, and change the css and verify what it looks like in IE.
Sometimes you have CSS / JS syntax errors. The most simple way I’ve seen to find those issues is to comment out parts of your code (you can start with half if you are impatient). This will help you track down the exact line that is causing the issue.
7. Adjust your hypothesis. If you are following the previous steps properly, you are learning a lot about the problem space and how changing variables is affecting the environment. This should cause you to either adjust your hypothesis, or narrow down the scope of the problem and increase your confidence in a solution (We are 90% confident the problem is occurring because of A, and B is a common solution. We will try that out).
Sometimes this means moving in an entirely different direction with your research (“We thought it was A, but now we are fairly certain the problem is over here instead”).
8. Apply and re-apply your fix. Often, during your experimentation in your sandbox (step 5 or 6) you’ve stumbled on the solution (adding/removing this line in my JS fixes it).
So now you believe add this one line will fix the problem. Now we need to verify the following:
Will that alone fix the issue?
Will it fix it in all environments (browsers)?
Will that fix it in all use cases?
Very frequently, at this point, I’ve changed so many variables that I think the last change did it, but it was really a combination of several steps. You don’t want to wait till you push to production to find out you were wrong. That makes you look sloppy.
So you revert your code completely, and apply ONLY your suggested fix. If that fixes it, you are golden. If NOT, you need to retrace your steps, and perhaps it is 2 steps combined that will fix your problem. I’ve had times where it was working, but I couldn’t figure out why. It’s important to keep good track of your changes (either with source control like SVN/GIT, or manually writing it down).
If you can quickly apply it in prod via DevTools / Charles method, that would be valuable since you can test it without having to actually deploy it. Spending 15 min vs 24 hours can be well worth it to avoid pain for customers and yourself/management.
9. Deploy & verify. Push the fix to staging, run all your tests (manual or auto). Push to prod, and verify that the solution worked. This includes verifying that it works in all use cases and all environments (browsers). This includes verifying that you didn’t apply a bad fix (you didn’t break anything else, or introduce new bugs, or affect anything else adversely). If your code ends up fixing the problem, but breaking something else, find a fix, and raise that quickly (this will get you bonus points, because you own it).
10. Continue to monitor for problem. If something breaks, you should be the one to raise the issue. That shows that you own the problem. You should know something is broken before the testers do.