What's Wrong With Your Approach to Troublehooting

When something goes wrong, it's not a bad idea to already have a method ready that you will use to figure out what's happening and what you'll need to do to fix it.

Two things to know:

  1. Know what the "fixed" version looks like. Preferably a command you can run that gives a certain output when things work. For example: I'm trying to figure out why SSH asks for a password when I've set up the keys properly (or so I thought). So my test is: "ssh servername uptime" and it should work without asking a password.

  2. Describe the problem at the right level. A user complaining that they can't ping a server should not send you off to run and fix the server. The person's job isn't to sit around and ping a machine all day. They want to get some kind of task done like use the machine as their DNS server. Example: Once a user complained that they couldn't ping a machine half way around the world. I spend the day tracking down sysadmins in that part of the company to find out what was wrong with that machine. It was decommissioned and they were in a panic because they thought maybe they had powered off the wrong machine. I contacted the user and said "besides needing to ping this machine, what would you like to be doing with it?". It turned out that he wanted to run a certain job on it and if he had been following the proper procedure his tasks would have been automatically redirected to the replacement machine. I had wasted my entire day and the time of the local sysadmins. Another reason "I can't ping" isn't the right thing to be testing: Often firewalls are configured to drop ping packets but permit other packets through. Test what you want to go through.

Two strategies:

  1. Additive: Keep adding components until the problem starts. The last thing you added is the problem. Example: Web browsers can't talk to a server. Between the server and the user is a load balancer, a firewall, a cache, and the user's local web proxy. First try sending queries directly to the server, then through the LB to the server, then through the firewall to the LB to the server, etc. etc. each time adding one component.

  2. Subtractive: Keep removing components until the problem goes away. The last thing you removed was the problem: Example: A machine with dozens of cards won't boot. Keep removing cards until the machine boots.

Two bits of dumb luck:

  1. Forget everything I said. The problem is being caused by the last change made to the system. (this works 99% of the time... the problem is that 99% of the time you don't know what the last change actually was)

  2. When all else fails, check for stupid things. http://whatexit.org/tal/mywritings/dumb-things-to-check.html Example: A crazy problem just couldn't be explained. Then we checked the configuration file: a user had edited it by copying it to a Windows box, editing it, then copying it back. It now had a ^M at the end of every line. We never noticed because our text editor silently hid this fact. Sadly, the software that read the configuration file turned those ^Ms into a non-break space which screwed up tons of other procedures.

If you have any comments, feel free to post them below or find us on Twitter and Facebook!

-Until next time!

October 31 2019

Add or review comments

Please leave your comment

Existing comments

Comments 0


Get notified about new publications and product updates.
Please note we do not share information to anyone.