Servers are complex machines that generate a lot of heat when they work! A lot. Once in a while you might find your server room smelling funny. Don't panic! We gotchu! Here's what you do.
You've got the "How" pretty well nailed down:
You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:
This is a more interesting question.
Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating.
You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.
Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives.
Troubleshooting "that funny burning smell" should never take precedence over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.
The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.
If you see smoke or fire, drop the room
This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system.
Exceptions may exist (exercise some common sense), but this is almost always the correct action.
If you're proceeding to troubleshoot, always have at least one other person involved
This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).
Exercise prudent safety measures while troubleshooting
Make sure you always have an escape path (an open end of a row and a clear path to an exit).
Keep someone stationed at the EPO / fire suppression release.
Carry a fire extinguisher with you (Halon or other clean-agent, please).
Remember rule #1 above.
When in doubt, leave the room. Take care about your breathing: use a respirator or an oxygen mask. This might save your health in case of chemical fire.
Set a limit and stick to it
More accurately, set two limits:
The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.
Trust your gut
If you are concerned about safety at any time, call the troubleshooting off and clear the room.
You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.
If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)
Until next time!