When you think of a computer professional, you probably imagine someone who sits at a terminal screen entering cryptic commands. This is part of it, but I want to start with the fundamentals. In this article, we’ll look at some ways to determine whether a computer problem is being caused by software, or by hardware.
In the digital age, we sometimes forget that power supplies get old, cables become intermittent and hard drives fail. These problems can be surprisingly difficult to track down too. The PC will appear to work fine for a while, then suddenly shut off or reboot for no obvious reason and with no warning. I know from browsing the community forums online that this can throw even the most experienced professional. They’ll spend hours reinstalling the OS and tinkering with configuration settings … when the problem could be as simple as a dirty heat sink.
Here’s the rule: Unless you have made a change in software or system settings immediately prior to the problem appearing, it’s probably the hardware.
Now, you should do regular backups (e.g., in Windows, you can use “restore points”) to keep a “known good” configuration to fall back on to verify the software. Remember, “software changes” could include malware that was installed without your knowledge. You should regularly scan any Windows PC for viruses and trojans, especially if it accesses the Internet. But that rule will save you a lot of grief in the long run. Don’t overlook the hardware!
To lay the groundwork, let’s review what a PC needs in order to work reliably:
-A clean, reliable AC power source
-Plenty of cooling air
-Good electrical connections on all plugs, peripherals and hardware
Good AC power
You’d think this was obvious, and it should be. Unfortunately, with computers becoming ubiquitous in the typical broadcast facility, we’re just plugging them in wherever there’s an available outlet. As a result, you’ll find PCs on the same AC branch with vending machines, air conditioners and worse. This may work at first, but as the computer ages, the power supply becomes less able to smooth out noise on the AC line and you’ll start having problems.
If you’re fortunate enough to be doing a new installation, don’t pass up the opportunity to do it right to begin with. You should specify dedicated, single-outlet-per-breaker runs for all digital equipment, with big UPS units on all critical systems.
In an existing building, it’s not so easy. If your budget won’t allow hiring an electrician to install a dedicated outlet, you may have to become creative. You may actually find yourself running a heavy extension cord from a known-good source of AC in some cases. However you do it, the AC power to a PC must be clean and reliable.
Of course, “reliable” is a relative term. You must also make sure that your UPS units are working and have fresh batteries. Ironically, the power companies have gotten better at switching to backups during maintenance, which introduces a new issue: You may not even notice the lights flickering, but your PC will. The first sign of a problem might be someone shouting, “My computer just locked up!” You should also keep a known-good UPS handy as a spare for testing.
From my experience, the batteries in most UPS units last only a couple of years at most. The better ones have a “test” button that should be used regularly. Make this part of your regular maintenance schedule. When you install a UPS, make sure it’s well-ventilated, because heat is the number one killer of batteries — and on that subject in general …
Heat is your enemy
This rule applies to all electronics, and the figures shown here were the inspiration for this article.
Fig. 1: This clogged heat sink was obscured by the fan. Most of us wouldn’t think of running a transmitter with the same air filters for years, but we tend to forget that dust collects in computers too. A layer of dust acts as an insulator, trapping heat. This can cause components to become dangerously warm, ultimately resulting in failure.
This is especially true of microprocessors, which concentrate a lot of heat into a relatively tiny core area. Without a sink and a fan to draw out and dissipate that waste heat, a dangerous CPU temperature rise can result in a matter of seconds. Modern PC processors usually include a protection circuit that will do an immediate, emergency shutdown if the processor core becomes too hot. The symptom is that the PC will run for anywhere from a few seconds to several hours, then suddenly (and usually without warning) shut off — just as if someone flipped the power switch.
When you run across this, assuming that it’s not something obvious (like a loose power cord), it’s likely heat buildup. Open the PC case and start cleaning. The fans should turn smoothly and without noise.
Dust can hide where you won’t notice it. Fig. 1 shows a typical Pentium-class heat sink that appeared to be fine at first glance. But once the fan was removed, it became obvious that it was essentially useless due to a thick blanket of dust.
Don’t assume that this only happens to big desktop PCs, either. Fig. 2 is an example from a laptop. In this case, this user kept restarting the computer over and over, and the processor was finally destroyed by overheating. The moral of the story is to train your employees: If a PC cuts off for no apparent reason, they can check for loose power plugs or a flaky UPS. But tell them not to keep pushing the power button! If the problem doesn’t clear up, they need to leave that PC alone until you can investigate.
One other tip: While you’ve got the PC opened for routine cleaning, examine the motherboard. Power it up and use an infrared thermometer to check for unusual hot spots. Nothing inside that box should show more than 140 degrees Fahrenheit (and the lower, the better). Look at the capacitors. Are they starting to bulge and look “swollen” on top? If so, replace that motherboard or plug-in card ASAP. It’s a failure waiting to happen.
Bad connections
The symptoms in this case will be random hangs, missing or blank screens and other weirdness. Here you really will wonder if it is the hardware or software. But remember the rule that we stated above: If you haven’t changed the software or system settings, it’s most likely a hardware issue.
Fig. 2: This laptop suffered a complete processor failure due to dust on the heat sink. The connections to, from and inside your PC must work at very high (or even UHF) frequencies. You wouldn’t dream of doing a sloppy job on a Type N connector for an STL link. Equivalent care should be used when you crimp an RJ-45 plug onto CAT-5e for your 10 or 100 Base-T network. (Note: if it’s Gigabit Ethernet, you may need to hire a pro and/or use special test equipment to build and maintain the cables.) Take your time and do it properly. Buy one of those inexpensive cable testers and use it. Wiggle the cable at both ends and make sure the indicator lights remain solidly lit.
These problems can be quite difficult to run down as well. One obvious clue is if you move the PC to a different network connection and it starts working properly. I have mixed feelings about network patch panels and wall jacks too. While they can make your installation a lot neater-looking, they also add additional points of failure. Speaking from experience, punch-type wall jacks are notorious for becoming intermittent over time. Every engineer should keep a 100–200-ft known-good CAT-5 cable handy for troubleshooting. If the PC seems to work properly on that cable, you’ve got a bad connection somewhere in the existing run.
Inside the PC, corrosion and loose connections can be a problem as well. If a RAM stick isn’t making good contact with its socket, you can get random hangs. Remove the cover on the PC and gently push on the RAM while it’s running. Because this can also cause random reboots (less likely than a hang, but still possible), you should always check the RAM and all motherboard connections whenever you’re cleaning the innards of a PC.
Old, worn hardware
The most common failure, of course, is with disk drives. There are some free utilities online that will help you to troubleshoot these issues; those rate a separate article of their own in the future. For now, don’t forget that peripherals and plug-ins won’t last forever, either — sound cards, network cards and USB “sticks” will all fail in time. The best idea here is to keep some known-good spares handy for troubleshooting, or to swap a known-good unit from one PC into the one that’s giving trouble.
A lot of this is just common sense and standard troubleshooting techniques, but I’ll repeat the key point in closing: if your PC hasn’t had major software upgrades or configuration changes made recently, but is shutting down for no apparent reason, crack the cover! It’s probably the hardware.
Stephen M. Poole, CBRE-AMD, CBNT, is market chief engineer at Crawford Broadcasting in Birmingham, Ala.