Fig. 1: RAID 5 I was copying the data from our existing mail server to new hardware. We had some kind of glitch. To this day, I’m not sure what it was or how it happened. But the old mail server hung every time I tried to reboot. The BIOS utilities said that the RAID array was corrupt and would have to be rebuilt.
That was aggravating, but not the end of the world, I thought. I pulled up our most recent backup and started restoring. A few hours later, I really started to sweat. The backup was corrupted. There were data errors in the backup!
This story has a happy ending: Having learned the hard way that one backup is never enough, I dropped to the second-oldest backup. I loaded it, started the server … and said a prayer of thanks. It worked!
Everyone repeats this mantra endlessly: “Make backups. Keep backups. Check your backups.” But every one of us also gets busy. The transmitter is hit by lightning, or we’re doing three remotes on the same day. We think, “Ah, the server has been running fine, I can put it off until tomorrow.” But tomorrow turns into the next day and before you know it, into next week.
Then disaster strikes and you’re searching for your 6-month-old backup, hoping that it’s OK.
So I’ll start by stating what should be obvious: Backing up your data is just as important as tower inspections, checking the logs, changing air filters and doing PM at your transmitter sites. Find the time to do it.
CAVEATS ABOUT RAID
I think that any critical file server should use RAID, a “Redundant Array of Independent Disks.” But several strong caveats apply.
First, be careful which type of RAID you use. More on that in a moment.
Second, become familiar with your RAID. Read the manual. One common misconception is that RAID will automagically rebuild and restore a failed array. That is not necessarily so! It depends on the capabilities of your RAID controller and how you’ve configured it.
Third, remember that RAID only protects from drive failures. If your software writes bogus data to the array, it’s still bogus. If you have a power failure, the entire array could be corrupted. Use a good uninterruptible power supply and test it regularly to prevent problems of this type.
Fourth and finally, make and keep good backups. If you think you don’t need them just because you’re using a RAID array, you are going to be badly burned eventually. What if the RAID controller itself has a hardware failure?
TYPES OF RAID
There is a great deal of information about RAID available on the Internet, starting with Wikipedia. I’ll just hit the highlights.
RAID 0 is the simplest and is arguably useless. All it does is combine drives into One Big Disk. It provides no backup for your data. RAID 1 is much better; it “mirrors” the data between two or more drives, in essence, creating a copy of everything for you. RAID 3 and 4 are rarely used nowadays, so I’ll skip them entirely.
Designed for larger storage arrays is RAID 5 (Fig. 1). If you buy a name-brand server from Dell or HP, their websites will allow you to choose this, as well as the controller and drive set that you want. Ask their sales rep for recommendations too.
You want a RAID setup that will report a drive failure, then rebuild the data on the replacement drive. (Again, this is not automatic; you must choose a RAID setup that does this and then configure it.) You also want a good hardware controller that handles the number crunching, to keep down the load on your main processor. Avoid so-called “software RAID” — i.e., RAID implemented in software drivers — for serious, high-volume servers.
For example, we order our Dell servers with the PERC (Power Edge RAID controller) installed and ready to go with matched drives. Ask around, do some Web searches. Look at customer reviews. Sites like Tiger Direct and PC Mall also have enterprise divisions that can help you make a good decision.
MY RECOMMENDATION FOR LARGE ARRAYS: RAID 5
How does RAID 5 work? Briefly, referring to Fig. 1: When you write a file to disk, it will be chopped into blocks, labeled “A,” “B” and so on in the illustration. Block A1 is written to the first drive, Block A2 to the second, and so on.
But a key feature is illustrated by the blocks labeled “Ap,” “Bp” and so on. The “p” stands for “parity,” and represents the file integrity and “check” data that the array can use to detect errors, and then repair them on the fly. Note how even the “check” data is scattered across the drives for greater redundancy.
What about speed? Reading from the disk is nice and quick. In fact, if one program wants the “A1” block, while a second wants the “B2” block, since they’re on separate drives, a good controller will even multitask the requests, satisfying them almost concurrently.
Writes are a different story, and this is the first inescapable tradeoff with RAID 5. The data must be split, stuffed on different drives, and the “check” data generated and stored as well. This is why you want a good, “smart” controller like Dell’s PERC; it does all of this for the operating system, saving the load on your main processor(s).
The second tradeoff is drive space. Only the useless RAID 0 allows you to use all available space. The others must use at least some of the available space to store a copy and/or corrective data. With a typical RAID 5 using four 1 terabyte drives, for example, you can expect about 3 terabytes of available space. The remainder is used for the recovery information.
Carnegie Mellon released a study a few years ago. They discovered that, if one drive in a RAID array fails, it is entirely possible to have a second failure before the first drive can finish rebuilding (!). This is especially the case on a really large drive array. The more the data, (obviously) the longer it takes to rebuild and repair.
The moral here is simple: If your RAID array reports a failure, treat it as an emergency. Following your manufacturer’s instructions, replace that drive ASAP. You should also keep a spare drive on hand, ready to go, for a hot swap. There’s no point in having RAID if you don’t have a known-good spare ready.
TEST THOSE BACKUPS!
This takes time, and you’ll have to figure out a way to do it that doesn’t disrupt operations. In the case of our mail server, I actually have a second machine built and ready to go. From time to time, I load the backup onto it and ensure that the server runs properly.
Be careful where you store your backups as well. Never, ever store it on the same drive as the original. I have been amazed at the number of people who will do this. If the drive fails, how do you know the backup won’t get trashed along with the original data?
But nowadays, it’s not just a simple matter of writing it to CD or DVD. Our mail server, for example, typically has over 50 gigabytes of data. That would span several disks. In fact, I use the test server that I just mentioned, killing two birds with one stone. Once a week, I copy all mail data from the main server to the backup over our network. I can then immediately confirm a good backup by “running” that data on the spare server.
Ideally, you’d store some backups off-site. But there’s a world of difference between storing traffic and billing data (which might fit on the aforementioned DVD), and storing hundreds of gigabytes of music, spot advertising or other data. There is no one good answer, but consider a second server. If you have a high-speed data link between your studio complex and a transmitter site, you might even put that spare/backup server at that remote site.
Just don’t overlook physics: even with a 100 Megabit data link to the transmitter site, it’s going to take hours to copy gigabytes of data. You can estimate the top limit by dividing the network speed by 10: With 100 Base-T, you can’t expect better than 10 Megabytes per second.
Finally, read up on your operating system. Do some Google searches on phrases like, “backup Windows server 2012” to see what others are doing. See what problems they’re experiencing.
Disk imaging is another idea, but that’s for another article in and of itself. With this technique, you take the server offline, then make a “snapshot” of the entire drive or RAID array. That’s a great way to do it, but it’s not for the faint of heart or technically-challenged.
However you do it, your critical servers must keep running. You must be able to restore them as quickly as possible when they fail. To do that, I recommend that you start with something like RAID 5 with a good hardware controller.
Next, do regular backups, following your software’s recommendations. Find out what other users recommend. For example, I joined the online forums for our audio automation (RCS NexGen) primarily so that I could swap ideas with other people who use that system.
Keep more than one backup! With our mail server, I keep the latest one and the one previous to that. I’d keep even more if I had the space.
Finally, test those backups to make sure that they will, in fact, save your bacon when and if you have a need for them.
Stephen Poole is market chief at Crawford Broadcasting in Birmingham, Ala.
Got an idea for a future column on radio IT? Write to us at firstname.lastname@example.org.