Availability, Capacity and Continuity Management Are Part of IT Service Delivery
In the June 8 and Aug. 17, 2005 issues, the author discussed IT Service Management. The series continues here. Readers can now review all articles in the series at www.rwonline.com Click on the tab IT Service Management.
In the previous installment of our series on IT Service Management, we examined the Service Level Agreements (SLA) and Financial Management portions of Service Delivery, the half of ITSM relating to facility planning.
The other three areas involved here are Availability, Capacity and Continuity Management.
While broadcast engineers tend to be familiar with these areas, there is often room for improvement in both planning and actual deployment. Additionally, these three areas should be viewed in relationship with budgeting and SLAs. Service Delivery is not designed in a vacuum, but in accordance with business models and pre-defined fulfillment of what’s been promised the user or client and the owners.
When you need it
Availability Management basically means having resources there when you need them – in the right place at the right time and in good understood working order. It involves having spares, scheduling maintenance at convenient times, properly assessing usage and reliability of systems and ensuring that down times and recovery times are appropriate for facility needs.
(click thumbnail)Fig 1: For the user, availability agreements are a black box. For the broadcast engineer, they’re based on a hybrid of IT and broadcast solutions using both in-house and outsourced support contracts.
It also involves the procedures and training for getting services and equipment back up and running. Backup gear and procedures are useless if they cannot be deployed quickly enough to fit the users’ needs. (Ironically, recovery may be easier when short-staffed in the nighttime, when activities are more focused on basic routines).
The SLA itself helps get around the philosophical question of whether needs are “real” or “frivolous” – if they’ve been contracted, they should be delivered.
Under IT-related availability concerns, some problems appear as a steadily increasing consumption of resources, while others appear only sporadically, but devastatingly so. The former might include the increase in employees, the steady buildup of e-mail or video archives or increased Voice over IP use (such as Skype). The sporadic ones would include person-2-person file sharing, viruses and spyware, as well as more traditional facility issues with HVAC, telephones and electricity. A spyware program can quickly eat up all available CPU and network resources while spreading across a facility, while an effective virus can require software maintenance on every machine in the plant.
Quickly restoring essential core services while this happens is a must, though preferably in a way not exposed to the same threats. If critical broadcast systems use the same Internet connection as users, a workaround that restores broadcast capabilities separately may be necessary.
Of course simply replacing faulty equipment time and again is not a strategy. Identifying faults and migrating to problem-free solutions is the ideal. Detecting early warning signs helps prevent actual failures, and this can partially be done using software that baselines a system and tracks movements in performance over time and changes to the configuration. Knowing what processes run on a clean system can be a great help in detecting viruses and spyware.
Additionally, providing appropriate regular maintenance and staying on top of developments in security, hardware, software and firmware for the different systems are important but time-consuming tasks.
ITSM includes steady improvements in resources and procedures, including efficiency. However, changes and upgrades should not be done just to increase version numbers – they should fix known problems or prevent reasonably serious potential problems.
Some Availability issues relate to policy or politics – how resources are scheduled, appropriate computer use, security in the workplace, access from home, ease of system use and whether employees can be easily held accountable for related actions.
With many issues, it is easier to handle problems through technical means than behavioral change (system use will be greatest at 9 a.m. and 1 p.m. unless you radically stagger schedules). But when problems threaten the business plan through unsupportable costs, security risks or system breakdowns, a more general or better thought-out approach is required.
In security issues, a good solution is usually automatic for a user to follow. The Japanese call this a “Poka Yoke” for a procedure or device designed to be impossible to run the wrong way.
Capacity Management involves planning performance and throughput and making sure equipment, software, data and signal paths can handle the strains put on them.
Without baselining and simulating loads measurements, capacity management becomes a dangerous guessing game. A system load with five users cannot simply be extrapolated to 100 users without testing. A “gigabit” network will not provide a gigabit’s worth of actual data transfer, poorly maintained software will bog down a screaming fast computer, and what is fast for one user may not be enough for another with more demanding chores.
Depending on the task, system bottlenecks are usually network bandwidth, hard-drive access speed, memory and CPU speed, in that order. But in one case we discovered a poorly designed driver for a disk controller copying audio from disk at 1/100th the available network speed. In another instance, we found out that our Internet provider would “borrow” network bandwidth if we didn’t monitor it closely and complain enough.
Unfortunately, systems are frequently purchased without any effective performance evaluation, or in an unrealistic test lab scenario. Several approaches exist for providing more quantitative measures for the facility.
There are load simulators to stress test resources such as Web servers, databases, graphics and network traffic. A virtual machine environment such as VMWare may allow you to run a number of user sessions from the same machine. A conference computer rental service can be an affordable way to get in 50 spare user systems to test for a few days (especially if you book during conference low season).
But the central points are: 1) system effects can change non-linearly or simply break down with increased usage, 2) bottlenecks can occur where not expected, 3) basic system parameters do not adequately describe how it will perform with a particular application, so 4) keep testing until you understand how your systems behave and can pinpoint important metrics.
A fifth important point would be that systems and facilities change over time, so keep track of changes. This will be discussed in more detail with Configuration and Change Management in the next installment on Service Support; but on the Service Delivery side, Capacity Planning needs to track and anticipate growth in usage and other trends.
Some of the Availability concerns mentioned above need to be taken into account, such as increased user storage and network usage from inappropriate use and spyware. While an administrative solution may eventually produce a better solution, not all increases in usage are bad or malicious – some increase productivity or company morale or further another purpose. In some instances simply increasing capacity is cost effective no matter what the cause.
Occasionally, a good capacity strategy will not work for all users or systems, so different classes have to be created (even an exception of “one”). Tradeoffs in ease of administration or application, cost, effectiveness, streamlining policy and circumventing politics are just some of the considerations. Excess capacity can be as much of a burden to an organization as too little. Finding the right balance is the tricky part, but don’t get bogged down by the perfect solution. ITSM is a constantly evolving, steadily improving process.
Service Continuity Management means coping with emergencies and disasters, and maintaining or recovering the required broadcast capability.
This need not be fully operational – prior risk assessment helps us to decide the level of recovery required according to the severity of the disaster. Fortunately, improvements in IT systems help make for more robust and elaborate backup systems on the cheap. The big caveat is that these systems still must be carefully constructed, with appropriate procedures and training put in place, and practice run-throughs and final emergency procedures carried out.
Since telecommunications plays a huge role in the modern broadcast facility, it’s wise to assess your telecom providers’ emergency backup resources and investigate alternative transmission paths. But if the alternatives all go through the same data center or suffer from the same outages, the redundancy is not complete.
Normally dependable telecom systems easily go haywire during a large-scale disaster, with telephone lines blocked, data centers shut down and other problems. But for smaller emergencies, telecommuting via xDSL or WiFi, deploying portable recording systems and moving to stripped-down off-site facilities with Internet and low bitrate satellite connections can be sufficient and affordable.
(click thumbnail)Fig 2: Having simple diagrams of roles, procedures, systems and timelines can make continuity planning much easier and recovery much simpler under duress. (click on the image to see it larger)
Emergency backup systems are often seen as unused spares and cannibalized accordingly, so taking regular inventory is important to avoid surprises. Whereas maintenance can go bad under any circumstances, during disasters this is doubly true. Basics such as transportation, procurement, shipping and communication can function poorly if at all, so having simple solutions of last resort that rely on as few external requirements as possible is ideal.
Discussing backup plans with those who have been through disasters helps to identify non-obvious but important issues. Our computer facilities at UCLA went essentially undamaged during the 1994 L.A. earthquake, but housed next to chemistry labs that sprung a few gas leaks, we were unable to enter for days. During the first of four Florida hurricanes in 2004, NASA’s Kennedy Spaceflight Center had its Web and e-mail servers under a sheet of plastic after losing the roof, with no backups outside the storm zone.
Every UPS electricity backup has a time limit, and a plan is needed for what happens when that final moment approaches. Financial systems need low-tech contingencies too, as employees still want to get paid during a disaster even if the system is not specifically broadcast-related.
For a media organization, resources such as movie archives or sound libraries might be as valuable as any computer system, while keeping contact with advertisers and reporters and handling public relations might be critical for business survival. A Business Impact Assessment takes into account the value of different assets and activities to the particular organization, and helps to rate priorities on continuity and recovery plans.
Each broadcast facility will come up with its own priorities for broadcast continuity, but one important factor should be kept in mind: a surprising number of businesses never open their doors again after a serious disaster. If that possibility is not part of the organization’s risk acceptance, a Continuity Plan that provides a basic guarantee of survival needs to be put into place.
As a final note, international flavors can be important in planning for some global broadcast organizations. Working around differences in holidays, work hours and habits, approaches towards solutions, language issues and time zones can be both frustrating and entertaining. While younger people across the globe more and more adopt similar habits and speak more English, it is doubtful that Spain will abandon the afternoon siesta anytime soon, as just one tiny example.