|
|
|
|
|
|
|
||
|
|
With the advent of rolling blackouts in California, the possibility of power shortages in New York City, and the likelihood of brownouts in other areas, system managers are prompted to look closely at the status of their emergency backup power, battery back-up systems, and system backup and restore devices. This touches on the larger question that many system managers are also asking, "What failures could my system be subject to, and how will the system react?" Power Loss and Orderly ShutdownWindows based computers require an orderly system shutdown during which cached and in-memory system data is written to the disk. If power is turned off before the shutdown is started or completed, data will be lost, and the system may not be able to start up without manual intervention and possible restoration of data from a backup tape. Regardless of the operating system, corruption of real-time application databases is likely in any sudden power-off of a computer. This makes the use of an uninterruptible power supply (UPS) critical. For a workstation computer, a $200 UPS can provide 15 minutes of computer operating time in which to save and print your current work.
For servers or single-computer systems in continuous unattended operation, a $300 intelligent UPS can signal the operating system to perform an orderly shutdown well before the battery power runs out.
A more expensive UPS can provide up to 4 hours of full battery-based operation, allowing the system to keep running even if the power is out for several hours. Will Your System Pass the Test?How will your system react in the event of a scheduled or unscheduled blackout or brownout? When was it last tested?
Whether or not you are in an area that may be subjected to power shortages, it's worth doing a check of any critical system and its backup devices if you haven't done so lately.
Recent experience with several facilities revealed these shortcomings:
In short, trying a simple automated shutdown test - which the systems were intended to perform - took critical functions or the entire system offline. These systems were all installed correctly, had passed inspection and were initially functioning properly. None of these failures should have happened.
Such failures point out the need to periodically verify that emergency shutdown and restart procedures actually still function the way they were intended to. Strengthening Your System Against Other FailuresPower outages are just one type of failure that can affect a system. Accidental damage and component failures can also take a system offline, unless the system is designed to withstand those kinds of failures. The best time to address issues of fault tolerance is in the design stage, before the system is specified and purchased. But there are things that you can do to improve a system after it is installed. This is especially true for small and medium sized systems, which usually are good candidates for cost-effective enhancements.Redundancy
The key ingredient to tolerating or recovering from failures is redundancy. To be able to tolerate a component failure and keep on running, a system must have an alternate component that can take over the functionality of the failed component. To restore a system to use after a data storage component failure (such as a hard drive crash), there must be another data storage component with an alternate copy of the data to be restored.
When redundant components are hot swappable repair can be effected without having to stop the system. You simply unplug the failed component and replace it with a new one.
The PC components that most commonly fail are power supplies and hard disks. Redundant solutions exist for both of these components. Dual Power SuppliesA number of computer stores now sell computer cases for under $500 that contain dual hot swappable power supplies. Under normal operations each power supply provides half the power needed. When one power supply fails the other one takes over by switching to the full level of power. The failed power supply can be replaced without having to power down the system.Data Recovery
In the event of a complete loss of hard drives, or a loss of the entire computer, backup tapes let you restore the system to its last saved state. For unattended automated full backups (operating system and data), you need a tape capacity that is greater than the total hard drive capacity of the system (not counting duplicate RAID drives). The tape drives and tapes of just a few years ago had much smaller capacities than those available now. If your current tape backup system isn't correctly sized, now is a good time to update it. Backup Software
Three excellent backup products are Backup Exec by VERITAS Software (formerly Seagate Backup Exec) and Networker by Legato Systems, Inc., and TapeWare by Yosemite Technologies. These products offer options for backing up open files and databases (most other backup software cannot), which means that your system remains entirely online and functioning during the backup process. That is an absolute requirement for security systems that log historical data. It also means that you can run scheduled unattended weekly full backups and daily incremental backups and change the backup tape once per week for convenience. The products also offer point-in-time recovery options that allow you to restore a system to its exact condition at an earlier point in time. Both companies have versions for UNIX, Windows NT4, Windows 2000 and Novell NetWare.
Pricing for these backup products for a single Windows 2000 server (such as a Dell PowerEdge 6400) with the open file option and point-in-time recovery options ranges approximately from $1,000 to $2,500 depending upon the feature set and options. At the time of publication Legato's product pricing was under review and not available for publication.
If you already have a backup product for Windows or Netware that cannot back up open files, chances are that you can use St. Bernard Software's Open File Manager to enable open file backup.
Backup software has become much more capable and feature-rich in recent years. This makes it important to closely examine the features of backup products to make sure you get the best match to your system and your requirements. Real-time Backup
Recently Legato released a significant breakthrough in backup technology, which they call Celestra. Application data is backed up live as it is changed. Currently this product is available only for Solaris and HP-UX operating systems. Watch for more announcements about Celstra from Legato in the near future. There are other real-time backup solutions (such as PowerQuest's DataKeeper and StorActive's LiveBackup) but their backup action is triggered when a file is closed. That means that the open files and databases of security applications would not get backed up as they change. Such products are suitable for non-critical systems that can be backed up when the systems are not in use. RAID
RAID technology (Redundant Array of Inexpensive Disks) is a method of using several hard disk drives to provide fault tolerance in the event that one or more disks fail. There are about a dozen different ways of utilizing the multiple drives, each one called a RAID level. In RAID 1 two hard disks of equal capacity duplicate or mirror each other's contents. One disk continuously and automatically backs up the other disk. This method is also known as disk mirroring when one disk controller is used or disk duplexing when two independent hard-disk controllers are used.
The Windows NT and Windows 2000 have RAID capabilities built in to the operating system as an option of the NT file system. No RAID disk controllers are required, you just need the right number of disk drives for the level of RAID that you want to use. Software based RAID may not be appropriate for systems that are processor-intensive, since the RAID processing itself will consume processor cycles and could lower overall system performance.
Not long ago RAID controllers required SCSI drives (Small Computer System Interface, pronounced "skuzzy"), which are often more than twice the price of an IDE drive (Integrated Drive Electronics) of equivalent size. Now Radio Shack Online and other online and retail sources sell Arco Computer Products RAID IDE controllers for as low as $230 (Arco's DupliDisk II). Computers Designed for 24 Hour OperationDell, Gateway, Compaq, IBM and most other vendors make server-quality computers with multiple cooling fans, dual power supplies, and RAID-controlled SCSI hard disks all of which are hot swappable. These computers contain heat sensors and software that can warn you by beeper or fax of a rising computer temperature or of a failure of one of the redundant components. At the time of this writing single CPU systems that include a tape backup unit and Backup Exec software start at around $6,000.System Availability and Recovery
Fault tolerant systems can be placed into two categories: passively redundant and actively redundant. Passively redundant systems allow full recovery from failures, and are called High Availability systems. Actively redundant systems keep operating in spite of failures, and are called Assured Availability systems. Systems with geographically distributed redundant components can survive the complete destruction of one or more geographical locations, and are called Disaster Tolerant systems.
Figure 1 below shows the relative ranking of fault tolerant systems in terms of operational availability and cost. The sidebar Fault Tolerant Systems Technology presents a spectrum of fault tolerance and system recovery configurations. Figure 1. The relative ranking of fault-tolerant systems in terms of operational availability and cost.
High Availability Systems
In a passively redundant system the redundant components are not actively engaged in the computing tasks, and there are no CPU processors running tasks in parallel. Redundant components wait until they are activated or signaled to take over for a failed component. This transition is called the failover (failing over to the new component). After the primary component is repaired or replaced the system is switched back from the redundant component to the primary component. This transition is called the failback.
The failover and failback are both noticeable to the users, whose applications have to be automatically or manually restarted after either transition with a consequent loss of any unsaved data. Failback often requires a reboot of the computers. Acceptable failover time is usually measured in minutes. Failback time is usually longer than failover time, typically minutes or even hours, because the systems must be restarted, and any databases synchronized between the failover system to the primary system before the failback is complete.
In the failed over state the system is vulnerable to a second failure, and must be repaired and failed back in order to restore full system protection.
Backup servers (manual startup), standby servers (manual or automatic failover) and failover clusters (automatic failover) are all examples of passively redundant systems. Some failover cluster technologies allow up to 32 servers to be employed, thereby eliminating single points of failure. Passively redundant systems are known as high availability systems.
Note that if active redundancy were used with selected components in a system, failure of one of the redundant components would not cause a system failover. For example, if RAID disk drives and dual power supplies were used, failure of a single disk or power supply would not cause a system failover. Only a failure of multiple disks or both power supplies, or failure of a non-redundant component, would cause a system failover.
Assured Availability Systems
In an actively redundant system parallel processing is utilized so that the computing tasks are being simultaneously performed by two or more components. A failed component is detected instantly (usually within milliseconds) and is immediately isolated from the rest of the system before data can become corrupted. This type of system requires two or more processing CPUs and memory in each half of the system in addition to other redundant components.
All components are hot swappable, and can be replaced without stopping or restarting the system. Thus component failures and repairs are completely invisible to the system users. This is called an assured availability system, because it continues to operate regardless of component failure and repair. Assured availability systems are sometimes also called continuous availability systems to stress the fact that there is no failover or failback lapse in system operations.
Note that until a failed component is repaired, the system is vulnerable to a second similar component failure, and must have the component repaired to restore full system protection. Disaster Tolerant Systems
Placing all the computing elements in the same room (single-site configuration) leaves the system vulnerable to environmental disasters. A disaster tolerant system is one that can continue to provide application and data availability in spite of a major catastrophe such as a fire, earthquake, or bombing. Local disaster tolerant systems place redundant parts of the system in different locations within a building or campus (split-site configuration). A remote disaster tolerant system places redundant parts of the system one or more kilometers apart (remote-site configuration).
To keep all users on line with a disaster tolerant system, special network issues must be addressed since a disaster will typically disable a portion of the connecting network. Uptime Guarantees
Manufacturers do not usually make high-end systems available directly to end customers, but provide them through qualified vendors to ensure that systems are installed, configured and can be maintained correctly. Often the vendors offer an option for guaranteed levels of service. Compaq, for example, has service plans that offer a portion of your money back if they don't meet the uptime guarantee requirements.
For details on how to specify uptime, and what uptime percentages mean in terms of actual downtime hours, see the sidebar Putting Downtime Into Perspective . Where Does Your System Stand?
How much fault tolerance your system requires obviously depends upon the intended use of the system. The purpose of this article is to introduce the range of fault tolerance technologies available; to point out some cost-effective solutions that are often overlooked for small-end systems; and to highlight the importance of knowing where your existing system stands.
The vendors and manufacturers mentioned in this article do not comprise a complete list of technology sources. They are included here as examples of specific technologies, and as a starting point for your own research into the subject of fault tolerant computer offerings.
Due to the complexities involved and the variations between requirements, many of the vendors stress that the technology factors and cost factors involved cannot be accurately determined until a close examination is made of the system requirements. It is also important to know that vendor offerings can differ significantly; there are no apples-to-apples comparisons that can be made between products.
It is also important to pay attention to the terminology used. If the vendor material you are reviewing seems to be confusing, or seems to conflict with what you have read previously, check the vendor's use of terminology closely. For a head start on defining special terms see the sidebar The Terminology Maze .
Designing, Implementing and sustaining a disaster-tolerant solution requires specific knowledge of your security application and in-depth dialog with solution providers. A full solution requires the establishment of the true requirements, and the selection of the appropriate storage subsystem, host CPUs, operating system, and networking between systems or sites as well as the determination of the appropriate fault-tolerant technology.
|
|
||
|
|
Copyright © 2003 Cygnus Business Media. All rights reserved.
|
|