Home | Resources | Disasters do happen
Although terror attacks, flu pandemics and gas explosions grab the headlines, the cause of most disasters is far more humdrum. Here, a series of pie charts based on 998 actual disasters in the UK illustrates the most common reasons customers invoke SunGard, while a commentary explains the trends behind the often surprising statistics.
For many the concept of a disaster revolves around smoke and rubble scenarios; the jumbo jet crashing on the building, flood, fire or bomb. However, in SunGard’s experience of supporting almost 1,000 disasters in the UK, the reality is far more mundane. An awareness of what causes these disasters is useful in determining preventative measures to avoid such events. The chart below shows a breakdown of all the invocations between 1995 and 2007:

However, this profile changes if the invocations are divided into those for IT recovery, and those for Workplace/Office Recovery. The chart below shows all 634 technology invocations between 1999 and September 2007.

The average Technology invocation lasted 22 days. This duration reflects the fact that once companies are running production services on a disaster recovery system, it takes time to arrange a convenient window to migrate back onto the production system.
The chart below shows all 211 workplace/office recovery invocations between 1999 and September 2007.

The average workplace invocation lasts seven days, with companies keen to return to their normal environment as quickly as possible in order to reduce the disruption to staff.
Hardware failures generally occur during planned upgrades or, in smaller businesses, when personnel are trying to apply some newly learned trick or shortcut! The introduction of change management processes has reduced the number of invocations that occur during an upgrade and these have been relatively infrequent since 2002. Many technology invocations of our services are due to inadequacies on the part of the maintenance company - usually, delays in identifying and then sourcing the defective part following a hardware failure. Despite hardware becoming more fault resilient, continued distributed processing combined with greater intolerance of downtime, has maintained the number of invocations for this type of failure at a high level, though the trend is slightly down.
Data corruption used to represent the second likeliest cause of an invocation but the trend is certainly down since 2002. Companies should be aware though while data replication provides protection in the event of hardware failure and data loss, corruption in the primary database is also replicated and a robust backup solution is still required.
While fires can destroy entire buildings, more frequently the fire occurs in isolated areas of the building where it destroys structured cabling or power control systems. Not only does the fire itself cause damage, but so does the large quantity of water needed to extinguish the blaze.
Floods can affect entire buildings. However, more commonly they cause failures in only part of the building. Coolant leaks in air-conditioning units and mainframe chillers, and ruptured pipes may cause minimal damage to the building fabric, but they can cause numerous problems for water intolerant electrical gear and computer systems.
Water will always find its way down through the building to the basement. There it could disrupt power distribution systems or encounter the nerve centre of most companies - the telephone switch. Without the ability to use the phone, most organisations are totally isolated from their customers and suppliers. Recent examples of flooding have been due to a failure to clear leaves from gutters and downpipes resulting in the build up and sudden ingress of rainwater; leaking water vending machines and joint failures in high pressure water mist fire suppression systems.
Severe weather rarely affects Europe and invocations for this reason have generally occurred in the USA. However, high winds and electrical storms have caused outages in computer systems even when the building is generally unaffected. Lightning strikes have affected microwave links, while high winds have been known to blow condensing units off roofs, leading to airconditioning failures (categorised as Environmental Failure).
Power failures are a common and increasing cause of systems outage and the most common cause of workplace invocation. Fire, floods and storms have already been mentioned, but these frequently cause systems outages due to their effects on the power supply to the systems concerned. Computer systems can be protected from the effects of power outage through the use of Uninterruptible Power Supplies (UPS) and generators. However, in our experience, the complex control gear associated with these protective measures can sometimes fail and result in an outage.
Building outages are rare, but they create the most serious types of problem and will quickly highlight any deficiencies in business continuity planning. Many of these outages have resulted from criminal or terrorist activity either damaging the building beyond repair or preventing access to the building (for example when it is treated as a crime scene). From the planning point of view, it must be considered that nothing can be salvaged in time to resume normal operations so solutions will be required for everything from computer systems to stationery and critical files and documents.
When communications are considered, research and our experience suggests most networks fail every week. A fire may not affect your company’s building; however, if it disrupts your local telephone exchange or electricity substation, then the effects are disastrous and widespread. The trend for communications failure is increasing.
Disasters also fall into the ‘other’ category. Strike action may prevent staff accessing the building, disaffected former employees could physically damage systems or spread software viruses, and building moves can go badly wrong.
Environmental failures also affect systems, both directly when the air-conditioning or chiller units fail and overheat, or indirectly after a period of time. One customer suffered a complete systems failure due to salt corrosion in the processors; five years earlier an engineer had connected a fan incorrectly in an air handling unit and rather than sucking air out of the building, moist air from the nearby sea, was being blown in!
The need for solutions can be the result of a disaster many miles away. For example, one travel company had to rapidly establish a new call centre on a temporary basis to accommodate the need to rebook its customers to new locations at short notice after the original resort was shut down following a diphtheria outbreak. They were able to invoke their recovery contract and overcame the difficulty.
However, the nature of recovery has changed as companies experience as many problems associated with sudden and unexpected growth as they do outages. While these situations do not allow a customer to invoke the service, SunGard does allow “additional processing”. For example, a building society was suddenly forced to temporarily increase the capacity of its call centre to cope with the need to de-mutualise and float on the stock market. Another company did not anticipate how successful an advertising campaign would be and was forced to increase its call centre temporarily to accommodate the increased sales. Another customer, an outsourced call centre were required to implement a new contract before a planned expansion had been completed.
All these companies were able to maintain their availability to service their customers by utilising additional processing capabilities – they use the same resources as they would for a recovery, but we treat this use as we would a test, and in the event another customer requires the backup capabilities, they get priority and the company using additional processing is required to vacate within an hour.
The potential causes of disasters are numerous and united in a common truth; that the disaster will always be the one that is unexpected.