3 Phases of Disaster Recovery
Every day I wake up and read or watch the news. It’s easy to find chaos in the world. It seems as though “there is often an earthquake, tornado, flood, or some other disaster that could destroy my IT infrastructure.” There also appears to me to be an increase in natural disasters and other events that have the potential to impact business operations of many organizations regardless of size. I believe, establishing a good offense in advance can help prepare your organization if faced with a disaster. The success to this approach is leveling the playing field by helping your organization understand Disaster Recovery (DR). There are a number of places one can go on the internet to obtain a definition of DR, but I want to share what it means to me.
I believe that in order to have a comprehensive strategy that eliminates the element of surprise and produces a viable and reliable recovery methodology, you need to have an offensive, three phased approach to DR. If you view Disaster Recovery as a defensive strategy, you may have already lost the battle.
The first phase of this approach is implementing “local fault tolerance,” which ensures there are no single points of failure within your infrastructure and primary data center. Before you think you have this covered, take a step back and assess this first phase of the offensive approach. It’s easy to get comfortable and confident by thinking that something as simple as having two power supplies instead of one or clustered servers on a virtual machine are sufficient for local fault tolerance. An end-to-end assessment should be conducted on the infrastructure and datacenter facility to determine that your designs are resilient and capable of sustaining local disruptions. It is possible that previously installed systems may have inadequate designs to meet the current standards of fault tolerance.
The second phase of this approach is implementing “near-by redundancy. “Some call this their ‘high-availability” site. It is a separate facility that is less than 30 miles from the primary site. You may ask why have a site so close to the primary site, and isn’t this a big risk? The risk is acceptable when looking at the bigger picture. One of the advantages of having a recovery site in close proximity to the primary site is being able to leverage proximity-based technology, such as fiber optics, and metropolitan Ethernet networks for high-speed replication. The configuration of this site could be “hot”, which requires little to no activation time; “warm” which requires some activation time, or cold and will require significant activation and configuration time. My preference is to configure this as a hot site because provides you with an active-active configuration of production systems that can support a near-zero downtime configuration. The hot site should be fully equipped with capacity to respond within minutes of a major disaster - ensuring that the recovery point objectives (RPO) and recovery time objectives (RTO) for the various business processes are met. A hot site tends to be one of the more expensive configurations, so it is important to be selective which systems are most critical to the business and manage expectations for the agreed-upon recovery times.
The third phase of my offensive approach is implementation of a “remote recovery site”. I believe it is essential to have a remote recovery site that is at least 250 miles distance from the primary site, or utilizes cloud service providers at multiple remote locations. One of the main reasons I believe this is so critical is that disasters are typically localized. When you stretch those boundaries far enough apart from the primary site, you can reduce the possibility of many single points of failure and the risk of having both locations impacted by the same event. This site can also be configured in hot, warm or cold configuration. However there will be some limitations on the size and speed of data replication. Also, keep in mind that when selecting an alternate location, you’ll also want to consider your staff. If you do not have the resources to staff both locations, then your staff may have to travel to this alternate site to assist in additional manual efforts to restore services. Although a majority of systems may be engineered with automated recovery, there still may be a need for manual recovery of some systems or technology components.
"A significant technology disruption combined with a poorly designed DR strategy could be the perfect storm that cripples a business"
Regardless of the type of disaster, human-induced or a natural disaster, each has the potential to disrupt or disable a business for months if proper planning is not in place. A significant technology disruption combined with a poorly designed DR strategy could be the perfect storm that cripples a business. Having a solid offensive approach and sound DR strategy could make the difference in limiting the level of exposure and potential loss to an organization.
However, in my opinion, to actually limit the level of exposure and potential loss to an organization it is not good enough to just implement the three phase approach. It is also critical to test the overall design of the offensive approach and the effectiveness of each of the three phases. Such testing is often overlooked. It is important to discuss the recovery approach and review supporting documentation, but the real proof will be having the test results in your hand as evidence that you can provide what is intended. DR planning is most certainly not glamorous and it can be quite difficult to justify the costs for something that might happen, but if done properly, a robust offensive approach and solid DR strategy can make your IT organization more flexible, adaptable, and ready for those unavoidable situations that could disrupt your technology environment.