Disaster Recovery: A Continuous Journey
Let's start with an analogy. Most of us have some sort of insurance on our homes—either home owner’s or renter’s insurance. We pay premiums to our insurance companies and hope that we never have to use our policies. In the enterprise information technology (IT) space, this is exactly what a Disaster Recovery (DR) program is—and how it should be viewed. A proper DR program requires significant investment in people, processes and technology year after year with the hope that it never gets used. Just like a home owner’s or renter’s insurance policy–when something happens and you need it–you expect the insurance company to be responsive, have a clearly defined process to follow, and help you resolve your issues as quickly as possible. This is their business—and they need to be good at it.
In the past, DR for some IT organizations has occasionally been lower on the list of IT priorities—mainly because DR has been difficult to integrate well enough with Business Continuity Planning (BCP) practices. As IT professionals, it is important to provide business leaders with good counsel, so they understand the challenges/limitations of DR and the effort required to meet expectations when an issue arises. It is crucial to achieve this business alignment before any technology decisions are made. In my experience, choosing and implementing the technology is actually the least challenging part of implementing a DR program.
In my experience, choosing and implementing the technology is actually the least challenging part of implementing a DR program
There are a few key critical success factors that will help facilitate the successful implementation and ongoing maintenance and growth of an IT DR program.
• Executive alignment at the CXO level
The business will view the effort to implement a DR program as a non-revenue-generating activity, so IT professionals need the executive support to procure the proper resources needed to ensure success.
• Single source of truth for your data and IT inventory–a centralized CMDB (Configuration Management Database)
• Accurate inventory that is mapped in this fashion:
• Hardware components -> Applications -> Business Processes
• A clear understanding of what business processes are critical to be up and running for the business to be functional (i.e. the business processes that are attached to revenue). This is of particular importance and requires engaging directly with key business stakeholders, so that the entire DR program is aligned properly with the BCP plans of the business. The business will have some systems that need restoration immediately, and some that can withstand days or even weeks of downtime, before they become impactful. IT does not have this information readily available, thus engaging the business is critical.
• Clearly defined Recovery Time Objectives (RTO’s–how long to recover) that meet the needs of the business. Setting Recovery Point Objectives (RPO’s)–how up to date is the data in your DR container) will mainly be driven by technical considerations, but RTO’s are driven by people and processes. This is an iterative cycle–initial RTO’s should be set as “targets” and then, as the process matures, these can be revised and updated.
• Establish a dedicated program and invest authority into it
Everyone in IT (Infrastructure, AppDev, PMO) and the business needs to understand that it is a priority of the organization to maintain a functioning DR program. The investment (time and money) in the numerous tests that need to occur should be a sanctioned and expected work result. It’s important to set a policy based on criticality/tier level of the applications, as well as the frequency of the testing (once a quarter, once a year, etc.). Publish a DR testing calendar at the beginning of the year that sets expectations with all parties regarding availability of testers, subject matter experts, etc. Just like the insurance policy we talked about before–a DR program will require continuous investment to keep current and valid.
Define on paper the specific steps required to restore an application in a DR event. Ideally, this would be a manual process that is fully understood and tested before you begin any tool-based work. There is an old adage–you can’t automate a process that you don’t understand–this applies perfectly to DR as well.
Notice how none of these critical success factors are technology? I believe this is important to note as many organizations will attempt to lead with a tool first–and shape their DR strategy around it. In almost every instance, I have seen this approach fail. First and foremost, DR is a people and a process challenge–the tools you select will drive what RPO’s and RTO’s you can achieve. There is a plethora of tools available today for in-house consumption (if you are exploring DR for legacy applications), as well as automation platforms that can help hone your RTO’s to as tight a timeframe as possible. These all require various levels of investment.
The next step in the journey is to continue to explore new technologies (both in-house and cloud-based) that will allow applications from multiple locations to run in an active-active type of scenario. This will help validate the DR testing conducted and relieve much of the time burden that is associated with a robust DR program. Clearly, this is no small feat and requires refactoring many applications. For the foreseeable future, DR will remain a core component of our BCP strategy.