The Progressive Levels of DR Testing

Testing a disaster recovery plan is a progressive process. Each level of test becomes increasingly more elaborate and intrusive to the organization, introducing risks, but at the same time solidifying the capabilities of a DR plan.

Testing can be performed in a series of levels, each with increasing:

Technical scope including systems, recovery technology, communications, etc.
Risk of business disruption to production data and system availability
Breadth of participation & involvement of IT staff, the user community, and third-party support personnel.

Verifying the feasibility of the DR plan is carried out in incremental steps, making sure each small component of the recovery infrastructure actually works. When confident that all individual elements are in correct working order, only then it is appropriate to carry out a complete live failover and failback test – the ultimate operational test of disaster recovery capabilities.

Levels of DR testing:

1. Tabletop Exercise

Scope: To start, the preliminary “test” focuses on checking plans and documentation with core recovery team members. This may involve simulating a disaster, conducting training, and walking through communication, declaration and recovery processes and strategies.

Verification Objectives: Documentation is complete and any gaps are identified and filled. Recovery team is oriented and growing familiar with the process. People are getting comfortable with their responsibilities (this exercise is also used during Business Continuity Plan tests).

2. System-level

Scope: Start by validating that purchased recovery technology works by itself, and then in tandem with the production and the recovery site. Where there are shared platforms (e.g. virtualized storage or processing) groups of systems will likely be tested together. Focus will be mostly on the critical systems.

Verification Objectives: Recovery processes and technologies have been installed and configured correctly. The recovery site environment is ready.

3. Production Failover

Scope: This involves disconnecting users and isolating production systems after-hours. There will be a failover to the recovery site without updating recovery or production site data. Usually this focuses only on the most critical systems.

Verification Objectives: All failover processes work to the recovery site and production data is intact and accessible there.

4. Recovery Site Capacity

Scope: With an after-hours outage, production is isolated, and projected user volumes are tested at the recovery site following a simulated disaster. This validates the capacity of communications and systems infrastructure at the DR site. Usually this focuses on the most critical systems.

Verification Objective: Sufficient performance from the recovery site.

5. Failover and Failback

Scope: An after-hours outage that tests the ability to switch operations from primary site to secondary site, and back again. Although production data is carefully protected during this testing, a second phase of this test should validate that changes to the recovery site can be failed back successfully to production. Usually this focuses on the most critical systems.

Verification Objective: Failback to primary site works.

6. Live Failover and Failback

Scope: A full-blown failover and failback of production systems during business hours involving the general user population. This will involve an actual shut down, disconnection of production at primary site and redirecting operations to the recovery site. It may be scheduled in advance, to address a planned building outage, a major network upgrade or a critical systems migration. The organization may also choose to determine how well the organization performs with available resources under a simulated crisis, without warning.

Verification Objective: Failover and failback works under normal operating conditions. Usually this focuses on the most critical systems. Workarounds may be deployed for lower priority systems during an extended outage.

Careful planning and testing of a DR plan is crucial to guaranteeing that it runs seamlessly in a crisis. The overall goal is to verify that the working parts function correctly, individually and as a whole, and that the people involved are comfortable and well equipped to handle the disaster.

Levels of DR testing:

1. Tabletop Exercise

2. System-level

3. Production Failover

4. Recovery Site Capacity

5. Failover and Failback

6. Live Failover and Failback

Steve Tower