Testing the DR Plan: Failover vs. Failback

An organization’s disaster recovery plan cannot be considered reliable until it has undergone complete failover and failback testing. Failover and failback refer to the ability to transfer production systems to, and back from, a secondary site in the event of a disaster. DR testing is a progressive process, and running a complete failover and failback evaluation comes at the end of a series of stages or levels —each a broader and more complete test than the previous.

A failover test demonstrates whether a business can recover at an alternate DR site and run applications when the primary site is disconnected.

A failback test demonstrates whether a business can return from the recovery site back to the primary site, assuming this site is still intact.

A successful failover and failback test demonstrates that in a simulated disaster situation a business can recover at a secondary site, (operate successfully on an interim basis,) and then return back to the primary site without a loss of data, within expected timeframes.

A failover test can simulate the “loss” of production by making it unavailable to the user, disconnecting networks from the production environment. The goal is to verify that systems failover—or transfer of processing—to a secondary or recovery site and it is able to function in the event of a crisis. In the case where you have destruction of your data center, there would be no use for a failback until a new primary site has been constructed and is up and running.

A failover test evaluates DR capabilities before they are needed. A comprehensive plan will build and test recovery processes server-by-server, and application-by-application. Because a company may have hundreds of application systems and servers to accommodate in a disaster situation they will likely need to grouped according to priority, and recovery technology employed. Then they will need to be individually tested prior to running a full test to ensure a smooth execution of the DR plan as a whole. A complete failover test is a sequential run-through of all critical infrastructure and applications included in the DR plan. Failback should only be tested once it has been confidently concluded that the:

failover test is successful,
production-like data can be processed at the recovery site, and
the primary site can be re-synchronized from the recovery site.

In the event that accessibility to a primary site is lost, it is likely that a business will failover successfully if it has practiced DR testing. A complete operational failover – failback test should not be run until all components are first tested individually and the recovery team fully understands how to accommodate the amount of data change at the recovery site, combined with their ability to recover / restore data rapidly back to the (original) primary site.

Steve Tower