Consider the Real Costs & Risks of Engaging the Whole Organization in DR Tests

DR testing

The extent to which a DR test impacts a business is described in terms of intrusiveness—ranging from virtually invisible to a fully engaging disruption!  Non-intrusive testing validates whether a particular technology or system has the ability to run live in parallel (with no distress) on production. Intrusive testing has an impact on production and involves relying on users to participate and validate the effectiveness of DR capabilities. DR testing should be a progressive testing process that gets more bold and broad as it is undertaken, meaning that it becomes more intrusive as it is carried out. Usually compliance requirements (e.g. industry or trade regulations) and vulnerability of the business to IT risks drive these testing decisions.

Non-intrusive Testing

A non-intrusive test is a non-production affecting exercise. It validates that a system can “go live” at the secondary site. Non-intrusive testing generally involves technical people without needing to interact with regular users. This stage of testing ensures that applications come up “live”, but does not run transactions through the recovery site that need to be transferred back to production at the primary site. The non-intrusive phases of testing cannot necessarily guarantee a successful failover or failback. This is a preliminary stage of testing – a good first assessment before a full-blown user test involving a production “outage” and failover. Non-intrusive activity may also be performed as part of maintaining a DR plan, by desk-checking, or running individual procedural components from time to time with limited impact to day-to-day operations.

Intrusive Testing

A fully intrusive test will involve a complete disconnect of users from production at the primary site and a switch over to systems at the working recovery site. Depending on how aggressively you test, you may choose to stop access to production via the network, validate data is “current”, run transactions successfully through the recovery site, then failback to production, with all updates and systems completely current. It is intrusive because it could impact business operations in several ways (depending on the level of test you undertake):

  • production system may be inaccessible to users for periods of time during failover & failback transfers
  • users may be required to validate that recovery site systems are correctly working, with acceptable data
  • users may be required to “transact” with the recovery systems to ensure they perform as expected under “disaster scenarios”, and
  • under a failback test situation, users must validate that data and systems operations have transferred back successfully to production at the primary site

Since it involves interacting with actual production systems, intrusive testing requires (and builds) confidence that the DR processes will failover and failback and mitigate the amount of “lost data”.

An example of an intrusive failback test is Google’s annual, company-wide, multi-day Disaster Recovery Testing event (DiRT)[1], which live tests its disaster recovery capabilities. According to their DiRT experiences, Google validates its capabilities “live”, and with full user involvement. For instance, with their internal systems, they actually disconnected one of their data centers and divert processing to a secondary center while users are receiving and sending information. This is a live failover and failback while users were active on their system. It’s a true full-blown test, allowing Google to verify completely that DR plans work successfully.

The more intrusive the test becomes, the greater the risk to the business. For instance, conducting a “simulated” failback test on one small component is a non-intrusive test with very little risk of impact to the business; conducting a full-scale failback test on the entire system is a highly intrusive test with much greater risk potential. This is because it involves a lot of users and systems and takes production off-line. One small snag and potentially business operations are put in jeopardy. The safest way to do testing is non-intrusively, especially if it is unclear whether data and users will be unpredictably impacted on the primary site. A more comprehensive or “true” test of an organization’s DR capabilities is a real failback test that fully engages the business. An organization needs to realize that there is great value in progressing through all stages of testing. A fully intrusive test does face higher risks, but the reward is in knowing that the plan will work in a real crisis.


[1] Kripa Krishnan, Google:  “Weathering the Unexpected”, Sep. 16, 2012, ACMQueue

Steve Tower

With many years of professional IT experience, and training as a Certified Management Consultant, a Project Management Professional, a Professional Engineer and a Member, Business Continuity Institute, Steve Tower has the skills and abilities required to assist with even the most complex disaster recovery planning initiatives. Below, Steve discusses the necessary tools involved in setting up a disaster recovery plan and program.