Assume failure and Resiliency testing

06 February,2014 by Jack Vamvas

"Everything fails, all the time" is a mantra attributed to Werner Vogels , CTO of Amazon.com.

 Considering the ideas supporting this mantra leads to some interesting practical issues

 1)     An infrastructure will fail

 As infrastructures grows , scale becomes non-trivial. Added to environments supporting multiple technologies , with constant change leads to increased instances of failure or errors. Accept failure will occur and manage systems based on this assumption.

Compare with the old style thinking which underpinned – “build-once” systems, where engineers through a system over the wall – and only reviewed in an outage

2)     Testing for these failures in the Production environment , with engineers available to fix the problems

  This point will raise debate. My colleagues argue that purposefully disabling infrastructure services should never occur in a Production environment. It should always be applied in Non-Production environment. How many organisations have an exact setup in Non-production as Production , with the same resiliency , server configurations, database loads etc?

3)     If the failure occurs again , the infrastructure must recovery automatically without no disruption to the user experience

Once identifying repeated failures , apply fixes or steps which allow the services to recover automatically with no or minimal impact on the users. It’s a challenge – but worth it . It’s all about continuously testing and improving. Ultimately gaining back time to focus on core skills

Read More

 Powershell and Disaster Recovery preparation

SQL Performance tuning - Asking the right question

SQL Server DBA Top 10 automation tasks


Author: Jack Vamvas (http://www.sqlserver-dba.com)


Share:

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment on Assume failure and Resiliency testing


sqlserver-dba.com | SQL Server Performance Tuning | SQL Server DBA:Everything | FAQ | Contact|Copyright & Disclaimer