"Everything fails, all the time" is a mantra attributed to Werner Vogels , CTO of Amazon.com.
Considering the ideas supporting this mantra leads to some interesting practical issues
1) An infrastructure will fail
As infrastructures grows , scale becomes non-trivial. Added to environments supporting multiple technologies , with constant change leads to increased instances of failure or errors. Accept failure will occur and manage systems based on this assumption.
Compare with the old style thinking which underpinned – “build-once” systems, where engineers through a system over the wall – and only reviewed in an outage
2) Testing for these failures in the Production environment , with engineers available to fix the problems
This point will raise debate. My colleagues argue that purposefully disabling infrastructure services should never occur in a Production environment. It should always be applied in Non-Production environment. How many organisations have an exact setup in Non-production as Production , with the same resiliency , server configurations, database loads etc?
3) If the failure occurs again , the infrastructure must recovery automatically without no disruption to the user experience
Once identifying repeated failures , apply fixes or steps which allow the services to recover automatically with no or minimal impact on the users. It’s a challenge – but worth it . It’s all about continuously testing and improving. Ultimately gaining back time to focus on core skills