tags: #publish links: [[Engineering]], [[Management Articles]], [[Software Architecture]], [[SRE and DevOps]] created: 2021-10-05 Tue --- # How complex systems fail https://how.complexsystems.fail/ by **Richard I. Cook, MD** Originally written in a medical and patient safety context, this beautifully succinct summary applies to all complex organisations and large engineering projects, particularly software and networking. It encourages a different way of looking at the whole system as a dynamic thing. - Systems include the humans using and developing them. You have to look at this together with the technology or process. You can't expect to replace all the people with different people and have the overall system behave the same. - Complex systems are always in partial failure to some degree. They are always on the verge of catastrophe! They need to be designed for this. The people need to think accordingly. It is a source of information to continually tune the system. One should be wary of thinking about "causes" of failures in terms of "actions" or "events". Trying to remedy each of them tends to increase complexity. Failures are a continuous state and a normal circumstance, not a discrete event to be prevented. - The humans using and building the system need to be continuous risk managers and gamblers and creators of safety. - Severe failures occur when the layers of partial defenses against partial failures are fully overwhelmed. - Changes introduce new forms of failures. This includes changes attempting to mitigate minor risks - they may make catastrophic failure more likely. - Paradoxically, **trying to prevent *severe* failures by preventing *all* failures may be doomed** - you *need* the continuous partial failures, and a process for noticing and learning from them, in order to learn about the system's changing circumstances, tune it and keep everyone up to speed enough to prevent, and deal with, severe failures. Trying to prevent *all* failures may end up with a period of stability, then a catastrophic failure as soon as surrounding circumstances change. See also [[Engineering Overreach]] - you probably want to at least *try* to keep the complexity down. See also [[SLOs and rare events]]. Those rare events are the overwhelm of the defenses.