tags: #publish links: [[SLOs]], [[SRE and DevOps]] created: 2022-03-02 Wed --- # SLOs and rare events **How do we think about rare events** - vs looking at keeping error rates low on a typical weekly basis. Stealing from [[Heuristics That Almost Always Work]] - if I say I have 99.99999% reliability and things are near-100% solid for a few years then I unsurprisingly have a two-hour outage, you’ll think I’m doing awesomely well for those years, then after the outage you’ll be really unhappy that my prediction was actually just having four-nines and saying “it’s fine look our numbers show at least 5 nines since we started measuring” every time you asked me, rather than having a process to deliver nine nines. (This is essentially the [[Smooth-Sailing fallacy]]. See also [[Engineering Overreach]] - you can *easily* make it too hard to even analyse. [[How complex systems fail]] digs in a bit more.) The timescale of measurement _definitely_ matters when we start to talk about 5-nines and beyond. What you’re worried about long term is not having any extended outages at all, vs keeping normal error rates low: we’d need to prevent ever having a 15 mins downtime in 3 years. It’s not very meaningful to claim 5-nines but be given a pass for an occasional large outage and start again - you can’t meaningfully “catch up” your 5-nines SLO, that’s just a lie. Unless I can see some significant process and funding mechanisms to show otherwise, my default when seeing a high SLO is to suspect that's what people are actually doing. Delivering that level of reliability long term is _hard_, like predicting rare events is hard in the above article. **It likely requires architecting the whole org for it, including planning and project selection, hiring, funding, narrowness of focus.** But if you don’t, you can kind of get away with it for a while and not be challenged too much until that outage happens. I think it matters at 4-nines too, and it's easy to be dishonest with yourself about this if you've been pretty reliable for a while.