tags: #publish
links: [[Software Architecture]], [[SRE and DevOps]],
created: 2021-06-07 Mon
---
# Shuffle Sharding
https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/
An architecture to **drastically reduce blast radius**, via **randomisation of shard allocation per customer**.
It's especially useful when you have lots of potentially noisy-neighbour customers and you don't want them to disrupt others.
## Implementation
- Infra is sharded with a decent number of shards (e.g. at least a few tens of them)
- Load balancing with healthchecks. Customer workloads can be executed on any of several shards at any given time (i.e. *shards are somewhat stateless*) and load balancing detects unhealthy or overloaded infra.
- Each customer is randomly allocated to a list of N different shards: *the random allocation is different for every customer* (or, at least, very few customers share the same sets - the amount of sharing determines the blast radius).
What this means is that blast radius is extremely low for otherwise problematic scenarious:
- If a given customer is being a noisy neighbour, other customers are minimally effected because they all have other places they can execute, away from that noisy customer.
- If a given shard is down, instead of all customers on that shard being down or becoming a thundering herd elsewhere, every customer on that shard can transparently run somewhere else instead, and the impact of this traffic shifting is minimal because each customer's load moves somewhere different.