Shuffle Sharding - Hyperbolic Cloud

tags: #publish links: [[Software Architecture]], [[SRE and DevOps]], created: 2021-06-07 Mon --- # Shuffle Sharding https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/ An architecture to **drastically reduce blast radius**, via **randomisation of shard allocation per customer**. It's especially useful when you have lots of potentially noisy-neighbour customers and you don't want them to disrupt others. ## Implementation - Infra is sharded with a decent number of shards (e.g. at least a few tens of them) - Load balancing with healthchecks. Customer workloads can be executed on any of several shards at any given time (i.e. *shards are somewhat stateless*) and load balancing detects unhealthy or overloaded infra. - Each customer is randomly allocated to a list of N different shards: *the random allocation is different for every customer* (or, at least, very few customers share the same sets - the amount of sharing determines the blast radius). What this means is that blast radius is extremely low for otherwise problematic scenarious: - If a given customer is being a noisy neighbour, other customers are minimally effected because they all have other places they can execute, away from that noisy customer. - If a given shard is down, instead of all customers on that shard being down or becoming a thundering herd elsewhere, every customer on that shard can transparently run somewhere else instead, and the impact of this traffic shifting is minimal because each customer's load moves somewhere different.