What Are Availability Zones? A Guide to Multi-AZ Kubernetes

Flynn

May 21, 2026

Buoyant Enterprise for Linkerd

One of the things that differentiates EKS, Amazon’s managed Kubernetes service, from its peers is its near insistence that clusters must always span multiple availability zones (AZs). This is such an important facet of EKS that it’s literally the first thing we talk about in A Kubernetes engineer’s guide to migrating to Amazon EKS. That article focuses mostly on how AZs affect you; here, we’ll talk about why they exist and why they’re important.

An availability zone is a physically isolated group of data centers within an AWS region, designed to fail independently of other AZs while remaining connected by low-latency networking. They appear in all the hyperscale cloud providers, with somewhat different names. They’re partly a way to mitigate large-scale failures and partly a simple reflection of physics. In all cases they’re firmly rooted in the physical side of computing, so that’s where we’ll start.

What does physical infrastructure look like in a Kubernetes world?

In Kubernetes, you build clusters out of nodes, each of which is a machine (either physical or virtual) that’s used to run some of the cluster’s workloads. The smallest clusters have a single node; the largest have thousands. At some point, of course, you have physical hardware running each node, whether that physical hardware is dedicated to a single node or to multiple virtualized nodes.

How racks and data centers affect cloud providers

In larger installations, you’ll see racks full of server hardware. In turn, you’ll see large numbers of racks collected into data centers. Racks and data centers reflect the reality that it’s easier to manage infrastructure in groups than to deal with every individual machine. For example, a rack with one incoming power cable and one incoming network cable can include power distribution, network switches, and cooling inside the rack, and a data center can provide physical security, fire suppression, and backup power for every machine contained in it. (For all that we like to pretend the physical world doesn’t affect our cloud native environments, it very much does.)

The physical details here also dictate some realities of networking. Machines in the same rack tend to share a network switch, giving them a very fast, low-latency connection and requiring them to all be on the same subnet. Crossing racks will usually involve at least a second switch and may require routing. Crossing into a second data center will almost always involve routers and may involve a slower network (for example, you may have a 10Gbps network in each data center, but only 1Gbps connecting them).

All these things can, and do, fail.

Everyone who’s worked with a computer knows that machines can crash.
A rack failing all at once most commonly happens because the rack lost power or network. This can happen for any number of low-tech reasons like a circuit breaker tripping or a technician unplugging the wrong cable.
An entire data center could fail because its neighborhood or city lost power, or from cables getting cut by construction work. More dramatic causes would be fire or water damage: these are rare but slow to recover from.

Interestingly, all of these failures basically look the same to Kubernetes: when the hardware running nodes crashes, Kubernetes sees it as all the nodes on that hardware failing. If your cluster only has one node, this is catastrophic: the whole cluster is gone. This is a major reason that clusters support multiple nodes: a properly constructed cluster can ride out any single node failing. (This requires you to maintain excess capacity in terms of CPU and storage, of course.)

A rack failing looks like a lot of nodes failing all at once. If all of a cluster’s nodes are in the same rack, this could be a problem. A whole data center failing is far worse, of course: it’s very likely that whole clusters will be taken out at that point.

How zones and regions collect data centers

AZs are the next level above data centers: they collect data centers that are physically close to each other into a single organizational unit. “Physically close” here means close enough that it’s still possible to get very high-bandwidth, low-latency connections between the data centers, but far enough apart that a problem at one data center probably won’t necessarily affect others. This means that the size of an AZ can vary depending on where it is and what’s around it.

AZs, in turn, are collected into regions, each of which is guaranteed to contain at least three AZs and is likely to span multiple metro areas. (Basically, we can think of a data center as building-scale, an availability zone grouping data centers at the scale of a neighborhood, and a region as city-scale or possibly larger.) The physical separation between a region’s AZs means that you’ll typically see higher latencies between AZs within the same region, even though these are still high-bandwidth network links.

Both AZs and regions can fail, of course. The Hollywood scenario here is a meteor strike that destroys Boston, but in the real world they’re more prosaic things like blackouts, fires, or people making mistakes with BGP configuration.

How do multi-AZ clusters mitigate infrastructure failures?

With the physical background out of the way, we can (finally!) talk about the reason to use multiple AZs in a single cluster: it’s a balance of performance against the possibility of a rack or data center failing.

All the hyperscalers offer this capability, but using Amazon as an example for a moment: since Amazon doesn’t really expose data centers or racks directly, we can’t tell how EKS distributes nodes within an AZ. It might keep all the nodes of a given cluster in the same rack for performance, or it might deliberately spread them among all the data centers for reliability. Requiring clusters to have nodes in multiple AZs, though, makes this question irrelevant: for any failure up to and including an entire AZ, you’re covered.

This is a much bigger deal than it might seem.

Servers and racks fail all the time at the level of a hyperscale provider, which can absolutely affect your clusters. Recovery is really easy, though: the provider can just swap in nodes from other racks, and your cluster will do the right thing. In the ideal case, your users won’t even notice.

Data centers and entire AZs fail much more rarely, but when they do fail, recovery is a much bigger deal. Just to pick one simple example, suppose your entire cluster was housed in the failed infrastructure. Not only would you need to spin up a new cluster running somewhere else (which can be painful enough), but in that process you’d likely get a new set of external IPs, which would in turn force DNS changes with all they entail.

Insisting that clusters span AZs as a matter of course insulates you from all of these failures. An entire AZ going off the air becomes something easy to manage: just add some new nodes in yet another AZ, and Kubernetes will automatically start provisioning onto them.

Of course, this raises questions around performance and cost. Network bandwidth within an AZ is lower-latency and more plentiful than bandwidth crossing AZs, largely because of the physical reality of building a network inside a building as opposed to across a town. Because of that reality, Amazon charges for bandwidth consumed by inter-AZ traffic. I won’t repeat what A Kubernetes engineer’s guide to migrating to Amazon EKS says on these topics here; suffice it to say that there are things you can do to help minimize the burden here – but you will definitely need to think about it!

What happens when an entire AWS region fails?

To round out our discussion: though it’s thankfully uncommon for entire regions to fail, when it happens, a multi-AZ cluster isn’t going to help you; you’ll need to use a multi-cluster or multi-cloud strategy instead.

The reason, again, has to do with the reality of networking. The extra latency of cross-region communications causes trouble when you’re talking about running multiple replicas of the Kubernetes control plane in multiple regions, and that’s without managing the state that the cloud provider needs to track for which node belongs to which cluster. Requiring clusters to live entirely within one region makes it easier to get everything correct.

Multi-cluster and multi-cloud operations are a topic for another article; for now, we’ll just note that these options open up some fascinating operational possibilities beyond improving reliability.

Key takeaways: Building resilient clusters across availability zones

Failure is a fact of life when we start talking about Kubernetes, enough so that Kubernetes itself manages small-scale failures by design, and all the cloud providers provide techniques for mitigating the effects of larger-scale failures. EKS takes this a step further by all but requiring multi-AZ clusters; hopefully, this article provides some needed context for why this is important and how it benefits you.

FAQ

What is an availability zone in AWS?

An availability zone (AZ) is a physically isolated group of one or more data centers within an AWS region. Each AZ has independent power and networking but is connected to other AZs in the same region by low-latency, high-bandwidth links, so workloads can span AZs for resilience.

Why does EKS require multiple availability zones?

EKS requires multiple availability zones so that a single rack, data center, or AZ failure cannot take down an entire Kubernetes cluster. By spreading nodes across at least two AZs, EKS ensures that workloads automatically recover onto healthy nodes when one zone goes offline.

How many availability zones are in an AWS region?

Every AWS region contains at least three availability zones, though some regions have more. AWS guarantees this minimum so that customers can architect highly available systems, including multi-AZ Kubernetes clusters on EKS, with enough physical separation to survive a single-AZ outage.

What is the difference between an availability zone (AZ) and a region?

A region is a large geographic area (often spanning multiple metro zones), while an AZ is a smaller, physically isolated group of data centers inside that region. AZs in the same region share fast, low-latency networking. Regions are separated by higher latencies and designed to fail independently.

Can a Kubernetes cluster span multiple regions?

A standard EKS cluster cannot span multiple regions because cross-region network latency disrupts the K8s control plane. To survive a full region failure, you need a multi-cluster or multi-cloud architecture.

‍