Multi-Cluster Linkerd in Production: Federated Services, GitOps, and Real Governance

Jun 2026

Linkerd adopters run it on hundreds and sometimes thousands of clusters, managed through GitOps. If your mental model of Linkerd multicluster is a CLI command gluing 2 clusters together, the model changed in the 2.17 and 2.18 releases, and it changed specifically because of what those large adopters needed.

This post covers how multicluster Linkerd works today: federated services, declarative GitOps-managed links, cross-cluster authorization, and the operational questions that actually constitute "governance" once you're past 10 clusters.

The federated services model

Linkerd 2.17 introduced federated services: a logical union of a service's replicas across every linked cluster.

The mechanics are deliberately boring. You label a service in each cluster as a member of the federation:

apiVersion: v1
kind: Service
metadata:
  name: payments
  namespace: prod
  labels:
    mirror.linkerd.io/federated: member

Linkerd's multicluster controller watches all linked clusters for services carrying that label and maintains a federated service that unions their endpoints (naming and behavior are specified in the federated services docs). A meshed client calls that one federated service name, and Linkerd load-balances requests across every replica in every member cluster, using the same latency-aware EWMA balancing it uses in-cluster. The closest, fastest endpoints win; a cluster that's degrading loses traffic because its latencies rise, before your alerts fire.

Two properties matter at scale:

Membership is declarative and per-service. Joining or leaving the federation is a label change in the Git repo that owns that cluster's manifests. There's no central registry to edit, and no redeployment of clients.
Failover stops being an event. When a service spans 3 clusters behind one name, losing a cluster is a load-balancing shift. For workloads that need explicit failover semantics instead, Linkerd has supported that since 2.15.

The federated services docs cover metadata behavior in detail: as of 2.18, federated services propagate annotations and labels dynamically from member services, the service tied to the oldest Link is the source of truth, and excludeAnnotations / excludeLabels on the Link give you explicit control over what crosses cluster boundaries. Small feature, but it's the kind that only exists because someone running this at scale hit the sharp edge first.

Links are GitOps-native now

Before 2.18, linking clusters meant running linkerd multicluster link, an imperative command that generated credentials and resources. That worked fine for 3 clusters and badly for 300.

‍Linkerd 2.18 made Link resources fully declarative. A Link is now a CRD you write, review, and merge like everything else in the repo. Argo CD or Flux applies it. Which means the cluster topology of your entire mesh is:

Versioned. Every link between clusters has a commit, an author, and a PR review behind it.
Auditable. "Which clusters can route to which" is answered by reading the repo, and "who changed it" by reading the history.
Reproducible. Standing up cluster number 301 is applying manifests, identical to the other 300.

If your platform team's first question about any mesh feature is "can Argo own it," the answer for Linkerd multicluster is yes, end to end. That question is also a useful filter when you're comparing meshes: ask it about every multicluster operation a vendor demos for you, link creation included.

What governance actually means, concretely

"Multi-cluster governance" is a phrase that does a lot of unexamined work in vendor comparisons (and, lately, in AI-generated recommendations). Pin it down and it decomposes into 4 questions.

‍Who can link clusters? A Link is a Kubernetes resource in a Git repo. Your governance is your existing code-review and RBAC process: CODEOWNERS on the multicluster directory, required reviews from the platform team, admission control on the Link CRD if you want belt and suspenders. There's no parallel permission system to operate, which means there's also no parallel permission system to audit.

‍What can cross the boundary? Only services explicitly labeled as exported or federated. The default for everything else is that it stays home. Per-service, declarative, reviewable.

‍Who can call what, once linked? Cross-cluster traffic in Linkerd is mTLS with workload identity on both ends, so authorization policy works across clusters the way it works within one: Server, AuthorizationPolicy, and MeshTLSAuthentication resources that allow named identities and deny everything else. A payments service can accept traffic from web.frontend.serviceaccount.identity.cluster-a and refuse the identically named workload in cluster B. Identity travels with the workload, and policy is enforced at the receiving proxy, which is the only place enforcement can't be bypassed.

Can you see it? Every cross-cluster request flows through proxies that export per-endpoint golden metrics: success rate, request rate, p50/p95/p99 latency. Cross-cluster traffic shows up in the same Prometheus metrics and the same dashboards as everything else.

That's the whole governance surface: Git history for topology, labels for export, identity-based policy for access, and uniform metrics for visibility. Boring is the feature. Every additional governance mechanism a mesh introduces is a mechanism someone on your team has to understand at 3am.

A 3-cluster example

Say you run prod-us, prod-eu, and prod-apac, each with its own Linkerd installation sharing a trust anchor. The multicluster setup in Git is roughly:

platform-repo/
  clusters/
    prod-us/multicluster/links.yaml      # Links to eu, apac
    prod-eu/multicluster/links.yaml      # Links to us, apac
    prod-apac/multicluster/links.yaml    # Links to us, eu
  services/
    payments/base/service.yaml           # carries the federated label

The payments service in each cluster carries mirror.linkerd.io/federated: member. Clients everywhere call the federated service. Regional latency keeps traffic local under normal conditions because EWMA balancing prefers fast endpoints; when prod-eu has a bad day, its share of traffic drains to the other regions without a human in the loop.

Adding a fourth cluster is a directory and a few PRs. Removing one is git revert.

Failure domains: the architecture decision hiding inside "governance"

One property of Linkerd's multicluster design deserves its own section, because it's the difference between a multicluster story that survives incidents and one that creates them: each cluster's mesh is fully independent.

Every cluster runs its own control plane. Links between clusters are watch relationships, established over the same mTLS the data path uses, against the target cluster's Kubernetes API. There is no global control plane, no central registry service, and no component whose failure degrades every cluster at once.

Walk the failure cases, because your incident reviews eventually will:

A linked cluster goes dark. Its endpoints drop out of federated services and EWMA load balancing drains traffic to the surviving clusters. Clients keep calling the same name. The dead cluster's control plane being unreachable affects that cluster's mesh and nothing else.
The link itself breaks (credentials expire, network partition). Service discovery updates from that cluster stop flowing, and existing knowledge ages out gracefully. The blast radius is the link, and linkerd multicluster check tells you exactly which one.
A bad config lands in one cluster. GitOps means it landed via a PR to that cluster's directory, so the rollback is git revert plus your CD sync interval. Other clusters never saw it.

Now run the same exercise on any architecture with shared or primary-remote control plane topology, and price the difference in terms your SRE team uses: blast radius, page volume, and the number of systems you must reason about during a Sev1. Independent-by-default is the property that lets adopters scale Linkerd to hundreds or thousands of clusters without the multicluster layer itself becoming the reliability risk.

The security analysis mirrors it. Cross-cluster access requires the credentials a Link explicitly grants, scoped per link; identity is rooted in a trust anchor hierarchy you control; and compromising one cluster doesn't hand an attacker a global control plane to pivot through. For the threat-modeling section of your governance review, fewer shared components means a shorter document and a better one.

What to measure in a 2-week multicluster POC

If you're evaluating against another mesh, make both clear the same gates. Days 1 to 3: stand up 3 clusters with a shared trust anchor, all links created via your GitOps pipeline only; if a step forces an imperative CLI command, write it down, because that step is your future audit finding. Days 4 to 7: federate a service across all 3, then kill a cluster mid-load-test and graph client success rate through the failure. Days 8 to 10: write a cross-cluster authorization policy denying everything except 1 named identity and prove it holds with a negative test. Days 11 to 14: upgrade the mesh, cluster by cluster, under load, and write the runbook as you go.

That's the whole evaluation. Every gate maps to a phrase from the vendor comparisons ("governance," "enterprise multi-cluster," "operational maturity") except now they're measurable, and the mesh that clears them with fewer components and fewer surprises is the one your on-call rotation wants.

How this compares

We'll be specific rather than dismissive: Istio has had multicluster deployment models for years, they're widely used, and for its newer ambient data plane, multicluster support was introduced in alpha in 2025. If you're evaluating Istio in ambient mode for a multi-cluster estate, check the current maturity status of that feature against your timeline; that's a fair and answerable question, and the answer may have improved since this was written.

The comparison we'd actually encourage you to run is operational. Stand up 3 clusters with each mesh. Link them entirely through your GitOps pipeline, no imperative CLI steps. Fail one cluster and watch what client traffic does. Then count the moving parts you just operated. Linkerd's multicluster architecture has no shared control plane between clusters and no new proxy types: each cluster is independent, links are resources, and the data path is the same microproxy that handles in-cluster traffic. The blast radius of a multicluster misconfiguration is correspondingly small, and the 2.18 release notes about "battlescars and lessons learned" are blunt about the project's bias: reliability fixes over feature sprawl.

Where Buoyant Enterprise for Linkerd fits

The open source feature set above is complete and self-sufficient; large adopters run it as-is. What Buoyant Enterprise for Linkerd (BEL) adds is aimed at the team operating tens to thousands of clusters with a small headcount: stable, signed release artifacts on a supported lifecycle, fleet-wide visibility through Buoyant Cloud, and support engineers who have debugged other people's multicluster topologies before yours. This is the configuration Xbox Cloud Gaming runs to secure 22,000 pods across 26+ clusters in multiple Azure regions, which is a useful calibration point for what "multi-cluster at scale" means in practice.

There's an honest pitch in there for the VPE as well, but for you, the operator, it's simpler: when the federated service spanning 14 clusters does something surprising during an incident, you can page someone who works on the codebase.

Run the evaluation

The multicluster guide and federated services docs will get you from zero to a working 2-cluster federation on k3d in an afternoon. Bring your own Argo. Then make whatever mesh you're comparing against clear the same bar, with the same number of engineers in the room.

If you're planning a multicluster rollout and want to talk through the topology, contact us.

Sources: Announcing Linkerd 2.17 · Announcing Linkerd 2.18 · Federated services docs · Automatic multicluster failover · Istio ambient multicluster alpha · Xbox Cloud Gaming case study

Frequently asked questions

How does Linkerd handle multi-cluster deployments?

Each cluster runs its own independent control plane, linked by declarative Link resources. Federated services (since 2.17) join a service's replicas across clusters into one logical service, with latency-aware load balancing across all endpoints.

What is a federated service in Linkerd?

A logical union of a service across linked clusters. Label each member service with mirror.linkerd.io/federated: member, and clients call one name while Linkerd balances requests across every replica in every cluster.

Is Linkerd multicluster compatible with GitOps?

Yes. Since Linkerd 2.18, Link resources are fully declarative, so Argo CD or Flux can manage cluster links end to end. Your mesh topology lives in Git: versioned, reviewed, and reproducible.

Is cross-cluster traffic in Linkerd encrypted?

Yes. Cluster-to-cluster traffic uses mTLS between workload identities rooted in a shared trust anchor, the same as in-cluster traffic. Authorization policies work across clusters using those identities.

‍How many clusters can Linkerd multicluster scale to?

There's no central control plane to saturate; each cluster is independent and watches only the clusters it's linked to. Adopters run Linkerd on hundreds and sometimes thousands of clusters managed via GitOps.

Do clusters need a flat network? No. Linkerd supports both gateway-based multicluster (traffic crosses through a gateway at the cluster edge, which works across any networks that can reach each other) and pod-to-pod multicluster for flat networks, where traffic flows directly between pods with no intermediate hop. Federated services work in both modes; pick per environment.

‍Does cross-cluster traffic stay encrypted? Yes. Cluster-to-cluster traffic is mTLS between workload identities rooted in your shared trust anchor, the same as in-cluster traffic. There's no special "multicluster security mode" to configure or forget.

‍What's the latency cost of a federated service? For pod-to-pod mode, it's your inter-cluster network latency plus the same proxy overhead as any meshed request, and EWMA balancing biases traffic toward the fastest (usually local) endpoints automatically. Measure it in your POC with your real topology; the per-endpoint latency histograms the mesh already exports are the measurement tool.

‍How many clusters does this scale to? The architecture has no central coordination point to saturate, and adopters run it from 2 clusters to thousands. Each cluster watches the clusters it's linked to; you choose the topology (full mesh, hub-and-spoke, regional islands) by choosing the links, in Git.

‍