Skip to main content

Get Service Mesh Certified with Buoyant.

Enroll now!
close
OBSERVABILITY · RELIABILITY · KUBERNETES

Observability and Reliability

for Kubernetes

Get success rate, request volume, and latency for every meshed service the moment its pods roll with the proxy injected. No instrumentation. Then turn the operational work that quietly causes outages, like certificate rotation, into a non-event.

Provider
AWS
$0.02/GB
GCP
$0.01/GB
No. of Kubernetes clusters
Availability zones per cluster
Cross-zone traffic volume
select a provider

Select a provider to see your savings.

Without HAZL
per year
With HAZL
60% less
Savings
per year

0

lines of instrumentation code required

85%

reduction in control-plane memory at scale (2.20)

24 h

automatic proxy cert rotation cycle

Why your cross-zone bill is so high?

Cloud providers bill for traffic that crosses an availability-zone boundary, charged in each direction. Kubernetes spreads traffic as evenly as it can across pods, so in a typical 3-zone cluster about two-thirds of your traffic crosses a zone boundary and gets billed for it.

$100K–$1M+

Annual cross-zone charges at 1 GB/s on AWS (high-traffic cluster)

≈ 0

Cross-zone cost on Azure — in-zone routing highest-value on AWS & GCP

Why uniform metrics are hard in a polyglot cluster

Every service emits metrics differently, or not at all, and instrumenting each one by hand across languages and frameworks is a project that never finishes. Meanwhile the operational work that keeps the mesh healthy, certificate rotation above all, is the kind of toil that causes a cluster-wide outage when a step is missed.

A service mesh reads traffic at the proxy, so you get one consistent set of golden metrics across every service with no code change, and the riskiest operational steps get automated.

"A project that never finishes."

Every service emits metrics differently, or not at all. No consistent signal across the fleet.

Certificate rotation: the step most likely to cause a cluster-wide outage when missed.

Lorem ipsum backup validation info here? Read the full analysis ↗

What you get

From the moment the proxy is injected, you get consistent golden signals across every meshed service — no instrumentation, no code change. The operational work that quietly causes outages gets automated too.

Golden metrics, no instrumentation

Success rate, RPS, and latency percentiles for every meshed service, the moment the proxy is injected.

Cert rotation as a non-event

Linkerd auto-rotates proxy certs every 24 hours; Buoyant Enterprise for Linkerd (BEL) 2.20 automates the riskiest step, trust-anchor rotation.

Up to 85% less control-plane memory

A 2.20 destination-controller refactor, on large, high-churn clusters.

Hundreds of failed deploys caught

loveholidays built SLOs on Linkerd metrics and caught them before they became outages.

How does Linkerd observability work?

Linkerd's proxy reads every meshed connection and records golden metrics (success rate, requests per second, latency percentiles) for HTTP, HTTP/2, and gRPC, with no code change. Those metrics are scraped into Prometheus and surfaced on the dashboard, ready for SLOs and alerts. The control plane issues and rotates identities, and the BEL trust-anchor rotation operator handles the one cert step most likely to cause an outage.

Zone-aware load balancing with HAZL
Without HAZLWith HAZL
Availability zone AAvailability zone Bapplicationspill cross-zone under load / on HTTP 429applicationpodmicroproxybilled in both directionscross-zone only when neededin-zone (preferred, no cross-zone charge)
podmicroproxyapplicationAvailability zone Across-zone only when neededspill cross-zone under load / on HTTP 429applicationAvailability zone Bin-zone (preferred, no cross-zone charge)billed in both directions!
in zone traffic
cross-zone traffic

Without HAZL, pods route traffic to the nearest available endpoint — regardless of availability zone. Cloud providers charge for cross-zone traffic in both directions, and these charges add up fast at scale.

Zone-aware load balancing with HAZL

Under normal load HAZL keeps requests in-zone; when the in-zone endpoint is overloaded or returning errors, it spills to another zone, then returns in-zone as load recovers.

Stop paying for traffic you don't need to

Most teams turn HAZL on with no tuning and watch steady-state cross-zone traffic drop while reliability holds. Run the demo, then point it at a real cluster and measure against your own bill.

What it doesn't

✗  Drops to 0% in-zone under failure

✗  Requires ≥3 balanced pods per zone

✗  No health-based spill logic

✗  Struggles with autoscaling

✓  Free, no license needed

✓  Built into Kubernetes

What it covers

✓  Never sacrifices reliability

✓  Works with <3 pods per zone

✓  In-band health checking (HTTP/gRPC)

✓  Reads HTTP 429 as a spill signal

✓  ~1 min recovery after overload

✓  No tuning required in most cases

Why HAZL

1. Cost

2. Reliability

3. Simplicity

Cuts cost and protects reliability

HAZL is a "request-level load balancer in Buoyant Enterprise for Linkerd that balances HTTP and gRPC traffic in environments with multiple availability zones," and unlike Topology Aware Routing "never sacrifices reliability to achieve this cost reduction."

Read the docs ↗

It reacts to real load

HAZL balances on outstanding requests per endpoint and prefers local endpoints, adding cross-zone only when local load climbs. It uses in-band health checking, and reads rate-limit responses: an in-zone endpoint returning HTTP 429 is a reason to spill rather than a fast success (a BEL feature) In the same failure that dropped Topology Aware Routing to 0%, HAZL held near 100%.

Read the docs ↗

It works where TAR struggles

Fewer than 3 pods per zone, imbalanced traffic, autoscaling, and "requires no tuning or configuration" in most cases. It also preserves zone affinity across cluster boundaries.

Read the docs ↗

1. Cost

Cuts cost and protects reliability

HAZL is a "request-level load balancer in Buoyant Enterprise for Linkerd that balances HTTP and gRPC traffic in environments with multiple availability zones," and unlike Topology Aware Routing "never sacrifices reliability to achieve this cost reduction."

Read the docs ↗

2. Reliability

It reacts to real load

HAZL balances on outstanding requests per endpoint and prefers local endpoints, adding cross-zone only when local load climbs. It uses in-band health checking, and reads rate-limit responses: an in-zone endpoint returning HTTP 429 is a reason to spill rather than a fast success (a BEL feature) In the same failure that dropped Topology Aware Routing to 0%, HAZL held near 100%.

Read the docs ↗

3. Simplicity

It works where TAR struggles

Fewer than 3 pods per zone, imbalanced traffic, autoscaling, and "requires no tuning or configuration" in most cases. It also preserves zone affinity across cluster boundaries.

Read the docs ↗

Show me the evidence

Every claim is backed by a reproducible demo, a published cost model, and a CNCF track record.

Reproducible demo

Run it on a 3-zone local cluster and watch HAZL hold success rate where Topology Aware Routing drops it to 0%.

See the demo ↗

Published cost model

The cost figures above are from Buoyant's published AWS model — open for review.

Read the analysis ↗

CNCF-graduated

Buoyant created Linkerd, coined the term "service mesh," and shipped the service mesh in July 28, 2021.

Frequently asked questions

What does HAZL do?

It balances HTTP and gRPC requests to keep traffic in-zone for cost savings, while sending it cross-zone when reliability requires it.

How is it different from Topology Aware Routing?

TAR allocates zones statically and ignores live load, latency, and health, and is binary. HAZL balances on real load with in-band health checks and spills only when needed.

Is HAZL in open-source Linkerd?

No. OSS uses Kubernetes-native TAR; HAZL is a BEL feature.

How much will it save me?

It depends on cloud provider, traffic volume, and zone topology. Model it against your own cluster and your BEL cost.