Observability and Reliability
for Kubernetes
Get success rate, request volume, and latency for every meshed service the moment its pods roll with the proxy injected. No instrumentation. Then turn the operational work that quietly causes outages, like certificate rotation, into a non-event.
0
lines of instrumentation code required
85%
reduction in control-plane memory at scale (2.20)
24 h
automatic proxy cert rotation cycle
Why uniform metrics are hard in a polyglot cluster
Every service emits metrics differently, or not at all, and instrumenting each one by hand across languages and frameworks is a project that never finishes. Meanwhile the operational work that keeps the mesh healthy, certificate rotation above all, is the kind of toil that causes a cluster-wide outage when a step is missed.
A service mesh reads traffic at the proxy, so you get one consistent set of golden metrics across every service with no code change, and the riskiest operational steps get automated.
"A project that never finishes."
Every service emits metrics differently, or not at all. No consistent signal across the fleet.
Certificate rotation: the step most likely to cause a cluster-wide outage when missed.
0
lines of instrumentation code required
24h
automatic cert rotation, every proxy
85%
less control-plane memory at scale (2.20)
Lorem ipsum backup validation info here? Read the full analysis ↗
What you get
From the moment the proxy is injected, you get consistent golden signals across every meshed service — no instrumentation, no code change. The operational work that quietly causes outages gets automated too.
Golden metrics, no instrumentation
Success rate, RPS, and latency percentiles for every meshed service, the moment the proxy is injected.
Cert rotation as a non-event
Linkerd auto-rotates proxy certs every 24 hours; Buoyant Enterprise for Linkerd (BEL) 2.20 automates the riskiest step, trust-anchor rotation.
Up to 85% less control-plane memory
A 2.20 destination-controller refactor, on large, high-churn clusters.
Hundreds of failed deploys caught
loveholidays built SLOs on Linkerd metrics and caught them before they became outages.
How does Linkerd observability work?
Linkerd's proxy reads every meshed connection and records golden metrics (success rate, requests per second, latency percentiles) for HTTP, HTTP/2, and gRPC, with no code change. Those metrics are scraped into Prometheus and surfaced on the dashboard, ready for SLOs and alerts. The control plane issues and rotates identities, and the BEL trust-anchor rotation operator handles the one cert step most likely to cause an outage.
Zone-aware load balancing with HAZL
Under normal load HAZL keeps requests in-zone; when the in-zone endpoint is overloaded or returning errors, it spills to another zone, then returns in-zone as load recovers.
Stop paying for traffic you don't need to
Most teams turn HAZL on with no tuning and watch steady-state cross-zone traffic drop while reliability holds. Run the demo, then point it at a real cluster and measure against your own bill.
What it doesn't
✗ Drops to 0% in-zone under failure
✗ Requires ≥3 balanced pods per zone
✗ No health-based spill logic
✗ Struggles with autoscaling
✓ Free, no license needed
✓ Built into Kubernetes
What it covers
✓ Never sacrifices reliability
✓ Works with <3 pods per zone
✓ In-band health checking (HTTP/gRPC)
✓ Reads HTTP 429 as a spill signal
✓ ~1 min recovery after overload
✓ No tuning required in most cases
Why HAZL
1. Cost
2. Reliability
3. Simplicity
Cuts cost and protects reliability
HAZL is a "request-level load balancer in Buoyant Enterprise for Linkerd that balances HTTP and gRPC traffic in environments with multiple availability zones," and unlike Topology Aware Routing "never sacrifices reliability to achieve this cost reduction."
It reacts to real load
HAZL balances on outstanding requests per endpoint and prefers local endpoints, adding cross-zone only when local load climbs. It uses in-band health checking, and reads rate-limit responses: an in-zone endpoint returning HTTP 429 is a reason to spill rather than a fast success (a BEL feature) In the same failure that dropped Topology Aware Routing to 0%, HAZL held near 100%.
It works where TAR struggles
Fewer than 3 pods per zone, imbalanced traffic, autoscaling, and "requires no tuning or configuration" in most cases. It also preserves zone affinity across cluster boundaries.
1. Cost
Cuts cost and protects reliability
HAZL is a "request-level load balancer in Buoyant Enterprise for Linkerd that balances HTTP and gRPC traffic in environments with multiple availability zones," and unlike Topology Aware Routing "never sacrifices reliability to achieve this cost reduction."
2. Reliability
It reacts to real load
HAZL balances on outstanding requests per endpoint and prefers local endpoints, adding cross-zone only when local load climbs. It uses in-band health checking, and reads rate-limit responses: an in-zone endpoint returning HTTP 429 is a reason to spill rather than a fast success (a BEL feature) In the same failure that dropped Topology Aware Routing to 0%, HAZL held near 100%.
3. Simplicity
It works where TAR struggles
Fewer than 3 pods per zone, imbalanced traffic, autoscaling, and "requires no tuning or configuration" in most cases. It also preserves zone affinity across cluster boundaries.
Show me the evidence
Every claim is backed by a reproducible demo, a published cost model, and a CNCF track record.
Reproducible demo
Run it on a 3-zone local cluster and watch HAZL hold success rate where Topology Aware Routing drops it to 0%.
Published cost model
The cost figures above are from Buoyant's published AWS model — open for review.
Frequently asked questions
What does HAZL do?
It balances HTTP and gRPC requests to keep traffic in-zone for cost savings, while sending it cross-zone when reliability requires it.