Skip to main content

Get Service Mesh Certified with Buoyant.

Enroll now!
close
Blog home

Benchmarking gRPC Load Balancing on Kubernetes: Linkerd vs Istio vs Cilium Under a Slow Pod

Your gRPC service shows 8ms p50 latency. Dashboards are green, and 1 in 5 of your connections is pinned to a slow pod with no way out. It's not a bug. It's gRPC's architectural “load balancing problem,” which shows up in benchmarks but is easy to miss if you're only watching the median.

Here's the punchline up front: when we degraded one backend with a brutal 200ms delay, every connection-level setup we tested got worse, and Linkerd's p99 went down to 29ms, a hair below its own all-healthy baseline of 30ms. One mesh got faster under failure; the rest didn't. In this blog post, you'll learn why. 

In June 2026, Buoyant's Escalation Engineering team ran 420 benchmarks across six isolated EKS clusters, testing how vanilla Kubernetes, Cilium (L4 and L7), Istio Ambient, Istio Sidecar, and Linkerd each handle a single degraded backend pod under gRPC load. The test harness and raw results can be reviewed or replicated here. The results clarify something that trips up a lot of production setups. Connection-level load balancing and request-level load balancing are not interchangeable for gRPC, and the difference shows up in your tail latency whether your pods are healthy or not.

Why L4 load balancers and gRPC don't mix

L4 load balancers make routing decisions at connection time. That works with HTTP/1.1. Clients open multiple connections to parallelize requests, which gives the load balancer multiple chances to distribute work across backends. If one connection lands on a slow pod, others land elsewhere.

gRPC runs over HTTP/2, which multiplexes all requests over a single long-lived connection. One connection is sufficient and optimal by design, so clients don't open more. An L4 load balancer picks a pod when the connection opens, and every request on that connection goes to that pod until the connection closes or drops, which in production can last for hours.

With 5 backends and 1 degraded pod, the math is straightforward. 1 in 5 connections lands on the slow pod and stays there with no platform signal and no automatic recovery.

The test setup

The benchmark used an identical gRPC echo application (unary request/response) deployed to 6 isolated EKS clusters on 3x m5.xlarge nodes (4 vCPU / 16 GiB) in AWS us-east-1, EKS 1.35. Each cluster had 4 healthy backend pods and 1 pod with a delay injected via in-app `time.Sleep` controlled by ARTIFICIAL_DELAY_MS. A single client connection drove 200 concurrent streams for 60 seconds. Each scenario ran 10 times, producing 420 total runs. Error rates were under 0.2% across all conditions, consistent with end-of-window HTTP/2 stream cancellations.

The conditions tested:

Condition Load balancing Mechanism Version
Vanilla Kubernetes Connection-level kube-proxy, no mesh 1.35
Cilium (L4/eBPF) Connection-level eBPF datapath replacing kube-proxy 1.19.4
Cilium (L7/Envoy) Request-level CiliumEnvoyConfig, LEAST_REQUEST 1.19.4
Istio Ambient (ztunnel) Connection-level Waypoint enrolled 1.30.0
Istio Sidecar (Envoy) Request-level Per-pod Envoy sidecar, LEAST_REQUEST 1.30.0
Linkerd Request-level EWMA latency-weighted balancing preview-26.5.5


One note on scope

Istio Ambient ran with a namespace-enrolled waypoint. It still showed connection-level (bimodal) behavior: ztunnel tunnels all RPCs to one waypoint pod per connection, so the waypoint’s Envoy never load-balances per request (Istio #56864).

Enabling request-level routing isn’t the default. It requires installing Cilium with its Envoy L7 datapath enabled, a CiliumEnvoyConfig resource (here using LEAST_REQUEST), and a service.cilium.io/lb-l7=enabled annotation on the Service. If you’re running Cilium in its default eBPF mode, you have the L4 condition, not the L7 condition shown here. 

The baseline already shows the problem

Before introducing any degraded pod, the baseline run with all pods healthy shows the issue:

Linkerd benchmarks
Linkerd RPS benchmarks
Mesh p50 p99 RPS
Vanilla 7.34ms 65.53ms 9,175
Cilium (L4) 8.07ms 71.11ms 8,099
Istio Ambient 8.91ms 75.34ms 7,604
Istio Sidecar 33.46ms 49.64ms 5,593
Cilium (L7) 44.52ms 72.61ms 4,144
Linkerd 17.55ms 30.07ms 9,374

Vanilla p50 at 7ms looks fine, but vanilla p99 at 65ms is already more than 2x Linkerd's 30ms, and no pod has degraded yet. The gap isn't the proxy overhead, but the head-of-line blocking. With a single HTTP/2 connection and uneven request durations, some requests end up queuing behind longer-running ones on the same pod. The tail pays the price.

Linkerd's higher p50 (17ms vs 7ms for vanilla) reflects real proxy overhead. You're paying ~10ms per request for the ability to route at the request level. Whether that's a reasonable trade-off is exactly what the degraded-pod tests answer.

One slow pod: the coin flip

The primary test injected a 50ms delay into one of the 5 backend pods. The key number to look at is p50 standard deviation:

Linkerd primary test benchmarks
RPS, higher is better Linkerd benchmark
Mesh p50 Mean p50 Std Dev p99 RPS Consistent?
Vanilla 12.38ms ±13.64ms 71.72ms 7,817 No
Cilium (L4) 12.58ms ±13.56ms 69.51ms 7,837 No
Istio Ambient 12.32ms ±13.88ms 70.78ms 7,797 No
Cilium (L7) 41.37ms ±0.80ms 107.61ms 4,100 Yes
Istio Sidecar 30.71ms ±1.14ms 87.92ms 5,515 Yes
Linkerd 16.17ms ±1.12ms 66.55ms 9,588 Yes

The ±13ms standard deviation on L4 conditions isn't noise. At 200 concurrency with 1 slow pod, landing on a healthy pod results in very low latency, but landing on the slow pod collapses performance. The L4 conditions show a 12ms average because most connections land on healthy pods, but the variance tells you what happens to the unlucky 20%. Their p50 is not 12ms.

The 3 L7 conditions (Cilium L7, Istio Sidecar, Linkerd) all show ±0.8-1.1ms std dev, meaning request-level routing works and the variance comes from the mesh itself rather than which pod the connection happened to land on.

More connections don't fix it

A common workaround is to open multiple connections to give the L4 load balancer more attempts to distribute across backends. The mitigation test checked this:

Does more connections fix latency?
Connections Vanilla p50 Vanilla p99 Linkerd p50 Linkerd p99
1 (skew baseline) 12.38ms ±13.64ms 71.72ms ±6.30ms 16.17ms ±1.12ms 66.55ms ±1.36ms
2 9.47ms ±14.61ms 59.79ms ±9.51ms 16.55ms ±1.12ms 67.02ms ±1.30ms
4 4.69ms ±0.92ms 54.51ms ±12.63ms 16.89ms ±1.12ms 67.55ms ±1.33ms
8 4.39ms ±0.75ms 48.73ms ±12.39ms 17.54ms ±1.06ms 68.25ms ±1.41ms

More connections do improve vanilla's p50 substantially (12ms to 4ms at conn=8), because with 8 connections only 1 in 40 lands on the slow pod. But p99 barely moves: 72ms at conn=1, 49ms at conn=8, and std dev stays at ±12ms. The runs that land on the slow pod are still fully pinned. You need enough connections to statistically avoid the bad pod, and you still have no recourse when you don't.

Linkerd's p99 stays flat at ~67ms ±1.4ms regardless of connection count, without any application changes.

Why EWMA outperforms LEAST_REQUEST

All 3 L7 conditions route correctly (low std dev). The difference between them is algorithms and overhead.

Envoy's LEAST_REQUEST (used by Cilium L7 and Istio Sidecar) is concurrency-weighted: it routes to the pod with the fewest in-flight requests. To detect a slow pod, it needs to accumulate requests and observe congestion building. This works well enough at p50, but at high skew it fails at the tail. At 200ms injected delay, Istio Sidecar p99 hits 226ms. LEAST_REQUEST occasionally sends a tail request to the slow pod before the concurrency signal catches up, and when the slow pod is 200ms, that request drags the p99 up accordingly. Cilium L7 p99 at the same skew level hits 247ms for the same reason.

This isn't a misconfiguration but a property of the algorithm: LEAST_REQUEST reacts to congestion that's already built up. It doesn't preemptively avoid a pod that's responded slowly once.

Linkerd uses EWMA (exponentially weighted moving average), which is latency weighted. It tracks observed latency per pod, and a single slow response is enough to deprioritize that pod immediately before congestion builds.

The result at 200ms skew is stark: Linkerd p99 drops to 29ms, lower than its own healthy-pod baseline of 30ms, while Istio Sidecar sits at 226ms and Cilium L7 at 247ms, roughly 8x higher. EWMA penalizes the slow pod so aggressively that it almost never appears in the tail.

The throughput gap also grows with skew severity:

Condition RPS at skew_50ms RPS at skew_200ms
Linkerd 9,588 9,773
Istio Sidecar 5,515 5,981
Cilium (L7) 4,100 4,079


At skew_200ms, Linkerd is 63% ahead of Istio Sidecar and 140% ahead of Cilium L7 in throughput. The proxy overhead numbers explain part of this: Linkerd adds ~10ms per request vs. Envoy's ~26-37ms on the tested hardware. At skew_50ms, Cilium L7 routes correctly but delivers 4,100 RPS vs Linkerd's 9,588, a 57% throughput reduction for the same request-level routing capability.

What this costs to run

Throughput is infrastructure cost in disguise. The fewer requests each pod serves, the more pods you provision for the same load. The connection-level options, whether Vanilla, Cilium L4, Istio Ambient (which behaves connection-level here despite its L7 waypoint; see the scope note), are the cheapest to operate, with near-zero overhead and ~7,800 RPS, but that low cost buys you the coin flip, since they can’t route around a degraded pod. Among the options that route correctly, the spread is stark: Cilium L7 and Istio Sidecar deliver 43-57% less throughput than Linkerd, which means proportionally more nodes to serve the same traffic. Linkerd is both the only request-level option that holds its tail under failure and the cheapest to operate among the ones that actually work. 

Throughput was measured per cluster; cost-to-serve assumes it scales roughly linearly as nodes are added.

The Scorecard: Every Condition, Every Axis that Matters

Condition Request-level routing Tail under failure
(p99 @ skew_50 / skew_200)
Throughput Proxy overhead All green?
Vanilla Connection-level 71.72ms, bimodal / pinned 7,817 RPS ~0ms
Cilium (L4) Connection-level 69.51ms, bimodal / pinned 7,837 RPS ~0ms
Istio Ambient Connection-level 70.78ms, bimodal / pinned 7,797 RPS ~0ms
Cilium (L7) Request-level 107.61ms / 247ms 4,100 RPS ~37ms
Istio Sidecar Request-level 87.92ms / 226ms 5,515 RPS ~26ms
Linkerd Request-level 66.55ms / 29ms 9,588 RPS ~10ms

The connection-level options keep overhead near zero, but they cannot route around a slow pod, so the tail collapses the moment one degrades. The Envoy-based L7 options route correctly but pay for it twice: 26-37ms of overhead and 43-57% less throughput. Linkerd is the only row that’s green all the way across, including request-level routing, the best tail under failure, the highest throughput, and the lowest overhead of any mesh that routes correctly.

The Bottom Line

If you’re running gRPC on vanilla Kubernetes, Cilium’s default eBPF mode, or Istio Ambient without waypoints, you have connection-level load balancing. Your p99 is already inflated by head-of-line blocking, and the day a pod degrades, a fraction of your users get pinned to it with no recovery until the connection drops.

The solution is request-level load balancing, but that requires an L7 proxy in the data path. Three of the conditions we tested provide it (Cilium L7, Istio Sidecar, and Linkerd), but they are not equivalent:

  • Cilium L7 routes correctly but adds ~37ms of overhead and drops to 4,100 RPS. That's a 57% throughput cut versus Linkerd, with a 107.61ms p99 at skew_50 that blows out to 247ms at skew_200.
  • Istio Sidecar routes correctly but adds ~26ms of overhead and 5,515 RPS. That's 43% less throughput than Linkerd, with p99 reaching 226ms at skew_200.
  • Linkerd routes correctly with EWMA, adds only ~10ms, sustains the highest throughput in the field (9,588 -> 9,773 RPS), and is the only option whose tail latency improved under failure (29ms p99 at skew_200, below its own healthy baseline). 

If you run gRPC at scale, the conclusion is direct: deploy Linkerd. It’s the one configuration in this benchmark that eliminates the coin flip without making you pay for it in latency or compute. Get started at buoyant.io/linkerd 

Don’t take our word for it. The entire test harness and raw results are available on this GitHub repo. Run it against your own workload and watch the tail.