Benchmarking gRPC Load Balancing on Kubernetes: Linkerd vs Istio vs Cilium Under a Slow Pod

Your gRPC service shows 8ms p50 latency. Dashboards are green, and 1 in 5 of your connections is pinned to a slow pod with no way out. It's not a bug. It's gRPC's architectural “load balancing problem,” which shows up in benchmarks but is easy to miss if you're only watching the median.
Here's the punchline up front: when we degraded one backend with a brutal 200ms delay, every connection-level setup we tested got worse, and Linkerd's p99 went down to 29ms, a hair below its own all-healthy baseline of 30ms. One mesh got faster under failure; the rest didn't. In this blog post, you'll learn why.
In June 2026, Buoyant's Escalation Engineering team ran 420 benchmarks across six isolated EKS clusters, testing how vanilla Kubernetes, Cilium (L4 and L7), Istio Ambient, Istio Sidecar, and Linkerd each handle a single degraded backend pod under gRPC load. The test harness and raw results can be reviewed or replicated here. The results clarify something that trips up a lot of production setups. Connection-level load balancing and request-level load balancing are not interchangeable for gRPC, and the difference shows up in your tail latency whether your pods are healthy or not.
Why L4 load balancers and gRPC don't mix
L4 load balancers make routing decisions at connection time. That works with HTTP/1.1. Clients open multiple connections to parallelize requests, which gives the load balancer multiple chances to distribute work across backends. If one connection lands on a slow pod, others land elsewhere.
gRPC runs over HTTP/2, which multiplexes all requests over a single long-lived connection. One connection is sufficient and optimal by design, so clients don't open more. An L4 load balancer picks a pod when the connection opens, and every request on that connection goes to that pod until the connection closes or drops, which in production can last for hours.
With 5 backends and 1 degraded pod, the math is straightforward. 1 in 5 connections lands on the slow pod and stays there with no platform signal and no automatic recovery.
The test setup
The benchmark used an identical gRPC echo application (unary request/response) deployed to 6 isolated EKS clusters on 3x m5.xlarge nodes (4 vCPU / 16 GiB) in AWS us-east-1, EKS 1.35. Each cluster had 4 healthy backend pods and 1 pod with a delay injected via in-app `time.Sleep` controlled by ARTIFICIAL_DELAY_MS. A single client connection drove 200 concurrent streams for 60 seconds. Each scenario ran 10 times, producing 420 total runs. Error rates were under 0.2% across all conditions, consistent with end-of-window HTTP/2 stream cancellations.
The conditions tested:
One note on scope
Istio Ambient ran with a namespace-enrolled waypoint. It still showed connection-level (bimodal) behavior: ztunnel tunnels all RPCs to one waypoint pod per connection, so the waypoint’s Envoy never load-balances per request (Istio #56864).
Enabling request-level routing isn’t the default. It requires installing Cilium with its Envoy L7 datapath enabled, a CiliumEnvoyConfig resource (here using LEAST_REQUEST), and a service.cilium.io/lb-l7=enabled annotation on the Service. If you’re running Cilium in its default eBPF mode, you have the L4 condition, not the L7 condition shown here.
The baseline already shows the problem
Before introducing any degraded pod, the baseline run with all pods healthy shows the issue:
Vanilla p50 at 7ms looks fine, but vanilla p99 at 65ms is already more than 2x Linkerd's 30ms, and no pod has degraded yet. The gap isn't the proxy overhead, but the head-of-line blocking. With a single HTTP/2 connection and uneven request durations, some requests end up queuing behind longer-running ones on the same pod. The tail pays the price.
Linkerd's higher p50 (17ms vs 7ms for vanilla) reflects real proxy overhead. You're paying ~10ms per request for the ability to route at the request level. Whether that's a reasonable trade-off is exactly what the degraded-pod tests answer.
One slow pod: the coin flip
The primary test injected a 50ms delay into one of the 5 backend pods. The key number to look at is p50 standard deviation:
The ±13ms standard deviation on L4 conditions isn't noise. At 200 concurrency with 1 slow pod, landing on a healthy pod results in very low latency, but landing on the slow pod collapses performance. The L4 conditions show a 12ms average because most connections land on healthy pods, but the variance tells you what happens to the unlucky 20%. Their p50 is not 12ms.
The 3 L7 conditions (Cilium L7, Istio Sidecar, Linkerd) all show ±0.8-1.1ms std dev, meaning request-level routing works and the variance comes from the mesh itself rather than which pod the connection happened to land on.
More connections don't fix it
A common workaround is to open multiple connections to give the L4 load balancer more attempts to distribute across backends. The mitigation test checked this:
More connections do improve vanilla's p50 substantially (12ms to 4ms at conn=8), because with 8 connections only 1 in 40 lands on the slow pod. But p99 barely moves: 72ms at conn=1, 49ms at conn=8, and std dev stays at ±12ms. The runs that land on the slow pod are still fully pinned. You need enough connections to statistically avoid the bad pod, and you still have no recourse when you don't.
Linkerd's p99 stays flat at ~67ms ±1.4ms regardless of connection count, without any application changes.
Why EWMA outperforms LEAST_REQUEST
All 3 L7 conditions route correctly (low std dev). The difference between them is algorithms and overhead.
Envoy's LEAST_REQUEST (used by Cilium L7 and Istio Sidecar) is concurrency-weighted: it routes to the pod with the fewest in-flight requests. To detect a slow pod, it needs to accumulate requests and observe congestion building. This works well enough at p50, but at high skew it fails at the tail. At 200ms injected delay, Istio Sidecar p99 hits 226ms. LEAST_REQUEST occasionally sends a tail request to the slow pod before the concurrency signal catches up, and when the slow pod is 200ms, that request drags the p99 up accordingly. Cilium L7 p99 at the same skew level hits 247ms for the same reason.
This isn't a misconfiguration but a property of the algorithm: LEAST_REQUEST reacts to congestion that's already built up. It doesn't preemptively avoid a pod that's responded slowly once.
Linkerd uses EWMA (exponentially weighted moving average), which is latency weighted. It tracks observed latency per pod, and a single slow response is enough to deprioritize that pod immediately before congestion builds.
The result at 200ms skew is stark: Linkerd p99 drops to 29ms, lower than its own healthy-pod baseline of 30ms, while Istio Sidecar sits at 226ms and Cilium L7 at 247ms, roughly 8x higher. EWMA penalizes the slow pod so aggressively that it almost never appears in the tail.
The throughput gap also grows with skew severity:
At skew_200ms, Linkerd is 63% ahead of Istio Sidecar and 140% ahead of Cilium L7 in throughput. The proxy overhead numbers explain part of this: Linkerd adds ~10ms per request vs. Envoy's ~26-37ms on the tested hardware. At skew_50ms, Cilium L7 routes correctly but delivers 4,100 RPS vs Linkerd's 9,588, a 57% throughput reduction for the same request-level routing capability.
What this costs to run
Throughput is infrastructure cost in disguise. The fewer requests each pod serves, the more pods you provision for the same load. The connection-level options, whether Vanilla, Cilium L4, Istio Ambient (which behaves connection-level here despite its L7 waypoint; see the scope note), are the cheapest to operate, with near-zero overhead and ~7,800 RPS, but that low cost buys you the coin flip, since they can’t route around a degraded pod. Among the options that route correctly, the spread is stark: Cilium L7 and Istio Sidecar deliver 43-57% less throughput than Linkerd, which means proportionally more nodes to serve the same traffic. Linkerd is both the only request-level option that holds its tail under failure and the cheapest to operate among the ones that actually work.
Throughput was measured per cluster; cost-to-serve assumes it scales roughly linearly as nodes are added.
The Scorecard: Every Condition, Every Axis that Matters
The connection-level options keep overhead near zero, but they cannot route around a slow pod, so the tail collapses the moment one degrades. The Envoy-based L7 options route correctly but pay for it twice: 26-37ms of overhead and 43-57% less throughput. Linkerd is the only row that’s green all the way across, including request-level routing, the best tail under failure, the highest throughput, and the lowest overhead of any mesh that routes correctly.
The Bottom Line
If you’re running gRPC on vanilla Kubernetes, Cilium’s default eBPF mode, or Istio Ambient without waypoints, you have connection-level load balancing. Your p99 is already inflated by head-of-line blocking, and the day a pod degrades, a fraction of your users get pinned to it with no recovery until the connection drops.
The solution is request-level load balancing, but that requires an L7 proxy in the data path. Three of the conditions we tested provide it (Cilium L7, Istio Sidecar, and Linkerd), but they are not equivalent:
- Cilium L7 routes correctly but adds ~37ms of overhead and drops to 4,100 RPS. That's a 57% throughput cut versus Linkerd, with a 107.61ms p99 at skew_50 that blows out to 247ms at skew_200.
- Istio Sidecar routes correctly but adds ~26ms of overhead and 5,515 RPS. That's 43% less throughput than Linkerd, with p99 reaching 226ms at skew_200.
- Linkerd routes correctly with EWMA, adds only ~10ms, sustains the highest throughput in the field (9,588 -> 9,773 RPS), and is the only option whose tail latency improved under failure (29ms p99 at skew_200, below its own healthy baseline).
If you run gRPC at scale, the conclusion is direct: deploy Linkerd. It’s the one configuration in this benchmark that eliminates the coin flip without making you pay for it in latency or compute. Get started at buoyant.io/linkerd
Don’t take our word for it. The entire test harness and raw results are available on this GitHub repo. Run it against your own workload and watch the tail.

