Kubernetes was built for stateless applications, but that doesn't necessarily translate into how LLM inference works. For inference, the model has to be loaded into GPU memory, and you need fast RDMA (Remote Direct Memory Access) connections to shard across nodes. Additionally, autoscaling on CPU or memory doesn't tell you whether the model is actually saturated. So why is almost everyone running inference on Kubernetes?
In this episode of the AI Kubernetes Show, William Morgan spoke with Abdel Sghiouar, Developer Advocate at Google, KubeCon Co-chair, and Co-host of the Kubernetes Podcast, about this topic.
Why run LLM inference on Kubernetes instead of VMs?
This blog post was generated by AI from the interview transcript, with some editing.
The reason why everyone is running LLM inference on Kubernetes is automation and resiliency. On a bare VM you need to provision the machine, install GPU drivers, configure storage connectors, set up whatever hardware plugins you need, and then start the workload. Kubernetes gets you a node that already has all of that when the pod is deployed and gets you automatic restart on failure. While that second part might sound like table stakes in 2026, Abdel points out that recovering from failure in a distributed system is surprisingly hard if you try to do it yourself.
Take running databases on Kubernetes. In the early days, it was an awkward fit. StatefulSets, persistent volume claims, and operator patterns evolved to address it. Now most teams run databases on Kubernetes without much thought. LLM inference is going through the same evolution where we are figuring out which primitives need to be extended or added, not whether the platform is the right venue.
LLM inference on Kubernetes: the primitives that actually break
There are three places where standard Kubernetes assumptions don't hold.
Model loading. Container images are heavy, but models are even heavier. Managed Kubernetes flavors can be optimized for LLMs. GKE, for example, has a proprietary image-streaming feature for fast container image download at pod startup. But what are the open source equivalents to achieve that? Maybe a file system with models cached on the cluster, or using Docker images as volumes. In their KubeCon talk "Optimizing LLM Inference for the Rest of Us,” Abdel and Moffi Rahman went through every GKE-specific optimization and mapped each one to an open source equivalent.
Autoscaling. Horizontal pod autoscalers on CPU or memory don't map to LLM load. The right signals are accelerator usage and queue depth. LLMs process requests in batches, and those batches run serially. If the model is busy, everything else queues, and queue size directly predicts latency. Autoscaling on it requires custom metrics and a different mental model than standard web services.
Routing. Every major open source serving engine (e.g., vLLM, Ollama, Triton) has standardized on the OpenAI API spec, which passes the model ID in the JSON request body, not the URL path. Most load balancers, however, route on path, not on body. To address this, the open source community built the Gateway API Inference Extension. A router pod (called the "endpoint picker") sits in the cluster, extracts the model ID from the request body, and routes to the right backend. llm-d builds on this with queue-size-aware routing and policies for mixture-of-experts models.
How agentic workloads differ from LLM inference
LLM inference is big and hardware-hungry, but agentic workloads introduce a different problem.
An agent is an application process talking to an LLM with access to tools and MCP servers. As Abdel puts it, it's like a monolithic application running on a distributed system. The agent looks like a single entity, but it generates network calls to an LLM, to tools, and to external services, often in unplanned patterns, because the LLM is non-deterministic.
Two problems this creates have no clean solutions yet:
Resource management. If a tool process runs inside the agent's container and decides to pull terabytes of data into memory, the container's resource limits don't distinguish between agent memory and tool memory. The runtime resource picture becomes unpredictable in a way a standard microservice never is.
Identity. When an agent calls an external endpoint via an MCP server, which identity does it present? The agent's service account? The identity of the user who prompted it? Neither is obviously correct, and neither is straightforward to implement. At the end, it's all basic stuff, like Authn, Authz, and policy management, but the interaction patterns within an agent make it more complex than any traditional workload.
One concrete example Abdel raised from a conversation at Google Next was someone who runs agents on GKE that call Anthropic. How do you handle retries on failures? How do you manage latency for requests that might be megabytes each way (large prompts, multimodal inputs, and large responses)? This is not standard web traffic, and nobody has clean answers yet.
Agent Sandbox: isolating AI-generated code execution in Kubernetes
If an agent generates code and executes it, you don't want that execution to happen in the agent's own container. That's what the Agent Sandbox project is for. It pushes code execution to a separate pod, sandboxed with gVisor or Kata Containers.
Let's say you prompt an agent to write a function to extract your environment variables. Without sandboxing, it will execute that function inside its container and return the results. With the sandbox, that execution is isolated.
Key open source projects for LLM inference on Kubernetes
The space is moving fast, but here are a few projects worth tracking:
- KServe: Serving inference “the Kubernetes way,” this project provides CRDs for managing inference workloads declaratively, including scaling and versioning.
- Gateway API Inference Extension: The body-routing layer that makes model-ID-based request routing possible. This is the foundation for llm-d and similar tools.
- llm-d: Routing and serving patterns for LLM inference, with queue-size-aware routing and mixture-of-experts model support.
- Agent Sandbox: Isolated code execution for agentic workloads, sandboxed with gVisor or Kata Containers.
- DRA (Dynamic Resource Allocation): A Kubernetes API that flips the traditional resource model. Instead of requesting specific hardware, you describe workload characteristics and let the cluster match them. Think storage classes, but for accelerators and networking. The platform admin describes cluster capabilities; developers describe what they need without specifying the hardware model.
What's coming for AI workloads on Kubernetes
In the near future, let’s say a year from now, Abdel thinks that more organizations will start running their own LLMs. Inference providers will continue to move core capabilities to higher subscription tiers, which will create incentives to self-host.
Another near-term trajectory is that Kubernetes will become the universal control plane, where neither the node nor the hardware will matter. DRA is already moving us in that direction. Soon, you will deploy a workload, describe what it needs, and let the cluster provision the right underlying infrastructure, rather than the current model where you provision resources first and schedule workloads on top.
In the longer term, maybe five years from now, clusters will get dramatically larger. Abdel said that Google can host up to 135K nodes and AWS up to 130K, numbers that are only expected to increase. At that scale, the control plane architecture will have to change. Maybe become stateless and not co-located with the workloads.
And service mesh capabilities may move closer to the platform itself. mTLS could become a first-class cluster feature rather than something you install and operate separately.
Watch the full episode. Abdel covers a lot of ground, including a deep dive on the open source optimization layers from his KubeCon talk and his perspective on open-weight models.
FAQs
Why run LLM inference on Kubernetes instead of VMs?
Because Kubernetes provides automation. You get a node with GPU drivers and storage connectors ready on pod startup, plus automatic workload restart. We are seeing a similar evolution databases went through when Kubernetes wasn't ready for them.
What is the Gateway API Inference Extension, and why does it exist?
Most load balancers route on URL paths, but the OpenAI-spec APIs pass model IDs in the JSON request body. The Gateway API Inference Extension adds a router that extracts model IDs from request bodies and routes to the right backend pod.
How does Kubernetes autoscaling work for LLM inference?
Standard CPU/memory autoscaling doesn't reflect LLM load. The right signals are accelerator utilization and queue depth. LLMs batch requests serially, so queue size directly predicts latency under load.
What makes agentic workloads different from standard microservices on Kubernetes?
Agentic workloads are non-deterministic. The LLM decides which tools to call, often in unpredictable patterns. This creates hard problems for resource management, identity (agent vs. user), and traffic that can be megabytes per request.
What is llm-d and what problem does it solve?
llm-d is a routing layer for LLM inference traffic in Kubernetes. It builds on the Gateway API Inference Extension to add queue-size-aware routing, so requests go to the least-loaded backend rather than any available pod.


.png)
.webp)

.webp)

.webp)

.webp)
.png)
.png)
.png)
.webp)
.webp)


