Get Service Mesh Certified with Buoyant.

Agentic AI Security on Kubernetes: Understanding Payload Routing and Defense-in-Depth

Next episode:

Stop Bolting on Security: The Key to Reliable AI Agent Systems

In this episode of The Kubernetes AI Show, Buoyant Tech Evangelist Flynn chats with Shane Utt about the foundational challenges and ongoing efforts to bridge the gap between AI workloads and cloud native networking infrastructure. Flynn and Shane are both co-leads of the AI Gateway Working Group and contributors to the Kubernetes Gateway API.

The genesis of the AI Gateway Working Group

This blog post was generated by AI from the interview transcript, with some editing.

The AI Gateway Working Group began as an extension of an existing project and grew out of a need to address a wider set of use cases for the average Kubernetes user. The impetus for creating the group was the Gateway API inference extension (GIE) from 2024, which led in turn to the llm-d project. This initial effort was focused on a niche audience: companies running AI at scale on Kubernetes. However, the scope was too narrow for a broader range of Kubernetes users.

Today, the working group is focused on the more average Kubernetes user: developers doing inference in their applications, but often calling Large Language Models (LLMs) outside the cluster rather than hosting their own LLMs inside the cluster.

The group focuses on meeting users where they are (doing inference without hosting their own LLMs), enabling multi-cloud and multi-cluster routing (especially for data sovereignty), and implementing failover, such as being able to fall back to an external resource if an internal model fails.

Routing models vs. microservices

AI models represent a fundamentally different networking challenge than traditional microservices due to their unique operational characteristics: microservices are typically small and fast, with every instance of a given microservice being interchangeable. Models, on the other hand, are big, expensive, slower, and more importantly, instances of a given model tend to not be interchangeable.

This non-interchangeability is what makes advanced routing logic a must. The GIE, for example, provides an inference routing extension, which is an alternative to the service API. It operates at a lower level and allows for more advanced and optimized routing by looking at advertised capabilities and metrics coming out of your serving workload, like K-V cache awareness. Separately, a higher-level AI Gateway stack handles AI Gateway Routing, which focuses on semantic routing by looking at the payload.

This highlights a significant shift in infrastructure. Historically, routing relied on headers, but for a bunch of these AI cases, that practice has been turned on its head. Now, infrastructure has to be retrofitted to be able to look into the body of the request. This is required to determine the kind of model being requested, whether the request is safe (guarding against illicit actions), and whether a response is passing out sensitive information (for example, to comply with GDPR).

Efficiency, hardware, and HPC

Many of the issues faced by AI are not new problems but echoes of challenges found in High Performance Computing (HPC), primarily due to the intense hardware demands of LLMs.

Running LLM workloads is notoriously difficult and slow. A major efficiency headache is low hardware utilization, which often hovers around 20-30%. This low rate is partly because the standard Kubernetes scheduler isn't particularly effective at handling the scale required.

The good news is that the problem of leveraging specialized hardware efficiently has long been addressed in HPC. A lot of the solutions being applied to LLMs are simply HPC principles adapted for new use cases. For example, CPUs are good at decisions and fairly good at I/O, while GPUs are good at massively parallel number-crunching but not good at either decisions or I/O. This complicates scheduling for AI workloads – but HPC workloads have already had to think about this, so cross-pollination between HPC and AI is critical.

A significant complication, though, is that the data scientists who drive the development of LLMs are often in a separate world from the network engineers who have the expertise to handle routing and system efficiency. The disconnect between these two worlds can make building efficient LLM systems a frustrating process, leading again to a need to cross-train between these two separate fields.

The agentic paradigm and security

The shift to agentic AI, which introduces autonomy and tool-calling capabilities, represents a new frontier for networking and security.

An agent is a thing running in a loop that uses tools to accomplish a goal, which requires a certain amount of autonomy – without that, you just have a script. This autonomy is what makes an agentic system so powerful and potentially the killer app that every Kubernetes user will want: solid, safe automation of workloads and infrastructure.

However, this autonomy is deeply terrifying from a security perspective. The agentic model doesn't just perform simple actions; it needs to approximate the user's intent. The critical thing to remember is that LLMs are not human: they don’t share human motivation and they don’t have the ability to rationalize. They operate purely on token associations, and an accidental association could easily lead to a destructive action.

The defense against this risk has to be a robust, defense-in-depth strategy, with security implemented at every single layer. For instance, using Model Context Protocol (MCP) servers as the first line of defense is better than simple API wrappers. The MCP protocol can include a feature called elicitation, which is a way for the system to bypass the agent and query the user for confirmation on potentially dangerous actions, like asking, “Are you sure you want to take down production today?”

Ultimately, the greatest risk is humans. We have a powerful tendency to anthropomorphize almost everything, which makes it very easy for users to fall into the trap of believing that they can trust their agents to act in their best interests. This passive trust is very dangerous because it’s based on a fundamental misunderstanding. AI agents are not humans, and you cannot ascribe human motivation to them.

Stay in touch with Flynn and Shane

You can connect with Flynn on the CNCF and Linkerd Slack. Shane is also on the CNCF Slack and on LinkedIn.

FAQ

How do AI models fundamentally differ from traditional microservices in a networking context?

Unlike microservices, which are typically small, fast, and interchangeable, AI models are generally large, expensive, slower, and, critically, instances of a given model tend to not be interchangeable. This non-interchangeability necessitates more advanced routing logic.

Why is there a shift from header-based to payload-based routing for AI workloads on Kubernetes?

Historically, routing relied on headers, but for many AI use cases, the networking infrastructure must be retrofitted to look into the body of the request (the payload).

This is essential for:

Determining the specific model being requested.
Semantic routing based on content.
Security, by guarding against illicit actions or ensuring the response doesn't pass out sensitive information (e.g., to comply with GDPR).

What is the security risk associated with Agentic AI systems?

The autonomy of Agentic AI is a challenge from a security perspective because LLMs do not share human motivation or the ability to rationalize. They operate purely on token associations, meaning an accidental association could easily lead to a destructive action.

What defense strategy is recommended for securing Agentic AI?

A robust defense-in-depth security strategy implemented at every layer is required. For example, using MCP servers as the first line of defense is recommended, which can include a feature called elicitation to bypass the agent and query the user for confirmation on potentially dangerous actions.

‍

Get started with Buoyant
Enterprise for Linkerd

Download and install the world's most advanced service
mesh on any Kubernetes cluster in minutes