In this The Kubernetes Show episode, we talked with Keith Mattix, Senior Principal Software Engineer at Microsoft, Istio maintainer, and Co-Lead for the CNCF AI Gateway Working Group. We chatted about the current challenges and architectural shifts required to integrate AI workflows into enterprise-ready Kubernetes platforms.
This blog post was generated by AI from the interview transcript, with some editing.
The rise of AI has created a real split in organizations where the traditional platform engineering stacks and data science workflows just don't mesh. On one side, the platform team stack usually relies on established Kubernetes practices, Helm charts, and standardized tools for things like metrics, observability, and traffic management. The data science stack, on the other hand, often involves different environments and tools like R, Python, or MATLAB for building and iterating on models.
This creates a serious tug-of-war when it comes to deciding who owns the stack and what tools will be used when shipping data models to production. This divide goes way beyond simple inference requests and covers the entire model lifecycle, including training, experiment tracking, and reinforcement learning. The Kubernetes AI Gateway Working Group is focused on creating a seamless experience and migration path for running AI workflows on Kubernetes.
While foundational models from hyperscalers are powerful, they aren't always fast or specific enough for enterprise needs. The high cost of training these massive models is what drives the need for specialization.
Data scientists are constantly figuring out how to adapt general models. One common approach is Transfer Learning, which essentially means taking the core knowledge (embeddings) and applying it to a different problem or context.
Another technique is Low Rank Adaptation (LoRA), also known as LoRA Adapters. This lets developers take a foundational model and make small, strategic tweaks to its weights. The goal is to narrow the model to a specific use case—for example, creating one adapter for English responses and another for Chinese responses—making the same foundational model effective for hyper-specific topics.
Finally, there's Retrieval-Augmented Generation (RAG). This method couples a foundational model with a mechanism to retrieve and use information from an internal knowledge base, making the model's output more grounded in proprietary data.
This drive for specialization is a direct response to the need for application-specific models, such as those used in healthcare or HR systems. It's also driven by serious concerns around data privacy, liability, and the evolving laws focused on AI safety.
To tackle the complexity of an end-to-end AI workflow, the working group's main strategy is to think in terms of personas. This approach, which is similar to how Gateway API development has been handled, helps define APIs and "hook points" that meet the specific needs of different users. This effort is focused on three key personas. The data scientist is primarily focused on training models, running inference, and using the feedback loop to improve the model. The inference platform owner is responsible for running and managing the fleet of Kubernetes clusters and securing resources like GPUs. Finally, the application developer is running a general workload and needs to "make an inference call" to an AI model.
Managing the cost of LLM requests is a critical new challenge for platform engineers, particularly when dealing with token rate limits from an LLM provider. Mattix has some advice for platform engineers: The best technical advice for handling this is to centralize the policy. You can achieve this by routing all outgoing cluster traffic through a gateway, like an egress gateway. This approach allows platform engineers to throttle individual applications and enforce global quotas, which prevents any single application from monopolizing the entire budget.
On the non-technical side, communication is vital. It's helpful to view application developers as your customers. This means giving them a heads-up about token spend and setting up internal office hours where they can request more quota for their service or for an event. This kind of customer empathy creates a healthier business overall.
The rise of AI agents introduces a significant challenge for identity and access management. When an agent runs a task on a user's behalf, it absolutely should not inherit all of the user's permissions—that is a security team's worst fear.
An agent is essentially an LLM running in a loop to accomplish a specific goal. The core issue, the "identity gap," is that agent identities sit awkwardly between the traditional models of short-lived machine identities and long-lived human identities. Agents need just the permissions they need, and only for the time they need them (just-in-time access).
Right now, the main way to tackle this is with agent sandboxing. You should definitely look into technologies like gVisor, KVM, or WebAssembly tools such as Wazet to isolate the agent's runtime. The goal is to strictly lock down an agent's capabilities.
For the near term, while the industry develops standards for what's being called "agentic identity," it's best to lean heavily on an ephemeral, short-lived machine identity model. Think 30-minute or one-hour lifetimes to mitigate risk.
The introduction of AI fundamentally changes the software engineering landscape, particularly the long-held expectation of a "pure function"—that the same input will always yield the same output. Large Language Models (LLMs) are non-deterministic, making that traditional expectation obsolete.
To navigate this new environment, developers should approach everything with an immense amount of suspicion. Instead of focusing on what can go right, the mindset needs to shift toward anticipating everything that can go wrong. This suspicion naturally leads to adopting a zero trust approach to AI agents.
This means implementing strong guards directly in the infrastructure to prevent common issues. These issues include prompt injection, agents returning explicit or inappropriate content, or accidentally leaking sensitive data like API keys. Ultimately, isolating the agent in its runtime as much as possible is the best path forward.
You can connect with Keith Mattix and follow his work through the following platforms and communities, as mentioned in the interview:
What is the primary conflict between data science and platform engineering teams in the AI space?
The main conflict is a "tug of war" over the technology stack: Platform teams prefer their established Kubernetes and DevOps tools, while Data Science teams prefer their own distinct tools (like R or MATLAB) for model development, creating dissonance in organizations.
What are some techniques used to adapt a large foundational model for a specific enterprise use case?
Techniques include transfer learning, which involves placing embeddings in a new context, and low rank adaptation or LoRA adapters, which allow for tweaks to the weights of a foundational model to make it hyper-specific for a particular domain or language. Retrieval-Augmented Generation (RAG) can also be used to connect models to an internal knowledge base.
How can a platform engineer manage the global token spend for LLM providers?
The recommended technical solution is to centralize the policy using an egress gateway to ensure all outward-bound cluster traffic is checked for token usage, allowing the platform engineer to "throttle individual applications" and enforce quotas globally.
What is an AI agent, and why is its identity a security challenge?
An AI agent is an LLM running in a loop to accomplish a certain goal. Its identity is a security challenge because it acts on behalf of a human user but should not possess all the user's permissions. It needs a scoped, just-in-time identity, falling in between traditional human and machine identities.
What is the recommended security mindset for working with non-deterministic AI models and agents?
The recommended mindset is one of an immense amount of suspicion and taking a zero trust approach. Engineers should "think about everything that can go wrong" and prioritize agent sandboxing and using short-lived, ephemeral machine identities to mitigate risk.