Skip to main content

Get Service Mesh Certified with Buoyant.

Enroll now!
close

S02 E04 - Running Multi-agent AI on Kubernetes: Lessons from Imagine Learning

All Episodes

spotify logo
Apple Music logo
youtube logo

In this episode of The AI Kubernetes Show, Blake Romano, Staff Software Engineer at Imagine Learning, walks through what it actually looks like to build and run AI agents on Kubernetes at scale. He talks about the architecture choices, the failures, and why the organizational context you bring to the LLM matters more than which Software Development Kit (SDK) you use.

Imagine Learning is a K-12 education company building digital platforms for students and educators, and Blake has been driving AI and platform engineering initiatives there.

The problem that started it all

Imagine Learning was migrating to a new internal developer platform (IDP). The platform team was small, but the number of engineers with questions was not.

Blake and his director spent an afternoon building a proof of concept: an AI chatbot backed by AWS Bedrock, Bedrock Agents, and knowledge bases. A day later, they had something testable: A prototype that answered doc questions. That's how it all started. 

From RAG to multi-agent

The original chatbot used RAG (Retrieval-Augmented Generation). It vectorized their documentation, put them into a knowledge base, and let the agent retrieve and respond. It worked but wasn't scalable.

The current architecture is a multi-agent system. An orchestrator agent sits in front of a set of specialized sub-agents, each running as a Kubernetes deployment. The orchestrator routes based on the type of the request. Question about a deployed environment? Go to the Argo CD agent. Question about how to configure something? Go to the IDP documentation agent. Question about a ticket? Go to the ticketing system agent ...and so on.

The agents don't talk to each other directly. The orchestrator calls them as tools via API, synthesizes the responses, and sends back a single answer.

Here's a concrete example: a developer asks why their S3 bucket isn't deploying. The orchestrator hits the Argo CD agent to get the current resource state, then hits the documentation agent to check how S3 buckets are supposed to be configured in the platform. The chatbot response? “You configured this field incorrectly, which is why it's in a failed state.”

The key shift from RAG was replacing vectorized, static documentation with real-time MCP tool calls. The documentation agent now uses the doc system's search API directly. No re-vectorization, staleness, or tightly scoped system prompts per agent means each sub-agent has context appropriate to its domain.

Why GitOps makes this work

Blake points to GitOps as a major differentiator to make AI genuinely useful for platform teams and their consumers. When your infrastructure configuration is code in source control, the LLM can reference it.

If you need to understand how your Linkerd is configured in a given cluster, you can point it at the GitOps files that manage that configuration, and it can start to understand them. Then point MCP at the Prometheus server scraping metrics from that cluster, and the model can correlate configuration state with runtime behavior. That's a meaningfully different capability than "generate code from a prompt."

The implication is straightforward: platform teams that haven't moved to infrastructure-as-code and GitOps will get less out of these tools. The LLM is only as useful as the context it can see, and GitOps puts that context somewhere an LLM can reach.

Measuring whether any of this works

Measuring AI platform quality is hard. Thumbs up or thumbs down on agent responses gives you some signal but mostly captures failures. Developers don't go back to rate a response that worked. Just like the bathroom cleanliness button at the airport, you only press it when it's bad.

The practical approach at Imagine Learning leans on DORA metrics: deployment frequency, change failure rate, and time from PR to production. The idea is to tag work done with AI assistance through the ticketing system and correlate that tag with velocity metrics. It's early, and the attribution is imperfect. 

Blake doesn't think anyone has a great answer to a lot of these problems. What he's confident in is that the framework isn't new. Justifying LLM inference spend is the same problem as justifying EC2 spend: map usage to deployment velocity, feature delivery, and support ticket deflection. The granularity is harder to get right, but the approach is the same.

Guardrails matter more than they used to

The threat model for a platform has shifted. You used to be able to rely on engineers not accidentally deleting the database. The code got reviewed. The people writing it understood the consequences.

With AI in the loop, you have to assume the agent will eventually find the delete function and call it. Platform guardrails that were nice-to-have before are now load-bearing.

Imagine Learning bakes those guardrails into the build phase, deployment phase, and code generation phase: preventing secrets from being committed, enforcing approved AWS services, and requiring architectures the team knows how to support.

Blake also flags something less obvious: non-engineers are starting to vibe code internal tools. Marketing automating content workflows. Product teams are building quick prototypes. The platform needs to support those use cases with safe paths that don't require deep engineering context to follow.

"You need some blessed paths for people that don't have that engineering background now starting to write some code," said Blake.

Organizational context is the durable investment

Blake's framework for where to put engineering time: the harness changes, the organizational context doesn't. Protocols like A2A (agent-to-agent communication) and MCP (connecting agents to external data sources) are worth investing in because they're becoming standards. The specific SDK or orchestration framework is not worth over-investing in because it will change.

At the end of the day, the agents will change. And so will the LLMs and the tooling. The real value is in the organizational context you're bringing into these LLMs.

That means investing in how you put context into system prompts and skills (structured context blocks for agent behavior), how you expose your services via MCP, and how you design your information architecture so AI can actually parse your docs becomes more important. The framework wrapping that context is a commodity.

The swappability argument is real in practice. If a vendor ships an MCP server for their tool, you can plug it into Claude Code for local debugging, into your orchestrator agent for Slack-accessible troubleshooting, or into any CLI your organization uses. The connection point is MCP, and it works across contexts.

Platform engineers become air traffic controllers

Blake's prediction for where this lands in two to three years is that engineers will stop thinking at the function level and start thinking at the system level. Debugging a memory leak in a specific function becomes an agent problem. Deciding whether the platform's data model can support a new product capability stays human.

You won't be really thinking at the micro level as to what this function is doing. You'll be thinking much broader as to how the platform is working at a high level.

On code review, Blake looked at roughly 10,000 lines of code in a single day. The current review model doesn't scale at that volume. His working theory is that review should move upstream. Humans should focus on architecture and planning artifacts while letting agents flag code-level issues and only pull in a human reviewer where the agent flags something uncertain.

The part where engineers stay irreplaceable is judgment. Which platform abstractions actually hold. Whether a new feature will create reliability problems downstream. Whether the product requirement makes sense for the data model. Those decisions require organizational context that can't be fully encoded in a system prompt yet.

What Blake would tell other platform teams

Prototype fast and throw things away when they don't work. The cost of building and discarding is low enough now that paralysis is the bigger risk.

Keep the core engineering fundamentals: test coverage, observability, DORA metrics, and security guardrails. These aren't obsolete. If anything, they matter more because you're shipping faster and the blast radius of a bad decision is larger.

Start small and continue to iterate on it over time. And right now, it's so easy and so cheap to proof of concept something.

Frequently Asked Questions 

How do you architect a multi-agent AI system on Kubernetes? 

Run specialized sub-agents as Kubernetes deployments. An orchestrator routes requests to the right sub-agent (Argo CD, docs, ticketing) via API, synthesizes responses, and returns one answer. Each sub-agent has scoped system instructions and tooling.

Why does GitOps make AI more useful for platform teams? 

When the infrastructure configuration is code in source control, LLMs can read it directly. Point MCP at Prometheus and the model can correlate config state with runtime behavior, going beyond code generation to real operational reasoning.  

How do you measure the value of AI platform tooling? 

Apply DORA metrics: deployment frequency, change failure rate, and time from PR to production. Tag AI-assisted work in your ticketing system and correlate with velocity. LLM inference costs are real, and executives will ask.