During this AI Kubernetes Show episode, we chatted with Alexa Griffith, senior software engineer, AI infrastructure at Bloomberg, and tech lead for CNCF TAG Infrastructure. We dove into the major cultural and technological shift happening as organizations start embracing the new era of generative AI.
This blog post was generated by AI from the interview transcript, with some editing.
Bloomberg’s journey into cloud native technology was closely tied to its AI ambitions. The company began exploring how to build a platform on Kubernetes back in 2016. AI use cases were the perfect low-risk starting point since they initially had more relaxed Service Level Agreements (SLAs) compared to other core products.
This early work led to open source contributions, including KServe (formerly KF Serving), now a CNCF incubated project. The project got started with help from Bloomberg's engineering team, including Dan Sun, Alexa's manager. KServe provides an abstraction layer on top of Kubernetes, making it easier to manage scalable and reliable model deployment for teams self-hosting models, even on-premise.
More recently, the shift toward generative AI has called for new tools. The Envoy AI Gateway is a newer open source project built on Envoy. This project was a collaboration between Bloomberg and Tetrate engineers and is specifically designed to meet the unique demands of the GenAI era.
The move to generative AI (GenAI) is an evolution for the existing infrastructure, not a full-blown revolution. The current predictive AI platform, which is all about composability and interoperability, isn't going anywhere and is still fully supported.
However, GenAI does introduce some interesting new platform requirements. When it comes to load patterns, expect to see more token streaming and a lot more GPU usage. Platform teams also have to figure out how to simplify and standardize the communication with all the different models and providers out there, which often use wildly different protocols.
The key to all this, especially for agentic systems, is the MCP protocol. Bloomberg's CTO calls it the "API of the agentic AI era" because it provides a unified and common way for agents to communicate with various tools and other agents.
One of the biggest hurdles when working with GenAI is the non-deterministic nature of LLMs. Essentially, you can give an LLM the exact same prompt multiple times and get different outputs each time. This core challenge means we need a whole new playbook for how we control and monitor these systems.
This non-determinism really complicates things when it comes to managing cost and ensuring good observability. On the cost front, we measure expenditure in tokens, but because the output length is unpredictable, cost becomes highly variable. When you throw agentic systems into the mix, the complexity skyrockets. The LLM decides how many queries to make, and for a single input prompt, that number is also non-deterministic. For observability, it becomes crucial to implement robust distributed tracing and monitoring. Without these tools, understanding what the system is doing and why becomes next to impossible.
Managing the unpredictability of large language models (LLMs) often boils down to building a "deterministic cage around a non-deterministic core."
New security concerns are popping up as LLMs become more capable. For example, they can be used for more effective "red teaming practice," or what amounts to more intelligent hacking. The good news is that LLMs can also be leveraged to boost security, particularly when it comes to code review.
A common pattern is using one LLM as a judge to evaluate the accuracy of others. This is done to catch issues like hallucinations or the accidental leaking of personal data. To create that necessary deterministic cage around the evaluation process, methods like running the models multiple times or using different configurations are essential.
When it comes to Retrieval-Augmented Generation (RAG) systems, there's a common misconception. A recent Bloomberg research study found that while RAG systems pull directly from documents, which may seem safer, it really isn't.
The GenAI landscape is rapidly evolving. With models being constantly released, deprecated, and updated, platform teams need to create a critical layer of abstraction to truly empower developers. This is where the concept of a Model Garden comes in.
The platform team's core mission is to abstract this complexity through tools like a model garden—essentially a centralized registry. This approach lets users always work with the newest models and understand what they're using. Developers can access crucial benchmarking data, easily check latencies and features, and make informed, use-case-driven choices without needing to know every intricate detail of a specific model version.
When a platform team is juggling different model providers, they need to create a unified user experience. The core challenge is that providers like Anthropic and OpenAI all have their own APIs and use unique specifications. For example, one might use a label like 'reasoning' with a value of 'medium-high,' while another opts for an integer from one to five. This is where the Envoy AI Gateway can help. Its job is to manage traffic across hybrid cloud environments and different models, ultimately simplifying a user's request so that it can be easily interchanged across these various systems.
The metrics for evaluating model performance have changed. We're moving past the traditional system health checks and focusing on what really matters for the user experience with GenAI: latency. Because of token streaming, latency is now critical. The two key measurements we're tracking are time to first token, which dictates how "snappy" the initial response feels to the user, and token flow, which measures the continuous experience after that first token drops.
For platform teams, the real challenge is taking the huge volume of evaluation scores generated for every single prompt and turning that into something meaningful and actionable, like a Service Level Objective (SLO), to get a true picture of the system's performance.
You can connect with Alexa Griffith through the following channels: