Get Service Mesh Certified with Buoyant.

Navigating the AI Era at Bloomberg

Next episode:

Comcast's Platform Engineering: Guardrails and Scale in the Age of AI

During this AI Kubernetes Show episode, we chatted with Alexa Griffith, senior software engineer, AI infrastructure at Bloomberg, and tech lead for CNCF TAG Infrastructure. We dove into the major cultural and technological shift happening as organizations start embracing the new era of generative AI.

Bloomberg’s cloud native and AI-native evolution

This blog post was generated by AI from the interview transcript, with some editing.

Bloomberg’s journey into cloud native technology was closely tied to its AI ambitions. The company began exploring how to build a platform on Kubernetes back in 2016. AI use cases were the perfect low-risk starting point since they initially had more relaxed Service Level Agreements (SLAs) compared to other core products.

This early work led to open source contributions, including KServe (formerly KF Serving), now a CNCF incubated project. The project got started with help from Bloomberg's engineering team, including Dan Sun, Alexa's manager. KServe provides an abstraction layer on top of Kubernetes, making it easier to manage scalable and reliable model deployment for teams self-hosting models, even on-premise.

More recently, the shift toward generative AI has called for new tools. The Envoy AI Gateway is a newer open source project built on Envoy. This project was a collaboration between Bloomberg and Tetrate engineers and is specifically designed to meet the unique demands of the GenAI era.

The shift from predictive to generative AI

The move to generative AI (GenAI) is an evolution for the existing infrastructure, not a full-blown revolution. The current predictive AI platform, which is all about composability and interoperability, isn't going anywhere and is still fully supported.

However, GenAI does introduce some interesting new platform requirements. When it comes to load patterns, expect to see more token streaming and a lot more GPU usage. Platform teams also have to figure out how to simplify and standardize the communication with all the different models and providers out there, which often use wildly different protocols.

The key to all this, especially for agentic systems, is the MCP protocol. Bloomberg's CTO calls it the "API of the agentic AI era" because it provides a unified and common way for agents to communicate with various tools and other agents.

Wrangling non-determinism: Challenges in the GenAI era

One of the biggest hurdles when working with GenAI is the non-deterministic nature of LLMs. Essentially, you can give an LLM the exact same prompt multiple times and get different outputs each time. This core challenge means we need a whole new playbook for how we control and monitor these systems.

This non-determinism really complicates things when it comes to managing cost and ensuring good observability. On the cost front, we measure expenditure in tokens, but because the output length is unpredictable, cost becomes highly variable. When you throw agentic systems into the mix, the complexity skyrockets. The LLM decides how many queries to make, and for a single input prompt, that number is also non-deterministic. For observability, it becomes crucial to implement robust distributed tracing and monitoring. Without these tools, understanding what the system is doing and why becomes next to impossible.

GenAI guardrails and security

Managing the unpredictability of large language models (LLMs) often boils down to building a "deterministic cage around a non-deterministic core."

New security concerns are popping up as LLMs become more capable. For example, they can be used for more effective "red teaming practice," or what amounts to more intelligent hacking. The good news is that LLMs can also be leveraged to boost security, particularly when it comes to code review.

‍

A common pattern is using one LLM as a judge to evaluate the accuracy of others. This is done to catch issues like hallucinations or the accidental leaking of personal data. To create that necessary deterministic cage around the evaluation process, methods like running the models multiple times or using different configurations are essential.

When it comes to Retrieval-Augmented Generation (RAG) systems, there's a common misconception. A recent Bloomberg research study found that while RAG systems pull directly from documents, which may seem safer, it really isn't.

The platform engineer’s role: Abstraction and enablement

The GenAI landscape is rapidly evolving. With models being constantly released, deprecated, and updated, platform teams need to create a critical layer of abstraction to truly empower developers. This is where the concept of a Model Garden comes in.

The platform team's core mission is to abstract this complexity through tools like a model garden—essentially a centralized registry. This approach lets users always work with the newest models and understand what they're using. Developers can access crucial benchmarking data, easily check latencies and features, and make informed, use-case-driven choices without needing to know every intricate detail of a specific model version.

Unifying APIs

When a platform team is juggling different model providers, they need to create a unified user experience. The core challenge is that providers like Anthropic and OpenAI all have their own APIs and use unique specifications. For example, one might use a label like 'reasoning' with a value of 'medium-high,' while another opts for an integer from one to five. This is where the Envoy AI Gateway can help. Its job is to manage traffic across hybrid cloud environments and different models, ultimately simplifying a user's request so that it can be easily interchanged across these various systems.

New evaluation metrics

The metrics for evaluating model performance have changed. We're moving past the traditional system health checks and focusing on what really matters for the user experience with GenAI: latency. Because of token streaming, latency is now critical. The two key measurements we're tracking are time to first token, which dictates how "snappy" the initial response feels to the user, and token flow, which measures the continuous experience after that first token drops.

For platform teams, the real challenge is taking the huge volume of evaluation scores generated for every single prompt and turning that into something meaningful and actionable, like a Service Level Objective (SLO), to get a true picture of the system's performance.

Stay in touch with Alexa

You can connect with Alexa Griffith through the following channels:

X (formerly Twitter): @LexaLu
Podcast: Alexa's Input AI (on most podcast platforms)
LinkedIn: Alexa Griffith
LinkTree: https://linktr.ee/alexagriffith
Website: www.electrogrific.com

Resources

Tech at Bloomberg
Tech at Bloomberg on X
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Understanding and Mitigating Risks of Generative AI
- in Financial Services
Bloomberg’s Responsible AI Research: Mitigating Risky RAGs & GenAI in Finance
Bloomberg AI Researchers Mitigate Risks of “Unsafe” RAG LLMs and GenAI in Finance
Dashboards & Dragons: Reliability Magic for AI Platforms - Alexa Griffith & Sal Furino [SRECon Europe 2025]
Keynote: Platform Alchemy: Transforming Kubernetes Into Generative AI Gold [Kubecon Japan 2025]
Serving the Future: KServe’s Next Chapter Hosting LLMs & GenAI Models - Alexa Griffith & Tessa Pham [Kubecon EU AI Day 2025]

Get started with Buoyant
Enterprise for Linkerd

Download and install the world's most advanced service
mesh on any Kubernetes cluster in minutes