Skip to main content

Get Service Mesh Certified with Buoyant.

Enroll now!
close

S02 E02 - The AI-Native Workflow: How Schonfeld Used Kubernetes to Manage Explosive Code Volume

All Episodes

spotify logo
Apple Music logo
youtube logo

In this AI Kubernetes Show episode, host William Morgan spoke with Scott Feinberg, who leads AI and platform efforts at multi-strategy hedge fund Schonfeld, about how his team uses Kubernetes to rapidly scale an internal AI platform and manage the volume of code generated by AI agents.

From white glove to standardization at Schonfeld

This blog post was generated by AI from the interview transcript, with some editing.

Scott Feinberg joined Schonfeld to overhaul the firm's platform strategy. The previous platform, which relied on white glove service, simply couldn't scale. His first step was to standardize the cloud, DevOps, and SRE practices across the firm.

At a multi-strategy hedge fund, the nature of the work is unique: technologists are often developers or researchers working on quantitative analysis - “quants” - who demand a high degree of autonomy, since they are writing programs that essentially generate profit. Previously, the white glove approach meant new hires were asked open-ended questions, including what servers they wanted and what they wanted on them. This inevitably led to custom, unsupported, and disparate environments.

To address this, the focus shifted to establishing a standard set of tools and a robust platform to handle massive growth. The infrastructure quickly scaled, with Kubernetes clusters increasing from a handful to over 100. Standardization was often welcomed. Developers just needed a good, reliable platform so they could focus on adding their own proprietary expertise on top.

The evolution of SchonAI

Following the release of ChatGPT in December 2022, Schonfeld began building an internal AI platform, called SchonAI. The initial prototype was a straightforward wrapper around the OpenAI API, deployed as a Lambda function running in Slack. The high adoption rate quickly signaled that a full, extensible platform was necessary.

The team replatformed the application, moving it from a Lambda function to a production-ready FastAPI app running on Kubernetes. To better support this push, the SRE team was re-badged and dedicated to AI development. This made sense, as one of SRE's core functions is eliminating toil, something AI quickly proved adept at for the wider business.

To avoid vendor lock-in, the team initially developed a custom abstraction layer for the model APIs. But managing the vendors' frequent API changes became cumbersome, so this custom layer was later replaced with the external framework LightLLM.

SchonAI's core functionality

SchonAI is officially a full multi-tenant platform. Feinberg and team built out 150 tools (connectors to internal APIs) and 30 different bot configurations. Their initial focus has been on reducing toil, specifically in finance-specific workflows. For example, they are helping teams with document analysis, like extracting data from massive documents such as SEC filings or earnings transcripts. Another key area is legal processes, where they've built an automatic redliner to quickly review legal contracts against a template and generate comparisons. Finally, their email integration feature summarizes bulk reports and newsletters. This helps investment staff easily capture the interesting research in their inbox without getting buried in the noise.

Security and data handling with entitlements

Security and entitlement management were a core focus when building the system, especially since it deals with sensitive data. The central principle driving the architecture is that the AI never has access to anything a user doesn't already have access to.

Data access is fundamentally based on the user's entitlements in external systems, which is enforced via an OAuth exchange. This means nothing ever gets sent to the AI that the user couldn't personally look up somewhere else.

For handling large data volumes, like extensive email inboxes, the team implemented an approach similar to Retrieval-Augmented Generation (RAG). This technique, sometimes referred to as agentic search, uses smaller, more cost-effective subagents to look at subsections of data, pull out useful information, and then consolidate the final results.

The platform engineering advantage

Investing in a Kubernetes-first infrastructure was a game-changer for handling the massive surge in code volume driven by AI. This established setup provides standardized pipelines and templates for both Java and Python applications, delivering a fully bootstrapped environment complete with secrets management and deployment capabilities built right into the Kubernetes cluster. It also streamlined processes, essentially solving 90% of the common headaches for new developers.

These standardized setups are absolutely critical because the barrier to writing code is now so low. This means a lot more people are deploying code than ever before. The platform enforces guardrails, preventing the AI from creating non-standard deployment pipelines, for example, by stopping it from choosing ECS when EKS is the standard.

Cloud Developer Environments (CDEs), which were originally implemented to make human onboarding easier, have also proven to be indispensable for AI agents. The same infrastructure we use to bring on new human engineers is now used to onboard agents. An AI agent can now open a GitHub PR, spin up a CDE, execute its code, and push the changes for the PR.

Rethinking architectures for an AI-native world

There's a striking parallel between embracing AI and adopting serverless architectures. To maximize the value of AI, organizations need to fundamentally change their core approach.

Simply dropping AI into existing workflows won't cut it. Instead, the key is to envision a completely new process. The mindset should be "How would I build this now that I have this capability?" This is very similar to how you approach serverless-first architectures. It's all about defining your AI-native architecture and AI-native workflow.

The renaissance of building

Feinberg has an optimistic view of the long-term impact of AI on technology roles. Given the rapid rate of change, flexibility is absolutely critical. Building in a way that isn't tied down to one specific system will be increasingly valuable.

There has never been a better time to be a builder. AI isn't going to replace creativity or strategic direction; the real focus will shift toward the operators and executors who bring the innovative ideas.

The most valued engineers will be those with a true product engineering mindset, the people who challenge assumptions, ask deep questions, and have strong opinions about what they are creating. The days of simply building a spec handed to you are coming to an end.

Looking ahead, we can expect to see more jobs, more code, and more applications than ever before. In this environment, platform engineering as a function is going to become significantly more valuable.

FAQ

What are the advantages of standardizing your Kubernetes platform?

Standardization provides a robust platform and standard tools, allowing organizations to scale more easily and letting developers focus on business logic instead of infrastructure maintenance.

What challenges do organizations face when building abstractions over LLM APIs to avoid vendor lock-in?

Organizations build custom abstraction layers, but vendors frequently change their APIs (e.g., Anthropic adds new flags often), making maintenance cumbersome. A solution is to replace the custom layers with external frameworks like LightLLM.

What is the most important principle for managing security and sensitive data access when integrating AI platforms with enterprise systems?

AI should never access anything the user doesn't already have access to. Access should be enforced via an OAuth exchange based on the user's entitlements in external systems to ensure secure data handling.

How does defining an AI-native workflow maximize value vs. adding AI to existing business processes?

Maximizing AI value requires a fundamental change in its core approach. Instead of dropping AI into existing workflows, organizations must wipe the slate clean and define an AI-native workflow based on the new capability, mirroring the shift to serverless-first thinking.