Get Service Mesh Certified with Buoyant.

Platform Engineering's AI and Observability Playbook

Next episode:

Vibe Coding, Slopsquatting, and the New Era of AI Security Risks

A conversation with Kasper Nissen at KubeCon

This episode features Kasper Nissen, Principal Developer Advocate at Dash0, and is hosted by William Chia on the AI Kubernetes Show at KubeCon in Atlanta. They discuss how the state of AI is changing platform engineering, with a focus on observability, hallucinations, and the future of developer workflows.

Observability's unique position in the age of LLMs

This blog post was generated by AI from the transcript, with some editing.

Nissen shared his thoughts on the intersection of AI tooling and observability, noting that observability in general is uniquely positioned in the world of AI and LLMs. This is primarily due to the established data structures within the observability space.

Observability tools are key to fighting a core AI problem: hallucinations. The OpenTelemetry project is helpful here, providing "semantic conventions," great documentation, and usable data that an LLM can work with, which helps reduce the typical hallucinations we see. Feeding the AI model with a higher volume of data and using "semantic meaning agreed upon data structures" as context improves the quality of the output and lessens the occurrence of hallucinations.

This combination of rich, structured data is driving a change in how developers interact with telemetry. Nissen sees a shift where people are "moving away from looking at dashboards to asking things in a natural language." This means a user can simply ask, "Hey, is my service up and running?" and get a response that details any issues, the impact, and what needs to be done about it.

Platform engineering as an enabler

Platform engineering plays an important role for developers, such as React or Java developers, who might not focus heavily on observability.

The platform team's main job is to ensure feature developers have a solid foundation for telemetry. As Nissen points out, they need to provide a good foundation for developers to have all the data and telemetry ready out of the box and well-structured. This data should be available so they don't need to be telemetry experts. It should simply be a good default provided by the platform.

This partnership, which Nissen calls the age-old partnership of dev and ops, lets the platform engineering team expose observability data from runtime applications. By combining this prepared data with AI and semantic conventions, developers can start talking to the agent using natural language for troubleshooting, which is much easier than scrolling through all the different traces, logs, and metrics.

The power of AI troubleshooting

AI is really good at taking in huge amounts of information and quickly filtering it to find what would take a person much longer. This capability makes AI-based troubleshooting a super powerful tool.

Nissen shared a great customer example involving Dash0's product, Agent0: In one instance, Kafka experts worked for about an hour trying to solve an issue without any luck. When they finally asked Agent0 what was wrong, the tool provided the answer in two minutes—the problem was a partition that simply needed a restart.

Incident response is another critical area where AI troubleshooting saves a lot of time, especially when incidents happen at 3 a.m. The main goal is to get a solution that essentially gives you a straight answer: This is what is happening, and this is the actual impact.

Looking ahead, we might see multiple AI agents working together to pinpoint and even fix an issue, meaning an on-call responder could simply merge a pull request to resolve the incident.

Craftsmanship and the AI trade-off

While AI clearly helps with acceleration and automation, it's important to remember that AI boosts development but doesn't take the place of real craftsmanship. Nissen brought up a major industry concern: if new engineers are only taught to use a large language model (LLM), they might miss out on developing essential critical thinking and craftsmanship skills.

Balancing productivity and knowledge

There's a delicate balance between depending on a coding agent and keeping up your own expertise. When using a coding agent, developers need to be very methodical, perhaps by utilizing test-driven development (writing a test, watching it fail, and then making it pass). Fundamentally, you need to grasp how to build good software to create something that is actually usable and maintainable over time.

Nissen argued that easing the learning process too much by reducing the initial struggle, pain, and failure could ultimately undermine your development as a craftsperson and prevent you from achieving that necessary level of expertise.

Nissen and Chia agreed that while AI is an excellent tool for eliminating repetitive and trivial tasks, professionals should actively seek out other challenges, like certifications, to keep their minds sharp and their knowledge relevant.

Organizational advice: Platform as a product

When adopting AI, organizations should really look to the principles of platform engineering. Nissen’s final advice centers on treating the platform and its components, including AI, as a product and considering your developers as your customers. This means providing good support, proper documentation, and perhaps even internal advocacy.

The core mindset of platform engineering—optimizing routines, structuring for the goal, and serving internal customers—is the exact mindset needed for AI adoption. The platform organization should be responsible for providing these features or capabilities as platform products.

Platform engineering organizations must focus on building in compliance, safeguards, and defaults around using LLMs. They should also provide scorecards for the different models and evaluate them so that the organization can safely adopt these tools.

‍

FAQ

Why is observability uniquely positioned to work with AI and LLMs?

Observability is uniquely positioned because of projects like OpenTelemetry, which has semantic conventions, excellent documentation, and specifications. This means there's a lot of high-quality, usable data out there for an LLM to work with, which really helps cut down on hallucinations.

How is the interaction with observability data expected to change?

The way we interact with these systems is changing. Instead of poring over dashboards, developers will soon be asking questions in plain language, like, "Is my service up and running?"

What is the role of platform engineers in integrating AI and observability?

Platform engineers can really help out by providing a solid foundation for our developers. This means having all the data and telemetry ready out-of-the-box, structured and good to go, so developers don't have to be telemetry experts.

What is a key capability of AI in troubleshooting?

AI is tremendously good at ingesting large amounts of information to filter what would take a human a lot longer to get to, enabling faster and more efficient troubleshooting.

What is the primary risk of relying on AI in software engineering?

Just be careful, because even though AI accelerates development, it's not a replacement for real craftsmanship. If engineers lean too much on AI, they might lose that critical sense and the skilled handwork that defines good engineering.

Get started with Buoyant
Enterprise for Linkerd

Download and install the world's most advanced service
mesh on any Kubernetes cluster in minutes