Observability Summit North America 2026: Full Schedule

May 21-22, 2026
Learn more and Register to Attend

The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for Observability Summit North America 2026.

Please note: This schedule is automatically displayed in Central Daylight Time (UTC -5). To see the schedule in your preferred timezone, select from the drop-down menu located at the bottom of the menu to the right.

The schedule is subject to change.

10:20am CDT

[CANCELLATION] Scaling a Proprietary-to-OpenTelemetry Migration With AI-Assisted, Spec-Driven Workflows - Ying Mo & Paras Kampasi, IBM

Thursday May 21, 2026 10:20am - 10:45am CDT

Level One | Ballroom B

This talk presents a practical methodology for migrating a large proprietary observability platform to an OpenTelemetry-native architecture, using a GenAI-assisted workflow paired with a robust spec-driven strategy. Faced with hundreds of custom Java-based sensors, the engineering team designed a spec-driven conversion process that leverages GenAI to extract specifications, generate unit tests, and assist in implementing Go-based OpenTelemetry receivers. Each stage incorporates human review and test feedback loops to address the reliability limitations of GenAI and ensure functional correctness.

Additionally, a data-driven feasibility evaluation was conducted prior to large-scale conversion, where defined task types were benchmarked with and without GenAI to quantify effort savings and highlight where GenAI provides the greatest value.

Attendees will learn a reproducible workflow for large-scale migrations from proprietary to OpenTelemetry, how to pair GenAI with automated testing to manage risk, and insights on where GenAI accelerates real-world engineering tasks without compromising quality.

Speakers

Ying Mo

Senior Software Engineer, IBM

Ying Mo is a Senior Software Engineer at IBM, recently working on IBM Instana, an observability platform, leading engineering team to transform the product to OpenTelemetry native. He is always enthusiastic to bring innovative ideas into product by leveraging open source technology... Read More →

Paras Kampasi

Technical Product Manager, IBM

I work at the intersection of OpenTelemetry, observability, and modern cloud-native practices, helping teams make complex systems understandable and reliable. I speak and write about practical ways to apply open standards, close feedback loops between SREs and product teams, and turn... Read More →

Thursday May 21, 2026 10:20am - 10:45am CDT
Level One | Ballroom B

End-User Case Studies

Content Experience Level Advanced

10:50am CDT

OpenTelemetry GenAI in Practice: What the Spec Says Vs. What You Actually See - Zach Groves, Datadog

Thursday May 21, 2026 10:50am - 11:15am CDT

Level One | Ballroom B

OpenTelemetry’s GenAI semantic conventions are evolving quickly. Version 1.37 marked a major shift in how LLM behavior is expressed using standard spans and attributes. While later releases refined and clarified the spec, real-world adoption remains uneven, and “GenAI-compatible” can mean very different things across the ecosystem.

In this talk, I’ll share hands-on lessons from implementing and validating GenAI support in real emitters, including close collaboration with Strands. Implementing the 1.37 spec on both sides surfaced semantic ambiguities that only became clear in practice and ultimately led to stronger implementations.

I’ll also outline the current GenAI instrumentation landscape: Strands emitting 1.37+ compliant spans; OpenLLMetry, which mixes newer conventions with legacy and custom attributes; and OpenInference, which claims OpenTelemetry compatibility but does not emit GenAI semantic convention attributes.

Finally, I’ll show how these gaps surface in practice—teams believing they emit 1.37-compliant telemetry but sending pre-1.37 or non-spec data—and briefly touch on transition guidance like OTEL_SEMCONV_STABILITY_OPT_IN.

Speakers

Zach Groves

Software Engineer II, Datadog

Zach learned to code at Barnes & Noble during rest days between climbing while living in a van. He spent 3 years on the support team before moving over to engineering team at Datadog (3 years on APM and 1 on LLM Obs). He currently works on LLM Obs Otel compatibility. He likes scuba... Read More →

North America Observability otel genai presentation.pptx pdf

Thursday May 21, 2026 10:50am - 11:15am CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Intermediate

11:20am CDT

AI-Powered Root Cause Analysis at Scale: From Theory To Production Lessons From Nubank's 120M+ Cus - Letícia Mota & Yevgeny Gladun, Nubank

Thursday May 21, 2026 11:20am - 11:45am CDT

Level One | Ballroom B

This session presents an AI-powered SRE Agent designed to autonomously orchestrates complex, multi-source investigations by querying internal observability providers and knowledge bases.
A primary focus is the "Data Volume Problem." Modern observability systems generate terabytes of metrics and logs daily; at Nubank’s scale, the Prometheus MCP alone has more than 23,000 metrics available, while log queries can span billions of rows. The team overcame LLM context limits through on-premises data filtering, intelligent summarization, and selective context assembly. This architecture utilizes "Expert Guides" to reduce 23,000 raw metrics to approximately 14 relevant data points before LLM processing.
The talk covers multi-source orchestration using the Model Context Protocol (MCP) for pluggable tool discovery, allowing the AI to progressively load and correlate only the observability sources.
The platform enables the delivery of expert instructions for any specific scenario through targeted, versioned prompts. This transformation allows the platform to scale across the enterprise, performing virtually any investigative task beyond its original root cause analysis mission.

Speakers

Letícia Mota

Nubank

Letícia is a Product Manager at Nubank with 8+ years of experience. After working with data & image recognition products, she now works with Resilience and Troubleshooting products, including a DR Test Platform and an SRE Agent for Nubank.

... Read More →

Yevgeny Gladun

Staff Runtime Platforms Engineer, Nubank

Yevgeny Gladun is a Staff Engineer at Nubank with nearly 20 years of software development experience. Over his four-year tenure at Nubank, he transitioned from scaling Data ETL pipelines to deep architectural analysis of microservice interactions. As part of the Runtime Platforms... Read More →

ObservabilityCon 2026 [Nubank] AI Powered Root Cause Analysis at Scale pdf

Thursday May 21, 2026 11:20am - 11:45am CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Advanced

11:50am CDT

⚡ Lightning Talk: Summarizing the Noise: LLM Observability With Open Data Hub, VLLM, KServe and Prometheus - Twinkll Sisodia, Red Hat

Thursday May 21, 2026 11:50am - 12:00pm CDT

Level One | Ballroom B

As large language models (LLMs) move into production, raw metrics alone aren’t enough. This talk presents an open-source AI observability solution built on Open Data Hub (ODH) that deploys LLMs using vLLM and KServe, scrapes inference metrics using Prometheus, and feeds them into a summarization model to generate actionable insights. We’ll demonstrate a working UI that translates low-level metrics like latency, GPU usage, and token throughput into human-readable summaries—giving platform teams an intelligent way to monitor LLMs at scale. No dashboards to interpret—just straight answers from your models about your models.

Speakers

Twinkll Sisodia

Senior Software Engineer, OpenShift AI (Red Hat), Red Hat

Twinkll Sisodia is a Senior Software Engineer at Red Hat, building scalable, production-ready GenAI solutions. She works with partners to integrate their technology into OpenShift AI and contributes to open source in AI observability, platform optimization, and sustainability. Her... Read More →

Thursday May 21, 2026 11:50am - 12:00pm CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Intermediate

12:05pm CDT

⚡ Lightning Talk: From Collector To Terminal: A Better Way To See Your OpenTelemetry Logs - Jon Reeve, ControlTheory

Thursday May 21, 2026 12:05pm - 12:15pm CDT

Level One | Ballroom B

The OpenTelemetry Collector is powerful, but the "debug exporter" only shows raw output. What if you could see your OpenTelemetry logs - with structure, filters, and context - right in your terminal?

This talk introduces Gonzo, an open-source, OTLP-native terminal UI that visualizes logs from the Collector or any OTLP-capable source in real time. Learn how to validate both source instrumentation, and Collector pipelines - including components like filelog, k8sattributes, and transform - without a backend.

Whether debugging, testing configs, or teaching OTel, Gonzo offers a faster, clearer way to understand your telemetry as it flows.

Key Takeaways:
- Validate source instrumentation and Collector pipelines end-to-end
- See enriched OTel logs with structure and context in the terminal
- Debug and iterate on OTel configs faster - no backend required

Speakers

Jon Reeve

CPO and Co-founder, ControlTheory

Jonathan Reeve is a co-founder of ControlTheory, where he helps teams take control of their observability data with smarter, more efficient telemetry pipelines. A passionate advocate for OpenTelemetry and open standards, Jonathan focuses on making observability more scalable, cost-effective... Read More →

Thursday May 21, 2026 12:05pm - 12:15pm CDT
Level One | Ballroom B

The Future of Open Source Observability

Content Experience Level Beginner

1:25pm CDT

Unified End-to-End Observability: How Comcast Generates SpanMetrics at Enterprise Scale - Raghu Vamshi Challa, Comcast

Thursday May 21, 2026 1:25pm - 1:50pm CDT

Level One | Ballroom B

Enterprises often struggle with the "black box" nature of proprietary APM tools and the high cost of distributed tracing at scale. In this session, we will demonstrate how Comcast tackled this challenge by migrating 350 critical applications from AppDynamics to a cloud-native OpenTelemetry (OTel) stack, achieving a truly unified end-to-end observability experience.

We will pull back the curtain on the architecture that powers this migration. Specifically, we will show how we leveraged the OpenTelemetry Collector to generate Request, Error, and Duration (R.E.D.) metrics from trace data using the SpanMetrics connector. A key highlight will be our unique deployment of Conduit, which serves as a resilient transport layer to ensure data integrity and effective load balancing in a high-volume environment.

Attendees will leave with a blueprint for breaking free from APM vendor lock-in. To help the community fast-track this transition, we will also be sharing and walking through our reusable, battle-tested Grafana dashboards that can be leveraged by any enterprise.

Speakers

Raghu Challa

Comcast Engineer 6, Software Development & Engineering - Backend Engineering, Comcast

Raghu is an Observability Lead at Comcast, driving the enterprise-wide migration from legacy APM tools to OpenTelemetry. He specializes in designing high-scale telemetry pipelines that process massive volumes of trace data. Raghu is passionate about democratizing observability and... Read More →

Comcast SpanMetrics observability summit pptx

Thursday May 21, 2026 1:25pm - 1:50pm CDT
Level One | Ballroom B

End-User Case Studies

Content Experience Level Advanced

1:55pm CDT

Policy as Code Meets OpenTelemetry: The Next Frontier of Observability - Christopher Voisey, EnforceAuth

Thursday May 21, 2026 1:55pm - 2:20pm CDT

Level One | Ballroom B

Modern observability stacks excel at capturing signals about infrastructure health, application performance, and request flows. Yet one critical class of decisions remains largely invisible: authorization.
In distributed systems, authorization decisions increasingly determine not only whether an action succeeds, but if data is accessed, tools are invoked, or automated agents are allowed to act. These decisions are often evaluated outside application code using Policy as Code frameworks, yet their outcomes are rarely observable in a structured, privacy preserving way.
In this session, we explore how Policy as Code, Open Policy Agent, and the OpenTelemetry project can be combined to treat authorization decisions as observable events. We examine what it means to observe a decision without logging sensitive inputs, how decision structure differs from traditional metrics and traces, and why decision level observability is becoming essential in cloud native and AI driven systems.
Attendees will leave with a conceptual framework for thinking about authorization as telemetry, and a clearer understanding of where observability is heading as systems become more autonomous and policy driven.

Speakers

Christopher Voisey

Field CTO, EnforceAuth

Chris is a technology leader with 20+ years of experience designing and delivering secure, cloud-native systems. He has led engineering and solutions teams across startups and enterprises, helping organizations adopt policy-as-code, zero-trust architectures, and modern observability... Read More →

Thursday May 21, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom B

CNCF Observability Projects

Content Experience Level Intermediate

2:25pm CDT

Whats the Best Way To Reduce Storage Requirements Without Losing Insights? Push AI To the Edge! - Alex Degitz, ElastiFlow Inc

Thursday May 21, 2026 2:25pm - 2:50pm CDT

Level One | Ballroom B

During this session we’ll discuss ElastiFlow’s Edge Observability strategy, which includes an OTel native edge processing node with local DuckDB storage for all OTel signals and an agentic AI system that is model agnostic (we often run it with OpenAI’s gpt-oss-20b), exposing its tools through an MCP server.

Instead of just forwarding OTel signals from various Edge collectors, the signals are analyzed and routing decisions are made. Alerts are sent to the Observability Platform right away, while logs are stored locally and analyzed for patterns. Instead of forwarding all logs, we might only care about a few conditions of interest, often correlated with other signals, and send these to the Observability Platform, while less interesting logs can be aggressively aggregated.

With this approach, we were able to reduce the storage and ingest cost of Observability Platforms by half while actually decreasing the mean time to insight.

Speakers

Alex Degitz

VP of Product, ElastiFlow Inc

Alex has been building Automation and Observability products for 10+ years and has been advocating to break down silos between operations teams ever since.

Thursday May 21, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Intermediate

2:55pm CDT

One Pane to Rule Them All: Uniting the Prometheus Community with OpenSearch Dashboards, Logs, and Trace - Anirudha Jadhav and Kevin Fallis, AWS

Thursday May 21, 2026 2:55pm - 3:20pm CDT

Level One | Ballroom B

As infrastructure scales across regions and clusters, Prometheus deployments fragment into isolated islands of metrics—disconnected from logs, traces, and the dashboards operators actually live in.

This talk is for the Prometheus community. If you've wrestled with federation sprawl, alert duplication, or the gap between your metrics and the rest of your observability story, this session is for you.

We'll demonstrate how OpenSearch's distributed data source support lets multiple Prometheus clusters coexist natively alongside logs and traces in a single unified interface, no data migration, no parallel stacks.

You'll learn:

Unified querying across Prometheus clusters
SLO tracking wired directly into dashboards
Application management that finally connects the signals your teams have been operating in isolation

This is about completing the observability loop the Prometheus community has always needed, open, composable, and community-driven.

Speakers

Kevin Fallis

Principal Senior Solutions Architect, Amazon Web Services

Kevin Fallis is seasoned leader, architect, and developer with experience across many industry verticals and disciplines such as agriculture, ad tech, financial services, networking, security, telecommunications and of course search technologies. His passion helps others leverage... Read More →

Anirudha Jadhav

Sr. Engineering Leader, Amazon Web Services

Anirudha is a Senior Manager, Software Development at Amazon Web Services (AWS), leading development of insight engines and visualization platforms for the OpenSearch Project. He specializes in distributed systems, data analytics, and search technologies, including architecting one... Read More →

Thursday May 21, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Intermediate

3:40pm CDT

From Data Dumps To Smart Context: Building MCP Servers That AI Can Actually Use - Thomas Johnson, Multiplayer

Thursday May 21, 2026 3:40pm - 4:05pm CDT

Level One | Ballroom B

Most MCP servers fail the same way: they expose observability data without understanding what AI models need to reason effectively. The result? Tools that overwhelm models with metrics, miss critical context, and introduce unnecessary security exposure.

At Multiplayer, we built an MCP server to give AI coding assistants access not just to production telemetry but to full stack data: frontend screens and data, backend traces, logs, and request/response content and headers. What we learned challenges the "more data is better" assumption that drives most integrations.

This talk shares the hard lessons from moving an MCP server into production. You'll learn why filtered, intent-driven context outperforms comprehensive data access, how to design tools that align with developer workflows rather than API surfaces, and the security trade-offs that matter when LLMs query your observability stack.

We'll cover practical design patterns for MCP servers in the observability space: scoping data by blast radius, surfacing relationships over raw metrics, and handling authentication without compromising developer experience. This talk is about what works when AI meets production systems.

Speakers

Thomas Johnson

CTO and Co-founder, Multiplayer

Co-founder and CTO at Multiplayer, with 20+ years of experience as a backend developer building large-scale distributed software (and robots!)

MCPSARE COLLECTINGDUST pdf

Thursday May 21, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Any

4:10pm CDT

Why Are Your AI’s Decisions Hard To Explain: Trace Every Decision With Agentic AI Observability - Dhiraj Kumar Jain & Vikash Agrawal, Amazon Web Services

Thursday May 21, 2026 4:10pm - 4:35pm CDT

Level One | Ballroom B

Agentic AI systems represent a fundamental shift in software architecture: autonomous agents reason, plan, invoke tools, and orchestrate complex workflows without deterministic control flow. This breaks many assumptions behind traditional observability.

When agents independently make decisions, failures no longer follow a single request path. How do you debug emergent behavior across multiple agent steps? How do you analyze and control token-driven costs? How do you ensure reliability when outputs are non-deterministic?

This session explores why observability is a first-class requirement in the agentic AI era and how OpenSearch can act as the analytical backbone for understanding autonomous AI systems in production. We will cover practical techniques for instrumenting agent workflows with OpenTelemetry and indexing traces, logs, metrics, and AI decision artifacts into OpenSearch for deep correlation and analysis.

Attendees will learn battle-tested patterns for tracing agent reasoning and tool usage, investigating failures and hallucinations, monitoring latency and cost signals, and building dashboards that make agentic AI systems transparent, debuggable, and production-ready.

Speakers

Dhiraj Kumar Jain

Sr. Software Engineer, AWS

Dhiraj is a software engineer at Amazon Web Services (AWS), where he’s working on building a next-gen log analytics platform with CloudWatch Logs, helping scale it to handle vast amounts of data. Before this, worked in Amazon AuroraDB.

A distributed systems enthusiast, Dhiraj loves diving into complex, large-scale problems and building software for the next billion users. When he’s not scaling systems, you’ll find him at tech meetups and hackathons... Read More →

Vikash Agrawal

Vikash Agarwal, Amazon Web Services

Vikash Agrawal is a Software Development Manager at Amazon Web Services (AWS), leading initiatives in the AWS CloudWatch team. Previously, he played a key role in developing Amazon Q Developer, a Generative AI-powered assistant for developers. With over a decade of experience in software... Read More →

Observability for Agentic AI pdf

Thursday May 21, 2026 4:10pm - 4:35pm CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Intermediate

10:20am CDT

Observing the Observers: Bringing OpenTelemetry to Autonomous AI Agents - Abdel Fane, OpenA2A

Friday May 22, 2026 10:20am - 10:45am CDT

Level One | Ballroom B

Traditional observability assumes humans operate systems. AI agents break that model—they make autonomous decisions, execute operations without approval, and drift in capability over time. Yet most organizations have zero observability into their AI agent infrastructure.

When developers spin up MCP servers through Claude Desktop or Cursor, security and ops teams are blind. No metrics. No traces. No logs. Just autonomous agents accessing databases, calling APIs, and modifying production systems—completely outside your observability stack.

This talk explores how to instrument AI agents and MCP servers using familiar CNCF tools. We'll cover:

• Why traditional APM fails for autonomous agents (no request/response, emergent behavior, capability drift)
• Detecting anomalies in agent behavior (statistical baselines vs. ML-driven detection)
• Correlating agent actions to business outcomes

You'll see working demos of agent observability plus open-source code for instrumenting LangChain, CrewAI, and custom agents.

Walk away with patterns to extend your existing observability stack to AI agents before they become your biggest blind spot.

Speakers

Abdel Fane

CEO & Founder, OpenA2A

Abdel is a cybersecurity architect with 17+ years of experience securing enterprise environments across healthcare, finance, and government sectors. He has led security initiatives at Grail, Booz Allen Hamilton, Protiviti, and Allstate, specializing in cloud security & DevSecOps.
... Read More →

Friday May 22, 2026 10:20am - 10:45am CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Intermediate

10:50am CDT

Don't Let Users Find Your Outages: Synthetic Monitoring for Kubernetes Platforms - Kate Agnew, Marriott & David Norton, Platformers

Friday May 22, 2026 10:50am - 11:15am CDT

Level One | Ballroom B

No platform owner wants to be told their platform is down by a user. A core responsibility of the platform operating model is ensuring a reliable platform for the organization. In practice, it isn't always easy to detect when things are broken, especially when it falls outside of the traditional metrics coverage.

In our work, we adopted synthetic monitoring using Kuberhealthy, a CNCF project, to gain better visibility into whether the Kubernetes platform is operating as a user would expect. Synthetic monitoring allows us to replicate application developer workflows to validate end-to-end functionality of the platform.

Come and learn about implementing synthetics, how to not break things, and broadly how to improve stability with Kubernetes using synthetic monitoring.

Speakers

Kate Agnew

Sr. Director of Platform Engineering, Marriott

Kate Agnew is a Sr Director of Platform Engineering at Marriott, where she manages the enterprise Kubernetes and Service Mesh platform. Prior to Marriott, she held a similar platform leadership role at Optum, and has had multiple other leadership and technology positions at smaller... Read More →

David Norton

President and Principal Consultant, Platformers

David Norton is a founder and principal consultant at Platformers. He has been working in cloud platform engineering since 2016. Prior to that, he worked as an application developer.

David lives in St. Louis Park, MN, and usually enjoys spending time with his family, playing pickleball, reading, and fishing... Read More →

Friday May 22, 2026 10:50am - 11:15am CDT
Level One | Ballroom B

Integrating Observability into DevOps Practices

Content Experience Level Beginner

11:20am CDT

Applying Observability to the Internet of Living Things (IoLT) - Sophia Solomon, Elastic

Friday May 22, 2026 11:20am - 11:45am CDT

Level One | Ballroom B

We see IoT everywhere, from smart fridges to air quality sensors, but what about applying observability to billions of living things? Introducing Meowy, my virtual cat with a full observability stack. In this talk, I'll build a digital pet from scratch in Go, instrument it with OpenTelemetry, and visualize its "life" in real time, live-tracking its habits, moods, and (attempted) escapes.

I'll show how to create a RESTful "cat API," instrument it for tracing, and set up alerting with the ELK stack and Kibana visualizations. We'll cover observability basics (logs, metrics, and traces), how to apply them to our digital pet, how to structure telemetry data for "living" systems using AI tools, and how to query all our cat stats with an MCP-connected AI agent. By the end, we'll calculate the average MPH (meows per hour) and expand our understanding of observability applications. No prior observability experience required—just some Go basics and a love for any living thing, from feline to fungal!

Speakers

Sophia Solomon

Elastic

Friday May 22, 2026 11:20am - 11:45am CDT
Level One | Ballroom B

Integrating Observability into DevOps Practices

Content Experience Level Any

12:05pm CDT

⚡ Lightning Talk: A Drop-in System To Accelerate Metrics Observability by 100x Using Sketch-based Approximation - Milind Srivastava, Carnegie Mellon University

Friday May 22, 2026 12:05pm - 12:15pm CDT

Level One | Ballroom B

Metrics observability workloads are growing in scale, resulting in (a) higher cost to operate observability infrastructure, and (b) slower query latencies.

The usual approaches to deal with these are:
- sample data
- roll up data
- reduce data cardinality
- send less queries

All of these approaches compromise the coverage of the observability infrastructure and can result in missing important anomalous behavior.

Through our research, we have developed a radically new approach to achieve large scale, low cost, and low latency without compromising the coverage of the observability infrastructure.

Our system reduce querying cost and latency by 100x by using 2 key techniques:
- streaming precomputation
- sketch-based approximation

Our system is developed as a drop-in accelerator to an existing Prometheus-Grafana stack, without modifying Prometheus or Grafana.

We will release an open-source prototype of this system in the Q1 2026.

Speakers

Milind Srivastava

PhD Student, Carnegie Mellon University

Milind Srivastava is a PhD student at Carnegie Mellon University working on re-imagining the design of data analytics pipelines using semantic-preserving summarization, to drastically reduce costs, and increase performance. He is interested in seeing his research get adopted by industry... Read More →

Observability Summit May 2026 pdf

Friday May 22, 2026 12:05pm - 12:15pm CDT
Level One | Ballroom B

Scalability Challenges and Solutions

Content Experience Level Advanced

12:15pm CDT

[Rescheduled] ⚡ Lightning Talk: GPU-Scanner: Extending CNCF Observability for Multi-GPU AI Workloads - Ritika Gupta, Oracle

Friday May 22, 2026 12:15pm - 12:25pm CDT

Level One | Ballroom B

As large language models scale across hundreds of GPUs and multi-node AI systems, they’ve become a major operational challenge for infrastructure engineers. Traditional observability tools stop at the node level, leaving GPU health and utilization invisible until workloads fail or budgets spike. Imagine a 25 day training job failing on day 23 because one GPU silently throttled!

In this session, we’ll explore GPU-Scanner, an open-source observability extension for Kubernetes GPU clusters. Built to integrate with Prometheus & OpenTelemetry, GPU-Scanner adds both active and passive GPU health checks, capturing throughput, TFLOPs, memory diagnostics, thermal consistency, and long-run stability metrics.

We’ll demo real-world failure modes like catching a GPU “off the bus” or detecting thermal throttling and show how alerts flow into your existing observability stack. Leave with a practical playbook to proactively validate GPU clusters and maximize reliability and utilization.

Speakers

Ritika Gupta

Senior SWE - AI Incubations, Oracle

With a knack for transforming chaos into seamless solutions Ritika Gupta creates technologies to bind Kubernetes, Containers and Cloud ecosystem leveraging cloud native tooling. She actively contributes to Kubernetes as an sig-windows member. Her expertise spans container orchestration... Read More →

Friday May 22, 2026 12:15pm - 12:25pm CDT
Level One | Ballroom B

Community-Driven Development in Observability

Content Experience Level Intermediate

1:25pm CDT

eBPF Application Instrumentation for Java: Challenges, Design, and Real-World Examples - Endre Sara, Causely, Inc & Stephen Lang, Grafana Labs

Friday May 22, 2026 1:25pm - 1:50pm CDT

Level One | Ballroom B

Java is one of the most widely used languages for enterprise applications. Frameworks such as Spring Boot and Quarkus make observability straightforward when the OpenTelemetry Java agent can be injected.

In many production environments, however, modifying application code or JVM startup parameters is not possible. In these cases, eBPF-based instrumentation enables observability without code changes, but applying eBPF to Java is challenging. JVM abstraction layers, differences across JDK versions, and the diversity of frameworks and libraries complicate generic instrumentation. The problem becomes even harder when applications rely on TLS-encrypted communication such as HTTPS, gRPC, databases, and messaging systems, where payloads are opaque.

This talk explains how the OpenTelemetry eBPF Instrumentation (OBI) project addresses these challenges, covering key design decisions, trade-offs, and current limitations. The discussion is grounded in real-world examples, including Spring Boot services using HTTPS and gRPC, and a Quarkus application with TLS-encrypted PostgreSQL and Kafka, showing what is possible today with agentless Java observability using eBPF.

Speakers

Stephen Lang

Staff Software Engineer, Grafana Labs

Stephen is a Staff Software Engineer on Grafana's Beyla team and an approver for the OpenTelemetry eBPF Instrumentation (OBI) project.

Endre Sara

Co-Founder, Causely, Inc

Endre is a Co-Founder of Causely, where he’s building the IT industry’s first causal reasoning. Previously, Endre was VP of Advanced Engineering at Turbonomic. Prior to Turbonomic, Endre was a VP at Goldman Sachs. Endre holds an M.E. in Electrical Engineering from the Technical... Read More →

2026 05 22 obs summit ebpf java pdf

Friday May 22, 2026 1:25pm - 1:50pm CDT
Level One | Ballroom B

CNCF Observability Projects

Content Experience Level Beginner

1:55pm CDT

Breaking Free from Vendor Lock-In: Nubank DIY Observability Success - Diego Rocha, AWS & Otavio Valadares, Nubank

Friday May 22, 2026 1:55pm - 2:20pm CDT

Level One | Ballroom B

Nubank is the largest digital bank outside Asia, operating in Brazil, Mexico, and Colombia, and serving over 120 million customers. As a cloud-native company, Nubank distributed digital environment relies on more than 4,000 microservices, generating nearly 1 petabyte of monitoring logs daily. To better manage this volume and reduce operational costs by over 50%, Nubank recently transitioned from an external vendor to an in-house log platform. In this talk, we'll share the platform architecture and the challenges encountered during the migration journey.

Speakers

Diego Rocha

Sr. Solutions Architect, AWS

Otavio Valadares

Lead Software Engineer, Nubank

Lead Software Engineer @ Nubank

Observability Summit 2026 Final pdf

Friday May 22, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom B

Scalability Challenges and Solutions

Content Experience Level Intermediate

2:25pm CDT

The Legend of Config: Breath of the Cluster - Henrik Rexed, Dynatrace

Friday May 22, 2026 2:25pm - 2:50pm CDT

Level One | Ballroom B

Configuring Ingress, Gateway API, or service meshes in Kubernetes can feel like exploring an open world without a map : one wrong turn, and traffic vanishes. In this session, we’ll explore how to detect and prevent misconfigurations using OpenTelemetry, eBPF-based instrumentation (OBI), and enriched logs from service meshes and ingress controllers. Like a hero collecting tools to unlock new areas, we’ll show how to identify relevant data sources, parse and process their output, and apply common correlation rules to understand the impact of configuration changes. We’ll demonstrate how these techniques can be applied across observability platforms to reduce tool sprawl and improve operational efficiency. Attendees will leave with a practical, backend-agnostic approach to building a multi-source observability strategy for Kubernetes networking.

Speakers

Henrik Rexed

Cloud Native advocate & CNCF Ambassador, Dynatrace

Henrik is a Cloud Native Advocate at Dynatrace and a CNCF Ambassador . Prior to Dynatrace, Henrik has worked more than 15 years, as Performance Engineer. Henrik Rexed Is Also one of the Organizer of the conferences named WOPR, KCD Austria and the owner of the Youtube Channel Isit... Read More →

Friday May 22, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom B

Community-Driven Development in Observability

Content Experience Level Intermediate

2:55pm CDT

Let Them Eat Bugs: Practical Showcase of Agentic Issue Resolution - May Walter, Hud

Friday May 22, 2026 2:55pm - 3:20pm CDT

Level One | Ballroom B

What if we could move a big chunk of bug fixing and solving production issues to agentic AI? That would be so cool. In this talk we will go through the end to end process of setting up a background agentic workflow that detects production errors, finds their root causes, assesses the right solution and opens a PR - so you wake up in the morning to tasks almost fully completed for you by your loyal agent.

Together we will dive into the entire process of setting up this system that is currently running in real production environments - understanding the different tools, the infra challenges, the agentic accuracy spectrum, and more…

Speakers

May Walter

Co-Founder & CTO, Hud

May Walter is a software engineer, researcher, entrepreneur and serial CTO. She is currently Co-Founder and CTO of Hud, building a Runtime Code Sensor to bridge the gap between coding agents and production. Before Hud she was a founding CTO at Santa, and CTO at Bond (acquired by REEF... Read More →

Friday May 22, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom B

AI and MCP in Observability

Content Experience Level Intermediate

3:40pm CDT

Inside the Telemetry Data Plane: Constraints, Tradeoffs, and Scale - Eduardo Silva & José Lecaros, Chronosphere | A Palo Alto Networks Company

Friday May 22, 2026 3:40pm - 4:05pm CDT

Level One | Ballroom B

Modern telemetry systems often struggle not because of missing features, but because of hidden constraints in how data is buffered, scheduled, and moved through the system. This session explores the practical realities of building a telemetry data plane that must operate under extreme throughput, tight latency budgets, and strict resource limits.

Using real-world experience from developing a high-performance open source telemetry agent, we’ll examine how design tradeoffs around buffering, concurrency, and I/O shape system behavior at scale. Topics include user-space serialization strategies, adaptive buffering models, memory-mapped persistence, and multithreaded I/O coordination, along with how these choices interact with core Linux primitives such as epoll, asynchronous I/O, and zero-copy techniques.

Rather than focusing on APIs or products, this talk dives into the mechanics and constraints that determine whether a telemetry system remains predictable under load. The discussion is grounded in production lessons learned from operating at billions of events per minute and highlights patterns that apply broadly to collectors, agents, and streaming systems.

Speakers

José Lecaros

Support Engineer, Chronosphere | A Palo Alto Networks Company

He works as a Support Engineer at Chronosphere, helping both customers and the Fluent community. He's been a developer and support engineer for 20+ years.

Eduardo Silva

Distinguished Engineer, Chronosphere | A Palo Alto Networks Company

Eduardo is an entrepreneur and Software Engineer. He is one of Fluentd project maintainers and creator of Fluent Bit, a lightweight Logs, Metrics, and Traces processor.

Friday May 22, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom B

Integrating Observability into DevOps Practices

Content Experience Level Intermediate

4:10pm CDT

[CANCELLATION] The Missing Layer in eBPF Observability: Storage - Kritik Sachdeva, IBM

Friday May 22, 2026 4:10pm - 4:35pm CDT

Level One | Ballroom B

Modern observability has embraced eBPF for profiling CPU usage and tracing network paths in production systems. Yet one critical layer remains largely under-instrumented: storage. Despite being a frequent source of performance issues, storage I/O is still treated as a black box, especially in cloud native environments.

This talk we will walk through the basic storage I/O path in Linux and Kubernetes, highlight where traditional metrics fall short, and discuss the kinds of storage latency and wait signals that eBPF can surface at runtime without requiring kernel modifications or specialized debugging setups.

Using simple examples, the session will show how hidden storage latency and queuing effects surface in real workloads, and why these blind spots become more visible with data-intensive and AI workloads where applications or GPUs often wait on storage without clear indicators.

By the end of this talk, attendees will gain a practical understanding of where storage observability breaks down today, what eBPF can realistically help uncover at a foundational level, and how to reason about storage-related performance issues alongside CPU and networking metrics.

Speakers

kritik sachdeva

Technical Support Professional, IBM

I’m Kritik Sachdeva, currently working as a Support Professional at IBM. I’ve been working with Ceph & OpenShift for the past 5 years, and since college I had a great interest in technologies like K8s, containers, or Ceph.

Since then, I’ve enjoyed exploring how different... Read More →

Friday May 22, 2026 4:10pm - 4:35pm CDT
Level One | Ballroom B

The Future of Open Source Observability

Content Experience Level Beginner

10:20am CDT

10:50am CDT

11:20am CDT

11:50am CDT

12:05pm CDT

1:25pm CDT

1:55pm CDT

2:25pm CDT

2:55pm CDT

3:40pm CDT

4:10pm CDT

10:20am CDT

10:50am CDT

11:20am CDT

12:05pm CDT

12:15pm CDT

1:25pm CDT

1:55pm CDT

2:25pm CDT

2:55pm CDT

3:40pm CDT

4:10pm CDT

Get help with the event