Loading…
May 21-22, 2026
Learn more and Register to Attend

The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for Observability Summit North America 2026.

Please note: This schedule is automatically displayed in Central Daylight Time (UTC -5). To see the schedule in your preferred timezone, select from the drop-down menu located at the bottom of the menu to the right.

The schedule is subject to change.
Venue: Level One | Ballroom A clear filter
Thursday, May 21
 

9:00am CDT

Keynote: Welcome + Opening Remarks
Thursday May 21, 2026 9:00am - 9:10am CDT

Thursday May 21, 2026 9:00am - 9:10am CDT
Level One | Ballroom A

9:15am CDT

Sponsored Keynote: Zero-Code Observability: Close the Coverage Gaps That Cause Outages - Eden Federman, Odigos
Thursday May 21, 2026 9:15am - 9:20am CDT
The outages that hurt most start across multiple vectors: compiled languages, third-party applications, legacy services, hard-to-instrument areas, and latency-sensitive workloads. In this session, Odigos co-founder and CTO Eden Federman will talk about how eBPF-based instrumentation with OpenTelemetry output delivers full distributed tracing across every service in your cluster — in minutes, with no code changes and <1% overhead.
Speakers
avatar for Eden Federman

Eden Federman

Co-founder & CTO, Odigos
Eden is the Co-Founder & CTO of Odigos, leading the company's technical vision with deep expertise as an OpenTelemetry maintainer and eBPF innovator. With a background spanning major engineering roles, including contributions at Verizon Media, Taboola, and OpenTelemetry, Eden leads... Read More →
Thursday May 21, 2026 9:15am - 9:20am CDT
Level One | Ballroom A

9:25am CDT

Sponsored Keynote: The Work Before the Magic: Autoremediation Readiness - Alok Bhide, Chronosphere | A Palo Alto Networks Company
Thursday May 21, 2026 9:25am - 9:30am CDT
The pitch for autoremediation is hard to resist: AI doesn't just surface issues faster — it fixes them on the spot, leaving you to kick back, validate, and observe. MTTR doesn't just shrink; it becomes a relic. Problems vanish before anyone even notices they existed.

But rush into it without solid data, proper curation, and clear policy, and you're pulling a tap with too much pressure — nothing but foam, no beer.

Closed-loop remediation isn't a shortcut. It's the payoff at the end of a disciplined, AI-driven observability practice.

In this talk, we'll walk through the three things that make autoremediation actually work:

  1. System coverage that holds up at real scale
  2. Data that's clean, navigable, and actionable
  3. Ground rules for what AI is — and isn't — allowed to do

You'll walk away with a practical readiness checklist and a clear framework for deciding where autoremediation belongs in your stack, and where it definitely doesn't.

No hype. Just the work that earns AI the right to act in production.


Speakers
avatar for Alok Bhide

Alok Bhide

Director, Product Management, Chronosphere (a Palo Alto Networks company)
Alok Bhide is the Director, Product Management at Chronosphere a Palo Alto Networks company, and has been in the Observability space for over a decade, formerly as a Director of Product at Splunk and CPO at Universal Tennis, where he was also responsible for SRE and the Engineering... Read More →
Thursday May 21, 2026 9:25am - 9:30am CDT
Level One | Ballroom A

9:35am CDT

Keynote: 10 Million Spans Per Second: Lessons From Scaling OpenTelemetry at Reddit - Trevor Riles, Reddit
Thursday May 21, 2026 9:35am - 10:00am CDT
Reddit processes over 25 billion tracing events per hour across thousands of services. In this talk, we share how we scaled our OpenTelemetry-based distributed tracing platform by 67% in one year—and what broke along the way.

We'll cover our architecture: OpenTelemetry instrumentation across Python, Go, and JavaScript baseplate libraries feeding into Kafka pipelines and ClickHouse storage. You'll learn how we handled an incident that spiked ingestion to well over 10 million spans per second, the sampling strategies we developed to balance cost with debuggability, and why instrumenting three language runtimes simultaneously is harder than it sounds.

Key takeaways:
- Practical patterns for multi-language OTel instrumentation at scale
- Remote sampling strategies that adapt to traffic patterns
- ClickHouse schema design for sub-second trace queries
- Building adoption through cross-functional partnerships, not mandates

Whether you're starting your tracing journey or scaling an existing platform, this talk provides battle-tested lessons from running distributed tracing infrastructure serving one of the world's largest online communities.
Speakers
avatar for Trevor Riles

Trevor Riles

Senior Software Engineer, Reddit
Trevor Riles is a Senior Software Engineer on Reddit's Observability team, where he owns the distributed tracing platform. He previously co-presented at KubeCon on Reddit's Thanos metrics infrastructure and has been building observability systems at Reddit since 2021.
Thursday May 21, 2026 9:35am - 10:00am CDT
Level One | Ballroom A
  Keynote Sessions

10:20am CDT

The Invisible Tax: How Data Format Conversions Drive up Telemetry Pipeline Costs - Cijo Thomas & Joshua MacDonald, Microsoft
Thursday May 21, 2026 10:20am - 10:45am CDT
Telemetry signals traverse long pipelines before reaching observability backends. While enrichment, filtering, and redaction provide clear value, significant compute cost often comes from repeated conversion through different data formats.
Telemetry commonly flows through SDK formats, wire protocols, collector‑internal formats, and backend ingestion schemas. Each boundary introduces marshaling, unmarshalling and copying. These transformations add no new information, yet consume CPU and memory and scale linearly with volume—creating a hidden "transform tax" that compounds dramatically at terabyte scale.
This talk will share results from measuring instrumented OpenTelemetry SDK and Collector pipelines. We quantify compute spent on pure format conversion versus value‑generating processing and show how these costs grow with scale.
Attendees will learn about conversion costs and strategies to reduce waste: eliminating unnecessary translations, aligning pipeline representations, leveraging zero‑copy techniques, and minimizing transformation hops between pipeline stages. We also examine Apache Arrow‑based representations as one approach to reducing this overhead.
Speakers
avatar for Cijo Thomas

Cijo Thomas

Principal Software Engineer, Microsoft
Cijo is a Software Engineer at Microsoft specializing in Observability. He has been deeply involved with the OpenTelemetry project since its inception and is a core maintainer for the OpenTelemetry .NET and OpenTelemetry Rust implementations. His expertise extends beyond OpenTelemetry... Read More →
avatar for Joshua MacDonald

Joshua MacDonald

Principal Software Engineer, Microsoft
Joshua MacDonald is an OpenTelemetry contributor working in the observability industry. On the side, he writes open-source telemetry software and operates a community water system.
Thursday May 21, 2026 10:20am - 10:45am CDT
Level One | Ballroom A
  CNCF Observability Projects

10:50am CDT

Taming Observability at Scale in a Multi-Cluster Kubernetes Platform at Bloomberg - Joe Nathan Abellard, Bloomberg
Thursday May 21, 2026 10:50am - 11:15am CDT
Bloomberg runs a managed, multi-cluster Kubernetes platform built atop Karmada to support AI and streaming analytics workloads. This comes with challenges around observability at scale. To meet disaster recovery requirements, we use a multi-region architecture where each Karmada control plane is hosted on management clusters spanning multiple regions. This helps ensure high availability, but also adds complexity related to observability. For example, how do we aggregate and visualize metrics across multiple Prometheus servers when each management cluster has a dedicated Prometheus setup?

This talk covers our multi-region architecture to meet DR requirements and our Prometheus stack with Thanos for global metrics aggregation. We’ll explore how we choose the right signals and define meaningful alerts in a complex multi-cluster environment to curb alert fatigue, while ensuring timely issue detection. We’ll also discuss the challenges of defining SLIs and SLOs in a multi-tenant platform.
Speakers
avatar for Joe Nathan Abellard

Joe Nathan Abellard

Senior Software Engineer, Bloomberg
Joe Nathan Abellard is a Senior Software Engineer on the Cloud Native Compute Services (CNCS) Platform Engineering team at Bloomberg. He's the lead engineer and product owner for Bloomberg's large-scale, managed multi-cluster Kubernetes platform, built on the CNCF Karmada project... Read More →
Thursday May 21, 2026 10:50am - 11:15am CDT
Level One | Ballroom A

11:20am CDT

Quantiles at Scale: Choosing the Right Estimation Algorithms for Observability - Mike Shi, ClickHouse
Thursday May 21, 2026 11:20am - 11:45am CDT
Quantiles like p90 and p99 sit at the heart of observability. They define dashboards, drive SLOs, and shape how teams reason about system performance. They are also some of the most expensive metrics to compute, and the cost grows fast as data volumes increase.
To keep up, observability systems rely heavily on approximate quantile algorithms such as sketches and probabilistic data structures, including t-digest. These approaches work well at small and medium scale, but at tens or hundreds of petabytes, things start to creak and limitations become apparent.
We share hard won lessons from operating ClickHouse at extreme scale, where quantile estimation must remain accurate and affordable over hundreds of petabytes of data. We break down the most common quantile algorithms used in observability today, explain their real trade offs, and show when each approach makes sense. We also explore a critical design decision: when quantiles should be computed on the fly at query time versus pre aggregated during ingestion.
The goal is to give you a practical framework for choosing quantile algorithms that scale, rather than blindly relying on defaults that stop working as your data grows.
Speakers
avatar for Mike Shi

Mike Shi

Head of Product, Observability, ClickHouse
Mike leads observability at ClickHouse, where he works on building a developer-friendly observability platform. He joined ClickHouse through the acquisition of HyperDX, a company he co-founded, after spending the last five years building observability platforms for engineers—accidentally... Read More →
Thursday May 21, 2026 11:20am - 11:45am CDT
Level One | Ballroom A

11:50am CDT

⚡ Lightning Talk: Beyond Billions: Operating Thanos, Prometheus & OpenTelemetry at Trillion-Scale - Narendra Sanikommu, Nvidia
Thursday May 21, 2026 11:50am - 12:00pm CDT
Operating a metrics system beyond billions of data points introduces failure modes that don't exist at smaller deployments. This lightning talk shares battle-tested lessons from running, Thanos, Prometheus and OpenTelemetry in production across distributed Kubernetes environments, focusing on three critical challenges: implementing multi-tenancy without noisy neighbor problems, building rate limiting that prevents a single tenant from destabilizing the cluster, and isolating query workloads so expensive queries don't starve metric ingestion.

The talk walks through real incidents where these challenges caused production impact, including 5xx errors on Thanos Receivers from unbounded queries, Prometheus remote write lag and partial query results from overwhelmed Store Gateways. For each problem, the talk presents custom solutions developed—including tenant-aware rate limiting middleware and workload isolation patterns—and shares concrete configuration approaches that attendees can apply to their own deployments.
Attendees will leave with actionable techniques for scaling their observability infrastructure to trillion-scale while maintaining reliability under load.
Speakers
avatar for Narendra Sanikommu

Narendra Sanikommu

Senior Software Engineer, Nvidia
Experienced software engineer who is passionate about solving complex software engineering challenges. With around 14 years of experience in software engineering – has a strong foundation in building and optimizing high-performance systems particularly in Observability, Big Data... Read More →
Thursday May 21, 2026 11:50am - 12:00pm CDT
Level One | Ballroom A
  Scalability Challenges and Solutions
  • Content Experience Level Any

12:15pm CDT

Lunch
Thursday May 21, 2026 12:15pm - 1:15pm CDT
Menu: 
MinneSalad: Romaine, Baby Lettuce Greens, Purple Cabbage, Carrot Shreds,Honey-Clover Gouda, Sweet and Spicy Pepitas, Cucumber,Shredded Daikon, Red Peppers, Blueberry Balsamic Vinaigrette (vg, gf)

Sautéed Beef Tips, Wild Rice, Carrots, Celery, Onions, Mushrooms, Topped with Cheddar Cheese and Crispy Tater Tots
Wild Rice Hot Dish Plant-Based Ground Beef, Wild Rice, Carrots, Celery, Onions, Mushrooms (vg)
Wild Rice Cakes with Roasted Red Pepper Sauce, Roasted Brussel Sprout Medley (ve)

Homemade Dinner Rolls
Assorted Miniature Bundt Cakes
Thursday May 21, 2026 12:15pm - 1:15pm CDT
Level One | Ballroom A

1:15pm CDT

Panel: Telemetry That Matters - Diana Todea, VictoriaMetrics; Antonio Jimenez Martinez, Cisco ThousandEyes; Laura Luttmer, Dynatrace
Thursday May 21, 2026 1:15pm - 1:50pm CDT
Instrumentation has never been easier, but are we truly gaining clarity? As data volumes rise, dashboards multiply, and observability costs increase, developers may feel less insight and more friction. Are we collecting telemetry with purpose or just because we can? What problem is this data meant to solve?
This panel brings together practitioners across open standards, developer experience and real-world reliability engineering. The discussion will examine how zero code instrumentation affects workflows and system understanding, how meaningful telemetry improves day to day engineering work and why unfiltered or unstructured data often has the opposite effect. The conversation will cover practical lessons for filtering, dropping, reducing and shaping telemetry so teams maintain visibility without unnecessary volume or cost. Finally, we explore scaling observability across fleets of collectors with an OpAMP server, ensuring consistent signal delivery and manageability as telemetry grows.
At the center is a guiding question: What is the purpose of the telemetry we collect and how do we ensure it remains aligned with developer needs, operational requirements, and system reliability?
Speakers
avatar for Antonio Jimenez Martinez

Antonio Jimenez Martinez

Tech Lead Software Engineer, Cisco ThousandEyes
I am a Tech Lead Software Engineer at Cisco ThousandEyes, specializing in observability to ensure our customers can effectively monitor their products. My recent work involves using OpenTelemetry to stream telemetry data, enhancing network visibility and performance for our clients... Read More →
avatar for Diana Todea

Diana Todea

Developer Experience Engineer, VictoriaMetrics
Diana is a Developer Experience Engineer at VictoriaMetrics. She has worked as a Senior Site Reliability Engineer focused on Observability. She is an active member of the OpenTelemetry CNCF open source project, co-organizer of Cloud Native Days Romania, co-lead of neurodiversity working... Read More →
avatar for Laura Luttmer

Laura Luttmer

Sr. Product Manager, Bindplane (Dynatrace)
I am a Product Manager at Bindplane based in Albuquerque, New Mexico. With over 10 years of product experience spanning SaaS, legal, and data platforms, I focus on OpenTelemetry-native pipeline solutions, AI-powered telemetry intelligence, and helping customers get more out of their... Read More →
Thursday May 21, 2026 1:15pm - 1:50pm CDT
Level One | Ballroom A

1:55pm CDT

Taming Tenancy, Cost and Architecture at Collibra Through OpenTelemetry and Our Telemetry Backbone - Alex Van Boxel, Collibra
Thursday May 21, 2026 1:55pm - 2:20pm CDT
Operating a SaaS platform presents the same observability problems as any other enterprise, but due to the scale and tenancy, we introduce a huge multiplier on the observability signals, having an effect on cost and effectiveness.

This session dives into the techniques Collibra used to tame these problems and how to maintain clarity when infrastructure spans virtual machines, modern Kubernetes clusters, and a complex mix of single- and multi-tenant architectures. Without the right context, telemetry data becomes a noisy, indistinguishable flood.

We will dive into the architectural decision to leverage the C4 system model, ensuring every piece of telemetry carries the vital context of what it belongs to and where it sits in the hierarchy. Enabling us to gain insights into both signal attribution and allowing virtual chargebacks. The presentation details the implementation of a pipeline using custom-built OpenTelemetry collectors designed to handle the data and enrich it before sending it to the appropriate backends.

This session will give you practical insights on the challenges SaaS platforms have, but the techniques that are used to tame them can be applied everywhere.
Speakers
avatar for Alex Van Boxel

Alex Van Boxel

Principal System Architect, Collibra
Alex Van Boxel is a Principal System Architect at Collibra. With an engineering background in R&D at Alcatel-Lucent, Progress Software, and Veepee, he loves to focus on the fundamental building blocks of the software industry. That means reading, understanding, and contributing to... Read More →
Thursday May 21, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom A
  End-User Case Studies
  • Content Experience Level Any

2:25pm CDT

The Speed of Metrics, the Fidelity of Traces: Architecting Post-Collection Aggregation - Zack Owens, New Relic
Thursday May 21, 2026 2:25pm - 2:50pm CDT
As organizations adopt observability practices, they face a scalability paradox: systems now generate petabytes of traces and logs, but querying this raw telemetry over long time horizons becomes prohibitively slow and expensive due to the data volume.

The standard solution of pre-aggregating high-cardinality telemetry into metrics at collection time through features in the OpenTelemetry collector works well for known patterns but fails when engineers need to ask new questions about historical data. This creates an uncomfortable choice for engineers and operators: fast dashboards with pre-aggregated metrics, or high-fidelity traces and logs that become unusable beyond short time windows.

This talk presents a post-collection aggregation approach that enables fast queries over long time periods of detailed telemetry without changes to collector-side configuration. This session explores techniques for incremental view materialization that work with timeseries data. Attendees will leave with concrete architectural patterns which are applicable to open source databases like ClickHouse or OpenSearch to answer novel questions without sacrificing query speed or data fidelity.
Speakers
avatar for Zack Owens

Zack Owens

Principal Software Engineer, New Relic
Zack Owens is a Principal Engineer and Architect at New Relic, focusing on the data platform and NRDB, a purpose-built timeseries database for observability.
Thursday May 21, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom A
  Scalability Challenges and Solutions
  • Content Experience Level Any

2:55pm CDT

When the Cloud Fails: Debugging the "Undocumented" - Dhruv Jain, Gojek (GoTo Group) Indonesia
Thursday May 21, 2026 2:55pm - 3:20pm CDT
What happens when a system degrades under high load while all internal metrics remain “green”? At hyperscale, supporting on-demand services across Southeast Asia’s most populous countries, a team observed up to a 7% drop in message delivery. The root cause was not application code, messaging brokers, or load balancers, but a hidden limitation deep within a cloud provider’s firewall.

This war-story session presents a forensic investigation into a managed cloud load balancer and its interaction with connection-tracking tables. The talk walks through the production cutover that triggered the issue and the targeted load testing that ultimately isolated the failure to cloud infrastructure behavior invisible to standard monitoring.

Beyond root cause analysis, the session focuses on outcomes: how sustained, evidence-based debugging led the cloud provider to acknowledge the issue—initially labeled a “limitation”—and introduce a new observability metric, firewall/connections_tracked. Attendees will leave with a practical framework for debugging black-box cloud failures and identifying the node-level metrics needed to detect silent network drops before they impact users.
Speakers
avatar for Dhruv Jain

Dhruv Jain

Lead Software Engineer, Gojek (GoTo Group)
Dhruv Jain is a Lead Software Engineer at Gojek, where he focuses on building and scaling MQTT infrastructure that handles millions of concurrent connections across Southeast Asia. Beyond his work at Gojek, he is an active contributor to the open-source community and Google Summer... Read More →
Thursday May 21, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom A
  End-User Case Studies

3:40pm CDT

The Full Picture: Visualizing Service "Fullness" To Rethink Saturation Prevention - Tal Nordan, Independent
Thursday May 21, 2026 3:40pm - 4:05pm CDT
Saturation has long been the stepchild of "the Four Golden Signals of SRE". While latency, traffic, and errors are directly measurable through metrics like P99, RPS, and 5xx rates, monitoring just how "full" a service is relies on indirect symptoms such as CPU usage or queue depth. Yet, saturation should ideally rather be the first signal to alert, as once it's reached, other signals - latency and errors - spike fast.

The inability to directly observe and mitigate saturation drives excessive safety margins, chronically low CPU utilization and massive compute waste in latency-sensitive and customer-facing systems. This session introduces an open-source approach extending Envoy proxy and its seamless integration through eBPF and Cilium, to provide direct observability into service saturation, by comparing each instance's live number of concurrent requests to its true concurrency limit. We then explore how such direct visualization of saturation can help reduce MTTR and minimize waste.
Speakers
avatar for Tal Nordan

Tal Nordan

Software Engineer, Independent
An early contributor to the Envoy proxy project, now working on developing tools to detect and mitigate inefficiencies in the way services interact with each other. Over the years Tal has been a founding engineer and a contractor working on a wide variety of cloud-native data-plane... Read More →
Thursday May 21, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom A

4:10pm CDT

Secure by Design: Rethinking Test Credentials for Synthetic Monitoring - Katie Kodes, Katie Kodes
Thursday May 21, 2026 4:10pm - 4:35pm CDT
Synthetic monitoring and end-to-end testing often require dangerous levels of access to production systems. Last summer, I nearly emailed my bank details to a team I was training on new testing tools. If I hadn't caught that mistake, I probably would have dumped them into an OTel collector too.

This session explores the security implications of common testing practices, and presents practical alternatives that maintain observability without compromising security.

Attendees will learn authentication and authorization patterns to improve test security across the software development lifecycle.

Implementing mitigations like health check endpoints, synthetic data, and privilege separation spans the full stack of infrastructure, development, monitoring, and governance. Attendees will leave with a shared vocabulary they can use to align business, development, security, and observability teams on safer test traffic in production.
Speakers
avatar for Katie Kodes

Katie Kodes

DevOps Architect
Katie is a DevOps architect who brings clarity to complex technical challenges across the entire stack. With experience ranging from infrastructure to front-end development, she helps teams build reliable, observable systems that deliver real business value. A passionate educator... Read More →
Thursday May 21, 2026 4:10pm - 4:35pm CDT
Level One | Ballroom A

4:40pm CDT

Closing Remarks
Thursday May 21, 2026 4:40pm - 4:45pm CDT

Thursday May 21, 2026 4:40pm - 4:45pm CDT
Level One | Ballroom A
 
Friday, May 22
 

9:00am CDT

Keynote: Welcome Back + Opening Remarks
Friday May 22, 2026 9:00am - 9:05am CDT

Friday May 22, 2026 9:00am - 9:05am CDT
Level One | Ballroom A

9:10am CDT

Sponsored Keynote: OpenSearch - See Everything: Open Observability for Agentic AI - Anirudha Jadhav, Amazon Web Services
Friday May 22, 2026 9:10am - 9:15am CDT
AI is accelerating software development at an exponential pace, but we have no idea what our AI systems are actually doing. Agents operate across distributed frameworks. One request spawns dozens of hops with zero visibility. The OpenSearch Observability Stack closes that gap—built for open source contributors, with a growing focus on developers and operators using these systems every day. Open source. Linux Foundation-governed. One pipeline. Every framework. Every model. Every hop visible. The agentic era deserves open infrastructure, and we’ll share how this is a step towards building it together.
Speakers
avatar for Anirudha Jadhav

Anirudha Jadhav

Sr. Engineering Leader, Amazon Web Services
Anirudha is a Senior Manager, Software Development at Amazon Web Services (AWS), leading development of insight engines and visualization platforms for the OpenSearch Project. He specializes in distributed systems, data analytics, and search technologies, including architecting one... Read More →
Friday May 22, 2026 9:10am - 9:15am CDT
Level One | Ballroom A

9:15am CDT

Sponsored Keynote: Datadog - Every Byte Counts: How Protocol Design Shapes the Cost of Observability - Amanda Sopkin, Datadog
Friday May 22, 2026 9:15am - 9:20am CDT
Today, many organizations are pushing beyond existing limits for telemetry volume. Systems are ever-more distributed and generative AI workloads produce enormous amounts of data. As telemetry volumes grow, observability pipelines must become more efficient.

At scale, telemetry egress directly impacts observability spend. Cloud providers charge per gigabyte of data transferred across regions or providers, and those bytes add up quickly. The protocol used to encode telemetry determines how much data is sent over the network. Even modest improvements in encoding efficiency (i.e. the protocol) can translate into significant cost savings. However, the OpenTelemetry Protocol (OTLP) was not initially optimized for performance. Instead, it prioritized interoperability and easy adoption.

Today the OpenTelemetry community is exploring OTAP, a new stateful protocol for transmitting OpenTelemetry data based on Apache Arrow. By using columnar encoding and maintaining state throughout a stream, OTAP avoids repeatedly sending the same metadata, reducing payload size and network transfer. However, because OTAP relies on long-lived stateful streams rather than independent requests, there is additional architectural and operational complexity in its implementation. There are further challenges to larger adoption by the community; for example, Apache Arrow support varies significantly across languages.

Protocol design today is critical to efficiently scaling your systems. In this talk we will explore how protocol design affects telemetry egress and overall observability cost. We will go over some strategies for improving encoding efficiency, compare stateless and stateful approaches, and discuss the potential benefits and drawbacks of adopting a protocol like OTAP. Join us to learn more about how your protocol decisions can influence your costs over time.
Speakers
avatar for Amanda Sopkin

Amanda Sopkin

Engineering Manager, Datadog

Friday May 22, 2026 9:15am - 9:20am CDT
Level One | Ballroom A

9:20am CDT

Keynote: Tracing the Agent's Mind: Extending OpenTelemetry for Deep MCP Inspection - Mustafa Dayıoğlu, TUBITAK & Zeyno Dodd, Conjectura R&D
Friday May 22, 2026 9:20am - 9:45am CDT
Production AI agents make thousands of tool-calling decisions daily, yet observability stops at the model boundary. OpenTelemetry's GenAI semantic conventions capture token counts and latencies—what the LLM processed—but not why an agent selected a specific tool. Research (McKenzie et al., 2023) demonstrates inverse scaling: more capable models exhibit unpredictable tool selection patterns. This gap leaves engineers guessing during critical production failures.

We present gen-ai-otel, an open-source OpenTelemetry extension introducing decision-level telemetry for MCP agents. A new attribute namespace (gen_ai.agent.*) captures tool selection confidence, session context, permission scope validation, and baseline deviations. The zero-sidecar architecture routes telemetry through standard Collector pipelines to existing backends—Jaeger, Prometheus, or graph databases—with low overhead and cardinality-aware attributes.

A live demo reconstructs an agent's decision chain, revealing anomalies invisible to token metrics—reducing decision-debugging time. Attendees leave with: 1) Collector configs, 2) Grafana dashboards for confidence tracking, 3) demo code and repo—all Apache 2.0 licensed.
Speakers
avatar for mustafa dayıoğlu

mustafa dayıoğlu

Senior Chief Researcher, TUBITAK (THE SCIENTIFIC AND TECHNOLOGICAL RESEARCH COUNCIL OF TÜRKİYE)
Mustafa Dayıoğlu (PhD, ITU) is a security architect with 25 years of experience in cybersecurity at TÜBİTAK, designing large-scale security systems serving 80 million citizens for regulated environments. Specializes in threat modeling and protocol development for AI agent systems... Read More →
avatar for Zeyno Dodd

Zeyno Dodd

R&D Solution Architect, Conjectura R&D
R&D Architect with 25+ years building distributed systems and leading open research collaborations. Principal collaborator on SFAMDF and GraphSentinel—open initiatives exploring proactive, federated security patterns for MCP‑based agentic AI systems. Research interests include... Read More →
Friday May 22, 2026 9:20am - 9:45am CDT
Level One | Ballroom A
  Keynote Sessions

10:20am CDT

Exploring Observability with MCP Servers - Tiffany Jernigan, Grafana Labs
Friday May 22, 2026 10:20am - 10:45am CDT
You may have heard of the pillars of observability: metrics, logs, traces, and, depending on who you ask, profiles. As systems grow in complexity, the need to both individually understand and correlate these signals becomes paramount for rapid incident detection, root cause analysis, and performance optimization. Yet, even with advances like OpenTelemetry, making sense of your own data often requires learning specialized query languages and navigating complex toolchains, which is a barrier for many users.

While AI tools like ChatGPT can offer general advice, they lack access to your specific observability data. This is where Model Context Protocol (MCP) servers come in. MCP servers provide a standardized way for AI assistants and other tools to securely connect to your observability data, making it easier to investigate and diagnose issues faster using natural language.

In this talk, we’ll cover MCP and demonstrate how to explore your observability data using Grafana MCP, while also touching on how the same approach can work with other MCP-compatible tools or custom MCP servers.
Speakers
avatar for Tiffany Jernigan

Tiffany Jernigan

Senior Developer Advocate, Grafana Labs
Tiffany is senior developer advocate at Grafana Labs and a CNCF Ambassador. She also formerly worked as a software developer and developer advocate at VMware, Amazon, Docker, and Intel. Prior to that, she graduated from Georgia Tech with a degree in electrical engineering. In her... Read More →
Friday May 22, 2026 10:20am - 10:45am CDT
Level One | Ballroom A
  AI and MCP in Observability

10:50am CDT

Show Me the Receipts: A Forensic Hunt for Observability - Mostafa Radwan, Datadog
Friday May 22, 2026 10:50am - 11:15am CDT
Today, observability platforms can process massive volumes of telemetry, but practitioners struggle to determine what matters during incidents, unnecessarily increasing usage bills.

This talk resolves the question: “Which telemetry data should we keep?” Learn how one team achieved 30% log reduction by flipping the script and asking “what did we actually use?” instead of “what should we collect?” They conducted a forensic audit of incident resolutions to find receipts proving which data sources truly mattered.

You’ll learn techniques for tracing backward from resolved incidents to identify which telemetry is deemed valuable and see how to map incidents to telemetry data that enabled resolution, revealing which sources proved critical, redundant, or unused.

Using OpenTelemetry (OTel) and Vector, an open-source tool for building fast and scalable observability pipelines, this approach provides a replicable pattern that the community can adapt across different environments.

You’ll leave with a framework for measuring telemetry value based on usage patterns, plus a repeatable audit process. The key question: “Where are the receipts?”
Speakers
avatar for Mostafa Radwan

Mostafa Radwan

Senior Solutions Engineer, Datadog
Mostafa is a technologist specialized in cloud native computing, observability, and security.

He started his career as a software engineer before getting in the trenches of application and production support.

He worked as a Solutions Architect at Docker where he helped enterp... Read More →
Friday May 22, 2026 10:50am - 11:15am CDT
Level One | Ballroom A
  Scalability Challenges and Solutions
  • Content Experience Level Any

11:20am CDT

[CANCELLATION] AI Training in Emerging Economies: Building Africa's Largest LLM From the Ground Up - Okikiola Oliyide, Awarri
Friday May 22, 2026 11:20am - 11:45am CDT
N-ATLaS is a multilingual African-language LLM we took from research to production on Kubernetes. This talk shows the end-to-end path we used to make it reproducible, observable, and affordable: data + finetune pipelines (artifacts, seeds, checkpoints), Argo-orchestrated training on mixed GPU pools, and a serving stack with Triton + KServe tuned for real traffic. I’ll walk through SRE guardrails that mattered for N-ATLaS (SLOs, golden signals, error budgets), supply-chain hygiene (image signing, provenance, model versioning), and the levers that cut cost-per-token while improving latency and uptime under pre-emptions. We’ll cover autoscaling, caching, model rollout strategies, and incident playbooks plus what we’d change after thousands of downloads and weeks of live usage. Expect hard-learned patterns, YAML you can run, and a plain-English checklist you can lift into your own cluster; whether you’re serving English or a low-resource language model.
Speakers
avatar for work okiki

work okiki

Lead DevOps Engineer, Awarri
Okikiola Oliyide is Lead Cloud DevOps Engineer at Awarri Technology, where he designs and operates large-scale Kubernetes platforms powering Africa’s largest LLM initiative. With 5+ years across AWS, GCP, and on-prem, he specialises in CI/CD, observability, and cost-efficient GPU... Read More →
Friday May 22, 2026 11:20am - 11:45am CDT
Level One | Ballroom A
  CNCF Observability Projects

11:50am CDT

⚡ Lightning Talk: Show Me the Money: Metrics Edition - Brian Davis, Red Canary
Friday May 22, 2026 11:50am - 12:00pm CDT
Existing cloud and Kubernetes cost management tools struggle to track expenses at a granular level, leaving engineers unable to answer critical questions like: How much is one specific customer costing us in DynamoDB usage? Or, which system component is consuming the most of our Kafka cluster?1


This lightning talk demonstrates how to leverage existing observability frameworks to gain detailed, low-level cost insights. Attendees will learn basic techniques to instrument standard metrics—such as component name, customer ID, and team—with custom labels for fine-grained cost allocation.1


This session includes a practical case study from Red Canary, who has used this exact methodology for over five years to transform their tactical decision-making and better manage cloud spend. By treating cost allocation as an observability problem, engineers can provide the finance team with the deep data required for effective resource management.1


Attendees will leave with an actionable plan for implementing a metrics-based cost tracking system (likely with the tooling you already have), independent of high-level cloud billing tools, to drive significant operational efficiency.
Speakers
avatar for Brian Davis

Brian Davis

Principal Software Architect, Red Canary
Principal Software Architect at Red Canary, a Zscaler Company, Brian Davis has been building and monitoring complex systems for over two decades, ranging from signal-processing algorithms to complex data-processing applications, deploying these on Solaris servers, on-prem virtual... Read More →
Friday May 22, 2026 11:50am - 12:00pm CDT
Level One | Ballroom A
  End-User Case Studies

12:05pm CDT

⚡ Lightning Talk: Observability Debt: When Telemetry Stops Telling the Truth - Spoorthi Palakshaiah, Relevance Lab
Friday May 22, 2026 12:05pm - 12:15pm CDT
This talk introduces observability debt as an operational issue that develops over time in evolving systems. Teams often instrument services early using observability frameworks, define metrics, dashboards, alerts, and SLOs, and initially gain confidence in their ability to understand system behavior. However, production systems rarely remain static. As systems evolve through refactoring, scaling, architectural changes, asynchronous processing, and organizational shifts. Observability artifacts frequently remain unchanged, creating a mismatch between what telemetry is assumed to represent and how the system actually behaves. This mismatch, referred to as observability debt, does not result from missing data but from telemetry whose meaning has drifted due to unmaintained assumptions, leading to dashboards that appear healthy, alerts that lack context, and slower incident understanding. To make this concrete, the talk uses a minimal personal system intentionally designed to model common production patterns. Starting from a low-debt state where telemetry reflects user impact, the system evolves while observability remains static, resulting in metrics that hide localized failures.
Speakers
avatar for Spoorthi Palakshaiah

Spoorthi Palakshaiah

DevOps Engineer, Relevance Lab
Spoorthi is a DevOps engineer with experience designing, building, and optimizing cloud infrastructure. She works extensively with Kubernetes, infrastructure as code, CI/CD pipelines, and open source observability tools to improve system reliability, scalability, and operational efficiency... Read More →
Friday May 22, 2026 12:05pm - 12:15pm CDT
Level One | Ballroom A

12:30pm CDT

Lunch
Friday May 22, 2026 12:30pm - 1:25pm CDT
Menu:

Smoked Turkey-Honey Dijon Wedge; Smoked Turkey, Honey-Dijon Cream Cheese, Lettuce, Marble Pumpernickel Focaccia (GF)
Ham & Swiss Wedge; Smoked Ham, Mustard Aioli, Lettuce, Egg Focaccia
Roasted Veggie Wrap (vg) 
Corn Chowder Soup (v, GF)

Blueberry Cheesecake (V) and Apple Spice Cake (vg, GF)
Friday May 22, 2026 12:30pm - 1:25pm CDT
Level One | Ballroom A

1:25pm CDT

Beyond Dashboards: Architecting AI Agents for Autonomous Observability - Divya Mahajan, Amazon & Achin Gupta, Intuit
Friday May 22, 2026 1:25pm - 1:50pm CDT
The future of observability isn't better dashboards—it's AI agents that reason across metrics, logs, and traces alongside your engineering team.

Engineers spend hours correlating signals across Grafana, Kibana, and Jaeger, mentally stitching together what happened and why. What if an agent could do that correlation automatically?

This session presents a practical architecture for building observability agents that autonomously triage incidents across all three pillars. we'll demonstrate an agent that ingests an alert, queries metrics, searches logs, examines traces, identifies root causes, and recommends remediation—while keeping humans in the loop.

We'll cover:

Why observability is ideal for agentic AI
Agent architecture with LangGraph orchestration
Integration patterns: MCP, REST APIs, and OpenTelemetry
Tool design for metrics, logs, and traces
Live demo: agent triaging a simulated incident
Production considerations: reliability, cost, guardrails
Attendees leave with a working reference architecture built on CNCF ecosystem tools (Prometheus, Jaeger, Loki, Grafana). All code is open source.
Speakers
avatar for Divya Mahajan

Divya Mahajan

Software Engineer, Amazon

Divya Mahajan is a Software Development Engineer at Amazon Alexa, where she builds production-grade Agentic AI and LLM systems at scale. Her work sits at the intersection of conversational AI, agentic automation, and reliable system design, with a focus on accuracy, observability... Read More →
avatar for Achin Gupta

Achin Gupta

Staff Software Engineer, Intuit
Achin Gupta is a Staff Software Engineer with 9 years of experience designing and building production grade distributed observability backends on Kubernetes. He also focuses on AI driven systems, developing LLM powered workflows and multi agent architectures, with an emphasis on observability... Read More →
Friday May 22, 2026 1:25pm - 1:50pm CDT
Level One | Ballroom A
  AI and MCP in Observability

1:55pm CDT

One Size Does Not Fit All: A Polystore Architecture for Logs and Traces - Suman Karumuri, KalDB
Friday May 22, 2026 1:55pm - 2:20pm CDT
Observability data isn't homogeneous. Security logs require needle-in-haystack searches with multi-year compliance retention. Kernel logs are uncompressible text. Structured logs enable fast aggregations, while semi-structured logs explode cardinality. Traces demand different access patterns entirely.

Modern requirements compound this. Observability must join with other data sources. Agentic AI systems generate massive volumes of unstructured and semi-structured logs and traces. Big data platforms have emerged as popular storage alternatives.

Forcing everything into one system creates impossible tradeoffs: slow queries, runaway costs, frustrated users.

At Airbnb and Slack, operating thousands of tenants across hundreds of clusters, we built a polystore architecture routing workloads to specialized engines, unified behind a single query interface. This required changes across the entire stack: instrumentation, collection, storage, and query layers.

This talk shares routing criteria, backend tradeoffs, and techniques for unified querying. Attendees will learn to optimize observability for better performance and lower costs.
Speakers
avatar for Suman Karumuri

Suman Karumuri

CEO, KalDB
Suman Karumuri is Founder and CEO of KalDB and author of KalDB, an open source serverless Lucene platform. He is co-author of the OpenTracing/OpenTelemetry specification and was previously tech lead of Zipkin. Over the past decade, he has built and ran petabyte-scale log search, distributed... Read More →
Friday May 22, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom A

2:25pm CDT

Implementation of Unified Observability at Scale From Scratch - Ahmed J., Emaar
Friday May 22, 2026 2:25pm - 2:50pm CDT
Unified observability has lately been regarded as the holy grail by some. One platform, universal observability, for everything. Usually, this would be the default, but when you are at a 30-year-old non-technical enterprise, dealing with a mixture of legacy and modern systems, it's a whole different story.

A consequence of legacy decisions, in some cases, results in having multiple observability platforms for different teams within the company, adding overhead, cost, noise, and audit complexity. This was the case at Emaar, a property developer based in Dubai, until the PE team took on the exciting project of unifying all observability into one platform. This included applications, infrastructure, network, and security. The complexity arises not just from the different data sources, but rather from the number and nature of the deployment sites. This included sites across 10 countries consisting of data centers, hotels, malls, shops, etc.

This talk will outline the experience of implementing a unified observability platform consisting of thousands of network devices, machines, and application workloads using open-source technologies that resulted in 6 figures of cost savings.
Speakers
avatar for Ahmed J.

Ahmed J.

Platform Engineer, Emaar
Ahmed is a platform engineer with a background in artificial intelligence research and development. He excels at building scalable infrastructure to deploy and manage production-grade applications and models. He co-led the orchestration of modern infrastructure and observability at... Read More →
Friday May 22, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom A
  End-User Case Studies

2:55pm CDT

How Observability-First Development Lets You Ship Agents in Weeks, Not Months - Anirudha Jadhav & Kevin Fallis, AWS
Friday May 22, 2026 2:55pm - 3:20pm CDT
Building AI agents is easy, but knowing why they fail is hard. Traditional APM tools were designed for request-response services, not autonomous agents that reason, plan, and execute multi-step workflows. When your agent makes unexpected decisions, standard metrics and traces don't tell you why.

This session introduces Eval-Driven Development, which focuses on building reliable agents through continuous observability and evaluation. Using OpenSearch AgentHealth, a new open-source platform for agent observability, we'll walk you through the full agent lifecycle of building, observing, improving, and repeating. We'll share a case study comparing two production root-cause-analysis agents. One was built with observability from day one and shipped in a 6 weeks, while the other was retrofitted later and took 12 months to reach production. You'll learn how we used agentic evaluation to score agent outputs and improve accuracy over time.

You'll walk away with patterns for instrumenting agents with OpenTelemetry, techniques for evaluating full decision sequences (not just outputs), and a framework for shortening your development timeline by building observability in from the start.
Speakers
avatar for Anirudha Jadhav

Anirudha Jadhav

Sr. Engineering Leader, Amazon Web Services
Anirudha is a Senior Manager, Software Development at Amazon Web Services (AWS), leading development of insight engines and visualization platforms for the OpenSearch Project. He specializes in distributed systems, data analytics, and search technologies, including architecting one... Read More →
avatar for Kevin Fallis

Kevin Fallis

Principal Senior Solutions Architect, Amazon Web Services
Kevin Fallis is seasoned leader, architect, and developer with experience across many industry verticals and disciplines such as agriculture, ad tech, financial services, networking, security, telecommunications and of course search technologies. His passion helps others leverage... Read More →
Friday May 22, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom A
  AI and MCP in Observability

3:40pm CDT

Devs, Transform (Your Data) and Roll Out!: Learning and Leveraging OTTL - Reese Lee, New Relic
Friday May 22, 2026 3:40pm - 4:05pm CDT
The OpenTelemetry Collector has emerged as one of the project’s most critical pieces for ingesting and processing your app and infrastructure data, but did you know there’s even more you can do with your data before it reaches your backend?

Enter OTTL, or OpenTelemetry Transformation Language, a domain-specific language that can interact with and modify OTel data. Yes, the Collector already comes with dozens of components that can handle a wide range of data processing, BUT using OTTL in conjunction with the components enables even more powerful data manipulation.

In this session, learn about the benefits of OTTL, when to use it, and how to get started with OTTL. Get ready to explore:
* What OTTL is: A breakdown of the syntax and the underlying architecture within the OTel Collector.
* Why it’s useful: practical strategies for cost reduction (filtering noise), compliance (redacting PII), and standardization (normalizing attributes).
* How to use it: A live walkthrough of writing complex transformation statements for the transform and filter processors.
Speakers
avatar for Reese Lee

Reese Lee

Senior Developer Relations Engineer, New Relic
Reese Lee is a Senior Developer Relations Engineer at New Relic focusing on technical enablement via workshops, blog posts, documentation, and more. She is a Maintainer of the OpenTelemetry End User SIG, where she enjoys learning about interesting use cases and the different ways... Read More →
Friday May 22, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom A
  CNCF Observability Projects

4:10pm CDT

Closing Remarks
Friday May 22, 2026 4:10pm - 4:15pm CDT

Friday May 22, 2026 4:10pm - 4:15pm CDT
Level One | Ballroom A
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Content Experience Level
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.