Loading…
May 21-22, 2026
Learn more and Register to Attend

The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for Observability Summit North America 2026.

Please note: This schedule is automatically displayed in Central Daylight Time (UTC -5). To see the schedule in your preferred timezone, select from the drop-down menu located at the bottom of the menu to the right.

The schedule is subject to change.
arrow_back View All Dates
Thursday, May 21
 

7:30am CDT

Registration
Thursday May 21, 2026 7:30am - 5:00pm CDT

Thursday May 21, 2026 7:30am - 5:00pm CDT
Level One | Ballroom Lobby

8:00am CDT

Coat Check
Thursday May 21, 2026 8:00am - 6:15pm CDT

Thursday May 21, 2026 8:00am - 6:15pm CDT
Level One | Ballroom Lobby

9:00am CDT

Keynote: Welcome + Opening Remarks
Thursday May 21, 2026 9:00am - 9:10am CDT

Thursday May 21, 2026 9:00am - 9:10am CDT
Level One | Ballroom A

9:15am CDT

Sponsored Keynote: Zero-Code Observability: Close the Coverage Gaps That Cause Outages - Eden Federman, Odigos
Thursday May 21, 2026 9:15am - 9:20am CDT
The outages that hurt most start across multiple vectors: compiled languages, third-party applications, legacy services, hard-to-instrument areas, and latency-sensitive workloads. In this session, Odigos co-founder and CTO Eden Federman will talk about how eBPF-based instrumentation with OpenTelemetry output delivers full distributed tracing across every service in your cluster — in minutes, with no code changes and <1% overhead.
Speakers
avatar for Eden Federman

Eden Federman

Co-founder & CTO, Odigos
Eden is the Co-Founder & CTO of Odigos, leading the company's technical vision with deep expertise as an OpenTelemetry maintainer and eBPF innovator. With a background spanning major engineering roles, including contributions at Verizon Media, Taboola, and OpenTelemetry, Eden leads... Read More →
Thursday May 21, 2026 9:15am - 9:20am CDT
Level One | Ballroom A

9:25am CDT

Sponsored Keynote: The Work Before the Magic: Autoremediation Readiness - Alok Bhide, Chronosphere | A Palo Alto Networks Company
Thursday May 21, 2026 9:25am - 9:30am CDT
The pitch for autoremediation is hard to resist: AI doesn't just surface issues faster — it fixes them on the spot, leaving you to kick back, validate, and observe. MTTR doesn't just shrink; it becomes a relic. Problems vanish before anyone even notices they existed.

But rush into it without solid data, proper curation, and clear policy, and you're pulling a tap with too much pressure — nothing but foam, no beer.

Closed-loop remediation isn't a shortcut. It's the payoff at the end of a disciplined, AI-driven observability practice.

In this talk, we'll walk through the three things that make autoremediation actually work:

  1. System coverage that holds up at real scale
  2. Data that's clean, navigable, and actionable
  3. Ground rules for what AI is — and isn't — allowed to do

You'll walk away with a practical readiness checklist and a clear framework for deciding where autoremediation belongs in your stack, and where it definitely doesn't.

No hype. Just the work that earns AI the right to act in production.


Speakers
avatar for Alok Bhide

Alok Bhide

Director, Product Management, Chronosphere (a Palo Alto Networks company)
Alok Bhide is the Director, Product Management at Chronosphere a Palo Alto Networks company, and has been in the Observability space for over a decade, formerly as a Director of Product at Splunk and CPO at Universal Tennis, where he was also responsible for SRE and the Engineering... Read More →
Thursday May 21, 2026 9:25am - 9:30am CDT
Level One | Ballroom A

9:35am CDT

Keynote: 10 Million Spans Per Second: Lessons From Scaling OpenTelemetry at Reddit - Trevor Riles, Reddit
Thursday May 21, 2026 9:35am - 10:00am CDT
Reddit processes over 25 billion tracing events per hour across thousands of services. In this talk, we share how we scaled our OpenTelemetry-based distributed tracing platform by 67% in one year—and what broke along the way.

We'll cover our architecture: OpenTelemetry instrumentation across Python, Go, and JavaScript baseplate libraries feeding into Kafka pipelines and ClickHouse storage. You'll learn how we handled an incident that spiked ingestion to well over 10 million spans per second, the sampling strategies we developed to balance cost with debuggability, and why instrumenting three language runtimes simultaneously is harder than it sounds.

Key takeaways:
- Practical patterns for multi-language OTel instrumentation at scale
- Remote sampling strategies that adapt to traffic patterns
- ClickHouse schema design for sub-second trace queries
- Building adoption through cross-functional partnerships, not mandates

Whether you're starting your tracing journey or scaling an existing platform, this talk provides battle-tested lessons from running distributed tracing infrastructure serving one of the world's largest online communities.
Speakers
avatar for Trevor Riles

Trevor Riles

Senior Software Engineer, Reddit
Trevor Riles is a Senior Software Engineer on Reddit's Observability team, where he owns the distributed tracing platform. He previously co-presented at KubeCon on Reddit's Thanos metrics infrastructure and has been building observability systems at Reddit since 2021.
Thursday May 21, 2026 9:35am - 10:00am CDT
Level One | Ballroom A
  Keynote Sessions

10:00am CDT

Coffee + Networking Break
Thursday May 21, 2026 10:00am - 10:20am CDT
Menu:
Assorted Scones (GF) 
Blueberry Maple Overnight Oats (v, GF) 
Assorted Fruit Yogurts, including Dairy Free and Greek Yogurts
Thursday May 21, 2026 10:00am - 10:20am CDT
Level One | Ballroom A+B Foyer

10:20am CDT

The Invisible Tax: How Data Format Conversions Drive up Telemetry Pipeline Costs - Cijo Thomas & Joshua MacDonald, Microsoft
Thursday May 21, 2026 10:20am - 10:45am CDT
Telemetry signals traverse long pipelines before reaching observability backends. While enrichment, filtering, and redaction provide clear value, significant compute cost often comes from repeated conversion through different data formats.
Telemetry commonly flows through SDK formats, wire protocols, collector‑internal formats, and backend ingestion schemas. Each boundary introduces marshaling, unmarshalling and copying. These transformations add no new information, yet consume CPU and memory and scale linearly with volume—creating a hidden "transform tax" that compounds dramatically at terabyte scale.
This talk will share results from measuring instrumented OpenTelemetry SDK and Collector pipelines. We quantify compute spent on pure format conversion versus value‑generating processing and show how these costs grow with scale.
Attendees will learn about conversion costs and strategies to reduce waste: eliminating unnecessary translations, aligning pipeline representations, leveraging zero‑copy techniques, and minimizing transformation hops between pipeline stages. We also examine Apache Arrow‑based representations as one approach to reducing this overhead.
Speakers
avatar for Cijo Thomas

Cijo Thomas

Principal Software Engineer, Microsoft
Cijo is a Software Engineer at Microsoft specializing in Observability. He has been deeply involved with the OpenTelemetry project since its inception and is a core maintainer for the OpenTelemetry .NET and OpenTelemetry Rust implementations. His expertise extends beyond OpenTelemetry... Read More →
avatar for Joshua MacDonald

Joshua MacDonald

Principal Software Engineer, Microsoft
Joshua MacDonald is an OpenTelemetry contributor working in the observability industry. On the side, he writes open-source telemetry software and operates a community water system.
Thursday May 21, 2026 10:20am - 10:45am CDT
Level One | Ballroom A
  CNCF Observability Projects

10:20am CDT

[CANCELLATION] Scaling a Proprietary-to-OpenTelemetry Migration With AI-Assisted, Spec-Driven Workflows - Ying Mo & Paras Kampasi, IBM
Thursday May 21, 2026 10:20am - 10:45am CDT
This talk presents a practical methodology for migrating a large proprietary observability platform to an OpenTelemetry-native architecture, using a GenAI-assisted workflow paired with a robust spec-driven strategy. Faced with hundreds of custom Java-based sensors, the engineering team designed a spec-driven conversion process that leverages GenAI to extract specifications, generate unit tests, and assist in implementing Go-based OpenTelemetry receivers. Each stage incorporates human review and test feedback loops to address the reliability limitations of GenAI and ensure functional correctness.

Additionally, a data-driven feasibility evaluation was conducted prior to large-scale conversion, where defined task types were benchmarked with and without GenAI to quantify effort savings and highlight where GenAI provides the greatest value.

Attendees will learn a reproducible workflow for large-scale migrations from proprietary to OpenTelemetry, how to pair GenAI with automated testing to manage risk, and insights on where GenAI accelerates real-world engineering tasks without compromising quality.
Speakers
avatar for Ying Mo

Ying Mo

Senior Software Engineer, IBM
Ying Mo is a Senior Software Engineer at IBM, recently working on IBM Instana, an observability platform, leading engineering team to transform the product to OpenTelemetry native. He is always enthusiastic to bring innovative ideas into product by leveraging open source technology... Read More →
avatar for Paras Kampasi

Paras Kampasi

Technical Product Manager, IBM
I work at the intersection of OpenTelemetry, observability, and modern cloud-native practices, helping teams make complex systems understandable and reliable. I speak and write about practical ways to apply open standards, close feedback loops between SREs and product teams, and turn... Read More →
Thursday May 21, 2026 10:20am - 10:45am CDT
Level One | Ballroom B
  End-User Case Studies

10:50am CDT

OpenTelemetry GenAI in Practice: What the Spec Says Vs. What You Actually See - Zach Groves, Datadog
Thursday May 21, 2026 10:50am - 11:15am CDT
OpenTelemetry’s GenAI semantic conventions are evolving quickly. Version 1.37 marked a major shift in how LLM behavior is expressed using standard spans and attributes. While later releases refined and clarified the spec, real-world adoption remains uneven, and “GenAI-compatible” can mean very different things across the ecosystem.

In this talk, I’ll share hands-on lessons from implementing and validating GenAI support in real emitters, including close collaboration with Strands. Implementing the 1.37 spec on both sides surfaced semantic ambiguities that only became clear in practice and ultimately led to stronger implementations.

I’ll also outline the current GenAI instrumentation landscape: Strands emitting 1.37+ compliant spans; OpenLLMetry, which mixes newer conventions with legacy and custom attributes; and OpenInference, which claims OpenTelemetry compatibility but does not emit GenAI semantic convention attributes.

Finally, I’ll show how these gaps surface in practice—teams believing they emit 1.37-compliant telemetry but sending pre-1.37 or non-spec data—and briefly touch on transition guidance like OTEL_SEMCONV_STABILITY_OPT_IN.
Speakers
avatar for Zach Groves

Zach Groves

Software Engineer II, Datadog
Zach learned to code at Barnes & Noble during rest days between climbing while living in a van. He spent 3 years on the support team before moving over to engineering team at Datadog (3 years on APM and 1 on LLM Obs). He currently works on LLM Obs Otel compatibility. He likes scuba... Read More →
Thursday May 21, 2026 10:50am - 11:15am CDT
Level One | Ballroom B
  AI and MCP in Observability

10:50am CDT

Taming Observability at Scale in a Multi-Cluster Kubernetes Platform at Bloomberg - Joe Nathan Abellard, Bloomberg
Thursday May 21, 2026 10:50am - 11:15am CDT
Bloomberg runs a managed, multi-cluster Kubernetes platform built atop Karmada to support AI and streaming analytics workloads. This comes with challenges around observability at scale. To meet disaster recovery requirements, we use a multi-region architecture where each Karmada control plane is hosted on management clusters spanning multiple regions. This helps ensure high availability, but also adds complexity related to observability. For example, how do we aggregate and visualize metrics across multiple Prometheus servers when each management cluster has a dedicated Prometheus setup?

This talk covers our multi-region architecture to meet DR requirements and our Prometheus stack with Thanos for global metrics aggregation. We’ll explore how we choose the right signals and define meaningful alerts in a complex multi-cluster environment to curb alert fatigue, while ensuring timely issue detection. We’ll also discuss the challenges of defining SLIs and SLOs in a multi-tenant platform.
Speakers
avatar for Joe Nathan Abellard

Joe Nathan Abellard

Senior Software Engineer, Bloomberg
Joe Nathan Abellard is a Senior Software Engineer on the Cloud Native Compute Services (CNCS) Platform Engineering team at Bloomberg. He's the lead engineer and product owner for Bloomberg's large-scale, managed multi-cluster Kubernetes platform, built on the CNCF Karmada project... Read More →
Thursday May 21, 2026 10:50am - 11:15am CDT
Level One | Ballroom A

11:20am CDT

AI-Powered Root Cause Analysis at Scale: From Theory To Production Lessons From Nubank's 120M+ Cus - Letícia Mota & Yevgeny Gladun, Nubank
Thursday May 21, 2026 11:20am - 11:45am CDT
This session presents an AI-powered SRE Agent designed to autonomously orchestrates complex, multi-source investigations by querying internal observability providers and knowledge bases.
A primary focus is the "Data Volume Problem." Modern observability systems generate terabytes of metrics and logs daily; at Nubank’s scale, the Prometheus MCP alone has more than 23,000 metrics available, while log queries can span billions of rows. The team overcame LLM context limits through on-premises data filtering, intelligent summarization, and selective context assembly. This architecture utilizes "Expert Guides" to reduce 23,000 raw metrics to approximately 14 relevant data points before LLM processing.
The talk covers multi-source orchestration using the Model Context Protocol (MCP) for pluggable tool discovery, allowing the AI to progressively load and correlate only the observability sources.
The platform enables the delivery of expert instructions for any specific scenario through targeted, versioned prompts. This transformation allows the platform to scale across the enterprise, performing virtually any investigative task beyond its original root cause analysis mission.
Speakers
avatar for Letícia Mota

Letícia Mota

Nubank
Letícia is a Product Manager at Nubank with 8+ years of experience. After working with data & image recognition products, she now works with Resilience and Troubleshooting products, including a DR Test Platform and an SRE Agent for Nubank.​​​


... Read More →
avatar for Yevgeny Gladun

Yevgeny Gladun

Staff Runtime Platforms Engineer, Nubank
Yevgeny Gladun is a Staff Engineer at Nubank with nearly 20 years of software development experience. Over his four-year tenure at Nubank, he transitioned from scaling Data ETL pipelines to deep architectural analysis of microservice interactions. As part of the Runtime Platforms... Read More →
Thursday May 21, 2026 11:20am - 11:45am CDT
Level One | Ballroom B
  AI and MCP in Observability

11:20am CDT

Quantiles at Scale: Choosing the Right Estimation Algorithms for Observability - Mike Shi, ClickHouse
Thursday May 21, 2026 11:20am - 11:45am CDT
Quantiles like p90 and p99 sit at the heart of observability. They define dashboards, drive SLOs, and shape how teams reason about system performance. They are also some of the most expensive metrics to compute, and the cost grows fast as data volumes increase.
To keep up, observability systems rely heavily on approximate quantile algorithms such as sketches and probabilistic data structures, including t-digest. These approaches work well at small and medium scale, but at tens or hundreds of petabytes, things start to creak and limitations become apparent.
We share hard won lessons from operating ClickHouse at extreme scale, where quantile estimation must remain accurate and affordable over hundreds of petabytes of data. We break down the most common quantile algorithms used in observability today, explain their real trade offs, and show when each approach makes sense. We also explore a critical design decision: when quantiles should be computed on the fly at query time versus pre aggregated during ingestion.
The goal is to give you a practical framework for choosing quantile algorithms that scale, rather than blindly relying on defaults that stop working as your data grows.
Speakers
avatar for Mike Shi

Mike Shi

Head of Product, Observability, ClickHouse
Mike leads observability at ClickHouse, where he works on building a developer-friendly observability platform. He joined ClickHouse through the acquisition of HyperDX, a company he co-founded, after spending the last five years building observability platforms for engineers—accidentally... Read More →
Thursday May 21, 2026 11:20am - 11:45am CDT
Level One | Ballroom A

11:50am CDT

⚡ Lightning Talk: Summarizing the Noise: LLM Observability With Open Data Hub, VLLM, KServe and Prometheus - Twinkll Sisodia, Red Hat
Thursday May 21, 2026 11:50am - 12:00pm CDT
As large language models (LLMs) move into production, raw metrics alone aren’t enough. This talk presents an open-source AI observability solution built on Open Data Hub (ODH) that deploys LLMs using vLLM and KServe, scrapes inference metrics using Prometheus, and feeds them into a summarization model to generate actionable insights. We’ll demonstrate a working UI that translates low-level metrics like latency, GPU usage, and token throughput into human-readable summaries—giving platform teams an intelligent way to monitor LLMs at scale. No dashboards to interpret—just straight answers from your models about your models.
Speakers
avatar for Twinkll Sisodia

Twinkll Sisodia

Senior Software Engineer, OpenShift AI (Red Hat), Red Hat
Twinkll Sisodia is a Senior Software Engineer at Red Hat, building scalable, production-ready GenAI solutions. She works with partners to integrate their technology into OpenShift AI and contributes to open source in AI observability, platform optimization, and sustainability. Her... Read More →
Thursday May 21, 2026 11:50am - 12:00pm CDT
Level One | Ballroom B
  AI and MCP in Observability

11:50am CDT

⚡ Lightning Talk: Beyond Billions: Operating Thanos, Prometheus & OpenTelemetry at Trillion-Scale - Narendra Sanikommu, Nvidia
Thursday May 21, 2026 11:50am - 12:00pm CDT
Operating a metrics system beyond billions of data points introduces failure modes that don't exist at smaller deployments. This lightning talk shares battle-tested lessons from running, Thanos, Prometheus and OpenTelemetry in production across distributed Kubernetes environments, focusing on three critical challenges: implementing multi-tenancy without noisy neighbor problems, building rate limiting that prevents a single tenant from destabilizing the cluster, and isolating query workloads so expensive queries don't starve metric ingestion.

The talk walks through real incidents where these challenges caused production impact, including 5xx errors on Thanos Receivers from unbounded queries, Prometheus remote write lag and partial query results from overwhelmed Store Gateways. For each problem, the talk presents custom solutions developed—including tenant-aware rate limiting middleware and workload isolation patterns—and shares concrete configuration approaches that attendees can apply to their own deployments.
Attendees will leave with actionable techniques for scaling their observability infrastructure to trillion-scale while maintaining reliability under load.
Speakers
avatar for Narendra Sanikommu

Narendra Sanikommu

Senior Software Engineer, Nvidia
Experienced software engineer who is passionate about solving complex software engineering challenges. With around 14 years of experience in software engineering – has a strong foundation in building and optimizing high-performance systems particularly in Observability, Big Data... Read More →
Thursday May 21, 2026 11:50am - 12:00pm CDT
Level One | Ballroom A
  Scalability Challenges and Solutions
  • Content Experience Level Any

12:05pm CDT

⚡ Lightning Talk: From Collector To Terminal: A Better Way To See Your OpenTelemetry Logs - Jon Reeve, ControlTheory
Thursday May 21, 2026 12:05pm - 12:15pm CDT
The OpenTelemetry Collector is powerful, but the "debug exporter" only shows raw output. What if you could see your OpenTelemetry logs - with structure, filters, and context - right in your terminal?

This talk introduces Gonzo, an open-source, OTLP-native terminal UI that visualizes logs from the Collector or any OTLP-capable source in real time. Learn how to validate both source instrumentation, and Collector pipelines - including components like filelog, k8sattributes, and transform - without a backend.

Whether debugging, testing configs, or teaching OTel, Gonzo offers a faster, clearer way to understand your telemetry as it flows.

Key Takeaways:
- Validate source instrumentation and Collector pipelines end-to-end
- See enriched OTel logs with structure and context in the terminal
- Debug and iterate on OTel configs faster - no backend required
Speakers
avatar for Jon Reeve

Jon Reeve

CPO and Co-founder, ControlTheory
Jonathan Reeve is a co-founder of ControlTheory, where he helps teams take control of their observability data with smarter, more efficient telemetry pipelines. A passionate advocate for OpenTelemetry and open standards, Jonathan focuses on making observability more scalable, cost-effective... Read More →
Thursday May 21, 2026 12:05pm - 12:15pm CDT
Level One | Ballroom B

12:15pm CDT

Lunch
Thursday May 21, 2026 12:15pm - 1:15pm CDT
Menu: 
MinneSalad: Romaine, Baby Lettuce Greens, Purple Cabbage, Carrot Shreds,Honey-Clover Gouda, Sweet and Spicy Pepitas, Cucumber,Shredded Daikon, Red Peppers, Blueberry Balsamic Vinaigrette (vg, gf)

Sautéed Beef Tips, Wild Rice, Carrots, Celery, Onions, Mushrooms, Topped with Cheddar Cheese and Crispy Tater Tots
Wild Rice Hot Dish Plant-Based Ground Beef, Wild Rice, Carrots, Celery, Onions, Mushrooms (vg)
Wild Rice Cakes with Roasted Red Pepper Sauce, Roasted Brussel Sprout Medley (ve)

Homemade Dinner Rolls
Assorted Miniature Bundt Cakes
Thursday May 21, 2026 12:15pm - 1:15pm CDT
Level One | Ballroom A

1:15pm CDT

Panel: Telemetry That Matters - Diana Todea, VictoriaMetrics; Antonio Jimenez Martinez, Cisco ThousandEyes; Laura Luttmer, Dynatrace
Thursday May 21, 2026 1:15pm - 1:50pm CDT
Instrumentation has never been easier, but are we truly gaining clarity? As data volumes rise, dashboards multiply, and observability costs increase, developers may feel less insight and more friction. Are we collecting telemetry with purpose or just because we can? What problem is this data meant to solve?
This panel brings together practitioners across open standards, developer experience and real-world reliability engineering. The discussion will examine how zero code instrumentation affects workflows and system understanding, how meaningful telemetry improves day to day engineering work and why unfiltered or unstructured data often has the opposite effect. The conversation will cover practical lessons for filtering, dropping, reducing and shaping telemetry so teams maintain visibility without unnecessary volume or cost. Finally, we explore scaling observability across fleets of collectors with an OpAMP server, ensuring consistent signal delivery and manageability as telemetry grows.
At the center is a guiding question: What is the purpose of the telemetry we collect and how do we ensure it remains aligned with developer needs, operational requirements, and system reliability?
Speakers
avatar for Antonio Jimenez Martinez

Antonio Jimenez Martinez

Tech Lead Software Engineer, Cisco ThousandEyes
I am a Tech Lead Software Engineer at Cisco ThousandEyes, specializing in observability to ensure our customers can effectively monitor their products. My recent work involves using OpenTelemetry to stream telemetry data, enhancing network visibility and performance for our clients... Read More →
avatar for Diana Todea

Diana Todea

Developer Experience Engineer, VictoriaMetrics
Diana is a Developer Experience Engineer at VictoriaMetrics. She has worked as a Senior Site Reliability Engineer focused on Observability. She is an active member of the OpenTelemetry CNCF open source project, co-organizer of Cloud Native Days Romania, co-lead of neurodiversity working... Read More →
avatar for Laura Luttmer

Laura Luttmer

Sr. Product Manager, Bindplane (Dynatrace)
I am a Product Manager at Bindplane based in Albuquerque, New Mexico. With over 10 years of product experience spanning SaaS, legal, and data platforms, I focus on OpenTelemetry-native pipeline solutions, AI-powered telemetry intelligence, and helping customers get more out of their... Read More →
Thursday May 21, 2026 1:15pm - 1:50pm CDT
Level One | Ballroom A

1:25pm CDT

Unified End-to-End Observability: How Comcast Generates SpanMetrics at Enterprise Scale - Raghu Vamshi Challa, Comcast
Thursday May 21, 2026 1:25pm - 1:50pm CDT
Enterprises often struggle with the "black box" nature of proprietary APM tools and the high cost of distributed tracing at scale. In this session, we will demonstrate how Comcast tackled this challenge by migrating 350 critical applications from AppDynamics to a cloud-native OpenTelemetry (OTel) stack, achieving a truly unified end-to-end observability experience.

We will pull back the curtain on the architecture that powers this migration. Specifically, we will show how we leveraged the OpenTelemetry Collector to generate Request, Error, and Duration (R.E.D.) metrics from trace data using the SpanMetrics connector. A key highlight will be our unique deployment of Conduit, which serves as a resilient transport layer to ensure data integrity and effective load balancing in a high-volume environment.

Attendees will leave with a blueprint for breaking free from APM vendor lock-in. To help the community fast-track this transition, we will also be sharing and walking through our reusable, battle-tested Grafana dashboards that can be leveraged by any enterprise.
Speakers
avatar for Raghu Challa

Raghu Challa

Comcast Engineer 6, Software Development & Engineering - Backend Engineering, Comcast
Raghu is an Observability Lead at Comcast, driving the enterprise-wide migration from legacy APM tools to OpenTelemetry. He specializes in designing high-scale telemetry pipelines that process massive volumes of trace data. Raghu is passionate about democratizing observability and... Read More →
Thursday May 21, 2026 1:25pm - 1:50pm CDT
Level One | Ballroom B
  End-User Case Studies

1:55pm CDT

Policy as Code Meets OpenTelemetry: The Next Frontier of Observability - Christopher Voisey, EnforceAuth
Thursday May 21, 2026 1:55pm - 2:20pm CDT
Modern observability stacks excel at capturing signals about infrastructure health, application performance, and request flows. Yet one critical class of decisions remains largely invisible: authorization.
In distributed systems, authorization decisions increasingly determine not only whether an action succeeds, but if data is accessed, tools are invoked, or automated agents are allowed to act. These decisions are often evaluated outside application code using Policy as Code frameworks, yet their outcomes are rarely observable in a structured, privacy preserving way.
In this session, we explore how Policy as Code, Open Policy Agent, and the OpenTelemetry project can be combined to treat authorization decisions as observable events. We examine what it means to observe a decision without logging sensitive inputs, how decision structure differs from traditional metrics and traces, and why decision level observability is becoming essential in cloud native and AI driven systems.
Attendees will leave with a conceptual framework for thinking about authorization as telemetry, and a clearer understanding of where observability is heading as systems become more autonomous and policy driven.
Speakers
avatar for Christopher Voisey

Christopher Voisey

Field CTO, EnforceAuth
Chris is a technology leader with 20+ years of experience designing and delivering secure, cloud-native systems. He has led engineering and solutions teams across startups and enterprises, helping organizations adopt policy-as-code, zero-trust architectures, and modern observability... Read More →
Thursday May 21, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom B
  CNCF Observability Projects

1:55pm CDT

Taming Tenancy, Cost and Architecture at Collibra Through OpenTelemetry and Our Telemetry Backbone - Alex Van Boxel, Collibra
Thursday May 21, 2026 1:55pm - 2:20pm CDT
Operating a SaaS platform presents the same observability problems as any other enterprise, but due to the scale and tenancy, we introduce a huge multiplier on the observability signals, having an effect on cost and effectiveness.

This session dives into the techniques Collibra used to tame these problems and how to maintain clarity when infrastructure spans virtual machines, modern Kubernetes clusters, and a complex mix of single- and multi-tenant architectures. Without the right context, telemetry data becomes a noisy, indistinguishable flood.

We will dive into the architectural decision to leverage the C4 system model, ensuring every piece of telemetry carries the vital context of what it belongs to and where it sits in the hierarchy. Enabling us to gain insights into both signal attribution and allowing virtual chargebacks. The presentation details the implementation of a pipeline using custom-built OpenTelemetry collectors designed to handle the data and enrich it before sending it to the appropriate backends.

This session will give you practical insights on the challenges SaaS platforms have, but the techniques that are used to tame them can be applied everywhere.
Speakers
avatar for Alex Van Boxel

Alex Van Boxel

Principal System Architect, Collibra
Alex Van Boxel is a Principal System Architect at Collibra. With an engineering background in R&D at Alcatel-Lucent, Progress Software, and Veepee, he loves to focus on the fundamental building blocks of the software industry. That means reading, understanding, and contributing to... Read More →
Thursday May 21, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom A
  End-User Case Studies
  • Content Experience Level Any

2:25pm CDT

Whats the Best Way To Reduce Storage Requirements Without Losing Insights? Push AI To the Edge! - Alex Degitz, ElastiFlow Inc
Thursday May 21, 2026 2:25pm - 2:50pm CDT
During this session we’ll discuss ElastiFlow’s Edge Observability strategy, which includes an OTel native edge processing node with local DuckDB storage for all OTel signals and an agentic AI system that is model agnostic (we often run it with OpenAI’s gpt-oss-20b), exposing its tools through an MCP server.

Instead of just forwarding OTel signals from various Edge collectors, the signals are analyzed and routing decisions are made. Alerts are sent to the Observability Platform right away, while logs are stored locally and analyzed for patterns. Instead of forwarding all logs, we might only care about a few conditions of interest, often correlated with other signals, and send these to the Observability Platform, while less interesting logs can be aggressively aggregated.

With this approach, we were able to reduce the storage and ingest cost of Observability Platforms by half while actually decreasing the mean time to insight.
Speakers
avatar for Alex Degitz

Alex Degitz

VP of Product, ElastiFlow Inc
Alex has been building Automation and Observability products for 10+ years and has been advocating to break down silos between operations teams ever since.
Thursday May 21, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom B
  AI and MCP in Observability

2:25pm CDT

The Speed of Metrics, the Fidelity of Traces: Architecting Post-Collection Aggregation - Zack Owens, New Relic
Thursday May 21, 2026 2:25pm - 2:50pm CDT
As organizations adopt observability practices, they face a scalability paradox: systems now generate petabytes of traces and logs, but querying this raw telemetry over long time horizons becomes prohibitively slow and expensive due to the data volume.

The standard solution of pre-aggregating high-cardinality telemetry into metrics at collection time through features in the OpenTelemetry collector works well for known patterns but fails when engineers need to ask new questions about historical data. This creates an uncomfortable choice for engineers and operators: fast dashboards with pre-aggregated metrics, or high-fidelity traces and logs that become unusable beyond short time windows.

This talk presents a post-collection aggregation approach that enables fast queries over long time periods of detailed telemetry without changes to collector-side configuration. This session explores techniques for incremental view materialization that work with timeseries data. Attendees will leave with concrete architectural patterns which are applicable to open source databases like ClickHouse or OpenSearch to answer novel questions without sacrificing query speed or data fidelity.
Speakers
avatar for Zack Owens

Zack Owens

Principal Software Engineer, New Relic
Zack Owens is a Principal Engineer and Architect at New Relic, focusing on the data platform and NRDB, a purpose-built timeseries database for observability.
Thursday May 21, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom A
  Scalability Challenges and Solutions
  • Content Experience Level Any

2:55pm CDT

One Pane to Rule Them All: Uniting the Prometheus Community with OpenSearch Dashboards, Logs, and Trace - Anirudha Jadhav and Kevin Fallis, AWS
Thursday May 21, 2026 2:55pm - 3:20pm CDT
As infrastructure scales across regions and clusters, Prometheus deployments fragment into isolated islands of metrics—disconnected from logs, traces, and the dashboards operators actually live in.

This talk is for the Prometheus community. If you've wrestled with federation sprawl, alert duplication, or the gap between your metrics and the rest of your observability story, this session is for you.

We'll demonstrate how OpenSearch's distributed data source support lets multiple Prometheus clusters coexist natively alongside logs and traces in a single unified interface, no data migration, no parallel stacks.

You'll learn:
  • Unified querying across Prometheus clusters
  • SLO tracking wired directly into dashboards
  • Application management that finally connects the signals your teams have been operating in isolation
This is about completing the observability loop the Prometheus community has always needed, open, composable, and community-driven.
Speakers
avatar for Kevin Fallis

Kevin Fallis

Principal Senior Solutions Architect, Amazon Web Services
Kevin Fallis is seasoned leader, architect, and developer with experience across many industry verticals and disciplines such as agriculture, ad tech, financial services, networking, security, telecommunications and of course search technologies. His passion helps others leverage... Read More →
avatar for Anirudha Jadhav

Anirudha Jadhav

Sr. Engineering Leader, Amazon Web Services
Anirudha is a Senior Manager, Software Development at Amazon Web Services (AWS), leading development of insight engines and visualization platforms for the OpenSearch Project. He specializes in distributed systems, data analytics, and search technologies, including architecting one... Read More →
Thursday May 21, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom B
  AI and MCP in Observability

2:55pm CDT

When the Cloud Fails: Debugging the "Undocumented" - Dhruv Jain, Gojek (GoTo Group) Indonesia
Thursday May 21, 2026 2:55pm - 3:20pm CDT
What happens when a system degrades under high load while all internal metrics remain “green”? At hyperscale, supporting on-demand services across Southeast Asia’s most populous countries, a team observed up to a 7% drop in message delivery. The root cause was not application code, messaging brokers, or load balancers, but a hidden limitation deep within a cloud provider’s firewall.

This war-story session presents a forensic investigation into a managed cloud load balancer and its interaction with connection-tracking tables. The talk walks through the production cutover that triggered the issue and the targeted load testing that ultimately isolated the failure to cloud infrastructure behavior invisible to standard monitoring.

Beyond root cause analysis, the session focuses on outcomes: how sustained, evidence-based debugging led the cloud provider to acknowledge the issue—initially labeled a “limitation”—and introduce a new observability metric, firewall/connections_tracked. Attendees will leave with a practical framework for debugging black-box cloud failures and identifying the node-level metrics needed to detect silent network drops before they impact users.
Speakers
avatar for Dhruv Jain

Dhruv Jain

Lead Software Engineer, Gojek (GoTo Group)
Dhruv Jain is a Lead Software Engineer at Gojek, where he focuses on building and scaling MQTT infrastructure that handles millions of concurrent connections across Southeast Asia. Beyond his work at Gojek, he is an active contributor to the open-source community and Google Summer... Read More →
Thursday May 21, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom A
  End-User Case Studies

3:20pm CDT

Coffee + Networking Break
Thursday May 21, 2026 3:20pm - 3:40pm CDT
Menu:
Rice Crispy Bars (GF) 
Potato Chips (GF, Vg) and French Onion Dip (v, GF) 

Thursday May 21, 2026 3:20pm - 3:40pm CDT
Level One | Ballroom A+B Foyer

3:40pm CDT

From Data Dumps To Smart Context: Building MCP Servers That AI Can Actually Use - Thomas Johnson, Multiplayer
Thursday May 21, 2026 3:40pm - 4:05pm CDT
Most MCP servers fail the same way: they expose observability data without understanding what AI models need to reason effectively. The result? Tools that overwhelm models with metrics, miss critical context, and introduce unnecessary security exposure.

At Multiplayer, we built an MCP server to give AI coding assistants access not just to production telemetry but to full stack data: frontend screens and data, backend traces, logs, and request/response content and headers. What we learned challenges the "more data is better" assumption that drives most integrations.

This talk shares the hard lessons from moving an MCP server into production. You'll learn why filtered, intent-driven context outperforms comprehensive data access, how to design tools that align with developer workflows rather than API surfaces, and the security trade-offs that matter when LLMs query your observability stack.

We'll cover practical design patterns for MCP servers in the observability space: scoping data by blast radius, surfacing relationships over raw metrics, and handling authentication without compromising developer experience. This talk is about what works when AI meets production systems.
Speakers
avatar for Thomas Johnson

Thomas Johnson

CTO and Co-founder, Multiplayer
Co-founder and CTO at Multiplayer, with 20+ years of experience as a backend developer building large-scale distributed software (and robots!)
Thursday May 21, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom B
  AI and MCP in Observability
  • Content Experience Level Any

3:40pm CDT

The Full Picture: Visualizing Service "Fullness" To Rethink Saturation Prevention - Tal Nordan, Independent
Thursday May 21, 2026 3:40pm - 4:05pm CDT
Saturation has long been the stepchild of "the Four Golden Signals of SRE". While latency, traffic, and errors are directly measurable through metrics like P99, RPS, and 5xx rates, monitoring just how "full" a service is relies on indirect symptoms such as CPU usage or queue depth. Yet, saturation should ideally rather be the first signal to alert, as once it's reached, other signals - latency and errors - spike fast.

The inability to directly observe and mitigate saturation drives excessive safety margins, chronically low CPU utilization and massive compute waste in latency-sensitive and customer-facing systems. This session introduces an open-source approach extending Envoy proxy and its seamless integration through eBPF and Cilium, to provide direct observability into service saturation, by comparing each instance's live number of concurrent requests to its true concurrency limit. We then explore how such direct visualization of saturation can help reduce MTTR and minimize waste.
Speakers
avatar for Tal Nordan

Tal Nordan

Software Engineer, Independent
An early contributor to the Envoy proxy project, now working on developing tools to detect and mitigate inefficiencies in the way services interact with each other. Over the years Tal has been a founding engineer and a contractor working on a wide variety of cloud-native data-plane... Read More →
Thursday May 21, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom A

4:10pm CDT

Why Are Your AI’s Decisions Hard To Explain: Trace Every Decision With Agentic AI Observability - Dhiraj Kumar Jain & Vikash Agrawal, Amazon Web Services
Thursday May 21, 2026 4:10pm - 4:35pm CDT
Agentic AI systems represent a fundamental shift in software architecture: autonomous agents reason, plan, invoke tools, and orchestrate complex workflows without deterministic control flow. This breaks many assumptions behind traditional observability.

When agents independently make decisions, failures no longer follow a single request path. How do you debug emergent behavior across multiple agent steps? How do you analyze and control token-driven costs? How do you ensure reliability when outputs are non-deterministic?

This session explores why observability is a first-class requirement in the agentic AI era and how OpenSearch can act as the analytical backbone for understanding autonomous AI systems in production. We will cover practical techniques for instrumenting agent workflows with OpenTelemetry and indexing traces, logs, metrics, and AI decision artifacts into OpenSearch for deep correlation and analysis.

Attendees will learn battle-tested patterns for tracing agent reasoning and tool usage, investigating failures and hallucinations, monitoring latency and cost signals, and building dashboards that make agentic AI systems transparent, debuggable, and production-ready.
Speakers
avatar for Dhiraj Kumar Jain

Dhiraj Kumar Jain

Sr. Software Engineer, AWS
Dhiraj is a software engineer at Amazon Web Services (AWS), where he’s working on building a next-gen log analytics platform with CloudWatch Logs, helping scale it to handle vast amounts of data. Before this, worked in Amazon AuroraDB.

A distributed systems enthusiast, Dhiraj loves diving into complex, large-scale problems and building software for the next billion users. When he’s not scaling systems, you’ll find him at tech meetups and hackathons... Read More →
avatar for Vikash Agrawal

Vikash Agrawal

Vikash Agarwal, Amazon Web Services
Vikash Agrawal is a Software Development Manager at Amazon Web Services (AWS), leading initiatives in the AWS CloudWatch team. Previously, he played a key role in developing Amazon Q Developer, a Generative AI-powered assistant for developers. With over a decade of experience in software... Read More →
Thursday May 21, 2026 4:10pm - 4:35pm CDT
Level One | Ballroom B
  AI and MCP in Observability

4:10pm CDT

Secure by Design: Rethinking Test Credentials for Synthetic Monitoring - Katie Kodes, Katie Kodes
Thursday May 21, 2026 4:10pm - 4:35pm CDT
Synthetic monitoring and end-to-end testing often require dangerous levels of access to production systems. Last summer, I nearly emailed my bank details to a team I was training on new testing tools. If I hadn't caught that mistake, I probably would have dumped them into an OTel collector too.

This session explores the security implications of common testing practices, and presents practical alternatives that maintain observability without compromising security.

Attendees will learn authentication and authorization patterns to improve test security across the software development lifecycle.

Implementing mitigations like health check endpoints, synthetic data, and privilege separation spans the full stack of infrastructure, development, monitoring, and governance. Attendees will leave with a shared vocabulary they can use to align business, development, security, and observability teams on safer test traffic in production.
Speakers
avatar for Katie Kodes

Katie Kodes

DevOps Architect
Katie is a DevOps architect who brings clarity to complex technical challenges across the entire stack. With experience ranging from infrastructure to front-end development, she helps teams build reliable, observable systems that deliver real business value. A passionate educator... Read More →
Thursday May 21, 2026 4:10pm - 4:35pm CDT
Level One | Ballroom A

4:40pm CDT

Closing Remarks
Thursday May 21, 2026 4:40pm - 4:45pm CDT

Thursday May 21, 2026 4:40pm - 4:45pm CDT
Level One | Ballroom A

4:45pm CDT

Evening Reception
Thursday May 21, 2026 4:45pm - 5:45pm CDT
Join us onsite for drinks and appetizers with fellow attendees.

Menu:
Gourmet Cheese Platter (v)
Fresh Vegetable Crudités Platter - Spinach Dip + Hummus (v) 
Wild Rice Cakes (vg, gf)  with Red Pepper Sauce 
Filo Tartlet - Sundried Tomato-Chicken
Thursday May 21, 2026 4:45pm - 5:45pm CDT
Level One | Ballroom A+B Foyer
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Content Experience Level
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -