Loading…
May 21-22, 2026
Learn more and Register to Attend

The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for Observability Summit North America 2026.

Please note: This schedule is automatically displayed in Central Daylight Time (UTC -5). To see the schedule in your preferred timezone, select from the drop-down menu located at the bottom of the menu to the right.

The schedule is subject to change.
Company: Intermediate clear filter
arrow_back View All Dates
Friday, May 22
 

9:20am CDT

Keynote: Tracing the Agent's Mind: Extending OpenTelemetry for Deep MCP Inspection - Mustafa Dayıoğlu, TUBITAK & Zeyno Dodd, Conjectura R&D
Friday May 22, 2026 9:20am - 9:45am CDT
Production AI agents make thousands of tool-calling decisions daily, yet observability stops at the model boundary. OpenTelemetry's GenAI semantic conventions capture token counts and latencies—what the LLM processed—but not why an agent selected a specific tool. Research (McKenzie et al., 2023) demonstrates inverse scaling: more capable models exhibit unpredictable tool selection patterns. This gap leaves engineers guessing during critical production failures.

We present gen-ai-otel, an open-source OpenTelemetry extension introducing decision-level telemetry for MCP agents. A new attribute namespace (gen_ai.agent.*) captures tool selection confidence, session context, permission scope validation, and baseline deviations. The zero-sidecar architecture routes telemetry through standard Collector pipelines to existing backends—Jaeger, Prometheus, or graph databases—with low overhead and cardinality-aware attributes.

A live demo reconstructs an agent's decision chain, revealing anomalies invisible to token metrics—reducing decision-debugging time. Attendees leave with: 1) Collector configs, 2) Grafana dashboards for confidence tracking, 3) demo code and repo—all Apache 2.0 licensed.
Speakers
avatar for mustafa dayıoğlu

mustafa dayıoğlu

Senior Chief Researcher, TUBITAK (THE SCIENTIFIC AND TECHNOLOGICAL RESEARCH COUNCIL OF TÜRKİYE)
Mustafa Dayıoğlu (PhD, ITU) is a security architect with 25 years of experience in cybersecurity at TÜBİTAK, designing large-scale security systems serving 80 million citizens for regulated environments. Specializes in threat modeling and protocol development for AI agent systems... Read More →
avatar for Zeyno Dodd

Zeyno Dodd

R&D Solution Architect, Conjectura R&D
R&D Architect with 25+ years building distributed systems and leading open research collaborations. Principal collaborator on SFAMDF and GraphSentinel—open initiatives exploring proactive, federated security patterns for MCP‑based agentic AI systems. Research interests include... Read More →
Friday May 22, 2026 9:20am - 9:45am CDT
Level One | Ballroom A
  Keynote Sessions

10:20am CDT

Observing the Observers: Bringing OpenTelemetry to Autonomous AI Agents - Abdel Fane, OpenA2A
Friday May 22, 2026 10:20am - 10:45am CDT
Traditional observability assumes humans operate systems. AI agents break that model—they make autonomous decisions, execute operations without approval, and drift in capability over time. Yet most organizations have zero observability into their AI agent infrastructure.

When developers spin up MCP servers through Claude Desktop or Cursor, security and ops teams are blind. No metrics. No traces. No logs. Just autonomous agents accessing databases, calling APIs, and modifying production systems—completely outside your observability stack.

This talk explores how to instrument AI agents and MCP servers using familiar CNCF tools. We'll cover:

• Why traditional APM fails for autonomous agents (no request/response, emergent behavior, capability drift)
• Detecting anomalies in agent behavior (statistical baselines vs. ML-driven detection)
• Correlating agent actions to business outcomes

You'll see working demos of agent observability plus open-source code for instrumenting LangChain, CrewAI, and custom agents.

Walk away with patterns to extend your existing observability stack to AI agents before they become your biggest blind spot.
Speakers
avatar for Abdel Fane

Abdel Fane

CEO & Founder, OpenA2A
Abdel is a cybersecurity architect with 17+ years of experience securing enterprise environments across healthcare, finance, and government sectors. He has led security initiatives at Grail, Booz Allen Hamilton, Protiviti, and Allstate, specializing in cloud security & DevSecOps.
... Read More →
Friday May 22, 2026 10:20am - 10:45am CDT
Level One | Ballroom B
  AI and MCP in Observability

11:20am CDT

[CANCELLATION] AI Training in Emerging Economies: Building Africa's Largest LLM From the Ground Up - Okikiola Oliyide, Awarri
Friday May 22, 2026 11:20am - 11:45am CDT
N-ATLaS is a multilingual African-language LLM we took from research to production on Kubernetes. This talk shows the end-to-end path we used to make it reproducible, observable, and affordable: data + finetune pipelines (artifacts, seeds, checkpoints), Argo-orchestrated training on mixed GPU pools, and a serving stack with Triton + KServe tuned for real traffic. I’ll walk through SRE guardrails that mattered for N-ATLaS (SLOs, golden signals, error budgets), supply-chain hygiene (image signing, provenance, model versioning), and the levers that cut cost-per-token while improving latency and uptime under pre-emptions. We’ll cover autoscaling, caching, model rollout strategies, and incident playbooks plus what we’d change after thousands of downloads and weeks of live usage. Expect hard-learned patterns, YAML you can run, and a plain-English checklist you can lift into your own cluster; whether you’re serving English or a low-resource language model.
Speakers
avatar for work okiki

work okiki

Lead DevOps Engineer, Awarri
Okikiola Oliyide is Lead Cloud DevOps Engineer at Awarri Technology, where he designs and operates large-scale Kubernetes platforms powering Africa’s largest LLM initiative. With 5+ years across AWS, GCP, and on-prem, he specialises in CI/CD, observability, and cost-efficient GPU... Read More →
Friday May 22, 2026 11:20am - 11:45am CDT
Level One | Ballroom A
  CNCF Observability Projects

12:05pm CDT

⚡ Lightning Talk: Observability Debt: When Telemetry Stops Telling the Truth - Spoorthi Palakshaiah, Relevance Lab
Friday May 22, 2026 12:05pm - 12:15pm CDT
This talk introduces observability debt as an operational issue that develops over time in evolving systems. Teams often instrument services early using observability frameworks, define metrics, dashboards, alerts, and SLOs, and initially gain confidence in their ability to understand system behavior. However, production systems rarely remain static. As systems evolve through refactoring, scaling, architectural changes, asynchronous processing, and organizational shifts. Observability artifacts frequently remain unchanged, creating a mismatch between what telemetry is assumed to represent and how the system actually behaves. This mismatch, referred to as observability debt, does not result from missing data but from telemetry whose meaning has drifted due to unmaintained assumptions, leading to dashboards that appear healthy, alerts that lack context, and slower incident understanding. To make this concrete, the talk uses a minimal personal system intentionally designed to model common production patterns. Starting from a low-debt state where telemetry reflects user impact, the system evolves while observability remains static, resulting in metrics that hide localized failures.
Speakers
avatar for Spoorthi Palakshaiah

Spoorthi Palakshaiah

DevOps Engineer, Relevance Lab
Spoorthi is a DevOps engineer with experience designing, building, and optimizing cloud infrastructure. She works extensively with Kubernetes, infrastructure as code, CI/CD pipelines, and open source observability tools to improve system reliability, scalability, and operational efficiency... Read More →
Friday May 22, 2026 12:05pm - 12:15pm CDT
Level One | Ballroom A

12:15pm CDT

[Rescheduled] ⚡ Lightning Talk: GPU-Scanner: Extending CNCF Observability for Multi-GPU AI Workloads - Ritika Gupta, Oracle
Friday May 22, 2026 12:15pm - 12:25pm CDT
As large language models scale across hundreds of GPUs and multi-node AI systems, they’ve become a major operational challenge for infrastructure engineers. Traditional observability tools stop at the node level, leaving GPU health and utilization invisible until workloads fail or budgets spike. Imagine a 25 day training job failing on day 23 because one GPU silently throttled!

In this session, we’ll explore GPU-Scanner, an open-source observability extension for Kubernetes GPU clusters. Built to integrate with Prometheus & OpenTelemetry, GPU-Scanner adds both active and passive GPU health checks, capturing throughput, TFLOPs, memory diagnostics, thermal consistency, and long-run stability metrics.

We’ll demo real-world failure modes like catching a GPU “off the bus” or detecting thermal throttling and show how alerts flow into your existing observability stack. Leave with a practical playbook to proactively validate GPU clusters and maximize reliability and utilization.
Speakers
avatar for Ritika Gupta

Ritika Gupta

Senior SWE - AI Incubations, Oracle
With a knack for transforming chaos into seamless solutions Ritika Gupta creates technologies to bind Kubernetes, Containers and Cloud ecosystem leveraging cloud native tooling. She actively contributes to Kubernetes as an sig-windows member. Her expertise spans container orchestration... Read More →
Friday May 22, 2026 12:15pm - 12:25pm CDT
Level One | Ballroom B

1:25pm CDT

Beyond Dashboards: Architecting AI Agents for Autonomous Observability - Divya Mahajan, Amazon & Achin Gupta, Intuit
Friday May 22, 2026 1:25pm - 1:50pm CDT
The future of observability isn't better dashboards—it's AI agents that reason across metrics, logs, and traces alongside your engineering team.

Engineers spend hours correlating signals across Grafana, Kibana, and Jaeger, mentally stitching together what happened and why. What if an agent could do that correlation automatically?

This session presents a practical architecture for building observability agents that autonomously triage incidents across all three pillars. we'll demonstrate an agent that ingests an alert, queries metrics, searches logs, examines traces, identifies root causes, and recommends remediation—while keeping humans in the loop.

We'll cover:

Why observability is ideal for agentic AI
Agent architecture with LangGraph orchestration
Integration patterns: MCP, REST APIs, and OpenTelemetry
Tool design for metrics, logs, and traces
Live demo: agent triaging a simulated incident
Production considerations: reliability, cost, guardrails
Attendees leave with a working reference architecture built on CNCF ecosystem tools (Prometheus, Jaeger, Loki, Grafana). All code is open source.
Speakers
avatar for Divya Mahajan

Divya Mahajan

Software Engineer, Amazon

Divya Mahajan is a Software Development Engineer at Amazon Alexa, where she builds production-grade Agentic AI and LLM systems at scale. Her work sits at the intersection of conversational AI, agentic automation, and reliable system design, with a focus on accuracy, observability... Read More →
avatar for Achin Gupta

Achin Gupta

Staff Software Engineer, Intuit
Achin Gupta is a Staff Software Engineer with 9 years of experience designing and building production grade distributed observability backends on Kubernetes. He also focuses on AI driven systems, developing LLM powered workflows and multi agent architectures, with an emphasis on observability... Read More →
Friday May 22, 2026 1:25pm - 1:50pm CDT
Level One | Ballroom A
  AI and MCP in Observability

1:55pm CDT

Breaking Free from Vendor Lock-In: Nubank DIY Observability Success - Diego Rocha, AWS & Otavio Valadares, Nubank
Friday May 22, 2026 1:55pm - 2:20pm CDT
Nubank is the largest digital bank outside Asia, operating in Brazil, Mexico, and Colombia, and serving over 120 million customers. As a cloud-native company, Nubank distributed digital environment relies on more than 4,000 microservices, generating nearly 1 petabyte of monitoring logs daily. To better manage this volume and reduce operational costs by over 50%, Nubank recently transitioned from an external vendor to an in-house log platform. In this talk, we'll share the platform architecture and the challenges encountered during the migration journey.
Speakers
avatar for Diego Rocha

Diego Rocha

Sr. Solutions Architect, AWS
avatar for Otavio Valadares

Otavio Valadares

Lead Software Engineer, Nubank
Lead Software Engineer @ Nubank
Friday May 22, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom B

1:55pm CDT

One Size Does Not Fit All: A Polystore Architecture for Logs and Traces - Suman Karumuri, KalDB
Friday May 22, 2026 1:55pm - 2:20pm CDT
Observability data isn't homogeneous. Security logs require needle-in-haystack searches with multi-year compliance retention. Kernel logs are uncompressible text. Structured logs enable fast aggregations, while semi-structured logs explode cardinality. Traces demand different access patterns entirely.

Modern requirements compound this. Observability must join with other data sources. Agentic AI systems generate massive volumes of unstructured and semi-structured logs and traces. Big data platforms have emerged as popular storage alternatives.

Forcing everything into one system creates impossible tradeoffs: slow queries, runaway costs, frustrated users.

At Airbnb and Slack, operating thousands of tenants across hundreds of clusters, we built a polystore architecture routing workloads to specialized engines, unified behind a single query interface. This required changes across the entire stack: instrumentation, collection, storage, and query layers.

This talk shares routing criteria, backend tradeoffs, and techniques for unified querying. Attendees will learn to optimize observability for better performance and lower costs.
Speakers
avatar for Suman Karumuri

Suman Karumuri

CEO, KalDB
Suman Karumuri is Founder and CEO of KalDB and author of KalDB, an open source serverless Lucene platform. He is co-author of the OpenTracing/OpenTelemetry specification and was previously tech lead of Zipkin. Over the past decade, he has built and ran petabyte-scale log search, distributed... Read More →
Friday May 22, 2026 1:55pm - 2:20pm CDT
Level One | Ballroom A

2:25pm CDT

The Legend of Config: Breath of the Cluster - Henrik Rexed, Dynatrace
Friday May 22, 2026 2:25pm - 2:50pm CDT
Configuring Ingress, Gateway API, or service meshes in Kubernetes can feel like exploring an open world without a map : one wrong turn, and traffic vanishes. In this session, we’ll explore how to detect and prevent misconfigurations using OpenTelemetry, eBPF-based instrumentation (OBI), and enriched logs from service meshes and ingress controllers. Like a hero collecting tools to unlock new areas, we’ll show how to identify relevant data sources, parse and process their output, and apply common correlation rules to understand the impact of configuration changes. We’ll demonstrate how these techniques can be applied across observability platforms to reduce tool sprawl and improve operational efficiency. Attendees will leave with a practical, backend-agnostic approach to building a multi-source observability strategy for Kubernetes networking.
Speakers
avatar for Henrik Rexed

Henrik Rexed

Cloud Native advocate & CNCF Ambassador, Dynatrace
Henrik is a Cloud Native Advocate at Dynatrace and a CNCF Ambassador . Prior to Dynatrace, Henrik has worked more than 15 years, as Performance Engineer. Henrik Rexed Is Also one of the Organizer of the conferences named WOPR, KCD Austria and the owner of the Youtube Channel Isit... Read More →
Friday May 22, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom B

2:25pm CDT

Implementation of Unified Observability at Scale From Scratch - Ahmed J., Emaar
Friday May 22, 2026 2:25pm - 2:50pm CDT
Unified observability has lately been regarded as the holy grail by some. One platform, universal observability, for everything. Usually, this would be the default, but when you are at a 30-year-old non-technical enterprise, dealing with a mixture of legacy and modern systems, it's a whole different story.

A consequence of legacy decisions, in some cases, results in having multiple observability platforms for different teams within the company, adding overhead, cost, noise, and audit complexity. This was the case at Emaar, a property developer based in Dubai, until the PE team took on the exciting project of unifying all observability into one platform. This included applications, infrastructure, network, and security. The complexity arises not just from the different data sources, but rather from the number and nature of the deployment sites. This included sites across 10 countries consisting of data centers, hotels, malls, shops, etc.

This talk will outline the experience of implementing a unified observability platform consisting of thousands of network devices, machines, and application workloads using open-source technologies that resulted in 6 figures of cost savings.
Speakers
avatar for Ahmed J.

Ahmed J.

Platform Engineer, Emaar
Ahmed is a platform engineer with a background in artificial intelligence research and development. He excels at building scalable infrastructure to deploy and manage production-grade applications and models. He co-led the orchestration of modern infrastructure and observability at... Read More →
Friday May 22, 2026 2:25pm - 2:50pm CDT
Level One | Ballroom A
  End-User Case Studies

2:55pm CDT

How Observability-First Development Lets You Ship Agents in Weeks, Not Months - Anirudha Jadhav & Kevin Fallis, AWS
Friday May 22, 2026 2:55pm - 3:20pm CDT
Building AI agents is easy, but knowing why they fail is hard. Traditional APM tools were designed for request-response services, not autonomous agents that reason, plan, and execute multi-step workflows. When your agent makes unexpected decisions, standard metrics and traces don't tell you why.

This session introduces Eval-Driven Development, which focuses on building reliable agents through continuous observability and evaluation. Using OpenSearch AgentHealth, a new open-source platform for agent observability, we'll walk you through the full agent lifecycle of building, observing, improving, and repeating. We'll share a case study comparing two production root-cause-analysis agents. One was built with observability from day one and shipped in a 6 weeks, while the other was retrofitted later and took 12 months to reach production. You'll learn how we used agentic evaluation to score agent outputs and improve accuracy over time.

You'll walk away with patterns for instrumenting agents with OpenTelemetry, techniques for evaluating full decision sequences (not just outputs), and a framework for shortening your development timeline by building observability in from the start.
Speakers
avatar for Anirudha Jadhav

Anirudha Jadhav

Sr. Engineering Leader, Amazon Web Services
Anirudha is a Senior Manager, Software Development at Amazon Web Services (AWS), leading development of insight engines and visualization platforms for the OpenSearch Project. He specializes in distributed systems, data analytics, and search technologies, including architecting one... Read More →
avatar for Kevin Fallis

Kevin Fallis

Principal Senior Solutions Architect, Amazon Web Services
Kevin Fallis is seasoned leader, architect, and developer with experience across many industry verticals and disciplines such as agriculture, ad tech, financial services, networking, security, telecommunications and of course search technologies. His passion helps others leverage... Read More →
Friday May 22, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom A
  AI and MCP in Observability

2:55pm CDT

Let Them Eat Bugs: Practical Showcase of Agentic Issue Resolution - May Walter, Hud
Friday May 22, 2026 2:55pm - 3:20pm CDT
What if we could move a big chunk of bug fixing and solving production issues to agentic AI? That would be so cool. In this talk we will go through the end to end process of setting up a background agentic workflow that detects production errors, finds their root causes, assesses the right solution and opens a PR - so you wake up in the morning to tasks almost fully completed for you by your loyal agent.

Together we will dive into the entire process of setting up this system that is currently running in real production environments - understanding the different tools, the infra challenges, the agentic accuracy spectrum, and more…
Speakers
avatar for May Walter

May Walter

Co-Founder & CTO, Hud
May Walter is a software engineer, researcher, entrepreneur and serial CTO. She is currently Co-Founder and CTO of Hud, building a Runtime Code Sensor to bridge the gap between coding agents and production. Before Hud she was a founding CTO at Santa, and CTO at Bond (acquired by REEF... Read More →
Friday May 22, 2026 2:55pm - 3:20pm CDT
Level One | Ballroom B
  AI and MCP in Observability

3:40pm CDT

Devs, Transform (Your Data) and Roll Out!: Learning and Leveraging OTTL - Reese Lee, New Relic
Friday May 22, 2026 3:40pm - 4:05pm CDT
The OpenTelemetry Collector has emerged as one of the project’s most critical pieces for ingesting and processing your app and infrastructure data, but did you know there’s even more you can do with your data before it reaches your backend?

Enter OTTL, or OpenTelemetry Transformation Language, a domain-specific language that can interact with and modify OTel data. Yes, the Collector already comes with dozens of components that can handle a wide range of data processing, BUT using OTTL in conjunction with the components enables even more powerful data manipulation.

In this session, learn about the benefits of OTTL, when to use it, and how to get started with OTTL. Get ready to explore:
* What OTTL is: A breakdown of the syntax and the underlying architecture within the OTel Collector.
* Why it’s useful: practical strategies for cost reduction (filtering noise), compliance (redacting PII), and standardization (normalizing attributes).
* How to use it: A live walkthrough of writing complex transformation statements for the transform and filter processors.
Speakers
avatar for Reese Lee

Reese Lee

Senior Developer Relations Engineer, New Relic
Reese Lee is a Senior Developer Relations Engineer at New Relic focusing on technical enablement via workshops, blog posts, documentation, and more. She is a Maintainer of the OpenTelemetry End User SIG, where she enjoys learning about interesting use cases and the different ways... Read More →
Friday May 22, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom A
  CNCF Observability Projects

3:40pm CDT

Inside the Telemetry Data Plane: Constraints, Tradeoffs, and Scale - Eduardo Silva & José Lecaros, Chronosphere | A Palo Alto Networks Company
Friday May 22, 2026 3:40pm - 4:05pm CDT
Modern telemetry systems often struggle not because of missing features, but because of hidden constraints in how data is buffered, scheduled, and moved through the system. This session explores the practical realities of building a telemetry data plane that must operate under extreme throughput, tight latency budgets, and strict resource limits.

Using real-world experience from developing a high-performance open source telemetry agent, we’ll examine how design tradeoffs around buffering, concurrency, and I/O shape system behavior at scale. Topics include user-space serialization strategies, adaptive buffering models, memory-mapped persistence, and multithreaded I/O coordination, along with how these choices interact with core Linux primitives such as epoll, asynchronous I/O, and zero-copy techniques.

Rather than focusing on APIs or products, this talk dives into the mechanics and constraints that determine whether a telemetry system remains predictable under load. The discussion is grounded in production lessons learned from operating at billions of events per minute and highlights patterns that apply broadly to collectors, agents, and streaming systems.
Speakers
avatar for José Lecaros

José Lecaros

Support Engineer, Chronosphere | A Palo Alto Networks Company
He works as a Support Engineer at Chronosphere, helping both customers and the Fluent community. He's been a developer and support engineer for 20+ years.
avatar for Eduardo Silva

Eduardo Silva

Distinguished Engineer, Chronosphere | A Palo Alto Networks Company
Eduardo is an entrepreneur and Software Engineer. He is one of Fluentd project maintainers and creator of Fluent Bit, a lightweight Logs, Metrics, and Traces processor.
Friday May 22, 2026 3:40pm - 4:05pm CDT
Level One | Ballroom B
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Content Experience Level
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -