The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for Observability Summit North America 2026.
Please note: This schedule is automatically displayed in Central Daylight Time (UTC -5). To see the schedule in your preferred timezone, select from the drop-down menu located at the bottom of the menu to the right.
The schedule is subject to change.
Sign up or log in to add sessions to your schedule and sync them to your phone or calendar.
Operating a metrics system beyond billions of data points introduces failure modes that don't exist at smaller deployments. This lightning talk shares battle-tested lessons from running, Thanos, Prometheus and OpenTelemetry in production across distributed Kubernetes environments, focusing on three critical challenges: implementing multi-tenancy without noisy neighbor problems, building rate limiting that prevents a single tenant from destabilizing the cluster, and isolating query workloads so expensive queries don't starve metric ingestion.
The talk walks through real incidents where these challenges caused production impact, including 5xx errors on Thanos Receivers from unbounded queries, Prometheus remote write lag and partial query results from overwhelmed Store Gateways. For each problem, the talk presents custom solutions developed—including tenant-aware rate limiting middleware and workload isolation patterns—and shares concrete configuration approaches that attendees can apply to their own deployments. Attendees will leave with actionable techniques for scaling their observability infrastructure to trillion-scale while maintaining reliability under load.
Experienced software engineer who is passionate about solving complex software engineering challenges. With around 14 years of experience in software engineering – has a strong foundation in building and optimizing high-performance systems particularly in Observability, Big Data... Read More →
Operating a SaaS platform presents the same observability problems as any other enterprise, but due to the scale and tenancy, we introduce a huge multiplier on the observability signals, having an effect on cost and effectiveness.
This session dives into the techniques Collibra used to tame these problems and how to maintain clarity when infrastructure spans virtual machines, modern Kubernetes clusters, and a complex mix of single- and multi-tenant architectures. Without the right context, telemetry data becomes a noisy, indistinguishable flood.
We will dive into the architectural decision to leverage the C4 system model, ensuring every piece of telemetry carries the vital context of what it belongs to and where it sits in the hierarchy. Enabling us to gain insights into both signal attribution and allowing virtual chargebacks. The presentation details the implementation of a pipeline using custom-built OpenTelemetry collectors designed to handle the data and enrich it before sending it to the appropriate backends.
This session will give you practical insights on the challenges SaaS platforms have, but the techniques that are used to tame them can be applied everywhere.
Alex Van Boxel is a Principal System Architect at Collibra. With an engineering background in R&D at Alcatel-Lucent, Progress Software, and Veepee, he loves to focus on the fundamental building blocks of the software industry. That means reading, understanding, and contributing to... Read More →
As organizations adopt observability practices, they face a scalability paradox: systems now generate petabytes of traces and logs, but querying this raw telemetry over long time horizons becomes prohibitively slow and expensive due to the data volume.
The standard solution of pre-aggregating high-cardinality telemetry into metrics at collection time through features in the OpenTelemetry collector works well for known patterns but fails when engineers need to ask new questions about historical data. This creates an uncomfortable choice for engineers and operators: fast dashboards with pre-aggregated metrics, or high-fidelity traces and logs that become unusable beyond short time windows.
This talk presents a post-collection aggregation approach that enables fast queries over long time periods of detailed telemetry without changes to collector-side configuration. This session explores techniques for incremental view materialization that work with timeseries data. Attendees will leave with concrete architectural patterns which are applicable to open source databases like ClickHouse or OpenSearch to answer novel questions without sacrificing query speed or data fidelity.
Zack Owens is a Principal Engineer and Architect at New Relic, focusing on the data platform and NRDB, a purpose-built timeseries database for observability.
Most MCP servers fail the same way: they expose observability data without understanding what AI models need to reason effectively. The result? Tools that overwhelm models with metrics, miss critical context, and introduce unnecessary security exposure.
At Multiplayer, we built an MCP server to give AI coding assistants access not just to production telemetry but to full stack data: frontend screens and data, backend traces, logs, and request/response content and headers. What we learned challenges the "more data is better" assumption that drives most integrations.
This talk shares the hard lessons from moving an MCP server into production. You'll learn why filtered, intent-driven context outperforms comprehensive data access, how to design tools that align with developer workflows rather than API surfaces, and the security trade-offs that matter when LLMs query your observability stack.
We'll cover practical design patterns for MCP servers in the observability space: scoping data by blast radius, surfacing relationships over raw metrics, and handling authentication without compromising developer experience. This talk is about what works when AI meets production systems.
Today, observability platforms can process massive volumes of telemetry, but practitioners struggle to determine what matters during incidents, unnecessarily increasing usage bills.
This talk resolves the question: “Which telemetry data should we keep?” Learn how one team achieved 30% log reduction by flipping the script and asking “what did we actually use?” instead of “what should we collect?” They conducted a forensic audit of incident resolutions to find receipts proving which data sources truly mattered.
You’ll learn techniques for tracing backward from resolved incidents to identify which telemetry is deemed valuable and see how to map incidents to telemetry data that enabled resolution, revealing which sources proved critical, redundant, or unused.
Using OpenTelemetry (OTel) and Vector, an open-source tool for building fast and scalable observability pipelines, this approach provides a replicable pattern that the community can adapt across different environments.
You’ll leave with a framework for measuring telemetry value based on usage patterns, plus a repeatable audit process. The key question: “Where are the receipts?”
We see IoT everywhere, from smart fridges to air quality sensors, but what about applying observability to billions of living things? Introducing Meowy, my virtual cat with a full observability stack. In this talk, I'll build a digital pet from scratch in Go, instrument it with OpenTelemetry, and visualize its "life" in real time, live-tracking its habits, moods, and (attempted) escapes.
I'll show how to create a RESTful "cat API," instrument it for tracing, and set up alerting with the ELK stack and Kibana visualizations. We'll cover observability basics (logs, metrics, and traces), how to apply them to our digital pet, how to structure telemetry data for "living" systems using AI tools, and how to query all our cat stats with an MCP-connected AI agent. By the end, we'll calculate the average MPH (meows per hour) and expand our understanding of observability applications. No prior observability experience required—just some Go basics and a love for any living thing, from feline to fungal!