Gabriel Ortuño

Posted on Feb 1

Is Elixir’s Observability Ready for Production? A Guide for Skeptical Engineers

#elixir #observability #opentelemetry #monitoring

As a software engineer coming from ecosystems like the JVM, Go, or Node.js you have a deep understanding of Site Reliability Engineering (SRE) principles and the modern observability stack. You know what it takes to operate systems at scale. When evaluating a new technology like Elixir, a critical question inevitably arises: “Can I really observe and operate this in production?” It’s a valid concern, born from experience with runtimes that can often feel like black boxes.

This article provides a direct, technically-grounded answer to that question. We will explore how Elixir offers a unique, two-pronged observability strategy: unparalleled real-time introspection thanks to its runtime design, combined with full integration into the standardized world of OpenTelemetry and Prometheus. The goal is to demonstrate that Elixir’s observability isn’t just an afterthought; it’s a powerful combination of deep, intrinsic runtime visibility and modern, standards-based instrumentation.

1. A Different Foundation: Observability by Design in the BEAM VM

To understand observability in Elixir, we must first look at its foundation. Elixir runs on the BEAM virtual machine, a core component of the Erlang Run-Time System (ERTS). The BEAM was designed from the ground up for concurrent, fault-tolerant systems, and the ability to inspect a live system is a fundamental part of its architecture, not a feature added later.

Lightweight Processes: The Core of Concurrency

The core unit of execution in an Elixir application is a “process.” Unlike heavyweight OS threads, BEAM processes are extremely lightweight. Each process is an isolated entity with its own stack, heap, and message queue (mailbox), all managed by the BEAM’s preemptive scheduler. This is fundamentally different from OS threads, which are managed by the kernel and have a much larger memory footprint. The BEAM’s ability to manage millions of lightweight processes within a single OS process is central to both its concurrency model and its observability, as the state of every single unit of work is visible to the runtime.

Built-in Introspection with Observer

The most striking feature for engineers new to the BEAM is Observer, a powerful graphical tool that is built directly into the runtime. It allows you to connect to a live Elixir node—in development or production—and inspect its internal state in real-time. This is not a third-party add-on; it’s a core capability of ERTS.

Observer provides critical insights across several areas:

Application Inspection: Visualize the complete supervision tree of your application. This shows exactly how your processes are structured, which processes are supervising others, and how they are linked for fault tolerance.
Process Information: Get a real-time, sortable list of every process running on the node. For each process, you can see its current memory usage, the length of its message queue, and its reduction count—a proxy for how much CPU work it has done. This is invaluable for identifying bottlenecks or memory-hungry processes instantly.
System Overview: View detailed statistics about the BEAM itself, including scheduler utilization across all cores, memory allocator statistics for different data types, and I/O activity.

These same introspection capabilities are available programmatically through functions like erlang:process_info/2, allowing you to build custom monitoring and diagnostic tools that tap directly into the runtime’s own data.

Observer is unparalleled for interactive, live-system debugging, but it is not designed for the automated, long-term, and aggregated analysis required for production SRE practices like alerting and SLO tracking. For that, we turn to the industry-standard three pillars.

2. Bridging to the Familiar: The Three Pillars in Elixir

While Observer provides a powerful ‘point-in-time’ view into the VM, a robust observability strategy also requires historical data and distributed context. This is where the ecosystem bridges from its unique internal capabilities to the familiar ‘three pillars’ model, with the telemetry library serving as the cornerstone.

The Foundation: telemetry

Telemetry is a lightweight, universal event-dispatching library that has become the standard foundation for instrumentation in Elixir. Key libraries like the Phoenix web framework and the Ecto database wrapper use telemetry to emit standardized events about their internal operations (e.g., request timings, query performance). Other libraries can then attach to these events to generate logs, metrics, or traces. This architecture decouples instrumentation from consumption. Library authors (like the Phoenix team) can emit events without dictating how they are used, allowing operations teams to plug in any number of handlers for metrics, logging, or tracing without modifying the core library code. This is a crucial design choice for a flexible and maintainable ecosystem.

The Three Pillars of Observability

This telemetry-based approach provides a clean bridge to the three pillars of observability, which are essential for understanding the why behind a system’s behavior, not just the what.

Logs: Discrete, timestamped records of events. They are essential for troubleshooting specific occurrences, providing the rich, detailed context needed to understand the steps leading to an error or a specific transaction.
Metrics: Aggregated numerical representations of system health over time (e.g., request latency, error rates, CPU utilization). They are ideal for building dashboards, defining alerts on known thresholds, and analyzing historical trends.
Traces: A detailed view of a single request’s end-to-end journey as it flows through a distributed system. Traces are crucial for diagnosing latency issues, identifying bottlenecks, and understanding inter-service dependencies in a microservices architecture.

The telemetry library provides the universal event bus that allows specialized libraries to transform raw events into the metrics and traces needed for a complete observability picture.

3. Logs: Structured Logging as a First-Class Citizen

Before discussing metrics and dashboards, it’s worth addressing logging, because this is often the first observability signal teams rely on in production—and an area where Elixir is frequently underestimated.

Elixir’s logging system is built around the Logger module, which is part of the standard library and deeply integrated with the runtime. While Logger historically emitted plain text logs by default, modern Elixir applications almost universally use structured logging, and the ecosystem makes this both straightforward and idiomatic.

Structured Logs by Default, Not by Convention

Logger supports structured, key–value metadata natively. Instead of treating logs as unstructured strings that must later be parsed, Elixir encourages attaching contextual metadata—such as request IDs, user identifiers, or domain-specific fields—directly to log events.

This metadata propagates naturally through the process model, meaning contextual information (like a request ID set at the boundary of a web request) is automatically available to all logs emitted by that process and its children. This aligns well with modern logging practices and avoids the need for thread-local hacks or manual context passing common in other ecosystems.

Flexible Encoding and Output

In production, Logger can be configured to emit logs in structured formats such as JSON, making it trivial to integrate with standard log aggregation systems (e.g., Elasticsearch, Loki, or commercial SaaS platforms). This is typically achieved by swapping the default formatter for a JSON formatter, without changing application code.

Because Logger is part of the runtime and not a third-party abstraction, this configuration is:

Centralized
Consistent across the ecosystem
Decoupled from any specific log backend

This makes it easy to evolve logging strategies over time—from local development logs to fully structured production pipelines—without rewriting instrumentation.

Logs as Part of a Cohesive Observability Strategy

Crucially, logging in Elixir does not exist in isolation. Logger integrates cleanly with the broader observability stack:

Logs can include trace and span identifiers when OpenTelemetry is enabled
Log volume and levels can be dynamically adjusted at runtime
Logging remains lightweight, with backpressure handled by the BEAM rather than the application

The result is a logging system that feels modern, production-oriented, and aligned with today’s expectations: structured by default, context-aware, and easy to route into both open-source and commercial observability platforms.

When OpenTelemetry is enabled, this structured logging model becomes even more powerful. Trace and span identifiers (trace_id, span_id) can be automatically injected into Logger metadata, allowing logs to be directly correlated with distributed traces. This makes it trivial to pivot from a slow or failing trace to the exact log lines emitted within that span, using standard tooling in both open-source and commercial platforms. For teams used to trace–log correlation in JVM or Go-based systems, this behavior is fully supported and idiomatic in Elixir.

4. Metrics: The Prometheus & Grafana Story

For metrics, the Elixir ecosystem has a go-to solution for integrating with the industry-standard Prometheus and Grafana stack: PromEx.

PromEx is a library designed to make exposing Prometheus-compatible metrics from an Elixir application as simple and streamlined as possible. It works by attaching its own handlers to the telemetry events emitted by your application and its dependencies, converting them into the metrics format that Prometheus scrapes.

The setup and features highlight its power and ease of use:

Simple Configuration: A single mix task, mix prom_ex.gen.config, generates the necessary configuration module, removing boilerplate and guesswork.
Effortless Phoenix Integration: PromEx.Plug is a simple plug that can be added to your Phoenix endpoint to expose a /metrics endpoint. Prometheus can then be configured to scrape this endpoint automatically.
Rich Ecosystem Plugins: PromEx offers a suite of pre-built plugins that provide deep visibility into the most common parts of an Elixir stack, including:
- PromEx.Plugins.Beam (for BEAM VM metrics)
- PromEx.Plugins.Phoenix (for web request metrics)
- PromEx.Plugins.Ecto (for database query metrics)
- PromEx.Plugins.Oban (for background job processing metrics)
Pre-built Dashboards: Crucially, each plugin comes with a tailored Grafana dashboard. This means you get meaningful, out-of-the-box visualizations for your key metrics without having to build dashboards from scratch.

5. Tracing: Embracing the OpenTelemetry Standard

For distributed tracing, the Elixir community has fully adopted OpenTelemetry (OTel), the Cloud Native Computing Foundation (CNCF) framework that provides a vendor-neutral standard for generating and collecting telemetry data.

Core Components and Instrumentation

Setting up OTel in a typical Phoenix application involves adding a few key libraries: opentelemetry_api, opentelemetry, and instrumentation-specific packages like opentelemetry_phoenix.

Instrumentation works in two primary ways:

Automatic Instrumentation: Libraries like opentelemetry_phoenix listen for standard :telemetry events and automatically create OTel spans from them. This provides a baseline level of tracing for web requests and database queries with minimal setup.
Manual Instrumentation: For business-logic-specific insights, you can create custom spans programmatically using functions like Tracer.with_span. This is essential for tracing business-critical logic that isn’t captured by default framework events, such as the steps within a complex financial transaction or a multi-stage data processing pipeline.

Exporting Data via the OTel Collector

The best practice for sending telemetry data in production is to use the OpenTelemetry Collector. This is a vendor-agnostic agent that receives data from your application (typically via the OTLP protocol) and can then process and export it to any number of backends. This decouples your application from the specific tracing backend (e.g., Jaeger, Zipkin, or a commercial platform), preventing vendor lock-in.

A Note on Maturity and Performance

The OpenTelemetry SDK for Elixir/Erlang is mature and widely used in production. However, for systems with very high request throughput, some performance tuning may be required. Specifically, operators may need to adjust the configuration of the batch span processor (e.g., OTEL_BSP_SCHEDULE_DELAY_MILLIS) to balance memory usage and latency, ensuring the observability layer itself does not become a bottleneck.

One important distinction to be explicit about is profiling. Unlike the JVM (JFR) or Go (pprof), the BEAM does not yet offer a ubiquitous, always-on, low-overhead continuous profiler as part of the standard production toolchain. Instead, profiling in Elixir relies on a combination of built-in tools (:eprof, :fprof, :cprof), runtime introspection, and situational diagnostics using Observer or libraries like recon. This means profiling is typically more targeted and investigative rather than continuously sampled. For many teams this is sufficient—especially when combined with strong metrics and tracing—but it is a real trade-off to be aware of for workloads where continuous CPU or allocation profiling is a hard requirement.

6. Integrating with Commercial Platforms (Datadog, New Relic, AppSignal, etc.)

Elixir applications integrate reliably with major commercial observability platforms such as Datadog, New Relic, and AppSignal. For most production systems today, the recommended and future-proof integration path is OpenTelemetry, which provides a vendor-neutral foundation for traces, metrics, and logs.

By instrumenting your application using OpenTelemetry APIs, you decouple observability concerns from any specific backend. Telemetry data is typically exported to an OpenTelemetry Collector, which then forwards it to one or more destinations—self-hosted systems like Jaeger or Prometheus, or commercial platforms such as Datadog or New Relic. Switching backends does not require changes to application code; it is a configuration concern at the Collector level.

This OTel-first approach has two important implications:

Consistency and portability: Instrumentation remains stable even if vendors change, pricing shifts, or organizational preferences evolve.
Reduced vendor lock-in: The application depends on open standards rather than proprietary agents or SDKs.

Some platforms also offer Elixir-specific or vendor-native integrations. AppSignal, in particular, provides a tightly integrated, Elixir-first experience with minimal setup and strong defaults. This can be an attractive option for teams that prefer a managed solution and are comfortable trading some flexibility for simplicity. However, such integrations typically couple instrumentation more closely to the vendor and make later migrations more expensive.

At a high level, the trade-off is familiar:

Self-Hosted Open Source (Prometheus / Grafana / Jaeger): Maximum control and flexibility with no direct vendor costs. Requires operational ownership and expertise.
Commercial SaaS Platforms: Managed infrastructure, advanced analytics, and lower operational overhead. Cost scales with usage.

In practice, many teams adopt a hybrid model: open-source tooling for core metrics and dashboards, combined with commercial platforms for tracing, alerting, or cross-service analysis. Elixir’s OpenTelemetry-based integrations support this model well.

7. The Verdict: Is Elixir Production-Ready?

So, is Elixir’s observability stack ready for the demands of modern production systems? The answer is yes. The ecosystem provides a clear two-pronged approach: deep, real-time runtime introspection via the BEAM, combined with full support for industry-standard observability practices through Telemetry, OpenTelemetry, and Prometheus-based tooling.

The BEAM’s intrinsic visibility enables operators to inspect scheduler behavior, memory usage, and individual process state directly on a live system, answering the question “what is happening right now?” with a level of precision that is difficult to achieve in many other runtimes. At the same time, standardized metrics, logs, and traces provide the historical context required for alerting, SLOs, and distributed debugging.

This contrasts with platforms where the runtime is largely opaque and observability relies primarily on sampling and post-hoc inference. In JVM or Go ecosystems, visibility is often reconstructed from external signals; in the BEAM, it is exposed directly by the runtime itself.

That said, Elixir’s observability model will not be a perfect fit for every organization. Teams that depend heavily on always-on, low-overhead continuous profiling, that are unwilling to operate BEAM-specific tooling, or that mandate vendor-specific agents may find a more natural fit elsewhere. Elixir rewards teams willing to understand the runtime it runs on—and for those teams, it offers a level of operational transparency that is difficult to replicate.

DEV Community