Cached at:
04/22/26, 07:04 PM
TL;DR: Replace guess-work with data—add telemetry first, then ship, so you always know what your Elixir system is actually doing in production.
## Telemetry-Driven Development
> “The purpose of a system is what it does.”
Noah (GitHub/Slack NZTeb) introduced the phrase *Telemetry-Driven Development* (TDD) at an Elixir meet-up. While classic TDD means “write the test, watch it fail, make it pass,” the new TDD means “make the system tell you what it’s doing, then decide if it’s finished.” The following notes condense his talk, the live demo, and the operational lessons he learned running 848 000 Nerves gateways at Smart Rent.
## Why classic TDD is no longer enough
Unit tests prove code works on the developer’s laptop, but production is a distributed, concurrent, resource-starved mess. With 3.5 M IoT devices (locks, thermostats, leak sensors) hammering the platform—231 new TLS connections/s, 3 600 MQTT msgs/s, DB 7 k row writes/s—guessing causes outages. Telemetry becomes the only reliable feedback loop.
## What “telemetry” actually means
Borrowed from 1950s hardware, telemetry literally means “measure at a distance.” Think of a pressure gauge on a pipe: you can’t see water, but you read the dial. In software we export three signals:
1. **Traces** – request-scoped causal graphs
2. **Metrics** – pre-aggregated numbers over time
3. **Logs** – discrete events with context
OpenTelemetry (OTel) wraps the three into one vendor-neutral spec. Elixir status today:
- **Traces** – stable
- **Metrics & Logs** – beta; usable via `:telemetry` + `:telemetry_metrics` bridges
## Libraries you will meet
| Package | Purpose |
|---------|---------|
| `opentelemetry_api` | Span/metric/log API |
| `opentelemetry` | SDK implementation |
| `opentelemetry_exporter` | GRPC/HTTP export to OTel Collector |
| `opentelemetry_telemetry` | Bridges `:telemetry` events into OTel spans |
| `telemetry` | BEAM-native dispatch library (1.0 released 2021-07-03) |
## Local observability stack in one command
Clone the repo, run:
```bash
docker compose up
```
Grafana spins up on `http://localhost:3000` with:
- **Loki** – logs
- **Tempo** – traces
- **Mimir** – metrics
The acronym LGTM is intentional—“looks good to me,” the comment every PR hopes for.
## Demo application walk-through
A minimal Phoenix app plus a GenServer worker:
1. Worker scheduled every 10 s calls `cpu_work()` and `io_work()`
2. Each function starts an OTel span via `OpenTelemetry.Tracer.with_span/3`
3. Attributes (`cpu_ms`, `bytes_read`) attach to spans
4. Collector receives, Tempo stores, Grafana displays a waterfall
Change code, hot-reload, refresh Grafana—feedback < 5 s, zero cloud cost.
## Writing tests against telemetry
In `MIX_ENV=test` attach a handler:
```elixir
:telemetry.attach(
"test-handler",
[:my_app, :work, :stop],
fn _event, measurements, _meta, pid ->
send(pid, {:telemetry, measurements})
end,
self()
)
```
Then assert:
```elixir
assert_receive {:telemetry, %{cpu_ms: ms}} when ms > 0
```
No more flaky sleeps or `assert_process` hacks; the event itself is the synchronisation primitive.
## Environment-specific pipeline
| Environment | Strategy |
|-------------|----------|
| dev | Export to stdout-spans, sampling 100 % |
| test | Attach in-process handler, no network |
| staging | Send to staging collector, 10 % sampling |
| prod | Send to regional collector, 1 % sampling, head-based probabilistic |
Runtime.exs reads `OTEL_EXPORTER_ENDPOINT` and `OTEL_TRACES_SAMPLER_ARG`, so the same container image ships everywhere.
## Continuous integration trick
CI job:
```bash
docker compose -f ci.docker-compose.yml up --exit-code-from test
```
Stack starts, tests run with full observability, stack tears itself down. Artifacts: JUnit XML *and* trace JSON for later forensics.
## Production numbers that justify the effort
Smart Rent fleet:
- 848 000 Erlang nodes (Nerves gateways)
- 3.5 M leaf devices (4 per gateway)
- P99 end-to-end latency 350 ms
- 7 k DB rows mutated every second
Without trace IDs operations would drown in unstructured logs; with them, a support ticket becomes “paste the trace ID, we’ll show you the exact gateway, firmware version, and query plan.”
## Current sharp edges in Elixir + OTel
1. Metrics & logs APIs still moving; expect minor breakage
2. High-throughput services need careful sampler tuning or RAM explodes
3. BEAM scheduler and reduction→CPU mapping not yet standard semantic conventions
4. Cross-node propagation requires custom `traceparent` parsing in MQTT/Phoenix channels
Community is active; watch erlang-otel for logging-domain PRs expected to land next release.
## Take-off checklist for your own project
1. Add `:opentelemetry_api` to all apps (only the API, no SDK in libs)
2. Wrap business functions with `with_span` or `:telemetry.execute`
3. Export locally with docker-compose LGTM stack
4. Write at least one test that asserts telemetry payload
5. Deploy to staging, open Grafana, ask: **“Can I see the story of one request?”**
6. Iterate until the dashboard answers before you open a shell
## Closing rule of thumb
If you can’t graph it, you can’t gripe about it. Ship the observability first; the feature is only done when the metrics, traces, and logs say so.
---
Source: [YouTube – Telemetry-Driven Development by Nezteb](https://www.youtube.com/watch?v=irQicdafnyM)