Telemetry-Driven Development

Lobsters Hottest 04/22/26, 06:52 PM Tools

Summary

Noah at Smart Rent coins "Telemetry-Driven Development" for Elixir: instrument first with OpenTelemetry, then ship, replacing guess-work with production data from 848k Nerves gateways.

<p>The repo: <a href="https://github.com/Nezteb/telemetry-driven-development" rel="ugc">https://github.com/Nezteb/telemetry-driven-development</a></p> <p><a href="https://lobste.rs/s/bonwlu/telemetry_driven_development">Comments</a></p>

Original Article

View Cached Full Text

Cached at: 04/22/26, 07:04 PM

TL;DR: Replace guess-work with data—add telemetry first, then ship, so you always know what your Elixir system is actually doing in production. ## Telemetry-Driven Development > “The purpose of a system is what it does.” Noah (GitHub/Slack NZTeb) introduced the phrase *Telemetry-Driven Development* (TDD) at an Elixir meet-up. While classic TDD means “write the test, watch it fail, make it pass,” the new TDD means “make the system tell you what it’s doing, then decide if it’s finished.” The following notes condense his talk, the live demo, and the operational lessons he learned running 848 000 Nerves gateways at Smart Rent. ## Why classic TDD is no longer enough Unit tests prove code works on the developer’s laptop, but production is a distributed, concurrent, resource-starved mess. With 3.5 M IoT devices (locks, thermostats, leak sensors) hammering the platform—231 new TLS connections/s, 3 600 MQTT msgs/s, DB 7 k row writes/s—guessing causes outages. Telemetry becomes the only reliable feedback loop. ## What “telemetry” actually means Borrowed from 1950s hardware, telemetry literally means “measure at a distance.” Think of a pressure gauge on a pipe: you can’t see water, but you read the dial. In software we export three signals: 1. **Traces** – request-scoped causal graphs 2. **Metrics** – pre-aggregated numbers over time 3. **Logs** – discrete events with context OpenTelemetry (OTel) wraps the three into one vendor-neutral spec. Elixir status today: - **Traces** – stable - **Metrics & Logs** – beta; usable via `:telemetry` + `:telemetry_metrics` bridges ## Libraries you will meet | Package | Purpose | |---------|---------| | `opentelemetry_api` | Span/metric/log API | | `opentelemetry` | SDK implementation | | `opentelemetry_exporter` | GRPC/HTTP export to OTel Collector | | `opentelemetry_telemetry` | Bridges `:telemetry` events into OTel spans | | `telemetry` | BEAM-native dispatch library (1.0 released 2021-07-03) | ## Local observability stack in one command Clone the repo, run: ```bash docker compose up ``` Grafana spins up on `http://localhost:3000` with: - **Loki** – logs - **Tempo** – traces - **Mimir** – metrics The acronym LGTM is intentional—“looks good to me,” the comment every PR hopes for. ## Demo application walk-through A minimal Phoenix app plus a GenServer worker: 1. Worker scheduled every 10 s calls `cpu_work()` and `io_work()` 2. Each function starts an OTel span via `OpenTelemetry.Tracer.with_span/3` 3. Attributes (`cpu_ms`, `bytes_read`) attach to spans 4. Collector receives, Tempo stores, Grafana displays a waterfall Change code, hot-reload, refresh Grafana—feedback < 5 s, zero cloud cost. ## Writing tests against telemetry In `MIX_ENV=test` attach a handler: ```elixir :telemetry.attach( "test-handler", [:my_app, :work, :stop], fn _event, measurements, _meta, pid -> send(pid, {:telemetry, measurements}) end, self() ) ``` Then assert: ```elixir assert_receive {:telemetry, %{cpu_ms: ms}} when ms > 0 ``` No more flaky sleeps or `assert_process` hacks; the event itself is the synchronisation primitive. ## Environment-specific pipeline | Environment | Strategy | |-------------|----------| | dev | Export to stdout-spans, sampling 100 % | | test | Attach in-process handler, no network | | staging | Send to staging collector, 10 % sampling | | prod | Send to regional collector, 1 % sampling, head-based probabilistic | Runtime.exs reads `OTEL_EXPORTER_ENDPOINT` and `OTEL_TRACES_SAMPLER_ARG`, so the same container image ships everywhere. ## Continuous integration trick CI job: ```bash docker compose -f ci.docker-compose.yml up --exit-code-from test ``` Stack starts, tests run with full observability, stack tears itself down. Artifacts: JUnit XML *and* trace JSON for later forensics. ## Production numbers that justify the effort Smart Rent fleet: - 848 000 Erlang nodes (Nerves gateways) - 3.5 M leaf devices (4 per gateway) - P99 end-to-end latency 350 ms - 7 k DB rows mutated every second Without trace IDs operations would drown in unstructured logs; with them, a support ticket becomes “paste the trace ID, we’ll show you the exact gateway, firmware version, and query plan.” ## Current sharp edges in Elixir + OTel 1. Metrics & logs APIs still moving; expect minor breakage 2. High-throughput services need careful sampler tuning or RAM explodes 3. BEAM scheduler and reduction→CPU mapping not yet standard semantic conventions 4. Cross-node propagation requires custom `traceparent` parsing in MQTT/Phoenix channels Community is active; watch erlang-otel for logging-domain PRs expected to land next release. ## Take-off checklist for your own project 1. Add `:opentelemetry_api` to all apps (only the API, no SDK in libs) 2. Wrap business functions with `with_span` or `:telemetry.execute` 3. Export locally with docker-compose LGTM stack 4. Write at least one test that asserts telemetry payload 5. Deploy to staging, open Grafana, ask: **“Can I see the story of one request?”** 6. Iterate until the dashboard answers before you open a shell ## Closing rule of thumb If you can’t graph it, you can’t gripe about it. Ship the observability first; the feature is only done when the metrics, traces, and logs say so. --- Source: [YouTube – Telemetry-Driven Development by Nezteb](https://www.youtube.com/watch?v=irQicdafnyM)

Telemetry-Driven Development

Similar Articles

Elixir for a Bluesky DataPlane: the choice we didn't expect

spent the last few weeks building an alternative to heavy AI observability tools because I was tired of messy logs. need feedback from nextjs/node devs.

Journey in optimising Elixir application

Verifying Agentic Development at Scale (8 minute read)

Doing real coding work locally for the first time

Submit Feedback

Similar Articles

Elixir for a Bluesky DataPlane: the choice we didn't expect

spent the last few weeks building an alternative to heavy AI observability tools because I was tired of messy logs. need feedback from nextjs/node devs.

Journey in optimising Elixir application

Verifying Agentic Development at Scale (8 minute read)

Doing real coding work locally for the first time