@Greptime: ๐ข๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ ๐ต๐ฎ๐ ๐ฎ ๐๐ฒ๐ฟ๐๐ถ๐ผ๐ป ๐ป๐๐บ๐ฏ๐ฒ๐ฟ ๐ป๐ผ๐. Most teams are still on 1.0 without realizing โฆ
Summary
This thread explains Observability 2.0, a shift from pre-aggregated metrics to storing wide events with all fields, enabling ad-hoc queries at read time. It highlights the urgency for AI agent observability and how GreptimeDB supports this model.
View Cached Full Text
Cached at: 07/02/26, 02:24 PM
Observability has a version number now. Most teams are still on 1.0 without realizing it.
The three-pillar model works. But it was designed around a constraint that no longer exists: storage was expensive, so you pre-aggregated everything before writing it down.
That write-time bet is the problem.
When you instrument a service, you decide upfront which dimensions your metrics carry. High-cardinality fields like build_id, region, payment_provider are often excluded because theyโd blow up cardinality. Logs keep the text but drop the schema. Traces are often sampled. Each pillar discards something, and you canโt get it back when you need it.
Observability 2.0 flips this. One wide event per request or span, every field kept. Metrics, trace views, and log views become different queries over the same raw data, derived at read time. You can GROUP BY build_id after the incident, not just before it. The idea isnโt new. Metaโs Scuba was doing it in 2013. What changed is that columnar storage on object storage made it affordable. Months of raw, high-cardinality events without the storage bill killing the project.
AI agents are where the upgrade becomes urgent. A single agent step carries model name, tokens, the full prompt, tool call params, reasoning, memory state. 50 to 200 fields per event. The questions arenโt โis it upโ โ theyโre โwas the answer accurate,โ โdid it hallucinate,โ โwhy did it pick that tool.โ You can only answer those by keeping the raw event. Honeycombโs data: mature observability datasets run 100+ dimensions. Pre-aggregation canโt cover that.
Weโve been working through these trade-offs while building GreptimeDB. Three posts covering the full reasoning:
- Wide Events, Explained: The Data Model Behind Observability 2.0: https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0โฆ
- Agent Observability: https://greptime.com/blogs/2025-12-11-agent-observability#why-o11y-1-0-struggles-with-agent-dataโฆ
- Database for Observability 2.0: https://greptime.com/blogs/2025-04-25-greptimedb-observability2-new-databaseโฆ
Wide Events, Explained: The Data Model Behind Observability 2.0
Source: https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0
Introduction (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#introduction)
Most monitoring setups make you guess. When you instrument a service, you decide up front which metrics to emit, which labels they carry, and which fields your log lines include. That decision happens at write time. But the questions you actually need to answer show up at read time, usually at 2 a.m., and they are almost never the ones you planned for. โWhy are checkout requests slow, but only for users on the new mobile build, in one region, paying with one specific provider?โ Ifbuild\_id,region, andpayment\_providerwere not part of your metric labels, that question is unanswerable. The data to answer it was thrown away before it was ever stored.
This gap between write-time decisions and read-time questions is the problem โObservability 2.0โ tries to close, and wide events are the data model that makes it work.
What a Wide Event Actually Is (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#what-a-wide-event-actually-is)
A wide event is one structured record per unit of work, carrying as much context as you can attach to it. For a web service, the unit of work is usually a single request. Instead of incrementing a few counters and writing a couple of log lines, the service emits one event when the request finishes, with every field that might matter:
json
{ "timestamp": "2026-06-10T14:22:03.221Z", "service": "checkout", "endpoint": "POST /api/checkout", "status_code": 500, "duration_ms": 1840, "user_id": "u_8821", "region": "ap-southeast-1", "build_id": "2026.6.2-rc1", "payment_provider": "stripe", "db_query_ms": 1620, "cache_hit": false, "feature_flags": ["new_checkout_ui", "async_receipts"], "trace_id": "a1b2c3...", "error": "payment gateway timeout" }
That is one row. A busy endpoint produces millions of them. Each one is a fact about something that really happened, with the dimensions kept intact rather than averaged away.
Compare that to the three pillars. A metric likehttp\_requests\_total\{status="500"\}tells you 500s went up, but it cannot tell you they were concentrated on one build, because addingbuild\_idanduser\_idas labels would blow up cardinality and cost. A plain log line has the context but no structure, so you grep it and hope. A trace has the structure but is usually sampled, so the one request you care about is often the one that got dropped. The wide event keeps the structure and the context in a form you can query.
A table of checkout request events with all columns kept. Projecting it to a metric with GROUP BY endpoint, status keeps only those two low-cardinality columns and drops the rest. The metric counts three 500s but cannot group by build_id, region, or payment_provider because those columns were discarded at write time.Figure 1: A metric is this table with most columns deleted at write time. It can count the 500s but not explain them, becauseGROUP BY build\_idneeds a column the metric never kept.## Where the Idea Came From (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#where-the-idea-came-from)
This is not new. Meta built an internal system called Scuba around 2013 for exactly this: ingest raw, wide events and slice them along any dimension at query time, fast enough to explore interactively. TheScuba paper (https://vldb.org/pvldb/vol6/p1057-wiener.pdf)describes a database that took in millions of events per second specifically so engineers could ask unplanned questions during an incident. Stripe later wrote about โcanonical log lines,โ the practice of emitting one rich, structured summary line per request instead of scattering context across many lines. Honeycomb productized the approach for the wider industry and has spent years arguing that this is what real debugging needs.
The pattern keeps getting reinvented because it solves a problem aggregation cannot: you can only group by a dimension you still have.
What Observability 2.0 Means (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#what-observability-2-0-means)
The core claim is about sources of truth. Charity Majors, Honeycombโs co-founder, is the one who put the โ1.0 versus 2.0โ label on it in a post titledItโs Time to Version Observability (https://www.honeycomb.io/blog/time-to-version-observability-signs-point-to-yes), and the distinction is sharper than the usual โcollect more telemetryโ pitch.
In the 1.0 model you maintain three separate stores: a metrics system, a log system, and a tracing system. Each has its own storage, its own query language, and its own copy of overlapping information. To investigate an incident you jump between them and reconstruct the story by hand. Worse, each store has already discarded something at write time: metrics pre-aggregated away the high-cardinality dimensions, traces sampled away most of the requests, logs kept the text but dropped the schema.
In the 2.0 model there is one source of truth: the wide events. Metrics, traces, and log-style views are derived from those events at query time rather than stored as separate primitives. A request rate is aCOUNTover events grouped by time. A latency heatmap is a distribution over theduration\_msfield. A trace is the set of events sharing atrace\_id. Because the raw dimensions are still there, you can group bybuild\_idafter the incident, not just before it. You can ask questions you did not anticipate, which is the entire point of observability as opposed to monitoring.
One events table is the only thing stored. A metric is a SELECT with count over time buckets, a trace is a SELECT filtered to a shared trace_id, and a log view is a SELECT filtered by a predicate. All three are derived from the same events at query time.Figure 2: In Observability 2.0 the events table is the only thing stored. Metrics, traces, and logs are just differentSELECTs over it, derived at query time instead of written to three separate stores.## Why This Is Happening Now (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#why-this-is-happening-now)
If wide events are obviously better for debugging, why did the industry pre-aggregate everything into metrics in the first place? Cost.
High-cardinality data is expensive to store and query. A metrics database keyed onuser\_idfalls over because every unique value creates a new time series. So the standard advice for a decade was to keep labels low-cardinality and throw away the interesting dimensions. The three-pillar split was, in part, a workaround for storage that could not handle wide, high-cardinality data economically.
Two things changed that. Columnar storage handles wide rows with many sparse fields far better than the row-oriented or series-oriented engines that metrics systems were built on, because a query only reads the columns it touches and columnar compression on repetitive fields is very good. And object storage made the bytes themselves cheap, so keeping months of raw events stopped being a budget conversation. Our ownlog benchmark against Loki and Elasticsearch (https://greptime.com/blogs/2025-08-07-beyond-loki-greptimedb-log-scenario-performance-report)shows what a columnar engine on object storage does to observability-scale data at rest. Observability 2.0 is less a new idea than an old idea that finally became affordable.
That reframes the whole thing as a database problem. โStore wide events and derive metrics, traces, and logs from themโ is a statement about a query engine, not a dashboard. You need a store that ingests high-cardinality structured data, compresses it on object storage, and answers both analytical questions (โgroup all checkout errors by build and region over the last weekโ) and time-series questions (โp99 latency per minuteโ) without forcing you into two systems again.
Why AI Agents Make This Urgent (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#why-ai-agents-make-this-urgent)
The clearest case for wide events right now is AI agents. An agent application emits exactly this shape of data: one execution step carries the model name, token counts, latency, the full prompt and response, a list of tool calls with dynamic parameters, the reasoning behind a decision, and the memory state it read. That is dozens to hundreds of fields per event, much of it semi-structured, with high-cardinality keys likesession\_idandtrace\_idon every row.
The three pillars handle this badly. Stuff the prompt and tool output into logs and you lose the structure you need to query them. Force the dynamic tool-call schema into traces and it is too rigid to fit. Pre-aggregate token usage into a metric and you can no longer trace a latency spike back to the specific prompt that caused it. And the questions teams ask about agents are not โis it upโ or โhow fast,โ but โwas the answer right, was the tool choice sensible, did it hallucinate.โ Those are semantic-quality questions you can only answer by keeping the raw event and deriving new dimensions from it after the fact. Honeycomb has pointed out that mature observability datasets routinely carry hundreds of dimensions, which is exactly what metric pre-aggregation cannot cover.
That takes wide events from โbetterโ to close to mandatory, and the database vendors have noticed: ClickHouse built its ClickStack around the wide-event model, and unified observability databases are moving the same way. We walk through the agent case in detail inAgent Observability: Can the Old Playbook Handle the New Game? (https://greptime.com/blogs/2025-12-11-agent-observability).
Where Wide Events Get Stored (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#where-wide-events-get-stored)
Teams do this a few different ways today.
Honeycomb remains the clearest expression of the model as a hosted product, with a query interface built around exploring events by dimension. A lot of teams build their own version on a columnar analytics database, most commonly ClickHouse, wiring up ingestion and a query layer themselves. And there are databases aimed directly at this unified shape.
GreptimeDB (https://greptime.com/product/db)is one example: an open-source database that stores metrics, logs, and traces in a single engine on object storage, and lets you query the same data with SQL for analytical slicing and PromQL for the time-series views people already have dashboards for. It is built in Rust with compute and storage separated, which is the shape this model needs, and it frames its own design aroundObservability 2.0 and wide events (https://greptime.com/blogs/2025-04-25-greptimedb-observability2-new-database)directly. CrateDB reaches for the same unified target from a search-and-analytics starting point, andInfluxDB (https://greptime.com/compare/influxdb)โs columnar 3.0 direction is a move toward it as well.
The takeaway is not which product. It is that the storage layer is the part of โObservability 2.0โ that was actually missing, and it is the part that is now solved well enough to build on.
When Wide Events Are Not the Answer (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#when-wide-events-are-not-the-answer)
Worth being honest about the limits, because the model has real costs.
Wide events do not replace high-volume infrastructure metrics. If you are scraping CPU and memory off ten thousand containers, a counter is the right tool and emitting a wide event per sample would be wasteful. Metrics are a fine, cheap aggregate when you already know the question and the cardinality is low.
Wide events also move work onto your instrumentation. The model is only as good as the fields you attach, and gettingbuild\_id,feature\_flags, and the right business dimensions onto every event takes deliberate effort across a codebase.OpenTelemetry (https://opentelemetry.io/)span attributes help, since a span is already a structured event with a place to hang context, but the team still has to decide what context matters.
And storage is cheaper, not free. Keeping every raw event at full fidelity for a year is affordable on object storage in a way it was not before, but it is not zero, and very high-throughput systems still need sampling strategies. The advantage is that you can now make that trade-off deliberately, rather than having it forced on you by a metrics backend that simply cannot hold the data.
The Short Version (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#the-short-version)
Observability 1.0 asks you to decide what you will want to know before anything goes wrong, and quietly discards the rest. Wide events keep the full context of each unit of work in a structured, queryable form, and Observability 2.0 derives metrics, traces, and logs from that single source of truth at read time. The idea is a decade old; what changed is that columnar engines on object storage made it cheap enough to be the default rather than a luxury. If you have ever been blocked at 2 a.m. by a dimension you did not think to add as a label, you already understand why teams are making the switch.
About Greptime (https://greptime.com/tech-content/2026-06-10-wide-events-observability-2-0#about-greptime)
GreptimeDB is an open-source, cloud-native database purpose-built for real-time observability. Built in Rust and optimized for cloud-native environments, it provides unified storage and processing for metrics, logs, and traces, deliveri
Similar Articles
@Greptime: Most of GreptimeDB's June work came down to one idea: a filter is no use if it can't reach the data. In a distributed qโฆ
GreptimeDB improved distributed query performance by enabling remote dynamic filters to push down to datanode scans at runtime and optimizing the optimizer to run before MergeScan wraps remote plans, ensuring filters reach the data. JSON v2 columns now support type hints.
@killme20082: One-click copy tells your AI Agent how to install and use GreptimeDB. https://greptime.com/product/db
GreptimeDB is a unified database for metrics, logs, and traces, offering OpenTelemetry-native ingestion, SQL/PromQL querying, object storage for cost reduction, and edge-to-cloud deployment.
@Greptime: GreptimeDB v1.1.0 is out PromQL rate/increase up to 97% faster, 20โ40% lower query time overall Up to 4.5x faster on TSโฆ
GreptimeDB v1.1.0 is released, offering up to 97% faster PromQL queries, 20-40% lower overall query times, and up to 4.5x improvement on TSBS scan-heavy queries, along with online repartitioning for existing tables.
@Greptime: GreptimeDB is now a native data source in @PersesDev , the CNCF observability dashboard project โ the plugin just mergeโฆ
GreptimeDB is now a native data source in Perses, the CNCF observability dashboard project, supporting metrics via PromQL, logs, traces, and SQL aggregations.
@Greptime: GreptimeDB v1.1.2 is out โ a v1.1 patch worth upgrading to. Headline fix: scheduled Flows now bind now()/current_timestโฆ
GreptimeDB v1.1.2 is a patch release that fixes scheduled Flows now() binding for deterministic EVAL INTERVAL windows, along with bug fixes for Kafka SASL password redaction, GC index file listing, parquet metadata cache size, Prometheus label discovery scan, and PromQL time binary aggregation. It is recommended for users to upgrade.