Cached at:
06/27/26, 07:13 AM
TL;DR: As AI agents run from minutes to days or even hundreds of steps, traditional observability methods break down. We need to shift to trajectory-based observability-driven development to monitor and debug long-running agents.
## Background: Risks and Challenges of Agents
I'm Sunny, founding engineer at Honeycomb. Honeycomb is an observability and evaluation platform. If your agent runs in production, you need to monitor it and evaluate it. We happen to have products for that.
The typical agent in 2026: What we originally wanted to deploy was a "good kid" that could help the company with various tasks; but what actually went live was a "good kid" with root privileges who could execute `rm -rf`. As models and agents become more powerful, we give them more permissions, and the risks increase sharply.
### Three Real Incidents
1. **A LangChain agent ran continuously for 11 days, racked up $47,000 in costs, and accomplished nothing** (March 2025).
2. **An agent deleted system files** – like deleting system files on your Mac.
3. **An agent read an extremely large file, causing context explosion and contaminating the entire session**.
Anyone who has used Claude Code should be familiar with these situations.
## Three Stages of Agent Development
### Stage 1: Teaching the LLM How to Think
- Prompt engineering, context engineering, RAG, evals – concepts from machine learning, later adopted by agent companies.
### Stage 2: Rise of Orchestration Frameworks (2024–2025)
- The era of o1-mini, o3-mini.
- Crew AI, LangChain, and other frameworks became popular.
- MCP emerged, and ReAct agents appeared.
- Attempts to teach the model how to plan; state machines and state tracking began to emerge.
### Stage 3: Autonomous Agents (2026)
- Models are stronger and more autonomous (e.g., Claude Code, Opus 4.7).
- Instead of trying to teach them how to think or plan, you directly give them tools and skills and let them figure it out.
- Paradigm shifts to "harness" and "sandbox" – you build a sandbox, give tools and a credit card, and let the agent loose.
## Harness: Long-Running Loop
A harness is essentially a loop around the LLM that can run for hours or even days. Inside this loop:
- Call the LLM;
- The LLM chooses to configure the sandbox, call tools, coordinate sub-agents.
Borrowing an analogy from Anthropic: The **brain** is the LLM and the harness around it; the **hands** are the sandbox and tools.
Characteristics of modern harnesses:
- Can be 100–1000 steps long.
- Trend in 2026: Design the "hands" very well (security, access control), while the harness layer is often thin.
- As models become stronger, the harness layer will continue to stay thin.
## Visualization: Real Trajectory Example
I traced my own Claude Code session with HoneyHive. For example, a task to develop a feature generated **689 events**, including hundreds of tool calls (bash, read, write, file edit) and many model events (interactions between the agent and the user).
Previously we had at most 10–20 steps; now it can be hundreds or even thousands. Finding errors among them is like looking for a needle in a haystack – you can't manually go through hundreds of steps and evaluate them. This brings unprecedented challenges.
## Hooks and Skills
### Hooks
- Similar to webhooks, triggered at various execution points in the harness.
- Examples of hooks in Claude Code: pre-tool use, post-tool use, permission requests (when human permission is needed), task tracking, spawn sub-agents, agent start/stop.
- These hooks are synchronous and can be used for observability as well as client-side evaluators to implement runtime guardrails.
### Skills
- A paradigm supported by all frontier models.
- A skill consists of three parts:
1. The actual tool execution code;
2. The skill's preamble (always embedded in the agent's memory, similar to a semantic hook);
3. The skill's MD file (guides the agent on how to use the tool).
- Skills are reusable behavioral units, e.g., a "PR review" skill or a "QA feature" skill; different AI workers can use the same skill in different contexts.
## What Is a Trajectory
A trajectory is a sequence of events or steps. Events include: LLM events, tool calls, agent delegation, decision points, permission requests, etc.
For example, a visualization of a 689-step session trajectory shows: user turns, bash, read, edit, etc., repeated about 600 times. This lets you visualize the agent's behavior patterns and observe custom fields such as input token count, evaluator results, etc.
## AI Workers: Goals and Challenges
What enterprises really want are **AI workers** – efficient at completing specific tasks within a specific domain, not general-purpose agents. AI workers should have:
- Clear success criteria (measurable ROI);
- Clear guardrails;
- Limited blast radius (impact is controlled when errors occur);
- Horizontal scalability.
### Six Failure Modes
1. **Context rot**: A certain tool consumes many tokens, residues remain in the session, degrading subsequent call quality.
2. **Amnesia**: The agent manually tries to do something that an existing skill or tool can already do.
3. **Ergonomics**: When tools return noisy JSON or are semantically unintuitive, the agent gets confused or even mimics the tool's behavior.
4. **YOLO**: The agent confidently performs irreversible operations without asking for permission.
5. **Delegation issues**: The agent does everything itself and doesn't call sub-agents.
6. **Stochasticity**: The trajectory variance is huge; you want preferred paths rather than complete randomness.
## Dashboard: Foundation for Discovering Failure Modes
An effective dashboard is especially useful for long-running agents:
- **Token usage by tool**: Identify tools that bloat the context, e.g., bash tool consuming 70k tokens per day.
- **Step count over time**: For example, the longest session had 4600 steps.
- **Tool usage grouping**: Understand which tools the agent prefers, and which are underused or overused.
## Limitations of Eval-Driven Development
Eval-driven development will not scale, for three reasons:
1. **Capability surpasses evaluation infrastructure**: Models improve, behavior changes, evaluations must be updated; new tools, new paradigms, new models keep emerging, and evaluations can't keep up.
2. **Evaluations are static**: When an agent has hundreds of thousands of steps, evaluations are not very useful. Trajectories are non-deterministic, variance is huge, and randomness grows exponentially with step count. By step 50, different runs may already be in completely different places.
3. **Simulation difficulty**: The gap between simulation and production remains large.
## Observability-Driven Development
A more suitable approach for long-running agents, with these steps:
1. **Instrumentation**: Use Honeycomb's out-of-the-box trace support.
2. **Deploy AI workers with strict guardrails**: Create many guardrails to understand execution boundaries and set thresholds accordingly.
3. **Collect real trajectories**: After hundreds to thousands of trajectories, clustering techniques can help identify task categories and discover new failure modes.
4. **Specialization**: Further optimize agent behavior based on trajectory patterns.
For example, set two guardrails in the trajectory: "irreversible operations" and "bash token exceeds 10k". Once the agent crosses the threshold, alert immediately. In Honeycomb, you can set alerts on server-side evaluators, at least so you get notified when the agent goes out of bounds.
This approach emphasizes learning from real trajectories rather than relying on static evaluations or simulations.
---
Source: https://www.youtube.com/watch?v=5XFRr4xkMQk