Dissecting model behavior through agent trajectories

arXiv cs.AI 06/17/26, 04:00 AM Papers

Summary

This paper introduces the Simple Strands Agent (SSA), a minimal harness designed to reduce the intent-execution gap between AI models and their agentic behavior, and analyzes 138k trajectories across various model families to reveal fine-grained behavioral differences.

arXiv:2606.17454v1 Announce Type: new Abstract: AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we $\textbf{reproduce or improve on the pass@1}$ performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an $\textbf{analysis of 138k trajectories generated by SSA}$, we look beyond the $\texttt{pass@1}$ numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

Original Article

View Cached Full Text

Cached at: 06/17/26, 05:36 AM

# Dissecting model behavior through agent trajectories
Source: [https://arxiv.org/abs/2606.17454](https://arxiv.org/abs/2606.17454)
[View PDF](https://arxiv.org/pdf/2606.17454)

> Abstract:AI agent performance is not just a modeling problem, it is fundamentally a systems problem\. The advanced capabilities of models are realized through agent harnesses\. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance\. We formalize this as the \`intent\-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa\. We argue that minimizing this intent\-execution gap is as important as other aspects of harness design such as tools and execution loops\. To illustrate the impact of this harness\-model alignment, we develop a simple and customizable harness called \`Simple Strands Agent' \(SSA\)\. SSA aims to find the bulk of common patterns which generalize across different model families \(such as Claude, Gemini, GPT, Grok, Qwen\), as well as a small number of model\-specific preferences\. We make two contributions: \(i\) we $\\textbf\{reproduce or improve on the pass@1\}$ performance reported by diverse model\-provider families on popular agentic benchmarks \(SWE\-Pro, SWE\-Verified and Terminal\-Bench\-2\), and \(ii\) building on an $\\textbf\{analysis of 138k trajectories generated by SSA\}$, we look beyond the $\\texttt\{pass@1\}$ numbers which tend to be relatively even across frontier models\. By representing agent trajectories in code state\-spaces, we observe model\-level differences in problem\-solving behavior\. Finer\-grained metrics such as edit frequency, testing activity, and phase\-transitions reveal how individual models allocate effort across different stages of autonomous problem solving\.

## Submission history

From: Gaurav Gupta \[[view email](https://arxiv.org/show-email/cff9f92a/2606.17454)\] **\[v1\]**Tue, 16 Jun 2026 03:17:03 UTC \(1,889 KB\)

Dissecting model behavior through agent trajectories

Similar Articles

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2057153343081111582

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2066928605691523210

Submit Feedback

Similar Articles

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2057153343081111582

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2066928605691523210