Dissecting model behavior through agent trajectories
Summary
This paper introduces the Simple Strands Agent (SSA), a minimal harness designed to reduce the intent-execution gap between AI models and their agentic behavior, and analyzes 138k trajectories across various model families to reveal fine-grained behavioral differences.
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# Dissecting model behavior through agent trajectories
Source: [https://arxiv.org/abs/2606.17454](https://arxiv.org/abs/2606.17454)
[View PDF](https://arxiv.org/pdf/2606.17454)
> Abstract:AI agent performance is not just a modeling problem, it is fundamentally a systems problem\. The advanced capabilities of models are realized through agent harnesses\. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance\. We formalize this as the \`intent\-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa\. We argue that minimizing this intent\-execution gap is as important as other aspects of harness design such as tools and execution loops\. To illustrate the impact of this harness\-model alignment, we develop a simple and customizable harness called \`Simple Strands Agent' \(SSA\)\. SSA aims to find the bulk of common patterns which generalize across different model families \(such as Claude, Gemini, GPT, Grok, Qwen\), as well as a small number of model\-specific preferences\. We make two contributions: \(i\) we $\\textbf\{reproduce or improve on the pass@1\}$ performance reported by diverse model\-provider families on popular agentic benchmarks \(SWE\-Pro, SWE\-Verified and Terminal\-Bench\-2\), and \(ii\) building on an $\\textbf\{analysis of 138k trajectories generated by SSA\}$, we look beyond the $\\texttt\{pass@1\}$ numbers which tend to be relatively even across frontier models\. By representing agent trajectories in code state\-spaces, we observe model\-level differences in problem\-solving behavior\. Finer\-grained metrics such as edit frequency, testing activity, and phase\-transitions reveal how individual models allocate effort across different stages of autonomous problem solving\.
## Submission history
From: Gaurav Gupta \[[view email](https://arxiv.org/show-email/cff9f92a/2606.17454)\] **\[v1\]**Tue, 16 Jun 2026 03:17:03 UTC \(1,889 KB\)Similar Articles
Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents
This paper introduces Base Sequence Analysis, a framework that encodes LLM agent runtime behavior into compact sequences, revealing high-risk patterns like the 'P-X-P' trigram and a verification deficit. It presents Governor, a runtime intervention system that improves task success by 6.2% and reduces token consumption by 44%.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA proposes strategic trajectory abstraction for long-horizon LLM agents, using hierarchical GRPO-style rollout with diverse strategy sampling and critical self-judgment to improve sample efficiency and final performance over frontier models and prior RL baselines.
TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation
TrajGenAgent proposes a hierarchical LLM agent framework that decouples macro-level activity planning from micro-level spatiotemporal instantiation for realistic human mobility trajectory generation without fine-tuning. It also introduces an anomaly-detection-based evaluation for behavioral fidelity.
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2057153343081111582
A 100-page survey from UIUC, Meta, and Stanford introduces three harness layers (Interface, Mechanisms, Scaling) for AI agents, arguing that most agent failures stem from harness issues rather than reasoning flaws, and provides a taxonomy for auditing agent stacks.
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2066928605691523210
The article distills 28 research papers into a 10-layer stack for building self-improving harnesses around AI models, emphasizing bounded, gated changes over general agent loops.