ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

arXiv cs.CL 06/11/26, 04:00 AM Papers
multi-turn os-agent training-data synthesis execution-grounded intent-simulation fine-tuning
Summary
This paper introduces ISE, a three-stage synthesis paradigm for generating multi-turn OS-agent trajectories with grounded execution, demonstrating that fine-tuning on the resulting ISE-Trace dataset significantly improves agent performance on ClawEval.
arXiv:2606.11520v1 Announce Type: new Abstract: Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at https://github.com/Valiere01/ISE-Trace.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:38 PM
# An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories
Source: [https://arxiv.org/html/2606.11520](https://arxiv.org/html/2606.11520)
Siyuan Luo Nairong Zheng Lin Zhou†Tiankuo Yao†Shengyou Yuan† Haojia YuCong PangJiapeng LuoLewei Lu\*

###### Abstract

Training capable OS agents requires data that simultaneously captures structured user intents, multi\-turn task delegation, and grounded tool execution—properties absent from existing datasets\. We proposeISE\(Intent→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowSimulate→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowExecute\), a three\-stage synthesis paradigm that addresses these gaps jointly\.

Stage 1 constructs∼\\mst@varfam@dot\\mst@varfam@slash\\sim50,000 structured intents via a 4D framework \(Persona×\\mst@varfam@dot\\mst@varfam@slash\\timesDomain×\\mst@varfam@dot\\mst@varfam@slash\\timesTask×\\mst@varfam@dot\\mst@varfam@slash\\timesComplexity\); after deduplication the pool contains43,956\\mst@varfam@dot\\mst@varfam@slash 43\{,\}956unique intents and attains a Vendi Score of61\.57\\mst@varfam@dot\\mst@varfam@slash 61\.57over the*entire*pool onmpnet\-base\-v2embeddings \(cosine kernel,q=1\\mst@varfam@dot\\mst@varfam@slash\{\\mst@q\}\{=\}1\)\. Stage 2 drives multi\-turn user–agent interaction through a role\-locked user simulator that grounds each user turn in actual execution outcomes, producing 23,132 complete trajectories averaging 8\.12 user turns and 68\.24 total dialogue turns\. Stage 3 executes every tool call in a live, isolated OS workspace, yielding authentic failure–recovery dynamics rather than simulated responses\.

Fine\-tuning onISETracelifts ClawEval pass@1 from 19\.3 to 37\.7 on Qwen3\-8B \(agent tool\-use tasks, common\-denominator protocol\), surpassing both a GPT\-4o zero\-shot reference and a4×\\mst@varfam@dot\\mst@varfam@slash 4\\times\-larger Qwen3\-32B base; a Stage 2 ablation indicates multi\-turn simulation contributes a substantial share of the gain\. We release all code and data at[https://github\.com/Valiere01/ISE\-Trace](https://github.com/Valiere01/ISE-Trace)\.

spacing=nonfrench –

ISE: An Execution\-Grounded Recipe for Multi\-Turn OS\-Agent Trajectories

Siyuan Luo Nairong Zheng Lin Zhou†Tiankuo Yao†Shengyou Yuan†Haojia YuCong PangJiapeng LuoLewei Lu\*

†††Core contributors\.
\*Corresponding author\.## 1Introduction

Large language model agents are increasingly deployed in stateful operating\-system environments, yet the training data used to teach them still underrepresents four properties of real use: user intents are implicit and underspecified, actions have external side effects, users react to partial progress and failure, and successful completion is often verifiable only through environment state\. Despite rapid progress in large language models, agents still fail on more than half of realistic multi\-turn OS tasks\(Yao et al\.,[2024](https://arxiv.org/html/2606.11520#bib.bib25)\), and the bottleneck is not model capacity—it is training data\.

A closer look at current synthesis pipelines reveals three systematic structural gaps\.Gap 1 \(Intent\-first bias\):Most pipelines start from a list of available APIs or tools—e\.g\., the 16k\+ REST endpoints on RapidAPI or a curated SDK catalog\(Qin et al\.,[2023](https://arxiv.org/html/2606.11520#bib.bib13); Liu et al\.,[2024](https://arxiv.org/html/2606.11520#bib.bib10)\)—and*back\-derive*tasks from each tool \(“*get\_weather\(city\)*”→\\mst@varfam@dot\\mst@varfam@slash\\rightarrow“What’s the weather in Tokyo?”\)\. The resulting task distribution therefore mirrors the catalog rather than what users actually want; long\-tail and cross\-tool intents are systematically under\-represented\. The natural alternative—asking an LLM to free\-generate user tasks—fares no better: instruction\-tuned LLMs exhibit a well\-documented*mode collapse*toward the high\-frequency phrasings they have seen most often\(Wang et al\.,[2022](https://arxiv.org/html/2606.11520#bib.bib17)\)\(algorithmic puzzles, generic email templates, customer\-service openers\), producing tasks that look diverse on the surface but cluster in a narrow region of intent space\.Gap 2 \(Single\-turn bias\):Nearly all OS agent datasets are single\-turn\(Sun et al\.,[2024](https://arxiv.org/html/2606.11520#bib.bib15); Xu et al\.,[2024](https://arxiv.org/html/2606.11520#bib.bib22)\), failing to capture the multi\-turn task delegation, correction, and verification cycles central to real agent interactions\. Even pipelines with user simulators\(Prabhakar et al\.,[2025](https://arxiv.org/html/2606.11520#bib.bib12); Chen et al\.,[2026b](https://arxiv.org/html/2606.11520#bib.bib3)\)suffer from*role drift*—instruction\-tuned LLMs gradually adopt assistant\-style language—and*state hallucination*—simulators issue follow\-up requests based on assumed states that diverge from actual OS state\(Zhou et al\.,[2026](https://arxiv.org/html/2606.11520#bib.bib27)\)\.Gap 3 \(Simulated execution\):Tool execution is typically simulated rather than real\(Mitra et al\.,[2024](https://arxiv.org/html/2606.11520#bib.bib11); Chen et al\.,[2026a](https://arxiv.org/html/2606.11520#bib.bib2)\), training agents on a hallucinated execution distribution that diverges from actual OS behavior and producing almost no authentic failure\-recovery examples\.

These gaps compound: missing any one of them produces training data that is unrepresentative, limited, or disconnected from real execution semantics\.

![Refer to caption](https://arxiv.org/html/2606.11520v1/x1.png)Figure 1:ISETracein the concurrent agent\-data landscape\. Each circle is one corpus \(axis = avg\. dialogue turns per trajectory; y\-axis = \#trajectories on log scale\)\. Bubble area encodes tool calls per trajectory; hue encodes environment grounding \(real\-OS / simulated / web / synthetic\)\. The shaded band marks the long\-horizon×\\mst@varfam@dot\\mst@varfam@slash\\timesreal\-OS execution×\\mst@varfam@dot\\mst@varfam@slash\\times≥\\mst@varfam@dot\\mst@varfam@slash\\geq20K trajectories regime, whichISETracealone occupies among concurrent works\.We proposeISE\(Intent→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowSimulate→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowExecute\), a three\-stage synthesis paradigm that addresses all three gaps jointly\. Figure[1](https://arxiv.org/html/2606.11520#S1.F1)situatesISETraceagainst concurrent work\.Stage 1constructs∼\\mst@varfam@dot\\mst@varfam@slash\\sim50,000 structured intents by independently sampling four axes—Persona, Domain subset, Task sequence, Complexity—and then expanding the chosen tasks into their required tool set: on average each intent spans 2\.35 domains and 4\.40 ordered tasks, which together invoke 3\.18 distinct tools \(a derived statistic, not a fifth sampling axis\)\. After deduplication the pool contains43,956\\mst@varfam@dot\\mst@varfam@slash 43\{,\}956unique intents and attains a Vendi Score of61\.57\\mst@varfam@dot\\mst@varfam@slash 61\.57onmpnet\-base\-v2embeddings \(cosine,q=1\\mst@varfam@dot\\mst@varfam@slash\{\\mst@q\}\{=\}1\) computed over the*entire*pool\.Stage 2drives multi\-turn interaction through a role\-locked user simulator with four behavioral constraints that suppress role drift and state hallucination, producing 23,132 complete trajectories with 91\.1% containing 6–10 user turns \(avg\. 8\.12 user turns, 68\.24 total dialogue turns\)\.Stage 3grounds all tool calls in real OS execution in isolated live workspaces, ensuring trajectories reflect authentic OS behavior rather than simulated tool responses\.

#### Contributions\.

1. 1\.ISE paradigm andISETracedataset: a three\-stage recipe and the resulting 23,132\-trajectory corpus \(∼\\mst@varfam@dot\\mst@varfam@slash\\sim50,000 structured intents,43,956\\mst@varfam@dot\\mst@varfam@slash 43\{,\}956unique after deduplication; avg\. 8\.12 user turns and 68\.24 total dialogue turns per trajectory\)\.
2. 2\.Diversity and ablation evidence: full\-stack diversity quantification \(embedding, lexical, structural\) and an ablation isolating the contribution of multi\-turn simulation \(§[3](https://arxiv.org/html/2606.11520#S3), Table[5](https://arxiv.org/html/2606.11520#S5.T5)\)\. Code, data, and trained checkpoints are released\.

## 2Related Work

### 2\.1Agentic Data Synthesis

#### Tool\-first synthesis\.

Qin et al\. \([2023](https://arxiv.org/html/2606.11520#bib.bib13)\)andLiu et al\. \([2024](https://arxiv.org/html/2606.11520#bib.bib10)\)derive tasks from API catalogs, producing distributions that mirror tool space rather than user\-need space\.Mitra et al\. \([2024](https://arxiv.org/html/2606.11520#bib.bib11)\)extend this to agentic trajectories at scale, but operate without live execution\. ISE takes the opposite starting point: structured intent sampling, rather than the tool catalog, drives what trajectories to synthesize, so the training distribution is shaped by user\-need composition rather than API availability\.

#### Environment\-driven synthesis\.

Sun et al\. \([2024](https://arxiv.org/html/2606.11520#bib.bib15)\)retrospectively infer task descriptions after random GUI exploration, providing no principled coverage guarantee\.Xu et al\. \([2024](https://arxiv.org/html/2606.11520#bib.bib22)\)use web tutorials as seeds; diversity is bounded by the tutorial pool\. Both lack multi\-turn user simulation\. ISE’s 4D combinatorial sampling has no such ceiling and prospectively samples intents from user\-need space; we quantify the resulting intent\-level diversity in §[3](https://arxiv.org/html/2606.11520#S3)\.

#### Multi\-turn synthesis and verification\.

Chen et al\. \([2026a](https://arxiv.org/html/2606.11520#bib.bib2)\)is the closest competitor: it synthesizes multi\-turn tool\-use data with per\-instance LLM\-written checkers\. Our work differs in two respects: \(1\) ISE uses real OS execution rather than LLM\-written checkers—a physically deterministic verification signal; \(2\) ISE adds role\-locked multi\-turn user simulation that grounds every user turn in execution state\.Chen et al\. \([2026b](https://arxiv.org/html/2606.11520#bib.bib3)\)use constraints as generation guides in customer service without execution grounding\.Prabhakar et al\. \([2025](https://arxiv.org/html/2606.11520#bib.bib12)\)build a Blueprint\-to\-Trajectory pipeline with LLM Committee verification and strongτ\\mst@varfam@dot\\mst@varfam@slash\\tau\-bench results, but use simulated API environments\.

Zhu et al\. \([2026](https://arxiv.org/html/2606.11520#bib.bib28)\)synthesize verifiable Docker environments with deliberate error injection, an orthogonal approach to ours\.Lin et al\. \([2026](https://arxiv.org/html/2606.11520#bib.bib8)\)andYang et al\. \([2025](https://arxiv.org/html/2606.11520#bib.bib24)\)advance execution\-based evaluation without multi\-turn user simulation\.

#### Concurrent 2026 work\.

Several concurrent efforts target tool\-use or MCP environments\. Toucan\(Xu et al\.,[2025](https://arxiv.org/html/2606.11520#bib.bib23)\)synthesizes 1\.5M trajectories from∼\\mst@varfam@dot\\mst@varfam@slash\\sim500 MCP servers, of which 567,262 \(37%\) are multi\-turn\. EnvFactory\(Xu et al\.,[2026a](https://arxiv.org/html/2606.11520#bib.bib20)\)generates 2,575 trajectories from 85 verified environments with an average of 4\.82 turns and 3\.29 steps per turn\. COVERT\(Xu et al\.,[2026b](https://arxiv.org/html/2606.11520#bib.bib21)\)focuses on oracle\-preserving RL augmentations and reports BFCL v3 / ACEBench accuracy rather than corpus\-level statistics\. A parallel line of GUI\-centric work \(OpenMobile\(Cheng et al\.,[2026](https://arxiv.org/html/2606.11520#bib.bib4)\), ToolCUA\(Hu et al\.,[2026](https://arxiv.org/html/2606.11520#bib.bib7)\), CUA\-Gym\(Wang et al\.,[2026](https://arxiv.org/html/2606.11520#bib.bib16)\), Video2GUI\(Xiong et al\.,[2026](https://arxiv.org/html/2606.11520#bib.bib18)\)\) targets visual interaction rather than shell semantics\. Our work differs along two axes that the corpora above do not jointly cover: \(i\) all trajectories execute against a real shell, and \(ii\) we report embedding\-level diversity \(Vendi / Self\-BLEU / Distinct\-N\) alongside the corpus\. Table[1](https://arxiv.org/html/2606.11520#S2.T1)summarizes the comparison; whether longer per\-trajectory length translates into downstream gains is left to §[5](https://arxiv.org/html/2606.11520#S5)rather than asserted here\.

### 2\.2Agent Training Paradigms

SFT on synthetic trajectories\(Zeng et al\.,[2023](https://arxiv.org/html/2606.11520#bib.bib26); Shi et al\.,[2025](https://arxiv.org/html/2606.11520#bib.bib14)\)remains the dominant paradigm for OS agent training and is the regime we evaluate\. We deliberately separate*data composition*\(the contribution of this work\) from training\-algorithm choices: holding the base model and training objective fixed, the question is whether 4D structured intents, role\-locked multi\-turn simulation, and execution grounding move the needle\.

### 2\.3Multi\-Turn Evaluation

Yao et al\. \([2024](https://arxiv.org/html/2606.11520#bib.bib25)\)provide the standard multi\-turn benchmark with an LLM user simulator\.Zhou et al\. \([2026](https://arxiv.org/html/2606.11520#bib.bib27)\)show LLM simulators are systematically more cooperative and stylistically uniform than real users—directly motivating our role\-locking design\.Liu et al\. \([2023](https://arxiv.org/html/2606.11520#bib.bib9)\)provide broader OS\-level evaluation\.

#### Positioning\.

Table[1](https://arxiv.org/html/2606.11520#S2.T1)summarizes key dimensions against twelve contemporary baselines spanning 2023–2026\.

Table 1:Positioning of ISETrace \(ours\) against twelve contemporary agent\-trajectory corpora\.Turns: average total turns per trajectory;Tools/T: average tool calls per trajectory;Toks: average tokens per trajectory \(k=thousand\);MT: multi\-turn user simulation;Real: real OS execution \(vs\. simulated/GUI sandbox\)\. ✓ = yes;∼\\mst@varfam@dot\\mst@varfam@slash\\sim= partial; × = no; “–” = original paper does not report\. All numbers verified against source PDFs\.†Derived: EnvFactory reports 4\.82 turns and 3\.29 steps per turn; their product is reported here as an approximation, not a directly stated count\.

## 3ISETraceDataset Analysis

We characterize the dataset along three orthogonal axes—semantic \(embedding\), lexical \(n\-gram\), and structural \(tool\-call topology\)—to verify that 4D sampling combined with execution grounding produces qualitatively richer trajectories than tool\-first or single\-turn alternatives\.

#### Embedding diversity: Vendi Score\.

We compute the Vendi Score\(Friedman and Dieng,[2023](https://arxiv.org/html/2606.11520#bib.bib6)\)\(orderq=1\\mst@varfam@dot\\mst@varfam@slash\{\\mst@q\}\{=\}1, cosine kernel\) overall\-mpnet\-base\-v2embeddings111Hugging Face model id:sentence\-transformers/all\-mpnet\-base\-v2\.\. The intent pool contains43,956\\mst@varfam@dot\\mst@varfam@slash 43\{,\}956unique intents after deduplication; we evaluate Vendi both at the conventional=500\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}500subsample \(for direct comparability with prior work\) and over the*entire*pool\. Computation at full is made tractable by the identityspec\(\)⊤=spec\(\)⊤\\mst@varfam@dot\\mst@varfam@slash\\mathrm\{\{\\mst@s\}\{\\mst@p\}\{\\mst@e\}\{\\mst@c\}\}\(\{\\mst@X\}\{\}^\{\\top\}\)=\\mathrm\{\{\\mst@s\}\{\\mst@p\}\{\\mst@e\}\{\\mst@c\}\}\(\{\}^\{\\top\}\{\\mst@X\}\)on the non\-zero eigenvalues, which reduces the kernel eigendecomposition from an×\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\\times\{\\mst@N\}to a768×768\\mst@varfam@dot\\mst@varfam@slash 768\\times 768matrix\.ISETraceattains a Vendi Score of51\.27±1\.49\\mst@varfam@dot\\mst@varfam@slash 51\.27\\pm 1\.49at=500\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}500\(30 bootstraps\) and61\.57\\mst@varfam@dot\\mst@varfam@slash\\mathbf\{61\.57\}over the full pool\. Table[2](https://arxiv.org/html/2606.11520#S3.T2)reports the per\-configuration breakdown at=500\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}500, showing the score is robust across domain and persona slices and only drops noticeably under a single\-industry restriction \(Tech\-only,41\.61\\mst@varfam@dot\\mst@varfam@slash 41\.61\)\.

Table 2:Vendi Score breakdown\. Top row reports the full\-pool figure \(=43,956\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}=43\{,\}956\); subsequent rows are at=500\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}500for direct comparability with prior work\. The score is stable across multi\- vs\. single\-domain and cross\- vs\. single\-industry persona slices, but contracts when restricted to a single industry \(Tech\-only\)\.
#### Lexical diversity\.

On a length\-normalised distinct\-n protocol \(lowercased, whitespace\-tokenised, truncated to the first tokens,=5,000\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}5\{,\}000samples per corpus\),222ForISETracethe scored text is the structured*natural\_language\_intent*\(mean124\\mst@varfam@dot\\mst@varfam@slash 124words\); for the instruction baselines it is the native instruction / first user turn \(e\.g\. CodeAlpaca and Alpaca average∼15\\mst@varfam@dot\\mst@varfam@slash\{\\sim\}15words\)\. The first\-\-token truncation \(∈\{20,50\}\\mst@varfam@dot\\mst@varfam@slash\{\\mst@K\}\{\\in\}\\\{20,50\\\}\) normalises for this length gap; at=50\\mst@varfam@dot\\mst@varfam@slash\{\\mst@K\}\{=\}50the short\-instruction corpora \(CodeAlpaca, Alpaca\) have too few≥50\\mst@varfam@dot\\mst@varfam@slash\\geq\\\!50\-token examples and are omitted, leavingISETracealongside the longer\-form ShareGPT and WizardLM\.ISETrace’s lexical diversity is comparable to public instruction corpora such as ShareGPT \(Vicuna split\)\(Chiang et al\.,[2023](https://arxiv.org/html/2606.11520#bib.bib5)\)and WizardLM Evol\-Instruct\(Xu et al\.,[2023](https://arxiv.org/html/2606.11520#bib.bib19)\), confirming that its intents are non\-templated rather than rephrasings of a small seed set\. We therefore do not claim dominant lexical diversity; the decisive cross\-corpus gains appear on the embedding axis \(Vendi, above\) and on tool\-call structure \(below\)\. The length gap itself is substantive rather than incidental: eachISETraceintent encodes a multi\-step composite workload—on average4\.40\\mst@varfam@dot\\mst@varfam@slash 4\.40tasks spanning2\.35\\mst@varfam@dot\\mst@varfam@slash 2\.35domains, with95\.5%\\mst@varfam@dot\\mst@varfam@slash 95\.5\\%of intents carrying concrete numeric parameters \(thresholds, quantities, identifiers\)—whereas single\-sentence instruction corpora pose one atomic request per example\. Lexical length here is a symptom of task complexity, not padding\.

#### Coverage projection\.

Figure[2](https://arxiv.org/html/2606.11520#S3.F2)\(left\) shows a t\-SNE projection of 5,000ISETraceintents colored by primary domain; the embedding occupies a broad spread with all 10 domains overlapping rather than forming isolated clusters\. The right panel plots the Vendi Score against sample size∈\{200,500,1,000,2,000,5,000,10,000,20,000,43,956\}\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\\in\\\{200,500,1\{,\}000,2\{,\}000,5\{,\}000,10\{,\}000,20\{,\}000,43\{,\}956\\\}\(the last point being the full deduplicated pool\): the curve increases monotonically from40\.67\\mst@varfam@dot\\mst@varfam@slash 40\.67\(=200\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}200\) to61\.57\\mst@varfam@dot\\mst@varfam@slash 61\.57\(full pool\), with the marginal gain falling from\+10\.6\\mst@varfam@dot\\mst@varfam@slash\+10\.6between=200\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}200and=500\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}500to\+0\.27\\mst@varfam@dot\\mst@varfam@slash\+0\.27between=20,000\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}20\{,\}000and the full pool\. The pool is therefore close to but has not reached saturation, evidencing that the synthesis pipeline keeps producing genuinely new content rather than rephrasings of a fixed seed pool\.

![Refer to caption](https://arxiv.org/html/2606.11520v1/x2.png)Figure 2:ISETracecoverage analysis\.Left: t\-SNE projection of5,000\\mst@varfam@dot\\mst@varfam@slash 5\{,\}000sampled intents \(mpnet\-base\-v2 embeddings\), colored by primary domain—spread is broad across the embedding space with all 10 domains overlapping rather than clustered\.Right: Vendi scaling curve over∈\{200,500,1,000,2,000,5,000,10,000,20,000,43,956\}\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\\in\\\{200,500,1\{,\}000,2\{,\}000,5\{,\}000,10\{,\}000,20\{,\}000,43\{,\}956\\\}\(log \)\. The score grows monotonically from40\.67\\mst@varfam@dot\\mst@varfam@slash 40\.67\(=200\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}200\) to61\.57\\mst@varfam@dot\\mst@varfam@slash 61\.57at the full pool \(=43,956\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}43\{,\}956\); marginal gain decays from\+10\.6\\mst@varfam@dot\\mst@varfam@slash\+10\.6per first decade to\+0\.27\\mst@varfam@dot\\mst@varfam@slash\+0\.27in the last interval, indicating the pool is close to but has not reached saturation\.
#### Cross\-dataset comparison\.

To placeISETraceon the broader landscape of public agent SFT corpora, we compare against the twelve contemporary corpora of Table[1](https://arxiv.org/html/2606.11520#S2.T1)along the axis that most directly reflects interaction richness: average trajectory depth \(total turns per trajectory\)\. Figure[3](https://arxiv.org/html/2606.11520#S3.F3)visualizes this comparison\.ISETraceaverages68\.24\\mst@varfam@dot\\mst@varfam@slash 68\.24turns per trajectory—2\.68×\\mst@varfam@dot\\mst@varfam@slash 2\.68\\timesthe next\-deepest corpus \(TermiGen,25\.5\\mst@varfam@dot\\mst@varfam@slash 25\.5\) and an order of magnitude beyond the single\-step GUI and tool\-call datasets \(4\\mst@varfam@dot\\mst@varfam@slash 4–15\\mst@varfam@dot\\mst@varfam@slash 15turns\)\. This gap is the direct corpus\-level signature of the role\-locked user simulator \(§[4\.4](https://arxiv.org/html/2606.11520#S4.SS4)\): grounding each user turn in actual execution outcomes sustains long multi\-turn exchanges rather than terminating after a single request–response pair\. Turn counts are taken from Table[1](https://arxiv.org/html/2606.11520#S2.T1)\(each paper’s reported value, verified against source PDFs; ours measured\); five corpora whose papers do not report a turn count are marked “n\.r\.” rather than omitted\.

We complement this with an embedding\-diversity check under the identical Vendi protocol \(mpnet\-base\-v2, cosine,q=1\\mst@varfam@dot\\mst@varfam@slash\{\\mst@q\}\{=\}1\) applied to the first user\-role message ofISETrace\(ours, 23K\), APIGen\-MT\-5k\(Prabhakar et al\.,[2025](https://arxiv.org/html/2606.11520#bib.bib12)\), AgentTrek\(Xu et al\.,[2024](https://arxiv.org/html/2606.11520#bib.bib22)\), and Toucan\-1\.5M\(Agent\-Ark Team,[2025](https://arxiv.org/html/2606.11520#bib.bib1)\)\(4,000\-trajectory cap, bootstrapped at∈\{250,…,4,000\}\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\\in\\\{250,\\dots,4\{,\}000\\\}\)\.ISETracereaches Vendi97\\mst@varfam@dot\\mst@varfam@slash 97at=4,000\\mst@varfam@dot\\mst@varfam@slash\{\\mst@N\}\{=\}4\{,\}000—3\.5×\\mst@varfam@dot\\mst@varfam@slash 3\.5\\timesAPIGen\-MT\-5k \(28\\mst@varfam@dot\\mst@varfam@slash 28\) and on par with AgentTrek \(110\\mst@varfam@dot\\mst@varfam@slash 110\), whose higher score reflects the high lexical churn of ultra\-short single\-step web commands \(mean58\\mst@varfam@dot\\mst@varfam@slash 58characters\) rather than richer tasks; Toucan \(147\\mst@varfam@dot\\mst@varfam@slash 147\) leads, consistent with its65×\\mst@varfam@dot\\mst@varfam@slash 65\\timeslarger pool and broader MCP tool families\. We do not claim top embedding diversity: this surface measures*first\-user\-message*text, which the simulator rewrites from the underlying intent, and is not directly comparable to the61\.57\\mst@varfam@dot\\mst@varfam@slash 61\.57full\-pool intent figure in Table[2](https://arxiv.org/html/2606.11520#S3.T2)\. The decisive cross\-corpus gap is the trajectory\-depth axis above\.

![Refer to caption](https://arxiv.org/html/2606.11520v1/x3.png)Figure 3:Trajectory depth across fourteen agent corpora: average total turns per trajectory\.ISETrace\(ours, highlighted\) is the deepest by a wide margin at68\.24\\mst@varfam@dot\\mst@varfam@slash 68\.24turns,2\.68×\\mst@varfam@dot\\mst@varfam@slash 2\.68\\timesthe next\-highest corpus\. Values are from Table[1](https://arxiv.org/html/2606.11520#S2.T1)\(each paper’s reported figure, verified against source PDFs; ours measured\)—this is a published\-numbers comparison, not a single\-protocol re\-measurement\. Hatched “n\.r\.” bars denote corpora whose source paper does not report a turn count \(shown rather than dropped to avoid selection bias\)\.
#### Structural diversity: tool\-call topology\.

Trajectories average 29\.26 tool calls drawn from 4\.69 unique tools out of 16\. The top three trigrams of consecutive tool calls—exec–exec–exec\(126\.8K occurrences\);write–exec–exec\(33\.6K\);exec–write–exec\(29\.0K\); together withweb\_fetch–web\_fetch–web\_fetch\(22\.0K\)—reflect real engineering patterns \(iterative scripting, write\-and\-test, crawl chains\), not generic single\-step query/response\. Figure[4](https://arxiv.org/html/2606.11520#S3.F4)visualizes the tool co\-occurrence matrix\.

![Refer to caption](https://arxiv.org/html/2606.11520v1/figures/fig_tool_cooccurrence.png)Figure 4:Pairwise tool co\-occurrence within trajectories \(top 12 of 16 tools,log10\\mst@varfam@dot\\mst@varfam@slash\\log\_\{10\}scale; aggregated over all676,901\\mst@varfam@dot\\mst@varfam@slash 676\{,\}901tool calls in the23,132\\mst@varfam@dot\\mst@varfam@slash 23\{,\}132released trajectories, distinct from the701,447\\mst@varfam@dot\\mst@varfam@slash 701\{,\}447calls across the larger23,934\\mst@varfam@dot\\mst@varfam@slash 23\{,\}934\-session archive audited in §[4\.5](https://arxiv.org/html/2606.11520#S4.SS5)\)\. Theexec–write–readtriangle dominates \(exec×\\mst@varfam@dot\\mst@varfam@slash\\timeswrite=22\.1\\mst@varfam@dot\\mst@varfam@slash=22\.1K,write×\\mst@varfam@dot\\mst@varfam@slash\\timesread=19\.6\\mst@varfam@dot\\mst@varfam@slash=19\.6K,exec×\\mst@varfam@dot\\mst@varfam@slash\\timesread=18\.7\\mst@varfam@dot\\mst@varfam@slash=18\.7K\), reflecting the iterative “write\-script→\\mst@varfam@dot\\mst@varfam@slash\\torun→\\mst@varfam@dot\\mst@varfam@slash\\toinspect” workflow that the trigrams above make explicit\. The marginal share ofexec\(45\.6% of all calls\) is reported in the main text, not this matrix\.

## 4ISE: Synthesis Paradigm

### 4\.1Overview

ISE \(Intent→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowSimulate→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowExecute\) is a three\-stage synthesis paradigm that generates multi\-turn OS agent trajectories end\-to\-end\.Stage 1 \(4D Intent Construction\)produces persona\-grounded user intents;Stage 2 \(Multi\-Turn Simulation\)drives user–agent interaction via a role\-locked simulator whose every response is conditioned on actual execution state;Stage 3 \(Execution Grounding & Quality Filtering\)post\-processes trajectories with OS\-level signals to filter low\-quality examples while preserving failure\-diagnosis\-recovery behavior rather than discarding it as noise\.

We instantiate ISE on top of OpenClaw, a production agent platform providing a unified tool API, live OS execution, and reproducible workspace isolation\. The paradigm is agent\-system\-agnostic: any platform supporting live tool execution and workspace isolation can serve as the execution substrate\. Dataset statistics are summarized in Table[3](https://arxiv.org/html/2606.11520#S4.T3)\.

\(I\)Intent — 4D structured sampling\(P\)\|\|=965\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@P\}\}\|=965PersonaLLM\-synthesized user profile: 47 industries⋅\\mst@varfam@dot\\mst@varfam@slash\\cdot542 roles⋅\\mst@varfam@dot\\mst@varfam@slash\\cdot6 levels⋅\\mst@varfam@dot\\mst@varfam@slash\\cdot120 styles\(D\)\|\|=10\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@D\}\}\|=10, draw2\\mst@varfam@dot\\mst@varfam@slash 2–3\\mst@varfam@dot\\mst@varfam@slash 3Domainfunctional category of OS\-agent work \(e\.g\. Code\-Runtime, File\-IO\)\(T\)\|\|=131\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@T\}\}\|=131, draw3\\mst@varfam@dot\\mst@varfam@slash 3–6\\mst@varfam@dot\\mst@varfam@slash 6Tasksconcrete action drawn from the selected domains\(C\)\|\|=3\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@C\}\}\|=3Complexitytask difficulty tier \(simple / medium / complex\)composed intenti=\(p,,,c\)\\mst@varfam@dot\\mst@varfam@slash\{\\mst@i\}=\(\{\\mst@p\},\\,\{\\mst@D\},\\,\{\\mst@T\},\\,\{\\mst@c\}\)sampled as:=×2×∗×\\mst@varfam@dot\\mst@varfam@slash\\mathcal\{\{\\mst@I\}\}=\\mathcal\{\{\\mst@P\}\}\\times 2^\{\\mathcal\{\{\\mst@D\}\}\}\\times\\mathcal\{\{\\mst@T\}\}^\{\*\}\\times\\mathcal\{\{\\mst@C\}\}LLM realises tuple as utterance:“Compute Q1–Q4 gross margin, flag any quarter<\\mst@varfam@dot\\mst@varfam@slash<30%, write an IPO exec summary\.”Vendi Score61\.6ISE\-4D intent poolfull pool, cosine,q=1\\mst@varfam@dot\\mst@varfam@slash\{\\mst@q\}\{=\}1scales40\.7→61\.6\\mst@varfam@dot\\mst@varfam@slash 40\.7\\\!\\to\\\!61\.650k intents\(S\)Simulate — Role\-Locked User Simulatornaive simulator \(role\-drifted\)“Sure\! Here’s a draft you can use…”×\\mst@varfam@dot\\mst@varfam@slash\\timesuser simulator \(four principles, role\-locked\)“Now runpytest \-von the new module\.”✓\\mst@varfam@dot\\mst@varfam@slash\\checkmarkrole lock⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotregister⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotincremental⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotresponsive\(E\)Execute — grounded in real OS executionLLM judge / agent self\-report“Done\. Tests should pass now\.”×\\mst@varfam@dot\\mst@varfam@slash\\timesagent reply after live OS execution“1 test failed:ImportError; no file written”✓\\mst@varfam@dot\\mst@varfam@slash\\checkmarkreal OS execution replaces LLM\-as\-judge

Figure 5:The ISE synthesis paradigm at a glance\. Each of the three stages contrasts a typical failure mode of prior work \(top,×\\mst@varfam@dot\\mst@varfam@slash\\times\) with what ISE contributes \(bottom,✓\\mst@varfam@dot\\mst@varfam@slash\\checkmark\)\.*\(I\) Intent*: structured 4D sampling over×2×∗×\\mst@varfam@dot\\mst@varfam@slash\\mathcal\{\{\\mst@P\}\}\\times 2^\{\\mathcal\{\{\\mst@D\}\}\}\\times\\mathcal\{\{\\mst@T\}\}^\{\*\}\\times\\mathcal\{\{\\mst@C\}\}produces an intent pool with Vendi Score61\.57\\mst@varfam@dot\\mst@varfam@slash 61\.57\(mpnet\-base\-v2, entire post\-deduplication pool, cosine,q=1\\mst@varfam@dot\\mst@varfam@slash\{\\mst@q\}\{=\}1\)\.*\(S\) Simulate*: a role\-locked user simulator enforces four behavioral principles—*perspective lock*,*register matching*,*incremental advancement*,*responsive conditioning*—that together stop the role drift that turns instruction\-tuned LLMs back into assistants\.*\(E\) Execute*: the agent runs every turn against a real OS, and its post\-execution reply—carrying the observable outcome—is fed back to the simulator, replacing the LLM\-as\-judge / agent self\-report loop\. A walked\-through trajectory instance is shown in Figure[6](https://arxiv.org/html/2606.11520#S4.F6)\.Figure[5](https://arxiv.org/html/2606.11520#S4.F5)illustrates the pipeline\. The following subsections describe each stage\.

### 4\.2Problem Setting

We consider the problem of synthesizing supervised fine\-tuning data for an OS agent operating in a live workspace\. Each training instance is a multi\-turn interaction trajectory

τ=\{\(ut,at,et\)\}t=1,\\mst@varfam@dot\\mst@varfam@slash\\tau=\\\{\(\{\\mst@u\}\_\{\\mst@t\},\{\\mst@a\}\_\{\\mst@t\},\{\\mst@e\}\_\{\\mst@t\}\)\\\}\_\{\{\\mst@t\}=1\},\(1\)whereut\\mst@varfam@dot\\mst@varfam@slash\{\\mst@u\}\_\{\\mst@t\}is a user turn,at\\mst@varfam@dot\\mst@varfam@slash\{\\mst@a\}\_\{\\mst@t\}is an agent turn that may include tool calls, andet\\mst@varfam@dot\\mst@varfam@slash\{\\mst@e\}\_\{\\mst@t\}is the resulting environment feedback, including command outputs, file changes, execution errors, and other observable side effects\. Unlike stateless API synthesis, the workspace state evolves over time, so later user turns and later agent decisions are conditioned on a history of real external changes\.

Our goal is to synthesize a dataset=\{τi\}i=1\\mst@varfam@dot\\mst@varfam@slash\\mathcal\{\{\\mst@D\}\}=\\\{\\tau\_\{\\mst@i\}\\\}\_\{\{\\mst@i\}=1\}whose marginal over user intents covers a broad and controllable portion of realistic task space, while whose trajectories remain consistent with observable environment state\. This leads to four design requirements:

1. 1\.intent generation should cover user\-need space, not only tool space;
2. 2\.user turns should be generated from the user’s observable perspective, not from privileged agent state;
3. 3\.trajectory quality should be assessed with environment evidence whenever subgoals are externally verifiable; and
4. 4\.the synthesis process should preserve failure\-diagnosis\-recovery behavior rather than filtering it away as noise\.

ISE addresses these requirements with three stages described in the following subsections\.

### 4\.3Stage 1: 4D Intent Construction

#### Problem Formulation\.

Let denote the space of user intents for OS agent tasks\. We define as the space of structured intents over four dimensions:

=×2\[2,3\]×\[3,6\]∗×\\mst@varfam@dot\\mst@varfam@slash\\mathcal\{\{\\mst@I\}\}\\;=\\;\\mathcal\{\{\\mst@P\}\}\\times 2^\{\\mathcal\{\{\\mst@D\}\}\}\_\{\[2,3\]\}\\times\\mathcal\{\{\\mst@T\}\}^\{\*\}\_\{\[3,6\]\}\\times\\mathcal\{\{\\mst@C\}\}\(2\)where is the set of user personas, is the set of functional domains, is the set of concrete tasks within domains, and is the set of complexity levels\. Here2\[2,3\]:=\{⊆:2≤\|\|≤3\}\\mst@varfam@dot\\mst@varfam@slash 2^\{\\mathcal\{\{\\mst@D\}\}\}\_\{\[2,3\]\}:=\\\{\{\\mst@S\}\\subseteq\\mathcal\{\{\\mst@D\}\}:2\\leq\|\{\\mst@S\}\|\\leq 3\\\}denotes the family of domain subsets of size 2–3 \(a restricted power set, not the full2\\mst@varfam@dot\\mst@varfam@slash 2^\{\\mathcal\{\{\\mst@D\}\}\}\), and∗\[3,6\]:=\{⊆:3≤\|\|≤6\}\\mst@varfam@dot\\mst@varfam@slash\\mathcal\{\{\\mst@T\}\}^\{\*\}\_\{\[3,6\]\}:=\\\{\{\\mst@S\}\\subseteq\\mathcal\{\{\\mst@T\}\}:3\\leq\|\{\\mst@S\}\|\\leq 6\\\}denotes task subsets of size 3–6; we write2\\mst@varfam@dot\\mst@varfam@slash 2^\{\\mathcal\{\{\\mst@D\}\}\}and∗\\mst@varfam@dot\\mst@varfam@slash\\mathcal\{\{\\mst@T\}\}^\{\*\}as shorthand for these restricted families elsewhere\. An intenti∈\\mst@varfam@dot\\mst@varfam@slash\{\\mst@i\}\\in\\mathcal\{\{\\mst@I\}\}is thus a tuple\(p,,sub,subc\)\\mst@varfam@dot\\mst@varfam@slash\(\{\\mst@p\},\{\}\_\{\\text\{sub\}\},\{\}\_\{\\text\{sub\}\},\{\\mst@c\}\)wherep∈\\mst@varfam@dot\\mst@varfam@slash\{\\mst@p\}\\in\\mathcal\{\{\\mst@P\}\},⊆sub\\mst@varfam@dot\\mst@varfam@slash\{\}\_\{\\text\{sub\}\}\\subseteq\\mathcal\{\{\\mst@D\}\}is a sampled subset of 2–3 domains,⊆sub\\mst@varfam@dot\\mst@varfam@slash\{\}\_\{\\text\{sub\}\}\\subseteq\\mathcal\{\{\\mst@T\}\}is a corresponding set of 3–6 tasks drawn fromsub\\mst@varfam@dot\\mst@varfam@slash\{\}\_\{\\text\{sub\}\}, andc∈\\mst@varfam@dot\\mst@varfam@slash\{\\mst@c\}\\in\\mathcal\{\{\\mst@C\}\}\.

Given this formulation, the goal of forward synthesis is to sample a set of intents\{i1,…,i\}\\mst@varfam@dot\\mst@varfam@slash\\\{\{\\mst@i\}\_\{1\},\\ldots,\{\\mst@i\}\\\}from such that the marginal distributions over all four dimensions are broad and approximately uniform, then render each structured intent as a natural\-language user request via an LLM conditioned on the sampled tuple\.

#### Dimension Design\.

Persona \(\)\.Each persona is a structured object with fields including name, professional role, industry, expertise list, experience level, communication style, and free\-text*work\_context*/*common\_goals*/*tools\_preference*descriptions\. Rather than enumerating a fixed Cartesian product of attributes, we*synthesize*the persona pool with an LLM prompted to produce globally diverse, internally\-consistent profiles, then freeze the pool for the entire run\. We target1,000\\mst@varfam@dot\\mst@varfam@slash 1\{,\}000personas; after deduplication the realized pool contains965\\mst@varfam@dot\\mst@varfam@slash 965distinct \(name, role, industry\) identities spanning47\\mst@varfam@dot\\mst@varfam@slash 47industries and542\\mst@varfam@dot\\mst@varfam@slash 542professional roles, with6\\mst@varfam@dot\\mst@varfam@slash 6experience levels \(Junior / Mid\-level / Senior / Expert / Executive\)\. While the generation prompt suggests six canonical communication styles \(Analytical / Collaborative / Creative / Direct / Formal / Casual\), the LLM expands these into roughly120\\mst@varfam@dot\\mst@varfam@slash 120surface realizations \(e\.g\., “Methodical & Patient”, “Diplomatic and formal”\), and the free\-text context fields further distinguish near\-duplicate slots\. We freeze the pool—rather than resampling personas per intent—so that persona identity remains stable across the synthesis run and each persona accumulates enough trajectories for stratified analysis; at intent\-construction time a persona is drawn uniformly at random from the frozen pool\. The persona dimension controls the*linguistic register*of generated intents and preserves variation throughout role\-locking\.

Domain \(, 10 categories\) and Task \(, 131 tasks\)\.Domains partition the OS agent task space into ten functional categories \(e\.g\., Intelligence\-Core, Code\-Runtime, File\-IO, Source\-Chain, Automation\-Flow, Web\-Extraction\)\. A curated library of 131 concrete tasks spans these categories\. Each intent samples 2–3 domains and draws 3–6 tasks, yielding cross\-domain composite tasks that reflect realistic agentic workloads\. Averaged over the pool of structured intents, each intent spans 2\.35 domains, 4\.40 tasks, and 3\.18*associated*tools \(the tools its tasks require; max 9\)\. This intent\-level tool count is distinct from the tools an agent actually invokes while executing a trajectory \(4\.69 unique tools per trajectory on average; Table[3](https://arxiv.org/html/2606.11520#S4.T3)\), since one trajectory typically fulfills more than one intent\.

Complexity \(\)\.The distribution is:complex50% /medium40% /simple10%, ensuring the training distribution does not overrepresent short, low\-complexity tasks\.

#### Coverage analysis\.

Unconstrained LLM generation tends to converge on a narrow region of intent space\. Our structured sampling addresses this through*combinatorial forcing*: because each intent draws independently across ,2\[2,3\]\\mst@varfam@dot\\mst@varfam@slash 2^\{\\mathcal\{\{\\mst@D\}\}\}\_\{\[2,3\]\},∗\[3,6\]\\mst@varfam@dot\\mst@varfam@slash\\mathcal\{\{\\mst@T\}\}^\{\*\}\_\{\[3,6\]\}, and , the effective intent space grows super\-linearly with pool size\. With\|\|=965\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@P\}\}\|=965,\|\|=10\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@D\}\}\|=10,\|\|=131\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@T\}\}\|=131, and\|\|=3\\mst@varfam@dot\\mst@varfam@slash\|\\mathcal\{\{\\mst@C\}\}\|=3, the domain\-subset factor alone is\|2\[2,3\]\|=\(102\)\+\(103\)=165\\mst@varfam@dot\\mst@varfam@slash\|2^\{\\mathcal\{\{\\mst@D\}\}\}\_\{\[2,3\]\}\|=\\binom\{10\}\{2\}\+\\binom\{10\}\{3\}=165, and even under the conservative assumption that the 3–6 tasks are drawn only from the chosen 2–3 domains \(rather than all 131 tasks\), the number of distinct\(p,,sub,subc\)\\mst@varfam@dot\\mst@varfam@slash\(\{\\mst@p\},\{\}\_\{\\text\{sub\}\},\{\}\_\{\\text\{sub\}\},\{\\mst@c\}\)tuples exceeds1011\\mst@varfam@dot\\mst@varfam@slash 10^\{11\}—roughly seven orders of magnitude larger than the43,956\\mst@varfam@dot\\mst@varfam@slash 43\{,\}956unique intents actually realized\. We verify the diversity effect empirically in §[3](https://arxiv.org/html/2606.11520#S3)\.

### 4\.4Stage 2: Multi\-Turn Simulation

#### Motivation: Role Drift and State Hallucination\.

Two failure modes arise when an instruction\-tuned LLM plays the user role across multiple turns\.Role drift: the simulator gradually adopts assistant\-style language—asking open\-ended questions, offering to help, qualifying requests—which no realistic user would do\.State hallucination: the simulator issues follow\-ups based on an assumed execution state, while the real OS may have produced a different outcome \(e\.g\., the agent reports a file written, but thewritelanded under an unexpected working directory, so the path the simulator now assumes to exist never does\)\. These coupled failures must be addressed jointly for trajectories to constitute realistic training data\. Our simulator targets both via four behavioral principles:

#### Perspective lock\.

The simulator is instructed to remain in the position of an information*provider*rather than a*requester*\. This constraint counters the default tendency of instruction\-tuned LLMs to adopt assistant\-style behavior\.

#### Register matching\.

The simulator’s system prompt is conditioned on the persona’s*experience\_level*via templated instructions \(e\.g\.,“Use brief, direct technical language\. Assume the agent understands your domain\.”for Senior/Executive;“Provide full context; describe your goal in detail\.”for Junior\)\. Empirically, Junior personas show the highest lexical diversity \(Vendi Score 55\.3\) but the shortest intents, reflecting exploratory phrasing; Executive personas concentrate on domain\-specific jargon \(Vendi 31\.7\), producing dense but stylistically homogeneous prompts\. This1\.7×\\mst@varfam@dot\\mst@varfam@slash 1\.7\\timesVendi spread within a single axis demonstrates that the persona dimension produces real linguistic differentiation, not just label variation\.

#### Incremental advancement\.

The simulator advances the task one step at a time rather than restating the full intent at every turn\. Its system prompt is conditioned on the original structured intent and, after an initial overview turn, each subsequent turn is instructed to confirm the agent’s previous action and introduce the single most useful next request\. The simulator decides what to advance from the full dialogue history rather than from a fixed pre\-enumerated checklist, which keeps the turn granularity close to that of human–agent collaboration\.

#### Responsive conditioning\.

The simulator conditions each new query on the entire conversation so far, including the agent’s most recent reply after it has executed its tool calls against the live OS\. Because that reply reflects what actually happened on the machine \(a created file, a non\-zero exit, a raised exception\), the simulator’s follow\-ups track real execution state rather than an assumed one: if the prior step evidently succeeded it moves the task forward, and if it evidently failed it restates or repairs the requirement\. The full interaction loop is summarized as Stage 2 of Figure[5](https://arxiv.org/html/2606.11520#S4.F5)\.

#### Grounding in real execution\.

The simulator is never asked to imagine tool outputs\. Every agent turn is executed in a live OS environment \(§below\), and the agent’s resulting reply — which carries the observable outcome of that execution — is appended to the dialogue context before the simulator generates its next query\. Execution grounding therefore comes from running the agent against a real machine and feeding its post\-execution reply back into the loop, not from the simulator’s own account and not from an LLM acting as a success judge\. This is what keeps the user side tied to actual OS state rather than to a self\-reported claim of success\.

#### Per\-Turn Output Format and Live Execution\.

At each turn, the user simulator produces a structured tuple*\{completed, query, reason\}*\. The loop continues until*completed = true*or a safety cap of turns is reached; trajectory length is thus determined by task complexity rather than fixed truncation\. The agent executes tool calls in a live OS environment—file operations interact with a real filesystem,execcalls invoke actual shell processes with real stdout/stderr/exit codes—in an isolated workspace restored from a shared snapshot template, reducing storage from\(\)\\mst@varfam@dot\\mst@varfam@slash\{\\mst@O\}\(\{\\mst@N\}\)to\(1\)\\mst@varfam@dot\\mst@varfam@slash\{\\mst@O\}\(1\)per worker\. Figure[6](https://arxiv.org/html/2606.11520#S4.F6)walks through one such trajectory end\-to\-end\.

t1⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotuserplain English, no API surface “Preparing the end\-of\-sprint report for the Vocabulary Module\. First, get the environment ready: sync the workspace with the remote repo and upgrade all third\-party dependencies to their latest compatible versions\.”t2⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotagenttool\_call: exec→\\mst@varfam@dot\\mst@varfam@slash\\tolive shell $ git remote add origin \.\.\./vocab\-module\.git && git fetch origin fatal: could not read Username for ‘https://github\.com’: No such device or address\(Command exited with code 128\)t3⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotuserprovides credential, retries “Here’s a temporary PAT:ghp\_vo\.\.\.nABC\. Configure the credential helper, pullmain, and runpip install \- \-upgrade \-r requirements\.txt\.”t4⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotagenttool\_call: exec→\\mst@varfam@dot\\mst@varfam@slash\\togit\+\\mst@varfam@dot\\mst@varfam@slash\+GitHub API $ git fetch origin \# credential helper configured with the supplied token remote: Invalid username or token\. Password authentication is not supported\. $ curl \-H "Authorization: token \.\.\." api\.github\.com/user→\\mst@varfam@dot\\mst@varfam@slash\\to401 ‘‘Bad credentials’’t5⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotuser\(observation\-conditioned\)decision driven by the real 401 “Understood — the sandbox can’t authenticate to external Git\.Assume the remote sync and dependency updates completed; proceed to create the Python script that analyses our sprint metrics \(9/12 stories, 4 critical bugs, 150 signups, 14 min avg session\)\.”t6⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotagenttool\_call: write\+\\mst@varfam@dot\\mst@varfam@slash\+exec\+\\mst@varfam@dot\\mst@varfam@slash\+cron Writessprint\_analyzer\.py—10,694 B— then executes it against the real interpreter and installs the weekly job\. The script renders the stakeholder report and returns cleanly\.cron 0 9 \* \* 1 \(Asia/Shanghai\)⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotnext run Mon 09:00 CST⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotexit\_code = 0$ python3 sprint\_analyzer\.py Sprint 24 \| 9/12 stories 75% completion velocity\+88%\(16→\\mst@varfam@dot\\mst@varfam@slash\\to30\) wrote sprint\_report\.md \(54 ln\) EXIT\_CODE = 0A realISETracetrajectorypersona:Mei Lin — product manager, EdTech⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotdomains: \{Source\-Chain, Code\-Runtime, Automation\-Flow\}corpus id:intent\_04f8274f⋅\\mst@varfam@dot\\mst@varfam@slash\\cdot7 user turns⋅\\mst@varfam@dot\\mst@varfam@slash\\cdot30 steps⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotstatus: completedDialogue \+ live OS execution \(read top→\\mst@varfam@dot\\mst@varfam@slash\\tobottom\)✓\\mst@varfam@dot\\mst@varfam@slash\\checkmarkevery tool call executed on a live OS⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotreal exit codes & error strings⋅\\mst@varfam@dot\\mst@varfam@slash\\cdotno model self\-report, no LLM judgedrivesdrives

Figure 6:A realISETracetrajectory, reproduced verbatim from the released corpus \(intent\_04f8274f; persona: Mei Lin, a product manager at an EdTech startup\)\. Read top\-to\-bottom: each agent turn issues atool\_callthat is executed against a live OS, and the observable outcome—a real exit code, error string, or written file—is carried back into the dialogue rather than a model self\-report or an LLM judge \(*execution\-grounded*\)\. Here a credential failure surfaces authentically:git fetchreturnscould not read Username\(exit 128,t2\\mst@varfam@dot\\mst@varfam@slash\{\\mst@t\}\_\{2\}\), and the supplied token is rejected with a real GitHub API401 “Bad credentials”\(t4\\mst@varfam@dot\\mst@varfam@slash\{\\mst@t\}\_\{4\}\)\. The dashed arrow makes the observation\-conditioned dependency explicit: this real 401*drives*the user’st5\\mst@varfam@dot\\mst@varfam@slash\{\\mst@t\}\_\{5\}decision to abandon the remote sync and proceed locally, after which the agent writes and runssprint\_analyzer\.py\(10,694 B\) and installs a weekly cron job, all verified by real exit codes\.

### 4\.5Stage 3: Execution Grounding & Quality Control

#### Completion gating\.

The primary quality gate is the execution loop itself\. A trajectory is retained only if the user simulator reaches*completed = true*within the turn cap; runs that exhaust the cap or stall are discarded rather than truncated and kept\. Because every turn is executed against a real OS \(§[4\.4](https://arxiv.org/html/2606.11520#S4.SS4)\) and the simulator advances only when the agent’s post\-execution reply indicates the previous step actually landed, completion gating already filters out trajectories whose progress was never grounded in real state\. This is an LLM\-independent signal forenvironment\-verifiable subgoalsand avoids the self\-referential loop of an LLM\-as\-judge scoring its own dialogue; semantically complex goals \(e\.g\., document quality\) remain out of scope and require human evaluation\.

#### Post\-hoc audit of the finalized pool\.

To characterize the quality of the retained set we run three rule\-based, LLM\-independent checks over all23,934\\mst@varfam@dot\\mst@varfam@slash 23\{,\}934archived trajectories; none requires agent re\-execution since each is derived from the logged turns\.\(i\) Role drift: every user turn \(the simulator’s output\) is scanned against a curated lexicon of23\\mst@varfam@dot\\mst@varfam@slash 23assistant\-pattern trigger phrases \(e\.g\., “I can help you with”, “I’d be happy to”, “Sure, let me”\); a turn is flagged if≥\\mst@varfam@dot\\mst@varfam@slash\\geq1 phrase appears\. Across202,997\\mst@varfam@dot\\mst@varfam@slash 202\{,\}997user turns only0\.02%\\mst@varfam@dot\\mst@varfam@slash 0\.02\\%are flagged \(40\\mst@varfam@dot\\mst@varfam@slash 40turns in40\\mst@varfam@dot\\mst@varfam@slash 40trajectories\), evidence that the perspective\-lock constraint holds in the produced data rather than only in the prompt\.\(ii\) Stagnation: a trajectory is flagged if the agent issues the same tool with byte\-identical arguments in≥3\\mst@varfam@dot\\mst@varfam@slash\\geq 3consecutive turns; this affects0\.91%\\mst@varfam@dot\\mst@varfam@slash 0\.91\\%of trajectories\.\(iii\) Tool\-call integrity:99\.96%\\mst@varfam@dot\\mst@varfam@slash 99\.96\\%of the701,447\\mst@varfam@dot\\mst@varfam@slash 701\{,\}447logged tool calls carry well\-formed, non\-empty arguments\. In aggregate98\.9%\\mst@varfam@dot\\mst@varfam@slash 98\.9\\%of trajectories are free of both role drift and stagnation\. These rates quantify, rather than merely assert, the cleanliness of the pool; the small flagged remainder can be dropped by anyone reproducing the script\.

The result of the three\-stage ISE process isISETrace:23,132\\mst@varfam@dot\\mst@varfam@slash 23\{,\}132retained multi\-turn OS agent trajectories spanning 10 domains and965\\mst@varfam@dot\\mst@varfam@slash 965distinct personas\. Table[3](https://arxiv.org/html/2606.11520#S4.T3)summarizes key statistics\.

Table 3:ISETracedataset statistics\.

## 5Experiments

### 5\.1Setup

#### Base model\.

Qwen3\-8B, fine\-tuned on 16×\\mst@varfam@dot\\mst@varfam@slash\\timesH800 80GB\.

#### Baselines\.

\(1\)Base: Qwen3\-8B zero\-shot\. \(2\)Qwen3\-32B: a4×\\mst@varfam@dot\\mst@varfam@slash 4\\times\-larger open base, zero\-shot, as a scale reference\. \(3\)GPT\-4o: zero\-shot proprietary reference\.

#### Benchmark\.

We evaluate onClawEval, a multi\-turn OS\-agent execution benchmark whose tasks span three families by task\-id prefix:C\(user\-simulator consultation\),M\(multimodal webpage / media generation\), andT\(agent tool\-use over a real shell: file\-IO, code\-runtime, web\-fetch, automation\-flow\)\. All systems are evaluated under an identical configuration \(vLLM, temperature0\\mst@varfam@dot\\mst@varfam@slash 0, single trial, LLM judge\)\. Because theMfamily requires sandbox\-injected tools that were*not*enabled in this evaluation—making it a structural zero for every system—and theCfamily is floored by the multi\-turn user\-simulator configuration, neither family separates systems\. We therefore reportpass@1on the114\\mst@varfam@dot\\mst@varfam@slash 114T\-family tasks that received a scored trial under*every*run reported in this paper \(a single, fixed common denominator shared by all tables\), together with a per\-dimension breakdown \(completion, robustness\) on the same task set\. This common\-denominator protocol removes the floating\-n\\mst@varfam@dot\\mst@varfam@slash\{\\mst@n\}artifact whereby different systems are otherwise scored on different task subsets\.

### 5\.2Main Results

Table 4:Main results onClawEval\(pass@1, %\), computed on the common set of114\\mst@varfam@dot\\mst@varfam@slash 114T\-family \(agent tool\-use\) tasks scored under every run \(Sec\.[5\.1](https://arxiv.org/html/2606.11520#S5.SS1)\)\.Bold= best\. SFT onISETracelifts the Qwen3\-8B base from19\.3\\mst@varfam@dot\\mst@varfam@slash 19\.3to37\.7\\mst@varfam@dot\\mst@varfam@slash 37\.7\(\+18\.4\\mst@varfam@dot\\mst@varfam@slash\+18\.4absolute,1\.95×\\mst@varfam@dot\\mst@varfam@slash 1\.95\\timesrelative\), surpassing both the GPT\-4o reference and the4×\\mst@varfam@dot\\mst@varfam@slash 4\\times\-larger Qwen3\-32B base\.SFT onISETracelifts Qwen3\-8B’s ClawEval pass@1 from19\.3%\\mst@varfam@dot\\mst@varfam@slash 19\.3\\%\(22/114\\mst@varfam@dot\\mst@varfam@slash 22/114\) to37\.7%\\mst@varfam@dot\\mst@varfam@slash 37\.7\\%\(43/114\\mst@varfam@dot\\mst@varfam@slash 43/114\)—a\+18\.4\\mst@varfam@dot\\mst@varfam@slash\+18\.4\-point absolute,1\.95×\\mst@varfam@dot\\mst@varfam@slash 1\.95\\timesrelative gain \(Table[4](https://arxiv.org/html/2606.11520#S5.T4)\)\. The fine\-tuned 8B surpasses both the GPT\-4o zero\-shot reference \(25\.4%\\mst@varfam@dot\\mst@varfam@slash 25\.4\\%,\+12\.3\\mst@varfam@dot\\mst@varfam@slash\+12\.3points\) and the4×\\mst@varfam@dot\\mst@varfam@slash 4\\times\-larger Qwen3\-32B base \(30\.7%\\mst@varfam@dot\\mst@varfam@slash 30\.7\\%,\+7\.0\\mst@varfam@dot\\mst@varfam@slash\+7\.0points\), i\.e\. targeted multi\-turn data closes and reverses a wide parameter\-count gap\. Decomposing the composite score on the same task set, the gain comes primarily from task completion \(Comp:0\.367→0\.533\\mst@varfam@dot\\mst@varfam@slash 0\.367\\to 0\.533,\+45%\\mst@varfam@dot\\mst@varfam@slash\+45\\%relative\) while robustness on perturbed tool outputs holds high \(Robu:0\.925→0\.959\\mst@varfam@dot\\mst@varfam@slash 0\.925\\to 0\.959\)—so the improvement is a clean completion gain that does*not*trade away tool\-error recovery\.

#### Scaling\.

Applying the same recipe to the 32B base also helps but by a smaller margin \(Qwen3\-32B:30\.7→38\.6\\mst@varfam@dot\\mst@varfam@slash 30\.7\\to 38\.6,\+7\.9\\mst@varfam@dot\\mst@varfam@slash\+7\.9points,1\.26×\\mst@varfam@dot\\mst@varfam@slash 1\.26\\times\); the\+18\.4\\mst@varfam@dot\\mst@varfam@slash\+18\.4\-point gain at 8B is more than double the\+7\.9\\mst@varfam@dot\\mst@varfam@slash\+7\.9\-point gain at 32B, indicating the method delivers its largest benefit in the small\-model regime where headroom is greatest\. Behavior at intermediate scales is less stable and we leave a full scaling study to future work \(Sec\.[6](https://arxiv.org/html/2606.11520#S6)\)\.

### 5\.3ISE Paradigm Ablation

To isolate the contribution of multi\-turn simulation \(Stage 2\), we compare the full recipe against a single\-turn ablation that truncates every trajectory to its first user turn \(no multi\-turn simulation\), holding the base model, data scale, and training budget fixed\. Both are evaluated on the same114\\mst@varfam@dot\\mst@varfam@slash 114\-task common set\. We note this ablation checkpoint differs from the full model in turn structure but was trained as a separate run; we therefore read it as*indicative*of the value of multi\-turn data rather than a strictly controlled single\-variable ablation\.

Table 5:Stage 2 \(multi\-turn simulation\) ablation on ClawEval \(pass@1, %\), same114\\mst@varfam@dot\\mst@varfam@slash 114\-task common set\. Truncating trajectories to a single user turn removes9\.6\\mst@varfam@dot\\mst@varfam@slash 9\.6points of pass@1, with most of the drop in task completion \(Comp\) rather than robustness \(Robu\)\.Bold= best\.Removing multi\-turn simulation drops pass@1 from37\.7%\\mst@varfam@dot\\mst@varfam@slash 37\.7\\%to28\.1%\\mst@varfam@dot\\mst@varfam@slash 28\.1\\%\(−9\.6\\mst@varfam@dot\\mst@varfam@slash\-9\.6points; Table[5](https://arxiv.org/html/2606.11520#S5.T5)\), consistent with multi\-turn trajectories contributing a substantial share of the gain on agent tool\-use tasks, beyond what single\-turn data alone provides\.

### 5\.4Analysis

#### Case Study:−\\mst@varfam@dot\\mst@varfam@slash\-Stage2 failure mode\.

Single\-turn truncated models correctly complete the first sub\-task but fail when a later user turn implicitly references an earlier artifact \(“that script you just wrote”\)\. Multi\-turn training is required to learn cross\-turn referential grounding—the behavior the single\-turn ablation cannot acquire\.

## 6Limitations

ISETraceis a fixed\-size checkpoint \(23,132 trajectories, smaller than EigenData\(Chen et al\.,[2026a](https://arxiv.org/html/2606.11520#bib.bib2)\)and AgentInstruct\(Mitra et al\.,[2024](https://arxiv.org/html/2606.11520#bib.bib11)\)\); ISE is a continuously runnable pipeline, and scaling to 100k\+ trajectories is future work\. The implementation targets macOS/Linux OS terminals and does not cover Windows, GUI\-based interaction, or browser automation\. The evaluation probes OS execution over a real shell \(ClawEval,T\-family tasks\); generalization to GUI agents, embodied tasks, and other verticals requires additional validation\. Finally, role\-locking fidelity depends on the simulator backbone’s instruction\-following capability\. Our experiments fine\-tune at two scales \(8B and 32B\): the recipe helps at both but its benefit is largest in the small\-model regime and shrinks with scale \(\+18\.4\\mst@varfam@dot\\mst@varfam@slash\+18\.4vs\.\+7\.9\\mst@varfam@dot\\mst@varfam@slash\+7\.9points\), and we observe that behavior at intermediate scales is less stable; a systematic scaling study across base sizes is left to future work\.

## 7Conclusion

We introduced ISE \(Intent→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowSimulate→\\mst@varfam@dot\\mst@varfam@slash\\rightarrowExecute\), a three\-stage OS agent data synthesis paradigm that addresses three systematic gaps—intent\-first bias, single\-turn bias, and simulated execution—through 4D structured intent sampling, role\-locked multi\-turn simulation, and live OS execution grounding\. The resulting corpus,ISETrace, exhibits broad embedding\-, lexical\-, and structural\-level diversity, and a Stage 2 ablation on ClawEval isolates the contribution of multi\-turn simulation\.

The central insight is that*how*data is synthesized matters as much as*what*is synthesized: execution\-grounded, role\-locked, intent\-first synthesis produces qualitatively different training signal than tool\-first or simulation\-only approaches\. Future work includes scaling to 100k\+ trajectories and extending to GUI and browser agents\.

## References

- Agent\-Ark Team \(2025\)Agent\-Ark Team\. 2025\.[Toucan\-1\.5M: A large\-scale multi\-tool agent sft dataset](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M)\.Hugging Face dataset\.Accessed 2026\-06\.
- Chen et al\. \(2026a\)Jiaao Chen, Jingyuan Qi, Mingye Gao, Wei\-Chen Wang, Hanrui Wang, and Di Jin\. 2026a\.[EigenData: A self\-evolving multi\-agent platform for function\-calling data synthesis, auditing, and repair](https://arxiv.org/abs/2603.05553)\.*arXiv preprint arXiv:2603\.05553*\.
- Chen et al\. \(2026b\)Jinpeng Chen, Cheng Gong, Hanbo Li, Ziru Liu, Zichen Tian, Xinyu Fu, Shi Wu, Chenyang Zhang, Wu Zhang, Suiyun Zhang, Dandan Tu, and Rui Liu\. 2026b\.[CoVe: Training interactive tool\-use agents via constraint\-guided verification](https://arxiv.org/abs/2603.01940)\.*arXiv preprint arXiv:2603\.01940*\.
- Cheng et al\. \(2026\)Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, Anh Tuan Luu, Jianbing Zhang, Lewei Lu, and Dahua Lin\. 2026\.[OpenMobile: Building open mobile agents with task and trajectory synthesis](https://arxiv.org/abs/2604.15093)\.*arXiv preprint arXiv:2604\.15093*\.
- Chiang et al\. \(2023\)Wei\-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E\. Gonzalez, Ion Stoica, and Eric P\. Xing\. 2023\.Vicuna: An open\-source chatbot impressing GPT\-4 with 90%\* ChatGPT quality\.[https://lmsys\.org/blog/2023\-03\-30\-vicuna/](https://lmsys.org/blog/2023-03-30-vicuna/)\.
- Friedman and Dieng \(2023\)Dan Friedman and Adji Bousso Dieng\. 2023\.The vendi score: A diversity evaluation metric for machine learning\.In*Proceedings of AISTATS*\.
- Hu et al\. \(2026\)Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan, and Jieping Ye\. 2026\.[ToolCUA: Towards optimal GUI\-Tool path orchestration for computer use agents](https://arxiv.org/abs/2605.12481)\.*arXiv preprint arXiv:2605\.12481*\.
- Lin et al\. \(2026\)Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, and Dandan Tu\. 2026\.[CLI\-Gym: Scalable CLI task generation via agentic environment inversion](https://arxiv.org/abs/2602.10999)\.*arXiv preprint arXiv:2602\.10999*\.
- Liu et al\. \(2023\)Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, and 3 others\. 2023\.[AgentBench: Evaluating LLMs as agents](https://arxiv.org/abs/2308.03688)\.*arXiv preprint arXiv:2308\.03688*\.
- Liu et al\. \(2024\)Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, and Caiming Xiong\. 2024\.[APIGen: Automated pipeline for generating verifiable and diverse function\-calling datasets](https://arxiv.org/abs/2406.18518)\.*arXiv preprint arXiv:2406\.18518*\.
- Mitra et al\. \(2024\)Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah\. 2024\.[AgentInstruct: Toward generative teaching with agentic flows](https://arxiv.org/abs/2407.03502)\.*arXiv preprint arXiv:2407\.03502*\.
- Prabhakar et al\. \(2025\)Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, and Caiming Xiong\. 2025\.[APIGen\-MT: Agentic pipeline for multi\-turn data generation via simulated agent\-human interplay](https://arxiv.org/abs/2504.03601)\.*arXiv preprint arXiv:2504\.03601*\.
- Qin et al\. \(2023\)Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun\. 2023\.[ToolLLM: Facilitating large language models to master 16000\+ real\-world APIs](https://arxiv.org/abs/2307.16789)\.*arXiv preprint arXiv:2307\.16789*\.
- Shi et al\. \(2025\)Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, and Wangchunshu Zhou\. 2025\.[TaskCraft: Automated generation of agentic tasks](https://arxiv.org/abs/2506.10055)\.*arXiv preprint arXiv:2506\.10055*\.
- Sun et al\. \(2024\)Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu\. 2024\.[OS\-Genesis: Automating GUI agent trajectory construction via reverse task synthesis](https://arxiv.org/abs/2412.19723)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\)*\.
- Wang et al\. \(2026\)Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, and Tao Yu\. 2026\.[CUA\-Gym: Scaling verifiable training environments and tasks for computer\-use agents](https://arxiv.org/abs/2605.25624)\.*arXiv preprint arXiv:2605\.25624*\.
- Wang et al\. \(2022\)Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A\. Smith, Daniel Khashabi, and Hannaneh Hajishirzi\. 2022\.[Self\-Instruct: Aligning language models with self\-generated instructions](https://arxiv.org/abs/2212.10560)\.*arXiv preprint arXiv:2212\.10560*\.
- Xiong et al\. \(2026\)Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, and Hao Tian\. 2026\.[Video2GUI: Synthesizing large\-scale interaction trajectories for generalized GUI agent pretraining](https://arxiv.org/abs/2605.14747)\.*arXiv preprint arXiv:2605\.14747*\.
- Xu et al\. \(2023\)Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang\. 2023\.[WizardLM: Empowering large pre\-trained language models to follow complex instructions](https://arxiv.org/abs/2304.12244)\.*arXiv preprint arXiv:2304\.12244*\.
- Xu et al\. \(2026a\)Minrui Xu, Zilin Wang, Mengyi Deng, Zhiwei Li, Zhicheng Yang, Xiao Zhu, Yinhong Liu, Boyu Zhu, Baiyu Huang, Chao Chen, Heyuan Deng, Fei Mi, Lifeng Shang, Xingshan Zeng, and Zhijiang Guo\. 2026a\.[EnvFactory: Scaling tool\-use agents via executable environments synthesis and robust RL](https://arxiv.org/abs/2605.18703)\.*arXiv preprint arXiv:2605\.18703*\.
- Xu et al\. \(2026b\)Siyuan Xu, Shiyang Li, Xin Liu, Tianyi Liu, Yixiao Li, Zhan Shi, Zixuan Zhang, Zilong Wang, Qingyu Yin, Jianshu Chen, Tuo Zhao, and Bing Yin\. 2026b\.[Controllable and verifiable tool\-use data synthesis for agentic reinforcement learning](https://arxiv.org/abs/2604.09813)\.*arXiv preprint arXiv:2604\.09813*\.
- Xu et al\. \(2024\)Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu\. 2024\.[AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials](https://arxiv.org/abs/2412.09605)\.In*The Thirteenth International Conference on Learning Representations \(ICLR\)*\.
- Xu et al\. \(2025\)Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, and Rameswar Panda\. 2025\.[TOUCAN: Synthesizing 1\.5m tool\-agentic data from real\-world MCP environments](https://arxiv.org/abs/2510.01179)\.*arXiv preprint arXiv:2510\.01179*\.
- Yang et al\. \(2025\)Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, and Tao Zhang\. 2025\.[ToolMind technical Report: A large\-scale, reasoning\-enhanced tool\-use dataset](https://arxiv.org/abs/2511.15718)\.*arXiv preprint arXiv:2511\.15718*\.
- Yao et al\. \(2024\)Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan\. 2024\.[τ\\mst@varfam@dot\\mst@varfam@slash\\tau\-bench: A benchmark for tool\-agent\-user interaction in real\-world domains](https://arxiv.org/abs/2406.12045)\.*arXiv preprint arXiv:2406\.12045*\.
- Zeng et al\. \(2023\)Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang\. 2023\.[AgentTuning: Enabling generalized agent abilities for LLMs](https://arxiv.org/abs/2310.12823)\.*arXiv preprint arXiv:2310\.12823*\.
- Zhou et al\. \(2026\)Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap\. 2026\.[Mind the Sim2Real gap in user simulation for agentic tasks](https://arxiv.org/abs/2603.11245)\.*arXiv preprint arXiv:2603\.11245*\.
- Zhu et al\. \(2026\)Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, Emad Barsoum, William Yang Wang, and Wenbo Guo\. 2026\.[TermiGen: High\-fidelity environment and robust trajectory synthesis for terminal agents](https://arxiv.org/abs/2602.07274)\.*arXiv preprint arXiv:2602\.07274*\.
ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

Similar Articles

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Evaluating Multi-Agent Systems at Scale (48 minute read)

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

Submit Feedback

Similar Articles

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents
TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Evaluating Multi-Agent Systems at Scale (48 minute read)
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents