WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

arXiv cs.CL 06/03/26, 04:00 AM Papers
multi-turn trajectory-synthesis read-intensive write-intensive tool-agents agent-training
Summary
This paper proposes WRIT, a pipeline for synthesizing multi-turn agent training trajectories that balance write-intensive and read-heavy complexity. The method generates diverse tasks and simulations, enabling small models to achieve strong performance with reduced inference cost.
arXiv:2606.02908v1 Announce Type: new Abstract: Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $\tau^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:35 AM
# Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents
Source: [https://arxiv.org/html/2606.02908](https://arxiv.org/html/2606.02908)
###### Abstract

Multi\-turn user\-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions\. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc\. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write\-intensive trajectories that train sequential execution\.

We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read\-tool evidence before its arguments become identifiable, a challenge that write\-intensive data alone cannot address\. Guided by this insight, we proposeWRIT\(Write\-ReadIntensiveTrajectory Synthesis\), a pipeline for synthesizing multi\-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision\.WRITfirst generates write\-intensive and read\-heavy tasks\. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent\-user interactions in an executable environment to produce complete training trajectories\. The resulting data trains agents not only for longer task execution, but also for robust, evidence\-grounded decision making under high information load\. With only 2K synthesized trajectories, a 4B model trained onWRIToutperforms GPT\-5\.1 no\-think onτ2\\tau^\{2\}\-bench and substantially reduces inference\-time token usage, showing that compact SFT data can convert part of expensive test\-time reasoning into efficient agent behavior\.

Write ActionReasonTool Callsbook\_reservation\(user\_id="emma\_johnson\_7098",origin="EWR",destination="IAH",flight\_type="one\_way",cabin="business",flights=\[\{date="2024\-05\-25",flight\_number="HAT188"\}\],\.\.\.\)Simple task:“I need to book a one\-way business class flight from Newark to Houston on May 25\. Please book the direct flight that departs at 8:00 AM and arrives at 11:30 AM\.”1 x get\_user\_details1 x search\_direct\_flight1 x book\_reservationRead\-heavy task:“I need to book a one\-way business class flight from the New York area to Houston\. I’m flexible between May 25 and May 26, and I can depart from either Newark or LaGuardia\. Please book the fastest overall flight\.”1 x get\_user\_details4 x search\_direct\_flight4 x search\_onestop\_flight1 x book\_reservationTable 1:A simple task and a read\-heavy task can share the same gold write action, while differing in the amount of read evidence required to determine its arguments\. The simple task uses 2 read\-tool calls before the write action, whereas the read\-heavy task uses 9 read\-tool calls before executing the same booking action\. Read tools are shown in blue and write tools in orange\.## 1Introduction

Language agents equipped with tools are becoming a practical interface for automating user\-facing workflows, from booking flights to changing reservations and processing returns\(Luet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib22); Drouinet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib44); Wanget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib17); Fanget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib27); Barreset al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib4); Qianet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib10); Chenget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib13); Qinet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib12)\)\. In these multi\-turn settings, an agent must infer an incomplete or evolving user intent, ask clarifying questions, read external records, follow domain policy, and execute valid state\-changing actions\(Luet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib22); Zhaoet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib11); Ranaet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib14); Burdissoet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib18); Zhanget al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib35)\)\. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, and tool observations\. High\-quality trajectories are therefore the supervision that teaches an agent when to ask, when to read, which tool to call, what evidence to trust, and when it is safe to write\(Zenget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib16); Xuet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib15); Gaoet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib20)\)\.

Since collecting such trajectories from humans is expensive, synthetic trajectory generation has become a central route for training tool\-using agents\. Existing work follows several routes: executable simulation pipelines roll out interactions between user and agent models\(Prabhakaret al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib1); Chenet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib3); Wanget al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib7)\); LLM\-driven pipelines synthesize trajectories or simulate environment feedback without a complete backend\(Liet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib2)\); and environment\-scaling approaches construct many tool\-use environments from which trajectories can be collected\(Fanget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib27)\)\. Together, these methods expand the quantity and diversity of training data and improve benchmark performance for multi\-turn tool\-use agents\.

Most existing synthesis pipelines increase complexity by composing multiple user requests or state\-changing actions into longer tasks\. This trains agents for multi\-step execution, sequential decision making, and long\-horizon stability\. Yet these pipelines mainly teach agents to do more, while overlooking difficulty that arises before any action is taken\. In realistic service scenarios, the hard part is often gathering and comparing enough read\-tool evidence to determine what arguments an action should carry\. Users rarely provide all necessary identifiers; instead, they express preferences and descriptions, leaving the agent to search broadly before committing a state change\. This motivate a new data synthesis question:

Beyond teaching agents to act for longer, can we synthesize trajectories that teach them to read more carefully before they act?

Table[1](https://arxiv.org/html/2606.02908#S0.T1)makes this distinction concrete\. Both tasks share the same gold write action,book\_reservation\(\.\.\.\), so from a write\-action perspective they are identical\. The difference is what the agent must do before writing\. In the simple task, the user specifies the target flight by departure and arrival time, so one local search is enough\. In the read\-heavy task, the user asks for the fastest overall flight across multiple dates and departure airports, so the agent must search every airport\-date combination, compare all returned candidates, and recover the correctflight\_number; the read\-tool count rises from 2 to 9\. An agent trained only on shallow lookups may fail on such requests because it never learned to plan broad search, integrate evidence, and defer commitment until the arguments are grounded\. Read\-heavy trajectories are therefore a structurally distinct form of training complexity\.

Motivated by this observation, we proposeWRIT\(Write\-ReadIntensiveTrajectory Synthesis\), a pipeline that synthesizes training trajectories covering both action execution and evidence\-intensive decision making\. First,WRITgenerates service tasks with verifiable correct outcomes, spanning tasks with multiple sequential actions \(i\.e\., write\-intensive\) and tasks where one action requires extensive reading and comparison \(i\.e\., read\-intensive\)\. Second,WRITvaries how users express and reveal the same request, so training data reflects realistic conversational behaviors rather than only cooperative, fully specified interactions\. Third,WRITruns the agent and user through each task in an executable environment and retains successful interactions as complete training trajectories\. Figure[1](https://arxiv.org/html/2606.02908#S2.F1)summarizes this pipeline\.

We evaluateWRITonτ2\\tau^\{2\}\-bench using a controlled 2K\-trajectory training budget against strong synthetic\-data baselines\.

- •WRITconsistently outperforms prior trajectory synthesis methods across all three tested models \(Qwen3\-4B\-Instruct\-2507, Llama\-3\.1\-8B\-Instruct, Qwen2\.5\-14B\-Instruct\), with especially large gains on read\-heavy task subsets\.
- •A 4B model trained with only 2KWRITtrajectories outperforms GPT\-5\.1 no\-think onτ2\\tau^\{2\}\-bench and substantially narrows the gap to GPT\-5\.1 thinking, while using far fewer output tokens at inference time\.
- •Ablations confirm that both read\-heavy task synthesis and user\-behavior diversification contribute independently\.

These results show that a small, carefully structured set of trajectories balancing write\-intensive and read\-intensive complexity can produce more capable and reliable agents than much larger but less structured datasets\. Synthetic data should teach agents not only to act more, but also to know more before they act\.

## 2Problem Setup and Design Rationale

2\.1 Problem Setup\.We consider a user\-facing operational domain, such as airline customer service, where an agent interacts with a user while operating over a database, a set of tools, and domain policy rules\(Yaoet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib6); Barreset al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib4)\)\. The tools include read tools, which observe the environment without changing it, such assearch\_direct\_flight\(origin,destination,date\)for retrieving matching flight candidates\. They also include write tools, which update the environment state, such asbook\_reservation\(user\_id,origin,destination,flights,\.\.\.\)for creating a flight reservation\. Domain policy rules constrain when write tools may be used, including rules such as“All reservations can be cancelled within 24 hours of booking\.”

A task specifies what the user wants the agent to accomplish and what a correct outcome looks like\. We formalize a task as a tuple consisting of a user requestuu, an initial database statesinits\_\{\\mathrm\{init\}\}, a gold write\-action sequenceAgoldA\_\{\\mathrm\{gold\}\}, and a gold final database statesgolds\_\{\\mathrm\{gold\}\}\. Here,uuis the natural\-language goal,sinits\_\{\\mathrm\{init\}\}gives the starting conditions,AgoldA\_\{\\mathrm\{gold\}\}specifies the correct state\-changing actions, andsgolds\_\{\\mathrm\{gold\}\}is obtained by executingAgoldA\_\{\\mathrm\{gold\}\}fromsinits\_\{\\mathrm\{init\}\}in a sandboxed environment\. For a booking task, for example,sgolds\_\{\\mathrm\{gold\}\}is the database state after the correct reservation has been created, and task success is evaluated by checking whether the executed outcome matchessgolds\_\{\\mathrm\{gold\}\}\.

While the task defines what the agent must do, a training trajectory defines how the agent does it in a real conversation\. A trajectoryτ\\tauis the complete multi\-turn interaction record generated by simulating the task, interleaving user messages, agent responses, tool calls, and tool observations across conversation turns\. As supervised fine\-tuning data, a trajectory teaches the agent when to ask for more information, which tool to call and with what arguments, how to interpret tool outputs, and when to execute a write action\. Our pipeline first synthesizes tasks, then uses each task to simulate a trajectory, which lets us control task difficulty independently from how the trajectory unfolds\.

2\.2 Two\-Axis Trajectory Complexities\.To synthesize useful training trajectories, we need to understand what makes a write decision difficult for the agent\. The challenge is not only choosing the right write tool, but resolving the correct argument values from the user request, the conversation context, and tool observations; we call this processargument grounding\. For example, to book the right flight, the agent must determine the specificflight\_numberby reading flight search results, rather than being told it directly\. Each write action is therefore adecision point: before committing an action to environment, the agent must fully ground both the tool choice and its argument values\.

This framing yields two independent ways to make agent training harder and more comprehensive\. The first axis is the number of write decisions in a task: increasing it produceswrite\-heavy trajectoriesthat train the agent on long\-horizon sequential decision making\. The second axis is the evidence burden of a single decision: increasing this axis producesread\-heavy trajectories, where one write action requires the agent to collect and compare multiple read\-tool outputs before grounding its arguments\. This second axis is important and comparatively underexplored: without read\-heavy trajectories, an agent trained only on simple decisions may learn to act after a single lookup and fail when a real user’s request requires searching across multiple options, dates, or alternatives before any valid write can be taken\.

Our synthesis objective is therefore to generate training trajectories along both axes\. Together, write\- and read\-heavy trajectories teach the agent both long\-horizon execution stability and evidence\-intensive grounding under high information load\.

![Refer to caption](https://arxiv.org/html/2606.02908v1/x3.png)Figure 1:Overview of theWRITpipeline\.
## 3WRIT for Multi\-turn Agent Training

Guided by this goal, we proposeWRIT\(Write\-ReadIntensiveTrajectory Synthesis\), a pipeline for generating multi\-turn agent training data in three stages\. First,WRITsynthesizes write\-read intensive tasks with known correct outcomes, covering both write\-intensive service requests and read\-heavy requests that require substantial evidence gathering\. Second,WRITdesigns user behavior instructions that diversify how the user expresses and reveals the same underlying task across trajectories, so that training data reflects realistic conversational variation\. Third,WRITruns the agent and user simulator through each task and behavior instruction in an executable environment, collecting successful interactions as complete supervised fine\-tuning trajectories\. In this workflow, the first two stages prepare the inputs, namely the task and behavior instruction, and the final stage turns them into training trajectories\.

### 3\.1Write\-Read Intensive Task Synthesis

WRITfirst synthesizes tasks, each consisting of a user requestuu, an initial database statesinits\_\{\\mathrm\{init\}\}, a gold write\-action sequenceAgoldA\_\{\\mathrm\{gold\}\}, and a gold final statesgolds\_\{\\mathrm\{gold\}\}\. This subsection focuses entirely on task synthesis; the simulation that turns tasks into trajectories is introduced later in Section[3\.3](https://arxiv.org/html/2606.02908#S3.SS3)\. We control task complexity through following two branches\.

3\.1\.1 Write\-intensive task synthesis\.This branch synthesizes trajectories that cover the core write operations of the domain\. Each trajectory trains the agent to identify common user intents, follow domain policy, and execute write actions with correctly grounded arguments\. We describe the process in four steps\.

Step 1: Write prototype discovery\.The synthesis starts from identifying the popular write operations and user\-facing scenarios the agent should learn to handle\. We use an LLM to analyze the tool definitions and domain policy rules, and automatically derive a set of operation prototypes: each prototype captures a meaningful usage pattern for a write action and is paired with a natural\-language templatemmthat describes the corresponding user intent with slots for grounded argument values\. For example, one prototype forupdate\_reservation\_flights\(reservation\_id,cabin,flights,payment\_id\)captures the pattern where the user wants to change the itinerary and payment method, producing a template such as “You want to change the itinerary for \[reservation\] to \[flight\] and use \[payment\] for any fare difference\.” These templates keep the generated user requests stable and semantically aligned with the target write action\.

Step 2: Valid argument instantiation\.This step populates each prototype with concrete, valid argument values drawn from the current database state\. For each prototype, we sample a feasible combination of database records that satisfies the prototype’s constraints, such as selecting a user, one of the user’s reservations, and a target cabin class that differs from the current one\. This produces a fully instantiated gold write actionAgoldA\_\{\\mathrm\{gold\}\}\.

Step 3: User\-request construction\.Based on the sampled write tool and arguments, we can construct a natural user request that expresses the intent behind the gold write action without directly exposing backend identifiers\. Rather than inserting raw argument values, such as a flight number, into the request\(Prabhakaret al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib1)\), we describe each argument through a natural preference, such as “the cheapest flight” instead of a literal flight ID\(Chenet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib3)\), so the agent must read the environment to resolve it\. The verified descriptions are accepted and filled into the natural\-language templatemmto form the final user requestuu\.

Step 4: Multi\-write task generation\.We finally extend single\-decision trajectories into multi\-step trajectories that require the agent to complete several sequential write actions\. For example, we combine two write\-intensive trajectories by concatenating their user requests and gold write sequences into a single compound trajectory, i\.e\.,umulti=u1⊕u2u\_\{\\mathrm\{multi\}\}=u\_\{1\}\\oplus u\_\{2\}andAgoldmulti=Agold\(1\)⊕Agold\(2\)A\_\{\\mathrm\{gold\}\}^\{\\mathrm\{multi\}\}=A\_\{\\mathrm\{gold\}\}^\{\(1\)\}\\oplus A\_\{\\mathrm\{gold\}\}^\{\(2\)\}, with programmatic checks to prevent unintended execution conflicts\. The resulting multi\-write trajectories challenge the agent to sustain correct decision making across multiple decision points without losing track of the user’s overall goal\.

3\.1\.2 Read\-heavy task synthesis\.This branch synthesizes tasks in which a single write decision requires the agent to gather and compare evidence from multiple read\-tool calls before the correct argument can be determined\. Unlike write\-intensive tasks, where arguments can be resolved through a small number of direct lookups, read\-heavy tasks force the agent to search broadly, compare candidates across multiple tool outputs, and select the correct argument based on the user’s preference\. The construction proceeds in three steps\.

Step 1: Read\-call set construction\.The synthesis process must first determines the full set of read\-tool calls the agent should make to resolve the target write argument, and collect their outputs as an evidence pool\. Starting from an instantiated gold write action, we identify one argument as the read\-heavy target, i\.e\., the value the agent must discover through tool use\. We identify the single read\-tool call that contains this argument, which we call the gold read call, e\.g\.,search\_direct\_flight\(origin="EWR",destination="IAH",date="2024\-05\-25"\)in Table[1](https://arxiv.org/html/2606.02908#S0.T1), then generate perturbed variants of it by varying parameters such as date or departure airport\. The outputs of all these calls form thegrounding context: the evidence pool the agent must compare to find the correct value\.

Step 2: Read\-inducing request generation\.It then generates a natural user request that requires the agent to consult the full evidence pool rather than stopping at a single lookup\. An LLM generates the user request given the read\-call set and grounding context, under two requirements: the stated user preference must lead the agent to consult all specified read\-tool outputs, and it must uniquely identify the correct gold argument from the returned evidence\. For example, “fastest overall flight” requires comparing candidates across all the searched airport\-date combinations, as shown in Table[1](https://arxiv.org/html/2606.02908#S0.T1)\.

Step 3: Read\-heavy request verification\.Finally, we verify that the generated request actually induces the intended evidence\-gathering behavior and remains solvable\. An LLM verifier checks following three properties:

- •Read\-call coverage:the request should imply all lookup operations in the read\-call tool set\.
- •Preference\-grounded recovery:the verifier should recover gold argument from grounding context based on the stated user preference\.
- •Write\-action consistency:the request should clearly indicate the intended write action while leaving the read\-heavy target argument to be resolved from evidence\.

Requests that fail any check are discarded\. The read\-heavy task synthesis branch has now produced a verified user requestuuand gold write\-action sequenceAgoldA\_\{\\mathrm\{gold\}\}\.

3\.1\.3 Gold\-state construction\.After either synthesis branch produces a user requestuuand a gold write\-action sequenceAgoldA\_\{\\mathrm\{gold\}\}, both branches enter the same gold\-state construction step\. We executeAgoldA\_\{\\mathrm\{gold\}\}in a sandboxed environment initialized withsinits\_\{\\mathrm\{init\}\}to obtain the gold final statesgolds\_\{\\mathrm\{gold\}\}\.The resulting state provides the executable supervision signal used later to verify whether a simulated trajectory actually completes the intended task\.

### 3\.2User Behavior Diversification

Diversifying user behavior across trajectories is essential for training agents that remain robust when real users express the same request in different ways\(Ferreiraet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib48)\)\. The same task can unfold into many different conversations depending on how the user behaves: a user may reveal information gradually, correct a mistake mid\-conversation, or add irrelevant small talk\(Huet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib51)\)\. If the training trajectories always assume a cooperative and information\-complete user, the agent may be fragile at test time\. User behavioral variation changes the conversational path but not the underlying goal or correct write action\(Huet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib51)\)\. This means we can explicitly diversify user behavior without changing the task’s supervision signal:AgoldA\_\{\\mathrm\{gold\}\}andsgolds\_\{\\mathrm\{gold\}\}stay fixed\.

WRITmaintains a library of reusable behavior instruction primitives, each describing a specific user behavior pattern\. General task\-completion primitives cover behaviors that arise in ordinary service conversations, including progressive disclosure, where the user reveals information gradually; self\-correction, where the user fixes a stated value after a challenge; confirmation hesitation, where the user verifies the agent’s summary before agreeing; mild emotion; and irrelevant asides\(Algherairy and Ahmed,[2025](https://arxiv.org/html/2606.02908#bib.bib50)\)\. Policy\-robustness primitives cover behaviors that specifically pressure the agent’s policy boundary, including false\-premise assertions, assume\-style pressure, prior\-agent approval claims, complaint pressure, and social flattery\(Huet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib51)\)\. These are needed because policy\-sensitive tasks require the agent to refuse or redirect gracefully under adversarial user strategies that ordinary task\-completion primitives do not cover\.

For each synthesized task, we select a small number of compatible primitives from the library and prompts an LLM to instantiate them as concrete user\-simulator instructions tailored to that task; for example, the instruction may ask the user simulator to initially give the wrong date and correct it only after the agent challenges it\. These instructions govern only interaction style, namely how and when the user reveals information, not task content, namely what the user wants or which write action should be executed\. The instructions are passed to the user simulator alongside the task requestuuin Section[3\.3](https://arxiv.org/html/2606.02908#S3.SS3)\. Appendix[J](https://arxiv.org/html/2606.02908#A10)lists the script primitives used in our implementation and provides concrete examples of instantiated scripts\.

### 3\.3Trajectory Simulation and Filtering

This is where the synthesized task and user behavior instructions come together to produce complete training trajectories\(Prabhakaret al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib1); Fanget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib27)\)\. Section[3\.1](https://arxiv.org/html/2606.02908#S3.SS1)provides the task\(u,sinit,Agold,sgold\)\(u,s\_\{\\mathrm\{init\}\},A\_\{\\mathrm\{gold\}\},s\_\{\\mathrm\{gold\}\}\), and Section[3\.2](https://arxiv.org/html/2606.02908#S3.SS2)provides the user behavior instructions\. With these inputs, we initialize the executable environment atsinits\_\{\\mathrm\{init\}\}and run two models simultaneously: a user simulator guided by the task requestuuand the behavior instructions, and an agent model given the domain policy and tool definitions\. They interact turn by turn: the user expresses requests according to the behavior instructions, the agent responds and issues tool calls, and the environment executes those calls until the task is completed or refused\. The output is a complete trajectoryτ\\tauinterleaving user messages, agent responses, tool calls, and tool observations\.

Because the agent model may make errors during simulation, not all trajectories successfully realize the intended task, so we filter the data to keep only correct and complete demonstrations\. The retained trajectories form the training corpus𝒯\\mathcal\{T\}used for supervised fine\-tuning\. Since each retained trajectory comes from either a write\-intensive or read\-heavy task with a verified gold outcome,𝒯\\mathcal\{T\}systematically covers both axes of complexity defined in Section[2](https://arxiv.org/html/2606.02908#S2)\. Additional simulation details are provided in Appendix[I](https://arxiv.org/html/2606.02908#A9)\.

## 4Experiments

ModelDatasetτ2\\tau^\{2\}Retailτ2\\tau^\{2\}Airlineτ2\\tau^\{2\}Averageτ2\\tau^\{2\}Retail\-Hardτ2\\tau^\{2\}Airline\-HardPass1Pass4Pass1Pass4Pass1Pass4Pass1Pass4Pass1Pass4Qwen3\-4B\-Instruct\-2507APIGen\-MT50\.00±\\pm2\.7723\.6820\.00±\\pm4\.906\.0040\.85±\\pm2\.2818\.2943\.15±\\pm2\.0314\.5217\.50±\\pm6\.455\.00Simia53\.73±\\pm2\.6225\.4431\.00±\\pm6\.6310\.0046\.80±\\pm2\.3620\.7342\.74±\\pm4\.0614\.5221\.25±\\pm7\.500\.00CoVe59\.65±\\pm1\.6031\.5837\.50±\\pm6\.4020\.0052\.90±\\pm2\.4628\.0553\.63±\\pm2\.4222\.5833\.75±\\pm6\.2920\.00AReaL59\.43±\\pm5\.1332\.4647\.00±\\pm3\.4636\.0055\.64±\\pm3\.6033\.5452\.42±\\pm11\.8225\.8142\.50±\\pm8\.6625\.00WRIT71\.05±\\pm1\.2447\.3761\.00±\\pm3\.8342\.0067\.99±\\pm1\.9045\.7366\.13±\\pm2\.2838\.7157\.50±\\pm6\.4540\.00Llama\-3\.1\-8B\-InstructAPIGen\-MT42\.98±\\pm1\.7518\.4220\.50±\\pm4\.126\.0036\.13±\\pm2\.1914\.6334\.27±\\pm0\.819\.6822\.50±\\pm5\.005\.00Simia40\.79±\\pm3\.4017\.5423\.00±\\pm2\.008\.0035\.37±\\pm2\.4914\.6333\.47±\\pm3\.5812\.9016\.25±\\pm4\.7910\.00CoVe52\.19±\\pm1\.1327\.1932\.00±\\pm4\.9014\.0046\.04±\\pm1\.1723\.1751\.21±\\pm1\.5425\.8127\.50±\\pm2\.8915\.00AReaL45\.18±\\pm0\.5120\.1843\.50±\\pm3\.4224\.0044\.66±\\pm0\.7721\.3437\.50±\\pm4\.6317\.7438\.75±\\pm2\.5015\.00WRIT54\.61±\\pm2\.9031\.5850\.00±\\pm5\.8932\.0053\.20±\\pm3\.3231\.7147\.58±\\pm5\.0127\.4246\.25±\\pm7\.5030\.00Qwen2\.5\-14B\-InstructAPIGen\-MT50\.00±\\pm3\.7224\.5627\.00±\\pm2\.0012\.0042\.99±\\pm2\.8420\.7343\.15±\\pm5\.6522\.5817\.50±\\pm5\.005\.00Simia51\.10±\\pm2\.4228\.9534\.50±\\pm1\.9118\.0046\.04±\\pm1\.7625\.6140\.32±\\pm3\.9517\.7423\.75±\\pm4\.7910\.00CoVe58\.11±\\pm4\.0831\.5834\.50±\\pm3\.0016\.0050\.91±\\pm3\.6126\.8353\.63±\\pm5\.4927\.4230\.00±\\pm4\.0810\.00AReaL57\.68±\\pm5\.0331\.5843\.00±\\pm2\.5828\.0053\.20±\\pm3\.9030\.4950\.81±\\pm6\.5229\.0330\.00±\\pm5\.7710\.00WRIT72\.37±\\pm1\.6847\.3757\.50±\\pm4\.4338\.0067\.84±\\pm2\.5144\.5166\.13±\\pm2\.2837\.1046\.25±\\pm10\.3120\.00Table 2:Tau2\-bench evaluation results\.τ2\\tau^\{2\}Retail/Airline report success over all domain tasks,τ2\\tau^\{2\}Average is task\-count weighted across Retail and Airline, and Retail\-Hard/Airline\-Hard report fixed read\-heavy subsets where the hardest decision point requires about six or more read/search calls\. The exact hard\-subset task groupings are listed in Appendix[D](https://arxiv.org/html/2606.02908#A4)\. Pass1includes sample standard deviation across four trials\. All numbers are percentages\.Training data\.We synthesize training trajectories under theτ2\\tau^\{2\}\-bench environment setting, which provides executable tools, domain policies, database states, and state\-based success checks for multi\-turn user\-facing tasks\(Barreset al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib4)\)\. Our final dataset contains 2K trajectories with balanced domain coverage, including 1K trajectories for theτ2\\tau^\{2\}\-bench retail domain and 1K trajectories for the airline domain\. To compare data synthesis recipes under the same supervised fine\-tuning budget, we use a controlled 2K trajectory\-level setting for all main experiments\. For public baselines with larger released datasets, we uniformly sample 2K trajectories at the trajectory level\. This protocol isolates the effect of trajectory quality and task composition from the effect of dataset scale; we additionally report full\-size baseline comparisons in Appendix[B](https://arxiv.org/html/2606.02908#A2)\.

Baselines\.We compare against four synthetic trajectory datasets for multi\-turn user\-facing agents: APIGen\-MT\(Prabhakaret al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib1)\), Simia\(Liet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib2)\), CoVe\(Chenet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib3)\), and AReaL\(Gaoet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib20)\)\. These baselines cover different trajectory synthesis strategies, including simulated agent\-user interaction, seed\-set expansion with simulated environment feedback, rule\-based argument transformation, and LLM\-controlled synthetic data generation\.

Evaluation\.We evaluate onτ2\\tau^\{2\}\-bench\(Barreset al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib4)\), covering both retail and airline domains\. In addition to the full task sets, we report performance on fixed read\-heavy subsets where the hardest decision point requires about six or more read/search calls; the subset definitions are provided in Appendix[D](https://arxiv.org/html/2606.02908#A4)\. We use thePassk\\mathrm\{Pass\}^\{k\}reliability metric\(Yaoet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib6)\)\. For each taskii, we runnnindependent trials and letcic\_\{i\}denote the number of successful trials\. ThePassk\\mathrm\{Pass\}^\{k\}score is computed as1\|𝒬\|∑i∈𝒬\(cik\)/\(nk\)\\frac\{1\}\{\|\\mathcal\{Q\}\|\}\\sum\_\{i\\in\\mathcal\{Q\}\}\\binom\{c\_\{i\}\}\{k\}/\\binom\{n\}\{k\}, where𝒬\\mathcal\{Q\}is the evaluated task set\. Intuitively,Pass1\\mathrm\{Pass\}^\{1\}is the average success rate over repeated trials, whilePassk\\mathrm\{Pass\}^\{k\}estimates the probability thatkkrandomly sampled trials for the same task all succeed\. Largerkktherefore gives a stricter measure of reliability, because a model must solve the same task consistently rather than succeed only occasionally\. In our experiments, we run each task four times and reportPass1\\mathrm\{Pass\}^\{1\}andPass4\\mathrm\{Pass\}^\{4\}\. HigherPassk\\mathrm\{Pass\}^\{k\}therefore indicates more consistent behavior across repeated attempts\.

Models and implementation details\.We focus on non\-thinking agent settings, where the deployed model must act directly without explicit long\-form reasoning\. We therefore fine\-tune multiple instruction\-tuned base models, including Qwen3\-4B\-Instruct\-2507, Llama\-3\.1\-8B\-Instruct, and Qwen2\.5\-14B\-Instruct\. For each base model and dataset, we perform full\-parameter supervised fine\-tuning on the corresponding training trajectories\. Dataset statistics, training hyperparameters, and additional implementation details are provided in Appendix[E](https://arxiv.org/html/2606.02908#A5)\.

Variantτ2\\tau^\{2\}Retailτ2\\tau^\{2\}Airlineτ2\\tau^\{2\}Averageτ2\\tau^\{2\}Retail\-Hardτ2\\tau^\{2\}Airline\-HardPass1Pass4Pass1Pass4Pass1Pass4Pass1Pass4Pass1Pass4WRIT71\.05±\\pm1\.2447\.3761\.00±\\pm3\.8342\.0067\.99±\\pm1\.9045\.7366\.13±\\pm2\.2838\.7157\.50±\\pm6\.4540\.00w/o read\-heavy67\.11±\\pm3\.2446\.4952\.50±\\pm5\.7430\.0062\.65±\\pm3\.3941\.4657\.66±\\pm3\.0533\.8741\.25±\\pm6\.2915\.00w/o script69\.96±\\pm4\.7245\.6149\.50±\\pm6\.6126\.0063\.72±\\pm2\.8439\.6362\.90±\\pm8\.2232\.2645\.00±\\pm17\.8020\.00w/o multi\-write67\.32±\\pm3\.3143\.8655\.00±\\pm4\.1632\.0063\.57±\\pm1\.1540\.2459\.27±\\pm4\.0329\.0351\.25±\\pm8\.5430\.00Table 3:Ablation results onτ2\\tau^\{2\}\-bench using Qwen3\-4B\-Instruct\-2507\. We report Pass1and Pass4on the full Retail/Airline task sets, their task\-count weighted average, and the fixed read\-heavy hard subsets\.### 4\.1Results and Analysis

WRITexpands the agent capability boundary and improves reliability\.Table[2](https://arxiv.org/html/2606.02908#S4.T2)shows thatWRITsubstantially improves multi\-turn agent performance across model families\. On Qwen3\-4B\-Instruct\-2507,WRITachieves aτ2\\tau^\{2\}Average Pass1 of 67\.99, outperforming AReaL by 12\.35 points, and improves Pass4 from 33\.54 to 45\.73\. The same pattern holds for Llama\-3\.1\-8B\-Instruct, whereWRITimproves the average Pass1 from 46\.04 with CoVe to 53\.20, and for Qwen2\.5\-14B\-Instruct, whereWRITimproves the average Pass1 from 53\.20 with AReaL to 67\.84\. Higher Pass1 suggests that the trained agent can solve a broader set of tasks in a single attempt, while higher Pass4 indicates more stable behavior across repeated trials\. Together, these gains show thatWRITimproves both capability coverage and reliability in multi\-turn user\-facing settings\.

Read\-heavy synthesis addresses a key weakness of user\-facing agents\.The gains are especially clear on the read\-heavy subsets, which correspond to difficultτ2\\tau^\{2\}\-bench tasks requiring substantial read/search behavior before the final decision\. For Qwen3\-4B\-Instruct\-2507,WRITimproves Airline\-Hard Pass1 from 42\.50 with AReaL to 57\.50, and improves Pass4 from 25\.00 to 40\.00\. On Retail\-Hard,WRITalso improves Pass1 from 53\.63 with CoVe to 66\.13\. Similar improvements appear for other two base models, as also shown by the Passkcurves in Figure[2](https://arxiv.org/html/2606.02908#S4.F2)\. These results suggest that our synthesized trajectories directly improve a capability gap in current user\-facing agents: they need practice not only executing tools, but also gathering and comparing enough evidence before committing to a write action\.

MethodRetailAirlineAvg\.Output Tokens \(USD\)GPT\-5\.1 thinking82\.4672\.0079\.271,520,619 \($17\.52\)GPT\-5\.1 no\-think69\.3048\.0062\.80318,180 \($5\.56\)WRIT\-4B71\.0561\.0067\.99251,405 \(–\)

Table 4:GPT\-5\.1 evaluation results onτ2\\tau^\{2\}\-bench\. Retail, Airline, and Avg\. are percentages\. Output Tokens counts agent\-side completion tokens for one fullτ2\\tau^\{2\}evaluation, with agent\-side API cost shown in parentheses\.Pipeline components specialize into complementary capabilities\.Table[3](https://arxiv.org/html/2606.02908#S4.T3)shows that all three components contribute to the final performance\.Read\-heavy grounding strengthens evidence\-intensive decisions\.Removing read\-heavy trajectories only mildly reduces Retail Pass1 from 71\.05 to 67\.11, but the drop becomes much larger on Retail\-Hard, from 66\.13 to 57\.66\. The effect is even more pronounced on Airline\-Hard, where Pass1 drops from 57\.50 to 41\.25 and Pass4 collapses from 40\.00 to 15\.00\. This pattern directly supports our main hypothesis: read\-heavy samples do not merely improve general performance, but specifically improve the agent’s capability and stability on difficult tasks that require substantial evidence gathering before acting\.Scripts improve robustness near the policy boundary\.Removing scripts causes the largest full\-domain drop on Airline, reducing Pass1 from 61\.00 to 49\.50 and Pass4 from 42\.00 to 26\.00; it also substantially hurts Airline\-Hard, where Pass4 falls from 40\.00 to 20\.00\. The retail domain is less affected, suggesting that the script layer is especially valuable in policy\-sensitive settings where adversarial user patterns, such as false premises, pressure, or delayed policy\-relevant information, stress the agent’s refusal and policy\-following behavior\.Multi\-write composition provides useful task\-composition coverage\.Removing multi\-write trajectories lowers the average Pass1 from 67\.99 to 63\.57 and Pass4 from 45\.73 to 40\.24, with the largest drop appearing on Retail\-Hard Pass4, from 38\.71 to 29\.03\. This indicates that compound tasks still provide important training signal for maintaining correctness across multiple requested operations\. Overall, the ablations show that both complexity\-oriented sample types, multi\-write and read\-heavy, substantially improve performance, while scripts mainly improve stability under user\-side variation and policy\-boundary stress\.

WRITapproaches strong API agents with substantially lower inference cost\.Table[4](https://arxiv.org/html/2606.02908#S4.T4)comparesWRITwith GPT\-5\.1 variants onτ2\\tau^\{2\}\-bench\. Although GPT\-5\.1 thinking achieves the highest score, it uses over 1\.5M output tokens for one full Retail\+Airline evaluation, reflecting the high inference cost of relying on test\-time reasoning\. In contrast,WRIToutperforms GPT\-5\.1 no\-think on both domains, improving the average Pass1 from 62\.80 to 67\.99, while using fewer output tokens\. This suggests that our synthesized trajectories transfer part of the required evidence\-gathering and policy\-following behavior into the model parameters through SFT, allowing a smaller non\-thinking agent to act more efficiently at inference time\. The gap to GPT\-5\.1 thinking further indicates that explicit reasoning remains powerful, butWRITprovides a cost\-effective alternative when deployment requires direct, low\-token agent behavior\.

![Refer to caption](https://arxiv.org/html/2606.02908v1/x4.png)Figure 2:Passkcurves for Qwen3\-4B\-Instruct\-2507\. The horizontal axis indexesk=1,2,3,4k=1,2,3,4\.

## 5Conclusion

We presentedWRIT, a trajectory synthesis pipeline for multi\-turn user\-facing agents that controls task complexity along two axes: the number of write decisions and the read evidence required to resolve each decision\. By combining decision\-coverage tasks, read\-heavy grounding tasks, and scripted user behaviors,WRITproduces clean SFT trajectories in executable environments\. Experiments onτ2\\tau^\{2\}\-bench show significant gains across models, especially on read\-heavy hard subsets, demonstrating the importance of two\-axis complexity control and opening a promising direction for future agentic trajectory synthesis\.

## Limitations

#### Compositional hard samples\.

WRITcontrols task complexity along two axes: increasing the number of decision points through multi\-write tasks, and increasing the grounding difficulty of individual decision points through read\-heavy tasks\. In this work, we study these two axes mostly as separate sources of difficulty\. We have not fully explored their composition, such as constructing multi\-write tasks where each decision point is also read\-heavy\. Such samples may further stress long\-horizon state tracking and evidence\-intensive grounding at the same time\.

#### Mixture of complexity types\.

Our training data contains both decision\-coverage samples and read\-heavy grounding samples, but we do not exhaustively study the optimal mixture ratio between different complexity types\. Different model families or base capabilities may benefit from different proportions of low\-read, multi\-write, read\-heavy, and policy\-robust samples\. A more systematic mixture study could clarify how each type of synthetic trajectory shapes agent behavior during supervised fine\-tuning\.

## References

- Prompting large language models for user simulation in task\-oriented dialogue systems\.Computer Speech & Language89,pp\. 101697\.External Links:[Document](https://dx.doi.org/10.1016/j.csl.2024.101697)Cited by:[§3\.2](https://arxiv.org/html/2606.02908#S3.SS2.p2.1)\.
- V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan \(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.arXiv preprint arXiv:2506\.07982\.Cited by:[§A\.2](https://arxiv.org/html/2606.02908#A1.SS2.p1.1),[Appendix E](https://arxiv.org/html/2606.02908#A5.p1.4),[§1](https://arxiv.org/html/2606.02908#S1.p1.1),[§2](https://arxiv.org/html/2606.02908#S2.p1.1),[§4](https://arxiv.org/html/2606.02908#S4.p1.2),[§4](https://arxiv.org/html/2606.02908#S4.p3.15)\.
- K\. Basu, I\. Abdelaziz, K\. Kate, M\. Agarwal, M\. Crouse, Y\. Rizk, K\. Bradford, A\. Munawar, S\. Kumaravel, S\. Goyal, X\. Wang, L\. A\. Lastras, and P\. Kapanipathi \(2024\)NESTFUL: a benchmark for evaluating llms on nested sequences of api calls\.arXiv preprint arXiv:2409\.03797\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1)\.
- S\. Burdisso, S\. Baroudi, Y\. Labrak, D\. Grunert, P\. Cyrta, Y\. Chen, S\. Madikeri, T\. Schaaf, E\. Villatoro\-Tello, A\. Hassoon, R\. Marxer, and P\. Motlicek \(2025\)SDialog: a python toolkit for end\-to\-end agent building, user simulation, dialog generation, and evaluation\.arXiv preprint arXiv:2506\.10622\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- J\. Chen, C\. Gong, H\. Li, Z\. Liu, Z\. Tian, X\. Fu, S\. Wu, C\. Zhang, W\. Zhang, S\. Zhang, D\. Tu, and R\. Liu \(2026\)CoVe: training interactive tool\-use agents via constraint\-guided verification\.arXiv preprint arXiv:2603\.01940\.Cited by:[§A\.2](https://arxiv.org/html/2606.02908#A1.SS2.p2.1),[Appendix F](https://arxiv.org/html/2606.02908#A6.p1.2),[§1](https://arxiv.org/html/2606.02908#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.02908#S3.SS1.p5.2),[§4](https://arxiv.org/html/2606.02908#S4.p2.1)\.
- X\. Cheng, Y\. Hu, X\. Zhang, L\. Xu, L\. Tan, Z\. Pan, X\. Li, and Y\. Liu \(2025\)Beyond itinerary planning: a real\-world benchmark for multi\-turn and tool\-using travel tasks\.arXiv preprint arXiv:2512\.22673\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez,et al\.\(2024\)Workarena: how capable are web agents at solving common knowledge work tasks?\.arXiv preprint arXiv:2403\.07718\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- R\. Fang, S\. Cai, B\. Li, J\. Wu, G\. Li, W\. Yin, X\. Wang, X\. Wang, L\. Su, Z\. Zhang, S\. Wu, Z\. Tao, Y\. Jiang, P\. Xie, F\. Huang, and J\. Zhou \(2025\)Towards general agentic intelligence via environment scaling\.arXiv preprint arXiv:2509\.13311\.Cited by:[§A\.2](https://arxiv.org/html/2606.02908#A1.SS2.p2.1),[§1](https://arxiv.org/html/2606.02908#S1.p1.1),[§1](https://arxiv.org/html/2606.02908#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.02908#S3.SS3.p1.4)\.
- R\. Ferreira, D\. Semedo, and J\. Magalhães \(2024\)Multi\-trait user simulation with adaptive decoding for conversational task assistants\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Cited by:[§3\.2](https://arxiv.org/html/2606.02908#S3.SS2.p1.2)\.
- J\. Gao, J\. Chen, C\. He, S\. Xu, D\. Jin, and Y\. Wu \(2026\)From self\-evolving synthetic data to verifiable\-reward rl: post\-training multi\-turn interactive tool\-using agents\.arXiv preprint arXiv:2601\.22607\.Cited by:[§A\.2](https://arxiv.org/html/2606.02908#A1.SS2.p2.1),[Appendix F](https://arxiv.org/html/2606.02908#A6.p1.2),[§1](https://arxiv.org/html/2606.02908#S1.p1.1),[§4](https://arxiv.org/html/2606.02908#S4.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[Appendix E](https://arxiv.org/html/2606.02908#A5.p1.4),[Appendix F](https://arxiv.org/html/2606.02908#A6.p1.2)\.
- Z\. Hu, N\. F\. Chen, and R\. K\. Lee \(2025\)Are current task\-oriented dialogue systems able to satisfy impolite users?\.IEEE Transactions on Computational Social Systems12\(5\),pp\. 2876–2887\.External Links:[Document](https://dx.doi.org/10.1109/TCSS.2024.3521020)Cited by:[§3\.2](https://arxiv.org/html/2606.02908#S3.SS2.p1.2),[§3\.2](https://arxiv.org/html/2606.02908#S3.SS2.p2.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.International Conference on Learning Representations\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. C\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.arXiv preprint arXiv:2401\.13649\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1)\.
- Y\. Li, H\. A\. Inan, X\. Yue, W\. Chen, L\. Wutschitz, J\. Kulkarni, R\. Poovendran, R\. Sim, and S\. Rajmohan \(2025\)Simulating environments with reasoning models for agent training\.arXiv preprint arXiv:2511\.01824\.Cited by:[§A\.2](https://arxiv.org/html/2606.02908#A1.SS2.p2.1),[Appendix F](https://arxiv.org/html/2606.02908#A6.p1.2),[§1](https://arxiv.org/html/2606.02908#S1.p2.1),[§4](https://arxiv.org/html/2606.02908#S4.p2.1)\.
- J\. Lu, T\. Holleis, Y\. Zhang, B\. Aumayer, F\. Nan, F\. Bai, S\. Ma, S\. Ma, M\. Li, G\. Yin, Z\. Wang, and R\. Pang \(2024\)ToolSandbox: a stateful, conversational, interactive evaluation benchmark for llm tool use capabilities\.arXiv preprint arXiv:2408\.04682\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2023\)Gorilla: large language model connected with massive apis\.arXiv preprint arXiv:2305\.15334\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1)\.
- A\. Prabhakar, Z\. Liu, M\. Zhu, J\. Zhang, T\. M\. Awalgaonkar, S\. Wang, Z\. Liu, H\. Chen, T\. Hoang, J\. C\. Niebles,et al\.\(2026\)Apigen\-mt: agentic pipeline for multi\-turn data generation via simulated agent\-human interplay\.Advances in Neural Information Processing Systems38\.Cited by:[§A\.2](https://arxiv.org/html/2606.02908#A1.SS2.p2.1),[Appendix F](https://arxiv.org/html/2606.02908#A6.p1.2),[§1](https://arxiv.org/html/2606.02908#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.02908#S3.SS1.p5.2),[§3\.3](https://arxiv.org/html/2606.02908#S3.SS3.p1.4),[§4](https://arxiv.org/html/2606.02908#S4.p2.1)\.
- C\. Qian, Z\. Liu, A\. Prabhakar, Z\. Liu, J\. Zhang, H\. Chen, H\. Ji, W\. Yao, S\. Heinecke, S\. Savarese, C\. Xiong, and H\. Wang \(2025\)UserBench: an interactive gym environment for user\-centric agents\.arXiv preprint arXiv:2507\.22034\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- T\. Qin, F\. Bai, T\. Hu, R\. Vemulapalli, H\. S\. Koppula, Z\. Xu, B\. Jin, M\. Cemri, J\. Lu, Z\. Wang, and M\. Cao \(2025\)COMPASS: benchmarking constrained optimization in llm agents\.arXiv preprint arXiv:2510\.07043\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- M\. Rana, C\. Man, A\. E\. Msiiwa, J\. Paine, K\. Zhu, S\. Dev, V\. Sharma, and A\. M R \(2025\)AgentChangeBench: a multi\-dimensional evaluation framework for goal\-shift robustness in conversational ai\.arXiv preprint arXiv:2510\.18170\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian \(2024\)Appworld: a controllable world of apps and people for benchmarking interactive coding agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16022–16076\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1)\.
- Z\. Wang, Q\. Chang, H\. Patel, S\. Biju, C\. Wu, Q\. Liu, A\. Ding, A\. Rezazadeh, A\. Shah, Y\. Bao, and E\. Siow \(2025\)MCP\-Bench: benchmarking tool\-using llm agents with complex real\-world tasks via mcp servers\.arXiv preprint arXiv:2508\.20453\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- Z\. Wang, Y\. Lu, Y\. Zhang, P\. Chen, Z\. Dong, J\. Huang, J\. Gesi, X\. Tang, C\. Luo, Q\. Liu, Y\. Sang, H\. Lu, M\. Li, J\. Lai, and D\. Wang \(2026\)Trajectory2Task: training robust tool\-calling agents with synthesized yet verifiable data for complex user intents\.arXiv preprint arXiv:2601\.20144\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p2.1)\.
- Z\. Xu, A\. M\. Soria, S\. Tan, A\. Roy, A\. S\. Agrawal, R\. Poovendran, and R\. Panda \(2025\)TOUCAN: synthesizing 1\.5m tool\-agentic data from real\-world mcp environments\.arXiv preprint arXiv:2510\.01179\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[Appendix E](https://arxiv.org/html/2606.02908#A5.p1.4),[Appendix F](https://arxiv.org/html/2606.02908#A6.p1.2)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024a\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.External Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[Appendix E](https://arxiv.org/html/2606.02908#A5.p1.4)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024b\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.arXiv preprint arXiv:2405\.15793\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.arXiv preprint arXiv:2406\.12045\.Cited by:[§A\.2](https://arxiv.org/html/2606.02908#A1.SS2.p1.1),[Appendix E](https://arxiv.org/html/2606.02908#A5.p1.4),[§2](https://arxiv.org/html/2606.02908#S2.p1.1),[§4](https://arxiv.org/html/2606.02908#S4.p3.15)\.
- X\. Zeng, W\. Liu, L\. Wang, L\. Li, F\. Mi, Y\. Wang, L\. Shang, X\. Jiang, and Q\. Liu \(2025\)ToolACE\-MT: non\-autoregressive generation for agentic multi\-turn interaction\.arXiv preprint arXiv:2508\.12685\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- Z\. Zhang, S\. Cui, Y\. Lu, J\. Zhou, J\. Yang, H\. Wang, and M\. Huang \(2024\)Agent\-safetybench: evaluating the safety of llm agents\.arXiv preprint arXiv:2412\.14470\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- W\. Zhao, X\. Wang, C\. Ma, L\. Kong, Z\. Yang, M\. Tuo, X\. Shi, Y\. Zhai, and X\. Cai \(2025\)MUA\-RL: multi\-turn user\-interacting agent reinforcement learning for agentic tool use\.arXiv preprint arXiv:2508\.18669\.Cited by:[§1](https://arxiv.org/html/2606.02908#S1.p1.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, Z\. Luo, Z\. Feng, and Y\. Ma \(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Bangkok, Thailand\.External Links:[Link](http://arxiv.org/abs/2403.13372)Cited by:[Appendix E](https://arxiv.org/html/2606.02908#A5.p1.4)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)Webarena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 15585–15606\.Cited by:[§A\.1](https://arxiv.org/html/2606.02908#A1.SS1.p1.1)\.

## Appendix ARelated Work

### A\.1Synthetic trajectories for agent training

Recent agent research increasingly uses full interaction trajectories as supervision for teaching models agentic capabilities, rather than relying only on final\-answer labels\. In web and workflow environments, agent traces describe how models navigate interfaces, use tools, and complete multi\-step tasks\(Zhouet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib42); Kohet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib43); Drouinet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib44); Trivediet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib45)\)\. In software engineering, trajectories capture repository navigation, code editing, tool execution, and issue resolution\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib46); Yanget al\.,[2024b](https://arxiv.org/html/2606.02908#bib.bib47)\)\. Other work studies tool\-use or function\-calling traces across heterogeneous APIs and environments\(Patilet al\.,[2023](https://arxiv.org/html/2606.02908#bib.bib28); Basuet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib23); Xuet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib15); Zenget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib16)\)\. These studies show that trajectory\-level supervision can teach models intermediate actions and tool interactions that are difficult to learn from final\-answer labels alone\.

### A\.2Trajectory synthesis for multi\-turn user\-facing agents

Training multi\-turn user\-facing agents requires trajectories that capture dialogue state tracking, user intent clarification, policy adherence, tool use, and state\-changing execution\(Yaoet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib6); Barreset al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib4)\)\. Recent work has therefore studied how to synthesize such trajectories without relying on large\-scale human collection\.

APIGen\-MT proposes a two\-phase pipeline that first constructs task blueprints with ground\-truth actions and then realizes them as multi\-turn interactions through simulated human\-agent interplay\(Prabhakaret al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib1)\)\. Simia expands small seed datasets into more diverse training trajectories, using reasoning models to simulate environment feedback and support data augmentation without a fully implemented executable backend\(Liet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib2)\)\. CoVe focuses on rule\-based argument transformation: it replaces directly exposed tool arguments with predefined indirect descriptions, so that the agent must recover the hidden argument through tool use\(Chenet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib3)\)\. AReaL/EigenData follows an LLM\-controlled generation pipeline, where an LLM drives the construction of synthetic tasks, dialogues, tool calls, and executable checkers for multi\-turn tool\-use training\(Gaoet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib20)\)\. AgentScaler broadens the setting by constructing many synthetic function\-calling environments from which agent trajectories can be collected\(Fanget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib27)\)\.

These works commonly increase task difficulty by composing multiple user requests or write actions into compound tasks\. This produces longer trajectories and trains agents for long\-horizon execution, but it mainly increases the number of decision points\. Our work studies a complementary axis of complexity\. Instead of only asking the agent to execute more write actions,WRITconstructs read\-heavy tasks where a single write action requires substantial read\-tool evidence before its arguments can be resolved\. This differs from fixed rule\-based argument rewriting:WRITgenerates natural user requests that induce specified read\-heavy behavior, and uses reverse\-selection verification to ensure that the intended write argument remains recoverable from the returned evidence\.

## Appendix BFull\-Size Dataset Comparison onτ2\\tau^\{2\}\-Bench

Our main experiments use uniformly sampled 2K trajectory\-level training sets for all datasets\. This design is intended to isolate data quality and task composition while holding the SFT data budget fixed across methods\. However, several public baselines are released at larger scales, such as APIGen\-MT\-5K, Simia\-90K, and CoVe\-12K\. A natural concern is therefore that the 2K sampling protocol could understate the performance of these baselines by discarding useful examples\.

To address this possible confound, we run an additional full\-size comparison using the same base model, training recipe, and evaluation protocol as the main Qwen3\-4B\-Instruct\-2507 experiments\. Specifically, we train on each dataset at its available full scale and evaluate with the same strict Passkcomputation, where context\-window and model\-side failures are counted as incorrect\. This experiment is not meant to replace the controlled 2K\-budget comparison; rather, it verifies that our conclusions are not an artifact of uniformly downsampling larger baseline datasets\. The full\-size results are reported in Table[5](https://arxiv.org/html/2606.02908#A2.T5), with the corresponding Passkcurves shown in Figure[3](https://arxiv.org/html/2606.02908#A2.F3)\.

Datasetτ2\\tau^\{2\}Retailτ2\\tau^\{2\}Airlineτ2\\tau^\{2\}Averageτ2\\tau^\{2\}Retail\-Hardτ2\\tau^\{2\}Airline\-HardPass1Pass4Pass1Pass4Pass1Pass4Pass1Pass4Pass1Pass4APIGen\-MT\-5K51\.54±\\pm1\.1019\.3023\.00±\\pm3\.468\.0042\.84±\\pm0\.3015\.8543\.95±\\pm2\.4211\.2922\.50±\\pm5\.0010\.00Simia\-90K51\.97±\\pm3\.5427\.1944\.00±\\pm6\.7320\.0049\.54±\\pm2\.7425\.0045\.16±\\pm5\.4324\.1928\.75±\\pm11\.815\.00CoVe\-12K61\.62±\\pm4\.0141\.2338\.00±\\pm9\.3818\.0054\.42±\\pm4\.0634\.1556\.85±\\pm5\.1633\.8731\.25±\\pm14\.365\.00AReaL\-2K59\.43±\\pm5\.1332\.4647\.00±\\pm3\.4636\.0055\.64±\\pm3\.6033\.5452\.42±\\pm11\.8225\.8142\.50±\\pm8\.6625\.00WRIT\-2K71\.05±\\pm1\.2447\.3761\.00±\\pm3\.8342\.0067\.99±\\pm1\.9045\.7366\.13±\\pm2\.2838\.7157\.50±\\pm6\.4540\.00Table 5:Full\-size dataset comparison onτ2\\tau^\{2\}\-bench using Qwen3\-4B\-Instruct\-2507\. The main experiments use uniformly sampled 2K training sets to control the data budget across methods; this appendix experiment instead trains each dataset at its available full scale \(APIGen\-MT\-5K, Simia\-90K, CoVe\-12K, AReaL\-2K, andWRIT\-2K\) to rule out the possibility that the 2K sampling protocol unfairly disadvantages larger public baselines\.![Refer to caption](https://arxiv.org/html/2606.02908v1/x5.png)Figure 3:Passkdegradation curves for the full\-size dataset comparison onτ2\\tau^\{2\}\-bench using Qwen3\-4B\-Instruct\-2507\. The horizontal axis indexesk=1,2,3,4k=1,2,3,4; panel titles report the number of evaluated tasks\. Unlike the controlled 2K\-budget main comparison, this setting trains each dataset at its available full scale, including APIGen\-MT\-5K, Simia\-90K, CoVe\-12K, AReaL\-2K, andWRIT\-2K\.
## Appendix CAdditional PasskCurves

We provide additional Passkcurves to complement the main results\. Figure[4](https://arxiv.org/html/2606.02908#A3.F4)reports the curves for Llama\-3\.1\-8B\-Instruct, Figure[5](https://arxiv.org/html/2606.02908#A3.F5)reports the curves for Qwen2\.5\-14B\-Instruct, and Figure[6](https://arxiv.org/html/2606.02908#A3.F6)visualizes the ablation variants on Qwen3\-4B\-Instruct\-2507\.

![Refer to caption](https://arxiv.org/html/2606.02908v1/x6.png)Figure 4:Passkdegradation curves onτ2\\tau^\{2\}\-bench for Llama\-3\.1\-8B\-Instruct\. The horizontal axis indexesk=1,2,3,4k=1,2,3,4; panel titles report the number of evaluated tasks\.![Refer to caption](https://arxiv.org/html/2606.02908v1/x7.png)Figure 5:Passkdegradation curves onτ2\\tau^\{2\}\-bench for Qwen2\.5\-14B\-Instruct\. The horizontal axis indexesk=1,2,3,4k=1,2,3,4; panel titles report the number of evaluated tasks\.![Refer to caption](https://arxiv.org/html/2606.02908v1/x8.png)Figure 6:Passkcurves for the ablation study onτ2\\tau^\{2\}\-bench using Qwen3\-4B\-Instruct\-2507\. The horizontal axis indexesk=1,2,3,4k=1,2,3,4; panel titles report the number of evaluated tasks\.Domain\# TasksTask IDsRetail622, 3, 4, 5, 8, 9, 19, 20, 21, 23, 24, 25, 26, 27, 29, 30, 31, 32, 35, 36, 37, 38, 45, 49, 53, 54, 55, 58, 62, 63, 64, 66, 68, 70, 71, 74, 76, 79, 81, 82, 83, 84, 85, 86, 87, 90, 91, 93, 94, 95, 98, 99, 100, 101, 102, 104, 105, 106, 107, 111, 112, 113Airline201, 2, 4, 5, 7, 8, 9, 10, 15, 17, 18, 19, 27, 35, 38, 39, 41, 42, 43, 44Table 6:Read\-heavy task subsets used forτ2\\tau^\{2\}Retail\-Hard andτ2\\tau^\{2\}Airline\-Hard evaluation\.
## Appendix DRead\-Heavy Subsets inτ2\\tau^\{2\}\-Bench

We defineread\-heavy tasksas tasks where thehardest decision point requires approximately six or more read/search tool callsbefore the final write action, refusal, or answer\. The resulting task IDs for eachτ2\\tau^\{2\}\-bench domain are listed in Table[6](https://arxiv.org/html/2606.02908#A3.T6)\.

## Appendix EDataset Statistics and Training Details

Dataset statistics are shown in Table[7](https://arxiv.org/html/2606.02908#A5.T7), with example user requests for different synthesized task types shown in Table[8](https://arxiv.org/html/2606.02908#A5.T8)\. The SFT hyperparameters are summarized in Table[9](https://arxiv.org/html/2606.02908#A5.T9)\. We fine\-tune three instruction\-following backbones: Qwen3\-4B\-Instruct\-2507Yanget al\.\([2025](https://arxiv.org/html/2606.02908#bib.bib38)\), Llama\-3\.1\-8B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2606.02908#bib.bib40)\), and Qwen2\.5\-14B\-InstructYanget al\.\([2024a](https://arxiv.org/html/2606.02908#bib.bib39)\)\. We run our experiments on four NVIDIA RTX PRO 6000 GPUs with 96GB memory each\. We use LLaMA\-FactoryZhenget al\.\([2024](https://arxiv.org/html/2606.02908#bib.bib41)\)for full\-parameter supervised fine\-tuning and evaluate agents with the officialτ2\\tau^\{2\}\-bench implementationBarreset al\.\([2025](https://arxiv.org/html/2606.02908#bib.bib4)\)\. During evaluation, the agent temperature is set to 0, while the user simulator uses GPT\-4\.1 with temperature 0\.5\. This follows the reliability\-oriented protocol ofτ\\tau\-benchYaoet al\.\([2024](https://arxiv.org/html/2606.02908#bib.bib6)\), which evaluates deterministic agents under stochastic user simulations\. The resulting Passkmetric measures whether an agent can solve the same underlying task consistently acrosskkindependent user interactions, thereby testing robustness to user\-side uncertainty\.

Domain\# Traj\.PlainRead\-heavyMulti\-writeScriptedAvg\. TurnsAvg\. Tool CallsRetail100057627829114625\.026\.56Airline100032223018744824\.236\.66Total200089850847859424\.636\.61Table 7:Statistics of theWRITSFT dataset\. Read\-heavy denotes trajectories whose target write arguments require evidence from multiple read\-tool outputs, Multi\-write counts trajectories with at least two state\-changing write actions, and Scripted counts trajectories paired with user\-side interaction scripts\. Average turns are computed over non\-system messages, and average tool calls count executed assistant tool calls\. All 2,000 trajectories are clean\-success trajectories that reach the gold database state without tool execution errors\.Task typeGold write\-action sequenceAgoldA\_\{\\mathrm\{gold\}\}User requestuuPlain\(Single\-write\)book\_reservation\(\.\.\.\)You want to book a new one\-way trip from IAH to LAS on May 22 in economy cabin, choosing the cheapest available option, with 1 checked bag, without travel insurance, and paying with your Mastercard ending in 1756\.Multi\-writemodify\_user\_address\(\.\.\.\)\+modify\_pending\_order\_items\(\.\.\.\)You want to update your default account address to 101 Highway, Apt 1, New York, NY 10001\. You also want to modify the black Desk Lamp to the white USB\-powered one in your pending order and use your gift card ending with 4233 for any price difference\.Read\-heavybook\_reservation\(\.\.\.\)I need to book a one\-way business class ticket for myself from the New York area to Phoenix on May 20th\. I am flexible on which airport I depart from—please check Newark, JFK, and LaGuardia\. I need to arrive in Phoenix by 3:00 PM\. Among all the direct and one\-stop options that meet this arrival time, I want the cheapest business class seat\. Please use my Visa ending in 8898 for payment\.Table 8:Examples of synthesized user requests for three task types in our SFT dataset\. Plain tasks contain a single low\-read write action, multi\-write tasks compose multiple state\-changing actions into one request, and read\-heavy tasks require the agent to compare evidence from multiple read\-tool calls before executing the gold write action\.HyperparameterValueFine\-tuning typeFull\-parameter SFTEpochs2Learning rate1×10−51\\times 10^\{\-5\}OptimizerAdamWLearning\-rate scheduleCosine decay with 10 warmup stepsMaximum sequence length16,384 tokensPrecisionBF16Gradient clippingMaximum gradient norm 1\.0Table 9:SFT hyperparameters used in our main experiments\.
## Appendix FLicenses of Models and Datasets

We use publicly released models, benchmarks, and synthetic trajectory datasets\. The Qwen3\-4B\-Instruct\-2507 model\(Yanget al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib38)\)and Qwen2\.5\-14B\-Instruct are released under the Apache\-2\.0 license, while Llama\-3\.1\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.02908#bib.bib40)\)is released under the Llama 3\.1 Community License\. For benchmark resources,τ\\tau\-bench andτ2\\tau^\{2\}\-bench are released under the MIT License\. For baseline datasets, APIGen\-MT\(Prabhakaret al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib1)\)is released under CC\-BY\-NC\-4\.0, CoVe\(Chenet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib3)\)and AReaL\(Gaoet al\.,[2026](https://arxiv.org/html/2606.02908#bib.bib20)\)are released under Apache\-2\.0, and the Simia\(Liet al\.,[2025](https://arxiv.org/html/2606.02908#bib.bib2)\)code repository is released under the MIT License, while its dataset card does not separately specify a license\.

## Appendix GIntended Use of Existing and Created Artifacts

We use existing models, benchmarks, and baseline datasets consistently with their intended research use\. The base models are used for supervised fine\-tuning and evaluation of tool\-using agents, and theτ2\\tau^\{2\}\-bench environments are used as benchmark settings for evaluating multi\-turn user\-facing task\-completion agents\. Public baseline datasets are used only for research comparison under their released access conditions and licenses\.

The trajectories created byWRITare intended for research on training and evaluating multi\-turn user\-facing agents in executable tool environments\. They are designed to study synthetic trajectory generation, supervised fine\-tuning, read\-heavy task complexity, and robustness to user\-side interaction variation\. Since the trajectories are derived from research benchmarks and synthetic environments, they should be used only for research and evaluation purposes, not for deployment in real customer\-service systems without additional validation, safety review, and compliance checks\.

## Appendix HUse of AI Assistants

AI assistants were used only for writing assistance, including language polishing, clarity improvements, and minor editing of manuscript text\. They were not used to generate experimental results, alter reported numbers, or make scientific claims without author review\. All final content, analyses, and conclusions were reviewed and approved by the authors\.

## Appendix IAgent\-User Simulation Details

Following theτ2\\tau^\{2\}\-bench prompting setup, we use separate models for the agent and the user simulator during trajectory synthesis\. The agent model is prompted as a customer\-service agent with access to the domain policy, while the user simulator is prompted to play the customer role according to the synthesized task and the sampled interaction script\. Tables[10](https://arxiv.org/html/2606.02908#A9.T10)and[11](https://arxiv.org/html/2606.02908#A9.T11)show the prompt templates used in our implementation\.

ComponentPrompt TemplateAgent model<instructions\>You are a customer service agent that helps the user according to the <policy\> provided below\.In each turn you can either:\- Send a message to the user\.\- Make a tool call\.You cannot do both at the same time\.Try to be helpful and always follow the policy\. Always make sure you generate valid JSON only\.</instructions\><policy\>\{domain\_policy\}</policy\>Table 10:Prompt template used for the agent model during trajectory synthesis\.ComponentPrompt TemplateUser simulator\# User Simulation GuidelinesYou are playing the role of a customer contacting a customer service representative\.Your goal is to simulate realistic customer interactions while following specific scenario instructions\.…<scenario\>\{user\_scenario\}\{optional\_interaction\_script\}</scenario\>Table 11:Prompt template used for the user simulator during trajectory synthesis\. The optional interaction script is included when a script is sampled for the task\.We use Qwen3\.6\-Plus as the agent model and GPT\-5\.1 as the user simulator\. The decoding temperature is set to0\.20\.2for the agent model and0\.70\.7for the user simulator, so that the agent behavior remains relatively stable while the user simulator preserves conversational diversity\.

## Appendix JScript Primitives and Examples

We summarize the reusable script primitives used byWRITto control interaction diversity in Table[12](https://arxiv.org/html/2606.02908#A10.T12)\. We provide concrete instantiated scripts passed to the user simulator in Table[13](https://arxiv.org/html/2606.02908#A10.T13)\.

CategoryPrimitiveDescriptionDisclosure & state trackingProgressive disclosureThe user reveals task details gradually instead of providing every constraint in the first turn\.Disclosure & state trackingSelf\-correctionThe user first provides a bounded wrong value, then corrects it after the agent challenges it or fails to verify it\.Disclosure & state trackingConfirmation hesitationThe user pauses at the confirmation step and checks the agent’s summary before agreeing\.Disclosure & state trackingDelayed policy\-relevant revealThe user delays policy\-relevant information until the agent asks for it or checks the records\.Conversational noiseLight emotionThe user expresses mild conversational tone, such as being slightly rushed, confused, concerned, or impatient\.Conversational noiseIrrelevant asideThe user adds a brief harmless aside that does not require an answer and should not affect the task\.Multi\-item structureDelayed item revealFor multi\-item retail changes, the user initially omits one intended item change and adds it at final confirmation\.Multi\-item structureAtomic groupingThe user expects all item changes for one order to be confirmed together, preventing multi\-item changes from being split into separate writes\.Policy robustnessFact distortionThe user presents a false or policy\-incorrect premise while pursuing the same request\.Policy robustnessAssume\-style pressureThe user asks or tells the agent to assume a false premise, including imperative forms such as “Assume …”\.Policy robustnessFalse\-premise assertionThe user states the false premise directly as if it were true\.Policy robustnessMemory\-based false premiseThe user frames the false premise as something they remember or believe\.Policy robustnessPrior\-agent approval claimThe user claims that a previous representative approved or stated the false premise\.Refusal pressure & fallbackFallback requestWhen allowed by the task specification, the user asks for a bounded fallback after refusal\.Refusal pressure & fallbackMild pressureThe user adds a short urgency or fairness appeal after refusal without changing the request\.Refusal pressure & fallbackComplaint pressureThe user threatens escalation, complaint, or supervisor review while staying on the same request\.Refusal pressure & fallbackEmotional pressureThe user expresses stronger frustration, distress, or stress after refusal\.Refusal pressure & fallbackSocial flatteryThe user uses praise or flattery to pressure the agent into being lenient\.Table 12:Script primitives used to control interaction diversity during trajectory simulation\.Script TypeInstantiated ScriptPolicy\-robust scriptInteraction tips\.•Reject any non\-refund cancellation options or workarounds; insist on the refund or nothing\.•If the refusal feels unfair, say you may file a complaint, but keep asking only for the same request\.•After the initial refusal, assert that a representative previously approved this specific refundable cancellation\.Safety limits\.•Do not ask for an alternative operation, workaround, transfer, rebooking, upgrade, or different write action unless it was already part of the task instructions\.•Do not volunteer this policy\-relevant fact unless the agent directly asks: the reservation was not booked within 24 hours, the airline has not cancelled the flight, and there is no travel insurance\.These tips control only style, timing, or correction behavior\. They must not change the underlying customer goal\.General task\-completion scriptInteraction tips\.•Before saying yes, pause to verify the agent’s summary matches your request\.•Let the agent’s questions guide how much detail you reveal instead of saying everything upfront\.Safety limits\.•Do not introduce any new request or fallback operation beyond the original customer goal\.•Do not reveal internal IDs or exact values that are not present in the task instructions or known information\.These tips control only style, timing, or correction behavior\. They must not change the underlying customer goal\.Table 13:Examples of instantiated scripts passed to the user simulator\. The tips specify interaction style and timing, while the safety limits prevent the script from changing the underlying task semantics\.
## Appendix KWrite\-Tool Prototype Discovery Prompt

Tables[14](https://arxiv.org/html/2606.02908#A11.T14)–[18](https://arxiv.org/html/2606.02908#A11.T18)summarize the prompt used to induce write\-tool prototypes from a write\-tool schema, available read tools, and domain policy\.

ComponentPrompt contentSystemYou are an automatic prototype\-discovery module for user\-facing state\-changing tools\. Your output will be used to synthesize verifiable training tasks\. Return ONLY valid JSON\.InputThe payload contains one write\-tool schema, available read tools, and domain policy\. The model analyzes the write tool’s argument\-level modification modes\.GoalFor each meaningful modification mode, produce one natural\-language request template and one executable sampling plan\.DefinitionsA modification mode is a user\-facing pattern over the tool arguments: which existing object is targeted, which business arguments are changed or created, which arguments are kept unchanged but still required by the API, which values are supporting execution details, and which values are computed\. A prototype is one modification mode plus a request template and sampling rules\. A prototype is not a sampled task, not a dialogue script, not a wording variant, and not an exhaustive Cartesian product over all arguments\.Output schemaReturn JSON withdomain\_label,write\_tool,prototype\_bank, andinvalid\_or\_refusal\_patterns\. Each prototype containsprototype\_id,modification\_mode,argument\_role\_map,business\_intent,template,template\_slots,sampling\_rules,policy\_checks, andexclude\_patterns\.Table 14:Overview of the write\-tool prototype discovery prompt\.Argument roleMeaningtargetIdentifies the existing object or owner being operated on\.changedAn existing business value substantively changed by this prototype\.unchanged\_contextRequired by the API but intentionally copied from current state\.supporting\_valueRequired to execute, pay for, settle, route, authorize, or refund the change, but not itself the business object being changed\.computed\_valueDerived from state, policy, arithmetic, fees, balances, allowances, totals, eligibility, or another sampled field\.new\_entity\_valueRequired when the tool creates a new object rather than modifying an existing one\.Table 15:Argument roles used by the prototype discovery prompt\.RulePrompt instruction1Enumerate fine\-grained but meaningful modification modes, not wording variants\.2Include single\-business\-field changes when policy\-feasible\. Also include natural combined changes when multiple independent business arguments are commonly requested together and can be handled by the same write tool\.3Treatargument\_role\_mapas the source of truth\. Downstream code derives the modified argument set as all keys labeledchangedornew\_entity\_value\.4Mark derived fields ascomputed\_value, including policy allowances, fee/refund/fare calculations, totals, remaining balances, and count splits\.5If one argument is a user\-facing quantity and another is a derived charged/free/eligible/nonfree portion of that quantity, mark the user\-facing quantity aschangedornew\_entity\_valueand the derived portion ascomputed\_value\.6Do not create a positive prototype solely because a supporting argument changes when that argument is always required\.7Do not create a positive prototype for changing only supporting values if that would not change the underlying business object; place it ininvalid\_or\_refusal\_patterns\.8For creation tools, mark user\-chosen fields stored on the new object asnew\_entity\_value\. Usesupporting\_valueonly for execution support such as payment, settlement, routing, authorization, or refund instruments\.Table 16:Core induction rules for prototype discovery\.RulePrompt instruction9For list or nested\-object arguments that identify independent subrecords, explicitly check whether both selected\-subset and all\-elements modes are policy\-feasible\. If both are feasible, output both prototypes\.10Even if the API requires a complete updated list/object when the user changes only a subset, selected\-subset is still a valid user\-facing prototype\. State that unchanged subrecords must be copied from the current state\.11all\_elementsmeans all eligible elements inside the selected target record, not all elements of a product, type, or category unless category\-level targeting is itself a common business operation\.12For replacement values that can be chosen explicitly or selected by preference from candidate evidence, separate those modes only if they require different sampling rules or target filters\.13Split direct values and copied/resolved values only when the grounding source changes the sampler or policy checks\.14Templates must begin with “You want to” and use bracketed semantic slots for all concrete values\.15Do not include concrete fake values: no backend record identifiers, account identifiers, candidate identifiers, payment identifiers, exact timestamps, prices, or unsupported enum values\.16–18Represent backend identifiers as descriptor slots with likely read tools listed\. Keep positive prototypes policy\-feasible, move policy\-boundary cases to refusal patterns, and avoid internal API names in templates\.Table 17:Additional induction rules for list arguments, grounding, and template wording\.CheckPrompt instructionRole\-map coverageargument\_role\_mapmust contain every write\-tool parameter exactly once and no unknown parameters\.Changed\-value consistencyEvery changed\-value sampler refers only to arguments labeledchangedornew\_entity\_value\.Computed\-value consistencyEvery computed\-value rule refers only to arguments labeledcomputed\_value\.Output formatThe model returns JSON only\.Table 18:Final consistency checks in the prototype discovery prompt\.
WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

Similar Articles

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Submit Feedback

Similar Articles

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining