How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

arXiv cs.CL Papers

Summary

This paper systematically measures behavioral reproducibility of LLM agents in multi-step tool-calling pipelines across 1,140 traces, finding a 'structural consistency, parametric variance' pattern where agents reliably select tools in the same order but vary in arguments, and that structural consistency predicts task success.

arXiv:2605.28840v1 Announce Type: new Abstract: Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:13 AM

# How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines
Source: [https://arxiv.org/html/2605.28840](https://arxiv.org/html/2605.28840)
\(April 2026\)

###### Abstract

Large language model \(LLM\) agents with tool\-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains underexplored:*does the same agent behave the same way twice?*We present a systematic empirical study ofbehavioral consistency in multi\-step tool\-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations\. Unlike prior work on consistency in ReAct\-style agents \(search\-only, free\-text actions\), we study the richer setting of*structured tool\-calling interfaces*with typed parameters and consequential side effects\.

Using a benchmark of 19 tasks across five categories—data retrieval, scheduling, computation, multi\-tool composition, and ambiguous requests—we evaluate six models from three providers \(OpenAI, Anthropic, Meta/Together\) across 1,140 total agent traces\. We identify a“structural consistency, parametric variance” pattern: agents reliably select the same tools in the same order \(meanTSS=0\.87\\mathrm\{TSS\}=0\.87, 95 % CI\[0\.84,0\.90\]\[0\.84,0\.90\]\) but vary substantially in the arguments they provide \(meanAC=0\.69\\mathrm\{AC\}=0\.69,\[0\.64,0\.74\]\[0\.64,0\.74\]\); this gap is large \(Cohen’sd=0\.75d=0\.75\) and highly significant \(p<10−13p<10^\{\-13\}\)\.

We additionally establish that: \(1\)ambiguous task specificationsreduce argument consistency by 28 % relative to structured tasks \(d=0\.74d=0\.74,p=0\.001p=0\.001\), a stronger effect than model selection \(η2=0\.08\\eta^\{2\}=0\.08, n\.s\.\); \(2\)60 % of behavioral divergenceoriginates in the first two pipeline steps; \(3\)natural language outputs almost never match\(<<5 % exact match\) even when tool sequences are identical; and \(4\)models differ significantly in structural but not argument consistency\(F=3\.52F=3\.52,η2=0\.15\\eta^\{2\}=0\.15,p=0\.003p=0\.003forTSS\\mathrm\{TSS\}; n\.s\. forAC\\mathrm\{AC\}\)\. Critically, a correctness analysis shows thatstructural consistency predicts task success—high\-TSS\\mathrm\{TSS\}conditions achieve 90\.2 % correctness versus 61\.2 % for low\-TSS\\mathrm\{TSS\}\(d=0\.81d=0\.81,p<0\.001p<0\.001\)—while argument\-level variance is benign \(r=0\.12r=0\.12,p=0\.31p=0\.31, n\.s\.\)\. This makesTSS\\mathrm\{TSS\}a lightweight, correctness\-free proxy for agent reliability, actionable without ground\-truth labels\.

We release all code, benchmark definitions, raw traces, and analysis scripts at[https://github\.com/Abelo9996/agent\-consistency](https://github.com/Abelo9996/agent-consistency)\.

## 1Introduction

The deployment of LLM\-based agents with tool\-calling capabilities has accelerated rapidly, with production systems now using agents to search databases, send emails, manage calendars, and orchestrate multi\-step workflows through structured function calls\(Qin et al\.,[2024](https://arxiv.org/html/2605.28840#bib.bib9); Li et al\.,[2023](https://arxiv.org/html/2605.28840#bib.bib3); Schick et al\.,[2023](https://arxiv.org/html/2605.28840#bib.bib11)\)\. As these systems mature, a reliability question that has received surprisingly little systematic attention comes to the fore:*if you run the same agent on the same task twice, do you get the same behavior?*

This question has immediate practical consequences:

- •Testing\.If agents behave differently across runs, unit tests asserting on outputs are flaky by design\. Reliable behavioral invariants are needed to test that an agent “does the right thing”\(Kapoor and Narayanan,[2024](https://arxiv.org/html/2605.28840#bib.bib1)\)\.
- •Debugging\.Failure reproducibility is a prerequisite for root\-cause analysis\. Behavioral variance makes production failures intermittent and difficult to trace\.
- •Safety and auditability\.High\-stakes deployments require guarantees that the agent will not take an unexpected action on a re\-run \(e\.g\., sending duplicate emails, creating conflicting calendar events\)\(Weidinger et al\.,[2021](https://arxiv.org/html/2605.28840#bib.bib14)\)\.
- •Cost optimization\.If behavioral variance is predictable from task characteristics, consistency\-aware routing can assign tasks to cheaper models when variance is tolerable and reserve high\-consistency models for critical workflows\.

#### The gap in prior work\.

Mehta et al\. \([2026](https://arxiv.org/html/2605.28840#bib.bib6)\)studied behavioral consistency in ReAct\-style agents\(Yao et al\.,[2023](https://arxiv.org/html/2605.28840#bib.bib15)\)on HotpotQA, finding that agents produce 2\.0–4\.2 distinct action sequences per 10 runs and that inconsistency predicts failure\. This is important but limited to*search\-only*actions in a question\-answering setting\. Real\-world agents operate over*structured tool\-calling interfaces*with typed parameters, multiple heterogeneous tools, and multi\-step pipelines\. The distinction matters: tool calls are discrete typed objects \(not free\-form text\), have observable side effects \(emails sent, events created\), and compose in sequences where early divergence propagates through subsequent steps\. It is unclear whether consistency patterns from ReAct agents transfer to this richer action space\.

#### This work\.

We extend the study of behavioral consistency to multi\-step tool\-calling agents across diverse task types, making the following contributions:

1. 1\.Benchmark\.19 tasks spanning five categories, paired with 10 deterministic simulated tools that isolate LLM variance from environmental non\-determinism \(Section[4\.1](https://arxiv.org/html/2605.28840#S4.SS1)\)\.
2. 2\.Formal metric framework\.Formal definitions of Tool Sequence Similarity \(TSS\\mathrm\{TSS\}\), Argument Consistency \(AC\\mathrm\{AC\}\), divergence point, and output agreement, targeting distinct behavioral layers \(Section[3\.2](https://arxiv.org/html/2605.28840#S3.SS2)\)\.
3. 3\.The structural/parametric distinction\.TSS=0\.87\\mathrm\{TSS\}=0\.87substantially exceedsAC=0\.69\\mathrm\{AC\}=0\.69across all models and categories \(d=0\.75d=0\.75,p<10−13p<10^\{\-13\}\); this is a novel finding not captured by prior single\-metric consistency studies \(Section[5\.1](https://arxiv.org/html/2605.28840#S5.SS1)\)\.
4. 4\.Correctness validation\.TSS\\mathrm\{TSS\}predicts task correctness \(d=0\.81d=0\.81,p<0\.001p<0\.001\) whileAC\\mathrm\{AC\}does not \(r=0\.12r=0\.12, n\.s\.\), ruling out trivial “consistently wrong” explanations and establishingTSS\\mathrm\{TSS\}as a reliability proxy \(Section[5\.5](https://arxiv.org/html/2605.28840#S5.SS5)\)\.
5. 5\.Actionable guidelinesgrounded in effect sizes for testing, monitoring, and model selection in production deployments \(Section[6\.3](https://arxiv.org/html/2605.28840#S6.SS3)\)\.

## 2Related Work

#### Behavioral consistency in LLM agents\.

Mehta et al\. \([2026](https://arxiv.org/html/2605.28840#bib.bib6)\)measured consistency in ReAct agents on HotpotQA using search\-only actions, finding inconsistency predicts failure\. We extend to typed tool calls with diverse tool sets and task types, and introduce the structural/parametric distinction absent from prior work\.Wang et al\. \([2023](https://arxiv.org/html/2605.28840#bib.bib13)\)showed that sampling multiple reasoning chains and marginalizing improves accuracy; this*leverages*variance rather than characterizing its structure, and focuses on reasoning rather than tool calls\.Renze and Guven \([2024](https://arxiv.org/html/2605.28840#bib.bib10)\)studied self\-reflection in LLM agents and found mixed effects, suggesting that deliberate consistency\-improving strategies do not uniformly help\.

#### LLM reliability and robustness\.

Sclar et al\. \([2024](https://arxiv.org/html/2605.28840#bib.bib12)\)quantified LLM sensitivity to prompt formatting, finding large swings from superficial changes\.Lu et al\. \([2022](https://arxiv.org/html/2605.28840#bib.bib5)\)demonstrated order sensitivity in few\-shot prompts\.Perez et al\. \([2022](https://arxiv.org/html/2605.28840#bib.bib8)\)showed that standard NLP evaluations underestimate robustness failures\. Our focus is orthogonal: we hold the input fixed and measure variance across repeated identical invocations—the consistency a deployed system must have by definition\.

#### Agent capability evaluation\.

AgentBench\(Liu et al\.,[2024](https://arxiv.org/html/2605.28840#bib.bib4)\), ToolBench\(Qin et al\.,[2024](https://arxiv.org/html/2605.28840#bib.bib9)\), API\-Bank\(Li et al\.,[2023](https://arxiv.org/html/2605.28840#bib.bib3)\), and Gorilla\(Patil et al\.,[2023](https://arxiv.org/html/2605.28840#bib.bib7)\)evaluate whether agents*can*solve tasks\.Schick et al\. \([2023](https://arxiv.org/html/2605.28840#bib.bib11)\)showed LLMs can learn tool use from self\-generated demonstrations\.Kapoor and Narayanan \([2024](https://arxiv.org/html/2605.28840#bib.bib1)\)surveyed pitfalls in AI agent evaluations, noting that variance across runs is rarely reported\. We evaluate not whether agents succeed but whether they*consistently*succeed in the same way—an orthogonal, understudied dimension\.

#### Reliability in ML systems\.

The broader ML reliability literature addresses distribution shift, uncertainty quantification\(Lakshminarayanan et al\.,[2017](https://arxiv.org/html/2605.28840#bib.bib2)\), and out\-of\-distribution detection—all concerned with performance under changed*inputs*\. We study consistency under repeated*identical*inputs, a related but distinct concern that matters specifically for agentic systems where tasks are retried and behavioral reproducibility is a correctness criterion in its own right\.

## 3A Framework for Agent Behavioral Consistency

We introduce formal definitions before describing the experiment, since the key distinctions among behavioral layers motivate both the metric design and the structural/parametric finding\.

### 3\.1Agent Execution Model

###### Definition 1\(Agent Trace\)\.

A*trace*τ\\tauis a sequence of tool callsτ=\(c1,c2,…,ck\)\\tau=\(c\_\{1\},c\_\{2\},\\ldots,c\_\{k\}\), where each callci=\(𝚗𝚊𝚖𝚎i,𝐚i\)c\_\{i\}=\(\\mathtt\{name\}\_\{i\},\\,\\mathbf\{a\}\_\{i\}\)consists of a tool name𝚗𝚊𝚖𝚎i∈𝒯\\mathtt\{name\}\_\{i\}\\in\\mathcal\{T\}and an argument map𝐚i:𝒦→𝒱\\mathbf\{a\}\_\{i\}:\\mathcal\{K\}\\to\\mathcal\{V\}, followed by a final natural language responser∈Σ∗r\\in\\Sigma^\{\*\}\.

###### Definition 2\(Behavioral Consistency\)\.

Given taskqqand modelℳ\\mathcal\{M\}, let\{τ\(j\)\}j=1N\\\{\\tau^\{\(j\)\}\\\}\_\{j=1\}^\{N\}beNNindependent traces from runningℳ\\mathcal\{M\}onqqwith identical context\.*Behavioral consistency*is the degree to which these traces are similar under a metricd​\(⋅,⋅\)d\(\\cdot,\\cdot\)over traces\.

This framing reveals a natural hierarchy of behavioral layers:

1. 1\.Structural layer: the sequence of tool names\(𝚗𝚊𝚖𝚎1,…,𝚗𝚊𝚖𝚎k\)\(\\mathtt\{name\}\_\{1\},\\ldots,\\mathtt\{name\}\_\{k\}\)—the agent’s*procedural choice*\.
2. 2\.Argument layer: the argument maps𝐚i\\mathbf\{a\}\_\{i\}at each step—*how*the procedure is parameterized\.
3. 3\.Output layer: the final responserr—the surface\-level text the user sees\.

Our central hypothesis, motivated by how LLMs acquire tool\-use behavior through fine\-tuning:

###### Hypothesis 1\(Structural Consistency, Parametric Variance\)\.

For multi\-step tool\-calling agents, structural consistency significantly exceeds argument consistency:𝔼​\[TSS\]≫𝔼​\[AC\]\\mathbb\{E\}\[\\mathrm\{TSS\}\]\\gg\\mathbb\{E\}\[\\mathrm\{AC\}\]\.

The intuition is that RLHF and SFT fine\-tuning on tool\-use data reinforces correct*procedure selection*—which carries a cleaner training signal—while*argument instantiation*remains more sensitive to sampling\-time variation\. Figure[1](https://arxiv.org/html/2605.28840#S3.F1)illustrates the pattern concretely\.

![Refer to caption](https://arxiv.org/html/2605.28840v1/x1.png)Figure 1:The “structural consistency, parametric variance” pattern\. Two independent runs of the same agent on the same task produce identical tool sequences \(green checkmarks\) but diverge in argument values \(orange “≠\\neq” annotations\)\. Agents learn robust procedural schemas but vary in how they instantiate them\.
### 3\.2Formal Metric Definitions

###### Definition 3\(Tool Sequence Similarity,TSS\\mathrm\{TSS\}\)\.

Let𝐬\(j\)=\(𝚗𝚊𝚖𝚎1\(j\),…,𝚗𝚊𝚖𝚎kj\(j\)\)\\mathbf\{s\}^\{\(j\)\}=\(\\mathtt\{name\}\_\{1\}^\{\(j\)\},\\ldots,\\mathtt\{name\}\_\{k\_\{j\}\}^\{\(j\)\}\)be the tool\-name sequence of tracejj\. Define

TSS​\(\{τ\(j\)\}\)=1\(N2\)​∑j<j′\(1−EditDist​\(𝐬\(j\),𝐬\(j′\)\)max⁡\(\|𝐬\(j\)\|,\|𝐬\(j′\)\|\)\),\\mathrm\{TSS\}\\bigl\(\\\{\\tau^\{\(j\)\}\\\}\\bigr\)\\;=\\;\\frac\{1\}\{\\binom\{N\}\{2\}\}\\sum\_\{j<j^\{\\prime\}\}\\\!\\left\(1\-\\frac\{\\mathrm\{EditDist\}\(\\mathbf\{s\}^\{\(j\)\},\\,\\mathbf\{s\}^\{\(j^\{\\prime\}\)\}\)\}\{\\max\(\|\\mathbf\{s\}^\{\(j\)\}\|,\\,\|\\mathbf\{s\}^\{\(j^\{\\prime\}\)\}\|\)\}\\right\),whereEditDist\\mathrm\{EditDist\}is the Levenshtein distance over tool\-name tokens\.TSS∈\[0,1\]\\mathrm\{TSS\}\\in\[0,1\], withTSS=1\\mathrm\{TSS\}=1iff allNNtraces share the same tool\-name sequence\.

###### Definition 4\(Argument Consistency,AC\\mathrm\{AC\}\)\.

Align traces by step index\. For stepiiand trace pair\(j,j′\)\(j,j^\{\\prime\}\)both reaching stepii, let𝐟​\(ci\(j\)\)=\{\(k,v\):v=𝐚i\(j\)​\(k\)\}\\mathbf\{f\}\(c\_\{i\}^\{\(j\)\}\)=\\\{\(k,v\):v=\\mathbf\{a\}\_\{i\}^\{\(j\)\}\(k\)\\\}be the flattened key\-value set\. Then

AC​\(\{τ\(j\)\}\)=1\|𝒮\|​∑\(j,j′,i\)∈𝒮\|𝐟​\(ci\(j\)\)∩𝐟​\(ci\(j′\)\)\|\|𝐟​\(ci\(j\)\)∪𝐟​\(ci\(j′\)\)\|,\\mathrm\{AC\}\\bigl\(\\\{\\tau^\{\(j\)\}\\\}\\bigr\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{\(j,j^\{\\prime\},i\)\\,\\in\\,\\mathcal\{S\}\}\\frac\{\|\\mathbf\{f\}\(c\_\{i\}^\{\(j\)\}\)\\cap\\mathbf\{f\}\(c\_\{i\}^\{\(j^\{\\prime\}\)\}\)\|\}\{\|\\mathbf\{f\}\(c\_\{i\}^\{\(j\)\}\)\\cup\\mathbf\{f\}\(c\_\{i\}^\{\(j^\{\\prime\}\)\}\)\|\},where𝒮\\mathcal\{S\}indexes all step\-aligned pairs\. When traces call different tools at stepii, their key\-value sets are disjoint, yieldingAC=0\\mathrm\{AC\}=0for that step—soAC\\mathrm\{AC\}partially captures structural divergence as well\. Our correctness analysis \(Section[5\.5](https://arxiv.org/html/2605.28840#S5.SS5)\) empirically disentangles the two effects\.

Additional metrics:Unique Sequencescounts distinct tool\-name sequences acrossNNruns;Divergence Pointis the mean step at which a trace pair first differs;Output Agreementis the exact\-match rate of final responses\.

## 4Methodology

### 4\.1Task Benchmark

We design 19 tasks across five categories of increasing ambiguity \(TableLABEL:tab:full\_tasks, Appendix[Appendix A\. Full Task Benchmark](https://arxiv.org/html/2605.28840#Ax1)\):

- •Data Retrieval\(4 tasks\): Contact lookup, email search, aggregation\. Clear instructions with deterministic correct tool sequences\.
- •Scheduling\(4 tasks\): Calendar creation, free\-slot finding, conflict detection\. Require temporal reasoning with well\-defined procedures\.
- •Computation\(3 tasks\): Inventory valuation, revenue projection\. Require numeric tool calls with specific arguments\.
- •Multi\-Tool Composition\(4 tasks\): Chains of 3–5 different tools \(e\.g\., “find email→\\tolook up sender→\\toschedule meeting”\)\.
- •Ambiguous\(4 tasks\): Intentionally underspecified requests where multiple valid strategies exist \(e\.g\., “Help me prepare for my meetings tomorrow”\)\.

Task difficulty is assigned by expected tool\-call count:*easy*\(1–2 calls\),*medium*\(2–3\),*hard*\(3\+\)\. We validate this: hard tasks require significantly more tool calls \(mean3\.0±2\.43\.0\\pm 2\.4\) than easy tasks \(1\.9±0\.71\.9\\pm 0\.7; Spearmanρ=0\.26\\rho=0\.26,p=0\.004p=0\.004\)\.

### 4\.2Tool Environment

We implement 10 deterministic simulated tools spanning 7 domains \(contacts, calendar, email, products, weather, calculations, notes; see Appendix[Appendix B\. Tool Descriptions](https://arxiv.org/html/2605.28840#Ax2)\)\. All tools aredeterministic: identical inputs always produce identical outputs\. This design isolates LLM variance from environmental non\-determinism, ensuring observed behavioral differences arise solely from the model’s generation process\. Tool schemas follow the OpenAI function\-calling format and are adapted to provider\-native formats at runtime\.

### 4\.3Agent Framework

Each run follows a standard tool\-calling loop: \(1\) the model receives a system prompt111Full prompt in Appendix[Appendix C\. System Prompt](https://arxiv.org/html/2605.28840#Ax3)\. Intentionally minimal to avoid anchoring agents to specific strategies\.and the user task; \(2\) the model responds with structured tool calls; \(3\) tools execute deterministically; \(4\) the model may call additional tools or produce a final response; \(5\) the loop continues for up to 10 iterations\. We use each provider’s native tool\-calling API\.

Temperature is set to 1\.0\(the default for most providers\)\. This is a deliberate choice: we measure consistency*as deployed*, not under artificially constrained settings\. Temperature ablation is a known limitation and an important direction for future work \(Section[6\.5](https://arxiv.org/html/2605.28840#S6.SS5)\)\.

### 4\.4Models

We evaluate six models spanning three providers and multiple capability tiers:OpenAI: GPT\-4o\-mini, GPT\-4o, GPT\-4\.1\-mini, GPT\-4\.1;Anthropic: Claude Sonnet 4;Meta/Together AI: Llama 3\.3 70B Instruct Turbo\. This covers flagship models \(GPT\-4\.1, Sonnet 4\), cost\-optimized variants \(GPT\-4o\-mini, GPT\-4\.1\-mini\), and an open\-source model \(Llama 3\.3\)\. Each model runs each of the 19 tasks 10 times, yielding 1,140 agent traces\.222Partial o1 results \(7 of 19 tasks\) are shown in Figure[2](https://arxiv.org/html/2605.28840#S5.F2)but excluded from aggregate statistics, as o1 uses constrained chain\-of\-thought decoding with no temperature parameter\. Claude Haiku 3\.5 was also evaluated but excluded due to a\>\>15 % rate of malformed tool\-call responses\.

### 4\.5Correctness Evaluation

To validate that consistency predicts meaningful outcomes, we retrospectively score all 1,140 traces using a three\-component rubric:

1. 1\.Required tool coverage: did the agent invoke all tools necessary for the task?
2. 2\.Argument validity: do key arguments match expected patterns \(e\.g\.,send\_email\.to∼\\sim/alice@exampleċom/,create\_calendar\_event\.date=2026\-03\-02\)?
3. 3\.Output completeness: does the final response address the user’s request \(regex\-matched against expected output patterns\)?

A trace is scored correct iff all applicable criteria are satisfied\. Full per\-task criteria are in Appendix[Appendix D\. Correctness Criteria \(Illustrative Subset\)](https://arxiv.org/html/2605.28840#Ax4)and the code release\.

### 4\.6Statistical Analysis

We report means with 95 % CIs via thett\-distribution, Cohen’sddfor effect sizes, andpp\-values from paired or independenttt\-tests as appropriate\. Cross\-model comparisons use one\-way ANOVA withη2\\eta^\{2\}\. We acknowledge that a linear mixed\-effects model \(treating tasks as random effects\) would be more appropriate given the shared task structure across models; the simple ANOVA is conservative but adequate given our sample sizes\. We test five primary hypotheses; all survive Bonferroni correction atα=0\.01\\alpha=0\.01\. Split\-half reliability \(TSS:r=0\.66r=0\.66,p<10−16p<10^\{\-16\},n=125n=125\) confirms moderate metric stability atN=10N=10\.

## 5Results

### 5\.1Structural Consistency with Parametric Variance

Hypothesis[1](https://arxiv.org/html/2605.28840#Thmhypothesis1)is confirmed\. Agents exhibit*structural consistency with parametric variance*: meanTSS=0\.87\\mathrm\{TSS\}=0\.87, 95 % CI\[0\.84,0\.90\]\[0\.84,0\.90\], versus meanAC=0\.69\\mathrm\{AC\}=0\.69,\[0\.64,0\.74\]\[0\.64,0\.74\]\(pairedtt\-test:t=8\.41t=8\.41,p<10−13p<10^\{\-13\}; Cohen’sd=0\.75d=0\.75\)\. The pattern holds across all models and categories \(Table[1](https://arxiv.org/html/2605.28840#S5.T1), Figure[2](https://arxiv.org/html/2605.28840#S5.F2)\)\.

Finding 1\.Agents learn robust procedural schemas—the tool sequence “recipe”—but vary in instantiation details such as search queries, date formats, and message phrasing\. The structural/parametric gap isd=0\.75d=0\.75,p<10−13p<10^\{\-13\}\.

Table 1:Cross\-model consistency \(19 tasks, 10 runs each\)\.†Partial results only \(7 tasks\); excluded from aggregate statistics\.![Refer to caption](https://arxiv.org/html/2605.28840v1/x2.png)Figure 2:Model comparison acrossTSS\\mathrm\{TSS\}\(left\) andAC\\mathrm\{AC\}\(right\) with 95 % CIs\. Llama 3\.3 70B is clearly separated inTSS\\mathrm\{TSS\}; theAC\\mathrm\{AC\}ranking is dominated by task\-level factors \(ANOVA n\.s\.\)\.
### 5\.2Ambiguity Is the Dominant Driver of Inconsistency

Task category strongly affects consistency \(Table[2](https://arxiv.org/html/2605.28840#S5.T2), Figure[3](https://arxiv.org/html/2605.28840#S5.F3)\)\. Ambiguous tasks have substantially lower AC \(0\.520\.52vs\.0\.720\.72for structured; Cohen’sd=0\.74d=0\.74,t=3\.34t=3\.34,p=0\.001p=0\.001\) and lower TSS \(0\.790\.79vs\.0\.890\.89;d=0\.58d=0\.58,p=0\.010p=0\.010\)\. The ambiguity effect on AC \(d=0\.74d=0\.74\) exceeds the between\-model effect \(η2=0\.08\\eta^\{2\}=0\.08, n\.s\.\), establishing thattask specification quality is a stronger lever on consistency than model selection\.

Table 2:Consistency by task category \(mean across 6 models\)\. With 3–4 tasks per category, the ambiguous\-vs\.\-structured contrast \(p=0\.001p=0\.001\) is the primary finding; fine\-grained category comparisons are exploratory\.Finding 2\.Task specification quality \(d=0\.74d=0\.74\) is a stronger lever on consistency than model selection \(η2=0\.08\\eta^\{2\}=0\.08, n\.s\.\)\. Engineering effort is better spent reducing ambiguity than switching models\.

![Refer to caption](https://arxiv.org/html/2605.28840v1/x3.png)\(a\)By task category\.
![Refer to caption](https://arxiv.org/html/2605.28840v1/x4.png)\(b\)By task difficulty\.

Figure 3:Ambiguous tasks show the largest consistency drop; difficulty has a modest additional effect \(ρ=0\.26\\rho=0\.26,p=0\.004p=0\.004forTSS\\mathrm\{TSS\};r=−0\.18r=\-0\.18,p=0\.04p=0\.04\)\.
### 5\.3Divergence Concentrates in Early Steps

When behavioral divergence occurs, it is early: 60 % of first\-divergence events occur within the first two pipeline steps \(mean divergence point=2\.2=2\.2; Figure[4](https://arxiv.org/html/2605.28840#S5.F4)\)\. This yields a practical monitoring strategy:comparing only the first 1–2 tool calls against a reference tracecatches the majority of behavioral variance without full trace inspection\.

![Refer to caption](https://arxiv.org/html/2605.28840v1/x5.png)Figure 4:Distribution of first divergence points\. 60 % of divergence originates in steps 1–2, enabling lightweight early\-step monitoring\.Finding 3\.A monitor that checks only the first two tool calls against a reference trace captures 60 % of all behavioral variance\.

### 5\.4Output Text Is Not a Reliability Signal

Final natural language responses exhibit near\-zero exact\-match rates \(<<5 % overall\) even when the underlying tool\-calling sequences are identical\. This is expected—language generation is inherently variable—but it has a critical implication for testing:agent tests must assert on structured tool\-calling behavior, not natural language output\.Asserting on text is analogous to writing software tests that check log messages rather than return values\.

### 5\.5Consistency Predicts Correctness

We score all 1,140 traces using the rubric in Section[4\.5](https://arxiv.org/html/2605.28840#S4.SS5)\. Overall correctness is 77\.1 %\. More importantly,TSS\\mathrm\{TSS\}is a significant predictor: Pearsonr=0\.32r=0\.32\(p=0\.005p=0\.005\), Spearmanρ=0\.42\\rho=0\.42\(p<0\.001p<0\.001\)\. A median split reveals that high\-TSS\\mathrm\{TSS\}conditions \(TSS≥0\.90\\mathrm\{TSS\}\\geq 0\.90\) achieve 90\.2 % correctness versus 61\.2 % for low\-TSS\\mathrm\{TSS\}\(Cohen’sd=0\.81d=0\.81,p<0\.001p<0\.001; Figure[5](https://arxiv.org/html/2605.28840#S5.F5)\)\.

Crucially,AC\\mathrm\{AC\}does not predict correctness \(r=0\.12r=0\.12,p=0\.31p=0\.31, n\.s\.\)\. Agents may phrase search queries differently, format dates differently, or vary message bodies across runs—none of this impairs task success\. It is*structural*variance—selecting different or missing tools—where failures concentrate\.

![Refer to caption](https://arxiv.org/html/2605.28840v1/x6.png)Figure 5:Left:Correctness rate climbs monotonically across TSS tertiles \(61 %→\\to77 %→\\to90 %\), broken out by category\.Right:AC shows no relationship with correctness \(r=0\.12r=0\.12,p=0\.31p=0\.31, n\.s\.\), confirming that argument\-level variance is benign\.Finding 4\.TSS\\mathrm\{TSS\}predicts task success \(d=0\.81d=0\.81,p<0\.001p<0\.001\) without requiring correctness labels\.AC\\mathrm\{AC\}does not \(r=0\.12r=0\.12, n\.s\.\)\. Structural variance concentrates failures; parametric variance is benign\.

### 5\.6Cross\-Model Differences

Models differ significantly in structural consistency \(ANOVA onTSS\\mathrm\{TSS\}:F=3\.52F=3\.52,p=0\.003p=0\.003,η2=0\.15\\eta^\{2\}=0\.15\) but not in argument consistency \(F=1\.61F=1\.61,p=0\.15p=0\.15,η2=0\.08\\eta^\{2\}=0\.08, n\.s\.\), indicating that AC variation is dominated by task\-level factors\.

GPT\-4\.1\-miniachieves the highestTSS\\mathrm\{TSS\}\(0\.92\) andAC\\mathrm\{AC\}\(0\.81\)\.Llama 3\.3 70Bis significantly below all proprietary models inTSS\\mathrm\{TSS\}\(0\.71,\[0\.61,0\.82\]\[0\.61,0\.82\]\), with 3\.3 unique sequences per task versus 1\.6–2\.2 for proprietary models \(post\-hocp<0\.05p<0\.05vs\. all others\)\.Claude Sonnet 4achieves the second\-highestAC\\mathrm\{AC\}\(0\.76\) while exploring more unique sequences \(2\.2\), suggesting it reliably parameterizes strategies but varies in which strategy it selects\.

The Model×\\timesCategory interaction \(Figure[6](https://arxiv.org/html/2605.28840#S5.F6)\) is notable: Llama 3\.3 70B drops most sharply on ambiguous \(TSS=0\.54\\mathrm\{TSS\}=0\.54\) and composition \(TSS=0\.63\\mathrm\{TSS\}=0\.63\) tasks, while proprietary models maintainTSS\>0\.80\\mathrm\{TSS\}\>0\.80even on ambiguous tasks\.

![Refer to caption](https://arxiv.org/html/2605.28840v1/x7.png)Figure 6:TSS\\mathrm\{TSS\}across models and task categories\. Llama 3\.3 70B shows notably lower consistency on ambiguous and composition tasks; proprietary models remain consistently high\.

## 6Discussion

### 6\.1Why Structural Consistency Exceeds Argument Consistency

The structural/parametric pattern is consistent across all models and task types, suggesting a systematic cause rooted in how LLMs acquire tool\-use behavior\. Training corpora for tool\-calling likely contain many demonstrations of the same high\-level task type solved with the same tool sequence, but with varied argument values \(different names, dates, query strings\) across instances\. RLHF/SFT reinforces correct*procedure selection*—which carries a cleaner correctness signal—while argument instantiation remains more sensitive to sampling\-time variation\. This “procedural schema” interpretation is directly supported by our correctness analysis: argument variance \(AC\\mathrm\{AC\}\) does not predict failure \(r=0\.12r=0\.12, n\.s\.\), meaning the variations are largely semantically equivalent \(e\.g\., “Alice” vs\. “alice” in a case\-insensitive lookup\)\. Structural variance \(TSS\\mathrm\{TSS\}\) predicts failure strongly \(d=0\.81d=0\.81\), because missing or wrong tools are not semantically equivalent to correct ones\.

### 6\.2TSS as a Practical Reliability Proxy

The practical value ofTSS\\mathrm\{TSS\}as a reliability proxy rests on three properties established here: \(1\) it predicts correctness \(d=0\.81d=0\.81\) without requiring correctness labels; \(2\) it can be estimated at deployment time by comparing a run’s tool sequence against a reference; and \(3\) the early\-divergence result \(60 % in steps 1–2\) means monitoring only the first two tool calls suffices\.

This enables a concrete production workflow: \(a\) run the agent on each task typekktimes to establish a reference tool sequence; \(b\) in production, compare each run’s first two tool calls against the reference; \(c\) flag deviations for human review or retry\. The cost isO​\(1\)O\(1\)per step rather than semantic output evaluation, which is often expensive or infeasible online\.

### 6\.3Implications for Agent Deployment

1. 1\.Reduce ambiguity before switching models\.The ambiguity effect on consistency \(d=0\.74d=0\.74\) outweighs the between\-model effect \(η2=0\.08\\eta^\{2\}=0\.08, n\.s\. forAC\\mathrm\{AC\}\)\. Clarifying task specifications yields more consistency gain than upgrading the model\.
2. 2\.Monitor early steps\.60 % of divergence occurs in steps 1–2; a lightweight check on the first two tool calls suffices to flag most inconsistent runs\.
3. 3\.Test tool calls, not text\.Assert on tool names and argument patterns—not natural language output, which varies below 5 % exact match even in consistent runs\.
4. 4\.UseTSS\\mathrm\{TSS\}as an online reliability signal\.A run whose tool sequence deviates from a reference trace is nearly three times more likely to fail; no correctness labels are needed to compute this signal\.
5. 5\.Match model to consistency requirement\.GPT\-4\.1\-mini offers the best structural consistency for consistency\-critical workflows\. Llama 3\.3 70B introduces substantially higher behavioral diversity and may require additional guardrails\.

### 6\.4Relation to Prior Work

Mehta et al\. \([2026](https://arxiv.org/html/2605.28840#bib.bib6)\)reported 2\.0–4\.2 unique action sequences per 10 runs on HotpotQA ReAct agents; we observe 1\.6–3\.3 unique tool\-call sequences, broadly consistent\. Our finer\-grained analysis reveals that the unique\-sequence count masks the structural/parametric distinction: even conditions with\>\>2 unique sequences often share highTSS\\mathrm\{TSS\}because sequences agree on most tool names with only minor reordering\. Prior consistency studies also did not establish a connection to correctness; our correctness analysis \(d=0\.81d=0\.81\) grounds the metric in task outcomes\.

### 6\.5Limitations

- •Sample size\.N=10N=10runs per cell yields split\-half reliabilityr=0\.66r=0\.66\. Aggregate results are well\-powered \(1,140 traces\), but per\-cell CIs are wide\. Future work should useN≥30N\\geq 30\.
- •Single temperature\.We measure atT=1\.0T=1\.0only\. The consistency\-capability trade\-off at other temperatures is important and uncharted; aT=0T=0baseline would substantially strengthen the causal interpretation\.
- •Structured correctness proxy\.Our rubric is not a full semantic evaluation\. Human judges on a random 200\-trace subset would provide stronger validity\. The significant rank correlation \(ρ=0\.42\\rho=0\.42,p<0\.001p<0\.001\) provides initial support\.
- •Limited tasks per category\.3–4 tasks per category makes category\-level estimates imprecise\.≥\\geq10 tasks per category are needed for reliable inference\.
- •Single system prompt\.Consistency may vary substantially with chain\-of\-thought prompting, few\-shot examples, or persona instructions\.
- •Simulated tools\.Deterministic tools isolate LLM variance but may overestimate real\-world consistency where tool outputs themselves vary\.
- •Provider coverage\.Google Gemini and Mistral are absent, limiting generalizability\.
- •AC alignment\.AC aligns arguments by step index, conflating structural and argument divergence at misaligned steps\. A disentangled metric would be more informative\.

### 6\.6Future Directions

1. 1\.Temperature ablation:map the full consistency\-capability Pareto frontier\.
2. 2\.Larger benchmark:≥\\geq10 tasks per category,≥\\geq50 total\.
3. 3\.Human correctness evaluation:human judges on 200 traces\.
4. 4\.Consistency\-aware routing:route tasks to models based on predictedTSS\\mathrm\{TSS\}\.
5. 5\.Multi\-turn and long\-horizon agents:web agents, coding agents\.
6. 6\.Fine\-tuning effects:does RLHF explicitly optimize consistency?
7. 7\.Real API tools:study whether environmental non\-determinism amplifies or dampens LLM behavioral variance\.

## 7Conclusion

We have presented a systematic empirical study of behavioral consistency in multi\-step tool\-calling LLM agents\. Across six models, 19 tasks, and 1,140 agent traces, we identify a “structural consistency, parametric variance” pattern that is large \(d=0\.75d=0\.75\) and highly significant \(p<10−13p<10^\{\-13\}\): agents reliably select the same tools in the same order but vary in argument details\. Critically, only the structural layer predicts task success \(d=0\.81d=0\.81,p<0\.001p<0\.001\), while argument variance is benign \(r=0\.12r=0\.12, n\.s\.\)—validating the pattern and ruling out trivial consistency\. Task specification quality outweighs model selection as a consistency lever, divergence concentrates early in the pipeline, andTSS\\mathrm\{TSS\}serves as a lightweight, correctness\-free reliability proxy\.

These findings suggest that LLMs acquire robust procedural schemas for tool use through training, while argument instantiation remains sensitive to sampling variation\. Understanding and exploiting this structure—testing on tool calls not text, monitoring early steps, flagging low\-TSS\\mathrm\{TSS\}runs—is a practical path toward more reliable agentic systems\.

## Reproducibility

All code, benchmark definitions, tool implementations, raw traces, and analysis scripts are available at[https://github\.com/Abelo9996/agent\-consistency](https://github.com/Abelo9996/agent-consistency)\. The full system prompt, task list, and per\-task correctness criteria are in Appendices[Appendix A\. Full Task Benchmark](https://arxiv.org/html/2605.28840#Ax1)–[Appendix D\. Correctness Criteria \(Illustrative Subset\)](https://arxiv.org/html/2605.28840#Ax4)\.

## References

- Kapoor and Narayanan \(2024\)Kapoor, S\. and Narayanan, A\.AI Agents That Matter\.arXiv preprint arXiv:2407\.01502, 2024\.
- Lakshminarayanan et al\. \(2017\)Lakshminarayanan, B\., Pritzel, A\., and Blundell, C\.Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles\.InAdvances in Neural Information Processing Systems, 2017\.
- Li et al\. \(2023\)Li, M\., Zhao, Y\., Yu, B\., Song, F\., Li, H\., Yu, H\., Li, Z\., Huang, F\., and Li, Y\.API\-Bank: A Comprehensive Benchmark for Tool\-Augmented LLMs\.InProceedings of EMNLP, 2023\.
- Liu et al\. \(2024\)Liu, X\., Yu, H\., Zhang, H\., et al\.AgentBench: Evaluating LLMs as Agents\.InProceedings of ICLR, 2024\.
- Lu et al\. \(2022\)Lu, Y\., Bartolo, M\., Moore, A\., Riedel, S\., and Stenetorp, P\.Fantastically Ordered Prompts and Where to Find Them: Overcoming Few\-Shot Prompt Order Sensitivity\.InProceedings of ACL, 2022\.
- Mehta et al\. \(2026\)Mehta, A\., Ramesh, A\., and Singla, A\.When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM\-Based Agents\.arXiv preprint arXiv:2602\.11619, 2026\.
- Patil et al\. \(2023\)Patil, S\. G\., Zhang, T\., Wang, X\., and Gonzalez, J\. E\.Gorilla: Large Language Model Connected with Massive APIs\.arXiv preprint arXiv:2305\.15334, 2023\.
- Perez et al\. \(2022\)Perez, E\., Kiela, D\., and Cho, K\.True Few\-Shot Learning with Language Models\.InAdvances in Neural Information Processing Systems, 2022\.
- Qin et al\. \(2024\)Qin, Y\., Liang, S\., Ye, Y\., et al\.ToolLLM: Facilitating Large Language Models to Master 16000\+ Real\-World APIs\.InProceedings of ICLR, 2024\.
- Renze and Guven \(2024\)Renze, M\. and Guven, E\.Self\-Reflection in LLM Agents: Effects on Problem\-Solving Performance\.arXiv preprint arXiv:2405\.06682, 2024\.
- Schick et al\. \(2023\)Schick, T\., Dwivedi\-Yu, J\., Dessì, R\., et al\.Toolformer: Language Models Can Teach Themselves to Use Tools\.InAdvances in Neural Information Processing Systems, 2023\.
- Sclar et al\. \(2024\)Sclar, M\., Choi, Y\., Tsvetkov, Y\., and Suhr, A\.Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design\.arXiv preprint arXiv:2310\.11324, 2024\.
- Wang et al\. \(2023\)Wang, X\., Wei, J\., Schuurmans, D\., et al\.Self\-Consistency Improves Chain of Thought Reasoning in Language Models\.InProceedings of ICLR, 2023\.
- Weidinger et al\. \(2021\)Weidinger, L\., Mellor, J\., Rauh, M\., et al\.Ethical and Social Risks of Harm from Language Models\.arXiv preprint arXiv:2112\.04359, 2021\.
- Yao et al\. \(2023\)Yao, S\., Zhao, J\., Yu, D\., et al\.ReAct: Synergizing Reasoning and Acting in Language Models\.InProceedings of ICLR, 2023\.

## Appendix A\. Full Task Benchmark

TableLABEL:tab:full\_taskslists all 19 benchmark tasks used in the study\. Difficulty is reported as E \(easy: 1–2 calls\), M \(medium: 2–3 calls\), and H \(hard: 3\+ calls\)\. Expected tools denote the minimal correct solution pattern\.

Table 3:All 19 benchmark tasks used in the evaluation\.IDDiff\.Task InstructionExpected ToolsData Retrievalretrieve\-001EFind Alice’s email and send “Meeting moved to 3pm tomorrow\.”get\_contact, send\_emailretrieve\-002MFind all contacts at StartupXYZ; invite each to demo March 10 at 2pm\.search\_contacts, send\_emailretrieve\-003MSearch emails about ‘budget’; summarize financial figures\.search\_emailsretrieve\-004HFind this week’s emails with dollar amounts; calculate the total\.search\_emails, calculateSchedulingschedule\-001ESchedule 30\-min “Design Review” with Bob, March 2 at 2pm\.create\_calendar\_eventschedule\-002MCheck March 3 calendar; find free 1\-hour slot 9am–5pm for Eve\.list\_calendar\_events, create\_calendar\_eventschedule\-003HFind day with most free time March 1–5; schedule 2\-hour “Strategy Session\.”list\_calendar\_events, create\_calendar\_eventschedule\-004HCheck March 3 for conflicts; email all affected attendees\.list\_calendar\_events, search\_contacts, send\_emailComputationcompute\-001ETotal inventory value of electronics \(price×\\timesstock, sum\)\.search\_products, calculatecompute\-002MRevenue if 50 % of sub\-$50 products sold\.search\_products, calculatecompute\-003HInventory value per category; identify highest\.search\_products, calculateMulti\-Tool Compositioncompose\-001MFind Acme Corp email; look up Dave; schedule 1\-hour meeting March 4 at 3pm\.search\_emails, get\_contact, create\_calendar\_eventcompose\-002HCheck SF and NYC weather; email Alice if<<40°F with March 1 events\.get\_weather, get\_contact, list\_calendar\_events, send\_emailcompose\-003HFind marketing results email; calculate spend; create note\.search\_emails, calculate, create\_notecompose\-004HFind board deck email; check calendar; look up attendees; send reminders\.search\_emails, list\_calendar\_events, get\_contact, send\_emailAmbiguousambig\-001MHelp me prepare for my meetings tomorrow\.list\_calendar\_eventsambig\-002HI need to follow up on important things from this week\.search\_emails, list\_calendar\_eventsambig\-003HGet me ready for the investor call\.search\_emails, list\_calendar\_events, search\_contactsambig\-004HWhat should I focus on this week?list\_calendar\_events, search\_emails
## Appendix B\. Tool Descriptions

TableLABEL:tab:toolssummarizes the 10 deterministic simulated tools used in our environment\. Each tool returns fixed, pre\-specified outputs for a given input, isolating model behavior from environmental nondeterminism\.

Table 4:Deterministic simulated tools used in the benchmark\.ToolDomainDescriptionget\_contactContactsLook up a contact by name; returns email, phone, and company\.search\_contactsContactsSearch contacts by free\-text query; returns a matching list\.send\_emailEmailSend an email with fields such as recipient, subject, and body; returns a confirmation\.search\_emailsEmailSearch the inbox by query; returns matching message records\.list\_calendar\_eventsCalendarList events for a date or date range\.create\_calendar\_eventCalendarCreate an event with title, date, time, duration, and attendees\.search\_productsProductsSearch inventory; returns price, stock, and category metadata\.calculateMathEvaluate a mathematical expression and return the numeric result\.get\_weatherWeatherReturn current weather for a city, including temperature and conditions\.create\_noteNotesCreate a text note with title and body; returns a confirmation\.
## Appendix C\. System Prompt

The following system prompt was used verbatim for all models and all runs:

> You are a helpful assistant with access to a set of tools\. Use the provided tools to complete the user’s request as accurately and efficiently as possible\. Only call tools when necessary\. If you have all the information you need to respond, do so directly\. When you have completed the task, provide a concise summary of what you did and the results\.

This prompt is intentionally minimal so that we do not anchor models to a particular tool\-selection strategy and thereby artificially inflate measured consistency\.

## Appendix D\. Correctness Criteria \(Illustrative Subset\)

TableLABEL:tab:correctness\_criteriaprovides illustrative examples of the correctness rubric used to score traces\. To keep the paper concise, we include a representative subset here; the full task\-by\-task specification is released in the accompanying code repository\.

Table 5:Illustrative correctness criteria for representative tasks\.Task IDRequired ToolsKey Argument Checksretrieve\-001get\_contact, send\_emailget\_contact\.name∼\\sim/alice/i;send\_email\.to=alice@\.\.\.; body∼\\sim/3 pm/ischedule\-001create\_calendar\_eventtitle∼\\sim/design review/i; date=2026\-03\-02; start\_time=14:00compute\-001search\_products, calculatesearch\_products\.category∼\\sim/electronics/i; output contains numeric resultcompose\-001search\_emails, get\_contact, create\_calendar\_eventRequired sequence: search→\\tolookup→\\tocreateambig\-001list\_calendar\_eventsAny invocation accepted; multiple valid strategies are treated as correct\.

Similar Articles

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Hugging Face Daily Papers

This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.

Are most LLM eval tools still too prompt-focused?

Reddit r/AI_Agents

The author questions whether current LLM evaluation tools are too focused on isolated prompts rather than full workflows and agent interactions, noting that step-by-step accuracy can mask overall behavioral drift in production.

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

arXiv cs.CL

This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.