The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

arXiv cs.AI Papers

Summary

This paper formalizes the sim-to-real gap for foundation model agents as a Markov Decision Process problem, proposing a unified research agenda to adapt classical solutions like domain randomization for improving agent robustness and reliability in real-world deployment.

arXiv:2606.07017v1 Announce Type: new Abstract: Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:14 AM

# The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
Source: [https://arxiv.org/html/2606.07017](https://arxiv.org/html/2606.07017)
Xiaoou Liu∗, Tiejin Chen∗, Weibo Li, Xiyang Hu, Hua Wei

\(2026\)

###### Abstract\.

Foundation model agents are increasingly deployed for real\-world decision\-making, but suffer from the sim\-to\-real gap\. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon\. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim\-to\-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward\. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization\. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent\. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real\-world applications\.

Sim\-to\-real, Reinforcement Learning, LLM Agents, MDP

††copyright:acmlicensed††journalyear:2026††doi:XXXXXXX\.XXXXXXX††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 3–7, 2026; Washington, DC, USA††isbn:978\-1\-4503\-XXXX\-X/2018/06## 1\.Introduction

![Refer to caption](https://arxiv.org/html/2606.07017v1/x1.png)Figure 1\.Landscape view of sim\-to\-real transfer for foundation model agent systems under the MDP decomposition\. A policyπ\\pitrained in the simulated MDPℳs\\mathcal\{M\}\_\{s\}\(left\) is deployed without modification in the real MDPℳr\\mathcal\{M\}\_\{r\}\(right\)\. Performance degradation arises from discrepancies in four channels—observation, action, transition, and reward—represented as a central “gap wall\.” Each channel corresponds to distinct real\-world shifts \(e\.g\., multilingual observation mismatch, action\-space distractors, stochastic tool execution, cost\-aware reward signals\) and can be stress\-tested via controlled perturbations\. The same decomposition aligns directly with classical sim\-to\-real remedies, enabling unified benchmarking and training across𝒪,𝒜,ℛ,𝒯\\mathcal\{O,A,R,T\}\.With the exploration of foundation models\(Singhet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib76); Grattafioriet al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib79); QwenTeam,[2025](https://arxiv.org/html/2606.07017#bib.bib77)\), foundation model \(FM\) agents and reinforcement learning \(RL\) policies are increasingly deployed for real\-world decision\-making, spanning tool\-augmented assistants\(Liet al\.,[2023a](https://arxiv.org/html/2606.07017#bib.bib58)\), robotic control\(Kimet al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib81)\), and autonomous systems\(Yaoet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib80)\)\. While training typically occurs on clean, curated benchmarks where data is abundant, and failures are safe, deployment exposes policies to distribution shifts, noisy inputs, unpredictable execution constraints, and stochastic transitions\. Consequently, performance on the leaderboard does not equate to real\-world deployment reliability\(Daet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib1)\)\. This simulation\-to\-reality \(sim\-to\-real\) gap is extensively studied in robotics and classical control, which boast mature frameworks to address such discrepancies\(Daet al\.,[2024b](https://arxiv.org/html/2606.07017#bib.bib10); Hanna and Stone,[2017](https://arxiv.org/html/2606.07017#bib.bib6)\)\. However, as FMs become the backbone of agent systems, the community is currently reinventing the wheel by treating agent robustness as an entirely novel phenomenon rather than a known deployment gap\(Zhuet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib82); Zhouet al\.,[2026](https://arxiv.org/html/2606.07017#bib.bib2)\)\.We should correct this isolated perspective and start to solve the sim\-to\-real gap in FM agents nowbecause FM policies are rapidly moving from safe simulators into unpredictable production environments where ignoring established sim\-to\-real frameworks leaves critical vulnerabilities unmeasured and causes severe real\-world failures\.

In this paper,our big and bold idea is to formalize the FM agent evaluation and training gap as a classical sim\-to\-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward\. Recent evaluations, such as tool\-use perturbation benchmarks and studies on multilingual tool calling, are inadvertently rediscovering these exact gaps\(Rabinovich and Tavor,[2025](https://arxiv.org/html/2606.07017#bib.bib22); Wanget al\.,[2026](https://arxiv.org/html/2606.07017#bib.bib23)\), but they completely lack a unifying language\. We advocate that the agent community should adopt the same family of classical solutions, including domain randomization\(Bellemareet al\.,[2016](https://arxiv.org/html/2606.07017#bib.bib64); Wiltzeret al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib65)\)and grounded action transformation\(Hanna and Stone,[2017](https://arxiv.org/html/2606.07017#bib.bib6); Lin and Sun,[2025](https://arxiv.org/html/2606.07017#bib.bib69); Karnanet al\.,[2020](https://arxiv.org/html/2606.07017#bib.bib8); Daet al\.,[2023](https://arxiv.org/html/2606.07017#bib.bib11)\)\. Establishing this unified formulation is critical at this moment to prevent fragmented research efforts and to accurately measure deployment readiness\.

To push the frontier and offer a bold approach to this sim\-to\-real gap, we operationalize the classical Markov Decision Process view across the subsequent sections of this paper\. In detail, we build upon recent surveys of simulation to reality methods in reinforcement learning\(Daet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib1); Zhaoet al\.,[2020](https://arxiv.org/html/2606.07017#bib.bib3)\)that organize techniques by the four elements of the Markov Decision Process \(MDP\), including observation \(𝒪\\mathcal\{O\}\), action \(𝒜\\mathcal\{A\}\), transition \(𝒯\\mathcal\{T\}\), and reward \(ℛ\\mathcal\{R\}\)\. We review these four gaps in traditional reinforcement learning in Section 2, including their typical causes and classical mitigation techniques, and then dedicate Section 3 to translating these classical discrepancies into the FM domain\. Within this section, we provide concrete examples for each component and propose how benchmarking paradigms can be designed to systematically expose these vulnerabilities\. To demonstrate that such gaps genuinely exist in practice, we incorporate recent studies on multilingual tool calling\(Luoet al\.,[2026](https://arxiv.org/html/2606.07017#bib.bib24)\)which highlight a severe observation space gap where models correctly understand user intent and select the appropriate tool, yet produce parameter values in the user language that violate strict execution level conventions, leading to operationally invalid tool calls\. Finally, Section 4 outlines concrete research directions to harden agents against these gaps\.

If all of our ideas succeed, the outcome would look like a fundamental paradigm shift where the field shares a unified vocabulary for vulnerabilities and universally adopts standardized stress test benchmarks\. This collective shift will directly yield a new generation of highly trustworthy agents that intrinsically maintain performance despite severe multilingual inputs and transition perturbations and cost\-aware reward constraints\. Researchers building foundation models will gain a structured methodology for robustness testing while practitioners deploying autonomous systems will acquire a reliable blueprint for safe production environments\. These specific groups and their deployment challenges represent the core focus of the KDD community\. Therefore, this agenda directly advances KDD thematic topics by establishing a rigorous foundation for Trustworthy and Responsible Data Science and shaping Modern AI for reliable real world Applications\.

## 2\.Problem Formulation and Classical Solutions

RL is commonly formulated as a discounted Markov decision process \(MDP\)\(Feinberg and Shwartz,[2012](https://arxiv.org/html/2606.07017#bib.bib4)\)ℳ=\(𝒮,𝒜,𝒯,ℛ,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},\\mathcal\{R\},\\gamma\), where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,𝒯​\(st\+1∣st,at\)\\mathcal\{T\}\(s\_\{t\+1\}\\mid s\_\{t\},a\_\{t\}\)is the transition dynamics,ℛ\\mathcal\{R\}is the reward function, andγ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor\. The objective is to learn a policyπ\\pithat maximizes the expected discounted return\. In sim\-to\-real settings, the policy is trained in a simulated MDPℳs\\mathcal\{M\}\_\{s\}and deployed in a real\-world MDPℳr\\mathcal\{M\}\_\{r\}\. Following prior work\(Daet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib1)\), the*sim\-to\-real gap*of policyπ\\pican be defined asG​\(π\):=ψs​\(π\)−ψr​\(π\)G\(\\pi\):=\\psi\_\{s\}\(\\pi\)\-\\psi\_\{r\}\(\\pi\), whereψs​\(⋅\)\\psi\_\{s\}\(\\cdot\)andψr​\(⋅\)\\psi\_\{r\}\(\\cdot\)denote the same evaluation metric computed in simulation and in the real environment, respectively\. From an MDP perspective, the source ofG​\(π\)G\(\\pi\)can be attributed to discrepancies in the core elements of the process: observation, action, transition, and reward\.

∙\\bulletObservation gap\.In sim\-to\-real RL, the agent’s perceived observation in simulation often differs from that in the real world, i\.e\.,ots≠otro\_\{t\}^\{s\}\\neq o\_\{t\}^\{r\}, which can induce a systematic performance drop after deployment\. There are two common sources of the observation gap\. \(1\)*Perception completeness*: simulators may provide ideal or fully\-observed signals \(e\.g\.,ots=stso\_\{t\}^\{s\}=s\_\{t\}^\{s\}\), whereas real observations are partial, noisy, delayed, or subject to occlusion\. \(2\)*Representation mismatch*: differences in sensor modalities, resolution, calibration, and encoding create a distribution shift in the observation space, even when the underlying task dynamics are similar\.Techniques to mitigatethe observation gap include the following\.*\(1\) Domain Randomization*randomizes visual and sensing parameters in simulation to encourage invariance and improve out\-of\-distribution generalization\(Tobinet al\.,[2017](https://arxiv.org/html/2606.07017#bib.bib5)\)\.*\(2\) Domain Adaptation*aligns simulated and real observation feature distributions, for example via adversarial objectives\(Bousmaliset al\.,[2017](https://arxiv.org/html/2606.07017#bib.bib91)\)or embedding alignment\(Parket al\.,[2021](https://arxiv.org/html/2606.07017#bib.bib92)\), to reduce cross\-domain discrepancy\.*\(3\) Sensor Fusion*combines complementary modalities to reduce reliance on any single biased channel, thereby improving robustness under real\-world sensing imperfections\.

∙\\bulletAction gap\.In sim\-to\-real RL, actions that are valid and effective in simulation may not translate faithfully to the real world\. There are three common sources of the action gap:*\(1\) Action granularity*: simulators often use discretized or simplified action spaces and assume near\-perfect execution, while real control is continuous, fine\-grained, and constrained by low\-level actuation limits;*\(2\) Execution uncertainty*: real actuation is stochastic and imperfect, where intended actions can be perturbed or experience unexpected scaling in magnitude\(Bellemareet al\.,[2016](https://arxiv.org/html/2606.07017#bib.bib64); Wiltzeret al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib65)\);*\(3\) System delay*: real actuators and APIs introduce latency and jitter, so the effective action may be applied att\+δt\+\\deltarather than immediately as assumed in simulation\.Techniques to mitigatethe action gap include the following\.*\(1\) Action Shielding*projects or filters proposed actions to ensure feasibility and safety before execution\.*\(2\) Delay\-aware Control*explicitly models latency \(e\.g\., constant\- or random\-delay MDP variants\) and trains policies robust to delayed actuation\.*\(3\) Robustification to Actuation Uncertainty*uses action perturbations/noise injection or robust RL objectives, improving stability under perturbed actions and action\-scale shifts\(Tanet al\.,[2020](https://arxiv.org/html/2606.07017#bib.bib66); Liuet al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib67)\)\.

∙\\bulletTransition gap\.Sim\-to\-real transfer often suffers from*transition gaps*, where the next\-state dynamics in simulation diverge from those in the real environment\(Hanna and Stone,[2017](https://arxiv.org/html/2606.07017#bib.bib6); Lin and Sun,[2025](https://arxiv.org/html/2606.07017#bib.bib69)\)\. Such gaps arise from inaccurate or incomplete modeling of real dynamics\.Techniques to mitigatethe transition gap include the following\.*\(1\) Transition\-level Domain Randomization*perturbs dynamics parameters to train policies robust to model error\(Valassakiset al\.,[2020](https://arxiv.org/html/2606.07017#bib.bib13); Mehtaet al\.,[2020](https://arxiv.org/html/2606.07017#bib.bib12)\)\.*\(2\) Grounding Methods*\(Hanna and Stone,[2017](https://arxiv.org/html/2606.07017#bib.bib6); Desaiet al\.,[2020b](https://arxiv.org/html/2606.07017#bib.bib7); Karnanet al\.,[2020](https://arxiv.org/html/2606.07017#bib.bib8); Desaiet al\.,[2020a](https://arxiv.org/html/2606.07017#bib.bib9)\)learn a transformation to map simulated transitions to real transitions\(Daet al\.,[2023](https://arxiv.org/html/2606.07017#bib.bib11); Karnanet al\.,[2020](https://arxiv.org/html/2606.07017#bib.bib8)\)\.*\(3\) Distributionally Robust Learning*optimizes policies to perform well under unknown\-but\-bounded transition shifts\(Smirnovaet al\.,[2019](https://arxiv.org/html/2606.07017#bib.bib70)\)\.

∙\\bulletReward gap\.In sim\-to\-real RL, the reward specified in simulation may not faithfully reflect the real\-world objective due to incomplete task modeling or cascading effects of observation, action, and transition mismatches\(Liet al\.,[2023b](https://arxiv.org/html/2606.07017#bib.bib72)\)\. For example, delayed or mis\-executed actions can alter realized outcomes and therefore change the reward received\(Kimet al\.,[2026b](https://arxiv.org/html/2606.07017#bib.bib74)\)\.Techniques to mitigatethe reward gap include the following\.*\(1\) Reward shaping*provides denser and more informative feedback while preserving the optimal policy under suitable conditions, as in potential\-based shaping\(Badnavaet al\.,[2023](https://arxiv.org/html/2606.07017#bib.bib86)\)\. Reward shaping can also incorporate structured priors such as automaton\-guided shaping to better handle sparse objectives\(Velasquezet al\.,[2021](https://arxiv.org/html/2606.07017#bib.bib88)\)\.*\(2\) Reward augmentation*uses limited real\-environment data to refine or supplement return signals, for instance by matching sim–real trajectory distributions or by augmenting returns for return\-conditioned learning\. This improves transferred policy performance under data scarcity\(Guoet al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib87)\)\.

Table 1\.Connecting traditional sim\-to\-real gap in RL to their analogues in FM agent, organized by the four MDP elements\.
## 3\.Gaps on FM\-Controlled Agents and Benchmarking Paradigm

In Section 2, we introduced sim\-to\-real gaps across𝒪\\mathcal\{O\},𝒜\\mathcal\{A\},𝒯\\mathcal\{T\},ℛ\\mathcal\{R\}\. In this section, we further analyze how these gaps manifest in FM\-controlled agents, with Table[1](https://arxiv.org/html/2606.07017#S2.T1)mapping traditional sim\-to\-real gap sources in RL and robotics to their analogues in FM agent settings\. Moreover, benchmarking sim\-to\-real robustness for FM agents can draw on classical mitigation strategies from RL\.

### 3\.1\.Observation: Randomization and Shift

FM agents possess an observation space consisting of processed textual inputs\. These inputs include user queries along with tool schemas and environment feedback\(Liet al\.,[2023a](https://arxiv.org/html/2606.07017#bib.bib58); Huanget al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib59)\)\. An observation gap emerges when the structured inputs found in simulated training benchmarks diverge from the noisy observations encountered during real\-world deployment\. In practice, this gap manifests through everyday user typos and massive irrelevant context distractors\(Liuet al\.,[2025c](https://arxiv.org/html/2606.07017#bib.bib89)\)as well as two deeper structural challenges, namely multilingual misalignment\(Luoet al\.,[2026](https://arxiv.org/html/2606.07017#bib.bib24)\)and multi\-modal noise\. Multilingual misalignment describes the discrepancy where an agent fails when agents meet non\-standard language beyond English\. Furthermore multi\-modal noise represents the critical shift from clean text observations to complex environments such as imperfect speech transcriptions or noisy optical character recognition outputs that corrupt the underlying semantic meaning\(Yuet al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib17); Chenet al\.,[2025a](https://arxiv.org/html/2606.07017#bib.bib90)\)\.

A rigorous benchmarking paradigm should systematically expose these observation vulnerabilities through controlled randomization and distribution shift methodologies\. Evaluators should inject noise into the text observations while keeping the underlying correct action unchanged\. This process involves introducing minor typographical errors and query paraphrasing that mimic real\-world user interactions\. Furthermore distribution shift protocols must evaluate policies against entirely novel environments, such as replacing standard English instructions with diverse multilingual user inputs or previously unseen tool structures\(Kimet al\.,[2026a](https://arxiv.org/html/2606.07017#bib.bib63)\), without altering the core functional task\.

### 3\.2\.Action: Randomization and Distractors

For FM agents, the action space translates directly to the repertoire of accessible tools or application programming interfaces\. An action gap manifests when the perfectly distinct tool sets of training environments are replaced by the overlapping and noisy functional landscapes of real deployment environments\(Jianget al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib68)\)\.

To show this specific vulnerability, evaluators should systematically expand and confuse the available action space to force complex disambiguation\. This process involves injecting distractor tools that share the exact name as the ground truth tool but contain empty or incorrect parameter descriptions\. Evaluators can additionally introduce redundant and highly similar tools to clutter the selection pool\. Perturbing the presented action set in this manner tests whether an agent relies excessively on surface cues like tool names rather than demonstrating deep semantic understanding\.

### 3\.3\.Transition: Realism and Robustness

The transition function for FM agents corresponds to the environment response following a tool execution\. A transition gap emerges because agents trained under idealized assumptions expect immediate and perfect success while real deployment environments present stochastic dynamics with transient timeouts or partial application programming interface failures\(Zhouet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib71)\)\.

To expose this inherent vulnerability, researchers should subject agents to modified execution dynamics that systematically vary transition fidelity to mimic actual production environments\. An effective evaluation methodology should inject configurable perturbations, such as varying timeout rates and unpredictable retry patterns, alongside incomplete data payloads like partial JSON responses\. For instance, forcing an initial tool call to return a transient error mandates that the agent interprets the failure correctly and autonomously initiates a recovery sequence\. Exposing models to these realistic transition conditions reveals whether policies inherently possess grounded behavior analogous to robust targets\(Chenet al\.,[2026](https://arxiv.org/html/2606.07017#bib.bib55)\)and whether they can plan under uncertainty\(Liuet al\.,[2025b](https://arxiv.org/html/2606.07017#bib.bib54); Daet al\.,[2024a](https://arxiv.org/html/2606.07017#bib.bib35)\)\.

### 3\.4\.Reward: Realism and Metadata

A critical vulnerability of FM\-controlled agents arises because agents typically overfit to accuracy and ignore underlying operational constraints such as latency\(Kimet al\.,[2026b](https://arxiv.org/html/2606.07017#bib.bib74)\)\. The reward gap emerges when the simulated environment fails to capture these true operational costs of real\-world deployment\. Furthermore, empirical observations from recent tool\-use perturbation frameworks demonstrate that reward conditions consistently induce the most severe performance degradation\(Vuddantiet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib75)\)\.

To systematically benchmark, evaluators should construct scenarios where redundant tools are deliberately provided so the agent is explicitly forced to evaluate cost and latency metadata to select the optimal path\. Beyond simple redundancy, benchmarking protocols should incorporate misleading tool names and hidden usage fees to rigorously test if the model blindly optimizes for surface\-level functional relevance instead of the true underlying utility\. This approach effectively shifts the evaluation metric from basic semantic alignment to holistic operational efficiency, ensuring that policies learn to maximize actual deployment goals\.

### 3\.5\.A Canonical Example: Observation Gap

Action, transition, and reward gaps are covered above and in Table[1](https://arxiv.org/html/2606.07017#S2.T1)\. Here we zoom in on one*observation*\-gap setting: the*language–execution boundary*\. Standard tool\-calling benchmarks\(Patilet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib20); Chenet al\.,[2024](https://arxiv.org/html/2606.07017#bib.bib25)\)assume predominantly English user queries and language\-consistent execution \(e\.g\., parameter values that match API conventions\)\. In reality, users may issue queries in many languages while tool interfaces \(e\.g\., function names, parameter identifiers, or expected value formats\) remain language\-invariant \(e\.g\., English\-only\)\. Multilingual tool\-calling studies\(Luoet al\.,[2026](https://arxiv.org/html/2606.07017#bib.bib24)\)show that models often produce*semantically correct*tool calls—correct intent, correct tool selection—yet*operationally invalid*ones, because parameter values are copied from the user’s language \(e\.g\., Chinese or Hindi\) and violate execution\-level conventions\. For example,\(Luoet al\.,[2026](https://arxiv.org/html/2606.07017#bib.bib24)\)shows that the error rate increases from 13\.5% to 28\.5% and from 5\.5% to 46\.5% when transferring the instruction from English to Chinese using GPT5\(Singhet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib76)\)and Qwen\-Next\-80B\(QwenTeam,[2025](https://arxiv.org/html/2606.07017#bib.bib77)\)as shown in Table[2](https://arxiv.org/html/2606.07017#S4.T2)\. This failure mode is called*parameter value language mismatch*: the observation \(user query in language L\) leads the model to fill parameters with values in L, but the execution environment expects values in a fixed convention \(e\.g\., English\)\. The gap is thus in the*observation space*—what the agent observes \(multilingual natural language\) is drawn from a different distribution than what was emphasized in training \(often English\-centric benchmarks\), and the mapping from observation to action \(parameter values\) must satisfy execution constraints that are not fully reflected in the training distribution\.

## 4\.Research Directions

Table 2\.Comparison of error rates for each model’s agents across different languages\. The results show a significant increase in error rates for non\-English languages, highlighting the parameter value language mismatch\.The MDP formulation above maps naturally to a set of open research directions, organized by the four components of the agent loop \(Figure[2](https://arxiv.org/html/2606.07017#S4.F2)\)\. We outline concrete opportunities for each component and close with cross\-cutting challenges\.

![Refer to caption](https://arxiv.org/html/2606.07017v1/x2.png)Figure 2\.Overview of research directions organized around the MDP loop\. Each MDP component is Observation \(𝒪\\mathcal\{O\}\), Action \(𝒜\\mathcal\{A\}\), Transition \(𝒯\\mathcal\{T\}\), and Reward \(ℛ\\mathcal\{R\}\)\.∙\\bulletObservation \(𝒪\\mathcal\{O\}\): Alignment across languages and modalities\.The observation gap manifests most acutely in multilingual and multi\-modal settings\. Multilingual alignment requires training or decoding procedures that enforce parameter\-value conventions across languages, so that a date format or unit convention in one language maps correctly to the same API argument as its counterpart in another\. Recent multilingual agent benchmarks\(Hofmanet al\.,[2025](https://arxiv.org/html/2606.07017#bib.bib84)\)reveal that non\-English prompts cause disproportionate performance drops, yet few methods explicitly target the language\-execution mismatch at the system level\. Multi\-modal extension broadens observations from text to vision, audio, and mixed\-modality inputs, where sim\-to\-real benchmarks for vision\-language agents and FM\-controlled robots remain largely absent\. Data augmentation, such as synthesizing paraphrases, noisy OCR outputs, or culturally varied tool descriptions, can improve robustness without additional real\-world data\. Error decomposition protocols that separate semantic misunderstanding from execution\-level formatting errors are essential for diagnosing which part of the observation pipeline fails\.

∙\\bulletAction \(𝒜\\mathcal\{A\}\): Safe and delay\-tolerant execution\.Real\-world tool calls face latency, ambiguity, and risk, none of which are captured by clean\-benchmark evaluation\. Delay\-aware policies must learn to act under variable and sometimes unpredictable tool\-response times, a challenge studied extensively in real\-time RL for robotics but largely unexplored for FM agents\. Action shielding offers a complementary safety layer: recent work such as ShieldAgent\(Chenet al\.,[2025b](https://arxiv.org/html/2606.07017#bib.bib85)\)demonstrates that guardrail agents can intercept unsafe or invalid tool calls before execution through formal verification; adapting such mechanisms to the sim\-to\-real setting, where the simulated action space may not fully cover real API constraints, is a natural next step\. Distractor benchmarks that inject irrelevant or adversarial tool options into the action space test whether agents can resist spurious choices, an analog of the distractor objects studied in robotic manipulation\.

∙\\bulletTransition \(𝒯\\mathcal\{T\}\): Bridging simulated and real tool call\.The transition function is where sim\-to\-real mismatch is often most severe: simulated tool calls return clean, deterministic outputs, while real APIs produce timeouts, partial failures, rate\-limit errors, and non\-deterministic orderings\. GAT\-style mapping, which adapts grounded action transformation from multi\-agent settings to tool\-use agents, can learn a correction function that maps simulated tool outcomes to realistic feedback distributions\. Realism benchmarks should systematically vary transition fidelity by injecting configurable perturbations such as timeout rates, retry patterns, and partial JSON responses, so that robustness can be measured per perturbation type\. Learned dynamics models, including LLM\-based or neural surrogates that predict real\-world transition distributions from simulation experience, could enable planning under transition uncertainty without costly real\-world rollouts\.

∙\\bulletReward \(ℛ\\mathcal\{R\}\): Robust training under distributional shift\.Standard RL objectives for tool\-use agents overfit to clean reward signals; when deployed, noisy or delayed rewards cause the largest performance drops\. Curriculum randomization, which progressively increases the severity of𝒪\\mathcal\{O\}/𝒜\\mathcal\{A\}/𝒯\\mathcal\{T\}/ℛ\\mathcal\{R\}perturbations during training, has proven effective in robotics sim\-to\-real transfer and can be directly ported to agent training loops by ramping up API noise or timeout frequency over the course of training\. Retry\-aware training objectives that explicitly reward recovery behavior, such as successfully completing a task after an initial tool failure, encourage agents to develop robust fallback strategies rather than failing silently\. Multi\-objective optimization that jointly maximizes clean\-task performance and stress robustness addresses the well\-known tension between average\-case accuracy and worst\-case reliability, ensuring that hardening against perturbations does not degrade nominal performance\.

Cross\-cutting challenges\.Several questions span all four MDP components\. It remains unknown how sim\-to\-real robustness scales with model size and pretraining diversity, or whether larger models naturally close𝒪\\mathcal\{O\}/𝒜\\mathcal\{A\}/𝒯\\mathcal\{T\}/ℛ\\mathcal\{R\}gaps without targeted intervention\. Shared perturbation libraries, public leaderboards reporting per\-gap and combined robustness scores, and reproducibility practices such as fixed seeds, perturbation schedules, and version\-controlled tool schemas are needed to reduce community fragmentation\. Joint protocols that reflect real\-world correlations among gap types, for example, noisy observations co\-occurring with delayed transitions, and that balance coverage across languages and tool ecosystems against evaluation cost, represent the most important open benchmarking challenge\.

## 5\.Conclusion

Foundation model agents are rapidly transitioning from curated, failure\-safe benchmarks to real\-world deployments where inputs, tools, and environments are noisy and shifting\. We argue that many of the resulting reliability failures are not fundamentally new but mirror the traditional sim\-to\-real gap studied in reinforcement learning\. Therefore, we propose a unifying MDP\-based framework that is structured around observation, action, transition, and reward to formalize FM agent robustness and evaluation\. We hope this Blue Sky formulation catalyzes a paradigm shift in how the field diagnoses vulnerabilities, designs benchmarks, and ultimately builds reliable, real\-world AI systems aligned with KDD’s goals in responsible data science and modern, dependable applications\.

###### Acknowledgements\.

The work was partially supported by NSF awards \#2442477 and \#2550203\. We thank the Amazon Research Awards, Cisco Faculty Research Awards, and Toyota Faculty Research Awards\. The views and conclusions in this paper are those of the authors and should not be interpreted as representing any funding agencies\.

## References

- A\. N\. Abbas, S\. Mehak, G\. C\. Chasparis, J\. D\. Kelleher, M\. Guilfoyle, M\. C\. Leva, and A\. K\. Ramasubramanian \(2024\)Safety\-driven deep reinforcement learning framework for cobots: a sim2real approach\.In2024 10th International Conference on Control, Decision and Information Technologies \(CoDIT\),pp\. 2917–2923\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.3.2.2.1.1.1.1.1)\.
- M\. Alshiekh, R\. Bloem, R\. Ehlers, B\. Könighofer, S\. Niekum, and U\. Topcu \(2018\)Safe reinforcement learning via shielding\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.3.2.2.1.1.1.1.1)\.
- R\. Antonova, S\. Cruciani, C\. Smith, and D\. Kragic \(2017\)Reinforcement learning for pivoting task\.arXiv preprint arXiv:1703\.00472\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.3.2.2.1.1.1.2.1)\.
- B\. Badnava, M\. Esmaeili, N\. Mozayani, and P\. Zarkesh\-Ha \(2023\)A new potential\-based reward shaping for reinforcement learning agent\.In2023 IEEE 13th Annual Computing and Communication Workshop and Conference \(CCWC\),pp\. 01–06\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p5.1)\.
- M\. G\. Bellemare, G\. Ostrovski, A\. Guez, P\. Thomas, and R\. Munos \(2016\)Increasing the action gap: new operators for reinforcement learning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.30\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p2.1),[§2](https://arxiv.org/html/2606.07017#S2.p3.2)\.
- S\. Bohez, T\. Verbelen, E\. De Coninck, B\. Vankeirsbilck, P\. Simoens, and B\. Dhoedt \(2017\)Sensor fusion for robot control through deep reinforcement learning\.In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 2365–2370\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.2.1.2.1.1.1.3.1)\.
- K\. Bousmalis, N\. Silberman, D\. Dohan, D\. Erhan, and D\. Krishnan \(2017\)Unsupervised pixel\-level domain adaptation with generative adversarial networks\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 3722–3731\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.2.1.2.1.1.1.2.1),[§2](https://arxiv.org/html/2606.07017#S2.p2.3)\.
- T\. Chen, P\. Li, K\. Zhou, T\. Chen, and H\. Wei \(2025a\)Vision language model helps private information de\-identification in vision data\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 4558–4572\.Cited by:[§3\.1](https://arxiv.org/html/2606.07017#S3.SS1.p1.1)\.
- T\. Chen, X\. Liu, V\. Nandam, K\. Liou, and H\. Wei \(2026\)Conformal feedback alignment: quantifying answer\-level reliability for robust llm alignment\.arXiv preprint arXiv:2601\.17329\.Cited by:[§3\.3](https://arxiv.org/html/2606.07017#S3.SS3.p2.1)\.
- Z\. Chen, W\. Du, W\. Zhang, K\. Liu, J\. Liu, M\. Zheng, J\. Zhuo, S\. Zhang, D\. Lin, K\. Chen,et al\.\(2024\)T\-eval: evaluating the tool utilization capability of large language models step by step\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9510–9529\.Cited by:[§3\.5](https://arxiv.org/html/2606.07017#S3.SS5.p1.1)\.
- Z\. Chen, M\. Kang, and B\. Li \(2025b\)Shieldagent: shielding agents via verifiable safety policy reasoning\.arXiv preprint arXiv:2503\.22738\.Cited by:[§4](https://arxiv.org/html/2606.07017#S4.p3.2)\.
- L\. Da, T\. Chen, L\. Cheng, and H\. Wei \(2024a\)Llm uncertainty quantification through directional entailment graph and claim level response augmentation\.arXiv preprint arXiv:2407\.00994\.Cited by:[§3\.3](https://arxiv.org/html/2606.07017#S3.SS3.p2.1)\.
- L\. Da, M\. Gao, H\. Mei, and H\. Wei \(2024b\)Prompt to transfer: sim\-to\-real transfer for traffic signal control with prompt learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 82–90\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1)\.
- L\. Da, H\. Mei, R\. Sharma, and H\. Wei \(2023\)Uncertainty\-aware grounded action transformation towards sim\-to\-real transfer for traffic signal control\.In2023 62nd IEEE Conference on Decision and Control \(CDC\),pp\. 1124–1129\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p2.1),[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.4.3.2.1.1.1.2.1),[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- L\. Da, J\. Turnau, T\. P\. Kutralingam, A\. Velasquez, P\. Shakarian, and H\. Wei \(2025\)A survey of sim\-to\-real methods in rl: progress, prospects and challenges with foundation models\.arXiv preprint arXiv:2502\.13187\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1),[§1](https://arxiv.org/html/2606.07017#S1.p3.4),[§2](https://arxiv.org/html/2606.07017#S2.p1.14)\.
- S\. Desai, I\. Durugkar, H\. Karnan, G\. Warnell, J\. Hanna, and P\. Stone \(2020a\)An imitation from observation approach to transfer learning with dynamics mismatch\.Advances in Neural Information Processing Systems33,pp\. 3917–3929\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- S\. Desai, H\. Karnan, J\. P\. Hanna, G\. Warnell, and P\. Stone \(2020b\)Stochastic grounded action transformation for robot learning in simulation\.In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 6106–6111\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- E\. A\. Feinberg and A\. Shwartz \(2012\)Handbook of markov decision processes: methods and applications\.Vol\.40,Springer Science & Business Media\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p1.14)\.
- V\. Firoiu, T\. Ju, and J\. Tenenbaum \(2018\)At human speed: deep reinforcement learning with action delay\.arXiv preprint arXiv:1810\.07286\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.3.2.2.1.1.1.2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1),[Table 2](https://arxiv.org/html/2606.07017#S4.T2.1.4.3.1)\.
- Y\. Guo, Y\. Wang, Y\. Shi, P\. Xu, and A\. Liu \(2024\)Off\-dynamics reinforcement learning via domain adaptation and reward augmented imitation\.Advances in Neural Information Processing Systems37,pp\. 136326–136360\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.5.4.2.1.1.1.1.1),[§2](https://arxiv.org/html/2606.07017#S2.p5.1)\.
- J\. Hanna and P\. Stone \(2017\)Grounded action transformation for robot learning in simulation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.31\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1),[§1](https://arxiv.org/html/2606.07017#S1.p2.1),[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.4.3.2.1.1.1.2.1),[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- O\. Hofman, J\. Brokman, O\. Rachmil, S\. Bose, V\. Pahuja, T\. Shimizu, T\. Starostina, K\. Marchisio, S\. Goldfarb\-Tarrant, and R\. Vainshtein \(2025\)MAPS: a multilingual benchmark for global agent performance and security\.arXiv preprint arXiv:2505\.15935\.Cited by:[§4](https://arxiv.org/html/2606.07017#S4.p2.2)\.
- X\. Huang, W\. Liu, X\. Chen, X\. Wang, H\. Wang, D\. Lian, Y\. Wang, R\. Tang, and E\. Chen \(2024\)Understanding the planning of llm agents: a survey\.arXiv preprint arXiv:2402\.02716\.Cited by:[§3\.1](https://arxiv.org/html/2606.07017#S3.SS1.p1.1)\.
- D\. Jiang, Y\. Lu, Z\. Li, Z\. Lyu, P\. Nie, H\. Wang, A\. Su, H\. Chen, K\. Zou, C\. Du,et al\.\(2025\)Verltool: towards holistic agentic reinforcement learning with tool use\.arXiv preprint arXiv:2509\.01055\.Cited by:[§3\.2](https://arxiv.org/html/2606.07017#S3.SS2.p1.1)\.
- H\. Karnan, S\. Desai, J\. P\. Hanna, G\. Warnell, and P\. Stone \(2020\)Reinforced grounded action transformation for sim\-to\-real transfer\.In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 4397–4402\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p2.1),[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- D\. Kim, Z\. Ren, J\. Hao, Z\. Sun, L\. Wang, X\. Ma, Z\. Ye, X\. Han, J\. Yin, H\. Ji,et al\.\(2026a\)Beyond perfect apis: a comprehensive evaluation of llm agents under real\-world api complexity\.arXiv preprint arXiv:2601\.00268\.Cited by:[§3\.1](https://arxiv.org/html/2606.07017#S3.SS1.p2.1)\.
- J\. Kim, B\. Shin, J\. Chung, and M\. Rhu \(2026b\)The cost of dynamic reasoning: demystifying ai agents and test\-time scaling from an ai infrastructure perspective\.In2026 IEEE International Symposium on High Performance Computer Architecture \(HPCA\),pp\. 1–16\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.5.4.3.1.1.1.1.1),[§2](https://arxiv.org/html/2606.07017#S2.p5.1),[§3\.4](https://arxiv.org/html/2606.07017#S3.SS4.p1.1)\.
- Y\. Kim, D\. Kim, J\. Choi, J\. Park, N\. Oh, and D\. Park \(2024\)A survey on integration of large language models with intelligent robots\.Intelligent Service Robotics17\(5\),pp\. 1091–1107\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1)\.
- G\. Li, H\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem \(2023a\)Camel: communicative agents for” mind” exploration of large language model society\.Advances in neural information processing systems36,pp\. 51991–52008\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.07017#S3.SS1.p1.1)\.
- J\. Li, X\. Hu, H\. Xu, J\. Liu, X\. Zhan, Q\. Jia, and Y\. Zhang \(2023b\)Mind the gap: offline policy optimization for imperfect rewards\.arXiv preprint arXiv:2302\.01667\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.5.4.2.1.1.1.1.1),[§2](https://arxiv.org/html/2606.07017#S2.p5.1)\.
- Z\. Lin and S\. Sun \(2025\)Revealing the challenges of sim\-to\-real transfer in model\-based reinforcement learning via latent space modeling\.arXiv preprint arXiv:2506\.12735\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p2.1),[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025a\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[Table 2](https://arxiv.org/html/2606.07017#S4.T2.1.2.1.1)\.
- Q\. Liu, Y\. Kuang, and J\. Wang \(2024\)Robust deep reinforcement learning with adaptive adversarial perturbations in action space\.In2024 International Joint Conference on Neural Networks \(IJCNN\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p3.2)\.
- X\. Liu, T\. Chen, L\. Da, C\. Chen, Z\. Lin, and H\. Wei \(2025b\)Uncertainty quantification and confidence calibration in large language models: a survey\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 6107–6117\.Cited by:[§3\.3](https://arxiv.org/html/2606.07017#S3.SS3.p2.1)\.
- Y\. Liu, R\. Zhao, L\. Altinger, H\. Schütze, and M\. A\. Hedderich \(2025c\)Evaluating robustness of large language models against multilingual typographical errors\.arXiv preprint arXiv:2510\.09536\.Cited by:[§3\.1](https://arxiv.org/html/2606.07017#S3.SS1.p1.1)\.
- Z\. Luo, T\. P\. Kutralingam, O\. N\. Okoani, W\. Xu, H\. Wei, and X\. Hu \(2026\)Lost in execution: on the multilingual robustness of tool calling in large language models\.arXiv preprint arXiv:2601\.05366\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p3.4),[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.2.1.3.1.1.1.3.1),[§3\.1](https://arxiv.org/html/2606.07017#S3.SS1.p1.1),[§3\.5](https://arxiv.org/html/2606.07017#S3.SS5.p1.1)\.
- B\. Mehta, M\. Diaz, F\. Golemo, C\. J\. Pal, and L\. Paull \(2020\)Active domain randomization\.InConference on Robot Learning,pp\. 1162–1176\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- M\. Mishra, M\. Stallone, G\. Zhang, Y\. Shen, A\. Prasad, A\. M\. Soria, M\. Merler, P\. Selvam, S\. Surendran, S\. Singh,et al\.\(2024\)Granite code models: a family of open foundation models for code intelligence\.arXiv preprint arXiv:2405\.04324\.Cited by:[Table 2](https://arxiv.org/html/2606.07017#S4.T2.1.6.5.1)\.
- Y\. Park, S\. H\. Lee, and I\. H\. Suh \(2021\)Sim\-to\-real visual grasping via state representation learning based on combining pixel\-level and feature\-level domain adaptation\.In2021 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 6300–6307\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.2.1.2.1.1.1.2.1),[§2](https://arxiv.org/html/2606.07017#S2.p2.3)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(bfcl\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning,Cited by:[§3\.5](https://arxiv.org/html/2606.07017#S3.SS5.p1.1)\.
- X\. B\. Peng, M\. Andrychowicz, W\. Zaremba, and P\. Abbeel \(2018\)Sim\-to\-real transfer of robotic control with dynamics randomization\.In2018 IEEE international conference on robotics and automation \(ICRA\),pp\. 3803–3810\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.4.3.2.1.1.1.1.1)\.
- QwenTeam \(2025\)External Links:[Link](https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list)Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1),[§3\.5](https://arxiv.org/html/2606.07017#S3.SS5.p1.1),[Table 2](https://arxiv.org/html/2606.07017#S4.T2.1.5.4.1)\.
- E\. Rabinovich and A\. A\. Tavor \(2025\)On the robustness of agentic function calling\.InProceedings of the 5th Workshop on Trustworthy NLP \(TrustNLP 2025\),pp\. 298–304\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p2.1),[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.2.1.3.1.1.1.1.1),[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.3.2.3.1.1.1.3.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1),[§3\.5](https://arxiv.org/html/2606.07017#S3.SS5.p1.1),[Table 2](https://arxiv.org/html/2606.07017#S4.T2.1.3.2.1)\.
- E\. Smirnova, E\. Dohmatob, and J\. Mary \(2019\)Distributionally robust reinforcement learning\.arXiv preprint arXiv:1902\.08708\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.4.3.2.1.1.1.3.1),[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- K\. L\. Tan, Y\. Esfandiari, X\. Y\. Lee, S\. Sarkar,et al\.\(2020\)Robustifying reinforcement learning agents via action space adversarial training\.In2020 American control conference \(ACC\),pp\. 3959–3964\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.3.2.2.1.1.1.3.1),[§2](https://arxiv.org/html/2606.07017#S2.p3.2)\.
- J\. Tobin, R\. Fong, A\. Ray, J\. Schneider, W\. Zaremba, and P\. Abbeel \(2017\)Domain randomization for transferring deep neural networks from simulation to the real world\.In2017 IEEE/RSJ international conference on intelligent robots and systems \(IROS\),pp\. 23–30\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.2.1.2.1.1.1.1.1),[§2](https://arxiv.org/html/2606.07017#S2.p2.3)\.
- E\. Valassakis, Z\. Ding, and E\. Johns \(2020\)Crossing the gap: a deep dive into zero\-shot sim\-to\-real transfer for dynamics\.In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 5372–5379\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p4.1)\.
- A\. Velasquez, B\. Bissey, L\. Barak, A\. Beckus, I\. Alkhouri, D\. Melcer, and G\. Atia \(2021\)Dynamic automaton\-guided reward shaping for monte carlo tree search\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 12015–12023\.Cited by:[§2](https://arxiv.org/html/2606.07017#S2.p5.1)\.
- S\. V\. Vuddanti, A\. Shah, S\. K\. Chittiprolu, T\. Song, S\. Dev, K\. Zhu, and M\. Chaudhary \(2025\)PALADIN: self\-correcting language model agents to cure tool\-failure cases\.arXiv preprint arXiv:2509\.25238\.Cited by:[§3\.4](https://arxiv.org/html/2606.07017#S3.SS4.p1.1)\.
- R\. Wang, Y\. Chen, Y\. Wang, C\. Wu, J\. Fang, X\. Cai, Q\. Gu, H\. Su, A\. Zhang, X\. Wang,et al\.\(2026\)AgentNoiseBench: benchmarking robustness of tool\-using llm agents under noisy condition\.arXiv preprint arXiv:2602\.11348\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p2.1)\.
- H\. Wiltzer, M\. Bellemare, D\. Meger, P\. Shafto, and Y\. Jhaveri \(2024\)Action gaps and advantages in continuous\-time distributional reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 47815–47848\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p2.1),[§2](https://arxiv.org/html/2606.07017#S2.p3.2)\.
- H\. Yao, W\. Xu, J\. Turnau, N\. Kellam, and H\. Wei \(2025\)Instructional agents: llm agents on automated course material generation for teaching faculties\.arXiv preprint arXiv:2508\.19611\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1)\.
- A\. Yu, A\. Foote, R\. Mooney, and R\. Martín\-Martín \(2024\)Natural language can help bridge the sim2real gap\.arXiv preprint arXiv:2405\.10020\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.2.1.3.1.1.1.3.1),[§3\.1](https://arxiv.org/html/2606.07017#S3.SS1.p1.1)\.
- W\. Zhao, J\. P\. Queralta, and T\. Westerlund \(2020\)Sim\-to\-real transfer in deep reinforcement learning for robotics: a survey\.In2020 IEEE symposium series on computational intelligence \(SSCI\),pp\. 737–744\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p3.4)\.
- J\. Zhou, J\. Chen, Q\. Lu, D\. Zhao, and L\. Zhu \(2025\)Shielda: structured handling of exceptions in llm\-driven agentic workflows\.arXiv preprint arXiv:2508\.07935\.Cited by:[Table 1](https://arxiv.org/html/2606.07017#S2.T1.1.4.3.3.1.1.1.1.1),[§3\.3](https://arxiv.org/html/2606.07017#S3.SS3.p1.1)\.
- X\. Zhou, A\. Yuan, Z\. Luo, Z\. Ling, X\. Pan, Y\. Gao, H\. Zhang, J\. Li, S\. Jiang, P\. Z\. Wang,et al\.\(2026\)When simulation lies: a sim\-to\-real benchmark and domain\-randomized rl recipe for tool\-use agents\.arXiv preprint arXiv:2605\.11928\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1)\.
- K\. Zhu, Z\. Liu, B\. Li, M\. Tian, Y\. Yang, J\. Zhang, P\. Han, Q\. Xie, F\. Cui, W\. Zhang,et al\.\(2025\)Where llm agents fail and how they can learn from failures\.arXiv preprint arXiv:2509\.25370\.Cited by:[§1](https://arxiv.org/html/2606.07017#S1.p1.1)\.

Similar Articles

Mind the Sim-to-Real Gap & Think Like a Scientist

arXiv cs.AI

This paper studies when and how a planner should supplement a pre-trained simulator with real experiments in sequential decision problems, proposing Fisher-SEP to minimize posterior variance of a target policy's value.

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

arXiv cs.LG

This paper proposes a strategic robustness objective for learning simulators in model-based reinforcement learning, formulated as a minimax game between a model player and an adversarial policy player. Theoretical guarantees and a provably convergent algorithm are provided, with experiments showing reduced prediction error and improved real-world policy transfer.

Sim-to-real transfer of robotic control with dynamics randomization

OpenAI Blog

OpenAI researchers demonstrate a method to bridge the reality gap in robotic control by training policies with randomized simulator dynamics, enabling robots trained purely in simulation to successfully transfer to real-world tasks like object manipulation without physical training.

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

arXiv cs.AI

The paper introduces RealUserSim, a framework that grounds LLM-based user simulation in real human behavioral data from 14,000+ authentic conversations to bridge the reality gap in agent benchmarking. It shows that grounded simulation raises behavioral match rates from 24.2% to 45.3% and reveals failure mechanisms invisible to cooperative simulators.