Efficient Skill Grounding via Code Refactoring with Small Language Models

arXiv cs.AI Papers

Summary

This paper presents RECENT, a framework that enables efficient skill grounding in embodied agents using small language models (sLMs) by refactoring code-based skills rather than regenerating them from scratch, achieving performance comparable to LLM-based methods.

arXiv:2606.07999v1 Announce Type: new Abstract: Effective skill grounding is essential for deploying reusable skills in embodied agents, as even minor embodiment or environmental differences can render an entire skill incompatible. This challenge is particularly pronounced in embodied settings, where agents must operate in dynamic, partially observable environments without access to large language models (LLMs). In this setting, reliance on LLMs is impractical, while small language models (sLMs) remain insufficient for the effective skill grounding required for reliable long-horizon control. We present RECENT, a refactoring-centric agent framework that enables efficient skill grounding with sLMs by decoupling skill semantics from embodiment- and environment-specific execution binding. By representing skills as executable code, RECENT preserves the semantic intent encoded in a skill's control structure while grounding it by modifying only execution bindings through localized refactoring, rather than regenerating code from scratch. We evaluate RECENT across diverse skill grounding scenarios spanning multiple robot embodiments in dynamic environments, demonstrating robust long-horizon performance when deployed with an sLM. Across all scenarios, RECENT achieves the best performance among sLM-based Code-as-Policies (CaP) methods and matches the task performance of LLM-based CaP.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:54 AM

# Efficient Skill Grounding via Code Refactoring with Small Language Models
Source: [https://arxiv.org/html/2606.07999](https://arxiv.org/html/2606.07999)
Wonje ChoiSaehun ChunDaehee LeeJooyoung KimChaeun LeeHonguk Woo

###### Abstract

Effective skill grounding is essential for deploying reusable skills in embodied agents, as even minor embodiment or environmental differences can render an entire skill incompatible\. This challenge is particularly pronounced in embodied settings, where agents must operate in dynamic, partially observable environments without access to large language models \(LLMs\)\. In this setting, reliance on LLMs is impractical, while small language models \(sLMs\) remain insufficient for the effective skill grounding required for reliable long\-horizon control\. We presentRECENT, a refactoring\-centric agent framework that enables efficient skill grounding with sLMs by decoupling skill semantics from embodiment\- and environment\-specific execution binding\. By representing skills as executable code,RECENTpreserves the semantic intent encoded in a skill’s control structure while grounding it by modifying only execution bindings through localized refactoring, rather than regenerating code from scratch\. We evaluateRECENTacross diverse skill grounding scenarios spanning multiple robot embodiments in dynamic environments, demonstrating robust long\-horizon performance when deployed with an sLM\. Across all scenarios,RECENTachieves the best performance among sLM\-based Code\-as\-Policies \(CaP\) methods and matches the task performance of LLM\-based CaP\.

Embodied AI, Language Model

## 1Introduction

Recent embodied control systems increasingly exploit the planning capabilities of large language models \(LLMs\) to solve tasks by composing learned skills represented as neural sub\-policies or code snippets\(Brohan et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib5); Wang et al\.,[2023a](https://arxiv.org/html/2606.07999#bib.bib42)\)\. A skill inherently couples functional semantics, which specify what is to be achieved, with executable components, which determine how the skill can be realized by a particular robot in a given environment\. However, because robots differ in morphology, actuation, and sensing, and task environments vary in object properties, physical constraints, and operational conditions, the executability of a skill is highly deployment\-dependent, rendering direct skill reuse across contexts infeasible\. Consequently, existing skill representations make it difficult to explicitly separate functional semantics from executable components, hindering deployment\-time grounding and leading to relearning or regeneration when execution contexts change\.

Such designs incur substantial computational overhead, frequently requiring additional training to relearn neural sub\-policies when deployment conditions change\. Representing skills as executable code partially alleviates this issue by enabling grounding through inference at test time rather than retraining\. However, existing code\-based approaches still typically rely on regenerating the entire skill implementation, instead of adapting only deployment\-specific components while preserving the overall skill structure\. This regeneration overhead becomes particularly pronounced on capacity\-limited devices, where online access to large\-scale LLMs is not guaranteed and skill grounding must be performed using on\-device computing resources, rendering LLM inference impractical\. On the other hand, small language models \(sLMs\) enable efficient inference but offer limited reasoning capacity\(Choi et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib9)\), and existing skill grounding approaches that rely on regeneration remain ill\-suited for dynamic environments\.

To address these challenges, we presentRECENT, a refactoring\-centric agent framework that enables sLMs to perform efficient skill grounding\. Skill grounding becomes efficient when invariant semantic intent is separated from deployment\-specific execution bindings, allowing grounding to be bootstrapped across diverse deployment contexts\. By representing skills as well\-defined executable code,RECENTpreserves semantic intent in functional logic that can be reused across embodiment and environment differences, while isolating execution bindings as localized components for on\-demand modification\. As a result, sLMs are restricted to localized editing at deployment time, resolving embodiment mismatches through lightweight code modifications without the extensive reasoning associated with regenerating code from scratch\. Environmental variations are handled through in\-situ adaptation, where execution\-time feedback is incorporated to progressively patch the executing code without frequent execution interruptions\. Figure[1](https://arxiv.org/html/2606.07999#S1.F1)illustrates a code\-level comparison between existing skill grounding approaches that rely on inefficient full regeneration and our framework, which achieves efficient skill grounding through localized code refactoring\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x1.png)Figure 1:Key concept comparing \(top\) the skill grounding procedure used in existing approaches with \(bottom\) our refactoring\-centric skill grounding procedure\.Specifically, we adopt a skill ontology to declaratively encode skill semantics, robot capabilities, and their relationships, providing a unified foundation for reusable skill representations\. Guided by this ontology, we construct \(i\) an offline skill repository using an LLM, in which semantic intent is explicitly encoded in executable skill code along with metadata describing functional requirements and adaptation cues\. By validating this semantic intent offline on a common robot platform prior to deployment, subsequent skill grounding no longer needs to reason about complex task semantics and can instead focus solely on resolving execution bindings\. Deployment\-time skill grounding inRECENTaddresses embodiment mismatches through \(ii\) ontology\-based reasoning and environmental variations through \(iii\) in\-situ adaptation, both realized via Fill\-in\-the\-Middle \(FIM\)\(Bavarian et al\.,[2022](https://arxiv.org/html/2606.07999#bib.bib3)\)\-based localized code refactoring rather than end\-to\-end regeneration from scratch\. Embodiment mismatches are amenable to sLM\-based editing because relevant execution bindings are explicitly identified through ontology\-level comparisons between skill requirements and target robot capabilities, constraining grounding to localized pre\-execution code edits\. In contrast, environmental variations are handled by deferring adaptation to execution time, during which an sLM proactively patches yet\-to\-be\-executed code snippets under unit\-level validity checks, preserving executability without altering global task semantics\.

We evaluateRECENTacross diverse skill grounding scenarios using sLMs under deployment constraints, spanning multiple robot embodiments, dynamic environments, and capacity\-limited device settings\. Specifically, we design diverse long\-horizon robotic manipulation tasks spanning multiple robot embodiments in CoppeliaSim\(Rohmer et al\.,[2013](https://arxiv.org/html/2606.07999#bib.bib30)\)and Genesis\(Genesis Authors,[2024](https://arxiv.org/html/2606.07999#bib.bib12)\)\. Across all scenarios,RECENT, deployed with the sLM Qwen2\.5\-Coder\-7B\(Hui et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib17)\), outperforms the Code as Policies \(CaP\) baseline\(Liang et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib26)\)instantiated with the same\-sized distilled sLM CodeV\-R1\(Zhu et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib54)\), achieving an improvement of 58\.81 percentage points \(pp\) in task success rate \(SR\), a 99\.09 pp reduction in grounding overhead \(GO\), an average of only 0\.71 execution interruptions \(EI\) per task, approaching zero, and a 93\.29% relative reduction in idle time \(IT\)\. Its performance is comparable to that of LLM\-based CaP using GPT\-5\.2\-Codex\(OpenAI,[2025](https://arxiv.org/html/2606.07999#bib.bib29)\), with differences of only 6\.58 pp in SR, while outperforming it on average across the remaining metrics despite operating under deployment constraints, as shown in Table[1](https://arxiv.org/html/2606.07999#S5.T1), achieving a substantial 57\.81 pp improvement in GO and average reductions of 22\.95% and 77\.52% in EI and IT, respectively\.

Our contributions are summarized as follows:

- •We presentRECENT, a refactoring\-centric agent framework that enables efficient skill grounding with sLMs under deployment constraints, making long\-horizon control practical without reliance on large\-scale LLM inference\.
- •We represent skills as executable code that separates invariant semantic intent from deployment\-specific execution binding, enabling skill grounding through localized code refactoring rather than regeneration from scratch\.
- •We evaluateRECENTon diverse skill grounding scenarios and show consistent performance improvements over distilled sLM\-based CaP in SR, GO, EI, and IT, while remaining comparable to LLM\-based CaP\.

## 2Related Work

#### LLM\-based Embodied Control\.

In embodied control, recent work increasingly leverages the reasoning capabilities of LLMs for high\-level task planning over predefined skill policies\(Huang et al\.,[2022](https://arxiv.org/html/2606.07999#bib.bib15); Brohan et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib5); Song et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib36)\)\. Building on recent advances in the code\-writing capabilities of LLMs\(Chen et al\.,[2021](https://arxiv.org/html/2606.07999#bib.bib8); Roziere et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib31); Hui et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib17); Guo et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib13); Zhu et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib53)\)embodied policies can be represented and executed in programmatic forms, commonly referred to as*Code\-as\-Policies*\(Liang et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib26); Huang et al\.,[2023b](https://arxiv.org/html/2606.07999#bib.bib16),[a](https://arxiv.org/html/2606.07999#bib.bib14); Burns et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib6); Li et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib25); Mu et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib28); Vemprala et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib41); Singh et al\.,[2022](https://arxiv.org/html/2606.07999#bib.bib35); Wang et al\.,[2023b](https://arxiv.org/html/2606.07999#bib.bib44)\)\. Rather than mapping instructions to a fixed set of predefined skills, these methods prompt LLMs to generate Python\-like programs that directly invoke perception and motor APIs, enabling motion\-level control in embodied agents\. Unlike existing approaches that rely on large\-scale LLMs at deployment, our work focuses on reliable long\-horizon embodied control under deployment constraints by representing skills as reusable code and enabling sLMs to ground them through localized code refactoring rather than regeneration from scratch\.

#### Skill Grounding\.

In embodied agents, skills are temporally extended and reusable behavioral patterns that encapsulate low\-level control, enabling higher\-level planning and composition for solving complex tasks\(Kober & Peters,[2009](https://arxiv.org/html/2606.07999#bib.bib20); Rozo et al\.,[2020](https://arxiv.org/html/2606.07999#bib.bib32); Kroemer et al\.,[2021](https://arxiv.org/html/2606.07999#bib.bib21)\)\. Parametric approaches typically ground such skills through retraining or fine\-tuning neural policies, mapping abstract skill representations to concrete robot actions while entangling task semantics with execution details within the learned policies\(Xu et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib49); Wang et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib43); Doshi et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib10)\)\. More recently, LLM\-based embodied agents have represented skills as function\-level code, where task\-level decisions and execution\-specific details are generated jointly within a single program\(Tziafas & Kasaei,[2024](https://arxiv.org/html/2606.07999#bib.bib40); Sarch et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib34); Li et al\.,[2025a](https://arxiv.org/html/2606.07999#bib.bib23)\)\. We focus on how skills should be structured to support efficient grounding\. Specifically, we separate semantic intent from execution context, so that functional logic is preserved, while only execution\-specific components are subject to adaptation\. This separation reduces the reasoning burden on the sLM by preserving semantic intent in functional logic and restricting deployment\-time grounding to localized edits over execution bindings\. As a result, skill grounding naturally reduces to refactoring execution\-specific code fragments rather than regenerating entire skill implementations from scratch\.

#### Programmatic Control\.

Recent advances in LLMs have spurred growing interest in programmatic control, where agents generate, execute, and repair code to solve complex tasks\(Xia & Zhang,[2023](https://arxiv.org/html/2606.07999#bib.bib47); Yang et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib50); Bouzenia et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib4); Xia et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib48)\)\. By explicitly exposing control flow and intermediate program structure, these approaches facilitate structured reasoning, improved generalization, and compositionality compared to end\-to\-end learned control policies\. Programmatic control has also been applied to embodied agents, enabling code generation and execution over perception and control modules\(Liang et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib26); Huang et al\.,[2023b](https://arxiv.org/html/2606.07999#bib.bib16)\)\. Unlike programmatic agents in digital environments, embodied code agents must reason over continuous states and interact with the physical world under partial observability\(Ahn et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib1); Meng et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib27); Ying et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib51)\)\. In embodied settings, generated code often coordinates perception, decision\-making, and actuation in a closed\-loop manner, where continuous control and environmental feedback are essential for handling uncertainty and long\-horizon dependencies\. Unlike existing approaches,RECENTperforms in\-situ adaptation to environmental variations during execution, enabling continuous control without interrupting program execution\.

## 3Problem Formulation

We formulate an embodied task asτ=\(𝒮,𝒜,𝒢,T\)\\tau=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{G\},T\), where𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}denote the state and action spaces, respectively\. Due to partial observability\(Sutton & Barto,[2018](https://arxiv.org/html/2606.07999#bib.bib38)\), at each timestepttthe agent receives an observationoto\_\{t\}that provides incomplete information about the underlying statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}\. The environment dynamics are defined by the transition functionT:𝒮×𝒜→𝒮T:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathcal\{S\}\. We denote by𝒢⊂𝒮\\mathcal\{G\}\\subset\\mathcal\{S\}the set of goal states, and define each taskτ\\tauas a composite objective consisting of multiple individual goalsg∈𝒢g\\in\\mathcal\{G\}\. To solve a taskτ\\tau, the agent is provided offline with a set of reference skills𝒳=\{χ1,…,χK\}\\mathcal\{X\}=\\\{\\chi\_\{1\},\\dots,\\chi\_\{K\}\\\}, where each skillχk\\chi\_\{k\}is a temporally extended action and may not be directly executable in the target deployment setting\. Our objective is to optimize an sLMπsLM\\pi\_\{\\mathrm\{sLM\}\}over a set of tasks𝒯\\mathcal\{T\}to maximize task success rates while minimizing skill grounding overhead and preserving execution continuity:

maxπsLM​𝔼τ∼𝒯\\displaystyle\\max\_\{\\pi\_\{\\mathrm\{sLM\}\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\tau\\sim\\mathcal\{T\}\}\[SR\(πsLM,τ\)\\displaystyle\\Big\[\\mathrm\{SR\}\(\\pi\_\{\\mathrm\{sLM\}\},\\tau\)\(1\)−λCgro\(πsLM,τ;𝒳\)−μCexe\(πsLM,τ\)\]\.\\displaystyle\-\\lambda\\mathrm\{C\}\_\{\\mathrm\{gro\}\}\(\\pi\_\{\\mathrm\{sLM\}\},\\tau;\\mathcal\{X\}\)\-\\mu\\mathrm\{C\}\_\{\\mathrm\{exe\}\}\(\\pi\_\{\\mathrm\{sLM\}\},\\tau\)\\Big\]\.Here,SR​\(πsLM,τ\)\\mathrm\{SR\}\(\\pi\_\{\\mathrm\{sLM\}\},\\tau\)denotes the task success rate for executingτ\\tauunder the decisions ofπsLM\\pi\_\{\\mathrm\{sLM\}\}, indicating whether all goal conditions are satisfied\. The grounding costCgro​\(πsLM,τ;𝒳\)\\mathrm\{C\}\_\{\\mathrm\{gro\}\}\(\\pi\_\{\\mathrm\{sLM\}\},\\tau;\\mathcal\{X\}\)measures the deployment\-time overhead incurred when groundingχk∈𝒳\\chi\_\{k\}\\in\\mathcal\{X\}toτ\\tau\. The execution costCexe​\(πsLM,τ\)\\mathrm\{C\}\_\{\\mathrm\{exe\}\}\(\\pi\_\{\\mathrm\{sLM\}\},\\tau\)penalizes execution disruptions that interrupt task progress\. The coefficientsλ\\lambdaandμ\\mubalance task success against grounding and execution costs\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x2.png)Figure 2:Overview ofRECENT\. \(i\)*Offline skill repository*stores reusable skill code with ontology\-based metadata specifying functional intent, robot embodiment, and semantic relations\. \(ii\)*Ontology\-based reasoning*maps each skill to the target robot, diagnosing determined conflicts and undetermined warnings\. \(iii\)*In\-situ adaptation*patches code at execution time using environment feedback\.
## 4Approach

We presentRECENT, a code refactoring\-centric agent framework that enables sLMs to perform efficient skill grounding under deployment constraints, ensuring task success, minimizing grounding overhead, and preserving execution continuity\. InRECENT, skills are generated and validated by an LLM with a clear separation between semantic intent and execution\-specific components within the skill code\. At deployment time, grounding is performed by an sLM by refactoring only the execution\-specific components, enabling skills to be adapted to the target execution context with minimal modification\. To enable systematic skill reuse across diverse robots,RECENTadopts a skill ontology that explicitly structures each skill according to shared semantic definitions\(Tenorth & Beetz,[2017](https://arxiv.org/html/2606.07999#bib.bib39)\)\. The ontology encodes a skill’s functional requirements \(e\.g\., its definition, preconditions, and effects\), robot embodiment profiles \(e\.g\., available primitive APIs\), and semantic relations between them \(e\.g\., requires, provides, and substitutable\-by\)\. Based on this representation, we maintain an offline repository of reusable skills, where each skill is generated as executable code using cloud\-scale LLMs and annotated with ontology\-based metadata\. All skills in the repository are validated on a commonly used robot, such asFranka Emika Panda, to ensure functional correctness prior to deployment\. By construction, this representation enables skills to be systematically grounded across diverse deployment conditions\.

Skill grounding inRECENTis performed at deployment time based on the execution context, which we decompose into determined and undetermined factors depending on their resolvability before execution\. Before execution, embodiment\-related factors are largely determined by the known capabilities of the target robot\. Accordingly,RECENTleverages the skill ontology to either select directly compatible skills or identify substitutable code snippets to resolve embodiment mismatches\. In these cases, an sLM performs localized code editing, such as replacing API calls, adjusting parameters, or rewiring interfaces, while preserving the overall functional structure for efficient reuse\. During execution, environment\-related factors, such as object properties, contact dynamics, partial observability, and scene constraints, may remain undetermined\. To handle these factors,RECENTmonitors execution\-time environmental feedback and performs anticipatory in\-situ code patching on yet\-to\-be\-executed code fragments when a skill is predicted to violate its expected behavior\. In this process, lightweight autonomous validation checks are inserted for undetermined factors to guide the sLM in deciding when and where patching is required, enabling uninterrupted long\-horizon execution without restarting the entire skill\.

Figure[2](https://arxiv.org/html/2606.07999#S3.F2)provides an overview ofRECENT, which operates in three phases\. Prior to deployment, we construct \(i\) an*offline skill repository*, where reusable skill code is generated using an LLM based on a skill ontology and validated for executability on a general\-purpose robot\. At deployment time,RECENTperforms skill grounding through two additional phases: \(ii\)*Ontology\-based reasoning*for embodiment gaps, which enables an sLM to perform targeted pre\-execution editing for determined, embodiment\-specific factors; and \(iii\)*In\-situ adaptation*to environmental variations, which enables the sLM to perform on\-demand patching for undetermined, environment\-dependent factors\.

### 4\.1Offline skill repository

The offline skill repository serves as the foundation for systematic skill reuse inRECENT\. We adopt a skill ontology as a logical knowledge base that encodes skill semantics, robot embodiment capabilities, and semantic relations between them following\(Tenorth & Beetz,[2017](https://arxiv.org/html/2606.07999#bib.bib39)\)\. Specifically, the ontology captures skill’s functional intent, including preconditions, and expected effects, as well as robot embodiment profiles, such as available primitive APIs and execution constraints\. In addition, it encodes semantic relations between skills and capabilities, including requires, provides, and substitutable\-by\. Each skill in the repository is generated as executable code using a cloud\-scale LLM and is structured to separate semantic intent from execution\-specific components\. All skills are validated offline on a representative embodiment to ensure functional correctness and basic executability\.

#### Skill ontology querying\.

Based on the skill ontology, we define an ontology query operator𝒬\\mathcal\{Q\}that extracts a robot\-conditioned skill specification by retrieving ontology facts consistent with both the skill definition and the target embodiment\. For a skillχ∈𝒳\\chi\\in\\mathcal\{X\}and a robotr∈ℛr\\in\\mathcal\{R\}, we define:

𝒬​\(χ,r\)=⟨𝒞​\(χ\),𝒫​\(χ\),ℰ​\(χ\),ℐ​\(χ,r\)⟩\.\\mathcal\{Q\}\(\\chi,r\)=\\langle\\mathcal\{C\}\(\\chi\),\\mathcal\{P\}\(\\chi\),\\mathcal\{E\}\(\\chi\),\\mathcal\{I\}\(\\chi,r\)\\rangle\.\(2\)Here,𝒞​\(χ\)\\mathcal\{C\}\(\\chi\)denotes the set of required capabilities and operational constraints specified for skillχ\\chi\(e\.g\., gripper type, sensing modalities\),𝒫​\(χ\)\\mathcal\{P\}\(\\chi\)denotes preconditions that must hold before execution \(e\.g\., object visibility, reachable poses, collision\-free workspaces\), andℰ​\(χ\)\\mathcal\{E\}\(\\chi\)denotes the expected effects after execution \(e\.g\., object pose changes, grasp success, state transitions\)\. Crucially,ℐ​\(χ,r\)\\mathcal\{I\}\(\\chi,r\)denotes a robot\-conditioned primitive API interface derived from the ontology that attempts to satisfy the required capabilities in𝒞​\(χ\)\\mathcal\{C\}\(\\chi\)under the capability profile of robotrr\. If no interface fully satisfies𝒞​\(χ\)\\mathcal\{C\}\(\\chi\), the remaining unmet requirements are exposed as embodiment gaps and deferred to the diagnosis operator in*Ontology\-based reasoning*\(Section[4\.2](https://arxiv.org/html/2606.07999#S4.SS2)\)\. The resulting interface specifies typed inputs, outputs, and control parameters, and serves as an explicit grounding contract between abstract skill intent and embodiment\-specific execution\. This ontology\-query formulation provides a specification that enables LLMs to generate skill code enriched with structured metadata\.

#### LLM\-based skill generation and validation\.

Letrsrcr^\{\\mathrm\{src\}\}denote a general\-purpose robot embodiment \(e\.g\.,*Franka Emika Panda*\), which serves as a reference robot for offline skill validation\. For each skillχ\\chi, we query𝒬​\(χ,rsrc\)\\mathcal\{Q\}\(\\chi,r^\{\\mathrm\{src\}\}\)and use a cloud\-scale LLMπLLM\\pi\_\{\\mathrm\{LLM\}\}to generate an executable skill implementationχ~\\tilde\{\\chi\}enriched with ontology\-aligned metadata\. The resulting offline skill repository𝒳~\\tilde\{\\mathcal\{X\}\}is defined as the set of skills that satisfy canonical validation:

𝒳~=\{χ~∣χ∈𝒳,χ~=\\displaystyle\\tilde\{\\mathcal\{X\}\}=\\\{\\tilde\{\\chi\}\\mid\\chi\\in\\mathcal\{X\},\\tilde\{\\chi\}=πLLM​\(𝒬​\(χ,rsrc\)\),\\displaystyle\\pi\_\{\\mathrm\{LLM\}\}\(\\mathcal\{Q\}\(\\chi,r^\{\\mathrm\{src\}\}\)\),\(3\)Validate\(χ~,rsrc\)=1\}\.\\displaystyle\\mathrm\{Validate\}\(\\tilde\{\\chi\},r^\{\\mathrm\{src\}\}\)=1\\\}\.The validation procedureValidate​\(⋅\)\\mathrm\{Validate\}\(\\cdot\)executesχ~\\tilde\{\\chi\}onrsrcr^\{\\mathrm\{src\}\}and verifies consistency with its specification by checking satisfaction of testable preconditions in𝒫​\(χ\)\\mathcal\{P\}\(\\chi\)and alignment between observed outcomes and the expected effects inℰ​\(χ\)\\mathcal\{E\}\(\\chi\)\. Only validated skills are stored as reusable entries in the offline skill repository for deployment\-time grounding\.

### 4\.2Ontology\-based reasoning for embodiment gaps

At deployment time,RECENTgrounds skills to a target robot by reasoning over embodiment mismatches through the ontology\. We categorize deployment\-time uncertainties into determined embodiment factors and undetermined environmental factors\. This phase resolves determined factors before execution via localized code editing, while deferring undetermined factors to*in\-situ adaptation*\(Section[4\.3](https://arxiv.org/html/2606.07999#S4.SS3)\)\.

#### Ontology\-based compatibility analysis\.

Given a target robotrtgtr^\{\\mathrm\{tgt\}\},RECENTqueries the skill ontology to assess the compatibility of each retrieved skill with the target embodiment\. Specifically, for a candidate skill implementationχ~\\tilde\{\\chi\}, we evaluate whether the capability profile ofrtgtr^\{\\mathrm\{tgt\}\}satisfies the required capabilities𝒞​\(χ\)\\mathcal\{C\}\(\\chi\)and whether the robot’s available primitive APIs are consistent with the skill’s interface on the ontology\. If all required capabilities are directly supported byrtgtr^\{\\mathrm\{tgt\}\}, the skill is deemed compatible and selected without modification\. Otherwise,RECENTperforms ontology\-based embodiment gap analysis by decomposing unmet capability requirements into determined and undetermined factors\. Givenχ~\\tilde\{\\chi\}andrtgtr^\{\\mathrm\{tgt\}\}, we define an ontology\-based diagnosis operator𝒟\\mathcal\{D\}as:

𝒟​\(χ~,rtgt\)=⟨Δdet​\(χ~,rtgt\),Δund​\(χ~,rtgt\)⟩\.\\mathcal\{D\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\)=\\langle\\Delta\_\{\\mathrm\{det\}\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\),\\Delta\_\{\\mathrm\{und\}\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\)\\rangle\.\(4\)Here,Δdet\\Delta\_\{\\mathrm\{det\}\}denotes determined conflicts whose resolution is fully specified by ontology\-derived substitutions, allowing corresponding code interfaces and API calls to be refactored prior to execution\. In contrast,Δund\\Delta\_\{\\mathrm\{und\}\}denotes undetermined, environment\-contingent warnings \(e\.g\., sensing visibility, contact stability, object physical properties\) whose satisfaction cannot be certified prior to execution and must be validated online during*in\-situ adaptation*\(Section[4\.3](https://arxiv.org/html/2606.07999#S4.SS3)\)\.

#### Localized editing for determined conflicts\.

For determined conflictsΔdet​\(χ~,rtgt\)\\Delta\_\{\\mathrm\{det\}\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\),RECENTresolves grounding via localized code editing using sLM\-based FIM\(Bavarian et al\.,[2022](https://arxiv.org/html/2606.07999#bib.bib3)\)\. Each conflict specifies an editing target defined by the start and end locations of the code span to be modified, together with ontology\-derived substitutions as hints\. Accordingly, determined conflicts are represented as a set of infilling tasks:

Δdet​\(χ~,rtgt\)=\{\(mi,κi\)\}i=1N,\\Delta\_\{\\mathrm\{det\}\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\)=\\\{\(m\_\{i\},\\kappa\_\{i\}\)\\\}\_\{i=1\}^\{N\},\(5\)where each infilling task\(mi,κi\)\(m\_\{i\},\\kappa\_\{i\}\)consists of a code snippetmim\_\{i\}to be refactored and a set of ontology\-derived substitution hintsκi\\kappa\_\{i\}\. Eachδi\\delta\_\{i\}is resolved locally by an sLMπsLM\\pi\_\{\\mathrm\{sLM\}\}using FIM to generate a replacement snippet:

mi′←πsLM​\(ψpre​\(mi;χ~\),ψsuf​\(mi;χ~\),κi\)\.m\_\{i\}^\{\\prime\}\\leftarrow\\pi\_\{\\mathrm\{sLM\}\}\\big\(\\psi^\{\\mathrm\{pre\}\}\(m\_\{i\};\\tilde\{\\chi\}\),\\psi^\{\\mathrm\{suf\}\}\(m\_\{i\};\\tilde\{\\chi\}\),\\kappa\_\{i\}\\big\)\.\(6\)whereψpre​\(mi;χ~\)\\psi^\{\\mathrm\{pre\}\}\(m\_\{i\};\\tilde\{\\chi\}\)andψsuf​\(mi;χ~\)\\psi^\{\\mathrm\{suf\}\}\(m\_\{i\};\\tilde\{\\chi\}\)denote the prefix and suffix code segments surroundingmim\_\{i\}, respectively\. The updatedχ~\\tilde\{\\chi\}is obtained asχ~←\[ψpre​\(mi;χ~\)⊕mi′⊕ψsuf​\(mi;χ~\)\]\\tilde\{\\chi\}\\leftarrow\[\\psi^\{\\mathrm\{pre\}\}\(m\_\{i\};\\tilde\{\\chi\}\)\\oplus m\_\{i\}^\{\\prime\}\\oplus\\psi^\{\\mathrm\{suf\}\}\(m\_\{i\};\\tilde\{\\chi\}\)\], where⊕\\oplusdenotes concatenation\.

### 4\.3In\-situ adaptation to environmental variations

This phase addresses undetermined factorsΔund​\(χ~,rtgt\)\\Delta\_\{\\mathrm\{und\}\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\)whose validity cannot be resolved before execution and instead depends on execution\-time environmental conditions\. In contrast to determined conflicts, these factors are evaluated and resolved in\-situ during skill execution\.

#### Unit\-test for undetermined warnings\.

For each undetermined warning inΔund​\(χ~,rtgt\)\\Delta\_\{\\mathrm\{und\}\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\),RECENTinstantiates a lightweight, localized validation check in the form of an autonomous unit test\. Each unit test encodes expected preconditions or effects associated with a specific subsequent code segment and is continuously evaluated using execution\-time observations \(e\.g\., object states, contact outcomes, or sensory feedback\)\. Importantly, these validation checks are applied only to code segments that have not yet been executed, enabling anticipatory detection of potential violations without interrupting ongoing execution\. When a validation check is violated, the corresponding undetermined factor is determined and treated as an immediate patching target\. If a unit test foru∈Δund​\(χ~,rtgt\)u\\in\\Delta\_\{\\mathrm\{und\}\}\(\\tilde\{\\chi\},r^\{\\mathrm\{tgt\}\}\)is violated, we convert it into an observation\-conditioned infilling task:

u→ot\(m,ot\)\.u\\xrightarrow\[\]\{\\,o\_\{t\}\\,\}\(m,o\_\{t\}\)\.\(7\)wheremmdenotes the target code snippet to be refactored, andoto\_\{t\}denotes the environment observation at timesteptt\.

#### Execution\-time adaptive patching\.

Given a patching target\(m,ot\)\(m,o\_\{t\}\),RECENTperforms localized execution\-time adaptation via FIM\. Formally, given the current skill codeχ~\\tilde\{\\chi\}, we generate a replacement snippet usingπsLM\\pi\_\{\\mathrm\{sLM\}\}:

m′←πsLM​\(ψpre​\(m;χ~\),ψsuf​\(m;χ~\),ot\),m^\{\\prime\}\\leftarrow\\pi\_\{\\mathrm\{sLM\}\}\\big\(\\psi^\{\\mathrm\{pre\}\}\(m;\\tilde\{\\chi\}\),\\psi^\{\\mathrm\{suf\}\}\(m;\\tilde\{\\chi\}\),o\_\{t\}\\big\),\(8\)and update the skill code by patching only the targeted snippet asχ~←\[ψpre​\(m;χ~\)⊕m′⊕ψsuf​\(m;χ~\)\]\\tilde\{\\chi\}\\leftarrow\[\\psi^\{\\mathrm\{pre\}\}\(m;\\tilde\{\\chi\}\)\\oplus m^\{\\prime\}\\oplus\\psi^\{\\mathrm\{suf\}\}\(m;\\tilde\{\\chi\}\)\]\. This formulation modifies only the invalidated snippet while preserving the remaining code structure and ongoing execution context\.

## 5Evaluation

### 5\.1Experimental setting

To evaluate the efficiency and robustness of skill grounding under deployment constraints, we assessRECENTon long\-horizon embodied manipulation tasks where embodiment gaps and environmental variations cause incompatibilities between offline\-generated skills and deployment\-time execution conditions\. These tasks are designed to examine whetherRECENT, when deployed with an sLM, can ground reusable skills via localized code refactoring instead of policy regeneration, while sustaining continuous execution over long horizons\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x3.png)Figure 3:Evaluation settings\. \(left\)Kinematic variations, transferring from the source Panda robot to the target robots UR5 and Sawyer\. \(right\)End\-effector variations, transferring from the Franka Hand to the Robotiq 2F\-85 and vacuum grippers\.#### Environments\.

We design four evaluation settings spanning two types of embodiment mismatches: \(1\)*Kinematic variation*, where the source and target robots differ in manipulator structure, such as link configurations, joint layouts, and degrees of freedom; and \(2\)*End\-effector variation*, where the grasping mechanism differs, such as parallel\-jaw versus vacuum grippers\. For kinematic variation, we use Franka Emika Panda as the source embodiment and evaluate deployment on UR5 and Sawyer manipulators, which differ in kinematic structure and joint configuration\. All robots are implemented in CoppeliaSim\(Rohmer et al\.,[2013](https://arxiv.org/html/2606.07999#bib.bib30)\), enabling consistent evaluation across heterogeneous kinematic and control interfaces\. For end\-effector variation, we use Panda equipped with a parallel\-jaw gripper as the source, and evaluate deployment using a Robotiq 2F\-85 parallel gripper and a vacuum gripper\. These settings are implemented in Genesis\(Genesis Authors,[2024](https://arxiv.org/html/2606.07999#bib.bib12)\), which supports accurate modeling of contact dynamics and suction\-based manipulation\. Figure[3](https://arxiv.org/html/2606.07999#S5.F3)summarizes the embodiment settings for evaluation\. Beyond embodiment mismatches, all scenarios incorporate environmental variations that introduce dynamic and partial observability, including object pose perturbations, interaction uncertainties, and sensing noise\. This setup ensures that skill execution requires grounding under both embodiment gaps and environmental variations\.

#### Tasks\.

We construct each task as a continual sequence of subtasks, forming long\-horizon manipulation problems that require up to 54 sequential API calls to complete under an optimal execution path\. Each subtask is adapted from existing manipulation benchmarks, including RLBench, VIMA, and RoboGen\(James et al\.,[2020](https://arxiv.org/html/2606.07999#bib.bib18); Jiang et al\.,[2022](https://arxiv.org/html/2606.07999#bib.bib19); Wang et al\.,[2023c](https://arxiv.org/html/2606.07999#bib.bib45)\)\. \(1\)*Kinematic variation*consists of 20 tasks, each requiring an average of 24\.6 API calls\. Each task combines up to 5 subtasks selected from 4 subtask categories: single\-object manipulation, inter\-object manipulation, precision interaction, and tool use\. \(2\)*End\-effector variation*consists of 12 tasks, each requiring an average of 29 API calls\. Each task consists of subtasks that require stable grasping and accurate interactions between objects and the gripper, including object placement, articulated object interaction, scene rearrangement, and tool use\.

Table 1:Performance on long\-horizon manipulation tasks underKinematicandEnd\-effectorvariation settings\. Higher values indicate better performance for SR and GC, whereas lower values indicate better performance for GO, EI, and IT\. All results are averaged over five runs with different random seeds\.
#### Baselines\.

For comparison, we categorize and evaluate baselines into three groups for a comprehensive evaluation\. \(1\) Code\-as\-Policies: We include methods following the Code\-as\-Policies paradigm, which generate executable policy code from task descriptions\. We evaluate Code as Policies \(CaP\)\(Liang et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib26)\)and its variants:CaP\(Liang et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib26)\)instantiated with an sLM,CaP\-CodeV\-R1, a distilled variant of CaP\(Zhu et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib54)\), andCaP\-Codex, which employs GPT\-5\.2\-Codex\(OpenAI,[2025](https://arxiv.org/html/2606.07999#bib.bib29)\)\. We additionally includeSCoT\(Li et al\.,[2025b](https://arxiv.org/html/2606.07999#bib.bib24)\), which augments CaP with explicit reasoning for policy code generation\. \(2\) Embodied Agentic Programming: We include embodied agentic programming methods that synthesize and execute programs online and integrate policy code generation into the embodied perception\-action feedback loop\. We useProgPrompt\(Singh et al\.,[2022](https://arxiv.org/html/2606.07999#bib.bib35)\)andRoboInspector\(Ying et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib51)\)as representative baselines\. \(3\) Automated Program Repair: We include automated program repair methods that iteratively revise programs based on reactive, error\-driven execution feedback\. We useRepairAgent\(Bouzenia et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib4)\)andAgentless\(Xia et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib48)\)as representative baselines\.

#### Metrics\.

We evaluate performance using 5 metrics aligned with the objective in Eq\. \([1](https://arxiv.org/html/2606.07999#S3.E1)\), covering task success, grounding efficiency, and execution continuity\. For robustness, we report \(1\)Task Success Rate \(SR\)\(%\), which measures whether an agent successfully completes all required subtasks of a given task, and \(2\)Goal\-Conditioned Success Rate \(GC\)\(%\), which evaluates whether individual goal conditions are satisfied\. For grounding efficiency, we measure \(3\)Grounding Overhead \(GO\)\(%\), defined as the ratio of the number of tokens generated by the sLM during skill grounding to the token length of the original skill code\. For execution continuity, we report \(4\) theExecution Interruption Count \(EI\), defined as the number of pauses in skill execution caused by failures or environmental mismatches at runtime, and \(5\) theIdle Time \(IT\), measured as the cumulative duration required to resolve such interruptions before skill execution resumes\.

#### Implementation\.

All experiments are implemented in Python 3\.9\. We use Qwen2\.5\-Coder\-7B as the default sLM, accessed via HuggingFace\(Wolf et al\.,[2019](https://arxiv.org/html/2606.07999#bib.bib46)\)\. All baseline methods are evaluated under identical configurations and executed on off\-the\-shelf NVIDIA RTX 4090 GPUs\. Additional experimental details are in Appendix[C](https://arxiv.org/html/2606.07999#A3)\.

### 5\.2Main result

Table[1](https://arxiv.org/html/2606.07999#S5.T1)presents a comparison ofRECENTand eight competitive baselines on long\-horizon embodied manipulation tasks across two skill grounding scenarios,*Kinematic variation*and*End\-effector variation*, both evaluated in dynamic environments\. Across both scenarios, all baselines are provided with the same set of reference skills, which are used for grounding and reuse to synthesize executable policy code for task execution\. Under the same sLM setting,RECENTconsistently outperforms all baselines in task success, grounding efficiency, and execution continuity\. In terms of task performance,RECENTachieves the highest SR and GC, with 73\.00% SR and 81\.76% GC in*Kinematic variation*, and 82\.50% SR and 89\.84% GC in*End\-effector variation*\. For grounding efficiency, it attains the lowest GO of 7\.85% and 2\.51% in*Kinematic variation*and*End\-effector variation*, respectively\.RECENTfurther preserves execution continuity, maintaining near\-zero EI of 0\.72 and 0\.69 and achieving the lowest IT of 1\.77 sec and 2\.34 sec across the two scenarios\.

Averaged across*Kinematic variation*and*End\-effector variation*,RECENToutperforms the strongest baseline RepairAgent, achieving higher task success with improvements of 30\.67 pp in SR and 16\.87 pp in GC, while reducing GO, EI, and IT by 49\.87 pp, 0\.72, and 65\.40 sec respectively\. Compared to the distilled sLM\-based baselineCaP\-CodeV\-R1,RECENTsignificantly improves SR and GC by 58\.81 pp and 47\.24 pp, and reduces GO by 99\.09 pp, EI by 2\.75, and IT by 28\.58 sec\. At the same time,RECENTachieves performance comparable to the fully accessible LLM\-based baselineCaP\-Codex, with absolute differences of only 6\.58 pp in SR and 0\.21 in EI, while achieving a substantial 57\.81 pp improvement in GO and a 7\.09 sec reduction in IT\.

Specifically, the*Kinematic variation*setting is designed to evaluate skill grounding under kinematic embodiment mismatches between source and target manipulators with different kinematic structures, including variations in joint configurations, degrees of freedom, and workspace geometry \(e\.g\., Panda→\\rightarrowUR5, Sawyer\)\. Under this setting, embodiment mismatches primarily arise at the motion level, such as discrepancies in joint limits, feasible trajectory waypoints, and inverse kinematics solutions\. This setting evaluates whether a skill can be grounded to respect target\-specific kinematic constraints while preserving its functional semantics\. The SCoT baseline relies on structured intermediate representations to guide code synthesis, but resolving motion\-level kinematic mismatches in this formulation still requires the sLM to synthesize kinematically valid code without localized modification targets, whereasRECENTidentifies deterministic kinematic conflicts and resolves them through minimal, ontology\-guided refactoring\. Compared to SCoT,RECENTimproves SR by 58\.50 pp and GC by 43\.38 pp, while reducing GO from 107\.08% to 7\.85%\.

In the*End\-effector variation*setting, skill grounding is designed to evaluate adaptation under changes in the grasping mechanism\. Embodiment mismatches arise from differences in end\-effector interfaces, primarily involving substitutions of grasp and release primitives \(e\.g\., parallel\-jaw grasp versus suction attach and detach\) and the corresponding low\-level control APIs\. This setting assesses whether end\-effector\-related adaptations can be localized while preserving functional semantics\. Failures in this setting are often induced by contact outcomes and grasp uncertainty that emerge only at execution time, and the RoboInspector baseline repairs such failures through full code regeneration based on categorized error feedback, which results in excessive token overhead, whereasRECENTrestricts modifications to ontology\-identified end\-effector\-related code spans\.RECENTimproves SR and GC by 43\.33 pp and 25\.04 pp, while reducing GO by 49\.69 pp, EI by 4\.38, and IT by 29\.31 sec compared to the RoboInspector baseline\.

### 5\.3Ablation study

Table 2:Ablation study on sLM family and model scaleTable[2](https://arxiv.org/html/2606.07999#S5.T2)reports the performance ofRECENTacross different sLM families \(i\.e\., Qwen2\.5\-Coder\(Hui et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib17)\)and CodeGemma\(Zhao et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib52)\)\) and model sizes ranging from 1\.5B to 7B\. This ablation evaluates the robustness of our framework across varying sLM architectures and model capacity levels\. Across model families, performance remains largely consistent in terms of SR, GO, and EI\. When using 7B models, both sLM families achieve identical SR of 83\.33%, while GO differs by only 0\.58 pp \(2\.64% for Qwen2\.5\-Coder and 3\.22% for CodeGemma\) and EI by 0\.11 \(0\.28 and 0\.39\), indicating that our framework is compatible with recently advancing code generation sLMs\. IT shows larger variation across families and scales \(2\.62 sec for Qwen2\.5\-Coder 7B and 19\.40 sec for CodeGemma 7B\), reflecting differences in per\-call inference latency across sLM implementations rather than differences in grounding behavior\. When scaling down to lighter models, robustness is largely preserved at the 3B scale, whereas noticeable performance degradation emerges below 2B\. Despite this degradation, even the smallest models consistently outperform baselines using 7B models, as shown in Table[1](https://arxiv.org/html/2606.07999#S5.T1)\. These results are enabled byRECENT’s decomposed grounding process, which offloads deployment\-time mismatch diagnosis from the sLM to the offline\-constructed ontology, whose compatibility relations deterministically identify conflicts and localize refactoring targets\. By reducing the reasoning burden on the sLM, this design decouples skill grounding from model\-specific reasoning capacity, enabling consistent performance across model families and scales\.

Table 3:Ablation study on offline skill validation#### Ablation study on offline skill validation\.

Table[3](https://arxiv.org/html/2606.07999#S5.T3)shows the effect of offline skill validation inRECENT\. We ablate offline skill validation by replacing the repository with skills that have undergone grounding once but are not canonically validated\. Under this setting, performance degrades, with SR decreasing by 10\.00 pp, GO increasing by 6\.46 pp, and EI increasing by 0\.20, while IT remains comparable\. These results arise because skills without offline validation do not guarantee the functional correctness of their semantic intent, which can manifest as deployment\-time mismatches and require the sLM to perform refactoring beyond the localized scope identified by the ontology\. Offline validation reduces deployment\-time grounding cost and supports more stable long\-horizon execution\. Even so,RECENTwithout offline skill validation remains competitive with the baselines reported in Table[1](https://arxiv.org/html/2606.07999#S5.T1), reflecting the robustness of its deployment\-time grounding procedure\.

Table 4:Ablation study onRECENT
#### Ablation study onRECENTcomponents\.

Table[4](https://arxiv.org/html/2606.07999#S5.T4)reports the contribution of individual components ofRECENTto overall performance\. Thew/o\. Ontology guidancesetting disables ontology\-based guidance in phase \(ii\), forcing determined conflicts to be resolved through full code regeneration rather than FIM\-based localized editing with ontology\-derived substitution hints\. Without ontology guidance, SR drops by 61\.66 pp, showing that embodiment conflicts must be explicitly diagnosed and localized for effective grounding\. The→\\boldsymbol\{\\rightarrow\}w/ API name similaritysetting restores localized FIM editing, but replaces ontology\-derived substitutions with heuristic API\-name matching\. Although API\-name similarity partially recovers performance, it still underperformsRECENTby 25\.00 pp in SR, indicating that localized editing alone is insufficient when substitutions are not semantically grounded\. This limitation is particularly evident for capability\-level equivalences such asopen\_gripper→\\rightarrowdeactivate\_vacuum\. In thew/o\. In\-situ adaptationsetting, anticipatory patching in phase \(iii\) is disabled, so environment\-dependent mismatches are no longer resolved by patching yet\-to\-be\-executed code spans during execution and are instead handled reactively after failures occur\. Disabling in\-situ adaptation decreases SR by 16\.66 pp and increases EI by 0\.40, showing that reactive recovery interrupts execution more often and reduces task completion reliability\. The→\\rightarrowFull re\-generationsetting replaces FIM\-based localized editing in phases \(ii\) and \(iii\) with full code regeneration, while retaining ontology\-guided diagnosis and validation\. Without infilling, mismatches are resolved by regenerating the entire skill code rather than only the affected spans, increasing GO by 86\.84 pp and IT by 89\.31 sec while decreasing SR by 56\.66 pp\.

## 6Conclusion

In this work, we presentRECENT, a refactoring\-centric agent framework that enables efficient skill grounding with sLMs under deployment constraints\. By representing skills as executable code that explicitly separates invariant semantic intent from embodiment\- and environment\-specific execution bindings,RECENTenables code refactoring rather than full code regeneration during deployment\. Extensive experiments demonstrate thatRECENTsignificantly outperforms distilled sLM\-based CaP baselines in success rate, grounding efficiency, and execution stability, while achieving task performance comparable to LLM\-based CaP\.

#### Limitation and future work\.

RECENTis scoped to deployment settings in which a shared and stable skill ontology is established through one\-time offline construction and validation, and performs deployment\-time grounding within the resulting ontology\. Within this scope, our framework supports execution\-binding edits and structural adaptations of skill logic, as long as the required behavior remains expressible using primitives and capability relations already represented in the ontology\. Cases requiring ontology\-level capability extension arise when the required capabilities are not represented in the ontology\. Rather than deployment\-time grounding, these cases correspond to incremental skill learning\(Lee et al\.,[2024](https://arxiv.org/html/2606.07999#bib.bib22)\), where new capabilities must be acquired and integrated into the skill ontology\. ExtendingRECENTto support such ontology\-level expansion and persistent repository evolution remains an important future direction for enabling lifelong skill learning across evolving embodiments and deployment environments\.

## Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation \(IITP\) grant funded by the Korea government \(MSIT\), \(RS\-2022\-II220043, Adaptive Personality for Intelligent Agents, RS\-2022\-II221045, Self\-directed multi\-modal Intelligence for solving unknown, open domain problems, RS\-2025\-02218768, Accelerated Insight Reasoning via Continual Learning, RS\-2025\-25442569, AI Star Fellowship Support Program \(Sungkyunkwan Univ\.\), RS\-2026\-25543726, Development of Leading Talent in Medical Domain\-Specific Generative AI, RS\-2026\-25528384, Resource\-Intensive AI Technologies Based on Sustainable GPU Integrated Platforms, RS\-2019\-II190421, Artificial Intelligence Graduate School Program \(Sungkyunkwan University\)\), the National Research Foundation of Korea \(NRF\) grant funded by the Korea government \(MSIT\) \(No\. RS\-2026\-25474409\), IITP\-ITRC \(Information Technology Research Center\) grant funded by the Korea government \(MSIT\) \(IITP\-2025\-RS\-2024\-00437633, 10%\), IITP\-ICT Creative Consilience Program grant funded by the Korea government \(MSIT\) \(IITP\-2026\-RS\-2020\-II201821, 10%\), the AI Computing Infrastructure Enhancement \(GPU Rental Support\) User Support Program funded by the Ministry of Science and ICT \(MSIT\), Republic of Korea \(No\. RQT\-25\-120157\), and by Samsung Electronics Co\., Ltd\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.

## References

- Ahn et al\. \(2025\)Ahn, S\., Choi, W\., Lee, J\., Park, J\., and Woo, H\.Towards reliable code\-as\-policies: A neuro\-symbolic framework for embodied task planning\.*arXiv preprint arXiv:2510\.21302*, 2025\.
- Akyürek et al\. \(2025\)Akyürek, E\., Damani, M\., Zweiger, A\., Qiu, L\., Guo, H\., Pari, J\., Kim, Y\., and Andreas, J\.The surprising effectiveness of test\-time training for few\-shot learning\.In*Proceedings of the 42nd International Conference on Machine Learning*, 2025\.URL[https://openreview\.net/forum?id=asgBo3FNdg](https://openreview.net/forum?id=asgBo3FNdg)\.Poster\.
- Bavarian et al\. \(2022\)Bavarian, M\., Jun, H\., Tezak, N\., Schulman, J\., McLeavey, C\., Tworek, J\., and Chen, M\.Efficient training of language models to fill in the middle\.*arXiv preprint arXiv:2207\.14255*, 2022\.
- Bouzenia et al\. \(2025\)Bouzenia, I\., Devanbu, P\., and Pradel, M\.Repairagent: An autonomous, llm\-based agent for program repair\.In*Proceedings of the IEEE/ACM 47th International Conference on Software Engineering*, 2025\.
- Brohan et al\. \(2023\)Brohan, A\., Chebotar, Y\., Finn, C\., Hausman, K\., Herzog, A\., Ho, D\., Ibarz, J\., Irpan, A\., Jang, E\., Julian, R\., et al\.Do as i can, not as i say: Grounding language in robotic affordances\.In*Proceedings of the 6th Conference on Robot Learning*, 2023\.
- Burns et al\. \(2024\)Burns, K\., Jain, A\., Go, K\., Xia, F\., Stark, M\., Schaal, S\., and Hausman, K\.Genchip: generating robot policy code for high\-precision and contact\-rich manipulation tasks\.In*Proceedings of the 37th IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\)*, pp\. 9596–9603\. IEEE, 2024\.
- Carion et al\. \(2026\)Carion, N\., Gustafson, L\., Hu, Y\.\-T\., et al\.Sam 3: Segment anything with concepts\.In*International Conference on Learning Representations*, 2026\.
- Chen et al\. \(2021\)Chen, M\., Tworek, J\., Jun, H\., Yuan, Q\., Pinto, H\. P\. D\. O\., Kaplan, J\., Edwards, H\., Burda, Y\., Joseph, N\., Brockman, G\., et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.
- Choi et al\. \(2024\)Choi, W\., Kim, W\. K\., Yoo, M\., and Woo, H\.Embodied CoT distillation from LLM to off\-the\-shelf agents\.In*Proceedings of the 41st International Conference on Machine Learning*, 2024\.
- Doshi et al\. \(2024\)Doshi, R\., Walke, H\., Mees, O\., Dasari, S\., and Levine, S\.Scaling cross\-embodied learning: One policy for manipulation, navigation, locomotion and aviation\.*arXiv preprint arXiv:2408\.11812*, 2024\.
- Fang et al\. \(2023\)Fang, H\.\-S\., Wang, C\., Fang, H\., Gou, M\., Liu, J\., Yan, H\., Liu, W\., Xie, Y\., and Lu, C\.Anygrasp: Robust and efficient grasp perception in spatial and temporal domains\.*IEEE Transactions on Robotics*, 39\(5\):3929–3945, 2023\.
- Genesis Authors \(2024\)Genesis Authors\.Genesis: A universal and generative physics engine for robotics and beyond, 2024\.URL[https://genesis\-embodied\-ai\.github\.io/](https://genesis-embodied-ai.github.io/)\.
- Guo et al\. \(2024\)Guo, D\., Zhu, Q\., Yang, D\., Xie, Z\., Dong, K\., Zhang, W\., Chen, G\., Bi, X\., Wu, Y\., Li, Y\., et al\.Deepseek\-coder: When the large language model meets programming–the rise of code intelligence\.*arXiv preprint arXiv:2401\.14196*, 2024\.
- Huang et al\. \(2023a\)Huang, S\., Jiang, Z\., Dong, H\., Qiao, Y\., Gao, P\., and Li, H\.Instruct2act: Mapping multi\-modality instructions to robotic actions with large language model\.*arXiv preprint arXiv:2305\.11176*, 2023a\.
- Huang et al\. \(2022\)Huang, W\., Xia, F\., Xiao, T\., Chan, H\., Liang, J\., Florence, P\., Zeng, A\., Tompson, J\., Mordatch, I\., Chebotar, Y\., et al\.Inner monologue: Embodied reasoning through planning with language models\.*arXiv preprint arXiv:2207\.05608*, 2022\.
- Huang et al\. \(2023b\)Huang, W\., Wang, C\., Zhang, R\., Li, Y\., Wu, J\., and Fei\-Fei, L\.Voxposer: Composable 3d value maps for robotic manipulation with language models\.*arXiv preprint arXiv:2307\.05973*, 2023b\.
- Hui et al\. \(2024\)Hui, B\., Yang, J\., Cui, Z\., Yang, J\., Liu, D\., Zhang, L\., Liu, T\., Zhang, J\., Yu, B\., Lu, K\., et al\.Qwen2\. 5\-coder technical report\.*arXiv preprint arXiv:2409\.12186*, 2024\.
- James et al\. \(2020\)James, S\., Ma, Z\., Arrojo, D\. R\., and Davison, A\. J\.Rlbench: The robot learning benchmark & learning environment\.*IEEE Robotics and Automation Letters*, 2020\.
- Jiang et al\. \(2022\)Jiang, Y\., Gupta, A\., Zhang, Z\., Wang, G\., Dou, Y\., Chen, Y\., Fei\-Fei, L\., Anandkumar, A\., Zhu, Y\., and Fan, L\.Vima: General robot manipulation with multimodal prompts\.*arXiv preprint arXiv:2210\.03094*, 2022\.
- Kober & Peters \(2009\)Kober, J\. and Peters, J\.Learning motor primitives for robotics\.In*2009 IEEE International Conference on Robotics and Automation*, pp\. 2112–2118\. IEEE, 2009\.
- Kroemer et al\. \(2021\)Kroemer, O\., Niekum, S\., and Konidaris, G\.A review of robot learning for manipulation: Challenges, representations, and algorithms\.*Journal of machine learning research*, 22\(30\):1–82, 2021\.
- Lee et al\. \(2024\)Lee, D\., Yoo, M\., Kim, W\. K\., Choi, W\., and Woo, H\.Incremental learning of retrievable skills for efficient continual task adaptation\.In*Advances in neural information processing systems*, volume 37, pp\. 17286–17312, 2024\.
- Li et al\. \(2025a\)Li, F\., Tagkopoulos, P\., and Tagkopoulos, I\.Skillflow: Scalable and efficient agent skill retrieval system\.*arXiv preprint arXiv:2504\.06188*, 2025a\.
- Li et al\. \(2025b\)Li, J\., Li, G\., Li, Y\., and Jin, Z\.Structured chain\-of\-thought prompting for code generation\.*ACM Transactions on Software Engineering and Methodology*, 34\(2\):1–23, 2025b\.
- Li et al\. \(2024\)Li, Y\., Wang, L\., Piao, S\., Yang, B\.\-H\., Li, Z\., Zeng, W\., and Tsung, F\.Mccoder: streamlining motion control with llm\-assisted code generation and rigorous verification\.*arXiv preprint arXiv:2410\.15154*, 2024\.
- Liang et al\. \(2023\)Liang, J\., Huang, W\., Xia, F\., Xu, P\., Hausman, K\., Ichter, B\., Florence, P\., and Zeng, A\.Code as policies: Language model programs for embodied control\.In*2023 IEEE International Conference on Robotics and Automation \(ICRA\)*, pp\. 9493–9500\. IEEE, 2023\.
- Meng et al\. \(2025\)Meng, Y\., Sun, Z\., Fest, M\., Li, X\., Bing, Z\., and Knoll, A\.Growing with your embodied agent: A human\-in\-the\-loop lifelong code generation framework for long\-horizon manipulation skills\.*arXiv preprint arXiv:2509\.18597*, 2025\.
- Mu et al\. \(2024\)Mu, Y\., Chen, J\., Zhang, Q\., Chen, S\., Yu, Q\., Ge, C\., Chen, R\., Liang, Z\., Hu, M\., Tao, C\., Sun, P\., Yu, H\., Yang, C\., Shao, W\., Wang, W\., Dai, J\., Qiao, Y\., Ding, M\., and Luo, P\.Robocodex: multimodal code generation for robotic behavior synthesis\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, pp\. 36434–36454\. PMLR, 2024\.
- OpenAI \(2025\)OpenAI\.Addendum to gpt\-5\.2 system card: Gpt\-5\.2\-codex\.Technical report, December 2025\.URL[https://cdn\.openai\.com/pdf/ac7c37ae\-7f4c\-4442\-b741\-2eabdeaf77e0/oai\_5\_2\_Codex\.pdf](https://cdn.openai.com/pdf/ac7c37ae-7f4c-4442-b741-2eabdeaf77e0/oai_5_2_Codex.pdf)\.
- Rohmer et al\. \(2013\)Rohmer, E\., Singh, S\. P\. N\., and Freese, M\.Coppeliasim \(formerly v\-rep\): a versatile and scalable robot simulation framework\.In*Proc\. of The International Conference on Intelligent Robots and Systems \(IROS\)*, 2013\.
- Roziere et al\. \(2023\)Roziere, B\., Gehring, J\., Gloeckle, F\., Sootla, S\., Gat, I\., Tan, X\. E\., Adi, Y\., Liu, J\., Sauvestre, R\., Remez, T\., et al\.Code llama: Open foundation models for code\.*arXiv preprint arXiv:2308\.12950*, 2023\.
- Rozo et al\. \(2020\)Rozo, L\., Guo, M\., Kupcsik, A\. G\., Todescato, M\., Schillinger, P\., Giftthaler, M\., Ochs, M\., Spies, M\., Waniek, N\., Kesper, P\., et al\.Learning and sequencing of object\-centric manipulation skills for industrial tasks\.In*2020 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\)*, pp\. 9072–9079\. IEEE, 2020\.
- Rubin et al\. \(2022\)Rubin, O\., Herzig, J\., and Berant, J\.Learning to retrieve prompts for in\-context learning\.In*Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pp\. 2655–2671, Seattle, United States, July 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.naacl\-main\.191\.URL[https://aclanthology\.org/2022\.naacl\-main\.191/](https://aclanthology.org/2022.naacl-main.191/)\.
- Sarch et al\. \(2023\)Sarch, G\., Wu, Y\., Tarr, M\., and Fragkiadaki, K\.Open\-ended instructable embodied agents with memory\-augmented large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pp\. 3468–3500, 2023\.
- Singh et al\. \(2022\)Singh, I\., Blukis, V\., Mousavian, A\., Goyal, A\., Xu, D\., Tremblay, J\., Fox, D\., Thomason, J\., and Garg, A\.Progprompt: Generating situated robot task plans using large language models\.*arXiv preprint arXiv:2209\.11302*, 2022\.
- Song et al\. \(2023\)Song, C\. H\., Wu, J\., Washington, C\., Sadler, B\. M\., Chao, W\.\-L\., and Su, Y\.Llm\-planner: Few\-shot grounded planning for embodied agents with large language models\.In*Proceedings of the 19th IEEE/CVF International Conference on Computer Vision*, 2023\.
- Sundaralingam et al\. \(2023\)Sundaralingam, B\., Hari, S\. K\. S\., Fishman, A\., Garrett, C\., Wyk, K\. V\., Blukis, V\., Millane, A\., Oleynikova, H\., Handa, A\., Ramos, F\., Ratliff, N\., and Fox, D\.curobo: Parallelized collision\-free minimum\-jerk robot motion generation, 2023\.
- Sutton & Barto \(2018\)Sutton, R\. S\. and Barto, A\. G\.*Reinforcement learning: An introduction*\.MIT press, 2018\.
- Tenorth & Beetz \(2017\)Tenorth, M\. and Beetz, M\.Representations for robot knowledge in the knowrob framework\.*Artificial Intelligence*, 247:151–169, 2017\.
- Tziafas & Kasaei \(2024\)Tziafas, G\. and Kasaei, H\.Lifelong robot library learning: Bootstrapping composable and generalizable skills for embodied control with language models\.In*2024 IEEE International Conference on Robotics and Automation \(ICRA\)*, pp\. 515–522\. IEEE, 2024\.
- Vemprala et al\. \(2023\)Vemprala, S\., Bonatti, R\., Bucker, A\., and Kapoor, A\.Chatgpt for robotics: Design principles and model abilities\.*Published by Microsoft*, 2023\.
- Wang et al\. \(2023a\)Wang, G\., Xie, Y\., Jiang, Y\., Mandlekar, A\., Xiao, C\., Zhu, Y\., Fan, L\., and Anandkumar, A\.Voyager: An open\-ended embodied agent with large language models\.*arXiv preprint arXiv:2305\.16291*, 2023a\.
- Wang et al\. \(2024\)Wang, T\., Bhatt, D\., Wang, X\., and Atanasov, N\.Cross\-embodiment robot manipulation skill transfer using latent space alignment\.*arXiv preprint arXiv:2406\.01968*, 2024\.
- Wang et al\. \(2023b\)Wang, Y\., Gonzalez\-Pumariega, G\., Sharma, Y\., and Choudhury, S\.Demo2code: From summarizing demonstrations to synthesizing code via extended chain\-of\-thought\.*Advances in Neural Information Processing Systems*, 2023b\.
- Wang et al\. \(2023c\)Wang, Y\., Xian, Z\., Chen, F\., Wang, T\.\-H\., Wang, Y\., Fragkiadaki, K\., Erickson, Z\., Held, D\., and Gan, C\.Robogen: Towards unleashing infinite data for automated robot learning via generative simulation\.*arXiv preprint arXiv:2311\.01455*, 2023c\.
- Wolf et al\. \(2019\)Wolf, T\., Debut, L\., Sanh, V\., Chaumond, J\., Delangue, C\., Moi, A\., Cistac, P\., Rault, T\., Louf, R\., Funtowicz, M\., et al\.Huggingface’s transformers: State\-of\-the\-art natural language processing\.*arXiv preprint arXiv:1910\.03771*, 2019\.
- Xia & Zhang \(2023\)Xia, C\. S\. and Zhang, L\.Keep the conversation going: Fixing 162 out of 337 bugs for $0\.42 each using chatgpt\.*arXiv preprint arXiv:2304\.00385*, 2023\.
- Xia et al\. \(2025\)Xia, C\. S\., Deng, Y\., Dunn, S\., and Zhang, L\.Demystifying llm\-based software engineering agents\.*Proc\. ACM Softw\. Eng\.*, 2025\.
- Xu et al\. \(2023\)Xu, M\., Xu, Z\., Chi, C\., Veloso, M\., and Song, S\.Xskill: Cross embodiment skill discovery\.In*Conference on robot learning*, pp\. 3536–3555\. PMLR, 2023\.
- Yang et al\. \(2025\)Yang, B\., Cai, Z\., Liu, F\., Le, B\., Zhang, L\., Bissyandé, T\. F\., Liu, Y\., and Tian, H\.A survey of llm\-based automated program repair: Taxonomies, design paradigms, and applications\.*arXiv preprint arXiv:2506\.23749*, 2025\.
- Ying et al\. \(2025\)Ying, C\., Du, L\., Cheng, P\., and Shu, Y\.Roboinspector: Unveiling the unreliability of policy code for llm\-enabled robotic manipulation\.*arXiv preprint arXiv:2508\.21378*, 2025\.
- Zhao et al\. \(2024\)Zhao, H\., Hui, J\., Howland, J\., Nguyen, N\., Zuo, S\., et al\.Codegemma: Open code models based on gemma\.*arXiv preprint arXiv:2406\.11409*, 2024\.
- Zhu et al\. \(2024\)Zhu, Q\., Guo, D\., Shao, Z\., Yang, D\., Wang, P\., Xu, R\., Wu, Y\., Li, Y\., Gao, H\., Ma, S\., et al\.Deepseek\-coder\-v2: Breaking the barrier of closed\-source models in code intelligence\.*arXiv preprint arXiv:2406\.11931*, 2024\.
- Zhu et al\. \(2025\)Zhu, Y\., Huang, D\., Lyu, H\., Zhang, X\., Li, C\., Shi, W\., Wu, Y\., Mu, J\., Wang, J\., Zhao, Y\., Jin, P\., Cheng, S\., Liang, S\., Zhang, X\., Zhang, R\., Du, Z\., Guo, Q\., Hu, X\., and Chen, Y\.Qimeng\-codev\-r1: Reasoning\-enhanced verilog generation\.In*Advances in Neural Information Processing Systems*\. NeurIPS, 2025\.Poster presentation\.

## Appendix AImplementation Details of the Approach

### A\.1Skill Ontology Structure

Figure[4](https://arxiv.org/html/2606.07999#A1.F4)illustrates the schema of the skill ontology, which connects robots, capabilities, skills, and primitives used for deployment\-time reasoning\. Figure[5](https://arxiv.org/html/2606.07999#A1.F5)shows a partial instance graph of the skill ontology, where querying detects that the suction embodiment lacks the specific gripper interface required byopen\_gripperand infers a valid substitution to the embodiment\-compatible primitivedeactivate\_vacuum\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x4.png)Figure 4:Schema of the skill ontology, defining classes and relations among skills, capabilities, robot embodiments, and primitives![Refer to caption](https://arxiv.org/html/2606.07999v1/x5.png)Figure 5:Partial instance graph of the skill ontology, illustrating ontology\-based reasoning, where a capability mismatch is detected and resolved by substituting an embodiment\-compatible primitive
### A\.2Skill Ontology Construction

#### Pipeline inputs\.

We construct the ontology from robot embodiment descriptions, primitive API specifications, and reference skill code using a deterministic pipeline\. The ontology follows a shared schema adapted from prior robot knowledge representations\(Tenorth & Beetz,[2017](https://arxiv.org/html/2606.07999#bib.bib39)\), and uses a fixed set of alignment rules to derive relations between robot capabilities, primitive APIs, and skill requirements\. Each robot embodiment is annotated once with its embodiment profile and primitive API metadata, including available capabilities, parameter specifications, preconditions, and expected effects\. Specifically, the pipeline integrates three sources of information\.Robot embodiment specificationsobtained from URDF and SRDF files provide robot structure information such as joints, links, planning groups, end\-effectors, and kinematic constraints\.Primitive API specificationsdefining executable robot interfaces are parsed to obtain callable interfaces, parameter signatures, capability requirements, preconditions, and expected effects\.Executable reference skill codeis analyzed to identify primitive call sites, parameter bindings, and control\-flow structure\. Among these, URDF and SRDF files are standard artifacts typically shipped with the robot, while primitive API metadata is provided once per embodiment as a lightweight, structured annotation\.

#### Deterministic relation instantiation\.

All extracted information is represented as semantic triples within a shared ontology graph\. Based on these triples, relations such assubstitutableByandfunctionalEquivalentare instantiated automatically through deterministic compatibility matching rules\. For example, if two primitive APIs provide equivalent effects under compatible preconditions, the ontology instantiates a substitution relation between them\. This construction avoids task\-specific or robot\-pair\-specific substitution rules\. Once the robot and primitive specifications are provided, the same pipeline instantiates the ontology relations automatically\. The ontology additionally performs capability inference using deterministic rules defined over embodiment and API specifications\. For example, robots equipped with valid planning groups and end\-effectors are assigned motion\-planning capabilities, while gripper\-equipped embodiments are assigned grasp\-related capabilities\. Optionally, an sLM may provide auxiliary capability suggestions using primitive descriptions, but all inferred relations are validated against the existing ontology constraints before incorporation\.

#### One\-time extension cost\.

Once the robot embodiment and primitive specifications are provided, ontology construction is fully automated and implemented as a deterministic Python pipeline, requiring approximately 0\.08 seconds across all experiments without task\-specific manual edits\. This confines the extension effort for a new embodiment to providing its URDF/SRDF files and primitive API metadata, without requiring task\-specific or robot\-pair\-specific substitution rules\. Moreover, because the ontology is shared across tasks, adding new skills does not require modifying the ontology itself, and newly generated skills can be directly validated and inserted into the offline skill repository\.

### A\.3Skill Ontology Scale

#### Ontology composition\.

Table[5](https://arxiv.org/html/2606.07999#A1.T5)summarizes the scale of the instantiated ontology used across all experiments\. Each entry in the ontology is represented as a*\(subject, predicate, object\)*triple, encoding relations among robots, primitives, capabilities, parameters, preconditions, and effects\. Although compact, the ontology covers multiple robot embodiments, primitive APIs, capability annotations, and primitive\-level specifications, including parameters, preconditions, and expected effects\. These relations provide the ontology\-level evidence used to localize and guide deployment\-time refactoring\.

Table 5:Scale of the instantiated ontology used in our experiments\.
#### Coverage of derived relations\.

The 40 substitution and equivalence relations reported in Table[5](https://arxiv.org/html/2606.07999#A1.T5)span both cross\-embodiment primitive substitutions and within\-embodiment functional equivalences over the 40 primitive APIs and 29 capabilities\. Cross\-embodiment relations connect primitives across grippers and manipulators with different interfaces, such asclose\_gripper↔\\leftrightarrowactivate\_vacuumfor grasp acquisition andopen\_gripper↔\\leftrightarrowdeactivate\_vacuumfor release\. Together with the 1,232 semantic triples encoding robot\-capability, primitive\-precondition, and primitive\-effect bindings, these relations provide sufficient coverage for the deployment\-time grounding scenarios evaluated in our experiments\. All relations are derived by the ontology construction pipeline described in Appendix[A\.2](https://arxiv.org/html/2606.07999#A1.SS2)rather than authored per task or robot pair\.

### A\.4Refactoring Pipeline Details

We provide additional implementation details of the refactoring pipeline used inRECENT\. The goal of this pipeline is not to regenerate an entire skill program, but to refactor only localized execution bindings that are inconsistent with the target embodiment or the current environment\. At a high level,RECENTfirst detects a deployment\-time mismatch through ontology\-based compatibility analysis or execution\-time unit tests, converts it into a structured infilling task augmented with ontology\-derived substitution hints, and resolves it through FIM\-based span replacement\. The patched code is then validated and either executed or further adapted using runtime observations\. This unified procedure is used for both pre\-execution embodiment grounding \(Section[4\.2](https://arxiv.org/html/2606.07999#S4.SS2)\) and in\-situ adaptation during execution \(Section[4\.3](https://arxiv.org/html/2606.07999#S4.SS3)\)\. The implementation, execution scripts, and configuration files used in our experiments are included in the supplementary material\. Below, we describe the four stages of the pipeline: \(1\) detecting a mismatch and diagnosing it as a determined conflict or an undetermined warning, \(2\) constructing an intermediate refactoring unit from the detected mismatch, \(3\) converting it into an FIM prompt and generating a replacement span with the sLM, and \(4\) validating the patched code and adapting it further during execution\.

#### Mismatch detection and diagnosis\.

Before any code refactoring,RECENTidentifies where and why the current skill code is inconsistent with the deployment context\. At pre\-execution time, the ontology\-based diagnosis operator𝒟​\(χ~,rtgt\)=⟨Δdet,Δund⟩\\mathcal\{D\}\(\\tilde\{\\chi\},r^\{\\text\{tgt\}\}\)=\\langle\\Delta\_\{\\text\{det\}\},\\Delta\_\{\\text\{und\}\}\\rangle\(Eq\.[4](https://arxiv.org/html/2606.07999#S4.E4)\) compares the skill’s required capabilities against the capability profile ofrtgtr^\{\\text\{tgt\}\}, decomposing unmet capability requirements into determined conflictsΔdet\\Delta\_\{\\text\{det\}\}, which can be resolved by ontology\-derived substitutions, and undetermined, environment\-contingent warningsΔund\\Delta\_\{\\text\{und\}\}, whose satisfaction depends on execution\-time conditions\. At execution time, each warningu∈Δundu\\in\\Delta\_\{\\text\{und\}\}is instantiated as an autonomous unit test that monitors the corresponding precondition or effect against incoming observationsoto\_\{t\}\. When such a unit test is violated, the corresponding undetermined factor is determined and converted into an observation\-conditioned infilling tasku→ot\(m,ot\)u\\xrightarrow\{o\_\{t\}\}\(m,o\_\{t\}\)\(Eq\.[7](https://arxiv.org/html/2606.07999#S4.E7)\)\. Each detected mismatch, whether from pre\-execution diagnosis or in\-situ violation, is then passed to the next stage along with its ontology\-derived substitution hints \(forΔdet\\Delta\_\{\\text\{det\}\}\) or its triggering observation \(forΔund\\Delta\_\{\\text\{und\}\}\)\.

#### Intermediate refactoring unit\.

Given a mismatch detected in the previous stage,RECENTconverts it into a structured intermediate refactoring unit\. Each unit is extracted from the original skill code and consists of the target code span, its surrounding context, the span type, and grounding metadata:

z=⟨m,ψpre,ψsuf,ρ,κ⟩,z=\\langle m,\\psi\_\{\\mathrm\{pre\}\},\\psi\_\{\\mathrm\{suf\}\},\\rho,\\kappa\\rangle,wheremmdenotes the editable target span,ψpre\\psi\_\{\\mathrm\{pre\}\}andψsuf\\psi\_\{\\mathrm\{suf\}\}denote the prefix and suffix surrounding the span,ρ\\rhodenotes the span type, andκ\\kappadenotes structured grounding metadata\. The span typeρ\\rhospecifies whether the target corresponds to an API substitution site, an interface or parameter binding point, a preparatory step, or a warning\-triggering statement\. The metadataκ\\kappaincludes the target embodiment, ontology\-derived substitute APIs, interface constraints, parameter constraints, and, when applicable, the execution\-time observation that triggered the patch\. For determined conflicts, the spanmmis typically an API call, argument binding, or wrapper interface whose substitute can be inferred from the ontology\. For undetermined warnings, the spanmmis selected from yet\-to\-be\-executed statements whose expected preconditions or effects are violated by execution\-time observations\. This representation allows the sLM to operate on localized execution bindings while preserving the global control structure of the original skill\.

#### FIM prompt construction and span generation\.

The refactoring unitzzis then converted into a localized FIM\-based infilling task\. The sLM receivesψpre\\psi\_\{\\mathrm\{pre\}\}andψsuf\\psi\_\{\\mathrm\{suf\}\}as the prefix and suffix of the FIM template, together with the structured grounding hints inκ\\kapparendered as inline comments preceding the target span\. These hints include the target embodiment, ontology\-derived substitute APIs, interface and parameter constraints, and, for in\-situ adaptation, the execution\-time observation that triggered the patch\. The span typeρ\\rhofurther controls how the hint comments are organized, so that the sLM is presented with the minimal set of grounding cues relevant to the current edit\. Given this prompt, the sLM generates a replacement spanm′m^\{\\prime\}, which is constrained to fill the region betweenψpre\\psi\_\{\\mathrm\{pre\}\}andψsuf\\psi\_\{\\mathrm\{suf\}\}\. This keeps generation length proportional to the size of the edit rather than the size of the skill, and prevents the sLM from rewriting unrelated code regions\.

#### In\-situ validation and patching\.

The generated spanm′m^\{\\prime\}is reinserted into the skill code at the position ofmm, producing a patched skill\. In our implementation, the patched code is then re\-evaluated against the unit tests that originally detected the violation \(Section[4\.3](https://arxiv.org/html/2606.07999#S4.SS3)\), using the latest execution\-time observation to re\-check the relevant precondition or effect\. If the patched span passes this check, execution resumes from the same point without restarting prior steps, preserving the long\-horizon execution context\. If it fails, the corresponding undetermined factor is treated as an immediate patching target, and a new refactoring unit is constructed by repeating the previous two stages on a narrower or adjacent span\. This loop terminates either when validation succeeds or when consecutive failed patches reach a preset limit, at which point the skill is reported as unrecoverable\.

## Appendix BBaseline Implementations

### B\.1CaP\-variants

#### Code\-as\-Policies\.

We implement Code\-as\-Policies \(CaP\) as a minimal lower\-bound baseline by directly introducing code generation into an embodied agent without any additional reasoning, planning, or execution\-aware mechanisms\. The agent is prompted with a task description, a list of available primitive skills, and a small number of in\-context examples, and generates a complete executable policy as a single Python program\. The generated code is executed directly in the simulator without intermediate validation or localization\. Upon execution failure, simulator error messages are appended to the prompt and the entire program is regenerated, without performing partial edits or structured repair\.

#### CaP\-CodeV\-R1\.

CaP\-CodeV\-R1 follows the same implementation as the CaP baseline while replacing the backbone language model withzhuyaoyu/CodeV\-R1\-Distill\-Qwen\-7B\. The prompting format, in\-context examples, and full\-program regeneration strategy are kept identical to CaP, ensuring that the overall framework and interaction protocol remain unchanged\. This baseline isolates the effect of a distilled code\-oriented language model with limited capacity, without introducing any additional embodied reasoning or execution\-aware mechanisms\.

#### SCoT\.

We implement SCoT as a baseline that produces structured intermediate outputs prior to code generation\. Instead of relying on external symbolic resources, the agent generates structured representations that specify primitive skill substitutions and parameter mappings before synthesizing an executable policy\. The final code is generated based on this intermediate structure and executed directly in the simulator\. This implementation follows the original SCoT formulation of generating structured guidance before code synthesis, without applying localized code modification or execution\-aware refinement\.

#### CaP\-Codex\.

CaP\-Codex follows the same implementation as the CaP baseline, while usingGPT\-5\.2\-Codexas the backbone language model with its default reasoning setting\. The overall prompting structure and full\-program generation protocol are largely aligned with CaP, but we provide richer and less\-processed inputs that are feasible for a stronger model to interpret, including raw numerical state information \(e\.g\., object positions and quaternions\) and unfiltered simulator error logs\. Unlike small language models, Codex can robustly handle these raw signals without inducing severe hallucination or performance degradation\. This baseline evaluates the effect of a large\-scale, code\-specialized model within the CaP paradigm, without introducing additional planning, localization, or execution\-aware refinement mechanisms\.

### B\.2Embodied Code Agent

#### ProgPrompt\.

We adapt ProgPrompt as a one\-shot code generation baseline by providing the source robot’s reference code as an in\-context example within the prompt\. The agent is provided with a Pythonic program header that imports available actions and their expected parameters, along with a list of environment objects, enabling the agent to generate executable task plans as function implementations\. The agent is instructed to translate the reference code from the source robot’s primitive skills to the target robot’s primitive skills while preserving task logic\. Generated code is executed directly in the simulator\.

#### RoboInspector\.

We adopt RoboInspector’s behavior detection taxonomy \(Nonsense, Disorder, Infeasible, Badpose\) as a pre\-execution filter for generated code\. The detector analyzes code structure, action sequences, and constraint patterns before simulation\. When unreliable behaviors are detected or execution fails, the feedback refiner regenerates code with categorized error descriptions, enabling behavior\-aware repair prompts\.

### B\.3Automatic Program Repair Agents

#### RepairAgent\.

We adopt RepairAgent as a baseline framework by implementing its core components for embodied policy repair, while keeping state transitions and tool execution programmatic\. In our implementation, the finite state machine deterministically switches between generate, repair, and revise based on simulator execution outcomes\. A middleware executes the generated policy in the embodied environment and exposes structured signals as tool outputs\. These outputs are incrementally appended to a dynamic prompt, together with the current scene and task constraints, enabling failure\-localized edits\. As a result, the repeated execution and repair state transition can become a bottleneck in long\-horizon tasks under noisy simulator feedback\.

#### Agentless\.

We implement Agentless with CaP by constructing constraint\-aware prompts that include robot\-specific primitive skill specifications, available skill primitives, scene object information, and workspace bounds\. The prompt explicitly specifies physical constraints and available actions, allowing the agent to generate executable code without iterative refinement\. Upon execution failure, error feedback is appended to the prompt and the code is regenerated at an appropriate granularity, targeting individual code lines, functions, or entire files depending on the localization result\.

## Appendix CExperimental Setting

We evaluate the agents under four evaluation settings spanning two types of embodiment mismatches: \(1\)*Kinematic variation*and \(2\)*End\-effector variation*, all implemented in simulation environments\. We further evaluate the agents in real\-world environments, which are described in a separate subsection\. Table[6](https://arxiv.org/html/2606.07999#A3.T6)summarizes their key differences, and we describe each setting in detail in the following sections\.

Table 6:Evaluation settings in simulation### C\.1Kinematic variation

In the*Kinematic variation*setting, the agent is evaluated on its ability to adapt base skills to target environments with different kinematic structures, such as link configurations, joint layouts, or degrees of freedom\. This requires the agent to abstract task\-relevant motion patterns from embodiment\-specific details and re\-instantiate them under altered morphological constraints\.

#### Evaluation Scenario\.

We instantiate two evaluation scenarios for the*Kinematic variation*setting, using the Franka Emika Panda \(Panda\) as the source embodiment and evaluating transfer to the UR5 and Sawyer manipulators, respectively\. The example scene for the scenarios are depicted in Figure[6](https://arxiv.org/html/2606.07999#A3.F6), with the summary provided in Table[7](https://arxiv.org/html/2606.07999#A3.T7)\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x6.png)

\(a\) Example of Panda \(source\)

![Refer to caption](https://arxiv.org/html/2606.07999v1/x7.png)

\(b\) Example of UR5 \(target\)

![Refer to caption](https://arxiv.org/html/2606.07999v1/x8.png)

\(c\) Example of Panda \(source\)

![Refer to caption](https://arxiv.org/html/2606.07999v1/x9.png)

\(d\) Example of Sawyer \(target\)

Figure 6:Example scene of kinematic variation scenariosTable 7:Summary of kinematic variation scenarios
#### Task Settings\.

We construct each task as a continual stream of up to five subtasks sampled from a pool of four subtask groups\. Each subtask group targets a distinct aspect of manipulation commonly evaluated in embodied intelligence benchmarks\. By composing these subtasks into a stream, we generate long\-horizon tasks that require diverse manipulation skills\. The subtask groups are as follows:

- •Single\-Object Manipulation\. The agent manipulates a single target object in isolation\. Tasks in this category focus on fundamental motor skills such as grasping, lifting, and repositioning\.
- •Inter\-Object Manipulation\. The agent manipulates objects in relation to other objects in the scene\. This requires understanding spatial relationships and coordinating actions that involve multiple entities, such as placing one object inside or on top of another\.
- •Precision Interaction\. The agent performs tasks that demand high accuracy in end\-effector positioning and orientation\. Small deviations can lead to task failure, requiring fine\-grained control over pose and motion trajectories\.
- •Tool\-use\. The agent leverages tools to accomplish goals that cannot be achieved through direct manipulation\. The agent must recognize and effectively use the tool to accomplish the task\.

Each subtask group consists of several individual subtasks, distinguished by their objectives and scene layouts\. We implement the subtasks for the*Kinematic variation*setting using the CoppeliaSim engine\(Rohmer et al\.,[2013](https://arxiv.org/html/2606.07999#bib.bib30)\)\. The individual subtasks for each group are listed in Table[8](https://arxiv.org/html/2606.07999#A3.T8)\.

Table 8:Description of kinematic variation subtasks
#### Primitive APIs\.

Table[9](https://arxiv.org/html/2606.07999#A3.T9)summarizes the set of executable primitive APIs and their corresponding instruction templates designed for various robotic platforms within the RLBench environment\. These APIs provide a structured interface for robot control at the function\-level, such as picking, placing, and alignment\.

Table 9:Executable APIs in RLBenchKinematic TypeTemplateExampleAPIPandaPick \[Object\]pick\(env, task, target\_pos=\[x,y,z\], approach\_axis=’z’\)Place \[Receptacle Object\]place\(env, task, target\_pos=\[x,y,z\], approach\_axis=’z’\)Move \[Object\]move\(env, task, target\_pos=\[x,y,z\], timeout=FLOAT\)Push \[Object\]push\(env, task, target\_pos=\[x,y,z\], approach\_axis=’z’\)Align To Quaternion \[Object\]align\_to\_quaternion\(env, task, quaternion=q, \.\.\.\)Align Two Axesalign\_two\_axes\(env, task, axis\_dirs=\(1,1\), \.\.\.\)Open Gripperopen\_gripper\(env, task\)Close Gripperclose\_gripper\(env, task\)UR5Ur5 Grasp At \[Object\]ur5\_grasp\_at\(env, task, grasp\_pos=\[x,y,z\], timeout\_s=5\.0\)Ur5 Release At \[Receptacle Object\]ur5\_release\_at\(env, task, place\_pos=\[x,y,z\], timeout\_s=5\.0\)Ur5 Move To \[Object\]ur5\_move\_to\(env, task, target\_pos=\[x,y,z\], timeout\_s=3\.0\)Ur5 Align Gripper \[Object\]ur5\_align\_gripper\(env, task, reference\_quat=q, \.\.\.\)Open Ur5 EEopen\_ur5\_ee\(env, task, gripper\_open=1\.0, velocity=0\.2\)Close Ur5 EEclose\_ur5\_ee\(env, task, gripper\_close=0\.0, velocity=0\.2\)SawyerSawyer Pick \[Object\]sawyer\_pick\(env, task, target\_object=obj, target\_pos=\[x,y,z\]\)Sawyer Place \[Receptacle Object\]sawyer\_place\(env, task, place\_pos=\[x,y,z\]\)Sawyer Move To \[Object\]sawyer\_move\_to\(env, task, target\_pos=\[x,y,z\]\)Sawyer Align Gripper \[Object\]sawyer\_align\_gripper\(env, task, reference\_quat=q, \.\.\.\)Sawyer Open Grippersawyer\_open\_gripper\(env, task, amount=1\.0, velocity=0\.2\)Sawyer Close Grippersawyer\_close\_gripper\(env, task, amount=0\.0, velocity=0\.2\)

### C\.2End\-effector variation

In*End\-effector variation*setting, the agent is evaluated on its ability to adapt base skills to target environments with different grasping mechanisms, such as parallel\-jaw grippers versus vacuum grippers\. This requires the agent to generalize manipulation intent beyond specific grasp geometries and adjust grip strategies, and approach vectors accordingly\.

#### Evaluation Scenario\.

We instantiate two evaluation scenarios for theend\-effector variationsetting, using the Franka Emika Panda with its default parallel\-jaw gripper, Franka Hand, as the source embodiment\. The first scenario evaluates transfer to a target environment with identical kinematics but equipped with a vacuum gripper; the second evaluates transfer to a Panda manipulator with a Robotiq 2F\-85 gripper\. The example scene for the scenarios are depicted in Figure[7](https://arxiv.org/html/2606.07999#A3.F7), with the summary provided in Table[10](https://arxiv.org/html/2606.07999#A3.T10)\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x10.png)

\(a\) Example of parallel\-jaw gripper \(source\)

![Refer to caption](https://arxiv.org/html/2606.07999v1/x11.png)

\(b\) Example of 2F\-85 gripper \(target\)

![Refer to caption](https://arxiv.org/html/2606.07999v1/x12.png)

\(c\) Example of parallel\-jaw gripper \(source\)

![Refer to caption](https://arxiv.org/html/2606.07999v1/x13.png)

\(d\) Example of vacuum gripper \(target\)

Figure 7:Example scene of end\-effector variation scenariosTable 10:Summary of end\-effector variation scenarios
#### Task Settings\.

We construct each task as a continual stream of up to five subtasks sampled from a pool of four subtask groups, akin to the*Kinematic variation*setting\. The specific subtask groups are as follows:

- •Object Placement\. The agent picks up a target object and places it at a specified location\. This involves identifying the object, grasping it appropriately, and positioning it according to spatial instructions\.
- •Articulated Object Interaction\. The agent interacts with objects that have movable joints, such as doors or drawers\. The agent must understand the kinematic constraints and apply appropriate forces along the correct axis of motion\.
- •Scene Rearrangement\. The agent rearranges multiple objects to match a target configuration\. This involves planning a sequence of pick\-and\-place actions while considering dependencies and spatial relationships among objects\.
- •Tool\-use\. The agent leverages tools to accomplish goals that cannot be achieved through direct manipulation\. The agent must recognize and effectively use the tool to accomplish the task\.

Each subtask group consists of several individual subtasks, distinguished by their objectives and scene layouts\. We implement the subtasks for the*End\-effector variation*setting using the Genesis simulator\(Genesis Authors,[2024](https://arxiv.org/html/2606.07999#bib.bib12)\)\. The individual subtasks for each group are listed in Table[11](https://arxiv.org/html/2606.07999#A3.T11)\.

Table 11:Description of end\-effector variation subtasks
#### Primitive APIs\.

Table[12](https://arxiv.org/html/2606.07999#A3.T12)summarizes the set of executable primitive APIs and their corresponding instruction templates designed for various robotic platforms within the Genesis environment\. These APIs provide a structured interface for robot control at the function\-level, such as picking, placing, and alignment\.

Table 12:Executable APIs in Genesis

### C\.3Evaluation Metrics

We evaluateRECENTand all baselines using five metrics that capture both task performance and computational efficiency during skill grounding\. Let𝒯=\{t1,t2,…,tN\}\\mathcal\{T\}=\\\{t\_\{1\},t\_\{2\},\\ldots,t\_\{N\}\\\}denote the set of evaluation tasks, where each tasktit\_\{i\}consists of a sequence of subtasks\{si,1,si,2,…,si,Mi\}\\\{s\_\{i,1\},s\_\{i,2\},\\ldots,s\_\{i,M\_\{i\}\}\\\}\.

Success Rate \(SR%↑\\uparrow\)\.The task\-level success rate measures the proportion of tasks that are fully completed\. A tasktit\_\{i\}is considered successful if and only if all of its constituent subtasks are executed successfully:

SR=1N​∑i=1N𝟏​\[⋀j=1Misuccess​\(si,j\)\],\\mathrm\{SR\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\left\[\\bigwedge\_\{j=1\}^\{M\_\{i\}\}\\texttt\{success\}\(s\_\{i,j\}\)\\right\],\(9\)where𝟏​\[⋅\]\\mathbf\{1\}\[\\cdot\]is the indicator function andsuccess​\(si,j\)\\texttt\{success\}\(s\_\{i,j\}\)returns true if subtasksi,js\_\{i,j\}is completed successfully\.

Goal\-Conditioned Success Rate \(GC%↑\\uparrow\)\.The subtask\-level success rate measures the proportion of individual subtasks that are completed successfully, providing a finer\-grained measure of execution progress:

GC=1∑i=1NMi​∑i=1N∑j=1Mi𝟏​\[success​\(si,j\)\]\.\\mathrm\{GC\}=\\frac\{1\}\{\\sum\_\{i=1\}^\{N\}M\_\{i\}\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{M\_\{i\}\}\\mathbf\{1\}\\left\[\\texttt\{success\}\(s\_\{i,j\}\)\\right\]\.\(10\)
Grounding Overhead \(GO%↓\\downarrow\)\.The grounding overhead quantifies the additional computational cost incurred during skill adaptation relative to the original skill complexity\. For each subtasksi,js\_\{i,j\}, letτsource​\(si,j\)\\tau\_\{\\text\{source\}\}\(s\_\{i,j\}\)denote the token count of the source skill code \(excluding import statements\), and letτadapt​\(si,j\)\\tau\_\{\\text\{adapt\}\}\(s\_\{i,j\}\)denote the total number of tokens generated during repair and revise operations\. The grounding overhead is computed as:

GO=\#​tokens generated during grounding\#​tokens in original skill code×100\\mathrm\{GO\}=\\frac\{\\\#\\text\{ tokens generated during grounding\}\}\{\\\#\\text\{ tokens in original skill code\}\}\\times 100\(11\)A lower GO indicates that the grounding process requires fewer token generations relative to the original skill size, reflecting more efficient adaptation through localized modifications rather than extensive code regeneration\.

Execution Interruption Count \(EI count↓\\downarrow\)\.The number of execution interruptions measures how frequently the agent must pause execution to perform error correction\. For each tasktit\_\{i\}, we count the number of LM calls with action typerepairorreviseacross all subtasks:

EI=1N​∑i=1N∑j=1Mi\(crepair​\(si,j\)\+crevise​\(si,j\)\),\\mathrm\{EI\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{M\_\{i\}\}\\left\(c\_\{\\text\{repair\}\}\(s\_\{i,j\}\)\+c\_\{\\text\{revise\}\}\(s\_\{i,j\}\)\\right\),\(12\)wherecrepair​\(si,j\)c\_\{\\text\{repair\}\}\(s\_\{i,j\}\)andcrevise​\(si,j\)c\_\{\\text\{revise\}\}\(s\_\{i,j\}\)denote the number of repair and revise calls for subtasksi,js\_\{i,j\}, respectively\. Lower EI indicates more stable execution with fewer runtime interventions\.

Idle Time \(IT sec↓\\downarrow\)\.The idle time measures the cumulative duration of LM inference calls required for error correction during execution\. For each tasktit\_\{i\}, we sum the duration of allrepairandreviseoperations:

IT=1N​∑i=1N∑j=1Mi∑k∈ℛi,jdk,\\mathrm\{IT\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{M\_\{i\}\}\\sum\_\{k\\in\\mathcal\{R\}\_\{i,j\}\}d\_\{k\},\(13\)whereℛi,j\\mathcal\{R\}\_\{i,j\}denotes the set of repair and revise LM calls for subtasksi,js\_\{i,j\}, anddkd\_\{k\}is the duration \(in seconds\) of callkk\. Lower IT indicates faster adaptation with reduced inference latency, which is critical for real\-time embodied control\.

SR and GC are performance metrics where higher values indicate better task completion\. GO, EI, and IT are efficiency metrics where lower values indicate more effective skill grounding\. Together, these metrics capture the trade\-off between task success and computational overhead that is central to efficient skill grounding with sLMs\.

### C\.4Failure Categorization

Table[13](https://arxiv.org/html/2606.07999#A3.T13)reports representative failure cases observed during skill execution and groups them by error type and underlying cause\. The failures mainly arise from hallucinated variables or objects, hallucinated primitive API usage, and execution disruptions caused by motion planning or inverse kinematics failures\. Hallucination errors occur when generated policy code refers to undefined variables or nonexistent scene objects, such asvariation\_index,ball\_position, ordrawer\. Hallucinated API errors arise when the code invokes unsupported interfaces or incorrect arguments, for example callingget\_object\_positionfrom the environment or passingapproach\_distancetosawyer\_pick\(\)\. Other failures are caused by non\-executable code, unreachable objects, blocked waypoints, workspace violations, or empty repair continuations\. These errors show that deployment\-time mismatches are not limited to syntactic mistakes, but also involve inconsistencies between execution bindings, robot capabilities, and environment constraints\.

Table 13:Error Analysis and ClassificationError TypeError AnalysisError MessageNameErrorHallucinationname ’variation\_index’ is not definedname ’ball\_position’ is not definedname ’target\_name’ is not definedAttributeErrorHallucinationnumpy\.ndarray has no attribute ’index’Hallucination \(API\)Environment has no attribute ’get\_object\_position’SkillFailureMotion Planning & IK FailuresWAYPOINT\_FAIL \- Blocked by another object\.WAYPOINT\_FAIL \- Waypoint too close to target center\.Alignment issues detected: Gripper misaligned with ’stand’Non\-executable CodeFailed to get initial code: task\.step\(\) timed outPathOutOfWorkspaceMotion Planning & IK FailuresAlignment issues detected: Gripper misaligned with ’tv\_frame’\.All path planning strategies failedRuntimeErrorNon\-executable CodeThe call failed on the V\-REP side\. Return value: \-1Motion Planning & IK FailuresObject ’green\_cube’ is not reachableObject ’lemon’ is not reachable\)Non\-executable CodeRevise returned empty continuationRepair returned empty continuationTypeErrorHallucination \(API\)sawyer\_pick\(\) got unexpected keyword argument ’approach\_distance’\.Hallucination’NoneType’ object is not subscriptableunsupported operand type\(s\) for \+: ’NoneType’ and ’float’expected np\.ndarray \(got NoneType\)ValueErrorHallucinationObject ’drawer’ not found in sceneSyntaxErrorNon\-executable Code\(’ was never closed

### C\.5Implementation Details

All methods are evaluated under the same software and hardware configuration summarized in Table[14](https://arxiv.org/html/2606.07999#A3.T14), using identical simulator setups, task definitions, primitive API specifications, and random seeds across five runs\. For sLM\-based methods, model execution is further constrained by the VRAM limits in Table[15](https://arxiv.org/html/2606.07999#A3.T15), which emulates deployment settings with limited on\-device memory\. The 7B models are executed under a 12 GB memory budget using FP8 precision, while smaller models use FP16 precision within their corresponding memory budgets\. These constraints ensure that performance differences mainly reflect the grounding procedure rather than differences in computational resources or evaluation conditions\.

Table 14:Implementation details for evaluationTable 15:Small language model configurations under fixed memory budgets used in ablation studies

## Appendix DAdditional Experimental Results

### D\.1Main experiment details

Tables[16](https://arxiv.org/html/2606.07999#A4.T16),[17](https://arxiv.org/html/2606.07999#A4.T17),[18](https://arxiv.org/html/2606.07999#A4.T18), and[19](https://arxiv.org/html/2606.07999#A4.T19)provide per\-scenario breakdowns of the main experiments reported in Table[1](https://arxiv.org/html/2606.07999#S5.T1)\. The aggregated results in Section[5\.2](https://arxiv.org/html/2606.07999#S5.SS2)summarize overall performance across embodiment settings, and these detailed results show how each embodiment mismatch affects deployment\-time grounding difficulty\. The Panda→\\rightarrowUR5 and Panda→\\rightarrowSawyer settings evaluate kinematic variation arising from differences in manipulator structure and motion constraints\. The parallel\-jaw gripper→\\rightarrowvacuum gripper and parallel\-jaw gripper→\\rightarrow2F\-85 gripper settings evaluate end\-effector variation under changes in grasping mechanisms and execution interfaces\.

Table 16:Performance on continual embodied tasks under Kinematic variationPanda→\\rightarrowUR5#### Panda→\\rightarrowUR5 kinematic variation

As shown in Table[16](https://arxiv.org/html/2606.07999#A4.T16),RECENTachieves the best task performance with 75\.00% SR and 82\.43% GC, outperforming the strongest baseline, RepairAgent, by 25\.00 pp in SR and 15\.40 pp in GC\. It also reduces GO from 28\.95% to 11\.81%, EI from 1\.65 to 0\.70, and IT from 42\.84 sec to 2\.47 sec compared to RepairAgent\. Compared to CaP\-Codex,RECENTimproves SR by 7\.00 pp and GC by 2\.88 pp, while reducing GO by 55\.57 pp, EI by 0\.57, and IT by 4\.37 sec\. These results show that localized refactoring of execution bindings can resolve kinematic embodiment gaps while preserving the functional structure of the skill\.

Table 17:Performance on continual embodied tasks under Kinematic variationPanda→\\rightarrowSawyer
#### Panda→\\rightarrowSawyer kinematic variation

Table[17](https://arxiv.org/html/2606.07999#A4.T17)further evaluates kinematic variation in the Panda→\\rightarrowSawyer setting, whereRECENTobtains the highest SR of 71\.00% and a GC of 81\.08%\. This corresponds to improvements of 29\.00 pp in SR and 19\.76 pp in GC over RoboInspector, the strongest baseline in this setting\.RECENTalso requires substantially lower grounding cost, reducing GO from 256\.86% to 3\.89%, EI from 1\.55 to 0\.74, and IT from 92\.17 sec to 1\.08 sec compared to RoboInspector\. Against CaP\-Codex,RECENTimproves SR by 4\.00 pp and reduces GO by 97\.31 pp, EI by 0\.45, and IT by 8\.06 sec\. These results indicate that localized refactoring can handle larger kinematic gaps without increasing deployment\-time grounding overhead\.

Table 18:Performance on continual embodied tasks under End\-effector variation Panda with aparallel\-jaw gripper→\\rightarrowvacuum gripper
#### Parallel\-jaw gripper→\\rightarrowvacuum gripper end\-effector variation

Table[18](https://arxiv.org/html/2606.07999#A4.T18)shows detailed results for the parallel\-jaw gripper→\\rightarrowvacuum gripper setting, where grasping behavior must be adapted from contact\-based gripper control to suction\-based manipulation\.RECENTachieves the highest SR of 83\.33% and GC of 91\.40%, improving over the strongest baseline, RepairAgent, by 33\.33 pp in SR and 15\.59 pp in GC\. It also reduces GO from 65\.38% to 2\.79%, EI from 1\.12 to 0\.36, and IT from 94\.70 sec to 1\.90 sec compared to RepairAgent\. Compared to CaP\-Codex,RECENTimproves SR by 11\.66 pp and GC by 4\.09 pp, while reducing GO by 35\.33 pp, EI by 0\.26, and IT by 8\.74 sec\. These results show that execution\-binding refactoring can efficiently adapt grasp and release primitives under changes in grasping mechanisms\.

Table 19:Performance on continual embodied tasks under End\-effector variation Panda with aparallel\-jaw gripper→\\rightarrow2F\-85 gripper
#### Parallel\-jaw gripper→\\rightarrow2F\-85 gripper end\-effector variation

The results in Table[19](https://arxiv.org/html/2606.07999#A4.T19)correspond to the parallel\-jaw gripper→\\rightarrow2F\-85 gripper setting\. Since both source and target embodiments use parallel grippers, the grounding challenge lies in adapting embodiment\-specific primitive interfaces rather than changing the grasping mechanism itself\. In this setting,RECENTachieves 81\.67% SR and 88\.28% GC, surpassing RepairAgent by 27\.34 pp in SR and 16\.46 pp in GC\. The efficiency gain is more pronounced, with GO decreasing from 46\.46% to 2\.23% and IT decreasing from 74\.05 sec to 2\.78 sec compared to RepairAgent\. Compared to CaP\-Codex,RECENTachieves higher SR by 3\.64 pp and reduces GO by 43\.04 pp and IT by 7\.16 sec\. This indicates that preserving the skill structure and editing only embodiment\-specific execution bindings is especially effective when the required adaptation is localized to gripper interfaces\.

### D\.2Ablation experiment details

Table[20](https://arxiv.org/html/2606.07999#A4.T20)and Table[21](https://arxiv.org/html/2606.07999#A4.T21)provide detailed ablation results for Table[2](https://arxiv.org/html/2606.07999#S5.T2)and Table[4](https://arxiv.org/html/2606.07999#S5.T4), respectively\.

Table 20:Ablation on model choice and sLM familyTable 21:Ablation onRECENTmodules
### D\.3Real\-world deployment

We include an additional real\-world deployment study to assess whetherRECENTcan perform skill grounding on physical robot systems with heterogeneous embodiments\. This study is designed to evaluate the practical applicability of our skill grounding procedure under realistic perception and manipulation conditions, complementing the simulation\-based evaluations in Section[5](https://arxiv.org/html/2606.07999#S5)with evidence from physical robot deployments\.

#### Robot setup\.

We evaluate our method on two heterogeneous real\-world robot embodiments\. The source embodiment is a 7\-DoF Franka Research 3 \(FR3\) robotic arm equipped with a parallel two\-finger gripper \(Figure[8\(a\)](https://arxiv.org/html/2606.07999#A4.F8.sf1)\), while the target embodiment is a 6\-DoF UR7e arm equipped with a Robotiq 2F\-85 parallel gripper \(Figure[8\(b\)](https://arxiv.org/html/2606.07999#A4.F8.sf2)\)\. Beyond differences in kinematics and end\-effector hardware, the two platforms also differ in how their primitives are exposed and composed\. While some primitives are shared across platforms, the FR3 implementation encapsulates object detection within guarded pick primitives to maintain a safer abstraction boundary around low\-level motion control, whereas the UR7e implementation exposes detection and picking as separate primitives, using a modular interface that accepts detected object poses explicitly in picking or motion\-level calls\. Each robot operates in a tabletop workspace with a similar task layout\. For perception, each robot is equipped with an Intel RealSense D435 RGB\-D camera mounted in an eye\-in\-hand configuration on the end\-effector\. The captured RGB\-D observations are processed using SAM 3\(Carion et al\.,[2026](https://arxiv.org/html/2606.07999#bib.bib7)\)for class\-agnostic segmentation, followed by AnyGrasp\(Fang et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib11)\)to estimate feasible grasp poses for detected objects\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x14.png)\(a\)FR3 with a two\-finger gripper \(source embodiment\)
![Refer to caption](https://arxiv.org/html/2606.07999v1/x15.png)\(b\)UR7e with a Robotiq 2F\-85 gripper \(target embodiment\)

Figure 8:Real\-world robot platforms for deployment\. The embodiments differ in robot morphology and gripper hardware\.
#### Safety safeguards for real\-robot execution\.

RECENTperforms code refactoring at the skill level and does not directly bypass low\-level robot safety mechanisms\. Ontology\-based unit tests validate not only API\-level compatibility but also embodiment\-level execution consistency, so unsafe or infeasible patched behaviors can be treated as validation failures and fed back into the refactoring loop\. At the control level, patched skill code is executed through safeguarded robot controllers and motion planners, including configured workspace bounds, joint\-limit checks, and collision\-aware trajectory validation\. In our implementation, collision\-aware motion planning is handled with cuRobo\(Sundaralingam et al\.,[2023](https://arxiv.org/html/2606.07999#bib.bib37)\)when available, while hardware\-level safety limits and collision halting are enforced by the robot execution stack\.

#### Scan\-and\-bag task\.

We evaluateRECENTon a real\-world scan\-and\-bag task in a checkout\-counter setting, where the robot sequentially scans checkout items placed on the counter using a barcode scanner and places the scanned items into a bagging basket\. Figure[9](https://arxiv.org/html/2606.07999#A4.F9)shows the real\-world checkout\-counter setup, where the unscanned area, the scanning station, the barcode scanner, and the bagging basket are annotated\. In this task, a human places checkout items in the unscanned area, and the robot detects the items, estimates feasible grasp points, picks one item, moves it through the scanning station to scan its barcode with the tabletop barcode scanner, and places the scanned item into the basket located next to the scanning station\. A trial is considered successful only when the robot completes this process for all five checkout items placed on the checkout counter\. As illustrated in Figure[10](https://arxiv.org/html/2606.07999#A4.F10)and Figure[11](https://arxiv.org/html/2606.07999#A4.F11), each trial consists of five repeated item\-level subgoals, where the robot sequentially picks, scans, and bags checkout items\. Each subgoal corresponds to the complete pick\-scan\-place sequence for a single target item\.

![Refer to caption](https://arxiv.org/html/2606.07999v1/x16.png)Figure 9:Real\-world checkout\-counter setup for the scan\-and\-bag task\. The unscanned area, where items awaiting scanning are placed, scanning station, barcode scanner, and bagging basket are annotated in the scene\.![Refer to caption](https://arxiv.org/html/2606.07999v1/x17.png)Figure 10:Task sequence for the real\-world scan\-and\-bag task with pick\-and\-place obstacle handling\. The labelN/5N/5indicates that the robot has completed scan\-and\-bag forNNout of five checkout items\. The second row provides a detailed view of the obstacle\-handling condition during the scan\-and\-bag sequence for the second item\. After picking the second item, the madeleine cake pouch, a graspable teddy bear is introduced into the scanning station\. The robot handles this obstacle using the pick\-and\-place removal branch, which is already available in the source skill, and then resumes the original scan\-and\-bag sequence\.![Refer to caption](https://arxiv.org/html/2606.07999v1/x18.png)Figure 11:Task sequence for the real\-world scan\-and\-bag task with sweeping obstacle handling\. The labelN/5N/5indicates that scan\-and\-bag has been completed forNNout of five checkout items\. The second row details the obstacle\-handling condition during scan\-and\-bag for the second item\. After picking the second item, the vanilla wafer pouch, a non\-graspable camping stove box is introduced into the scanning station\. Since this obstacle cannot be reliably removed by pick\-and\-place, a sweeping branch is patched into the obstacle\-handling routine during execution, shown in the orange highlighted region, and the original scan\-and\-bag sequence is then resumed\.
#### Obstacle\-handling condition\.

During the task, an unexpected obstacle may be introduced into the scanning station and block the barcode scanner, for example when a checkout item is mistakenly placed in front of the scanner\. Before executing the scan step, the robot checks whether the scanning station is occupied by an unexpected obstacle using the RGB\-D observation\. If the scanning station is blocked, the robot must remove the obstacle from the scanning station to the disposal area and then resume the original scan\-and\-bag sequence\. When the robot is already holding a target item, it first places the item at a temporary location, removes the obstacle, and then continues the task\.

An obstacle is considered graspable if AnyGrasp provides a stable top\-grasp pose for pick\-and\-place removal, and non\-graspable if its shape or placement makes stable grasping unreliable, requiring a sweeping motion instead\. The obstacle\-handling routine is already included in the source skill\. In the source setting, all obstacles are graspable and can therefore be removed using only the pick\-and\-place branch\. In the target real\-world setting, some obstacles are non\-graspable, requiring the robot to select either pick\-and\-place or sweeping depending on the graspability of the observed obstacle\. Thus, this task is designed not merely as a real\-world execution demo, but as an environment\-dependent adaptation setting in whichRECENTmust locally reground the obstacle\-removal behavior based on execution\-time observations\.

#### Trial configuration\.

We conduct 10 trials with trial\-level variations in the item sets and obstacle\-handling conditions\. Each trial corresponds to a complete scan\-and\-bag episode in which the robot processes all five checkout items placed on the checkout counter\. Among the 10 trials, 8 include an obstacle\-handling condition, while the remaining 2 are conducted without obstacle introduction to evaluate nominal scan\-and\-bag execution\. For each trial, the obstacle\-handling condition is defined by whether an obstacle is introduced, when it occurs within the task sequence, the type of obstacle, and whether the obstacle is graspable\. The obstacle introduction timing is specified with respect to the task phase, such as after picking an item, before scanning, or after scanning\. Human intervention is limited to placing the checkout items and introducing obstacles according to the trial configuration, while robot execution and recovery are performed autonomously\. Table[22](https://arxiv.org/html/2606.07999#A4.T22)summarizes the trial\-level configuration, including the item set assignment and the obstacle\-handling condition for each trial\. The corresponding item sets are detailed in Table[23](https://arxiv.org/html/2606.07999#A4.T23)\.

Table 22:Trial configurations for the real\-world scan\-and\-bag task\. Each trial corresponds to a complete episode in which the robot processes five checkout items\.Table 23:Item set definitions used in the real\-world scan\-and\-bag trials\. Each item set contains five checkout items placed on the checkout counter\.
#### Evaluation results\.

Table[24](https://arxiv.org/html/2606.07999#A4.T24)reports the real\-world scan\-and\-bag results over 10 trials\.RECENTsuccessfully completes 8 out of 10 trials, whereas CaP completes 4 out of 10 trials\. Grounding overhead is reduced from 125\.80% with CaP to 15\.70% withRECENT, and execution interruptions decrease from 3\.00 to 0\.40 per trial\. The average interruption time is also reduced from 23\.51s with CaP to 2\.34s withRECENT\. The two failed trials occur when an obstacle is introduced immediately before the scan motion, leaving insufficient time and no valid yet\-to\-be\-executed recovery span for localized patching before the scanner interaction\.

Table 24:Performance comparison on the real\-world scan\-and\-bag task over 10 trials\.The main difference comes from how the two methods handle obstacle\-induced execution\-time mismatches\. When the scanning station is occupied by a non\-graspable obstacle,RECENTdoes not wait for a failed grasp attempt to trigger full recovery\. Instead, the unit\-test\-based validation detects the graspability violation before the corresponding obstacle\-removal code is executed, andRECENTpatches only the affected, yet\-to\-be\-executed code span by replacing the grasp\-based removal branch with a sweep\-based removal branch\. This anticipatory localized patching allows the robot to clear the scanning station and resume the scan\-and\-bag sequence with low interruption time and low grounding overhead\. In contrast, CaP handles the same mismatch through policy regeneration\. This makes recovery less stable, as the regenerated program may fail to produce a consistent obstacle\-handling branch, and even successful recovery often requires longer execution pauses because unrelated parts of the task program are regenerated together with the obstacle\-handling logic\.

### D\.4Comparison with deployment\-time learning baselines

We compareRECENTwith learning\-based baselines operating at deployment time, while our main experiments primarily focus on baselines centered on code generation and program repair\. This comparison examines a different deployment\-time adaptation strategy, where reference samples are used to adapt the model or prompt to the target task\. We find that such learning\-based adaptation can improve task success, but still incurs substantial grounding cost, measured by GO and Initial Grounding Time \(IGT\), and remains belowRECENTin success rate\. In contrast,RECENTshifts much of the adaptation cost from deployment\-time to the offline stage\. This is enabled by its offline\-validated skill repository and localized execution\-binding refactoring, which reduce deployment\-time cost while achieving higher task success\.

#### Deployment\-time learning baselines\.

We consider two learning\-based baselines that use reference samples at deployment time\. Test\-Time Training \(TTT\)\(Akyürek et al\.,[2025](https://arxiv.org/html/2606.07999#bib.bib2)\)retrieves top\-kkexamples and trains a task\-specific LoRA module before generating the initial executable code\. When execution requires code regeneration, we train a separate LoRA module for the failure\-conditioned regeneration objective and use it to produce the corrected code\. Efficient Prompt Retriever \(EPR\)\(Rubin et al\.,[2022](https://arxiv.org/html/2606.07999#bib.bib33)\)retrieves top\-kkexamples and directly uses them as in\-context demonstrations for both initial code generation and corrected code generation\. Both baselines require deployment\-time reference samples, wherekkdenotes the number of retrieved examples\. We evaluate TTT withk=2k=2andk=10k=10, and EPR withk=2k=2, using deterministic oracle retrieval results for both baselines to isolate the effect of deployment\-time adaptation from retrieval errors\.

#### Evaluation metrics\.

We report the same metrics used in the main experiments, namely SR, GC, GO, EI, and IT, and additionally include IGT for this comparison\. IGT measures the wall\-clock time from receiving a target task to obtaining the first executable code, including any deployment\-time adaptation required before execution\. This additional metric captures the pre\-execution latency introduced by deployment\-time learning baselines, such as task\-specific LoRA updates or retrieval\-conditioned prompting, which is not reflected in IT because IT only measures robot interruption time during execution\.

#### Evaluation results\.

Table[25](https://arxiv.org/html/2606.07999#A4.T25)shows the comparison with deployment\-time learning baselines\.RECENTachieves a 15\.78 pp higher SR than TTT withk=10k=10and reduces IGT by 6\.49×\\timescompared with EPR withk=2k=2\. This is because TTT and EPR perform task\-level grounding at deployment time, increasing adaptation cost\. Even with oracle retrieval, they reconstruct task\-level executable policy code from retrieved samples, causing not only execution bindings but also functional logic to be regenerated together\.RECENTstabilizes semantic intent in advance through the offline\-validated skill repository and only locally edits embodiment\- and environment\-specific execution bindings at deployment time, thereby achieving more stable and successful grounding with lower overhead\.

Table 25:Performance comparison with deployment\-time learning baselines\.

Similar Articles

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

arXiv cs.AI

This paper introduces Formal Skill, a runtime-native abstraction for LLM agents that encodes reusable procedures as executable state machines with JSON metadata, Python executors, and hook-governed control logic. An open-source implementation called FairyClaw is presented, showing competitive performance on Harness-Bench with reduced token usage.