DART: Semantic Recoverability for Structured Tool Agents

arXiv cs.AI Papers

Summary

DART introduces semantic recoverability for structured tool agents, formalizing a criterion to determine whether a local checkpoint restore remains valid after downstream commitments. Experiments across three LLM-driven domains show it correctly recovers all commitment-sensitive cases where baseline local recovery fails, and a safety audit finds no unsafe rollbacks.

arXiv:2605.23311v1 Announce Type: new Abstract: When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:57 AM

# DART: Semantic Recoverability for Structured Tool Agents
Source: [https://arxiv.org/html/2605.23311](https://arxiv.org/html/2605.23311)
Ke Yang1, Panpan Li2, Zonghan Wu3, Kejin Xu1, Huaxi Huang4, Xiaoshui Huang5 1MOS Intelligent Connectivity Technology Co\. Ltd\. 2Sichuan Vocational College of Post and Telecom 3East China Normal University 4Shanghai Artificial Intelligence Laboratory 5Shanghai Jiao Tong University

###### Abstract

When a structured tool agent fails mid\-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists\. This tension is acute in commitment\-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output\. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment\. We formalize this gap as semantic recoverability and address it inDART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints—or blocks otherwise\. Across three LLM\-driven domains and external validation on a LangGraph\-based substrate,DARTcorrectly recovers all evaluated commitment\-sensitive cases where baseline local recovery fails, and a five\-domain safety audit finds no unsafe admitted rollbacks\. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check\.

## 1Introduction

Large language models are increasingly deployed as tool\-using agents in production settings such as workflow assistants, scheduling and booking systems, and multi\-stage orchestration pipelines\. Failures during execution are inevitable, and the ability to recover without replaying an entire task from scratch is operationally critical\. A broad class of such deployments can be characterized as structured tool agents: tool\-using agents whose execution is organized by explicit control flow, observable action boundaries, and persisted traces\. Their explicit structure makes partial\-progress reuse feasible, yet existing recovery mechanisms do not check whether a local restore is semantically correct, particularly when downstream work has already been committed\.

Current recovery approaches for such systems fall into three broad patterns\. The first assumes that the recoverable object is known in advance: classical workflow\-exception and runtime\-repair methods define exception scopes, compensation handlers, or service\-level regions at design time, so what to recover is never a runtime question\[[6](https://arxiv.org/html/2605.23311#bib.bib6),[7](https://arxiv.org/html/2605.23311#bib.bib7),[11](https://arxiv.org/html/2605.23311#bib.bib11),[12](https://arxiv.org/html/2605.23311#bib.bib12),[8](https://arxiv.org/html/2605.23311#bib.bib8),[10](https://arxiv.org/html/2605.23311#bib.bib10),[9](https://arxiv.org/html/2605.23311#bib.bib9)\]\. The second assumes that the rollback boundary is known in advance: distributed\-snapshot and transaction\-oriented protocols fix the scope at the process, transaction, or checkpoint level and enforce consistency within it, so how far to roll back is never a runtime question either\[[1](https://arxiv.org/html/2605.23311#bib.bib1),[2](https://arxiv.org/html/2605.23311#bib.bib2),[3](https://arxiv.org/html/2605.23311#bib.bib3),[5](https://arxiv.org/html/2605.23311#bib.bib5),[4](https://arxiv.org/html/2605.23311#bib.bib4),[17](https://arxiv.org/html/2605.23311#bib.bib17)\]\. The third provides the mechanism but not the criterion: modern agent runtimes such as LangGraph, resume and retry primitives that make local restore mechanically possible, yet they offer no way to determine whether a restored execution is still semantically valid once downstream work has already been committed\[[13](https://arxiv.org/html/2605.23311#bib.bib13),[14](https://arxiv.org/html/2605.23311#bib.bib14),[15](https://arxiv.org/html/2605.23311#bib.bib15),[16](https://arxiv.org/html/2605.23311#bib.bib16),[18](https://arxiv.org/html/2605.23311#bib.bib18)\]\.

Across all three lines, recovery faces a fundamental dilemma\. Whole\-task rerun is always safe but wasteful, because it replays an arbitrary amount of already\-completed work\. Local restore is efficient but can be semantically invalid when committed downstream consumers remain in place\. The root cause is that controller legality, i\.e\., the runtime’s ability to mechanically restore a prior state, does not imply semantic validity: the restored execution may no longer correspond to any valid upstream history\. This mismatch manifests in three concrete failure modes: \(i\) the runtime targets the wrong failed instance, \(ii\) local rollback invalidates committed downstream work, or \(iii\) rollback crosses an irreversible effect boundary\.

Consider a scheduling assistant with two subtasks: the first queries three participants’ calendars and proposes candidate meeting times, and the second picks one of those times and sends calendar invitations to all participants\. Suppose the first subtask completes, the second sends the invitations, and then the first is found to have failed \(e\.g\., a stale calendar cache produced an invalid candidate list\)\. The runtime can roll back the first subtask and retry it\. This rollback is controller\-legal, but the invitations from the second subtask are already committed and now refer to a time slot that may not appear in any valid retry\. Figure[1](https://arxiv.org/html/2605.23311#S1.F1)illustrates this pattern\. This example exposes a broader limitation: once rollback is localized to a failed semantic unit while committed downstream work remains in place, controller legality alone does not ensure correctness\.

![Refer to caption](https://arxiv.org/html/2605.23311v1/x1.png)Figure 1:Commitment\-sensitive recovery regime\. Whole\-task rerun is correct but expensive because it replays an unrelated completed prefix\. Checkpoint\-aligned restore can remain controller\-legal yet become semantically invalid when committed downstream consumers remain in place\.We study this gap as a question of semantic recoverability: when a failed instance is rolled back locally, under what conditions is the restore point not only controller\-legal but also semantically valid? Our key idea is to certify, before any local rollback is attempted, that the boundary of the failed instance is semantically closure, meaning that no committed downstream work depends on the specific output being rolled back\. Building on this idea, we introduceDART\(Deterministic Agent Runtime with Transition Guards\), a modular recovery runtime organized around four decision steps: failed\-instance localization, recoverable\-boundary certification, instance\-aligned checkpointing, and admissible rollback selection\. When all four steps succeed,DARTrestores the latest admissible local checkpoint\. When any step fails, it conservatively blocks the local rollback and falls back to whole\-task rerun\. Although we instantiateDARTwith explicit finite\-state\-machine \(FSM\) agents for observability, the underlying problem arises more broadly in any structured runtime where rollback is localized to a failed instance while committed downstream work is preserved\.

We evaluateDARTacross three LLM\-driven domains\. In the evaluated commitment\-sensitive cases,DARTrecovers every scenario correctly, whereas entry\-only restore fails and whole\-task rerun incurs substantially larger replay\. We reproduce the same contrast on a LangGraph\-based runtime: its built\-in checkpoint restore fails in the decisive commitment\-sensitive case whereDARTsucceeds\. Outside commitment\-sensitive settings,DARTremains competitive with existing approaches, and a systematic safety audit across all evaluated domains confirms that it introduces no unsafe recoveries\.

Contributions\.\(1\) We formalize semantic recoverability: when is a local rollback not only mechanically possible but also semantically valid? We show that, in commitment\-sensitive settings, the two notions diverge and controller legality alone does not prevent invalid recoveries\. \(2\) To close this gap, we define recoverable boundaries via four conditions \(decidability, closure, separability, and controllability\) that a rollback target must satisfy, and organize them into a four\-step runtime procedure realized inDART\. \(3\) We prove that the resulting blocking behavior is not merely a conservative design choice but a necessity: any runtime that admits all controller\-legal local rollbacks will produce semantically invalid executions in commitment\-sensitive settings\. Empirical validation on LangGraph confirms this result\.

## 2Related Work

Prior work relevant to structured tool\-agent recovery typically fixes the recovery unit, fixes the rollback scope, or provides persistence without an explicit semantic admissibility criterion\.

Recovery Units Assumed in Advance\.Existing work in this area typically assumes that the recoverable unit is already known\. Classical workflow systems study exception handling and exception scopes in long\-running processes\[[6](https://arxiv.org/html/2605.23311#bib.bib6),[7](https://arxiv.org/html/2605.23311#bib.bib7)\]\. Runtime\-repair work adds guided, monitor\-driven, and workaround\-based recovery\[[11](https://arxiv.org/html/2605.23311#bib.bib11),[12](https://arxiv.org/html/2605.23311#bib.bib12),[8](https://arxiv.org/html/2605.23311#bib.bib8),[10](https://arxiv.org/html/2605.23311#bib.bib10)\], while self\-healing process repair extends this line to adaptive workflow correction\[[9](https://arxiv.org/html/2605.23311#bib.bib9)\]\. Across these lines, the recovery object is typically authored in advance as an activity, exception scope, or service\-level region\. Our setting differs in that the runtime must first identify a unique failed semantic instance before any local recovery decision can be made\.

Rollback Scope Assumed in Advance\.Complementary work fixes the rollback boundary in advance and then reasons about consistency within that scope\. Distributed snapshots and rollback\-recovery protocols restore execution state under a predefined scope\[[1](https://arxiv.org/html/2605.23311#bib.bib1),[2](https://arxiv.org/html/2605.23311#bib.bib2)\]\. Transaction\-oriented recovery and nested transactions formalize atomicity and scoped rollback\[[3](https://arxiv.org/html/2605.23311#bib.bib3),[5](https://arxiv.org/html/2605.23311#bib.bib5)\], while sagas and idempotent\-effect patterns address long\-running and externally visible actions\[[4](https://arxiv.org/html/2605.23311#bib.bib4),[17](https://arxiv.org/html/2605.23311#bib.bib17)\]\. These mechanisms are essential for consistency, but they usually assume that rollback scope has already been fixed at the process, transaction, or checkpoint level\. By contrast, our question is whether a controller\-legal local restore remains semantically admissible under the current dependency and effect context\.

Persistence Without Semantic Admissibility\.More recent runtime systems expose persistence and retry primitives, but still stop short of defining when a local restore is semantically admissible\. Modern graph runtimes such as LangGraph expose persistence, interrupts, and rollback\-oriented execution primitives\[[13](https://arxiv.org/html/2605.23311#bib.bib13),[14](https://arxiv.org/html/2605.23311#bib.bib14),[15](https://arxiv.org/html/2605.23311#bib.bib15)\], and Step Functions and Ray similarly provide retry and fault\-tolerance mechanisms\[[16](https://arxiv.org/html/2605.23311#bib.bib16),[18](https://arxiv.org/html/2605.23311#bib.bib18)\]\. Explicit control structure also provides a more inspectable execution substrate for tool agents\[[24](https://arxiv.org/html/2605.23311#bib.bib24)\], while adjacent work studies agent failure diagnosis and debugging\[[26](https://arxiv.org/html/2605.23311#bib.bib26),[27](https://arxiv.org/html/2605.23311#bib.bib27),[31](https://arxiv.org/html/2605.23311#bib.bib31)\], self\-correction under tool failures\[[28](https://arxiv.org/html/2605.23311#bib.bib28)\], and resilient multi\-agent execution\[[30](https://arxiv.org/html/2605.23311#bib.bib30),[32](https://arxiv.org/html/2605.23311#bib.bib32)\]\. SagaLLM adds transaction guarantees for multi\-agent planning\[[29](https://arxiv.org/html/2605.23311#bib.bib29)\], and diagnosability\-oriented work sharpens fault detection and localization\[[34](https://arxiv.org/html/2605.23311#bib.bib34),[35](https://arxiv.org/html/2605.23311#bib.bib35)\]\. Together, these works make structured recovery increasingly practical, but they still largely treat correctness as implicit once execution can be retried or restored\. DART isolates the missing semantic layer: when a controller\-legal local restore is semantically admissible under downstream commitments and effect boundaries\.

Taken together, prior work leaves open how to identify the recoverable unit, when a controller\-legal checkpoint is also a semantically valid recovery boundary, and whether persistence primitives suffice without dependency\- and effect\-aware admission\.

## 3Problem Setting and Scope

We fix the execution model, observable failure signal, and subtask\-instance recovery unit assumed throughout the paper, using explicit FSMs as a transparent canonical instantiation of the broader class of explicit\-control runtimes\.

### 3\.1Basic Execution Model

#### Definition 1 \(FSM\-governed tool agent\)\.

An agent in our scope is a tuple

𝒢=\(S,A,δ,M,H\)\\mathcal\{G\}=\(S,A,\\delta,M,H\)\(1\)whereSSis a finite state set,AAis an action set,δ⊆S×A×S\\delta\\subseteq S\\times A\\times Sis an explicit legal transition relation,MMis the runtime memory or context, andH=\(e1,…,eT\)H=\(e\_\{1\},\\ldots,e\_\{T\}\)is a recorded step history\. Each step

et=\(st,at,st\+1,Δ​mt\)e\_\{t\}=\(s\_\{t\},a\_\{t\},s\_\{t\+1\},\\Delta m\_\{t\}\)\(2\)exposes at least a current state, an executed action, a successor state, and a memory delta\.

Eq\. \([1](https://arxiv.org/html/2605.23311#S3.E1)\) fixes the control substrate: agents with explicit, inspectable states and actions\. In our experiments these are LLM\-based tool agents with explicit FSM control\[[24](https://arxiv.org/html/2605.23311#bib.bib24)\]\. The contribution lies in the recovery layer rather than the model architecture\.

### 3\.2Observable Failure Events

#### Definition 2 \(Observable failure event\)\.

An observable failure event is a tuple

f=\(t,s,a,σ\)f=\(t,s,a,\\sigma\)\(3\)wherettis the failed step id,ssis the runtime state at failure,aais the failed action, andσ\\sigmais a normalized failure signal exposed at an action boundary and consumable by the recovery runtime\.

In our setting,σ\\sigmamay correspond to tool exceptions, timeouts, governor denials, missing required inputs, execution\-chain exceptions, or explicit contract violations\. We therefore do not study silent failure detection, latent semantic error discovery, or general root\-cause diagnosis\.

### 3\.3Subtask Skeletons and Subtask Instances

#### Definition 3 \(Subtask skeleton\)\.

A subtask skeleton is a reusable semantic template

K=\(k,SKint,SKent,PKcom,PKexit,XKin,XKout,πKeff\)K=\(k,S\_\{K\}^\{\\mathrm\{int\}\},S\_\{K\}^\{\\mathrm\{ent\}\},P\_\{K\}^\{\\mathrm\{com\}\},P\_\{K\}^\{\\mathrm\{exit\}\},X\_\{K\}^\{\\mathrm\{in\}\},X\_\{K\}^\{\\mathrm\{out\}\},\\pi\_\{K\}^\{\\mathrm\{eff\}\}\)\(4\)wherekkis the skeleton identifier,SKintS\_\{K\}^\{\\mathrm\{int\}\}the internal states,SKentS\_\{K\}^\{\\mathrm\{ent\}\}the entry states,PKcomP\_\{K\}^\{\\mathrm\{com\}\}andPKexitP\_\{K\}^\{\\mathrm\{exit\}\}the commit/exit predicates,XKinX\_\{K\}^\{\\mathrm\{in\}\}andXKoutX\_\{K\}^\{\\mathrm\{out\}\}the input and output interface keys, andπKeff\\pi\_\{K\}^\{\\mathrm\{eff\}\}the effect policy, i\.e\., the reviewed rollback policy for effects produced by this skeleton\. In the currentDART, these fields come from reviewed boundary configurations rather than automatic synthesis\. Eq\. \([4](https://arxiv.org/html/2605.23311#S3.E4)\) is the skeleton\-level reviewed recovery contract: its identity and lifecycle fields support failed\-instance localization and checkpoint binding, while its predicate, interface, and effect fields support boundary certification and rollback admissibility in Section 4\.

#### Definition 4 \(Subtask instance\)\.

A concrete runtime occurrence of a skeleton is a subtask instance

I=\(k,η,o\)I=\(k,\\eta,o\)\(5\)wherekkis the skeleton id,η\\etathe concrete entity id, andoothe ordinal for repeated occurrences\.

The recoverable unit is the instance in Eq\. \([5](https://arxiv.org/html/2605.23311#S3.E5)\), not the whole task or the skeleton template\. Stage\-level labels are insufficient once the same stage re\-enters\.

### 3\.4Scope Commitments and Non\-Goals

We study structured tool\-agent runtimes in the explicit\-control class, instantiated with explicit FSM control, observable action\-boundary failures, and subtask\-instance recovery; in the current realization, recoverable boundaries come from reviewed boundary configurations rather than automatic discovery\. The system does not eliminate semantic review, but concentrates it onto audited boundary, interface, and effect objects\. We do not address silent\-failure detection, general fault prediction, universal boundary synthesis, or unrestricted rollback under arbitrary irreversible side effects\.

## 4Method

### 4\.1Method Overview: Four Recovery Questions

Checkpoint\-based recovery becomes incomplete for failed\-instance\-local recovery when it does not determine the failed instance, the recoverable boundaries of that instance, and the rollback targets admissible under dependency and effect constraints\.DARTmakes these requirements explicit through four recovery questions: failed\-instance localization, recoverable\-boundary certification, instance\-aligned checkpointing, and admissible rollback selection\. Within the scoped setting studied here, these decisions are necessary for preventing semantically invalid recovery\.

Figure[2](https://arxiv.org/html/2605.23311#S4.F2)summarizes this four\-layer pipeline\. Audits and proof sketches are deferred to Appendix[A](https://arxiv.org/html/2605.23311#A1)and Appendices[G](https://arxiv.org/html/2605.23311#A7)and[H](https://arxiv.org/html/2605.23311#A8)\.

![Refer to caption](https://arxiv.org/html/2605.23311v1/x2.png)Figure 2:Recovery method overview\. After failure, the runtime identifies the failed instance, checks boundary and admissibility conditions, and restores the latest admissible checkpoint; otherwise it falls back to whole\-task rerun\.
### 4\.2Failed\-Instance Localization: From Observable Failure to Failed Instance

Building on Definitions 3 and 4, the first step localizes an observable failure to a concrete subtask instance\. Given the reviewed skeleton/entity structure, the runtime either identifies a unique failed instance or conservatively abstains, because all later boundary, checkpoint, and rollback judgments are indexed by that instance\. Operationally, the runtime resolves the active skeleton, bound entity, and occurrence ordinal from the FSM state, tool arguments, and sidecar registry; if the resulting\(k,η,o\)\(k,\\eta,o\)is not unique, it falls back to conservative whole\-task rerun \(Table[23](https://arxiv.org/html/2605.23311#A7.T23)\)\. Localization therefore establishes the instance index consumed by the next layer\.

### 4\.3Recoverable\-Boundary Certification: From Failed Instance to Certified Boundary

Building on failed\-instance localization, the second step certifies which reviewed lifecycle points of that instance are semantically recoverable rather than merely controller\-legal\. Each reviewed skeleton supplies a minimal recovery contract: an entity resolver, entry states, commit/exit predicates, conservative input/output interface keys, and an effect policy\. This step filters reviewed candidates rather than searching over arbitrary legal states or transitions, and it occurs before checkpoint materialization\. Informally, a recoverable boundary is a reviewed point from which the failed instance can resume without invalidating the surrounding execution\.

#### Definition 5 \(Recoverable boundary\)\.

Letbbbe a reviewed commit\- or exit\-level state or transition associated with subtask instanceII\. We callbba recoverable boundary iff

Recoverable​\(b,I\)⇔\\displaystyle\\mathrm\{Recoverable\}\(b,I\)\\iff\{\}Decidable​\(b,I\)∧Closed​\(b,I\)\\displaystyle\\mathrm\{Decidable\}\(b,I\)\\land\\mathrm\{Closed\}\(b,I\)\(6\)∧Separable​\(b,I\)∧Controllable​\(b,I\)\\displaystyle\\land\\mathrm\{Separable\}\(b,I\)\\land\\mathrm\{Controllable\}\(b,I\)Operationally, each conjunct is a concrete test\.Decidable\\mathrm\{Decidable\}passes iff the candidate still maps to one unique live instance identifier\(k,η,o\)\(k,\\eta,o\)\.Closed\\mathrm\{Closed\}passes iff the reviewed commit\- or exit\-level predicate holds and the declared interface handoff is semantically complete\.Separable\\mathrm\{Separable\}passes iff restoring from that point keeps replay confined to the failed instance rather than reopening an unrelated task prefix, enforced operationally by binding checkpoints to that instance and restricting restore search to its checkpoint set\.Controllable\\mathrm\{Controllable\}passes iff the effect policy allows rollback across that frontier\. A candidate is certified only if all four tests pass\.

This definition separates controller legality from recoverability\. A legal lifecycle point may still fail certification because the runtime can no longer tell which instance failed, the current step has not yet reached a semantically complete handoff, replay would reopen work outside that instance, or rollback is disallowed by the effect policy\. Operationally, the runtime evaluates only reviewed commit\- and exit\-level candidates for the failed instance and certifies those that keep the instance identifiable, the handoff closed, replay local, and rollback allowed under the effect policy\. For example, the transition fromWAITING\_POI\_SELECTIONtoSTOP\_READYis legal but not a certified exit boundary because the current stop instance has not reached a reviewed closed handoff\.

### 4\.4Instance\-Aligned Checkpointing: From Certified Boundary to Stable Recovery Anchor

Once Section 4\.3 certifies recoverable boundaries for the failed instance, the next step is instance\-aligned checkpointing\. Its objective is to turn those certified boundaries, together with conservative entry anchors, into concrete restore objects indexed by that instance\. Given the failed instance and these reviewed lifecycle points, the runtime constructs a recency\-ordered checkpoint set attached to that instance\. This step introduces no new semantic criterion: it materializes the recoverable structure certified in Section 4\.3 rather than re\-deciding recoverability\.

Operationally, the runtime indexes checkpoint records by instance identity\. When an instance becomes active, it records an entry checkpoint; when the instance later satisfies a reviewed commit predicate, it appends a commit checkpoint; and it optionally records an exit checkpoint when a reviewed exit\-boundary predicate holds\.

#### Stable checkpoint\.

A stable checkpoint is a concrete restore object attached to a subtask instance and a reviewed lifecycle type\. We denote it by

whereIIis the subtask instance andτ∈\{entry,commit,exit\}\\tau\\in\\\{\\texttt\{entry\},\\texttt\{commit\},\\texttt\{exit\}\\\}is the lifecycle type\. We write𝒞​\(I\)\\mathcal\{C\}\(I\)for the recency\-ordered checkpoint set attached toII\. Eq\. \([7](https://arxiv.org/html/2605.23311#S4.E7)\) binds each checkpoint to its instance\.

Instance\-aligned checkpointing matters because later restore search is confined to𝒞​\(If\)\\mathcal\{C\}\(I\_\{f\}\)for the failed instanceIfI\_\{f\}rather than collapsing back to an unrelated whole\-task prefix\. Entry checkpoints provide conservative restart anchors, commit checkpoints preserve stabilized partial progress within the same instance, and exit checkpoints record reviewed completion boundaries for policy\-controlled recovery\. By default, restore selection considers only entry and commit checkpoints; exit checkpoints are recorded but used only under an explicit policy\.

### 4\.5Admissible Rollback Selection: From Stable Anchor to Admissible Local Rollback

Building on the failed instance’s stable checkpoint set from Section 4\.4, the final step is admissible rollback selection\. Its objective is to determine which stable checkpoints of the failed instance remain safe to restore under dependency and effect constraints, and to select the latest such anchor when one exists\. The input is the observable failure event together with the failed instance and its stable checkpoints; the output is either an admissible local rollback target or rejection\. This step is necessary because restoring from a stable checkpoint may still invalidate committed downstream work or cross an irreversible effect boundary\.

#### Definition 6 \(Admissible local recovery\)\.

Given an observable failure eventff, a recovery is admissible local recovery iff there exist a subtask instanceIIand a stable checkpointccsuch that

AdmissibleRecover​\(f,I,c\)⇔\\displaystyle\\mathrm\{AdmissibleRecover\}\(f,I,c\)\\iff\{\}Identified​\(f,I\)∧Stable​\(c,I\)\\displaystyle\\mathrm\{Identified\}\(f,I\)\\land\\mathrm\{Stable\}\(c,I\)\(8\)∧ScopeOK​\(I,c\)∧NoCommittedConflict​\(I\)\\displaystyle\\land\\mathrm\{ScopeOK\}\(I,c\)\\land\\mathrm\{NoCommittedConflict\}\(I\)∧EffectAllowed​\(I,c\)\\displaystyle\\land\\mathrm\{EffectAllowed\}\(I,c\)In brief,Identified\\mathrm\{Identified\}fixes the failed instance,Stable\\mathrm\{Stable\}requires a checkpoint attached to that instance,ScopeOK\\mathrm\{ScopeOK\}keeps restoration within instance scope,NoCommittedConflict\\mathrm\{NoCommittedConflict\}preserves committed downstream consumers, andEffectAllowed\\mathrm\{EffectAllowed\}respects the reviewed effect policy\.

For the identified failed instanceIfI\_\{f\}and failure eventff, let

𝒜​\(f\)=\{c∈𝒞​\(If\)∣AdmissibleRecover​\(f,If,c\)\}\\mathcal\{A\}\(f\)=\\\{c\\in\\mathcal\{C\}\(I\_\{f\}\)\\mid\\mathrm\{AdmissibleRecover\}\(f,I\_\{f\},c\)\\\}\(9\)denote the admissible checkpoint set for the failed instance\. Eq\. \([9](https://arxiv.org/html/2605.23311#S4.E9)\) makes explicit that admissibility is defined over the failed instance’s own stable checkpoints rather than the whole task\.

After the failed instance and its checkpoint set are fixed, the runtime realizes this gate through dependency\- and effect\-aware vetoes over𝒞​\(If\)\\mathcal\{C\}\(I\_\{f\}\)\.*NoCommittedConflict*is enforced by a conservative producer–consumer relation over instance\-level read/write sets induced by the reviewed interface contract, rejecting rollback once a committed downstream instance has consumed the failed instance’s outputs\.*EffectAllowed*gates candidates by the frozen effect policyπKeff\\pi\_\{K\}^\{\\mathrm\{eff\}\}, blocking rollback across disallowed effect boundaries \(Appendix[I\.1](https://arxiv.org/html/2605.23311#A9.SS1)\)\. The runtime restores the most recent checkpoint in𝒞​\(If\)\\mathcal\{C\}\(I\_\{f\}\)that satisfies scope, committed\-consumer, and effect\-policy checks; if none exists,𝒜​\(f\)=∅\\mathcal\{A\}\(f\)=\\emptysetand local rollback is rejected\. Committed\-consumer blocking is therefore necessary for sound local rollback once committed downstream consumers remain in place\.

c⋆​\(f\)=max⁡𝒜​\(f\)c^\{\\star\}\(f\)=\\max\\mathcal\{A\}\(f\)\(10\)where the maximum is taken with respect to checkpoint recency within instanceIfI\_\{f\}\. Appendix[H](https://arxiv.org/html/2605.23311#A8)gives the remaining statements for this rule\.

## 5Experiments

### 5\.1Experimental Setup

#### Domains and Protocol\.

We evaluateDARTon three LLM\-driven domains: navigation, schedule\-form, and diagnosis\. Two deterministic domains—ETL pipeline and travel planning\[[33](https://arxiv.org/html/2605.23311#bib.bib33)\]—are deferred to Appendix[D](https://arxiv.org/html/2605.23311#A4)\. Across domains, failures are injected only at controlled observable action boundaries, and reviewed boundary, interface/effect, and audit specifications are frozen before evaluation\.

#### Baselines\.

We compare four recovery strategies under a matched runtime and failure protocol: whole\-task rerun \(Retry\-Only\); Coarse\-State\-Retry, which restores the latest pre\-entry snapshot at a benchmark\-defined coarse FSM anchor; Comp\-EntryOnly, which restores the failed instance’s entry checkpoint; andComp\-Frozen, which restores the latest admissible reviewed checkpoint of the failed instance\. These comparisons isolate recovery\-policy differences on a shared execution substrate\. We report both official headline cases and commitment\-sensitive cases, where recovery must preserve committed progress and downstream dependencies\.

#### Metrics and Reporting\.

Primary metrics are success, recovery\-observed rate, failure\-to\-milestone latency, replay actions, upstream replay, and preserved completed instances; the semantic audit additionally reports safe\-equivalence and admission/blocking statistics\. Main\-text latency and replay are medians over successful runs; paired tests, no\-failure overhead, and full audit details are deferred to Appendix[E](https://arxiv.org/html/2605.23311#A5)and Appendix[G](https://arxiv.org/html/2605.23311#A7)\. We report commitment\-sensitive cases first and headline cases second\. Extended LangGraph details are in Appendix[F](https://arxiv.org/html/2605.23311#A6)\.

### 5\.2In\-Domain Recovery Results

#### Commitment\-Sensitive Recovery: Entry\-Only Fails, Certified Checkpoints Succeed\.

We first examine commitment\-sensitive failures, where checkpoint choice changes outcome beyond cost\. Across commitment\-sensitive cases in navigation, schedule\-form, and diagnosis, entry\-only recovery fails in two ways: in navigation it is observed yet fails the end\-to\-end task contract, whereas in schedule\-form and diagnosis no local recovery is observed with entry\-only\. In schedule\-form, Coarse\-State\-Retry is omitted because after durable submit point no fair coarse anchor remains beyond the entry\-only family\. The decisive result is not thatDARTimproves recovery quality by degree, but that entry\-only recovery fails in all evaluated commitment\-sensitive core\-domain cases, whereas reviewed admissible checkpoints succeed throughout\. This indicates that without an admissibility criterion, failed\-instance\-local rollback leads to systematic failure not isolated errors\.

Table 1:Recovery outcomes in the three core LLM domains\. Panel A shows the commitment\-sensitive regime; for schedule\-form, Coarse\-State\-Retry is omitted because after submission no fair coarse anchor remains beyond the entry\-only family\. Panel B shows official headline cases\. Status distinguishes successful completion, contract failure after attempted recovery, no local recovery observed, and explicit blocking when no admissible checkpoint exists\. Latency and replay are medians over successful runs\.To ensure robustness, we repeat the decisive commitment\-sensitive row across model families\. The pattern persists: entry\-only restore fails,Comp\-Frozensucceeds with one\-step replay, and retry\-only requires full upstream replay\. Because*Replay*and*Up\. replay*are frontier\-size metrics, they are driven by recovery structure not model family \(Appendix Table[6](https://arxiv.org/html/2605.23311#A2.T6)\)\.

#### Official Headline Recovery: Competitive Beyond Commitment\-Sensitive Cases\.

Outside the commitment\-sensitive regime, the result is narrower\. Panel B of Table[1](https://arxiv.org/html/2605.23311#S5.T1)shows that on headline casesComp\-Frozenremains competitive, preserves zero upstream replay relative to whole\-task rerun, and is mostly at parity with stronger local baselines rather than uniformly dominant\. Relative toRetry\-Only, latency remains significantly lower in paired analysis \(Appendix[E](https://arxiv.org/html/2605.23311#A5)\)\.

### 5\.3Cross\-Runtime External Validation

We test if the same commitment\-sensitive failure pattern reappears once persistence and resume are available in an external LangGraph\-based runtime\. Table[2](https://arxiv.org/html/2605.23311#S5.T2)reports an aligned three\-way comparison acrossRetry\-Only, LangGraph\-SemiReal, andDARTon regime\-specific intersections\. The decisive schedule\-form commitment\-sensitive row is a counterexample:Retry\-Onlysucceeds with large replay frontier, LangGraph\-SemiReal drops to 0\.00, andDARTremains admissible, succeeds with one\-step replay frontier\. This shows that the failure is not a runtime artifact, but a general limitation of checkpoint\-aligned recovery in commitment\-sensitive local rollback settings\.

Table 2:Cross\-runtime external validation on aligned regime\-specific intersections\. In the decisive schedule\-form commitment\-sensitive case, LangGraph\-based checkpoint\-aligned restore fails, whereasDARTremains admissible and succeeds with a one\-step replay frontier\.The navigation rows and the schedule\-form entry\-aligned row serve as controls, showing that the gap appears where recovery depends on semantic admissibility beyond checkpoint alignment\. Appendix[F](https://arxiv.org/html/2605.23311#A6)further separates portability from necessity through a transplant\-control study and blocking witness\.

### 5\.4Semantic Audit and Blocking Calibration

The final main\-text question is whether admitted recoveries remain semantically acceptable and conservatively calibrated\. Table[3](https://arxiv.org/html/2605.23311#S5.T3)reports 54*comparable rows*and 47*evaluated recovery events*; full denominator details are deferred to Appendix[G](https://arxiv.org/html/2605.23311#A7)\.

Table 3:Semantic audit and blocking calibration\. Panel A summarizes the five\-domain audit for the three core LLM\-driven domains shown here; full breakdowns appear in Appendix[G](https://arxiv.org/html/2605.23311#A7)\. Panel B reports blocking calibration\. “–” indicates not applicable for that audit slice\.PanelScopeRows / eventsSafe\-equiv\.Blocked / unsafeSemanticPrefixEffectNotesAOverall semantic audit541\.00–54543121 rows also admit committed\-prefix checksANavigation121\.00–12120safe\-equivalent on all comparable rowsASchedule Form111\.00–11110safe\-equivalent on all comparable rowsADiagnosis101\.00–101010repair outcomes remain semantically alignedBOverall calibration47–12 blocked / 0 unsafe–––35 admitted; 0/12 false\-blocked eventsBReason family12 blocked–7 effect / 5 dependency–––blocking is structured, not randomBUnsafe admission audit35 admitted–0 unsafe / 0\.0 rate–––0/35 unsafe admissions

All 54 comparable rows are safe\-equivalent, and the 47\-event calibration yields 35 admitted and 12 blocked events with 0/35 unsafe admissions and 0/12 audited false blocks\.

### 5\.5Ablation Studies

We conduct ablation studies aligned with our main claims: instance\-aligned checkpointing, recoverable boundary certification, and committed\-consumer blocking\. Table[4](https://arxiv.org/html/2605.23311#S5.T4)reveals that coarse retry remain overly broad without instance\-aligned checkpoints \(A\), controller\-legal points may still fail boundary certification \(B\), removing committed\-consumer blocking permits downstream\-invalidating rollbacks \(C\)\. Collectively, DART’s gains stem from semantic admissibility, not checkpointing alone\.

Table 4:Necessity ablations for semantic admissibility\. Panel A tests instance\-aligned checkpointing, Panel B tests recoverable\-boundary certification, and Panel C tests committed\-consumer blocking\. Panel A uses representative cases; aggregates appear in Table[1](https://arxiv.org/html/2605.23311#S5.T1)\.PanelDomain / settingComp\-FrozenCoarse / Entry\-onlyRetry\-OnlyConclusionKey signalNoteANavigation / entry\-alignedreplay 5, latency 3122replay 5, latency 3122replay 11, latency 6359reviewed commit not yet neededcoarse = frozensame entry anchorANavigation / commit\-sensitivereplay 1, latency 872replay 4, latency 3751replay 14, latency 9432commit checkpoint adds real gainfrozen strictly smaller replaypreserved inst\. = 2ASchedule / commit\-sensitivereplay 1, latency 50entry\-only blockedreplay 26, latency 14247commit checkpoint remains necessaryentry\-only not admissiblepreserved inst\. = 5ADiagnosis / commit\-sensitivereplay 2, latency 40replay 5 / entry\-only failsreplay 15, latency 6430commit checkpoint shrinks frontiercoarse still wider than frozenpreserved inst\. = 2BNavigation wrong edgeforced exit emittedfrozen exit absentlegal edgewrong boundary is unsafeunresolved branch marked EXITEDWAITING\_POI\_SELECTION→\\rightarrowSTOP\_READYBSchedule wrong edgeforced exit emittedfrozen exit absentlegal edgewrong boundary is unsafeunresolved slot marked EXITEDWAITING\_SLOT\_SELECTION→\\rightarrowSLOT\_READYCNavigation consumer blockingdropped 1 committed consumerallowed without guardblocked under guardblocking is necessaryscope silently expandsdownstream stop invalidatedCSchedule consumer blockingdropped 2 committed consumersallowed without guardblocked under guardblocking is necessaryfinalized schedule invalidatedirreversible consumer presentCDiagnosis consumer blockingdropped 1 committed consumerallowed without guardblocked under guardblocking is necessaryfinalized repair invalidatedirreversible repair consumer

## 6Discussion and Limitations

Our empirical claims are scoped to observable\-failure recovery in structured tool agents under reviewed boundary configurations and the current dependency abstraction\. This scope is chosen for auditability and transparency, not because the question is unique to explicit\-FSM controllers\. Commitment\-sensitive failures need not dominate all workloads, but they become structurally unavoidable whenever a persisted runtime combines instance\-local rollback with independently committed downstream dependencies or effects\. In that setting, separating controller legality from semantic recoverability is a correctness requirement\. For deployed agent runtimes, the practical implication is that persistence primitives alone are insufficient in commitment\-sensitive settings\. Local recovery should be attempted only when semantic admissibility can be justified; otherwise, execution\-legal rollback may still be globally inconsistent\.DARTshows that such checks can be layered on top of existing explicit\-control runtimes while preserving local progress when safe\.

#### Broader Impacts\.

Explicit local recovery reduces unnecessary tool invocations, preserve completed progress, and improve long multi\-stage workflows\. Mis\-specified boundaries or overly aggressive rollback could still conceal errors in high\-stakes settings, so reviewed boundary, conservative blocking, and explicit effect policies remain safeguards rather than mere optimization choices\.

## 7Conclusion

This paper addresses semantic recoverability in structured tool\-agent runtimes under preserved downstream commitments\.DARTmakes this problem explicit through failed\-instance localization, recoverable\-boundary certification, instance\-aligned checkpointing, and admissible rollback selection\. Empirically,DARTimproves recovery correctness on the decisive commitment\-sensitive cases where baseline local recovery fails, reduces replay relative to whole\-task rerun, and introduces no unsafe admitted rollbacks under external LangGraph validation and a five\-domain safety audit\. More broadly, the results suggest that persistence primitives are necessary but not sufficient for sound local recovery: structured runtimes that preserve downstream committed work need an explicit admissibility criterion\.

## References

- Chandy and Lamport \[1985\]K\. M\. Chandy and L\. Lamport\. Distributed snapshots: Determining global states of distributed systems\.*ACM Transactions on Computer Systems*, 3\(1\):63–75, 1985\.
- Elnozahy et al\. \[2002\]E\. N\. Elnozahy, L\. Alvisi, Y\.\-M\. Wang, and D\. B\. Johnson\. A survey of rollback\-recovery protocols in message\-passing systems\.*ACM Computing Surveys*, 34\(3\):375–408, 2002\.
- Haerder and Reuter \[1983\]T\. Haerder and A\. Reuter\. Principles of transaction\-oriented database recovery\.*ACM Computing Surveys*, 15\(4\):287–317, 1983\.
- Garcia\-Molina and Salem \[1987\]H\. Garcia\-Molina and K\. Salem\. Sagas\. In*Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data*, pages 249–259, 1987\.
- Haerder and Rothermel \[1987\]T\. Haerder and K\. Rothermel\. Concepts for transaction recovery in nested transactions\. In*Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data*, pages 272–286, 1987\.
- Casati et al\. \[1999\]F\. Casati, S\. Ceri, S\. Paraboschi, and G\. Pozzi\. Specification and implementation of exceptions in workflow management systems\.*ACM Transactions on Database Systems*, 24\(3\):405–451, 1999\.
- Hagen and Alonso \[2000\]C\. Hagen and G\. Alonso\. Exception handling in workflow management systems\.*IEEE Transactions on Software Engineering*, 26\(10\):943–958, 2000\.
- Baresi et al\. \[2004\]L\. Baresi, C\. Ghezzi, and S\. Guinea\. Smart monitors for composed services\. In*Proceedings of the 2nd International Conference on Service\-Oriented Computing*, pages 193–202, 2004\.
- Baresi et al\. \[2007\]L\. Baresi, S\. Guinea, and L\. Pasquale\. Self\-healing BPEL processes with Dynamo and the JBoss rule engine\. In*Proceedings of the International Workshop on Engineering of Software Services for Pervasive Environments*, pages 11–20, 2007\.
- Carzaniga et al\. \[2010\]A\. Carzaniga, A\. Gorla, N\. Perino, and M\. Pezzè\. Automatic workarounds for web applications\. In*Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering*, pages 237–246, 2010\.
- Simmonds et al\. \[2010a\]J\. Simmonds, S\. Ben\-David, and M\. Chechik\. Guided recovery for web service applications\. In*Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering*, pages 247–256, 2010\.
- Simmonds et al\. \[2010b\]J\. Simmonds, S\. Ben\-David, and M\. Chechik\. Monitoring and recovery of web service applications\. In M\. Chignell, J\. Cordy, J\. Ng, and Y\. Yesha, editors,*The Smart Internet*, volume 6400 of*Lecture Notes in Computer Science*, pages 250–288\. Springer, 2010\.
- LangChain \[2026a\]LangChain\.*LangGraph Persistence*\. Documentation, 2026\.[docs\.langchain\.com/…/persistence](https://docs.langchain.com/oss/python/langgraph/persistence)\. Accessed April 2026\.
- LangChain \[2026b\]LangChain\.*LangGraph Interrupts*\. Documentation, 2026\.[docs\.langchain\.com/…/interrupts](https://docs.langchain.com/oss/python/langgraph/interrupts)\. Accessed April 2026\.
- LangChain \[2026c\]LangChain\.*Rollback Concurrent*\. LangSmith Documentation, 2026\.[docs\.langchain\.com/langsmith/rollback\-concurrent](https://docs.langchain.com/langsmith/rollback-concurrent)\. Accessed April 2026\.
- Amazon Web Services \[2026\]Amazon Web Services\.*Error Handling in Step Functions*\. Documentation, 2026\.[docs\.aws\.amazon\.com/step\-functions/…](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html)\. Accessed April 2026\.
- Featonby \[2021\]M\. Featonby\.*Making Retries Safe with Idempotent APIs*\. Amazon Builders’ Library, 2021\.[aws\.amazon\.com/builders\-library/…](https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/)\. Accessed April 2026\.
- Ray Team \[2026\]Ray Team\.*Fault Tolerance*\. Documentation, 2026\.[docs\.ray\.io/…/fault\-tolerance\.html](https://docs.ray.io/en/latest/ray-core/fault-tolerance.html)\. Accessed April 2026\.
- Liu et al\. \[2023\]X\. Liu, H\. Zhang, Y\. Song, et al\. AgentBench: Evaluating LLMs as agents\.*arXiv preprint arXiv:2308\.03688*, 2023\.
- Schick et al\. \[2023\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, et al\. Toolformer: Language models can teach themselves to use tools\.*Advances in Neural Information Processing Systems*, 36, 2023\.
- Shinn et al\. \[2023\]N\. Shinn, B\. Labash, and A\. Gopinath\. Reflexion: Language agents with verbal reinforcement learning\.*Advances in Neural Information Processing Systems*, 36, 2023\.
- Yao et al\. \[2023\]S\. Yao, J\. Zhao, D\. Yu, et al\. ReAct: Synergizing reasoning and acting in language models\. In*Proceedings of the 11th International Conference on Learning Representations*, 2023\.
- Guo et al\. \[2025\]L\. Guo, W\. Liu, Y\. W\. Heng, T\.\-H\. Chen, and Y\. Wang\. Agent\-SAMA: State\-aware mobile assistant\.*arXiv preprint arXiv:2505\.23596*, 2025\.
- Zhang et al\. \[2026\]S\. Zhang, C\. Yuan, R\. Guo, X\. Yu, R\. Xu, Z\. Chen, Z\. Li, Z\. Yang, S\. Guan, Z\. Tang, S\. Hu, L\. Zhang, R\. Chen, and H\. Wang\. EvoFSM: Controllable self\-evolution for deep research with finite state machines\.*arXiv preprint arXiv:2601\.09465*, 2026\.
- Vyas and Mercangoz \[2025\]J\. Vyas and M\. Mercangoz\. Autonomous control leveraging LLMs: An agentic framework for next\-generation industrial automation\.*arXiv preprint arXiv:2507\.07115*, 2025\.
- Barke et al\. \[2026\]S\. Barke, A\. Goyal, A\. Khare, A\. Singh, S\. Nath, and C\. Bansal\. AgentRx: Diagnosing AI agent failures from execution trajectories\.*arXiv preprint arXiv:2602\.02475*, 2026\.
- Zhu et al\. \[2025\]K\. Zhu, Z\. Liu, B\. Li, M\. Tian, Y\. Yang, J\. Zhang, P\. Han, Q\. Xie, F\. Cui, W\. Zhang, X\. Ma, X\. Yu, G\. Ramesh, J\. Wu, Z\. Liu, P\. Lu, J\. Zou, and J\. You\. Where LLM agents fail and how they can learn from failures\.*arXiv preprint arXiv:2509\.25370*, 2025\.
- Vuddanti et al\. \[2025\]S\. V\. Vuddanti, A\. Shah, S\. K\. Chittiprolu, T\. Song, S\. Dev, K\. Zhu, and M\. Chaudhary\. PALADIN: Self\-correcting language model agents to cure tool\-failure cases\.*arXiv preprint arXiv:2509\.25238*, 2025\.
- Chang and Geng \[2025\]E\. Y\. Chang and L\. Geng\. SagaLLM: Context management, validation, and transaction guarantees for multi\-agent LLM planning\.*arXiv preprint arXiv:2503\.11951*, 2025\.
- Chang and Geng \[2025\]E\. Y\. Chang and L\. Geng\. ALAS: A stateful multi\-LLM agent framework for disruption\-aware planning\.*arXiv preprint arXiv:2505\.12501*, 2025\.
- In et al\. \[2026\]Y\. In, M\. Tanjim, J\. Subramanian, S\. Kim, U\. Bhattacharya, W\. Kim, S\. Park, S\. Sarkhel, and C\. Park\. Rethinking failure attribution in multi\-agent systems: A multi\-perspective benchmark and evaluation\.*arXiv preprint arXiv:2603\.25001*, 2026\.
- Huang et al\. \[2025\]J\.\-T\. Huang, J\. Zhou, T\. Jin, X\. Zhou, Z\. Chen, W\. Wang, Y\. Yuan, M\. R\. Lyu, and M\. Sap\. On the resilience of LLM\-based multi\-agent collaboration with faulty agents\. In*Proceedings of the 42nd International Conference on Machine Learning*, 2025\.
- Xie et al\. \[2024\]J\. Xie, K\. Zhang, J\. Chen, T\. Zhu, R\. Lou, Y\. Tian, Y\. Xiao, and Y\. Su\. TravelPlanner: A benchmark for real\-world planning with language agents\. In*Proceedings of the 41st International Conference on Machine Learning*, 2024\.
- Cassandras and Lafortune \[2021\]C\. G\. Cassandras and S\. Lafortune\.*Introduction to Discrete Event Systems*\. Springer, 3rd edition, 2021\.
- Sampath et al\. \[1995\]M\. Sampath, R\. Sengupta, S\. Lafortune, K\. Sinnamohideen, and D\. Teneketzis\. Diagnosability of discrete\-event systems\.*IEEE Transactions on Automatic Control*, 40\(9\):1555–1575, 1995\.

## Appendix AAppendix Roadmap

The appendix is organized as a compact support map rather than a second narrative\. The main text now includes a dedicated Discussion and Limitations section that clarifies the scope of our claims, the main limitations of the current study, and the broader implications of the proposed recovery criterion\. The appendix provides the remaining supporting evidence: Appendix[B](https://arxiv.org/html/2605.23311#A2)gives cross\-model robustness on the decisive commitment\-sensitive row; Appendix[E](https://arxiv.org/html/2605.23311#A5)strengthens the main recovery tables with paired statistics and overhead diagnostics; Appendix[F](https://arxiv.org/html/2605.23311#A6)provides*external validation*on LangGraph\-based runtimes; and Appendix[G](https://arxiv.org/html/2605.23311#A7)supplies the main safety\-and\-mechanism backbone through the five\-domain audit chain, localization audit, and boundary/property evidence\. Appendix[D](https://arxiv.org/html/2605.23311#A4), Appendix[H](https://arxiv.org/html/2605.23311#A8), and Appendix[I](https://arxiv.org/html/2605.23311#A9)provide setup breadth, deferred proof support, and reproducibility details\.

Table 5:Appendix roadmap by reviewer concern\.
## Appendix BCross\-Model Decisive\-Row Evidence

Appendix Table[6](https://arxiv.org/html/2605.23311#A2.T6)reports the full cross\-model results on the decisive schedule\-form commitment\-sensitive row used in the main text\.

Table 6:Cross\-model results on the decisive schedule\-form commitment\-sensitive row \(schedule\_live\_final\_render\_failure\_after\_submitted\)\. Across all tested model families, entry\-only restore fails, frozen local recovery succeeds with one\-step replay and zero upstream replay, and retry\-only succeeds only by replaying the full upstream prefix\. Columns are ordered by priority from left to right: success, replay width, upstream replay, preserved progress, and failure\-to\-milestone latency\.
## Appendix CCross\-Model Generalization Results

Appendix Table[7](https://arxiv.org/html/2605.23311#A3.T7)reports the full cross\-model generalization results beyond the decisive row, covering cross\-domain, cross\-runtime, and control\-case settings\.

Table 7:Cross\-model generalization results beyond the decisive row\. We report representative results from three broader settings: a navigation commitment\-sensitive row \(cross\-domain evidence\), a LangGraph\-based schedule row \(cross\-runtime evidence\), and an official schedule\-form control case \(control\-case evidence\)\. Within each setting, we report a single representative recovery controller per model to keep cross\-model comparisons aligned\. Columns are ordered by priority from left to right: success, replay width, upstream replay, preserved progress, and failure\-to\-milestone latency\.
## Appendix DBenchmark Universe and Deterministic\-Domain Generalization

Appendix[D](https://arxiv.org/html/2605.23311#A4)broadens the empirical scope beyond the three main\-text LLM\-driven domains by documenting the full five\-domain benchmark universe and the deterministic\-domain generalization results\. The travel\-planning domain and its included cases are derived from the open\-source TravelPlanner dataset\[[33](https://arxiv.org/html/2605.23311#bib.bib33)\]\.

### D\.1Five\-Domain Benchmark Universe

Table 8:Five\-domain benchmark universe\. The main text uses the three core LLM\-driven domains; deterministic ETL and travel\-planning results appear only in Appendix[D](https://arxiv.org/html/2605.23311#A4)\. All live aggregates use repeat = 5\.The five\-domain universe is intentionally heterogeneous\. Navigation, schedule\-form, and diagnosis are externally grounded LLM\-driven live\-agent domains\. ETL pipeline and travel planning are deterministic domains that stress the same recovery framework without live LLM uncertainty, so we keep them as appendix\-only generalization evidence\. Travel\-planning cases instantiate TravelPlanner tasks as frozen planning skeletons with explicit commit/exit predicates and controlled observable\-failure sites, while audit\-safe\-equivalence is still evaluated with frozen domain specifications\.

### D\.2Deterministic\-Domain Generalization

Table 9:Deterministic\-domain generalization on ETL pipeline and travel planning\. These rows use the same stronger\-baseline protocol as the main text; travel\-planning cases come from TravelPlanner\[[33](https://arxiv.org/html/2605.23311#bib.bib33)\]\. Status uses the same semantics as Table[1](https://arxiv.org/html/2605.23311#S5.T1)\. For compactness, preserved completed instances is omitted here and the table focuses on task\-level success and replay behavior in deterministic domains\. Values are medians over successful runs\.The deterministic domains support a narrower generalization claim: in the official setting,Comp\-Frozencontinues to eliminate upstream replay, and in the deterministic commitment\-sensitive settings entry\-only recovery again fails whereasComp\-Frozenstill succeeds with a one\-step replay frontier\.

## Appendix EStatistical Robustness and Efficiency

Appendix[E](https://arxiv.org/html/2605.23311#A5)complements the main\-text tables with paired significance tests, no\-failure\-path efficiency measurements, and a compact checkpoint\-granularity diagnostic\.

### E\.1Paired Live Statistics

Table 10:Paired latency robustness forComp\-FrozenversusRetry\-Only\. Rows use runtime\-exported pair keys and report matched\-run medians, paired median deltas with 95% bootstrap confidence intervals, and Holm\-adjusted exact tests\. Replay and upstream replay also uniformly favorComp\-Frozen\.The paired statistics confirm the main text’s qualitative pattern\. The biggest latency reductions appear in the commitment\-sensitive rows, and the official headline setting remains significant in all three core LLM\-driven domains when compared againstRetry\-Only\. Against stronger local baselines on official rows, however,Comp\-Frozendoes not show uniform dominance: in navigation it is statistically indistinguishable from both Coarse\-State\-Retry and Comp\-EntryOnly on failure\-to\-milestone latency \(Holm\-adjustedp=0\.18352p=0\.18352and1\.000001\.00000\), with replay, upstream replay, and preserved completed instances all at parity; in schedule\-form it retains selective latency improvements over both Coarse\-State\-Retry and Comp\-EntryOnly \(Holm\-adjustedp=0\.00060p=0\.00060and0\.006160\.00616\), while replay, upstream replay, and preserved completed instances remain at parity\.

### E\.2No\-Failure\-Path Overhead

Table 11:No\-failure\-path overhead for the currently instrumented navigation and schedule\-form domains\. Values are medians over repeat = 5\. The sidecar\-hook and snapshot columns isolate recovery\-readiness bookkeeping rather than total planner variance\.Even though total no\-failure wall\-clock still reflects live\-path variance, the bookkeeping attributable toComp\-Frozenremains small in absolute terms: the sidecar hook stays on the order of tens of milliseconds, and peak serialized snapshots remain around 12–15 KB in the current setup\.

### E\.3Checkpoint Granularity Diagnostic

Table 12:Synthetic checkpoint\-granularity diagnostic in navigation\. Once the admissible restore point lies beyond entry, reviewed commit checkpoints sharply reduce replay\. End\-to\-end latency is a synthetic harness estimate\.

## Appendix FExternal Validation on LangGraph\-Based Runtimes

Appendix[F](https://arxiv.org/html/2605.23311#A6)provides*external validation*of the main recoverability claim on LangGraph\-based runtimes\. We instantiate two LangGraph\-based recovery controllers and evaluate them through three complementary views: regime\-aware comparison, transplant\-control transportability, and a counterfactual blocking witness\. The goal is not to benchmark generic framework speed, but to test whether the same*commitment\-sensitive rollback failure*isolated by the admissibility analysis reappears once persistence and resume behavior are already available in an external graph runtime; this two\-domain overlay is external evidence only and is not part of the five\-domain audit denominator\.

### F\.1Runtime Families

Table 13:Runtime families in the external validation study\. LangGraph\-Direct and LangGraph\-SemiReal are implemented using LangGraph execution and persistence primitives under our controlled benchmark protocol\.
### F\.2Regime\-Aware Comparison

Readers interested mainly in the decisive external result may start from Table[15](https://arxiv.org/html/2605.23311#A6.T15), where the schedule\-form commitment\-sensitive row isolates the boundary failure predicted by the main admissibility analysis\. The aggregate rows here provide context: on the smaller aligned direct subset, LangGraph\-Direct is faster on raw failure\-to\-milestone time in both headline domains \(navigation: 1480\.22 ms vs\. 3248\.19 ms; schedule\-form: 1\.77 ms vs\. 2426\.61 ms\)\. The decisive difference therefore lies in the schedule\-form commitment\-sensitive regime, where rollback admissibility becomes outcome\-critical\.

Table 14:Aggregate LangGraph\-SemiReal vs\. DART summaries on the official and regime\-balanced aligned sets\. Values are medians over successful runs\.The regime\-aware three\-way anchor is the clearest external\-validation result because it keeps only the shared regime\-specific case intersection acrossRetry\-Only, LangGraph\-SemiReal, andDARTComp\-Frozen\. On navigation, the semi\-real LangGraph runtime remains competitive in both entry\-aligned and commitment\-sensitive settings\. On schedule\-form, however, the commitment\-sensitive row is decisive:Retry\-Onlysucceeds with a large replay frontier \(32618\.41 ms, 25\.5 replayed actions\), LangGraph\-SemiReal drops to 0\.00 success, andDARTremains admissible and succeeds with a one\-step replay frontier \(1109\.80 ms, 1\.0 replayed action\)\. This is the external confirmation that checkpoint\-aligned restore alone is insufficient once rollback must respect downstream commitments\.

To separate policy effects from executor effects, we also run a matched G0/G1/G2/G3 transplant\-control study on the same two\-domain regime\-balanced universe while holding the reviewed checkpoint substrate fixed between G2 and G3\. In all four domain\-regime cells, G2 and G3 both remain at 1\.00 success with identical median replay and zero upstream replay; on the current universe every G3 decision iseligiblewith zero fallbacks, and relative to DART\-native the transplanted controller matches replay exactly while keeping failure\-to\-milestone latency within0\.760\.76–0\.84×0\.84\\times\. These rows are therefore a transportability check rather than blocked\-case evidence: they show that the admissibility layer ports without harming recovery on safe cases\. Because every current G3 decision iseligible, the necessity of the gate comes instead from the counterfactual witness below and the five\-domain calibration results in Appendix[G](https://arxiv.org/html/2605.23311#A7), which show what breaks when the same dependency/effect veto is disabled or overridden\.

Table 15:Three\-way anchor acrossRetry\-Only, LangGraph\-SemiReal, and DARTComp\-Frozenon aligned regime\-specific intersections\. The schedule\-form commitment\-sensitive row is the key safety result\.
### F\.3Semantic Equivalence

The study is not relying only on latency\. Under the current overlay contract, all comparable LangGraph\-SemiReal versusDARTpairs are safe\-equivalent: 6/6 on the official track and 8/8 on the regime\-balanced track\.

Table 16:LangGraph\-SemiReal semantic overlays\. The official overlay yields 6 comparable safe\-equivalent rows, and the regime\-balanced overlay yields 8\.
### F\.4Mechanism Witness

The blocking witness is counterfactual evidence that the gate is necessary, not just thatDARTdiffers from LangGraph\. In the schedule\-form witness, producerResolveSlot::slot\[0\]::0has already been consumed by two committed downstream instances:ResolveSlot::slot\[1\]::0andFinalizeSchedule::final::0\. With blocking on,DARTrejects rollback withcommitted\_consumers\_present\. With the same check disabled, the producer\-commit restore becomeseligiblebut drops those two committed consumers\. The reconstructed LangGraph\-style restore invalidates the same pair\. So checkpoint alignment alone is not enough once rollback must preserve downstream commitments\.

Table 17:Schedule\-form blocking witness against LangGraph\-SemiReal under an explicit producer rollback request\. Without blocking, the same restore invalidates committed downstream consumers\.SettingDART blocking onDART blocking offLangGraph\-SemiReal restoreSchedule producer rollback after finalized downstream consumersreject withcommitted\_consumers\_ presentrollback allowed; 2 committed consumers droppedrestore allowed; same 2 downstream consumers invalidated

## Appendix GFive\-Domain Audit Chain and Mechanism Evidence

Appendix[G](https://arxiv.org/html/2605.23311#A7)collects the five\-domain audit chain and the mechanism\-level sanity checks that underpin the main\-text semantic audit and blocking calibration claims, including semantic audit, blocking calibration, failed\-instance localization, and property\-wise necessity evidence\. Reviewers interested only in the main\-text safety claims can read this appendix in the following order: denominator flow and protocol, five\-domain semantic audit, five\-domain blocking calibration, and then the localization and boundary/property evidence\.

### G\.1Five\-Domain Semantic Audit and Blocking Calibration

These tables establish the safety backbone behind Table[3](https://arxiv.org/html/2605.23311#S5.T3)\. Table[18](https://arxiv.org/html/2605.23311#A7.T18)aligns the main counts across the five\-domain audit chain, separating the comparable\-row denominator used for safe\-equivalence from the evaluated\-event denominator used for blocking calibration\. The reported labels are executable checks derived from frozen case specifications and reviewed domain specifications over normalized semantic, committed\-prefix, and durable\-effect projections\. Human review enters when freezing boundary, effect, and audit specifications; we do not claim an independent multi\-annotator audit for the current version\.

Table 18:Denominator flow for the five\-domain audit chain\.The 54\-row semantic denominator counts cases where both methods yield audit\-ready terminal outputs; for injected\-failure runs, comparability additionally requires observed failure and recovery underComp\-Frozen\. The 47\-event calibration denominator is the subset that reaches the admissibility gate\. Shared status labels areok,contract,no\-recov, andblocked, with the same meanings as in Table[1](https://arxiv.org/html/2605.23311#S5.T1)\. We use*admitted event*,*blocked event*, and*blocked checkpoint*for the three calibration units, and reserve*preserved completed instances*for the runtime metric versus*committed\-prefix preservation*for the ETL/travel audit check\. A comparable row is safe\-equivalent iff all domain\-applicable reviewed checks pass; rows failing the comparable\-row inclusion rule are excluded from the safe\-equivalence denominator\.

Table 19:Audit protocol and denominators for Table[3](https://arxiv.org/html/2605.23311#S5.T3)\.Table 20:Domain\-specific reviewed specifications for audit\-safe\-equivalence\.Table 21:Five\-domain semantic audit underlying Table[3](https://arxiv.org/html/2605.23311#S5.T3)\. Rows are aggregated at the comparable\-row level\.Table 22:Five\-domain blocking calibration by domain\. Rows use the same 47 evaluated recovery events as Table[3](https://arxiv.org/html/2605.23311#S5.T3); false\-blocked and unsafe\-admission columns report audited event\-level counts\.
### G\.2Failed\-Instance Localization Audit

This subsection audits failed\-instance localization over the same five\-domain frozen case universe defined in Table[8](https://arxiv.org/html/2605.23311#A4.T8)\(54 cases, 5 repeats each, 270 repeat\-level rows\)\. It addresses a narrower but reviewer\-critical question than semantic equivalence alone: whether the runtime actually localizes recovery to the correct failed instance rather than merely to the correct skeleton family\. We audit this in three layers\. First, we align the observedComp\-Frozenrecovery scope and checkpoint type with frozen case specifications over the repeat = 5 official and commitment\-sensitive cases\. Across the 270 repeat\-level rows, the observed recovery\-scope prefix matches the specification in all rows, and the observed checkpoint type also matches in all rows\. Second, we run a systematic offline ambiguity benchmark over the same universe: the benchmark keeps the observed full recovery identifiers fixed, then weakens instance keys by dropping ordinal or structural fields to test when a conservative runtime should abstain\. Third, we retain three executable consequence probes\. Navigation and diagnosis each yield a unique weakened\-alias candidate under the frozen protocol, whereas schedule\-form exposes a genuine re\-entry ambiguity witness: collapsing ordinal identity creates two candidates, and forcing the stale one would erase an already committed refined value\. The probes therefore illustrate concrete ambiguity damage, while the broader ambiguity benchmark provides the systematic coverage\.

Table 23:Failed\-instance localization audit\. The table combines observed repeat\-level alignment, a systematic offline ambiguity benchmark over the frozen case universe, and three executable consequence probes\.
### G\.3Boundary Review Protocol and Boundary Evidence

Table 24:Reviewed positive exit\-boundary cases and paired negative controls in the two core workflow\-shaped domains\. Together with Table[4](https://arxiv.org/html/2605.23311#S5.T4), these rows show that controller legality is weaker than reviewed boundary validity\.Table 25:Boundary review load across workflow\-shaped reviewed\-boundary domains\. Counts report reviewed boundary objects rather than full FSM size; ETL pipeline and travel planning instead use deterministic output/effect specifications and are outside this reviewer\-burden view\.Table 26:Protocolized boundary\-review telemetry from logged real sessions in a representative workflow\-shaped domain\. Reviewer\-minute quantities are reported only when a real timed session is present; exact match checks whether the reviewed output reproduces the frozen configuration after validation\.Table 27:Cross\-domain structural boundary transfer audit from candidate export to frozen configuration\. Stable indicates no missing or extra skeletons and no field diffs relative to the frozen reviewed configuration; Exact match refers to candidate\-to\-frozen alignment at the skeleton/field level\.These tables make the scope explicit: reviewer\-burden evidence is reported only for workflow\-shaped reviewed\-boundary domains, with timed telemetry available for Schedule Form\. Within that scope, review is concentrated on a much smaller set of candidate predicates and semantic annotations than the raw FSM size might suggest\.

### G\.4Failure\-Signal Normalization and Snapshot\-Depth Efficiency

This subsection checks two support claims: whether normalized failure signals preserve the same recovery decision at fixed sites, and whether the registry\-only sidecar remains storage\-efficient as checkpoint depth grows\.

Table 28:Failure\-signal normalization adequacy matrix for the current fixed\-site evaluation\. At each fixed site, varying the raw observable signal leaves both the admissibility decision and the recovery signature unchanged\.Table 29:Snapshot\-depth efficiency summary for the schedule\-form depth benchmark\. The registry\-only sidecar grows much more slowly than the inline payload, while restore cost stays in the same order of magnitude\.
### G\.5Property\-Wise Necessity Decomposition

Table[30](https://arxiv.org/html/2605.23311#A7.T30)maps each conjunct of Eq\. \([6](https://arxiv.org/html/2605.23311#S4.E6)\) to the concrete counterexample or audit family that fails when it is removed\. Read together, the dependency\-blocking witnesses and effect\-policy forced\-override audits show why admissibility is needed both to preserve committed downstream semantics and to avoid replay across irreversible effect boundaries\.

Table 30:Property\-wise necessity decomposition for Eq\. \([6](https://arxiv.org/html/2605.23311#S4.E6)\)\. Each row points to a concrete evidence family showing what breaks if the corresponding conjunct is removed\.

## Appendix HProof Sketches for the Semantic Soundness Theorems

###### Lemma 1\(Legal edges do not imply recoverable boundaries\)\.

Letu=\(si,a,sj\)u=\(s\_\{i\},a,s\_\{j\}\)be a controller\-legal edge, i\.e\.,\(si,a,sj\)∈δ\(s\_\{i\},a,s\_\{j\}\)\\in\\delta\. If there exists a subtask instanceIIassociated withuusuch that at least one ofDecidable​\(u,I\)\\mathrm\{Decidable\}\(u,I\),Closed​\(u,I\)\\mathrm\{Closed\}\(u,I\),Separable​\(u,I\)\\mathrm\{Separable\}\(u,I\), orControllable​\(u,I\)\\mathrm\{Controllable\}\(u,I\)does not hold, thenuuis not a recoverable boundary forIIunder Eq\. \([6](https://arxiv.org/html/2605.23311#S4.E6)\)\.

###### Theorem 1\(Necessity of committed\-consumer blocking\)\.

LetIpI\_\{p\}be a producer instance and letIqI\_\{q\}be a committed downstream consumer ofIpI\_\{p\}\. Any local\-recovery policy that rolls backIpI\_\{p\}while leavingIqI\_\{q\}committed, without compensation, invalidation, or joint rollback ofIqI\_\{q\}, cannot guarantee semantic equivalence of the recovered execution\. Therefore, committed\-consumer blocking is necessary for sound failed\-instance\-local rollback\.

###### Corollary 1\(Soundness of dependency\-aware admission under conservative dependency abstraction\)\.

Assume A3–A4\. If local rollback of a producer instance is admitted by the runtime, then there exists no committed downstream instance that would become semantically unsupported were that producer rolled back while the downstream instance remained committed\.

###### Theorem 2\(Maximal admissible checkpoint selection\)\.

Assume the stable checkpoints ofIfI\_\{f\}are totally ordered by recency under Assumption A2\. If𝒜​\(f\)≠∅\\mathcal\{A\}\(f\)\\neq\\emptyset, then Eq\. \([10](https://arxiv.org/html/2605.23311#S4.E10)\) returns a unique checkpoint

c⋆​\(f\)=max⁡𝒜​\(f\)c^\{\\star\}\(f\)=\\max\\mathcal\{A\}\(f\)Moreover: \(i\)c⋆​\(f\)c^\{\\star\}\(f\)is admissible, i\.e\.,c⋆​\(f\)∈𝒜​\(f\)c^\{\\star\}\(f\)\\in\\mathcal\{A\}\(f\); and \(ii\) for any checkpointc′∈𝒞​\(If\)c^\{\\prime\}\\in\\mathcal\{C\}\(I\_\{f\}\)withc⋆​\(f\)≺\(If\)c′c^\{\\star\}\(f\)\\prec^\{\(I\_\{f\}\)\}c^\{\\prime\}, we havec′∉𝒜​\(f\)c^\{\\prime\}\\notin\\mathcal\{A\}\(f\)\.

#### Proof Sketch of Lemma[1](https://arxiv.org/html/2605.23311#Thmlemma1)\.

By Eq\. \([6](https://arxiv.org/html/2605.23311#S4.E6)\),Recoverable​\(u,I\)\\mathrm\{Recoverable\}\(u,I\)is the conjunction of four obligations:Decidable​\(u,I\)\\mathrm\{Decidable\}\(u,I\),Closed​\(u,I\)\\mathrm\{Closed\}\(u,I\),Separable​\(u,I\)\\mathrm\{Separable\}\(u,I\), andControllable​\(u,I\)\\mathrm\{Controllable\}\(u,I\)\. Therefore, if any one of these conjuncts fails for the legal edgeuuand instanceII, then the conjunction itself fails, and hence¬Recoverable​\(u,I\)\\neg\\mathrm\{Recoverable\}\(u,I\)\. Controller legality certifies only controller\-level reachability and does not reintroduce any missing conjunct\. Thus legality is necessary for controller execution but not sufficient for recoverability\.

#### Proof Sketch of Theorem[1](https://arxiv.org/html/2605.23311#Thmtheorem1)\.

AssumeIp↝IqI\_\{p\}\\rightsquigarrow I\_\{q\}under Eq\. \([14](https://arxiv.org/html/2605.23311#A9.E14)\), and thatIqI\_\{q\}is already committed\. By Assumption A3, the committed state ofIqI\_\{q\}semantically depends on outputs produced byIpI\_\{p\}\. Suppose a local\-recovery policy rolls backIpI\_\{p\}while leavingIqI\_\{q\}committed and without compensating, invalidating, or jointly rolling backIqI\_\{q\}\. Then the committed state ofIqI\_\{q\}becomes semantically unsupported after the producer rollback\. This violates semantic equivalence of the recovered execution and is exactly the conflict ruled out by the*NoCommittedConflict*term in Eq\. \([8](https://arxiv.org/html/2605.23311#S4.E8)\)\. Therefore any sound failed\-instance\-local recovery policy must block such rollback\.

#### Proof Sketch of Theorem[2](https://arxiv.org/html/2605.23311#Thmtheorem2)\.

By Assumption A2, the checkpoint set𝒞​\(If\)\\mathcal\{C\}\(I\_\{f\}\)is totally ordered by recency under⪯\(If\)\\preceq^\{\(I\_\{f\}\)\}\. Since𝒜​\(f\)⊆𝒞​\(If\)\\mathcal\{A\}\(f\)\\subseteq\\mathcal\{C\}\(I\_\{f\}\)and𝒜​\(f\)≠∅\\mathcal\{A\}\(f\)\\neq\\emptyset, the maximum elementc⋆​\(f\)=max⁡𝒜​\(f\)c^\{\\star\}\(f\)=\\max\\mathcal\{A\}\(f\)exists and is unique\. By construction,c⋆​\(f\)∈𝒜​\(f\)c^\{\\star\}\(f\)\\in\\mathcal\{A\}\(f\), soc⋆​\(f\)c^\{\\star\}\(f\)is admissible\. Now letc′∈𝒞​\(If\)c^\{\\prime\}\\in\\mathcal\{C\}\(I\_\{f\}\)satisfyc⋆​\(f\)≺\(If\)c′c^\{\\star\}\(f\)\\prec^\{\(I\_\{f\}\)\}c^\{\\prime\}\. Ifc′c^\{\\prime\}were also admissible, thenc′∈𝒜​\(f\)c^\{\\prime\}\\in\\mathcal\{A\}\(f\)andc⋆​\(f\)c^\{\\star\}\(f\)would fail to be the maximum element of𝒜​\(f\)\\mathcal\{A\}\(f\), a contradiction\. Therefore every checkpoint strictly later thanc⋆​\(f\)c^\{\\star\}\(f\)is inadmissible\. This proves that Eq\. \([10](https://arxiv.org/html/2605.23311#S4.E10)\) returns the unique latest admissible checkpoint within the failed instance\.

## Appendix IReproducibility and Additional Experimental Details

#### Runtime Realization Scope\.

The recovery method is realized inDARTas an online runtime sidecar rather than an offline trace analyzer\. The implementation centers on reviewed boundary configurations, step lifting, modular named checkpoints, producer\-consumer dependency tracking, and rollback selection in the online recovery path\. Appendix[I\.1](https://arxiv.org/html/2605.23311#A9.SS1)gives the concrete realization details\.

### I\.1Runtime Realization Details

We realize failed\-instance identity, recoverable\-boundary review, modular checkpoints, and admissibility checks inDARTas a runtime sidecar attached to the normal agent loop rather than as an offline trace\-analysis layer\. This subsection records the concrete runtime realization of the four\-layer recovery pipeline described in Section 4\.

![Refer to caption](https://arxiv.org/html/2605.23311v1/x3.png)Figure 3:Runtime sidecar overview\. Reviewed boundaries define recovery contracts; the sidecar lifts steps, tracks dependencies, and restores the latest admissible checkpoint\.Normalized signals use a small runtime failure vocabulary such asTIMEOUT,INVALID\_OUTPUT, orMISSING\_INPUT\. LetK^​\(k\)\\widehat\{K\}\(k\)denote the reviewed configuration loaded for skeleton idkk\. In the currentDARTrealization,

K^​\(k\)=\(S^kint,S^kent,P^kcom,P^kexit,X^kin,X^kout,π^keff\)\\widehat\{K\}\(k\)=\\big\(\\widehat\{S\}\_\{k\}^\{\\mathrm\{int\}\},\\widehat\{S\}\_\{k\}^\{\\mathrm\{ent\}\},\\widehat\{P\}\_\{k\}^\{\\mathrm\{com\}\},\\widehat\{P\}\_\{k\}^\{\\mathrm\{exit\}\},\\widehat\{X\}\_\{k\}^\{\\mathrm\{in\}\},\\widehat\{X\}\_\{k\}^\{\\mathrm\{out\}\},\\widehat\{\\pi\}\_\{k\}^\{\\mathrm\{eff\}\}\\big\)\(11\)The sidecar observer then lifts a base execution stepete\_\{t\}from Eq\. \([2](https://arxiv.org/html/2605.23311#S3.E2)\) to an enriched recovery\-aware record

Ψ​\(et\)=\(kt,It,Rt,Wt,ct\)\\Psi\(e\_\{t\}\)=\(k\_\{t\},I\_\{t\},R\_\{t\},W\_\{t\},c\_\{t\}\)\(12\)wherektk\_\{t\}is the resolved skeleton id,ItI\_\{t\}the resolved instance,RtR\_\{t\}andWtW\_\{t\}the step\-level read and write sets, andctc\_\{t\}an optional named checkpoint produced at that step\. In the current system,RtR\_\{t\}andWtW\_\{t\}are assembled conservatively from explicit state/action manifests, planner maps, and reviewed tool I/O or effect annotations associated with the active skeleton, rather than inferred from arbitrary black\-box runtime semantics\.

At the instance level, the runtime aggregates conservative interfaces as

R​\(I\)=⋃t:It=IRt,W​\(I\)=⋃t:It=IWtR\(I\)=\\bigcup\_\{t:I\_\{t\}=I\}R\_\{t\},\\qquad W\(I\)=\\bigcup\_\{t:I\_\{t\}=I\}W\_\{t\}\(13\)Eq\. \([13](https://arxiv.org/html/2605.23311#A9.E13)\) is the runtime abstraction used to derive dependency edges between instances: writes summarize produced semantic objects, while reads summarize downstream consumption that may make producer rollback unsafe\. These aggregates induce the conservative producer\-consumer relation

Ip↝Iq⇔W​\(Ip\)∩R​\(Iq\)≠∅I\_\{p\}\\rightsquigarrow I\_\{q\}\\iff W\(I\_\{p\}\)\\cap R\(I\_\{q\}\)\\neq\\varnothing\(14\)Eq\. \([14](https://arxiv.org/html/2605.23311#A9.E14)\) realizes committed\-consumer detection as a conservative over\-approximation of must\-block dependency\. Because the loaded interfaces intentionally over\-approximate possible reads and writes, the resulting dependency relation may block some otherwise admissible rollbacks; this conservatism is part of the safety envelope studied here\.

In the default Registry\-Only Sidecar path, the sidecar stores the instance registry, reconstructs snapshot\-manager bookkeeping at restore time, and resumes normal execution from the chosen named checkpoint and sidecar state\. Figure[3](https://arxiv.org/html/2605.23311#A9.F3)summarizes this path\.

#### Benchmark Settings\.

The main text reports three integrated result tables over the three core LLM\-driven domains\. Semi\-real live benchmarks inject failures only at observable action boundaries of the form in Eq\. \([3](https://arxiv.org/html/2605.23311#S3.E3)\), and the headline aggregates use repeat = 5 under the official and commitment\-sensitive regimes\. Appendix[D](https://arxiv.org/html/2605.23311#A4)documents the full five\-domain case universe and deterministic ETL/travel generalization; Appendix[E](https://arxiv.org/html/2605.23311#A5)adds paired statistics, no\-failure\-path overhead, and checkpoint\-granularity diagnostics; Appendix[F](https://arxiv.org/html/2605.23311#A6)provides two\-domain external LangGraph evidence, including regime\-aware comparison, transplant\-control transportability, and the blocking witness; and Appendix[G](https://arxiv.org/html/2605.23311#A7)collects the broader audit chain\.

#### Representative Reproduction Path\.

An accompanying public artifact is available at[https://github\.com/KeoYang/DART](https://github.com/KeoYang/DART)\. It includes theDARTimplementation used in the paper, together with the reviewed boundary configurations, benchmark harnesses, and scripts used to validate the frozen paper artifacts and rerun the official non\-live pipelines\. The artifact exposes fixed case sets, automated interaction choices, explicit failure injection points, and scripts for regenerating the aggregate JSON and markdown result files used for the paper tables\. Rerunning the semi\-real live protocol further requires the corresponding hosted\-LLM and, for the navigation domain, map\-service credentials\. For the T8 depth benchmark, the two compared variants correspond in the artifact to the aliasesregistry\_only\_v1andinline\_snapshot\_manager\_v1, respectively\. In particular, the artifact allows readers to trace how reviewed boundary configurations instantiate Eq\. \([11](https://arxiv.org/html/2605.23311#A9.E11)\), how execution steps are lifted according to Eq\. \([12](https://arxiv.org/html/2605.23311#A9.E12)\), how committed\-consumer relations are accumulated under Eq\. \([14](https://arxiv.org/html/2605.23311#A9.E14)\), and how rollback targets are selected through Eq\. \([10](https://arxiv.org/html/2605.23311#S4.E10)\)\.

#### Compute Resources\.

The reported experiments were executed on a single local workstation with an Apple M3 CPU \(8 cores\) and 16 GB unified memory\. The semi\-real runs are lightweight in restoration cost; the dominant runtime cost comes from downstream re\-execution and external service latency rather than from checkpoint restore itself\.

#### Assets and Services\.

The semi\-real benchmarks use hosted LLM APIs and, in the navigation domain, a map\-search API\. These external services are accessed only through their normal provider interfaces and terms\. The public artifact does not redistribute those services or their proprietary outputs; instead, it provides the benchmark harnesses, reviewed boundary configurations, and analysis scripts needed to reproduce the reported measurements for readers with appropriate access credentials\.

#### LLM Usage\.

The evaluated domains instantiate the proposed method with LLM\-based tool agents under explicit FSM controllers\. The contribution of the paper is not a new language model or prompting method\. Instead, the proposed method operates at the runtime\-recovery layer and requires only that the agent expose explicit state transitions, action boundaries, and step histories\.

Similar Articles

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.