Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Hugging Face Daily Papers 06/11/26, 12:00 AM Papers

coding-agents llm-agents runtime-checks user-corrections preference-compliance memory trace

Summary

TRACE is a skill-layer pipeline that mines user corrections from interactive coding agents to compile runtime checks, reducing repeated preference violations significantly better than memory alone, as demonstrated on ClawArena and MemoryArena tasks.

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

Original Article

View Cached Full Text

Cached at: 06/12/26, 06:54 PM

Paper page - Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Source: https://huggingface.co/papers/2606.13174 Authors:

Abstract

TRACE is a skill-layer pipeline that mines user corrections to create runtime checks, significantly reducing preference violations in interactive LLM agents.

InteractiveLLM agentsare becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access andpreference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-timeRule AcquisitionandCompiled Enforcement(TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that minesuser corrections, rewrites them as atomic rules, and compiles them intoruntime checksthat must pass before an agent completes future tasks. Unlikeruntime checkswritten ahead of time by developers, TRACE skills come from the user’s own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% onin-distributiontasks and from 100.0% to 2.0% onout-of-distributiontasks. On MemoryArena-derived tasks, TRACE reducesin-distributionviolation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline ontask pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

View arXiv page View PDF GitHub Add to collection

Get this paper in your agent:

hf papers read 2606\.13174

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.13174 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.13174 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.13174 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Paper page - Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

@adithya_s_k: You can now finetune models on agent traces directly with TRL Claude Code traces Codex traces OpenClaw traces Pi traces…

@appliedcompute: https://x.com/appliedcompute/status/2052826576723841292

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

@omarsar0: Cool paper from Apple. Most evaluation of tool-calling agents happens after the trajectory is over. By then the wrong c…

Submit Feedback

Similar Articles

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

@adithya_s_k: You can now finetune models on agent traces directly with TRL Claude Code traces Codex traces OpenClaw traces Pi traces…

@appliedcompute: https://x.com/appliedcompute/status/2052826576723841292

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

@omarsar0: Cool paper from Apple. Most evaluation of tool-calling agents happens after the trajectory is over. By then the wrong c…