PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors
Summary
# Paper page - PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors Source: [https://huggingface.co/papers/2605.06455](https://huggingface.co/papers/2605.06455) ## Abstract PrefixGuard enables effective online monitoring of LLM agents through trace analysis and prefix\-based risk scoring, demonstrating strong performance across multiple benchmark tasks while providing diagnostic insights for alert reliability\. Large language model \(LLM\) agents now execute long, tool\-using ta
View Cached Full Text
Cached at: 05/11/26, 10:44 AM
Paper page - PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors
Source: https://huggingface.co/papers/2605.06455
Abstract
PrefixGuard enables effective online monitoring of LLM agents through trace analysis and prefix-based risk scoring, demonstrating strong performance across multiple benchmark tasks while providing diagnostic insights for alert reliability.
Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweightprefix monitorsover heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, atrace-to-monitor frameworkwith an offlineStepView inductionstep followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns anevent abstractionandprefix-risk scorerfrom terminal outcomes. Across WebArena, τ^2-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. Forfinite-state audit, post-hocdeterministic finite automaton (DFA)extraction remains compact on WebArena and τ^2-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas τ^2-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.06455
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06455 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06455 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06455 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
TRACER: Trace-Based Adaptive Cost-Efficient Routing for LLM Classification
TRACER is an open-source system that trains lightweight ML surrogates on production traces from LLM classification endpoints, routing requests through a parity gate that activates surrogates only when agreement with the original model exceeds a specified threshold. This approach achieves 83-100% surrogate coverage on intent classification benchmarks while maintaining interpretability into handling boundaries and failure modes.
Towards Security-Auditable LLM Agents: A Unified Graph Representation
This paper introduces Agent-BOM, a unified graph representation for security auditing in LLM-based agentic systems. It addresses the semantic gap in post-hoc auditing by modeling static capabilities and dynamic runtime states to detect complex attack chains like memory poisoning and tool misuse.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.
MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
The article introduces MANTRA, a framework for automatically synthesizing SMT-validated compliance benchmarks for tool-using LLM agents from natural language manuals. It demonstrates that this approach enables scalable and reliable evaluation of agent adherence to complex procedural rules.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
This paper introduces AgentForesight, a framework for online auditing and early failure prediction in LLM-based multi-agent systems. It presents a new dataset, AFTraj-22K, and a specialized model, AgentForesight-7B, which outperforms leading proprietary models in detecting decisive errors during trajectory execution.