Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Hugging Face Daily Papers Papers

Summary

Introduces SCOUT, a framework that dynamically allocates prompt-injection detectors per request by predicting reliability and latency, improving safety and efficiency. Also presents SCOUT-450, a benchmark for complex agent-facing injections, showing a 46% reduction in attack-success rate and 40% latency reduction over a fixed GPT-4o judge.

Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.
Original Article
View Cached Full Text

Cached at: 06/10/26, 12:08 AM

Paper page - Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Source: https://huggingface.co/papers/2605.30837

Abstract

SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches.

Prompt-injection detectorsare heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector’s blind spots. We reframe defense asdetector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our frameworkSCOUT(Scalable and Controllable Outcome-prediction forUncertainty-aware Triage) makes this decision dynamic by predicting each detector’s per-sample reliability and latency from how it behaved on similar past inputs, and exposes a singlesafety-utility thresholdto the operator (where utility bundlesbenign-pass rateandwall-clock). To evaluate this setting, we buildSCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. OnSCOUT-450, a safety-oriented operating point reducesattack-success rateby 46% and totalwall-clockby 40% relative to an always-onGPT-4o judge, at a 5.1-point benign-utility drop.SCOUTalso transfers to three external benchmarks (BIPIA,IPI, andIHEval), improving the safety-utility frontier.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.30837

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30837 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30837 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30837 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Understanding prompt injections: a frontier security challenge

OpenAI Blog

OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.

Designing AI agents to resist prompt injection

OpenAI Blog

OpenAI publishes guidance on designing AI agents resistant to prompt injection attacks, arguing that modern attacks increasingly use social engineering tactics rather than simple string injections, and advocating for system-level defenses that constrain impact rather than relying solely on input filtering.