Arc Sentry outperformed LLM Guard 92% vs 70% detection on a head to head benchmark. Here is how it works.

Reddit r/artificial 04/23/26, 04:02 AM Tools

Summary

Arc Sentry is a new pre-generation prompt-injection detector that reads a model’s internal residual stream, achieving 92% detection with 0% false positives versus LLM Guard’s 70%/3.3% on a 130-prompt benchmark.

I built Arc Sentry, a pre-generation prompt injection detector for open-weight LLMs. Instead of scanning text for patterns after the fact, it reads the model’s internal residual stream before generate() is called and blocks requests that destabilize the model’s information geometry. Head to head benchmark on a 130-prompt SaaS deployment dataset: Arc Sentry: 92% detection, 0% false positives LLM Guard: 70% detection, 3.3% false positives The difference is architectural. LLM Guard classifies input text. Arc Sentry measures whether the model itself is being pushed into an unstable regime. Those are different problems and the geometry catches attacks that text classifiers miss. It also catches Crescendo multi-turn manipulation attacks that look innocent one turn at a time. LLM Guard caught 0 of 8 in that test. Install: pip install arc-sentry GitHub: https://github.com/9hannahnine-jpg/arc-sentry If you are self-hosting Mistral, Llama, or Qwen and want to try it, let me know.

Original Article

Similar Articles

LLM Guard scored 0/8 on a USENIX 2025 multi-turn jailbreak. Here’s what caught it instead.

Reddit r/artificial

Arc Sentry detects multi-turn jailbreaks like Crescendo by reading model internal state rather than text output, catching attacks that text-based monitors miss entirely.

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv cs.LG

This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Hugging Face Daily Papers

Introduces SCOUT, a framework that dynamically allocates prompt-injection detectors per request by predicting reliability and latency, improving safety and efficiency. Also presents SCOUT-450, a benchmark for complex agent-facing injections, showing a 46% reduction in attack-success rate and 40% latency reduction over a fixed GPT-4o judge.

Most injection detectors score each prompt in isolation. I built one that tracks the geometric trajectory of the full session. Here is a concrete result.

Reddit r/artificial

A developer built Arc Gate, a monitoring proxy for LLMs that uses Fisher information manifold geometry to detect session-level prompt injection attacks, identifying Crescendo-style gradual manipulation by tracking t-values against a phase transition threshold t* = 1.2247 rather than per-turn phrase detection.

We built a public red team environment for our AI agent security proxy — submit attacks and get a full security trace back

Reddit r/artificial

Arc Gate is a runtime governance layer for LLM agents that enforces instruction-authority boundaries. The project has launched a public red team environment where users can submit attacks and receive full security traces, with a benchmark showing 100% unsafe action prevention.

Similar Articles

LLM Guard scored 0/8 on a USENIX 2025 multi-turn jailbreak. Here’s what caught it instead.

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Most injection detectors score each prompt in isolation. I built one that tracks the geometric trajectory of the full session. Here is a concrete result.

We built a public red team environment for our AI agent security proxy — submit attacks and get a full security trace back

Submit Feedback