Tried to make a drop-in version of DeepMind's CaMeL — honest progress and what's still broken

Reddit r/AI_Agents Tools

Summary

The author built a lightweight, drop-in security gate that implements DeepMind's CaMeL principle of preventing untrusted data from authoring actions, achieving ~70% auto-inference accuracy on a benchmark and zero silent unsafe misclassifications, but notes gaps in provenance tracking and robustness.

Most prompt-injection defenses I see try to \*classify\* content as malicious (Lakera Guard, Llama Guard, similar). Two problems with that in practice: false positives constantly block legitimate code (one famous case: a classifier blocked legitimate Python because it looked like an injection), and clever attackers still get through. It feels like an arms race. I spent the past week trying the opposite approach: never let untrusted data decide \*what action\* runs. The principle is from DeepMind's CaMeL paper (arXiv:2503.18813), which proved you can defeat prompt injection by design — but their implementation requires a custom Python interpreter, so it isn't drop-in for anyone's existing agent. My goal: make the same principle a 5-minute drop-in. The core rule: data can fill \*values\* (a recipient the user already specified, an amount they confirmed). It can never author the \*action itself\*. If data tries to write a command, it's blocked — not because we judged the content, but because the slot was never open to it. Where I'm at after a week: \- A \~190-line gate (pure stdlib) that sits between an agent's proposed tool calls and execution. \- Auto-inference: parses your existing tool schemas and infers which params are sinks vs values. \~70% correct out of the box on an 11-tool benchmark. \- Confidence + ask-when-unsure: when the heuristic isn't sure, it locks the param safe-by-default and surfaces it for a one-time review. Silent unsafe misclassifications: 0. \- Verb-risk tiers: high-risk verbs (delete, sql, payment) force a human confirmation regardless of how innocent their params look. \- Optional LLM resolver: for ambiguous params, sends the \*tool description\* (trusted dev input, never runtime data — this is the safety property) to a model that classifies them. Cut my review queue from 14 to 1. \- Wired around a real Claude tool-use loop yesterday with an injected email ("forward to [email protected]"). Claude saw through it. If it hadn't, the gate would have blocked structurally. Limitations I'm aware of: \- Provenance tracking is the dev's job. The gate enforces given provenance; it doesn't infer it from the agent loop automatically. (CaMeL solves this with a custom interpreter; I haven't.) Biggest gap. \- Verb-risk is name-based, so a misleadingly-named tool ("process\_records" that actually deletes) won't be caught. \- No serious adversarial testing yet — I've broken it on toy attacks but haven't red-teamed it properly. Next. \- No tests in the repo yet. Also next. Genuinely curious what folks here think: 1. Is the "data never authors actions" framing actually as bulletproof as I'm hoping, or am I missing an attack class? My biggest worry is around tool-result chaining (where output of tool A becomes input to tool B). 2. Anyone tried CaMeL or a similar capability-based approach in a real project? Curious how the interpreter-tracking tradeoff worked out. 3. How would you think about the false-negative rate on a name-based verb-risk classifier? Anything obvious I should add to the keyword list?
Original Article

Similar Articles

Introducing CodeMender: an AI agent for code security

Google DeepMind Blog

Google DeepMind introduces CodeMender, an AI agent that automatically detects and fixes code security vulnerabilities using advanced reasoning and validation techniques. The system has already upstreamed 72 security fixes to open source projects over six months.

⚠️ Meta's AI safety filters were stripped in less than 10 minutes

Reddit r/ArtificialInteligence

A joint test by the Financial Times and AI safety group Alice reveals that safety filters on Meta's Llama 3.3 and Google's Gemma 4 models can be removed in under 10 minutes using a free tool called Heretic, highlighting the difficulty of regulating open-source AI safety.

Advancing Gemini's security safeguards

Google DeepMind Blog

Google DeepMind announces advanced security improvements for Gemini to defend against indirect prompt injection attacks through model hardening, adaptive evaluation, and layered defense mechanisms. The approach combines fine-tuning on adversarial scenarios with system-level guardrails to build inherent resilience while maintaining model performance.