Tried to make a drop-in version of DeepMind's CaMeL — honest progress and what's still broken

Reddit r/AI_Agents 06/01/26, 06:52 PM Tools

prompt-injection security deepmind camel agent-safety open-source llm-security

Summary

The author built a lightweight, drop-in security gate that implements DeepMind's CaMeL principle of preventing untrusted data from authoring actions, achieving ~70% auto-inference accuracy on a benchmark and zero silent unsafe misclassifications, but notes gaps in provenance tracking and robustness.

Most prompt-injection defenses I see try to \*classify\* content as malicious (Lakera Guard, Llama Guard, similar). Two problems with that in practice: false positives constantly block legitimate code (one famous case: a classifier blocked legitimate Python because it looked like an injection), and clever attackers still get through. It feels like an arms race. I spent the past week trying the opposite approach: never let untrusted data decide \*what action\* runs. The principle is from DeepMind's CaMeL paper (arXiv:2503.18813), which proved you can defeat prompt injection by design — but their implementation requires a custom Python interpreter, so it isn't drop-in for anyone's existing agent. My goal: make the same principle a 5-minute drop-in. The core rule: data can fill \*values\* (a recipient the user already specified, an amount they confirmed). It can never author the \*action itself\*. If data tries to write a command, it's blocked — not because we judged the content, but because the slot was never open to it. Where I'm at after a week: \- A \~190-line gate (pure stdlib) that sits between an agent's proposed tool calls and execution. \- Auto-inference: parses your existing tool schemas and infers which params are sinks vs values. \~70% correct out of the box on an 11-tool benchmark. \- Confidence + ask-when-unsure: when the heuristic isn't sure, it locks the param safe-by-default and surfaces it for a one-time review. Silent unsafe misclassifications: 0. \- Verb-risk tiers: high-risk verbs (delete, sql, payment) force a human confirmation regardless of how innocent their params look. \- Optional LLM resolver: for ambiguous params, sends the \*tool description\* (trusted dev input, never runtime data — this is the safety property) to a model that classifies them. Cut my review queue from 14 to 1. \- Wired around a real Claude tool-use loop yesterday with an injected email ("forward to [email protected]"). Claude saw through it. If it hadn't, the gate would have blocked structurally. Limitations I'm aware of: \- Provenance tracking is the dev's job. The gate enforces given provenance; it doesn't infer it from the agent loop automatically. (CaMeL solves this with a custom interpreter; I haven't.) Biggest gap. \- Verb-risk is name-based, so a misleadingly-named tool ("process\_records" that actually deletes) won't be caught. \- No serious adversarial testing yet — I've broken it on toy attacks but haven't red-teamed it properly. Next. \- No tests in the repo yet. Also next. Genuinely curious what folks here think: 1. Is the "data never authors actions" framing actually as bulletproof as I'm hoping, or am I missing an attack class? My biggest worry is around tool-result chaining (where output of tool A becomes input to tool B). 2. Anyone tried CaMeL or a similar capability-based approach in a real project? Curious how the interpreter-tracking tradeoff worked out. 3. How would you think about the false-negative rate on a name-based verb-risk classifier? Anything obvious I should add to the keyword list?

Original Article

Tried to make a drop-in version of DeepMind's CaMeL — honest progress and what's still broken

Similar Articles

Introducing CodeMender: an AI agent for code security

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

⚠️ Meta's AI safety filters were stripped in less than 10 minutes

AI guardrails stripped from Meta and Google models in minutes

Advancing Gemini's security safeguards

Submit Feedback

Similar Articles

Introducing CodeMender: an AI agent for code security

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

⚠️ Meta's AI safety filters were stripped in less than 10 minutes

AI guardrails stripped from Meta and Google models in minutes

Advancing Gemini's security safeguards