I built a benchmark for multi-turn prompt injection attacks. Most defenses never see them coming.
Summary
A new benchmark for multi-turn prompt injection attacks reveals that most current defenses fail to detect sophisticated, multi-step attacks.
Similar Articles
Understanding prompt injections: a frontier security challenge
OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.
Insights on Indirect Prompt Injection (12 minute read)
Zico Kolter and Matt Fredrikson, leaders at Gray Swan and experts in AI security, discuss the state of AI red-teaming and indirect prompt injection, a critical vulnerability for AI agents. They explain why AI security requires a different mindset, how automated red-teaming can beat humans, and introduce tools like Shade for adversarial testing.
Most injection detectors score each prompt in isolation. I built one that tracks the geometric trajectory of the full session. Here is a concrete result.
A developer built Arc Gate, a monitoring proxy for LLMs that uses Fisher information manifold geometry to detect session-level prompt injection attacks, identifying Crescendo-style gradual manipulation by tracking t-values against a phase transition threshold t* = 1.2247 rather than per-turn phrase detection.
Designing AI agents to resist prompt injection
OpenAI publishes guidance on designing AI agents resistant to prompt injection attacks, arguing that modern attacks increasingly use social engineering tactics rather than simple string injections, and advocating for system-level defenses that constrain impact rather than relying solely on input filtering.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Presents TurnGate, a turn-level monitor that detects hidden malicious intent in multi-turn dialogues by identifying the earliest turn where a response would enable harmful action, along with the Multi-Turn Intent Dataset (MTID) to support training and evaluation.