Tag
Arc Sentry is a new pre-generation prompt-injection detector that reads a model’s internal residual stream, achieving 92% detection with 0% false positives versus LLM Guard’s 70%/3.3% on a 130-prompt benchmark.
OpenAI publishes guidance on designing AI agents resistant to prompt injection attacks, arguing that modern attacks increasingly use social engineering tactics rather than simple string injections, and advocating for system-level defenses that constrain impact rather than relying solely on input filtering.
OpenAI presents a training approach using instruction-hierarchy tasks to improve LLM safety and reliability by teaching models to properly prioritize instructions based on trust levels (system > developer > user > tool). The method addresses prompt-injection attacks and safety steerability through reinforcement learning with a new dataset called IH-Challenge.
OpenAI introduces Lockdown Mode and Elevated Risk labels in ChatGPT to mitigate prompt injection attacks and protect sensitive data. Lockdown Mode is an advanced security setting for high-risk users that constrains ChatGPT's interaction with external systems and is available for enterprise plans with planned consumer rollout.
OpenAI describes security safeguards against URL-based data exfiltration attacks when AI agents retrieve web content, using an independent web index to verify that URLs are publicly known before automatic retrieval to prevent prompt injection attacks from leaking sensitive user data.
OpenAI announces security hardening of ChatGPT Atlas against prompt injection attacks through adversarial training and strengthened safeguards, including a rapid response loop for discovering and mitigating novel attack strategies before they appear in the wild.
OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.
OpenGuardrails is an open-source platform for AI safety, offering context-aware content-safety and manipulation detection (e.g., prompt injection, jailbreaking) via a unified model, plus a separate NER pipeline for data-leakage identification. It achieves state-of-the-art performance on safety benchmarks and supports private, enterprise-grade deployment.
Google DeepMind announces advanced security improvements for Gemini to defend against indirect prompt injection attacks through model hardening, adaptive evaluation, and layered defense mechanisms. The approach combines fine-tuning on adversarial scenarios with system-level guardrails to build inherent resilience while maintaining model performance.
OpenAI highlights grantees from its Cybersecurity Grant Program, supporting projects ranging from defending LLMs against prompt-injection attacks to autonomous cyber defense agents and secure AI inference infrastructure.
OpenAI proposes an instruction hierarchy approach to defend LLMs against prompt injection and jailbreak attacks by training models to prioritize system instructions over user inputs. The method significantly improves robustness without degrading standard capabilities.
MIT 6.566 course lecture introduces security challenges for AI agents, including non-adversarial errors (e.g., accidental database deletion) and adversarial attacks (e.g., prompt injection, data leakage), and explains the basics of building systems from language models to conversational agents.
AI browsers like OpenAI's Atlas and Perplexity's Comet embed AI assistants directly into browsing with memory and agentic capabilities, but significant security risks from prompt injection attacks make them unsuitable for sensitive use.