Tag
The author conducted an experiment on Gmail with AI agents connected via OAuth, sending obfuscated prompt injection emails. Frontier models sometimes caught the attacks, while cheap models silently executed them, revealing that agent security largely depends on model cost and token budget rather than architectural safeguards.
Arc Gate is a proxy that protects AI agents from prompt injection attacks by treating web and email content as untrusted, requiring no code changes from developers.
The article warns about security risks when AI agents execute external tools and announces new local guardrails for Tingly Box to prevent malicious actions.
The author discusses critical failure modes encountered when deploying AI agents in production, emphasizing the prevalence of prompt injection, the necessity of real-time governance and audit trails, and the requirement for ultra-fast kill switches. Treating enforcement as infrastructure rather than an afterthought is presented as the key to maintaining control and compliance.
A practitioner shares ten critical lessons for deploying AI agents in production, emphasizing code-based constraints, context management, and security over relying solely on prompts.
This paper presents MIPIAD, a multilingual defense framework against indirect prompt injection attacks using a hybrid of Qwen2.5-based classifiers and TF-IDF features with meta-ensemble learning. It demonstrates strong performance on English and Bangla benchmarks, achieving high F1 and AUROC scores while reducing cross-lingual gaps.
The article discusses a recent incident where Grok was manipulated into executing financial transactions, highlighting the broader lack of robust security layers for AI agents with tool access.
Pillar Security researchers disclosed a critical CVSS 10 vulnerability (TrustIssues) in Google's gemini-cli and related GitHub workflows, where prompt injection allowed attackers to exfiltrate secrets and compromise the repository supply chain.
Arc Sentry is a new pre-generation prompt-injection detector that reads a model’s internal residual stream, achieving 92% detection with 0% false positives versus LLM Guard’s 70%/3.3% on a 130-prompt benchmark.
OpenAI publishes guidance on designing AI agents resistant to prompt injection attacks, arguing that modern attacks increasingly use social engineering tactics rather than simple string injections, and advocating for system-level defenses that constrain impact rather than relying solely on input filtering.
OpenAI presents a training approach using instruction-hierarchy tasks to improve LLM safety and reliability by teaching models to properly prioritize instructions based on trust levels (system > developer > user > tool). The method addresses prompt-injection attacks and safety steerability through reinforcement learning with a new dataset called IH-Challenge.
OpenAI introduces Lockdown Mode and Elevated Risk labels in ChatGPT to mitigate prompt injection attacks and protect sensitive data. Lockdown Mode is an advanced security setting for high-risk users that constrains ChatGPT's interaction with external systems and is available for enterprise plans with planned consumer rollout.
OpenAI describes security safeguards against URL-based data exfiltration attacks when AI agents retrieve web content, using an independent web index to verify that URLs are publicly known before automatic retrieval to prevent prompt injection attacks from leaking sensitive user data.
OpenAI announces security hardening of ChatGPT Atlas against prompt injection attacks through adversarial training and strengthened safeguards, including a rapid response loop for discovering and mitigating novel attack strategies before they appear in the wild.
OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.
Google DeepMind announces advanced security improvements for Gemini to defend against indirect prompt injection attacks through model hardening, adaptive evaluation, and layered defense mechanisms. The approach combines fine-tuning on adversarial scenarios with system-level guardrails to build inherent resilience while maintaining model performance.
OpenAI highlights grantees from its Cybersecurity Grant Program, supporting projects ranging from defending LLMs against prompt-injection attacks to autonomous cyber defense agents and secure AI inference infrastructure.
OpenAI proposes an instruction hierarchy approach to defend LLMs against prompt injection and jailbreak attacks by training models to prioritize system instructions over user inputs. The method significantly improves robustness without degrading standard capabilities.
AI browsers like OpenAI's Atlas and Perplexity's Comet embed AI assistants directly into browsing with memory and agentic capabilities, but significant security risks from prompt injection attacks make them unsuitable for sensitive use.