Tag
Malware developers are adding text about nuclear and biological weapons to their spyware to trigger LLM safety refusals, preventing AI security scanners from analyzing the malware. This demonstrates a practical exploit of aggressive safety alignment, highlighting second-order blindspots that attackers can leverage.
Attackers bypassed Instagram 2FA by using Meta's AI support assistant to change recovery email via prompt injection, raising questions about AI agent privileges in account recovery.
This article warns about the Crescendo attack, a multi-turn prompt injection that evades single-message defenses by poisoning an AI agent's context over several turns. It introduces Bendex Arc, a tool that tracks behavioral trajectory across sessions to catch such attacks before they execute.
This paper introduces VATS, a mutation-driven framework that systematically evolves adversarial payloads to exploit error-path injection in MCP-based tool-calling agents. It demonstrates that error messages with implicit authority triple the success rate of standard indirect prompt injection across frontier models.
A six-month analysis of real adversarial inputs reveals that simple multi-turn setups, forward-momentum exploitation, and role redefinition attacks consistently bypass single-message classifiers. The post argues that stateful monitoring of conversational context is more effective than improving one-shot detection.
This paper introduces Zero-Shot Embedding Drift Detection (ZEDD), a lightweight framework that detects prompt injection attacks in LLMs by measuring semantic shifts in embedding space, achieving over 93% accuracy with less than 3% false positive rate across multiple architectures.
OpenAI launches Lockdown Mode, ChatGPT fully rolls out this mode to defend against prompt injection attacks and enhance security.
OpenAI introduces Lockdown Mode, an optional security setting that limits web browsing and external service access in ChatGPT to reduce data exfiltration risks from prompt injection attacks. It is rolling out to eligible personal and business accounts.
OpenAI announced Lockdown Mode, a new feature for ChatGPT that provides additional protection against prompt injection attacks by disabling live web browsing, image retrieval, deep research, and agent mode. The feature is designed for users handling sensitive data and is rolling out to Business and eligible personal accounts.
OpenAI has launched Lockdown Mode for ChatGPT to prevent data exfiltration from prompt injection attacks by limiting outbound network requests. The feature is rolling out to eligible accounts including Free, Plus, Pro, and self-serve Business.
A production AI support agent was compromised via prompt injection, exposing other customers' data. The post-mortem revealed lack of enforcement layers, useless audit trails, and no kill switch, highlighting systemic security gaps in deploying AI agents.
Attackers exploited Meta's AI customer support agent to hijack Instagram accounts by simply asking it to change linked email addresses, highlighting that AI agent vulnerabilities can be as dangerous as advanced AI hacking threats.
A tool built with pure math and determinism to solve indirect prompt injection and agent drifting, providing a pure audit trace chain. The creator is seeking pilot interest.
Agent Browser Shield is a product that blocks prompt injection attacks and reduces token costs for AI browser agents.
Bendex Arc is a tool that resists prompt injection attacks by tracking full sessions, independently verified to be 100% effective against attacks that defeat other tools.
PixieBrix launches Agent Browser Shield, a free source-available browser extension that protects AI agents from prompt injection, dark patterns, and context pollution during web browsing.
An open detection rule format for AI agent security threats, inspired by Sigma/YARA, aims to standardize detection of prompt injection, tool abuse, and other agent attacks, though it notes limitations against semantic attacks.
This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.
Johannes Link, maintainer of the Java library jqwik, added malicious prompt injection to disrupt AI usage of the library, sparking debate on AI ethics and open-source maintainer rights.
The article explores the challenges of identity verification in conversational AI systems, highlighting risks like impersonation and prompt injection, and questions whether serious approaches are being developed.