adversarial-attacks

#adversarial-attacks

@DailyDoseOfDS_: OpenAI paid $500k for this! > A Kaggle contest to find LLM vulnerabilities DeepTeam does it for free. It implements 20+…

X AI KOLs Timeline ↗ · yesterday Cached

DeepTeam is a free, open-source tool that implements 20+ state-of-the-art attacks to detect over 50 LLM vulnerabilities, including bias and PII leakage, running locally without a dataset.

0 favorites 0 likes

#adversarial-attacks

Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs

arXiv cs.AI ↗ · 2d ago Cached

This paper analyzes the reconstruction-concealment tradeoff in intent-obfuscation jailbreak attacks on Multimodal Large Language Models (MLLMs). It proposes concealment-aware variant construction and keyword-related distractor images to exploit model vulnerabilities more effectively.

0 favorites 0 likes

#adversarial-attacks

Power to the Clients: Federated Learning in a Dictatorship Setting

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces 'dictator clients'—a novel class of malicious participants in federated learning capable of erasing other clients' contributions while preserving their own—and provides theoretical analysis of their impact on model convergence, including scenarios with multiple adversarial clients.

0 favorites 0 likes

#adversarial-attacks

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

Systematic study shows LLM-based dense retrievers outperform BERT baselines on typos and poisoning but remain vulnerable to semantic perturbations, with embedding geometry predicting robustness.

0 favorites 0 likes

#adversarial-attacks

Designing AI agents to resist prompt injection

OpenAI Blog ↗ · 2026-03-11 Cached

OpenAI publishes guidance on designing AI agents resistant to prompt injection attacks, arguing that modern attacks increasingly use social engineering tactics rather than simple string injections, and advocating for system-level defenses that constrain impact rather than relying solely on input filtering.

0 favorites 0 likes

#adversarial-attacks

Understanding prompt injections: a frontier security challenge

OpenAI Blog ↗ · 2025-11-07 Cached

OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.

0 favorites 0 likes

#adversarial-attacks

Trading inference-time compute for adversarial robustness

OpenAI Blog ↗ · 2025-01-22 Cached

OpenAI presents evidence that reasoning models like o1 become more robust to adversarial attacks when given more inference-time compute to think longer. The research demonstrates that increased computation reduces attack success rates across multiple task types including mathematics, factuality, and adversarial images, though significant exceptions remain.

0 favorites 0 likes

#adversarial-attacks

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI Blog ↗ · 2024-04-19 Cached

OpenAI proposes an instruction hierarchy approach to defend LLMs against prompt injection and jailbreak attacks by training models to prioritize system instructions over user inputs. The method significantly improves robustness without degrading standard capabilities.

0 favorites 0 likes

#adversarial-attacks

Testing robustness against unforeseen adversaries

OpenAI Blog ↗ · 2019-08-22 Cached

OpenAI researchers developed a method to evaluate neural network robustness against unforeseen adversarial attacks, introducing a new metric called UAR (Unforeseen Attack Robustness) that assesses model performance against unanticipated distortion types beyond the commonly studied Lp norms.

0 favorites 0 likes

#adversarial-attacks

Adversarial attacks on neural network policies

OpenAI Blog ↗ · 2017-02-08 Cached

OpenAI researchers demonstrate that adversarial attacks, previously studied in computer vision, are also effective against neural network policies in reinforcement learning, showing significant performance degradation even with small imperceptible perturbations in white-box and black-box settings.

0 favorites 0 likes

adversarial-attacks

Submit Feedback