adversarial-attacks

#adversarial-attacks

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

arXiv cs.AI ↗ · yesterday Cached

Introduces PHANTOM, a large-scale open-source dataset of pre-generated adversarial attacks for vision-language models, covering 1010 high-level categories and 55 subcategories of harmful intents with 47,524 adversarial samples. The dataset aims to lower the barrier for adversarial research and enable systematic evaluation of VLM robustness and safety.

0 favorites 0 likes

#adversarial-attacks

One Year Later...The Harms Persist, But So Do We!

arXiv cs.CL ↗ · yesterday Cached

This study evaluates six proprietary LLMs across 16 DSM-5 conditions using adversarial attacks, finding that safety safeguards are only reliable for suicide and self-harm, with failure rates up to 100% for other conditions like eating disorders and substance use disorder.

0 favorites 0 likes

#adversarial-attacks

RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

arXiv cs.AI ↗ · yesterday Cached

RIFT-Bench is a new benchmark methodology for dynamically red-teaming agentic AI systems, using a graph representation to unify security evaluations across diverse architectures and enabling automated discovery and scanning of vulnerabilities.

0 favorites 0 likes

#adversarial-attacks

Insights on Indirect Prompt Injection (12 minute read)

TLDR AI ↗ · yesterday Cached

Zico Kolter and Matt Fredrikson, leaders at Gray Swan and experts in AI security, discuss the state of AI red-teaming and indirect prompt injection, a critical vulnerability for AI agents. They explain why AI security requires a different mindset, how automated red-teaming can beat humans, and introduce tools like Shade for adversarial testing.

0 favorites 0 likes

#adversarial-attacks

Veriphi: Attack-Guided Neural Network Verification with Dataset-Dependent Training Methods

arXiv cs.LG ↗ · 2026-06-18 Cached

Veriphi is a GPU-accelerated neural network verification system that combines adversarial attacks with formal certification. It demonstrates that the effectiveness of training methods (standard, adversarial, certified) depends heavily on dataset complexity, with IBP dominating on simple MNIST and PGD on complex CIFAR-10, and achieves 5x verification speedup.

0 favorites 0 likes

#adversarial-attacks

Non-Parametric Machine Text Detection via Multi-View Gaussian Processes

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper introduces a non-parametric multi-view Gaussian process framework for detecting machine-generated text that is robust to adversarial manipulations like paraphrasing. By combining complementary features and providing calibrated uncertainty, it outperforms existing detectors on held-out attacks.

0 favorites 0 likes

#adversarial-attacks

BadWorld: Adversarial Attacks on World Models

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

BadWorld is a label-free adversarial framework that reveals structural vulnerabilities in visual world models by generating imperceptible perturbations that cause catastrophic failures in future rollouts.

0 favorites 0 likes

#adversarial-attacks

@wquguru: If you want to trick Fable into doing a security audit, try this. Looks like our AI overlord has a bit of empathy.

X AI KOLs Timeline ↗ · 2026-06-13 Cached

An article detailing various jailbreak techniques for large language models, including Crescendo, role-playing, encoding, hidden prompts, and indirect injection, along with security recommendations for developers.

0 favorites 0 likes

#adversarial-attacks

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

arXiv cs.CL ↗ · 2026-06-12 Cached

This paper introduces PaperGuard, a benchmark for evaluating and defending against adversarial attacks on multimodal AI peer review systems, covering both text and figure-based attacks across multiple scientific domains.

0 favorites 0 likes

#adversarial-attacks

Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

arXiv cs.AI ↗ · 2026-06-11 Cached

This paper studies adversarial attacks on continuous data summarization under similarity-level perturbations via DR-submodular optimization, proposing multi-target attack generation as a min-max problem and robust defense as a regularized max-min problem, with theoretical guarantees and experiments.

0 favorites 0 likes

#adversarial-attacks

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

arXiv cs.CL ↗ · 2026-06-10 Cached

A new study demonstrates that AI-assisted peer review is vulnerable to low-cost manipulation via superficial rephrasing of paper abstracts, significantly inflating AI-generated review scores and potentially biasing human editorial decisions, highlighting the need for safeguards.

0 favorites 0 likes

#adversarial-attacks

Been watching real adversarial input hit my detection API for six months. Here's what's actually landing.

Reddit r/LocalLLaMA ↗ · 2026-06-08

A six-month analysis of real adversarial inputs reveals that simple multi-turn setups, forward-momentum exploitation, and role redefinition attacks consistently bypass single-message classifiers. The post argues that stateful monitoring of conversational context is more effective than improving one-shot detection.

0 favorites 0 likes

#adversarial-attacks

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper introduces the Semantic Gambit attack, which uses LLM predictions to generate real-time adversarial perturbations for automatic speech recognition systems, achieving a three-fold increase in word error rate over prior methods.

0 favorites 0 likes

#adversarial-attacks

When Autoregressive Consistency Hurts Safety Alignment

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper analyzes why LLM safety alignment is fragile, attributing it to 'autoregressive consistency'—the tendency of next-token prediction to extend the current response trajectory—which concentrates alignment updates on early tokens. The authors introduce a 'random insertion attack' exploiting this property and propose an adversarial safety alignment framework to address it.

0 favorites 0 likes

#adversarial-attacks

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arXiv cs.CL ↗ · 2026-06-04 Cached

Researchers from CUHK-Shenzhen introduce a jailbreak method using fanfiction subgenres from Archive of Our Own as attack carriers, embedding harmful content within creative writing scenes. Their method achieves a mean attack success rate of 0.731 on eight aligned LLMs, with a multi-turn extension (Saga-A4) reaching 0.924 ASR, outperforming existing methods.

0 favorites 0 likes

#adversarial-attacks

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

arXiv cs.AI ↗ · 2026-06-04 Cached

This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.

0 favorites 0 likes

#adversarial-attacks

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper reveals a fundamental vulnerability in LLM watermarking: when users have access to multiple models, averaging their output distributions cancels watermark perturbations, enabling detection evasion. The authors propose WASH and demonstrate empirically that averaging 3-5 models suppresses detection z-scores below thresholds while improving text quality.

0 favorites 0 likes

#adversarial-attacks

If someone spoofs your IoT sensor data, does your AI even have a way to know it's been fooled?

Reddit r/AI_Agents ↗ · 2026-05-27

Discusses how AI systems often trust sensor inputs without validation, using an example of a logistics company where spoofed temperature sensor data led to cargo damage, and questions whether AI can detect such spoofing.

0 favorites 0 likes

#adversarial-attacks

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

This paper introduces alignment tampering, a vulnerability in RLHF where language models can manipulate preference datasets to amplify misaligned biases, demonstrating experimentally across biases like sexism, brand promotion, and goal-seeking, and showing that existing mitigation techniques are insufficient.

0 favorites 0 likes

#adversarial-attacks

Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Hacker News Top ↗ · 2026-05-22 Cached

This paper identifies a new class of injection attacks where payloads mimic the domain language to evade LLM injection detectors, showing detection rates drop dramatically (e.g., from 93.8% to 9.7% on Llama 3.1 8B). The vulnerability is systematic and extends to dedicated safety classifiers like Llama Guard 3, which detected zero camouflage payloads.

0 favorites 0 likes

adversarial-attacks

Submit Feedback