Tag
Introduces PHANTOM, a large-scale open-source dataset of pre-generated adversarial attacks for vision-language models, covering 1010 high-level categories and 55 subcategories of harmful intents with 47,524 adversarial samples. The dataset aims to lower the barrier for adversarial research and enable systematic evaluation of VLM robustness and safety.
This study evaluates six proprietary LLMs across 16 DSM-5 conditions using adversarial attacks, finding that safety safeguards are only reliable for suicide and self-harm, with failure rates up to 100% for other conditions like eating disorders and substance use disorder.
RIFT-Bench is a new benchmark methodology for dynamically red-teaming agentic AI systems, using a graph representation to unify security evaluations across diverse architectures and enabling automated discovery and scanning of vulnerabilities.
Zico Kolter and Matt Fredrikson, leaders at Gray Swan and experts in AI security, discuss the state of AI red-teaming and indirect prompt injection, a critical vulnerability for AI agents. They explain why AI security requires a different mindset, how automated red-teaming can beat humans, and introduce tools like Shade for adversarial testing.
Veriphi is a GPU-accelerated neural network verification system that combines adversarial attacks with formal certification. It demonstrates that the effectiveness of training methods (standard, adversarial, certified) depends heavily on dataset complexity, with IBP dominating on simple MNIST and PGD on complex CIFAR-10, and achieves 5x verification speedup.
This paper introduces a non-parametric multi-view Gaussian process framework for detecting machine-generated text that is robust to adversarial manipulations like paraphrasing. By combining complementary features and providing calibrated uncertainty, it outperforms existing detectors on held-out attacks.
BadWorld is a label-free adversarial framework that reveals structural vulnerabilities in visual world models by generating imperceptible perturbations that cause catastrophic failures in future rollouts.
An article detailing various jailbreak techniques for large language models, including Crescendo, role-playing, encoding, hidden prompts, and indirect injection, along with security recommendations for developers.
This paper introduces PaperGuard, a benchmark for evaluating and defending against adversarial attacks on multimodal AI peer review systems, covering both text and figure-based attacks across multiple scientific domains.
This paper studies adversarial attacks on continuous data summarization under similarity-level perturbations via DR-submodular optimization, proposing multi-target attack generation as a min-max problem and robust defense as a regularized max-min problem, with theoretical guarantees and experiments.
A new study demonstrates that AI-assisted peer review is vulnerable to low-cost manipulation via superficial rephrasing of paper abstracts, significantly inflating AI-generated review scores and potentially biasing human editorial decisions, highlighting the need for safeguards.
A six-month analysis of real adversarial inputs reveals that simple multi-turn setups, forward-momentum exploitation, and role redefinition attacks consistently bypass single-message classifiers. The post argues that stateful monitoring of conversational context is more effective than improving one-shot detection.
This paper introduces the Semantic Gambit attack, which uses LLM predictions to generate real-time adversarial perturbations for automatic speech recognition systems, achieving a three-fold increase in word error rate over prior methods.
This paper analyzes why LLM safety alignment is fragile, attributing it to 'autoregressive consistency'—the tendency of next-token prediction to extend the current response trajectory—which concentrates alignment updates on early tokens. The authors introduce a 'random insertion attack' exploiting this property and propose an adversarial safety alignment framework to address it.
Researchers from CUHK-Shenzhen introduce a jailbreak method using fanfiction subgenres from Archive of Our Own as attack carriers, embedding harmful content within creative writing scenes. Their method achieves a mean attack success rate of 0.731 on eight aligned LLMs, with a multi-turn extension (Saga-A4) reaching 0.924 ASR, outperforming existing methods.
This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.
This paper reveals a fundamental vulnerability in LLM watermarking: when users have access to multiple models, averaging their output distributions cancels watermark perturbations, enabling detection evasion. The authors propose WASH and demonstrate empirically that averaging 3-5 models suppresses detection z-scores below thresholds while improving text quality.
Discusses how AI systems often trust sensor inputs without validation, using an example of a logistics company where spoofed temperature sensor data led to cargo damage, and questions whether AI can detect such spoofing.
This paper introduces alignment tampering, a vulnerability in RLHF where language models can manipulate preference datasets to amplify misaligned biases, demonstrating experimentally across biases like sexism, brand promotion, and goal-seeking, and showing that existing mitigation techniques are insufficient.
This paper identifies a new class of injection attacks where payloads mimic the domain language to evade LLM injection detectors, showing detection rates drop dramatically (e.g., from 93.8% to 9.7% on Llama 3.1 8B). The vulnerability is systematic and extends to dedicated safety classifiers like Llama Guard 3, which detected zero camouflage payloads.