adversarial-robustness

#adversarial-robustness

Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation

arXiv cs.LG ↗ · 2d ago Cached

This paper presents a comprehensive benchmark for evaluating adversarial attacks and defenses in Graph Neural Networks, highlighting the need for standardized and fair experimental protocols.

0 favorites 0 likes

#adversarial-robustness

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation

arXiv cs.CL ↗ · 2d ago Cached

Proposes LiSCP, a lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Achieves superior performance on in-domain and cross-domain detection with notable robustness.

0 favorites 0 likes

#adversarial-robustness

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.

0 favorites 0 likes

#adversarial-robustness

Stochasticity in Tokenization Improves Robustness

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.

0 favorites 0 likes

#adversarial-robustness

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

arXiv cs.CL ↗ · 2026-04-20 Cached

MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.

0 favorites 0 likes

#adversarial-robustness

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

0 favorites 0 likes

#adversarial-robustness

Trading inference-time compute for adversarial robustness

OpenAI Blog ↗ · 2025-01-22 Cached

OpenAI presents evidence that reasoning models like o1 become more robust to adversarial attacks when given more inference-time compute to think longer. The research demonstrates that increased computation reduces attack success rates across multiple task types including mathematics, factuality, and adversarial images, though significant exceptions remain.

0 favorites 0 likes

#adversarial-robustness

Testing robustness against unforeseen adversaries

OpenAI Blog ↗ · 2019-08-22 Cached

OpenAI researchers developed a method to evaluate neural network robustness against unforeseen adversarial attacks, introducing a new metric called UAR (Unforeseen Attack Robustness) that assesses model performance against unanticipated distortion types beyond the commonly studied Lp norms.

0 favorites 0 likes

#adversarial-robustness

Transfer of adversarial robustness between perturbation types

OpenAI Blog ↗ · 2019-05-03 Cached

Researchers study how adversarial robustness transfers across different perturbation types in deep neural networks, evaluating 32 attacks of 5 types on ImageNet models. Results show that robustness to one perturbation type doesn't always transfer to others and may sometimes hurt robustness elsewhere.

0 favorites 0 likes

#adversarial-robustness

Computational limitations in robust classification and win-win results

OpenAI Blog ↗ · 2019-02-04 Cached

This paper extends the study of computational hardness in learning robust classifiers, showing that efficient robust classification can be impossible even when unbounded robust classifiers exist, and establishing a win-win result: either an efficient robust classifier can be learned, or new cryptographic primitives can be constructed.

0 favorites 0 likes

adversarial-robustness

Submit Feedback