adversarial-robustness

#adversarial-robustness

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv cs.CL ↗ · 3d ago Cached

This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.

0 favorites 0 likes

#adversarial-robustness

Decoherence as Defence and the Magnitude of Noise Regularisation: A Rigorous N -Qubit Theory of Stochastic Quantum Neural Networks for Adversarially Robust Network Intrusion Detection

arXiv cs.CL ↗ · 4d ago Cached

This paper presents a rigorous N-qubit theory of stochastic quantum neural networks (SQNNs) for adversarially robust network intrusion detection, proving a decoherence-contraction theorem and showing that depolarising noise provides robustness against adversarial attacks, with experiments on the NSL-KDD dataset.

0 favorites 0 likes

#adversarial-robustness

MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense

arXiv cs.LG ↗ · 2026-06-17 Cached

MorphStrata introduces a layer-specific stochastic noise injection strategy for generating diverse student models in a Moving Target Defense framework to enhance adversarial robustness in time-series forecasting, achieving up to 97.97% improvement in RMSE under BIM attacks with minimal training overhead.

0 favorites 0 likes

#adversarial-robustness

Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper establishes a characterization of the sum-of-squares degree barriers for the reweighted-hinge method in robust halfspace learning using the Christoffel function, revealing a margin-degree tradeoff and explicit outlier barriers.

0 favorites 0 likes

#adversarial-robustness

When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

arXiv cs.AI ↗ · 2026-06-15 Cached

This paper studies skill-conditional trust in heterogeneous LLM agent swarms, showing that using per-skill trust scores outperforms global scores in specific regimes, but also reveals a vulnerability to reputation laundering attacks. The authors introduce the Conditional Information Value Test (CIVT) to detect such attacks and quantify trade-offs.

0 favorites 0 likes

#adversarial-robustness

Neural Variability Enhances Artificial Network Robustness

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper investigates how correlated noise, inspired by neural variability in the brain, can enhance the robustness of artificial neural networks against adversarial attacks and naturalistic image modifications.

0 favorites 0 likes

#adversarial-robustness

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper introduces a compute-aware evaluation framework for adversarial robustness of LLMs, proposing risk-compute curves and metrics based on FLOPs to better assess attack costs, finding that alignment training has non-monotonic effects and compute costs vary across models and harm categories.

0 favorites 0 likes

#adversarial-robustness

Outsmarting the Chameleon: Counterfactual Decoupling for Tactical OOD Shifts in Live Streaming Risk Assessment

arXiv cs.LG ↗ · 2026-06-03 Cached

Proposes Latent-Predictive Counterfactual Decoupling (LPCD) to address tactical out-of-distribution shifts in live streaming risk assessment by decoupling stable malicious intent from evolving narrative tactics at the latent level, achieving superior performance on large-scale industrial datasets.

0 favorites 0 likes

#adversarial-robustness

RRISE: Robust Radius Inference via a Surrogate Estimator

arXiv cs.LG ↗ · 2026-06-03 Cached

RRISE introduces a learned surrogate estimator that reduces the Monte Carlo sampling cost of randomized smoothing for certified robustness to a single forward pass, maintaining accuracy within 0.84 percentage points while replacing up to 10^4 evaluations per query.

0 favorites 0 likes

#adversarial-robustness

Making Brain-Computer Interfaces More Secure

arXiv cs.LG ↗ · 2026-06-03 Cached

This paper proposes a lightweight CNN architecture to improve adversarial robustness in EEG-based brain-computer interfaces, evaluating it against adversarial attacks and showing better classification performance than existing models.

0 favorites 0 likes

#adversarial-robustness

TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness

arXiv cs.LG ↗ · 2026-06-01 Cached

Introduces TASER, a training-time regularization framework derived from Langevin Stein operators that encourages geometric compatibility between predictors and data density, improving adversarial robustness and stability on CIFAR-10 without significant clean accuracy degradation.

0 favorites 0 likes

#adversarial-robustness

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

arXiv cs.AI ↗ · 2026-06-01 Cached

Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.

0 favorites 0 likes

#adversarial-robustness

The Distillation Game: Adaptive Attacks & Efficient Defenses

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

This paper studies distillation attacks where model outputs can enable imitation, proposing a minimax game framework and a forward-pass-only defense called Product-of-Experts, showing that adaptive students recover more capability than passive evaluation suggests.

0 favorites 0 likes

#adversarial-robustness

The Hamilton-Jacobi Theory of Deep Learning

Hugging Face Daily Papers ↗ · 2026-05-27 Cached

This paper identifies neural network training as a search through Hamilton-Jacobi initial-value problems, showing that residual networks, transformers, and RNNs discretize the same class of viscous Hamilton-Jacobi equations. It derives quantitative consequences including minimax optimal generalization rates, adversarial robustness bounds, and a closed-form influence function.

0 favorites 0 likes

#adversarial-robustness

Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy

arXiv cs.LG ↗ · 2026-05-22 Cached

This paper introduces a framework that connects randomized smoothing to differential privacy through privacy profiles, enabling tight provable robustness guarantees against backdoor attacks that jointly affect training and inference. The approach is instantiated for DP-SGD and Deep Partition Aggregation with experiments on MNIST and CIFAR-10.

0 favorites 0 likes

#adversarial-robustness

Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

arXiv cs.LG ↗ · 2026-05-21 Cached

Introduces HF-KCU, a method for efficient machine unlearning in federated learning that uses Krylov subspace approximations to remove a client's contribution, achieving significant speedup over retraining while preserving model accuracy and providing robustness against adversarial perturbations.

0 favorites 0 likes

#adversarial-robustness

more ai slop to slop around~

Reddit r/singularity ↗ · 2026-05-17

This post extends E8 lattice geometric activation injection to supervised LLM safety routing, using STE-snapped E8 policy heads. While achieving near-perfect routing on clean data, the approach catastrophically fails under adversarial stress, requiring a hybrid symbolic-geometric architecture with audited deterministic rules.

0 favorites 0 likes

#adversarial-robustness

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

arXiv cs.CL ↗ · 2026-05-15 Cached

This paper introduces Context-Driven Decomposition (CDD), a probe to diagnose when RAG systems comply with retrieved context despite conflicting parametric knowledge, and releases the Epi-Scale benchmark for systematic study across model families.

0 favorites 0 likes

#adversarial-robustness

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

arXiv cs.AI ↗ · 2026-05-12 Cached

This paper introduces Latent Personality Alignment (LPA), a method that improves LLM safety by training on abstract personality traits rather than explicit harmful examples. The approach achieves better generalization against adversarial attacks and preserves model utility with significantly fewer training samples.

0 favorites 0 likes

#adversarial-robustness

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.

0 favorites 0 likes

adversarial-robustness

Submit Feedback