Tag
This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.
This paper presents a rigorous N-qubit theory of stochastic quantum neural networks (SQNNs) for adversarially robust network intrusion detection, proving a decoherence-contraction theorem and showing that depolarising noise provides robustness against adversarial attacks, with experiments on the NSL-KDD dataset.
MorphStrata introduces a layer-specific stochastic noise injection strategy for generating diverse student models in a Moving Target Defense framework to enhance adversarial robustness in time-series forecasting, achieving up to 97.97% improvement in RMSE under BIM attacks with minimal training overhead.
This paper establishes a characterization of the sum-of-squares degree barriers for the reweighted-hinge method in robust halfspace learning using the Christoffel function, revealing a margin-degree tradeoff and explicit outlier barriers.
This paper studies skill-conditional trust in heterogeneous LLM agent swarms, showing that using per-skill trust scores outperforms global scores in specific regimes, but also reveals a vulnerability to reputation laundering attacks. The authors introduce the Conditional Information Value Test (CIVT) to detect such attacks and quantify trade-offs.
This paper investigates how correlated noise, inspired by neural variability in the brain, can enhance the robustness of artificial neural networks against adversarial attacks and naturalistic image modifications.
This paper introduces a compute-aware evaluation framework for adversarial robustness of LLMs, proposing risk-compute curves and metrics based on FLOPs to better assess attack costs, finding that alignment training has non-monotonic effects and compute costs vary across models and harm categories.
Proposes Latent-Predictive Counterfactual Decoupling (LPCD) to address tactical out-of-distribution shifts in live streaming risk assessment by decoupling stable malicious intent from evolving narrative tactics at the latent level, achieving superior performance on large-scale industrial datasets.
RRISE introduces a learned surrogate estimator that reduces the Monte Carlo sampling cost of randomized smoothing for certified robustness to a single forward pass, maintaining accuracy within 0.84 percentage points while replacing up to 10^4 evaluations per query.
This paper proposes a lightweight CNN architecture to improve adversarial robustness in EEG-based brain-computer interfaces, evaluating it against adversarial attacks and showing better classification performance than existing models.
Introduces TASER, a training-time regularization framework derived from Langevin Stein operators that encourages geometric compatibility between predictors and data density, improving adversarial robustness and stability on CIFAR-10 without significant clean accuracy degradation.
Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.
This paper studies distillation attacks where model outputs can enable imitation, proposing a minimax game framework and a forward-pass-only defense called Product-of-Experts, showing that adaptive students recover more capability than passive evaluation suggests.
This paper identifies neural network training as a search through Hamilton-Jacobi initial-value problems, showing that residual networks, transformers, and RNNs discretize the same class of viscous Hamilton-Jacobi equations. It derives quantitative consequences including minimax optimal generalization rates, adversarial robustness bounds, and a closed-form influence function.
This paper introduces a framework that connects randomized smoothing to differential privacy through privacy profiles, enabling tight provable robustness guarantees against backdoor attacks that jointly affect training and inference. The approach is instantiated for DP-SGD and Deep Partition Aggregation with experiments on MNIST and CIFAR-10.
Introduces HF-KCU, a method for efficient machine unlearning in federated learning that uses Krylov subspace approximations to remove a client's contribution, achieving significant speedup over retraining while preserving model accuracy and providing robustness against adversarial perturbations.
This post extends E8 lattice geometric activation injection to supervised LLM safety routing, using STE-snapped E8 policy heads. While achieving near-perfect routing on clean data, the approach catastrophically fails under adversarial stress, requiring a hybrid symbolic-geometric architecture with audited deterministic rules.
This paper introduces Context-Driven Decomposition (CDD), a probe to diagnose when RAG systems comply with retrieved context despite conflicting parametric knowledge, and releases the Epi-Scale benchmark for systematic study across model families.
This paper introduces Latent Personality Alignment (LPA), a method that improves LLM safety by training on abstract personality traits rather than explicit harmful examples. The approach achieves better generalization against adversarial attacks and preserves model utility with significantly fewer training samples.
This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.