Tag
Proposes Latent-Predictive Counterfactual Decoupling (LPCD) to address tactical out-of-distribution shifts in live streaming risk assessment by decoupling stable malicious intent from evolving narrative tactics at the latent level, achieving superior performance on large-scale industrial datasets.
RRISE introduces a learned surrogate estimator that reduces the Monte Carlo sampling cost of randomized smoothing for certified robustness to a single forward pass, maintaining accuracy within 0.84 percentage points while replacing up to 10^4 evaluations per query.
This paper proposes a lightweight CNN architecture to improve adversarial robustness in EEG-based brain-computer interfaces, evaluating it against adversarial attacks and showing better classification performance than existing models.
Introduces TASER, a training-time regularization framework derived from Langevin Stein operators that encourages geometric compatibility between predictors and data density, improving adversarial robustness and stability on CIFAR-10 without significant clean accuracy degradation.
Introduces PReMISE, a framework for discovering and auditing policy-level rubrics for LLM judges along four axes: structural adequacy, reliability, preference fit, and adversarial robustness.
This paper studies distillation attacks where model outputs can enable imitation, proposing a minimax game framework and a forward-pass-only defense called Product-of-Experts, showing that adaptive students recover more capability than passive evaluation suggests.
This paper identifies neural network training as a search through Hamilton-Jacobi initial-value problems, showing that residual networks, transformers, and RNNs discretize the same class of viscous Hamilton-Jacobi equations. It derives quantitative consequences including minimax optimal generalization rates, adversarial robustness bounds, and a closed-form influence function.
This paper introduces a framework that connects randomized smoothing to differential privacy through privacy profiles, enabling tight provable robustness guarantees against backdoor attacks that jointly affect training and inference. The approach is instantiated for DP-SGD and Deep Partition Aggregation with experiments on MNIST and CIFAR-10.
Introduces HF-KCU, a method for efficient machine unlearning in federated learning that uses Krylov subspace approximations to remove a client's contribution, achieving significant speedup over retraining while preserving model accuracy and providing robustness against adversarial perturbations.
This post extends E8 lattice geometric activation injection to supervised LLM safety routing, using STE-snapped E8 policy heads. While achieving near-perfect routing on clean data, the approach catastrophically fails under adversarial stress, requiring a hybrid symbolic-geometric architecture with audited deterministic rules.
This paper introduces Context-Driven Decomposition (CDD), a probe to diagnose when RAG systems comply with retrieved context despite conflicting parametric knowledge, and releases the Epi-Scale benchmark for systematic study across model families.
This paper introduces Latent Personality Alignment (LPA), a method that improves LLM safety by training on abstract personality traits rather than explicit harmful examples. The approach achieves better generalization against adversarial attacks and preserves model utility with significantly fewer training samples.
This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.
This paper introduces the Adversarial Empathy Benchmark (AEB) and Emotional Consistency Score (ECS) to test the robustness of RLVER-trained models against adversarial user behaviors. Results show that while RLVER improves emotional responsiveness, it does not significantly enhance the model's ability to track user emotional states under adversarial conditions.
This paper investigates adversarial robustness in Fuzzy ARTMAP, a streaming neural architecture, by introducing WB-Softmax as a mechanism-aligned white-box attack surrogate. It evaluates progressive training and selective updating strategies to improve robustness without data replay, while also offering interpretable diagnostics for structural failures.
This paper introduces MELD, a detector for AI-generated text that uses multi-task learning with auxiliary heads for generator family, attack type, and source domain to improve robustness. MELD achieves strong performance on the RAID benchmark and maintains low false-positive rates under adversarial attacks.
This paper presents a comprehensive benchmark for evaluating adversarial attacks and defenses in Graph Neural Networks, highlighting the need for standardized and fair experimental protocols.
Proposes LiSCP, a lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Achieves superior performance on in-domain and cross-domain detection with notable robustness.
This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.
This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.