Tag
This paper presents a robust evaluation framework and training strategies for biomedical publication type and study design classification, using knowledge-guided perturbations to mitigate reliance on spurious features.
This paper introduces FragileFlow, a plug-in regularizer that improves the robustness of LLMs and VLMs by controlling 'correct-but-fragile' predictions through spectral analysis and PAC-Bayes bounds.
This paper introduces WARDEN, a distributionally robust adversarial training framework for large language models that uses f-divergence to dynamically reweight adversarial examples, significantly reducing attack success rates while maintaining computational efficiency.
OpenAI announces security hardening of ChatGPT Atlas against prompt injection attacks through adversarial training and strengthened safeguards, including a rapid response loop for discovering and mitigating novel attack strategies before they appear in the wild.
This paper presents adversarial and virtual adversarial training methods adapted for text classification by applying perturbations to word embeddings in RNNs rather than raw inputs. The approach achieves state-of-the-art results on semi-supervised and supervised text classification benchmarks while reducing overfitting.