Tag
This paper introduces AI-MASLD, a stress-audit framework for medical LLMs that reveals how benchmark accuracy can hide serious safety failures, and demonstrates that open-weight models can match or exceed proprietary ones on safety dimensions.
MemFail is a diagnostic benchmark that isolates failure modes of LLM memory systems by formalizing summarization, storage, and retrieval operations, and evaluating them with adversarially designed datasets.
Explains how to use Claude to perform a premortem, a technique by Daniel Kahneman, to stress-test plans by imagining they have already failed.
DetectRL-X is a comprehensive multilingual benchmark for evaluating LLM-generated text detectors across 8 languages and 6 domains, including stress testing with AI-assisted writing operations and perturbations. It reveals strengths and limitations of current detectors in multilingual scenarios.