stress-testing

Tag

Cards List
#stress-testing

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

arXiv cs.AI · 3d ago Cached

This paper introduces AI-MASLD, a stress-audit framework for medical LLMs that reveals how benchmark accuracy can hide serious safety failures, and demonstrates that open-weight models can match or exceed proprietary ones on safety dimensions.

0 favorites 0 likes
#stress-testing

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

arXiv cs.AI · 2026-05-27 Cached

MemFail is a diagnostic benchmark that isolates failure modes of LLM memory systems by formalizing summarization, storage, and retrieval operations, and evaluating them with adversarially designed datasets.

0 favorites 0 likes
#stress-testing

@itsolelehmann: POV: claude traveled 6 months into the future and told you exactly how your next move failed. it's called a premortem. …

X AI KOLs Following · 2026-05-25 Cached

Explains how to use Claude to perform a premortem, a technique by Daniel Kahneman, to stress-test plans by imagining they have already failed.

0 favorites 0 likes
#stress-testing

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

arXiv cs.CL · 2026-05-18 Cached

DetectRL-X is a comprehensive multilingual benchmark for evaluating LLM-generated text detectors across 8 languages and 6 domains, including stress testing with AI-assisted writing operations and perturbations. It reveals strengths and limitations of current detectors in multilingual scenarios.

0 favorites 0 likes
← Back to home

Submit Feedback