Tag
This paper introduces a SCID-anchored benchmark of 555 interviews to evaluate five LLMs for psychiatric screening, finding that while models show potential, they tend to discount symptom evidence in the presence of preserved functioning or protective context, requiring careful validation.