Tag
This paper evaluates 42 large language models on their ability to measure item discrimination in reading comprehension assessments, finding weak alignment with human-calibrated measures and highlighting it as an open challenge for psychometric evaluation.