Tag
This paper evaluates 42 large language models on their ability to measure item discrimination in reading comprehension assessments, finding weak alignment with human-calibrated measures and highlighting it as an open challenge for psychometric evaluation.
This paper introduces MAFIG, a multi-agent framework that leverages LLM agents and feature-specific evaluators to generate reading comprehension items with controlled difficulty by adhering to specified feature constraints. Experiments show MAFIG achieves significantly higher constraint satisfaction and robust difficulty control compared to baseline methods.
The paper proposes fine-tuning transformer encoders end-to-end for response-free item difficulty modelling of multiple-choice reading comprehension items, with component-wise and multi-task variants, showing that multi-task learning improves in small-sample regimes.