LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

arXiv cs.CL 06/18/26, 04:00 AM Papers
llm item-discrimination educational-assessment psychometrics reading-comprehension evaluation zero-shot
Summary
This paper evaluates 42 large language models on their ability to measure item discrimination in reading comprehension assessments, finding weak alignment with human-calibrated measures and highlighting it as an open challenge for psychometric evaluation.
arXiv:2606.18709v1 Announce Type: new Abstract: Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.
Original Article
View Cached Full Text
Cached at: 06/18/26, 05:45 AM
# LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
Source: [https://arxiv.org/html/2606.18709](https://arxiv.org/html/2606.18709)
Han Chen1,Ming Li1,2,Chenguang Wang3,Yijun Liang2,Dawei Zhou3,Hong Jiao2,Tianyi Zhou1 1MBZUAI2University of Maryland3Virginia Tech \{han\.chen, tianyi\.zhou\}@mbzuai\.ac\.aeminglii@umd\.edu

###### Abstract

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency\. While various existing works have explored whether large language models \(LLMs\) can estimate item difficulty, it remains unclear whether they can capture item discrimination\. In this work, we evaluate 42 proprietary and open\-weight LLMs in zero\-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item’s discrimination value from its content, and response\-based Classical Test Theory \(CTT\) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores\. Our results show that direct prediction yields weak alignment with human\-calibrated discrimination: the best\-performing model reaches only a Spearman correlation of 0\.152\. Response\-based CTT calibration provides a stronger but still limited signal, with the all\-persona synthetic respondent pool reaching a Spearman correlation of 0\.241\. These findings highlight item discrimination as an open challenge for LLM\-based psychometric evaluation: current LLMs contain non\-random discrimination\-relevant signal, but they do not yet reliably capture how assessment items distinguish human students\.

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Han Chen1, Ming Li1,2, Chenguang Wang3, Yijun Liang2, Dawei Zhou3, Hong Jiao2, Tianyi Zhou11MBZUAI2University of Maryland3Virginia Tech\{han\.chen, tianyi\.zhou\}@mbzuai\.ac\.aeminglii@umd\.edu

## 1Introduction

A growing question in NLP is whether large language models can serve not only as task solvers, but also as models of human behavior\. We study this question through item discrimination, a latent population\-level statistic that reflects how human response patterns vary across proficiency levels\. In educational assessment, effective items should not be evaluated by difficulty alone, but also by how well they reveal meaningful differences in test\-takers’ proficiency\(Eignor,[2013](https://arxiv.org/html/2606.18709#bib.bib66); Haladyna and Rodriguez,[2013](https://arxiv.org/html/2606.18709#bib.bib65)\)\. In classical test theory \(CTT\), this property is commonly captured by item discrimination, often measured as the correlation between item correctness and total test score\(Lord and Novick,[2008](https://arxiv.org/html/2606.18709#bib.bib67); Crocker and Algina,[1986](https://arxiv.org/html/2606.18709#bib.bib64); Moses,[2017](https://arxiv.org/html/2606.18709#bib.bib63)\)\. A high discrimination value indicates that higher\-proficiency test\-takers are more likely to answer the item correctly, whereas a low or negative value suggests weak or reversed separation between proficiency levels\(Ebel,[1972](https://arxiv.org/html/2606.18709#bib.bib68); Moses,[2017](https://arxiv.org/html/2606.18709#bib.bib63); McCowan and McCowan,[1999](https://arxiv.org/html/2606.18709#bib.bib69)\)\.

A high\-discrimination item has a response pattern that closely tracks test\-takers’ proficiency\. This makes discrimination crucial for test construction, since items with similar difficulty can differ substantially in diagnostic value\(Haladyna and Rodriguez,[2013](https://arxiv.org/html/2606.18709#bib.bib65); Moses,[2017](https://arxiv.org/html/2606.18709#bib.bib63)\)\.

Traditionally, item discrimination is estimated from human response data collected through large\-scale pretesting, which makes calibration costly and time\-consuming for newly developed items\(Crocker and Algina,[1986](https://arxiv.org/html/2606.18709#bib.bib64); Moses,[2017](https://arxiv.org/html/2606.18709#bib.bib63); McCowan and McCowan,[1999](https://arxiv.org/html/2606.18709#bib.bib69)\)\. The emergence of Large Language Models \(LLMs\) offers a potential alternative, as they can process assessment items and provide scalable judgments about item quality and difficulty\(Liet al\.,[2025b](https://arxiv.org/html/2606.18709#bib.bib32); Veeramaniet al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib71); Liet al\.,[2025a](https://arxiv.org/html/2606.18709#bib.bib1)\)\. However, prior work has focused largely on whether LLMs can predict item difficulty, typically defined as the likelihood that test\-takers answer an item correctly\(AlKhuzaeyet al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib70); Yanevaet al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib3); Liet al\.,[2025a](https://arxiv.org/html/2606.18709#bib.bib1)\)\. Whether LLMs can predict item discrimination remains less clear\. This question matters because discrimination captures a distinct aspect of item quality: how well an item differentiates test\-takers by ability rather than how difficult it is overall\(Crocker and Algina,[1986](https://arxiv.org/html/2606.18709#bib.bib64); Moses,[2017](https://arxiv.org/html/2606.18709#bib.bib63)\)\.

We therefore examine whether current LLMs can estimate empirical item discrimination values from reading\-comprehension item content under a zero\-shot setting\(Säuberliet al\.,[2025b](https://arxiv.org/html/2606.18709#bib.bib73)\)\. This question is difficult to study at scale because publicly available assessment datasets rarely include item\-level discrimination values derived from human pretesting, as such statistics require collecting response data from real test\-takers\. Thus, we use the Cambridge Multiple\-Choice Questions Reading Dataset\(Mulloolyet al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib2)\)as it provides one of the few available settings that pairs reading comprehension items with item\-level psychometric statistics, including discrimination values\(Liusieet al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib72)\)\. This setting enables us to compare LLM\-generated discrimination estimates with empirical human\-calibrated discrimination values, thereby diagnosing Human\-AI discrimination alignment while acknowledging the limited availability of suitable human\-calibrated data\.

Specifically, in this work, we evaluate two complementary approaches followingLiet al\.\([2025a](https://arxiv.org/html/2606.18709#bib.bib1)\)\. First, direct discrimination prediction: LLMs are given the full item content, including the correct answer, and asked to predict the discrimination value\. This evaluates whether LLMs can explicitly judge how well an item distinguishes higher\- and lower\-proficiency test\-takers\(Crocker and Algina,[1986](https://arxiv.org/html/2606.18709#bib.bib64); Moses,[2017](https://arxiv.org/html/2606.18709#bib.bib63)\)\. Second, response\-based calibration: LLMs answer items without access to the correct answer\. We then treat their outputs as synthetic responses and compute item discrimination from their correctness patterns using the same CTT principle applied to human response data\(Crocker and Algina,[1986](https://arxiv.org/html/2606.18709#bib.bib64); Moses,[2017](https://arxiv.org/html/2606.18709#bib.bib63); Säuberliet al\.,[2025b](https://arxiv.org/html/2606.18709#bib.bib73); Maeda,[2025](https://arxiv.org/html/2606.18709#bib.bib74)\)\. For both approaches, we compare the no\-persona baseline with low\-, medium\-, and high\-proficiency simulation settings to assess how proficiency\-conditioned simulation affects the alignment between LLM estimates and empirical item statistics\(Säuberliet al\.,[2025b](https://arxiv.org/html/2606.18709#bib.bib73); Maeda,[2025](https://arxiv.org/html/2606.18709#bib.bib74)\)\.

Our contributions are threefold\. \(1\) To our knowledge, this work is among the earliest systematic studies of Human\-AI alignment for item discrimination, framing the task as a test of whether LLMs can model latent, ability\-conditioned human response behavior from item content\. \(2\) We shift the focus beyond the widely studied problem of item difficulty prediction and highlight item discrimination as an underexplored but essential dimension of assessment quality\. \(3\) We compare explicit discrimination prediction with response\-based CTT calibration under different proficiency simulations, providing systematic evidence on both the promise and the limitations of LLMs for discrimination\-oriented item analysis and potential LLM\-assisted educational assessment workflows\.

## 2Related Work

Prior work on item quality has primarily studied item difficulty prediction using response data, textual features, machine learning models, and, more recently, LLM\-based signals\(Hambletonet al\.,[1991](https://arxiv.org/html/2606.18709#bib.bib24); DeMars,[2010](https://arxiv.org/html/2606.18709#bib.bib23); Sano,[2015](https://arxiv.org/html/2606.18709#bib.bib26); Loukinaet al\.,[2016](https://arxiv.org/html/2606.18709#bib.bib25); Devlinet al\.,[2019](https://arxiv.org/html/2606.18709#bib.bib29); Heet al\.,[2021](https://arxiv.org/html/2606.18709#bib.bib30); Benedetto,[2023](https://arxiv.org/html/2606.18709#bib.bib21); Rogoz and Ionescu,[2024](https://arxiv.org/html/2606.18709#bib.bib33); Zotoset al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib35); Liet al\.,[2025b](https://arxiv.org/html/2606.18709#bib.bib32); Fenget al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib34)\)\. Recent systematic reviews further summarize text\-based item difficulty modeling as a mature research direction across classical machine learning, neural models, and LLM\-based approaches\(Peterset al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib84)\)\. Yet difficulty alone is insufficient for evaluating item quality, since hard items do not necessarily distinguish students’ proficiency\. Item discrimination, typically measured through item\-total correlations or IRT slope parameters, captures how well an item differentiates students by ability, but has received less attention in LLM\-based item quality estimation\(DeMars,[2010](https://arxiv.org/html/2606.18709#bib.bib23); Lord,[2012](https://arxiv.org/html/2606.18709#bib.bib45); Embretson and Reise,[2025](https://arxiv.org/html/2606.18709#bib.bib46); Hanet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib61)\)\. Meanwhile, LLM\-based student simulation has been explored for education, including synthetic response generation for item calibration\(Markelet al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib47); Parket al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib41),[2024](https://arxiv.org/html/2606.18709#bib.bib58); Liuet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib56)\), but recent work argues that simulated student behavior may still misalign with real student response patterns\(Hayakawa and Saggion,[2024](https://arxiv.org/html/2606.18709#bib.bib57); Säuberliet al\.,[2025a](https://arxiv.org/html/2606.18709#bib.bib60); Srivatsaet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib59)\)\. We therefore evaluate LLMs through the stricter lens of human item discrimination, asking whether they capture not only which items are difficult, but also which items meaningfully separate students by skill level\.*The complete related work section is provided in Appendix[A](https://arxiv.org/html/2606.18709#A1)*\.

## 3Dataset and Problem Formulation

### 3\.1Dataset and Task Description

We study item\-level discrimination prediction for multiple\-choice reading comprehension assessment\. Given an item, the goal is to predict how well an item distinguishes higher\- and lower\-proficiency test\-takers\. We use the Cambridge Multiple\-Choice Questions Reading DatasetMulloolyet al\.\([2023](https://arxiv.org/html/2606.18709#bib.bib2)\)and the target label is the human\-calibrated item discrimination value obtained from student pretesting\. The dataset used in our experiments contains 793 item records derived from 120 Cambridge reading\-comprehension tasks\. Each item includes the passage, question stem, four answer options, the correct answer, and psychometric statistics, including item difficulty score and CTT\-derived discrimination score\. Let

𝒟=\{\(xi,ai∗,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},a\_\{i\}^\{\*\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote the dataset, wherexix\_\{i\}is the full item context, including the passage, question, and answer options;ai∗a\_\{i\}^\{\*\}is the ground\-truth answer; andyiy\_\{i\}is the human\-derived item discrimination label\.

Under Classical Test Theory \(CTT\), discrimination is computed as the point\-biserial correlation between item correctness and total test score:

yi=corr\(vi,s\),y\_\{i\}=\\mathrm\{corr\}\(v\_\{i\},s\),wherevi∈\{0,1\}v\_\{i\}\\in\\\{0,1\\\}indicates whether an examinee answers itemiicorrectly, andssdenotes the examinee’s total test score\. A larger positive value indicates better discrimination, as higher\-scoring examinees are more likely to answer the item correctly, whereas values near zero suggest weak separation between high\- and low\-scoring examinees\.

We evaluate two LLM\-based approaches\. In direct discrimination prediction, the model receives the item context and ground\-truth answer, then outputs a single discrimination score\. This tests whether LLMs can judge how well an item separates student proficiency levels\. In response\-based CTT calibration, the model answers each item without access to the ground\-truth answer\. We convert the resulting answers into binary correctness values and use them to construct synthetic response matrices for item discrimination prediction in Section[5\.1](https://arxiv.org/html/2606.18709#S5.SS1)\.

### 3\.2Student Simulation

To examine whether LLMs can capture proficiency\-dependent item behavior, we evaluate each model under four proficiency configurations:

𝒫=\{p0,plow,pmid,phigh\}\.\\mathcal\{P\}=\\\{p\_\{0\},p\_\{\\text\{low\}\},p\_\{\\text\{mid\}\},p\_\{\\text\{high\}\}\\\}\.The baseline configurationp0p\_\{0\}uses no explicit student role and reflects the model’s default behavior\. Forplowp\_\{\\text\{low\}\},pmidp\_\{\\text\{mid\}\}, andphighp\_\{\\text\{high\}\}, the LLM simulates lower\-, medium\-, and higher\-proficiency Cambridge English test\-takers, respectively\.

In direct discrimination prediction, these simulated student roles test whether lower\-, medium\-, and higher\-proficiency perspectives affect the model’s discrimination estimates\. In direct answering, the simulated roles generate proficiency\-conditioned response patterns, allowing us to examine whether simulated respondents induce discrimination estimates closer to those observed in real students\. The detailed simulation prompts are provided in Appendix[C](https://arxiv.org/html/2606.18709#A3)\.

### 3\.3Implementation Details

#### Models and inference\.

We evaluate 42 proprietary and open\-weight LLMs to examine whether discrimination alignment varies across model families and capability levels\. The proprietary models include GPT\-3\.5\-Turbo\(OpenAI,[2024b](https://arxiv.org/html/2606.18709#bib.bib5)\), GPT\-4o\-mini\(OpenAI,[2024a](https://arxiv.org/html/2606.18709#bib.bib6)\), GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib4)\), GPT\-4\.1\-mini and GPT\-4\.1\(OpenAI,[2025b](https://arxiv.org/html/2606.18709#bib.bib7)\), GPT\-o4\-mini\(OpenAI,[2025c](https://arxiv.org/html/2606.18709#bib.bib8)\), GPT\-5\(OpenAI,[2025a](https://arxiv.org/html/2606.18709#bib.bib9)\), and GPT\-5\.5\(OpenAI,[2026](https://arxiv.org/html/2606.18709#bib.bib10)\)from the OpenAI model family; Claude 3\.5 Haiku\(Anthropic,[2024](https://arxiv.org/html/2606.18709#bib.bib77)\)and Claude 3\.7 Sonnet\(Anthropic,[2025](https://arxiv.org/html/2606.18709#bib.bib11)\)from the Claude family; and Gemini 2\.0 Flash\(Google DeepMind,[2024](https://arxiv.org/html/2606.18709#bib.bib12)\), Gemini 2\.5 Flash, and Gemini 2\.5 Pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib13)\)from the Gemini family\. The open\-weight models include Llama 2\-7B and Llama 2\-13B\(Touvronet al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib14)\), Llama 3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib15)\), Llama 3\.2\-1B and Llama 3\.2\-3B\(Meta AI,[2024a](https://arxiv.org/html/2606.18709#bib.bib78)\), and Llama 3\.3\-70B\(Meta AI,[2024b](https://arxiv.org/html/2606.18709#bib.bib79)\)from the Llama family; Mistral\-7B\-v0\.3\(Jianget al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib80)\)from the Mistral family; OLMo 2\-7B and OLMo 2\-13B\(OLMoet al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib81)\)from the OLMo family; Phi\-3\-mini and Phi\-3\.5\-mini\(Abdinet al\.,[2024a](https://arxiv.org/html/2606.18709#bib.bib20)\), Phi\-4\(Abdinet al\.,[2024b](https://arxiv.org/html/2606.18709#bib.bib19)\), and Phi\-4\-mini\(Aboueleninet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib82)\)from the Phi family; and Qwen2\.5, Qwen3, and Qwen3\.5 models at multiple scales\(Qwen Team,[2024](https://arxiv.org/html/2606.18709#bib.bib16); Yanget al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib17); Qwen Team,[2026](https://arxiv.org/html/2606.18709#bib.bib18)\)from the Qwen family\. Models were instructed to place the final scalar estimate or answer letter in boxed form\.

#### Evaluation metrics\.

We evaluate whether LLM\-derived discrimination scores align with human\-calibrated item discrimination using two complementary criteria\. \(1\) We useSpearman’s rank correlationto assess whether LLM\-derived scores preserve the human ranking of items by discrimination\. This criterion reflects whether the model can identify which items are relatively more or less discriminative\. \(2\) We useroot mean squared error \(RMSE\)to measure the numerical deviation between LLM\-derived discrimination scores and human\-calibrated discrimination labels\.

Spearman↑\\uparrowRMSE↓\\downarrowModelBaselineLowMediumHighAll\-personaBaselineLowMediumHighAll\-personaLlama 2\-7B0\.006\-0\.024\-0\.027\-0\.007\-0\.0130\.5540\.4230\.4230\.4460\.462Llama 3\.1\-8B\-0\.147∗∗∗\-0\.121∗∗∗\-0\.123∗∗∗\-0\.136∗∗∗\-0\.132∗∗∗0\.2890\.3320\.2570\.3390\.304Llama 3\.3\-70B\-0\.080∗\-0\.123∗∗∗\-0\.110∗∗\-0\.103∗∗\-0\.104∗∗∗0\.2210\.2350\.2290\.2870\.243Mistral\-7B\-v0\.3\-0\.0040\.0070\.004\-0\.095∗∗\-0\.0220\.3590\.3320\.2630\.3860\.335OLMo 2\-13B0\.0020\.011\-0\.057\-0\.091∗\-0\.0340\.2180\.2340\.2620\.2810\.249Phi\-3\-mini0\.0010\.077∗0\.0010\.0270\.0270\.3900\.3520\.3470\.3710\.365Phi\-4\-0\.133∗∗∗\-0\.018\-0\.073∗0\.012\-0\.053∗∗0\.1600\.1560\.1610\.1710\.162Qwen2\.5\-7B\-0\.059\-0\.035\-0\.089∗\-0\.093∗∗\-0\.069∗∗0\.1720\.1860\.1740\.1940\.181Qwen3\-8B0\.0410\.0210\.0200\.0370\.0300\.1570\.1560\.1520\.1660\.158Qwen3\-32B\-0\.043\-0\.081∗0\.002\-0\.049\-0\.043∗0\.2110\.1890\.1740\.1980\.193Qwen3\-235B\-A22B\-0\.0060\.034\-0\.060\-0\.037\-0\.0170\.1480\.1470\.1480\.1610\.151Qwen3\.5\-122B\-A10B\-0\.0580\.0480\.005\-0\.058\-0\.0160\.2500\.2450\.2240\.2680\.247Claude 3\.5 Haiku\-0\.116∗∗\-0\.144∗∗∗\-0\.134∗∗∗\-0\.148∗∗∗\-0\.136∗∗∗0\.1980\.2170\.1930\.2550\.216Claude 3\.7 Sonnet\-0\.051\-0\.128∗∗∗\-0\.0120\.007\-0\.046∗∗0\.1880\.2180\.1870\.2010\.199Gemini 2\.5 Flash0\.006\-0\.025\-0\.063\-0\.067\-0\.0370\.2360\.1900\.1760\.2660\.217Gemini 2\.5 Pro\-0\.001\-0\.0440\.003\-0\.029\-0\.0180\.1820\.1600\.1830\.2010\.182GPT\-3\.5\-Turbo\-0\.021\-0\.033\-0\.046\-0\.071∗\-0\.043∗0\.2740\.3100\.2660\.2840\.284GPT\-4\.1\-0\.140∗∗∗\-0\.171∗∗∗\-0\.131∗∗∗\-0\.160∗∗∗\-0\.151∗∗∗0\.1930\.1890\.1720\.1990\.188GPT\-4o\-0\.121∗∗∗\-0\.102∗∗\-0\.100∗∗\-0\.082∗\-0\.101∗∗∗0\.1790\.1810\.1860\.2100\.189GPT\-5\.50\.129∗∗∗0\.152∗∗∗0\.120∗∗∗0\.071∗0\.118∗∗∗0\.1530\.1530\.1510\.1570\.154All\-model\-0\.091∗\-0\.0350\.013\-0\.084∗\-0\.074∗0\.1650\.1560\.1600\.1880\.163

Table 1:Direct item discrimination prediction results for 20 representative models\. Spearman evaluates rank alignment with human\-calibrated discrimination values, with significance denoted by stars:p∗<\.05\{\}^\{\*\}p<\.05,p∗∗<\.01\{\}^\{\*\*\}p<\.01, andp∗⁣∗∗<\.001\{\}^\{\*\*\*\}p<\.001\. RMSE measures numerical prediction error\. Baseline denotes no persona, Low/Medium/High denote simulated student proficiency levels, All\-persona averages across personas within each model, and All\-model averages across models\.

## 4Direct Discrimination Prediction

### 4\.1Overall Performance

Table[1](https://arxiv.org/html/2606.18709#S3.T1)reports direct discrimination prediction results for 20 representative models, while the complete results for all 42 models are provided in Appendix[D](https://arxiv.org/html/2606.18709#A4)\. Across most models and prompting conditions, Spearman correlations are close to zero or even negative, suggesting that model\-predicted discrimination scores fail to preserve the human ranking of items by discrimination\. This represents a substantial limitation for item analysis, since discrimination is primarily used to identify which items better separate higher\- and lower\-scoring examinees\.

GPT\-5\.5 is the only model that shows consistently positive correlations across all prompting conditions, with the best result under the low\-proficiency prompt\. However, the correlation remains low, suggesting that even the best\-performing model in this setting provides only weak alignment with human\-calibrated discrimination\. In contrast, several capable proprietary models produce negative correlations in the baseline setting\. These results suggest that general language understanding and reasoning ability do not necessarily translate into reliable estimates of how reading\-comprehension items discriminate among human learners\.

![Refer to caption](https://arxiv.org/html/2606.18709v1/figures/consensus_alignment.png)Figure 1:Model consensus does not imply human alignment in direct discrimination prediction\. Each point denotes one model\-persona prediction vector\. The x\-axis summarizes how similar that prediction vector is to other model\-persona predictions, measured by its mean Spearman correlation with all other prediction vectors\. The y\-axis measures human alignment, computed as the Spearman correlation with human\-calibrated item discrimination\. Points in the lower\-right region indicate a shared but reversed signal: models agree with one another, but their rankings are negatively aligned with human discrimination\.![Refer to caption](https://arxiv.org/html/2606.18709v1/figures/main1.png)Figure 2:Distributions of human\-calibrated and LLM\-predicted item discrimination values for 20 representative models\. The complete visualization covering all 42 models is provided in Appendix[D](https://arxiv.org/html/2606.18709#A4)\.To further investigate whether prediction failures reflect isolated model behavior or broader model\-side patterns, Figure[1](https://arxiv.org/html/2606.18709#S4.F1)compares inter\-model consensus with human alignment\. The figure shows that direct predictions are not simply random\. Many model\-persona prediction vectors have positive agreement with other LLM prediction vectors while maintaining negative alignment with human\-calibrated discrimination\. This lower\-right pattern indicates a shared but reversed signal: models agree with one another to some extent, but their rankings move in the opposite direction from human discrimination\. Thus, the failure of direct prediction reflects not only weak alignment, but also systematically misaligned ranking behavior\.

Proficiency\-based simulation and persona ensembling do not consistently improve direct discrimination prediction\. Although some models improve under a particular simulated persona, the effective persona varies across models\.

#### Key Finding 1

Direct Prediction Shows Weak and Often Reversed Alignment with Human Discrimination\.Direct discrimination prediction fails to reliably estimate human\-calibrated item discrimination\. Across models and simulation conditions, Spearman correlations remain near zero or negative, indicating that LLMs cannot reliably rank items by how well they distinguish stronger from weaker students\. The negative correlations are informative rather than merely noisy: they suggest that some models rely on shared heuristics that rank items in a direction opposite to human item discrimination\. AlthoughGPT\-5\.5shows consistently positive alignment, its correlation remains weak, andproficiency simulation and persona ensembling fail to provide stable gains\.

### 4\.2Distributional Analysis of Simulation

Figure[2](https://arxiv.org/html/2606.18709#S4.F2)shows distributions for 20 representative models across the four role\-simulation settings; the complete visualization for all 42 models is provided in Appendix[D](https://arxiv.org/html/2606.18709#A4)\.

Most models produce substantially narrower distributions than the human\-calibrated distribution\. Human item discrimination spans a broad range, reflecting item\-level variation in how well questions distinguish higher\- and lower\-proficiency test\-takers\. In contrast, many LLMs concentrate their predictions within a small positive interval\. In particular, LLMs rarely assign very low or negative discrimination values, placing little probability mass in the lower tail where weakly discriminating or potentially problematic items appear\.

Proficiency\-based simulation changes the scale or shape of predictions for some models, but does not generate a human\-like distribution\. For many models, the no\-persona, low\-, medium\-, and high\-proficiency curves largely overlap, indicating limited effects of simulated roles\. When differences appear, they mainly involve location or dispersion shifts rather than reconstruction of the broader human\-calibrated distribution, and are not directionally consistent across models\.

Overall, these distributional results help explain the weak correlations in Table[1](https://arxiv.org/html/2606.18709#S3.T1): direct predictions do not merely differ in scale from human\-calibrated discrimination, but also fail to preserve the item\-level variation needed for reliable discrimination analysis\. This distinction is important because RMSE captures numerical calibration error, whereas Spearman correlation more directly tests whether models preserve the human ordering of item discrimination\.

#### Key Finding 2

Distributional Compression Under Simulation\.Direct discrimination prediction fails in bothrank alignmentanddistributional calibration\. Most models producecompressed prediction distributionswithin a narrow positive range, missing broader item\-level variation observed in human pretesting and rarely assigning very low or negative values\. This reveals systematic calibration bias, while the near\-zero or negative Spearman correlations indicate item\-level ranking misalignment with human discrimination\. Proficiency simulation mainly changes thescale or shapeof predictions, but does not generate a human\-like distribution\.

## 5Discrimination from CTT Calibration

### 5\.1CTT Calibration Formulation

Section[4](https://arxiv.org/html/2606.18709#S4)examined direct discrimination prediction\. In this section, we instead estimate item discrimination from LLM answer behavior using the same CTT principle as the Cambridge dataset: discrimination is computed as the correlation between item correctness and total score\.

Table 2:Response\-based CTT calibration across synthetic respondent pools\. Spearman evaluates rank alignment with human\-calibrated item discrimination, and the 95% CI gives the bootstrap confidence interval for Spearman\. Null reports the shuffled\-response baseline that preserves synthetic examinee score distributions but removes item\-specific alignment\.Letℳ\\mathcal\{M\}denote the set of LLMs and let𝒫\\mathcal\{P\}denote the four proficiency configurations defined in Section[3\.2](https://arxiv.org/html/2606.18709#S3.SS2)\. For each itemii, modelm∈ℳm\\in\\mathcal\{M\}under configurationp∈𝒫p\\in\\mathcal\{P\}predicts an answera^i,m,p\\hat\{a\}\_\{i,m,p\}, which we convert into a binary correctness value:

vi,m,p=𝕀\(a^i,m,p=ai∗\)\.v\_\{i,m,p\}=\\mathbb\{I\}\(\\hat\{a\}\_\{i,m,p\}=a\_\{i\}^\{\*\}\)\.Each model\-configuration pair\(m,p\)\(m,p\)is treated as one synthetic test\-taker\. We construct one pool for each configuration, as well as an all\-configuration pool:

ℛp\\displaystyle\\mathcal\{R\}\_\{p\}=\{\(m,p\):m∈ℳ\},\\displaystyle=\\\{\(m,p\):m\\in\\mathcal\{M\}\\\},p∈𝒫,\\displaystyle p\\in\\mathcal\{P\},ℛall\\displaystyle\\mathcal\{R\}\_\{\\mathrm\{all\}\}=ℳ×𝒫\.\\displaystyle=\\mathcal\{M\}\\times\\mathcal\{P\}\.
For a given poolℛs\\mathcal\{R\}\_\{s\}, wheres∈𝒫∪\{all\}s\\in\\mathcal\{P\}\\cup\\\{\\mathrm\{all\}\\\}, we compute the total score of each synthetic test\-taker as

Tm,p\(s\)=∑j=1Nvj,m,p\.T^\{\(s\)\}\_\{m,p\}=\\sum\_\{j=1\}^\{N\}v\_\{j,m,p\}\.The CTT\-calibrated discrimination estimate for itemiiis then

y~i\(s\)=corr⁡\(\{vi,m,p\}\(m,p\)∈ℛs,\{Tm,p\(s\)\}\(m,p\)∈ℛs\)\.\\tilde\{y\}^\{\(s\)\}\_\{i\}=\\operatorname\{corr\}\\left\(\\\{v\_\{i,m,p\}\\\}\_\{\(m,p\)\\in\\mathcal\{R\}\_\{s\}\},\\\{T^\{\(s\)\}\_\{m,p\}\\\}\_\{\(m,p\)\\in\\mathcal\{R\}\_\{s\}\}\\right\)\.
![Refer to caption](https://arxiv.org/html/2606.18709v1/figures/density.png)Figure 3:Distributions of human\-calibrated and LLM\-calibrated \(via CTT as in Section[5\.1](https://arxiv.org/html/2606.18709#S5.SS1)\) item discrimination values across synthetic respondent pools\.This follows the CTT definition and the Cambridge practice, which treats discrimination as the point\-biserial correlation between item score and total test score\. If all synthetic test\-takers in a pool have the same correctness value for an item, the correlation is undefined; in this case, we assign the item’s CTT\-calibrated discrimination value to zero because it does not distinguish among synthetic test\-takers in that pool\.

For the null baseline, we independently shuffle each synthetic test\-taker’s correctness vector across items, preserving test\-taker\-level total scores while destroying item\-specific response alignment\. We repeat this procedure 1,000 times and report the mean Spearman correlation with human\-calibrated discrimination labels\. We compute 95% confidence intervals using item\-level bootstrap resampling with 1,000 samples\.

### 5\.2Overall Performance

Table[2](https://arxiv.org/html/2606.18709#S5.T2)reports CTT results\. Unlike direct prediction, all answer\-based settings produce positive Spearman correlations with human\-calibrated item discrimination, suggesting that LLM answer behavior contains slightly more discrimination\-relevant signal than explicit numerical judgments\. Among single\-persona settings, the high\-proficiency pool achieves the highest single\-persona Spearman correlation, while the low\-proficiency pool achieves the second highest\.

However, alignment remains limited\. Even in the all\-persona setting, Spearman correlation is weak, indicating that LLM\-generated responses cannot perfectly simulate human pretesting data\. Near\-zero null baselines suggest that the positive correlations are not merely artifacts of the CTT computation, although the gap to human discrimination remains substantial\.

Figure[3](https://arxiv.org/html/2606.18709#S5.F3)further shows that response\-based estimates are distributionally miscalibrated\. Human item discrimination occupies a lower and broader range, whereas LLM\-derived values are shifted substantially to the right, suggesting systematic overestimation\. The persona\-specific curves are also highly similar, indicating that low\-, medium\-, and high\-proficiency prompting changes the calibrated discrimination distribution only modestly\. The all\-persona pool smooths the distribution and improves RMSE, but it is not the strongest setting in rank alignment and still does not recover the human\-calibrated distribution\.

This limited persona effect is consistent with the answer accuracy results in Appendix[B](https://arxiv.org/html/2606.18709#A2)Table[4](https://arxiv.org/html/2606.18709#A2.T4): persona prompting produces only small aggregate accuracy changes for most models, and current LLMs are often already strong enough on these reading\-comprehension items that even older and smaller models, such as Llama 2\-7B, answer many items correctly\. Such high and weakly differentiated correctness rates leave limited response variation for CTT calibration to exploit\. The visible density near zero in Figure[3](https://arxiv.org/html/2606.18709#S5.F3)is consistent with this limitation: for items answered identically by all synthetic test\-takers in a pool, the discrimination estimate is undefined and assigned to zero\. Thus, answer\-based calibration captures useful behavioral signal, but its discrimination values reflect model\-response structure rather than fully human\-like item behavior\.

#### Key Finding 3

Answer Behavior Contains Signal but Remains Miscalibrated\.Response\-based CTT calibration achievesstronger alignmentwith human item discrimination than direct numerical prediction, showing that LLM answer patterns contain non\-random discrimination\-relevant signal\. However, persona\-conditioned response pools remain highly similar: prompting changes aggregate accuracy only modestly, and many models already answer many items correctly, leaving limited response variation for CTT calibration\. The resulting estimates aresystematically shifted upward, and pooling all model\-persona respondents improves numerical calibration but not rank correlation\. Thus, LLM answer behavior is a useful but miscalibrated proxy rather than a reliable substitute for human pretesting\.

### 5\.3High/Low Discrimination Item Retrieval

Beyond pure correlation evaluation, we examine a direct real\-world application of item screening: whether LLM\-derived discrimination scores support item screening\. For low\-discrimination retrieval, gold and predicted sets contain the bottomk%k\\%of items by human\-calibrated and LLM\-derived discrimination, respectively\. High\-discrimination retrieval is defined analogously using the topk%k\\%\. We report Overlap@kkfork=10,20k=10,20:

Overlap@k=\|𝒮goldk∩𝒮predk\|\|𝒮goldk\|\.\\mathrm\{Overlap\}@k=\\frac\{\|\\mathcal\{S\}^\{k\}\_\{\\mathrm\{gold\}\}\\cap\\mathcal\{S\}^\{k\}\_\{\\mathrm\{pred\}\}\|\}\{\|\\mathcal\{S\}^\{k\}\_\{\\mathrm\{gold\}\}\|\}\.Because the predicted and gold sets have the same size, this metric is equivalent to both precision and recall at the cutoff\. Random selection has expected overlap of 0\.10 atk=10k=10and 0\.20 atk=20k=20\. For readability, Table[3](https://arxiv.org/html/2606.18709#S5.T3)reports retrieval results for 20 representative models, with complete direct\-model retrieval results provided in Appendix[D](https://arxiv.org/html/2606.18709#A4)\.

Table 3:Overlap\-based retrieval of low\- and high\-discrimination items for 20 representative models\. Direct rows use LLM\-predicted discrimination scores, CTT rows use answer\-based calibrated discrimination scores, and Random gives the expected overlap under random selection\.Table[3](https://arxiv.org/html/2606.18709#S5.T3)shows that direct discrimination predictions provide limited screening utility\. Most individual models perform close to random, especially for high\-discrimination retrieval\. Even GPT\-5\.5 retrieves only 19\.0% of the lowest\-discrimination items at @10 and 29\.6% at @20, while direct All\-model falls below random for high\-discrimination retrieval\. Response\-based CTT calibration performs better than aggregate direct prediction, especially for low\-discrimination retrieval\. Among CTT settings, CTT\-All achieves the best Low@10 score, retrieving 36\.7% of the bottom 10% human low\-discrimination items, while CTT\-Baseline achieves the best Low@20 score at 37\.7%\. For high\-discrimination retrieval, CTT\-High achieves the best High@10 score, and CTT\-Low achieves the best High@20 score\. These results indicate that response\-based calibration provides a stronger screening signal after adding the expanded model set, especially for low\-discrimination item retrieval\. However, even the best setting retrieves fewer than half of the target items, so LLM\-derived discrimination remains only a partial proxy for human pretesting\.

Overall, CTT\-calibrated answer behavior provides a stronger item\-screening signal than explicit discrimination judgments, especially for low\-discrimination items\. However, even the best setting retrieves fewer than half of the target items, so LLM\-derived discrimination remains only a partial proxy for human pretesting\.

## 6Difficulty vs\. Discrimination Alignment

This study is motivated by recent work on Human\-AI difficulty alignment, which shows that LLMs struggle to estimate human item difficulty even when they can solve many items correctly\(Liet al\.,[2025a](https://arxiv.org/html/2606.18709#bib.bib1)\)\. Our results extend this question from difficulty to discrimination\. The two settings share a common challenge: general problem\-solving ability does not imply alignment with human assessment behavior\. In both cases, explicit model judgments are weakly aligned with human\-calibrated item statistics, and proficiency prompting does not reliably make models behave like lower\- or higher\-proficiency test\-takers\.

The key difference is that difficulty and discrimination capture different psychometric targets\. Difficulty is an aggregate property: it asks how likely test\-takers are to answer an item correctly\. Discrimination is relational: it asks whether correctness varies systematically with proficiency\. A difficult item may have low discrimination if both strong and weak test\-takers fail, while an easy item may still discriminate if stronger test\-takers answer it correctly more reliably\. Thus, discrimination alignment requires more than estimating an overall probability of success; it requires modeling ability\-conditioned response patterns\.

This distinction also changes how the “curse of knowledge” appears\. In difficulty alignment, model correctness can expose this problem when items that are hard for humans are easy for LLMs\. In discrimination alignment, the same issue becomes a failure to generate ability\-stratified response patterns: current models are often accurate enough that even weak\-student prompting does not reliably produce the errors needed to simulate lower\-proficiency test\-takers\. As a result, answer\-based CTT calibration performs better than direct numerical prediction, but its synthetic response matrix still lacks the proficiency\-conditioned variation needed for human\-like discrimination estimates\.

## 7Future Directions: Student Error Simulation

Our preceding analysis shows that current LLMs remain insufficient for discrimination\-oriented item analysis, even with student personas, because they do not reliably reproduce the structured, ability\-sensitive error patterns underlying human item discrimination\. This suggests that future work may need to move beyond generic proficiency simulation toward models that can explain how and why students make particular errors\. This direction is consistent with recent calls for valid LLM\-based student simulation, which argue that simulated learners should be constrained by explicit epistemic state specifications rather than relying only on generic persona prompts\(Yuanet al\.,[2026](https://arxiv.org/html/2606.18709#bib.bib83)\)\. Recent work on LLM\-based cognitive student modeling and student error simulation offers a promising direction, including simulations of misconception\-driven reasoning, realistic student\-like errors, and learning trajectories\(Sonkaret al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib75); Miroyanet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib49); Ross and Andreas,[2025](https://arxiv.org/html/2606.18709#bib.bib76)\)\. For reading\-comprehension assessment, this idea could be explored by defining domain\-specific error rules, such as lexical matching, failure to locate textual evidence, incorrect inference integration, reference confusion, or attraction to plausible distractors\. Beyond final\-answer correctness, future work could also analyze intermediate reasoning traces using cognitively grounded frameworks for model reasoning analysis, which may help identify whether simulated errors arise from plausible comprehension processes\. Although such an extension would not replace human pretesting or guarantee improved discrimination estimates, it could provide an interpretable auxiliary signal for item review by identifying whether an item elicits plausible, proficiency\-relevant errors\.

## 8Conclusion

This work evaluates whether current LLMs can estimate human item discrimination in reading comprehension assessment\. Across 42 LLMs on the Cambridge Multiple\-Choice Questions Reading Dataset, direct discrimination prediction shows weak alignment with human\-calibrated discrimination\. Response\-based CTT calibration provides a stronger non\-random signal, but remains substantially miscalibrated\. These results suggest that item discrimination is a demanding test of Human\-AI assessment alignment: LLMs may solve items and produce plausible item\-quality judgments, yet still fail to reliably capture how items distinguish learners of different proficiency levels\. More broadly, item discrimination cannot be inferred from LLM competence alone; reliable discrimination analysis still requires human response data or substantially more faithful student\-error simulation\.

## Limitations

This study has two main limitations\. First, our experiments are limited to reading\-comprehension assessment\. We adopt this scope because public datasets that pair item content with human\-calibrated discrimination statistics are scarce\. Thus, our results should be viewed as evidence from a rare available benchmark rather than a complete characterization of LLM\-based discrimination estimation across assessment settings\.

Second, we evaluate LLMs only in zero\-shot prompting settings\. This design may understate the performance that could be achieved with few\-shot demonstrations, fine\-tuning, or access to calibration data\. We choose the zero\-shot setting intentionally, however, because it tests whether LLMs can infer item discrimination from item content and general assessment knowledge, rather than learning to reproduce the distribution of an existing calibration dataset\.

## References

- M\. Abdin, J\. Aneja, H\. Awadalla, A\. Awadallah, A\. A\. Awan, N\. Bach, A\. Bahree, A\. Bakhtiari, J\. Bao, H\. Behl, A\. Benhaim, M\. Bilenko, J\. Bjorck, S\. Bubeck, M\. Cai, Q\. Cai, V\. Chaudhary, D\. Chen, D\. Chen, W\. Chen, Y\. Chen, Y\. Chen, H\. Cheng, P\. Chopra, X\. Dai, M\. Dixon, R\. Eldan, V\. Fragoso, J\. Gao, M\. Gao, M\. Gao, A\. Garg, A\. D\. Giorno, A\. Goswami, S\. Gunasekar, E\. Haider, J\. Hao, R\. J\. Hewett, W\. Hu, J\. Huynh, D\. Iter, S\. A\. Jacobs, M\. Javaheripi, X\. Jin, N\. Karampatziakis, P\. Kauffmann, M\. Khademi, D\. Kim, Y\. J\. Kim, L\. Kurilenko, J\. R\. Lee, Y\. T\. Lee, Y\. Li, Y\. Li, C\. Liang, L\. Liden, X\. Lin, Z\. Lin, C\. Liu, L\. Liu, M\. Liu, W\. Liu, X\. Liu, C\. Luo, P\. Madan, A\. Mahmoudzadeh, D\. Majercak, M\. Mazzola, C\. C\. T\. Mendes, A\. Mitra, H\. Modi, A\. Nguyen, B\. Norick, B\. Patra, D\. Perez\-Becker, T\. Portet, R\. Pryzant, H\. Qin, M\. Radmilac, L\. Ren, G\. de Rosa, C\. Rosset, S\. Roy, O\. Ruwase, O\. Saarikivi, A\. Saied, A\. Salim, M\. Santacroce, S\. Shah, N\. Shang, H\. Sharma, Y\. Shen, S\. Shukla, X\. Song, M\. Tanaka, A\. Tupini, P\. Vaddamanu, C\. Wang, G\. Wang, L\. Wang, S\. Wang, X\. Wang, Y\. Wang, R\. Ward, W\. Wen, P\. Witte, H\. Wu, X\. Wu, M\. Wyatt, B\. Xiao, C\. Xu, J\. Xu, W\. Xu, J\. Xue, S\. Yadav, F\. Yang, J\. Yang, Y\. Yang, Z\. Yang, D\. Yu, L\. Yuan, C\. Zhang, C\. Zhang, J\. Zhang, L\. L\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, and X\. Zhou \(2024a\)Phi\-3 technical report: a highly capable language model locally on your phone\.External Links:2404\.14219,[Link](https://arxiv.org/abs/2404.14219)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- M\. Abdin, J\. Aneja, H\. Behl, S\. Bubeck, R\. Eldan, S\. Gunasekar, M\. Harrison, R\. J\. Hewett, M\. Javaheripi, and P\. Kauffmann \(2024b\)Phi\-4 technical report\.arXiv preprint arXiv:2412\.08905\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- A\. Abouelenin, A\. Ashfaq, A\. Atkinson, H\. Awadalla, N\. Bach, J\. Bao, A\. Benhaim, M\. Cai, V\. Chaudhary, C\. Chen,et al\.\(2025\)Phi\-4\-mini technical report: compact yet powerful multimodal language models via mixture\-of\-loras\.arXiv preprint arXiv:2503\.01743\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- Text\-based question difficulty prediction: a systematic review of automatic approaches\.International Journal of Artificial Intelligence in Education34\(3\),pp\. 862–914\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p3.1)\.
- Anthropic \(2024\)Model card addendum: claude 3\.5 haiku and upgraded claude 3\.5 sonnet\.Note:[https://assets\.anthropic\.com/m/1cd9d098ac3e6467/original/Claude\-3\-Model\-Card\-October\-Addendum\.pdf](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf)Accessed: 2026\-06\-12Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- Anthropic \(2025\)Claude 3\.7 sonnet system card\.Note:[https://www\.anthropic\.com/claude\-3\-7\-sonnet\-system\-card](https://www.anthropic.com/claude-3-7-sonnet-system-card)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- L\. Benedetto \(2023\)A quantitative study of nlp approaches to question difficulty estimation\.InInternational conference on artificial intelligence in education,pp\. 428–434\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- K\. Chrysafiadi and M\. Virvou \(2013\)Student modeling approaches: a literature review for the last decade\.Expert Systems with Applications40\(11\),pp\. 4715–4729\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, and E\. Rosen \(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- A\. T\. Corbett and J\. R\. Anderson \(1994\)Knowledge tracing: modeling the acquisition of procedural knowledge\.User modeling and user\-adapted interaction4\(4\),pp\. 253–278\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- L\. Crocker and J\. Algina \(1986\)Introduction to classical and modern test theory\.\.ERIC\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p1.1),[§1](https://arxiv.org/html/2606.18709#S1.p3.1),[§1](https://arxiv.org/html/2606.18709#S1.p5.1)\.
- C\. DeMars \(2010\)Item response theory\.Oxford University Press\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2606.18709#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- R\. L\. Ebel \(1972\)Essentials of educational measurement\.\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p1.1)\.
- D\. R\. Eignor \(2013\)The standards for educational and psychological testing\.\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p1.1)\.
- S\. E\. Embretson and S\. P\. Reise \(2025\)Item response theory: foundations for psychologists and social scientists\.Routledge\.Cited by:[§A\.2](https://arxiv.org/html/2606.18709#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- C\. Fan, M\. Li, L\. Sun, and T\. Zhou \(2025\)Missing premise exacerbates overthinking: are reasoning models losing critical thinking skill?\.arXiv preprint arXiv:2504\.06514\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1)\.
- W\. Feng, P\. Tran, S\. Sireci, and A\. S\. Lan \(2025\)Reasoning and sampling\-augmented mcq difficulty prediction via llms\.InInternational Conference on Artificial Intelligence in Education,pp\. 31–45\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- Google DeepMind \(2024\)Introducing gemini 2\.0: our new ai model for the agentic era\.Note:[https://blog\.google/innovation\-and\-ai/models\-and\-research/google\-deepmind/google\-gemini\-ai\-update\-december\-2024/](https://blog.google/innovation-and-ai/models-and-research/google-deepmind/google-gemini-ai-update-december-2024/)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- A\. C\. Graesser, P\. Chipman, B\. C\. Haynes, and A\. Olney \(2005\)AutoTutor: an intelligent tutoring system with mixed\-initiative dialogue\.IEEE Transactions on Education48\(4\),pp\. 612–618\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, and A\. Vaughan \(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- T\. M\. Haladyna and M\. C\. Rodriguez \(2013\)Developing and validating test items\.Routledge\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p1.1),[§1](https://arxiv.org/html/2606.18709#S1.p2.1)\.
- R\. K\. Hambleton, H\. Swaminathan, and H\. J\. Rogers \(1991\)Fundamentals of item response theory\.Vol\.2,Sage\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- S\. Han, F\. Rijmen, A\. A\. Boykin, and S\. Lottridge \(2025\)Leveraging fine\-tuned large language models in item parameter prediction\.InProceedings of the Artificial Intelligence in Measurement and Education Conference \(AIME\-Con\): Full Papers,J\. Wilson, C\. Ormerod, and M\. Beiting Parrish \(Eds\.\),Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States,pp\. 250–264\.External Links:[Link](https://aclanthology.org/2025.aimecon-main.27/),ISBN 979\-8\-218\-84228\-4Cited by:[§A\.2](https://arxiv.org/html/2606.18709#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- A\. Hayakawa and H\. Saggion \(2024\)Can llms solve reading comprehension tests as second language learners?\.InFourth Workshop on Knowledge\-infused Learning,Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- J\. He, L\. Peng, B\. Sun, L\. Yu, and Y\. Zhang \(2021\)Automatically predict question difficulty for reading comprehension exercises\.In2021 ieee 33rd international conference on tools with artificial intelligence \(ictai\),pp\. 1398–1402\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- F\. Hsu, H\. Lee, T\. Chang, and Y\. Sung \(2018\)Automated estimation of item difficulty for multiple\-choice tests: an application of word embedding techniques\.Information Processing & Management54\(6\),pp\. 969–984\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1)\.
- Z\. Huang, Q\. Liu, E\. Chen, H\. Zhao, M\. Gao, S\. Wei, Y\. Su, and G\. Hu \(2017\)Question difficulty prediction for reading problems in standard tests\.InProceedings of the AAAI conference on artificial intelligence,Vol\.31\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, and A\. Radford \(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- T\. Käser and G\. Alexandron \(2024\)Simulated learners in educational technology: a systematic literature review and a turing\-like test\.International Journal of Artificial Intelligence in Education34\(2\),pp\. 545–585\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- U\. Lee, S\. Lee, J\. Koh, Y\. Jeong, H\. Jung, G\. Byun, Y\. Lee, J\. Moon, J\. Lim, and H\. Kim \(2023\)Generative agent for teacher training: designing educational problem\-solving simulations with large language model\-based agents for pre\-service teachers\.InNeurIPS’23 Workshop on Generative AI for Education \(GAIED\),Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- B\. Li, B\. Kim, and Z\. Wang \(2026\)QuestBench: can llms ask the right question to acquire information in reasoning tasks?\.Advances in Neural Information Processing Systems38\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1)\.
- M\. Li, H\. Chen, Y\. Xiao, J\. Chen, H\. Jiao, and T\. Zhou \(2025a\)Can llms estimate student struggles? human\-ai difficulty alignment with proficiency simulation for item difficulty prediction\.arXiv preprint arXiv:2512\.18880\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p3.1),[§1](https://arxiv.org/html/2606.18709#S1.p5.1),[§6](https://arxiv.org/html/2606.18709#S6.p1.1)\.
- M\. Li, H\. Jiao, T\. Zhou, N\. Zhang, S\. Peters, and R\. W\. Lissitz \(2025b\)Item difficulty modeling using fine\-tuned small and large language models\.Educational and Psychological Measurement85\(6\),pp\. 1065–1090\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.18709#S1.p3.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.arXiv preprint arXiv:2205\.14334\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1)\.
- Y\. Liu, S\. Bhandari, and Z\. A\. Pardos \(2025\)Leveraging llm respondents for item evaluation: a psychometric analysis\.British Journal of Educational Technology56\(3\),pp\. 1028–1052\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- A\. Liusie, V\. Raina, A\. Mullooly, K\. Knill, and M\. J\. Gales \(2023\)Analysis of the cambridge multiple\-choice questions reading dataset with a focus on candidate response distribution\.arXiv preprint arXiv:2306\.13047\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p4.1)\.
- F\. M\. Lord and M\. R\. Novick \(2008\)Statistical theories of mental test scores\.IAP\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p1.1)\.
- F\. M\. Lord \(2012\)Applications of item response theory to practical testing problems\.Routledge\.Cited by:[§A\.2](https://arxiv.org/html/2606.18709#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- A\. Loukina, S\. Yoon, J\. Sakano, Y\. Wei, and K\. Sheehan \(2016\)Textual complexity as a predictor of difficulty of listening items in language proficiency tests\.InProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers,pp\. 3245–3253\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- H\. Maeda \(2025\)Field\-testing multiple\-choice questions with ai examinees: english grammar items\.Educational and Psychological Measurement85\(2\),pp\. 221–244\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p5.1)\.
- J\. M\. Markel, S\. G\. Opferman, J\. A\. Landay, and C\. Piech \(2023\)Gpteach: interactive ta training with gpt\-based students\.InProceedings of the tenth acm conference on learning@ scale,pp\. 226–236\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- R\. J\. McCowan and S\. C\. McCowan \(1999\)Item analysis for criterion\-referenced tests\.\.Online Submission\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p1.1),[§1](https://arxiv.org/html/2606.18709#S1.p3.1)\.
- Meta AI \(2024a\)Llama 3\.2: revolutionizing edge ai and vision with open, customizable models\.Note:[https://ai\.meta\.com/blog/llama\-3\-2\-connect\-2024\-vision\-edge\-mobile\-devices/](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)Accessed: 2026\-06\-12Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- Meta AI \(2024b\)Meta llama 3\.3 70b instruct model card\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.3\-70B\-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)Accessed: 2026\-06\-12Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- M\. Miroyan, R\. Niousha, J\. E\. Gonzalez, G\. Ranade, and N\. Norouzi \(2025\)ParaStudent: generating and evaluating realistic student code by teaching llms to struggle\.arXiv preprint arXiv:2507\.12674\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§7](https://arxiv.org/html/2606.18709#S7.p1.1)\.
- T\. Moses \(2017\)A review of developments and applications in item analysis\.Advancing human assessment: The methodological, psychological and policy contributions of ETS,pp\. 19–46\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p1.1),[§1](https://arxiv.org/html/2606.18709#S1.p2.1),[§1](https://arxiv.org/html/2606.18709#S1.p3.1),[§1](https://arxiv.org/html/2606.18709#S1.p5.1)\.
- A\. Mullooly, Ø\. Andersen, L\. Benedetto, P\. Buttery, A\. Caines, M\. J\. Gales, Y\. Karatay, K\. Knill, A\. Liusie, and V\. Raina \(2023\)The cambridge multiple\-choice questions reading dataset\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.18709#S3.SS1.p1.4)\.
- B\. Nguyen, T\. Du, M\. Yu, L\. Angrave, and M\. Jiang \(2025\)QG\-SMS: enhancing test item analysis via student modeling and simulation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 26152–26168\.External Links:[Link](https://aclanthology.org/2025.acl-long.1268/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1268),ISBN 979\-8\-89176\-251\-0Cited by:[§A\.2](https://arxiv.org/html/2606.18709#A1.SS2.p1.1)\.
- T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan,et al\.\(2024\)2 olmo 2 furious\.arXiv preprint arXiv:2501\.00656\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- OpenAI \(2024a\)GPT\-4o mini: advancing cost\-efficient intelligence\.Note:[https://openai\.com/index/gpt\-4o\-mini\-advancing\-cost\-efficient\-intelligence/?utm\_source=chatgpt\.com](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/?utm_source=chatgpt.com)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- OpenAI \(2024b\)Introducing apis for gpt\-3\.5 turbo and whisper\.Note:[https://openai\.com/index/introducing\-chatgpt\-and\-whisper\-apis/](https://openai.com/index/introducing-chatgpt-and-whisper-apis/)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- OpenAI \(2025a\)GPT\-5 system card\.Note:[https://openai\.com/index/gpt\-5\-system\-card/](https://openai.com/index/gpt-5-system-card/)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- OpenAI \(2025b\)Introducing gpt\-4\.1 in the api\.Note:[https://openai\.com/index/gpt\-4\-1/](https://openai.com/index/gpt-4-1/)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- OpenAI \(2025c\)Introducing openai o3 and o4\-mini\.Note:[https://openai\.com/index/introducing\-o3\-and\-o4\-mini/](https://openai.com/index/introducing-o3-and-o4-mini/)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- OpenAI \(2026\)GPT\-5\.5 System Card\.External Links:[Link](https://openai.com/index/gpt-5-5-system-card/)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- J\. Park, S\. Park, H\. Won, and K\. Kim \(2024\)Large language models are students at various levels: zero\-shot question difficulty estimation\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 8157–8177\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- S\. Peters, N\. Zhang, H\. Jiao, M\. Li, T\. Zhou, and R\. Lissitz \(2025\)Text\-based approaches to item difficulty modeling in large\-scale assessments: a systematic review\.arXiv preprint arXiv:2509\.23486\.Cited by:[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- Qwen Team \(2024\)Qwen2\.5: a party of foundation models\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5/)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- G\. Rasch \(1993\)Probabilistic models for some intelligence and attainment tests\.\.ERIC\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- A\. Rogoz and R\. T\. Ionescu \(2024\)Unibucllm: harnessing llms for automated prediction of item difficulty and response time for multiple\-choice questions\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),pp\. 493–502\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- A\. Ross and J\. Andreas \(2025\)Learning to make mistakes: modeling incorrect student thinking and key errors\.arXiv preprint arXiv:2510\.11502\.Cited by:[§7](https://arxiv.org/html/2606.18709#S7.p1.1)\.
- A\. Ross, M\. Srivastava, J\. Blanchard, and J\. Andreas \(2025\)Modeling student learning with 3\.8 million program traces\.arXiv preprint arXiv:2510\.05056\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- M\. Sano \(2015\)Automated capturing of psycho\-linguistic features in reading assessment text\.Inannual meeting of the National Council on Measurement in Education, Chicago, IL,Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- A\. Säuberli, D\. Frassinelli, and B\. Plank \(2025a\)Do LLMs give psychometrically plausible responses in educational assessments?\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 266–278\.External Links:[Link](https://aclanthology.org/2025.bea-1.21/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.21),ISBN 979\-8\-89176\-270\-1Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- A\. Säuberli, D\. Frassinelli, and B\. Plank \(2025b\)Do llms give psychometrically plausible responses in educational assessments?\.arXiv preprint arXiv:2506\.09796\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p4.1),[§1](https://arxiv.org/html/2606.18709#S1.p5.1)\.
- S\. Sonkar, X\. Chen, N\. Liu, R\. G\. Baraniuk, and M\. Sachan \(2024\)Llm\-based cognitive models of students with misconceptions\.arXiv preprint arXiv:2410\.12294\.Cited by:[§7](https://arxiv.org/html/2606.18709#S7.p1.1)\.
- K\. A\. Srivatsa, K\. Maurya, and E\. Kochmar \(2025\)Can LLMs reliably simulate real students’ abilities in mathematics and reading comprehension?\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 988–1001\.External Links:[Link](https://aclanthology.org/2025.bea-1.75/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.75),ISBN 979\-8\-89176\-270\-1Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.
- K\. Stasaski, G\. H\. Yang, and M\. A\. Hearst \(2020\)More diverse dialogue datasets via diversity\-informed data collection\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 4958–4968\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. D\. Manning \(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 5433–5442\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, and S\. Bhosale \(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- K\. VanLehn, S\. Ohlsson, and R\. Nason \(1994\)Applications of simulated students: an exploration\.Journal of artificial intelligence in education5,pp\. 135–135\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1)\.
- H\. Veeramani, S\. Thapa, N\. B\. Shankar, and A\. Alwan \(2024\)Large language model\-based pipeline for item difficulty and response time estimation for educational assessments\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),pp\. 561–566\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p3.1)\.
- V\. Yaneva, K\. North, P\. Baldwin, L\. A\. Ha, S\. Rezayi, Y\. Zhou, S\. R\. Choudhury, P\. Harik, and B\. Clauser \(2024\)Findings from the first shared task on automated prediction of difficulty and response time for multiple\-choice questions\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),pp\. 470–482\.Cited by:[§1](https://arxiv.org/html/2606.18709#S1.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, and C\. Lv \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.3](https://arxiv.org/html/2606.18709#S3.SS3.SSS0.Px1.p1.1)\.
- Z\. Yuan, Y\. Xiao, M\. Li, W\. Xuan, R\. Tong, M\. Diab, and T\. Mitchell \(2026\)Towards valid student simulation with large language models\.arXiv preprint arXiv:2601\.05473\.Cited by:[§A\.3](https://arxiv.org/html/2606.18709#A1.SS3.p1.1),[§7](https://arxiv.org/html/2606.18709#S7.p1.1)\.
- L\. Zotos, H\. van Rijn, and M\. Nissim \(2024\)Are you doubtful? oh, it might be difficult then\! exploring the use of model uncertainty for question difficulty estimation\.arXiv preprint arXiv:2412\.11831\.Cited by:[§A\.1](https://arxiv.org/html/2606.18709#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.18709#S2.p1.1)\.

## Appendix ARelated Work

### A\.1Item Difficulty Prediction

Item difficulty is central to educational assessment, informing test construction, adaptive testing, and learning personalization\. Conventionally, it is estimated from examinee response data under classical test theory \(CTT\) or item response theory \(IRT\), using measures such as the proportion of correct responses\(Hambletonet al\.,[1991](https://arxiv.org/html/2606.18709#bib.bib24); DeMars,[2010](https://arxiv.org/html/2606.18709#bib.bib23); Hsuet al\.,[2018](https://arxiv.org/html/2606.18709#bib.bib22)\)\. Since field testing is costly and hard to scale, prior work has explored text\-based difficulty prediction using various item\-level features, as well as machine learning and deep learning models\(Sano,[2015](https://arxiv.org/html/2606.18709#bib.bib26); Loukinaet al\.,[2016](https://arxiv.org/html/2606.18709#bib.bib25); Huanget al\.,[2017](https://arxiv.org/html/2606.18709#bib.bib27); Devlinet al\.,[2019](https://arxiv.org/html/2606.18709#bib.bib29); Heet al\.,[2021](https://arxiv.org/html/2606.18709#bib.bib30); Benedetto,[2023](https://arxiv.org/html/2606.18709#bib.bib21)\)\. More recent studies adopt LLMs to predict difficulty directly or to derive auxiliary signals such as reasoning traces and uncertainty estimates\(Rogoz and Ionescu,[2024](https://arxiv.org/html/2606.18709#bib.bib33); Zotoset al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib35); Liet al\.,[2025b](https://arxiv.org/html/2606.18709#bib.bib32); Fenget al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib34)\)\. However, difficulty alone offers only a partial view of item quality, since items that appear textually complex may not necessarily discriminate readers’ level\. Moreover, prior work suggests that LLMs may conflate item difficulty with their own likelihood of failure\(Linet al\.,[2022](https://arxiv.org/html/2606.18709#bib.bib52); Tianet al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib54); Liet al\.,[2026](https://arxiv.org/html/2606.18709#bib.bib51); Fanet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib55)\)\. We therefore argue that difficulty is important but insufficient, and examine whether LLMs can also capture item discrimination, i\.e\., how well items distinguish different student levels\.

### A\.2Item Discrimination Prediction

Psychometrically, item discrimination is operationalized through item\-total correlations in CTT or the IRT slope parameter\(DeMars,[2010](https://arxiv.org/html/2606.18709#bib.bib23); Lord,[2012](https://arxiv.org/html/2606.18709#bib.bib45); Embretson and Reise,[2025](https://arxiv.org/html/2606.18709#bib.bib46)\)\. Unlike difficulty, which summarizes average challenge, discrimination requires modeling response variation across ability groups, making it harder to infer from item text alone and less studied in prior work\. Recent work predicts both difficulty and discrimination from item text and structured attributes, finding discrimination harder to model and more dependent on explicit metadata\(Hanet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib61)\)\. LLM\-based item analysis has also begun to evaluate discrimination through simulated student responses\(Nguyenet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib62)\)\. We therefore use human item discrimination as a test of LLM\-based item understanding: models must identify both whether an item is hard, and whether it meaningfully differentiates the skill level of human students\.

### A\.3LLM\-Student Simulation

Student simulation supports tutoring, learning\-by\-teaching, and formative evaluation, traditionally relying on hand\-crafted misconception models, rule\-based dialogue policies, or human role\-play data\(VanLehnet al\.,[1994](https://arxiv.org/html/2606.18709#bib.bib38); Chrysafiadi and Virvou,[2013](https://arxiv.org/html/2606.18709#bib.bib37); Graesseret al\.,[2005](https://arxiv.org/html/2606.18709#bib.bib39); Stasaskiet al\.,[2020](https://arxiv.org/html/2606.18709#bib.bib40)\)\. LLMs make simulation more scalable through role prompting, fine\-tuning, and agentic designs that emulate diverse learner profiles, sometimes combined with knowledge tracing or IRT\-style ability representations\(Markelet al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib47); Leeet al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib48); Parket al\.,[2023](https://arxiv.org/html/2606.18709#bib.bib41); Käser and Alexandron,[2024](https://arxiv.org/html/2606.18709#bib.bib42); Corbett and Anderson,[1994](https://arxiv.org/html/2606.18709#bib.bib43); Rasch,[1993](https://arxiv.org/html/2606.18709#bib.bib44); Lord,[2012](https://arxiv.org/html/2606.18709#bib.bib45); Embretson and Reise,[2025](https://arxiv.org/html/2606.18709#bib.bib46); Miroyanet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib49); Rosset al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib50)\)\. For assessment, recent work uses heterogeneous or ability\-aligned LLMs to generate synthetic response matrices and estimate item parameters through IRT\-style calibration\(Parket al\.,[2024](https://arxiv.org/html/2606.18709#bib.bib58); Liuet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib56)\)\. However, prior studies caution that fluent student\-like responses or lower accuracy do not ensure behavioral fidelity, as LLMs may misalign with human error distributions, item facility, distractor choices, or ability\-specific response patterns\(Hayakawa and Saggion,[2024](https://arxiv.org/html/2606.18709#bib.bib57); Säuberliet al\.,[2025a](https://arxiv.org/html/2606.18709#bib.bib60); Srivatsaet al\.,[2025](https://arxiv.org/html/2606.18709#bib.bib59)\)\. Recent work further argues that valid LLM\-based student simulation should prioritize epistemic fidelity over surface\-level realism, since highly capable LLMs may exhibit a competence paradox when asked to emulate partially knowledgeable learners\(Yuanet al\.,[2026](https://arxiv.org/html/2606.18709#bib.bib83)\)\. We therefore evaluate LLM\-student simulation by alignment with human item discrimination, asking whether models succeed and fail in ways that preserve how items differentiate the students’ skill level\.

## Appendix BModel Answer Accuracy

Table[4](https://arxiv.org/html/2606.18709#A2.T4)reports the answer accuracy of each model under the no\-persona baseline and three proficiency\-conditioned student personas\. Overall, stronger proprietary models achieve consistently high accuracy, with GPT\-5 and GPT\-5\.5 reaching the highest average correct rates\. In contrast, smaller open\-weight models show substantially lower answer accuracy, suggesting that their synthetic response patterns may be constrained by limited reading\-comprehension capability\. Across most models, the weak, medium, and strong personas only introduce small changes in accuracy, indicating that persona prompting does not reliably induce large proficiency\-separated performance gaps at the aggregate answer level\.

Table 4:Model answer accuracy under different simulated personas\. Baseline denotes the no\-persona setting, Low/Medium/High denote low\-, medium\-, and high\-proficiency student personas, and Average reports the mean accuracy across these four settings\. The results show that persona prompting produces only small aggregate accuracy changes for most models\.
## Appendix CPrompt Templates

We used two prompt templates: one for direct discrimination prediction and one for response\-based CTT calibration\. For persona\-based runs, the corresponding system message was prepended to the user prompt\. For the no\-persona setting, no system message was used\. Figure[4](https://arxiv.org/html/2606.18709#A3.F4)shows the prompt used to directly estimate item discrimination, where the gold answer was provided\. Figure[5](https://arxiv.org/html/2606.18709#A3.F5)shows the prompt used to collect model answers for response\-based CTT calibration, where the gold answer was withheld\.

Prompt Template for Direct Discrimination Prediction\(System Prompt\)No persona: \[empty\]Low\-proficiency persona:\{internallinenumbers\*\}Suppose you are a student taking the Cambridge English Test\. You are a weak student with low\-level English proficiency\.Medium persona:\{internallinenumbers\*\}Suppose you are a student taking the Cambridge English Test\. You are an average student with medium\-level English proficiency\.High\-proficiency persona:\{internallinenumbers\*\}Suppose you are a student taking the Cambridge English Test\. You are a good student with high\-level English proficiency\.\(User Prompt\)\{internallinenumbers\*\}Analyze the discrimination value of the question\. The discrimination value is the point\-biserial correlation between selecting the correct answer and the test\-taker’s total test score, ranging from \-1 to 1\. A value near 0 means the item cannot distinguish between proficiency levels; a positive value means higher\-proficiency students are more likely to answer correctly; a negative value means the item discriminates in the wrong direction\. Predict the discrimination value and provide the final value in \\boxed\{\.\.\.\}:Below is a Multiple Choice Question for Reading Comprehension\.Question: \{QUESTION\}Options:\(A\) \{OPTION\_A\}\(B\) \{OPTION\_B\}\(C\) \{OPTION\_C\}\(D\) \{OPTION\_D\}Reference Passage: \{REFERENCE\_PASSAGE\}\{internallinenumbers\*\}Common European Framework of Reference for Languages \(CEFR\) Level: \{CEFR\_LEVEL\}Correct answer: \{CORRECT\_ANSWER\}Figure 4:Prompt template for direct discrimination prediction\. The gold answer was provided only in this setting\.Prompt Template for Direct Answer Prediction\(System Prompt\)No persona: \[empty\]Low\-proficiency persona:\{internallinenumbers\*\}Suppose you are a student taking the Cambridge English Test\. You are a weak student with low\-level English proficiency\.Medium persona:\{internallinenumbers\*\}Suppose you are a student taking the Cambridge English Test\. You are an average student with medium\-level English proficiency\.High\-proficiency persona:\{internallinenumbers\*\}Suppose you are a student taking the Cambridge English Test\. You are a good student with high\-level English proficiency\.\(User Prompt\)\{internallinenumbers\*\}Answer the question below step by step, and provide the final answer in \\boxed\{\.\.\.\}:Below is a Multiple Choice Question for Reading Comprehension\.Question: \{QUESTION\}Options:\(A\) \{OPTION\_A\}\(B\) \{OPTION\_B\}\(C\) \{OPTION\_C\}\(D\) \{OPTION\_D\}Reference Passage: \{REFERENCE\_PASSAGE\}\{internallinenumbers\*\}Common European Framework of Reference for Languages \(CEFR\) Level: \{CEFR\_LEVEL\}Figure 5:Prompt template for direct answer prediction used in response\-based CTT calibration\. The correct answer was not included in the prompt\.
## Appendix DComplete Results

This appendix provides the complete model\-level results for all 42 evaluated LLMs\. The main text reports 20 representative models for readability, while this appendix includes the full set of proprietary and open\-weight models used in our experiments\. The additional results are organized as follows\. Figure[6](https://arxiv.org/html/2606.18709#A4.F6)extends the distributional analysis in Figure[2](https://arxiv.org/html/2606.18709#S4.F2)to all 42 evaluated models\. Table[5](https://arxiv.org/html/2606.18709#A4.T5)reports the complete direct discrimination prediction results\. Table[6](https://arxiv.org/html/2606.18709#A4.T6)reports the complete high\- and low\-discrimination retrieval results\.

![Refer to caption](https://arxiv.org/html/2606.18709v1/figures/main1_all.png)Figure 6:Complete distributions of human\-calibrated and LLM\-predicted item discrimination values for all 42 models\. The x\-axis shows the item discrimination value\. The y\-axis shows the estimated density of items at each discrimination value, where higher density indicates that more items fall within that range\.Spearman↑\\uparrowRMSE↓\\downarrowModelBaselineLowMediumHighAll\-personaBaselineLowMediumHighAll\-personaLlama 2\-7B0\.006\-0\.024\-0\.027\-0\.007\-0\.0130\.5540\.4230\.4230\.4460\.462Llama 2\-13B0\.0190\.010\-0\.051\-0\.011\-0\.0080\.4200\.4210\.3790\.3860\.402Llama 3\.1\-8B\-0\.147∗∗∗\-0\.121∗∗∗\-0\.123∗∗∗\-0\.136∗∗∗\-0\.132∗∗∗0\.2890\.3320\.2570\.3390\.304Llama 3\.2\-1B\-0\.100∗∗\-0\.0050\.085∗\-0\.016\-0\.0090\.5710\.7100\.6150\.6540\.637Llama 3\.2\-3B\-0\.006\-0\.0050\.0690\.0350\.0230\.4040\.4710\.4130\.4280\.429Llama 3\.3\-70B\-0\.080∗\-0\.123∗∗∗\-0\.110∗∗\-0\.103∗∗\-0\.104∗∗∗0\.2210\.2350\.2290\.2870\.243Mistral\-7B\-v0\.3\-0\.0040\.0070\.004\-0\.095∗∗\-0\.0220\.3590\.3320\.2630\.3860\.335OLMo 2\-7B\-0\.087∗0\.005\-0\.110∗∗\-0\.125∗∗∗\-0\.079∗∗∗0\.3650\.4500\.3280\.3790\.380OLMo 2\-13B0\.0020\.011\-0\.057\-0\.091∗\-0\.0340\.2180\.2340\.2620\.2810\.249Phi\-3\-mini0\.0010\.077∗0\.0010\.0270\.0270\.3900\.3520\.3470\.3710\.365Phi\-3\.5\-mini\-0\.032\-0\.079∗\-0\.082∗\-0\.073∗\-0\.066∗∗∗0\.5430\.5160\.5450\.5830\.547Phi\-4\-mini\-0\.029\-0\.016\-0\.0500\.007\-0\.0220\.3740\.3750\.3160\.3230\.347Phi\-4\-0\.133∗∗∗\-0\.018\-0\.073∗0\.012\-0\.053∗∗0\.1600\.1560\.1610\.1710\.162Qwen2\.5\-0\.5B\-0\.0360\.083∗0\.078∗\-0\.0090\.0290\.8240\.6510\.5780\.6620\.679Qwen2\.5\-1\.5B\-0\.006\-0\.0170\.0450\.0230\.0110\.7470\.7080\.6830\.7060\.711Qwen2\.5\-3B\-0\.011\-0\.0440\.017\-0\.014\-0\.0130\.4560\.4970\.4440\.5200\.479Qwen2\.5\-7B\-0\.059\-0\.035\-0\.089∗\-0\.093∗∗\-0\.069∗∗0\.1720\.1860\.1740\.1940\.181Qwen2\.5\-14B\-0\.011\-0\.020\-0\.042\-0\.062\-0\.0340\.2050\.2330\.2230\.2400\.225Qwen2\.5\-32B\-0\.080∗\-0\.152∗∗∗\-0\.127∗∗∗\-0\.086∗\-0\.111∗∗∗0\.2170\.2040\.1840\.2300\.209Qwen3\-0\.6B0\.010\-0\.0390\.111∗∗0\.0070\.0220\.5340\.5610\.5220\.5560\.543Qwen3\-1\.7B0\.013\-0\.076∗\-0\.0320\.037\-0\.0140\.2930\.2340\.2220\.2250\.243Qwen3\-4B\-0\.020\-0\.140∗∗∗\-0\.075∗\-0\.117∗∗∗\-0\.088∗∗∗0\.1940\.1580\.1560\.1810\.172Qwen3\-8B0\.0410\.0210\.0200\.0370\.0300\.1570\.1560\.1520\.1660\.158Qwen3\-14B\-0\.017\-0\.0610\.001\-0\.051\-0\.0320\.1660\.1830\.1820\.2200\.188Qwen3\-32B\-0\.043\-0\.081∗0\.002\-0\.049\-0\.043∗0\.2110\.1890\.1740\.1980\.193Qwen3\-235B\-A22B\-0\.0060\.034\-0\.060\-0\.037\-0\.0170\.1480\.1470\.1480\.1610\.151Qwen3\.5\-9B0\.000\-0\.001\-0\.047\-0\.062\-0\.0270\.1750\.1940\.1790\.1700\.180Qwen3\.5\-27B\-0\.107∗∗0\.0130\.077∗\-0\.105∗∗\-0\.0300\.1710\.1780\.1710\.1890\.177Qwen3\.5\-122B\-A10B\-0\.0580\.0480\.005\-0\.058\-0\.0160\.2500\.2450\.2240\.2680\.247Claude 3\.5 Haiku\-0\.116∗∗\-0\.144∗∗∗\-0\.134∗∗∗\-0\.148∗∗∗\-0\.136∗∗∗0\.1980\.2170\.1930\.2550\.216Claude 3\.7 Sonnet\-0\.051\-0\.128∗∗∗\-0\.0120\.007\-0\.046∗∗0\.1880\.2180\.1870\.2010\.199Gemini 2\.0 Flash\-0\.072∗\-0\.019\-0\.073∗\-0\.143∗∗∗\-0\.077∗∗∗0\.1750\.1560\.1500\.1750\.164Gemini 2\.5 Flash0\.006\-0\.025\-0\.063\-0\.067\-0\.037∗0\.2360\.1900\.1760\.2660\.217Gemini 2\.5 Pro\-0\.001\-0\.0440\.003\-0\.029\-0\.0180\.1820\.1600\.1830\.2010\.182GPT\-3\.5\-Turbo\-0\.021\-0\.033\-0\.046\-0\.071∗\-0\.043∗0\.2740\.3100\.2660\.2840\.284GPT\-4\.1\-mini\-0\.028\-0\.034\-0\.0130\.000\-0\.0190\.1750\.1840\.1810\.1820\.180GPT\-4\.1\-0\.140∗∗∗\-0\.171∗∗∗\-0\.131∗∗∗\-0\.160∗∗∗\-0\.151∗∗∗0\.1930\.1890\.1720\.1990\.188GPT\-4o\-mini\-0\.012\-0\.096∗∗\-0\.0520\.042\-0\.0290\.2280\.1610\.1650\.2080\.191GPT\-4o\-0\.121∗∗∗\-0\.102∗∗\-0\.100∗∗\-0\.082∗\-0\.101∗∗∗0\.1790\.1810\.1860\.2100\.189GPT\-o4\-mini\-0\.101∗∗\-0\.161∗∗∗\-0\.153∗∗∗\-0\.074∗\-0\.122∗∗∗0\.1460\.1500\.1430\.1460\.146GPT\-5\-0\.071∗\-0\.051\-0\.131∗∗∗\-0\.082∗\-0\.084∗∗∗0\.1600\.1600\.1650\.1630\.162GPT\-5\.50\.129∗∗∗0\.152∗∗∗0\.120∗∗∗0\.071∗0\.118∗∗∗0\.1530\.1530\.1510\.1570\.154All\-model\-0\.091∗\-0\.0350\.013\-0\.084∗\-0\.074∗0\.1650\.1560\.1600\.1880\.163

Table 5:Complete direct item discrimination prediction results for all 42 evaluated models\. Spearman evaluates rank alignment with human\-calibrated discrimination values, with significance denoted by stars:p∗<\.05\{\}^\{\*\}p<\.05,p∗∗<\.01\{\}^\{\*\*\}p<\.01, andp∗⁣∗∗<\.001\{\}^\{\*\*\*\}p<\.001\. RMSE measures numerical prediction error\. Baseline denotes no persona, Low/Medium/High denote simulated student proficiency levels, All\-persona averages across personas within each model, and All\-model averages across models\.Table 6:Overlap\-based retrieval of low\- and high\-discrimination items\. Direct rows use LLM\-predicted discrimination scores, CTT rows use answer\-based calibrated discrimination scores, and Random gives the expected overlap under random selection\.
LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Similar Articles

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

LLMs Show No Signs Of Individuated Metacognition

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Exploring the Capability Boundaries of LLMs in Mastering Chinese Chouxiang Language

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Submit Feedback

Similar Articles

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
LLMs Show No Signs Of Individuated Metacognition
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
Exploring the Capability Boundaries of LLMs in Mastering Chinese Chouxiang Language
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas