Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

arXiv cs.AI Papers

Summary

Introduces Generative-Evaluative Agreement (GEA), a validity criterion for LLM-enabled adaptive assessments, and measures it on a two-stage adaptive test, finding that the model recovers about half the intended variance with systematic bias.

arXiv:2605.19529v1 Announce Type: new Abstract: When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:29 AM

# Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
Source: [https://arxiv.org/html/2605.19529](https://arxiv.org/html/2605.19529)
Yue WangChe Yee LyeLuke Peh Singapore University of Social Sciences, 463 Clementi Rd, 599494, Singapore \{grandeelee,wangyue,cylye,lukepehlc\}@suss\.edu\.sg

###### Abstract

When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self\-referential\. We introduceGenerative\-Evaluative Agreement \(GEA\), a validity criterion measuring whether an LLM’s scoring function recovers the skill levels its generative function was instructed to produce\. In the first direct measurement of GEA on a two\-stage adaptive assessment, the model recovers roughly half the intended variance \(r=0\.698r=0\.698\) with systematic positive bias\. GEA is strong \(r\>0\.7r\>0\.7\) for syntactically verifiable skills but near zero for design\-level skills, and low\-skill overestimation inflates scores near the routing threshold\. We argue that granular, skill\-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations\.

Generative\-Evaluative Agreement: A Necessary Validity Criterion for LLM\-Enabled Adaptive Assessment

Grandee Lee and Yue Wang and Che Yee Lye and Luke PehSingapore University of Social Sciences,463 Clementi Rd, 599494, Singapore\{grandeelee,wangyue,cylye,lukepehlc\}@suss\.edu\.sg

## 1Introduction

Computerized adaptive testing \(CAT\) traditionally relies on item banks pre\-calibrated via Item Response Theory \(IRT\), where every item has known difficulty and discrimination parameters estimated from hundreds of real responses\(van der Linden and Glas,[2010](https://arxiv.org/html/2605.19529#bib.bib14)\)\. LLM\-enabled adaptive assessment disrupts this paradigm: items are generated dynamically, each student potentially receives a unique test, and classical calibration \(requiring 50–200 respondents per item;Lord[1980](https://arxiv.org/html/2605.19529#bib.bib6)\) becomes infeasible\. Revisions to rubrics, prompts, or course materials become psychometrically consequential when they change what is being measured, how performance is elicited, how responses are scored, or how scores are interpreted\. In these cases, prior calibration and validity may no longer be transportable, and at least part of the item bank may need to be re\-authored, relinked, re\-calibrated, or revalidated\(Han and Guo,[2011](https://arxiv.org/html/2605.19529#bib.bib1)\)\. In a typical school setting, learning outcomes are updated when curricula evolve, rubrics are refined each semester as instructors identify ambiguities, and course restructuring changes the skill prerequisites for each assignment\. LLM\-based systems absorb these changes through prompt and rubric updates alone, but the validity of the resulting assessment must be re\-established each time\.

This creates abootstrapping problem: the system cannot be validated without real student data, but cannot be deployed at scale without prior validation\. Human review of each generated item is infeasible when the item space is effectively infinite, and the classical validity pipeline \(pre\-calibrate, validate against human raters, then deploy\) does not apply\. Simulation\-based validation offers a pragmatic alternative\.Liuet al\.\([2024](https://arxiv.org/html/2605.19529#bib.bib5)\)demonstrated that ensembles of LLM\-simulated respondents can approximate human item calibration with correlations exceeding 0\.89\.Zhenget al\.\([2026](https://arxiv.org/html/2605.19529#bib.bib26)\)used Monte Carlo simulation to identify optimal CAT configurations before empirical evaluation\.Marquez\-Carpinteroet al\.\([2025](https://arxiv.org/html/2605.19529#bib.bib7)\)reviewed LLM\-simulated student profiles for pre\-deployment testing of pedagogical systems\. However, when the same LLM generates items, simulates student responses, and scores them, the validation loop is self\-referential\. If the model’s representation of skill levels is inconsistent across its generative and evaluative functions, the system validates itself against a distorted mirror\.

This paper introducesGenerative\-Evaluative Agreement \(GEA\)as the formal criterion for this internal consistency: when an LLM generates a response at an intended skill level, does scoring recover that level? Valid routing decisions require scores that faithfully reflect the intended construct, but “intended difficulty” exists only in the model’s internal representation, accessed through two different computational paths \(generation and evaluation\) that traverse different prompt\-conditioned regions of the same model\. Empirical verification of their alignment is therefore a necessary \(though not sufficient\) validity condition for any LLM\-based adaptive assessment that uses simulation for calibration\.

![Refer to caption](https://arxiv.org/html/2605.19529v1/x1.png)Figure 1:GEA measurement versus closed\-loop self\-validation\. \(a\) In a pure closed loop, there is no external anchor and bias is invisible\. \(b\) GEA introduces the intended skill levelxxas an external reference point\. \(c\) When the generator inflates skill, GEA detects the discrepancy\. \(d\) Bias is undetectable only if both functions share the exact same misconception, an unlikely scenario given that generation and evaluation traverse different prompt\-conditioned paths\.### 1\.1Definition

Generative\-Evaluative Agreement \(GEA\)is the degree to which an LLM’s generative representation of skill levels is consistent with its evaluative representation\. Formally, if the model generates a response𝐫\\mathbf\{r\}conditioned on skill levelxx, then scoring𝐫\\mathbf\{r\}should recoverxxwithin acceptable error bounds:

𝔼​\[score​\(𝐫\)∣𝐫∼generate​\(x\)\]≈x\\mathbb\{E\}\[\\mathrm\{score\}\(\\mathbf\{r\}\)\\mid\\mathbf\{r\}\\sim\\mathrm\{generate\}\(x\)\]\\approx x\(1\)Herex,score​\(𝐫\)∈\[0,1\]x,\\mathrm\{score\}\(\\mathbf\{r\}\)\\in\[0,1\]are continuous per\-skill scores; ordinal proficiency bands \(Appendix[D](https://arxiv.org/html/2605.19529#A4)\) are derived post hoc for reporting\.

We operationalise “≈\\approx” through two primary metrics: Pearsonrrfor rank\-order fidelity and signed bias for systematic directionality\. We propose two actionable benchmarks:r\>0\.7r\>0\.7\(strong GEA\) to support fine\-grained proficiency reporting, andr\>0\.4r\>0\.4\(moderate GEA\) to support binary routing decisions\. Skills belowr=0\.4r=0\.4should not be used for adaptive routing without human validation\.

Crucially, GEA measurement is*not*equivalent to closed\-loop self\-validation\. Figure[1](https://arxiv.org/html/2605.19529#S1.F1)illustrates the distinction\. In a pure closed\-loop system \(panel a\), the model generates and scores with no external reference, so any systematic bias is invisible\. In GEA measurement \(panel b\), the intended skill levelxxserves as an external anchor\. If the generator inflates skill \(panel c\), the discrepancyx^≠x\\hat\{x\}\\neq xreveals the generation bias\. GEA can only fail to detect bias when both functions share the*exact same*misconception ofxx\(panel d\), which is unlikely given the empirical evidence of divergence reviewed in Section[2\.2](https://arxiv.org/html/2605.19529#S2.SS2)\. Figure[2](https://arxiv.org/html/2605.19529#S1.F2)shows the concrete assessment architecture in which GEA is measured\.

![Refer to caption](https://arxiv.org/html/2605.19529v1/x2.png)Figure 2:Adaptive assessment flow\. The LLM generates assignments and scores responses at each stage\. Routing depends on cumulative scores¯\\bar\{s\}crossing thresholdθ\\theta\. GEA measures consistency between the LLM’s generative and evaluative functions\.

## 2Background

### 2\.1The Closed\-Loop Problem

In traditional CAT, item parameters and scoring functions are independently validated against real human response data\. In LLM\-based adaptive systems, the model performs both roles with no external anchor\. An instructive \(though imperfect\) analogy comes from the generative/discriminative distinction:Ng and Jordan \([2001](https://arxiv.org/html/2605.19529#bib.bib9)\)showed that models learningP​\(X∣Y\)P\(X\\mid Y\)andP​\(Y∣X\)P\(Y\\mid X\)can disagree at finite capacity\. In an LLM, both generation and evaluation share the same weights, but are conditioned on different prompts that traverse different computational paths\. Generation is dominated by fluency priors; evaluation by criterion matching\. Shared architecture makes alignment*plausible*but does not*guarantee*it\(Ohet al\.,[2024](https://arxiv.org/html/2605.19529#bib.bib10); Westet al\.,[2023](https://arxiv.org/html/2605.19529#bib.bib18)\)\.

### 2\.2Empirical Evidence of Divergence

LLMs struggle to simulate lower\-proficiency cognitive states\(Yuanet al\.,[2026](https://arxiv.org/html/2605.19529#bib.bib22)\): expert knowledge leaks through despite skill\-level prompting\.Srivatsaet al\.\([2025](https://arxiv.org/html/2605.19529#bib.bib13)\)tested 11 LLMs against real NAEP data and found no model\-prompt pair faithfully reproduced real student distributions;Wuet al\.\([2025](https://arxiv.org/html/2605.19529#bib.bib19)\)confirmed this for Python programming\. LLMs also systematically rate their own outputs higher than equivalent text from other sources \(self\-preference bias;Panicksseryet al\.[2024](https://arxiv.org/html/2605.19529#bib.bib11)\), with the mechanism identified as perplexity\-based familiarity\(Wataokaet al\.,[2024](https://arxiv.org/html/2605.19529#bib.bib17)\)\. Even proprietary models show low intra\-rater consistency at temperature\>0\>0\(Leeet al\.,[2024b](https://arxiv.org/html/2605.19529#bib.bib15)\)\. In simulation\-based calibration, these mechanisms compound: the result may appear internally consistent but is not externally valid\.

### 2\.3Implications for Calibration

When simulation is the only feasible calibration method, GEA becomes the gatekeeper for trustworthiness\. If GEA is low, score distributions reflect the model’s self\-consistency rather than real student performance\.Wanget al\.\([2025a](https://arxiv.org/html/2605.19529#bib.bib12)\)argue that Generalizability Theory and Many\-Facet Rasch Measurement are needed to decompose multiple simultaneous error sources rather than collapsing them into a single coefficient\. Even in real deployment, the generative side affects question generation: if items are at the wrong difficulty, routing decisions are based on mis\-targeted items regardless of grading accuracy\.

### 2\.4Related Work

GEA connects to several research threads\. The automated essay scoring \(AES\) literature has studied inter\-rater reliability for decades\(Ramesh and Dash,[2022](https://arxiv.org/html/2605.19529#bib.bib24)\); GEA differs in that rater and author are the same model\. From a psychometric perspective, GEA instantiates the*substantive*component of[Messick](https://arxiv.org/html/2605.19529#bib.bib8)’s \([1989](https://arxiv.org/html/2605.19529#bib.bib8)\) construct validity framework, and aligns with the*Standards for Educational and Psychological Testing*\(American Educational Research Associationet al\.,[2014](https://arxiv.org/html/2605.19529#bib.bib23)\)requirement for evidence that scores support intended interpretations\. The LLM\-as\-judge paradigm\(Zhenget al\.,[2024](https://arxiv.org/html/2605.19529#bib.bib25)\)has documented self\-preference and position bias; GEA extends this from evaluation\-only settings to the generate\-then\-evaluate pipeline where generation bias compounds with evaluation bias\.

## 3Empirical Measurement of GEA

This section presents the empirical measurement of GEA in the sense defined in Section[1\.1](https://arxiv.org/html/2605.19529#S1.SS1)for the concrete case of Python object\-oriented programming \(OOP\) coding tasks\. The same Claude model performs both code generation and rubric\-based evaluation against the 24\-skill taxonomy \(class definition, inheritance, exception handling, etc\.; full list in Appendix[B](https://arxiv.org/html/2605.19529#A2)\)\.

### 3\.1Simulation Design

#### Student profiles\.

We generated 150 synthetic student profiles, each comprising a 24\-dimensional skill vector𝐱∈\[0,1\]24\\mathbf\{x\}\\in\[0,1\]^\{24\}corresponding to the official learning outcomes\. \(Table[2](https://arxiv.org/html/2605.19529#S4.T2)\)\. Skills are grouped into four progressive groups: Group A \(S01–S08, class basics\), Group B \(S01–S04, S06–S07, S09–S13, class variables and composition\), Group C \(S01–S04, S06, S09, S14–S21, inheritance and polymorphism\), and Group D \(S01, S14–S15, S22–S24, exception handling\)\. For more details, see Appendix[B](https://arxiv.org/html/2605.19529#A2)\. Profiles were sampled from 10 archetypes \(e\.g\., “Absolute Beginner,” “Lab 2 Proficient,” “Advanced”\) with Gaussian noise \(σ=0\.04\\sigma=0\.04\) to produce realistic within\-archetype variation\.

#### Assessment slots\.

Every student attempted all 6 assignment slots regardless of skill level, bypassing the adaptive routing to ensure full coverage: Each slot tests a designated subset of the 24 skills; non\-applicable skills are marked−1\.0\-1\.0in the output vector and excluded from scoring\. Scenario entities were assigned deterministically per student \(seeded on student ID\) to ensure reproducibility\.

#### Generate–then–score protocol\.

For each \(student, slot\) pair, we executed two sequential API calls to Claude Sonnet 4\.6:

1. 1\.Generate: Given the student’s full skill profile \(24 skill scores with per\-skill natural\-language descriptors\) and the assignment question, the model was prompted to produce Python code that*precisely matches*the specified skill levels, including deliberate errors, omissions, and partial implementations for low\-scoring skills\.
2. 2\.Score: The generated code was submitted to the same model’s scoring function with the identical rubric, which returned a 24\-element observed skill vector𝐱^∈\{−1\.0\}∪\[0,1\]24\\hat\{\\mathbf\{x\}\}\\in\\\{\-1\.0\\\}\\cup\[0,1\]^\{24\}and a scalar scores=round\(mean\(x^i:x^i≠−1\)×100\)s=\\mathrm\{round\}\(\\mathrm\{mean\}\(\\hat\{x\}\_\{i\}:\\hat\{x\}\_\{i\}\\neq\-1\)\\times 100\)\.

This yields paired observations\(xi,x^i\)\(x\_\{i\},\\hat\{x\}\_\{i\}\)for every skilliithat is applicable in a given slot, providing the raw material for measuring Equation[1](https://arxiv.org/html/2605.19529#S1.E1)\.

#### Scale\.

All 150 students had completed all 6 slots producing 862 result records and 7,788 paired \(true, observed\) skill\-level observations across 23 of 24 skills \(S13, Dictionary Collection Management, was not tested in any slot\)\. Table[1](https://arxiv.org/html/2605.19529#S4.T1)summarises the aggregate agreement statistics\.

## 4GEA Findings

Table 1:Overall GEA statistics \(7,788 paired skill observations, 150 students, 23 skills\)\. 95% bootstrap CIs from 1,000 resamples\.The pooled Pearson correlation ofr=0\.698r=0\.698indicates that the LLM’s evaluative function recovers roughly half the variance \(R2≈0\.49R^\{2\}\\approx 0\.49\) in the true skill levels it was asked to generate\. The positive mean bias of\+0\.059\+0\.059confirms the direction predicted by self\-preference bias: the model systematically overestimates the skill level of its own generated code\.

At the proficiency\-level granularity used for reporting \(8 ordinal levels from*Not Demonstrated*to*Mastered*; boundaries in Appendix[D](https://arxiv.org/html/2605.19529#A4)\), exact classification accuracy is only 34\.8%, rising to 64\.4% within±1\\pm 1adjacent level\. This represents moderate agreement, sufficient to distinguish broad skill bands but insufficient for fine\-grained proficiency reporting\.

### 4\.1Per\-Skill GEA

GEA varies dramatically across skills\. Table[2](https://arxiv.org/html/2605.19529#S4.T2)presents the full per\-skill breakdown, sorted by Pearsonrr\. After Benjamini\-Hochberg correction for multiple comparisons across 23 skills, 19 correlations remain significant atα=0\.05\\alpha=0\.05; the four non\-significant skills are precisely the near\-zero GEA tier\. Three tiers emerge:

Table 2:Per\-skill GEA metrics, sorted by Pearsonrr\. Skills marked with⋆\\starare mandatory in the assessment rubric\.SkillnnrrBiasS05Setter w/ Validation290\.88\+\+\.10S09⋆\\starClass Variable146\.83\+\+\.26S23Raise Exception145\.80\+\+\.12S11Composition139\.80\+\+\.13S12Delegation139\.80\+\+\.17S10Class Var Mod146\.78\+\+\.14S18⋆\\starOverride Refine285\.77\+\+\.13S24Try/Except145\.77\+\+\.11S07Compute Method429\.72\+\+\.04S15super\(\)\.\_\_init\_\_433\.71\+\+\.08S06\_\_str\_\_717\.65−\-\.04S01Class Def848\.62\+\+\.06S04Getter Property716\.59\+\+\.02S16Subclass Attrs288\.57\+\+\.21S08Mutate Method287\.55−\-\.06S22Custom Exception145\.55\+\+\.21S02Constructor717\.51−\-\.04S14Subclass Def433\.47\+\+\.25S03⋆\\starPrivate Vars717\.43±\\pm\.00S19ABC Definition114\.08−\-\.10S17⋆\\starOverride Replace288\.06\+\+\.19S21Polymorphism113−\-\.03−\-\.07S20Concrete Subclass108n/a−\-\.09#### Strong GEA \(r\>0\.7r\>0\.7; 10 skills\)\.

Skills with concrete, syntactically verifiable indicators, such as setter validation logic \(S05,r=0\.88r=0\.88\), class variable declaration with underscore convention \(S09,r=0\.83r=0\.83\), exception raising \(S23,r=0\.80r=0\.80\), and composition via object attributes \(S11,r=0\.80r=0\.80\), show the strongest agreement\. These skills have unambiguous code signatures: a@attr\.setterwith a conditional check, a\_countclass variable, araise CustomError\(\)statement\. The rubric criteria map directly to syntactic patterns that both the generator and evaluator can reliably target\.

#### Moderate GEA \(0\.4<r<0\.70\.4<r<0\.7; 9 skills\)\.

Foundational skills tested across many slots \(S01, S02, S04, S06\) show moderate agreement \(r=0\.43r=0\.43–0\.650\.65\)\. These skills are near\-universal in student code \(almost every submission defines a class, writes a constructor, uses properties\), creating a ceiling effect that compresses variance\. Subclass\-related skills \(S14, S16\) show moderaterrbut high bias \(\+0\.21\+0\.21to\+0\.25\+0\.25\), indicating the generator systematically over\-performs on inheritance tasks relative to the intended skill level\.

#### Near\-zero GEA \(r<0\.1r<0\.1; 4 skills\)\.

Abstract design skills, including ABC definition \(S19,r=0\.08r=0\.08\), method overriding by replacement \(S17,r=0\.06r=0\.06\), polymorphism \(S21,r=−0\.03r=\-0\.03\), and concrete subclass implementation \(S20, constant output\), show essentially no correlation between intended and observed skill levels\. S20 is degenerate: the evaluator assigned nearly identical scores regardless of the true skill level, collapsing all variation\. These skills require*design decisions*\(choosing to define an ABC, choosing to use polymorphic dispatch\) rather than*syntactic patterns*, making them harder for the generator to “partially implement” and harder for the evaluator to grade on a continuous scale\.

This pattern \(strong GEA for syntactically verifiable skills, weak GEA for design\-level skills\) is consistent with the scope limitation noted in Section[6](https://arxiv.org/html/2605.19529#S6): code assessment benefits from partially verifiable ground truth, but that benefit is concentrated in the syntactic stratum of the skill taxonomy\. Design\-level skills behave more like the subjective assessment domains where GEA is expected to be weakest\.

### 4\.2Calibration and Bias Structure

![Refer to caption](https://arxiv.org/html/2605.19529v1/figures/fig4_calibration.png)Figure 3:Calibration curve: mean observed skill as a function of true proficiency band\. Error bars show±1\\pm 1SD\. The dashed diagonal represents perfect calibration\. The curve lies above the diagonal at low skill levels \(overestimation\) and converges at high levels\.The calibration curve \(Figure[3](https://arxiv.org/html/2605.19529#S4.F3)\) reveals an asymmetric bias:

- •Low\-skill overestimation: Students in the “Not Demonstrated” band \(true≈0\.0\\approx 0\.0\) receive mean observed scores of≈0\.20\\approx 0\.20, a\+0\.20\+0\.20bias\. The LLM struggles to generate authentically incompetent code; even when instructed to omit a skill, the generated code tends to include partial or vestigial implementations that the evaluator detects\.
- •High\-skill convergence: Students in the “Advanced” and “Mastered” bands \(true\>0\.80\>0\.80\) are scored close to their true levels, with the curve approaching the diagonal\. The model finds it easier to generate competent code and to recognise competence\.
- •Mid\-range compression: The “Developing” through “Proficient” bands \(0\.450\.45–0\.800\.80\) show the highest variance \(largest error bars\), suggesting the model has difficulty maintaining fine\-grained distinctions in the middle of the skill range\.

This asymmetry has direct implications for adaptive routing\. The system’s routing threshold \(≥50\\geq 50for High path\) operates in the mid\-range where calibration is poorest\. The upward bias at low skill levels means that weak students are systematically overscored, making them more likely to exceed the threshold and be routed to the High path, exactly the misrouting scenario described in Section[6](https://arxiv.org/html/2605.19529#S6)\.

### 4\.3Proficiency\-Level Classification

![Refer to caption](https://arxiv.org/html/2605.19529v1/figures/fig6_confusion.png)Figure 4:Row\-normalised confusion matrix mapping true proficiency levels \(rows\) to observed proficiency levels \(columns\)\. The strong rightward bias in the “Approaching” through “Mastered” rows indicates systematic upward misclassification at higher skill levels\.The confusion matrix \(Figure[4](https://arxiv.org/html/2605.19529#S4.F4)\) reveals the proficiency\-level consequences of the biases identified above\.

#### Upward collapse to “Mastered\.”

The dominant pattern is a strong rightward shift: students at “Approaching” \(true0\.600\.60–0\.700\.70\) are classified as “Mastered” 50% of the time; “Proficient” \(true0\.700\.70–0\.800\.80\) collapses to “Mastered” 67% of the time; “Advanced” \(true0\.800\.80–0\.900\.90\) reaches 77%\. The evaluator effectively treats any competent implementation as “Mastered,” failing to distinguish gradations within the upper half of the skill range\. This is a direct manifestation ofself\-preference bias: the model’s own generated code, even when deliberately degraded, retains low perplexity from the evaluator’s perspective, inflating scores toward the ceiling\.

#### Low\-skill recovery\.

The “Not Demonstrated” level is the best\-recovered category at 60% exact match, consistent with the observation that the absence of a skill \(e\.g\., no@propertydefined at all\) is a binary, syntactically verifiable condition\. However, even here, 14% of truly absent skills are scored as “Beginning” and 9% as “Mastered,” the latter likely reflecting cases where the generator failed to suppress the skill despite being instructed to \(competence paradox leakage\)\.

#### Implications for proficiency reporting\.

The 8\-level proficiency scale is not reliably recoverable through the generate\-then\-score pipeline\. A coarser 3\- or 4\-level scale would better match the system’s effective resolution\. For adaptive routing \(threshold at 50\), the upward bias at low skill levels \(\+0\.059×100≈6\+0\.059\\times 100\\approx 6points\) inflates Stage 1 scores for weak students, increasing misrouting probability to the High path\.

## 5Strengthening GEA

The empirical findings above reveal that GEA is partial and skill\-dependent\. This section examines the mechanisms available to strengthen it, beginning with the most impactful \(granular rubrics\) and progressing to complementary strategies that address failure modes rubrics alone cannot resolve\.

### 5\.1Rubrics as Constrained Reasoning Paths

Section[2\.1](https://arxiv.org/html/2605.19529#S2.SS1)identified the core architectural problem: generation and evaluation traverse different computational paths through the model with no guarantee of alignment\. Granular rubrics address this by providing ashared external specificationthat constrains both paths to pass through the same intermediate representation, namely the rubric’s criteria\.

Wanget al\.\([2025b](https://arxiv.org/html/2605.19529#bib.bib16)\)formalise this as a graph reasoning problem\. Without rubrics, the path from task description to score can traverse arbitrary reasoning states\. With rubrics, both generation \(“produce code demonstrating these specific skills”\) and evaluation \(“check for these specific skills”\) are forced through the same intermediate checkpoints\. Under the assumption that aligning these intermediate nodes increases the overlap between the model’s generative and evaluative reasoning paths, rubric\-guided assessment should exhibit higher GEA than holistic assessment\.

In the context of this system, the rubric defines per\-assignment skill vector tables specifying exactly which of the 24 skills are applicable to each assignment and what constitutes mastery\. This provides structural decomposition: both the question generator and the response evaluator reference the same granular criteria, reducing the degrees of freedom available for the two paths to diverge\. The three\-tier GEA pattern in Table[2](https://arxiv.org/html/2605.19529#S4.T2)provides indirect evidence for this mechanism: skills with unambiguous syntactic rubric criteria \(e\.g\., “code contains a@propertydecorator”\) show strong GEA, while skills with holistic criteria \(e\.g\., “demonstrates appropriate use of polymorphism”\) show near\-zero GEA\.

### 5\.2Empirical Support from the Literature

Rubric quality directly predicts scoring accuracy\.Wuet al\.\([2024](https://arxiv.org/html/2605.19529#bib.bib20)\)found a Spearman rank correlation ofρ=0\.94\\rho=0\.94\(p<0\.01p<0\.01\) between analytic rubric alignment \(with human\-crafted rubrics\) and automated scoring accuracy\. Without rubrics, accuracy was 34\.8%; with human\-crafted analytic rubrics, 50\.4%; with LLM\-generated rubrics guided by holistic rubrics, 54\.6%\. The near\-perfect correlation between rubric quality and scoring performance establishes that the evaluative path is strongly anchored by rubric specification\. For GEA, this implies that the evaluative side of the agreement can be substantially improved through rubric design alone, without changing the model\.

Rubric\-aligned component extraction yields consistent gains\.AutoSCORE’s two\-agent design \(first extract rubric\-relevant components into a structured JSON representation, then score based on those components\) improved QWK by up to 37% \(Essay Set, GPT\-4o\) and 74% \(Science, LLaMA\-8B\) over single\-agent baselines\(Wanget al\.,[2025b](https://arxiv.org/html/2605.19529#bib.bib16)\)\. Gains were largest on complex, multi\-dimensional rubrics, which parallels this system’s 24\-skill vector with per\-assignment coverage tables\. The framework demonstrates that decomposing evaluation into criterion\-level checks before holistic scoring reduces rubric misalignment and evaluator shortcuts\. This system’s architecture already follows this pattern: the evaluator produces a 24\-element skill vector before computing a scalar score\.

Rubrics balance accuracy across proficiency levels\.Leeet al\.\([2024a](https://arxiv.org/html/2605.19529#bib.bib4)\)found that Chain\-of\-Thought prompting combined with scoring rubrics yielded a 13\.4% accuracy increase and, critically, more balanced accuracy across different proficiency categories\. Without rubrics, LLMs were biased toward certain score ranges; with rubrics, scores distributed more evenly across levels\. This directly addresses the calibration asymmetry observed in Figure[3](https://arxiv.org/html/2605.19529#S4.F3): rubric\-guided evaluation may reduce the systematic overestimation at low skill levels that currently inflates routing scores\.

Rubrics prevent evaluator shortcuts\.Wuet al\.\([2024](https://arxiv.org/html/2605.19529#bib.bib20)\)discovered that providing graded student responses \(without rubrics\) actually*degraded*rubric alignment; the LLM found superficial keyword shortcuts instead of following the intended reasoning chain\. Analytic rubrics prevented this by requiring criterion\-level assessment, forcing the model through the intended reasoning path rather than surface\-level pattern matching\. This finding is particularly relevant to GEA: without rubrics, the evaluator may assign high scores based on surface features \(e\.g\., code length, presence of class definitions\) rather than the specific skill indicators the generator was instructed to produce or omit\.

Rubrics are therefore anecessary but not sufficientcondition for GEA\. They are the most practical strategy within a single\-model system, but complete GEA assurance requires complementary measures to address failure modes that rubric design alone cannot resolve\.

### 5\.3Design Principles for GEA\-Strengthening Rubrics

Based on the findings above, rubrics that maximise GEA should:

1. 1\.Decompose the construct into discrete, independently assessable skills, each with a binary or ordinal mastery indicator \(as in the 24\-skill vector\)\.
2. 2\.Specify per\-assignment applicability: explicitly mark which skills are assessed and which are not applicable, so neither generation nor evaluation drifts into unintended territory\.
3. 3\.Define decision boundaries, not just level descriptions: specify what distinguishes mastery from non\-mastery for each skill, not just what each level “looks like” holistically\.
4. 4\.Use structured output formats: require the evaluator to produce criterion\-level judgements \(e\.g\., JSON skill vectors\) before aggregating to a holistic score, following AutoSCORE’s component\-extraction\-then\-scoring paradigm\(Wanget al\.,[2025b](https://arxiv.org/html/2605.19529#bib.bib16)\)\.
5. 5\.Avoid excessive verbosity: structural clarity outperforms exhaustive description; long rubrics can degrade performance in some models\(Yoshida,[2025](https://arxiv.org/html/2605.19529#bib.bib21)\)\.

### 5\.4Complementary Mitigations

While granular rubrics are the primary strategy for strengthening GEA, they cannot address all failure modes\. The competence paradox \(the generator producing overly competent code despite low\-skill instructions\) and self\-preference bias \(the evaluator inflating scores for model\-generated text\) require complementary interventions\.

#### Cross\-model evaluation\.

The most direct intervention is using a different model family to score responses than the one that generated them \(e\.g\., generate with Claude, score with GPT\-4\)\. This breaks the self\-preference loop whereby a model inflates scores for its own low\-perplexity outputs\(Panicksseryet al\.,[2024](https://arxiv.org/html/2605.19529#bib.bib11)\)\. The perplexity\-based mechanism identified byWataokaet al\.\([2024](https://arxiv.org/html/2605.19529#bib.bib17)\)implies that cross\-model evaluation should be most beneficial for the upper proficiency levels where self\-preference bias is strongest \(Figure[4](https://arxiv.org/html/2605.19529#S4.F4)\)\. Where cross\-model evaluation is impractical due to cost or API constraints,multi\-sample scoringoffers a partial substitute: scoring each response multiple times at non\-zero temperature and flagging high\-variance items for human review surfaces the stochastic inconsistency that single\-pass scoring conceals\(Korthalset al\.,[2026](https://arxiv.org/html/2605.19529#bib.bib2)\)\.

#### Epistemic state specification\.

On the generation side, the competence paradox can be mitigated by moving beyond naive role\-prompting \(“act as a beginner”\)\.Yuanet al\.\([2026](https://arxiv.org/html/2605.19529#bib.bib22)\)propose using structured misconception inventories and knowledge component graphs that constrain the generative path to produce behaviourally realistic responses\. Rather than asking the model to simulate a general proficiency level, the prompt specifies which knowledge components the student has and has not acquired, which common misconceptions are active, and which error patterns should appear\. This converts a vague instruction \(“produce beginner code”\) into a concrete specification that the generator can follow more reliably, directly improving the generative side of GEA\.

#### Real\-student pilot validation\.

Simulation\-derived thresholds should be validated against a small real\-student cohort before operational use, following standard psychometric practice\. Even a pilot of 10–20 students on a subset of skills would provide an external anchor against which to calibrate the simulation’s bias estimates\. This is particularly important for the routing threshold, where the\+6\+6\-point upward bias identified in Section[4\.2](https://arxiv.org/html/2605.19529#S4.SS2)may require adjustment before deployment\.

## 6Discussion

#### Threshold sensitivity\.

We swept the Stage 1 routing thresholdθ\\thetafrom 30 to 70 \(Table[3](https://arxiv.org/html/2605.19529#S6.T3)\), re\-routing 140 students with complete data \(10 excluded due to incomplete Stage 2 records\)\.

Table 3:Threshold sensitivity sweep\. Flip%: routing changes vs\. baselineθ=50\\theta\{=\}50\. Mis%: fraction misaligned with true archetype ability\.Astability plateauspansθ∈\[45,55\]\\theta\\in\[45,55\]\(<<5% flips\)\. Misclassification decreases monotonically asθ\\thetarises \(45% atθ=30\\theta\{=\}30to 19% atθ=70\\theta\{=\}70\) because the positive scoring bias pushes observed scores above true ability\. However, atθ=70\\theta\{=\}70nearly one in five students would be reclassified\. The baselineθ=50\\theta\{=\}50represents a pragmatic compromise within the stability plateau\.

#### Decomposing GEA failure\.

When GEA is low, the deficit could stem from the generator, the evaluator, or both\. For syntactically verifiable skills \(e\.g\., S05,r=0\.88r=0\.88\), unambiguous code signatures constrain both paths\. For design\-level skills \(e\.g\., S21,r=−0\.03r=\-0\.03\), the generator likely over\-produces while the evaluator lacks binary markers to assess degree\. Isolating each component requires human scoring of generated code or human\-written code at specified skill levels\.

#### Domain dependence\.

Code assessment occupies a privileged position because programming tasks have partially verifiable ground truth: a class either defines a@propertyor it does not\. Subjective domains \(essay argumentation, creative writing\) lack these anchors, so GEA findings here likely represent anupper bound\. The rubric\-as\-config architecture is domain\-agnostic, but empirical GEA guarantees are domain\-specific\.

#### Model scaling\.

We repeated the full simulation using Haiku 4\.5 with identical rubrics and profiles \(Table[4](https://arxiv.org/html/2605.19529#S6.T4)\)\.

Table 4:Model scaling comparison\. Pearsonrris computed at the*record level*\(n=862n=862assignment records\), which averages across∼9\{\\sim\}9skills per record and is therefore higher than the pooled skill\-levelr=0\.698r=0\.698in Table[1](https://arxiv.org/html/2605.19529#S4.T1)\.Haiku inflates scores by\+17\.6\+17\.6points on average, scoring Absolute Beginners at44\.844\.8\(vs\. Sonnet’s19\.819\.8\), above the routing threshold, and assigning 88% of students to Advanced\. The pooledrrdifference \(Sonnet 0\.698 vs\. Haiku 0\.447\) is significant \(Fisherz=23\.8z=23\.8,p<10−100p<10^\{\-100\}\)\. GEA is thereforescale\-dependent: self\-preference bias and the competence paradox are amplified at smaller scale\.

## 7Conclusion

We introduced Generative\-Evaluative Agreement \(GEA\) as a necessary validity criterion for LLM\-enabled adaptive assessment\. Using 150 synthetic profiles on a Python OOP assessment with Claude Sonnet 4\.6, the model recovers roughly half the intended skill variance \(r=0\.698r=0\.698, 95% CI\[\.684,\.712\]\[\.684,\.712\]\) with systematic positive bias, and GEA is strongly skill\-dependent: high for syntactically verifiable skills, near zero for design\-level skills\.

## Limitations

The reported GEA estimates should be read with several scoping constraints in mind\. All 150 profiles are LLM\-sampled rather than drawn from real students; authentic learner errors likely differ from simulated ones, so a pilot of 10–20 real students would provide an external anchor for the bias estimates and a check on the simulated\-error distribution\. We evaluate two Claude models \(Sonnet 4\.6, Haiku 4\.5\) on Python OOP code only; cross\-family replication \(e\.g\., GPT\-4o, Gemini, open\-weight models\) and subjective domains that lack the partial verifiability of code \(essays, open\-ended argumentation\) would test whether the strong\-vs\-weak\-GEA stratification we observe generalises beyond this setting\. The rubric\-decomposition argument and complementary mitigations \(cross\-model scoring, epistemic\-state specification, multi\-sample evaluation\) are supported by prior work and by the per\-skill GEA contrast in Table[2](https://arxiv.org/html/2605.19529#S4.T2), but we do not directly ablate rubric granularity or scorer identity on the same task; comparing holistic against decomposed rubrics, and same\-model against cross\-model scoring, is the most immediate empirical follow\-up\. Finally, GEA can in principle fail to detect bias when generator and evaluator share an identical distortion—for example, a sycophantic scorer that deflates an inflated generator’s score to match user expectations—and precisely quantifying this residual risk requires human\-scored anchor responses to decompose GEA failure between the two functions \(Section[6](https://arxiv.org/html/2605.19529#S6)\)\. The intended skill levelxxnonetheless remains an external reference that distinguishes GEA measurement from pure closed\-loop self\-validation, and GEA provides a concrete, measurable criterion any LLM\-based assessment system can report before deployment\.

## References

- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education \(2014\)Standards for educational and psychological testing\.American Educational Research Association,Washington, DC\.External Links:ISBN 978\-0935302356Cited by:[§2\.4](https://arxiv.org/html/2605.19529#S2.SS4.p1.1)\.
- K\. T\. Han and F\. Guo \(2011\)Potential impact of item parameter drift due to practice and curriculum change on item calibration in computerized adaptive testing\.GMAC Research ReportTechnical ReportRR\-11\-02,Graduate Management Admission Council\.Cited by:[§1](https://arxiv.org/html/2605.19529#S1.p1.1)\.
- L\. Korthals, E\. Akrong, G\. Geller, H\. Rosenbusch, R\. Grasman, and I\. Visser \(2026\)Towards reliable LLM grading through self\-consistency and selective human review: higher accuracy, less work\.Machine Learning and Knowledge Extraction8\(3\),pp\. 74\.External Links:[Document](https://dx.doi.org/10.3390/make8030074),[Link](https://www.mdpi.com/2504-4990/8/3/74)Cited by:[§5\.4](https://arxiv.org/html/2605.19529#S5.SS4.SSS0.Px1.p1.1)\.
- G\. Lee, E\. Latif, X\. Wu, N\. Liu, and X\. Zhai \(2024a\)Applying large language models and chain\-of\-thought for automatic scoring\.External Links:2312\.03748,[Link](https://arxiv.org/abs/2312.03748)Cited by:[§5\.2](https://arxiv.org/html/2605.19529#S5.SS2.p3.1)\.
- N\. Lee, J\. Hong, and J\. Thorne \(2024b\)Evaluating the consistency of LLM evaluators\.External Links:2412\.00543,[Link](https://arxiv.org/abs/2412.00543)Cited by:[§2\.2](https://arxiv.org/html/2605.19529#S2.SS2.p1.1)\.
- Y\. Liu, S\. Bhandari, and Z\. A\. Pardos \(2024\)Leveraging LLM\-respondents for item evaluation: a psychometric analysis\.External Links:2407\.10899,[Link](https://arxiv.org/abs/2407.10899)Cited by:[§1](https://arxiv.org/html/2605.19529#S1.p2.1)\.
- F\. M\. Lord \(1980\)Applications of item response theory to practical testing problems\.Lawrence Erlbaum Associates,Hillsdale, NJ\.External Links:ISBN 0898590060Cited by:[§1](https://arxiv.org/html/2605.19529#S1.p1.1)\.
- L\. Marquez\-Carpintero, A\. Lopez\-Sellers, and M\. Cazorla \(2025\)Simulating students with large language models: a review of architecture, mechanisms, and role modelling in education with generative AI\.External Links:2511\.06078,[Link](https://arxiv.org/abs/2511.06078)Cited by:[§1](https://arxiv.org/html/2605.19529#S1.p2.1)\.
- S\. Messick \(1989\)Validity\.InEducational Measurement,R\. L\. Linn \(Ed\.\),pp\. 13–103\.Cited by:[§2\.4](https://arxiv.org/html/2605.19529#S2.SS4.p1.1)\.
- A\. Y\. Ng and M\. I\. Jordan \(2001\)On discriminative vs\. generative classifiers: a comparison of logistic regression and naive Bayes\.InAdvances in Neural Information Processing Systems,T\. Dietterich, S\. Becker, and Z\. Ghahramani \(Eds\.\),Vol\.14\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2001/file/7b7a53e239400a13bd6be6c91c4f6c4e-Paper.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.19529#S2.SS1.p1.2)\.
- J\. Oh, E\. Kim, I\. Cha, and A\. Oh \(2024\)The generative AI paradox on evaluation: what it can solve, it may not evaluate\.External Links:2402\.06204,[Link](https://arxiv.org/abs/2402.06204)Cited by:[§2\.1](https://arxiv.org/html/2605.19529#S2.SS1.p1.2)\.
- A\. Panickssery, S\. R\. Bowman, and S\. Feng \(2024\)LLM evaluators recognize and favor their own generations\.External Links:2404\.13076,[Link](https://arxiv.org/abs/2404.13076)Cited by:[§2\.2](https://arxiv.org/html/2605.19529#S2.SS2.p1.1),[§5\.4](https://arxiv.org/html/2605.19529#S5.SS4.SSS0.Px1.p1.1)\.
- D\. Ramesh and S\. K\. Dash \(2022\)An analysis on the state of automated essay scoring\.External Links:2205\.04083,[Link](https://arxiv.org/abs/2205.04083)Cited by:[§2\.4](https://arxiv.org/html/2605.19529#S2.SS4.p1.1)\.
- K\. A\. Srivatsa, K\. K\. Maurya, and E\. Kochmar \(2025\)Can LLMs reliably simulate real students’ abilities in mathematics and reading comprehension?\.External Links:2507\.08232,[Link](https://arxiv.org/abs/2507.08232)Cited by:[§2\.2](https://arxiv.org/html/2605.19529#S2.SS2.p1.1)\.
- W\. J\. van der Linden and C\. A\. W\. Glas \(2010\)Elements of adaptive testing\.Statistics for Social and Behavioral Sciences,Springer,New York\.External Links:[Document](https://dx.doi.org/10.1007/978-0-387-85461-8),ISBN 978\-0\-387\-85459\-5Cited by:[§1](https://arxiv.org/html/2605.19529#S1.p1.1)\.
- Y\. Wang, J\. Huang, L\. Du, Y\. Guo, Y\. Liu, and R\. Wang \(2025a\)Evaluating large language models as raters in large\-scale writing assessments: a psychometric framework for reliability and validity\.Computers and Education: Artificial Intelligence9,pp\. 100481\.External Links:[Document](https://dx.doi.org/10.1016/j.caeai.2025.100481)Cited by:[§2\.3](https://arxiv.org/html/2605.19529#S2.SS3.p1.1)\.
- Y\. Wang, Z\. Ding, X\. Wu, S\. Sun, N\. Liu, and X\. Zhai \(2025b\)AutoSCORE: enhancing automated scoring with multi\-agent large language models via structured component recognition\.External Links:2509\.21910,[Link](https://arxiv.org/abs/2509.21910)Cited by:[item 4](https://arxiv.org/html/2605.19529#S5.I1.i4.p1.1),[§5\.1](https://arxiv.org/html/2605.19529#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.19529#S5.SS2.p2.1)\.
- K\. Wataoka, T\. Takahashi, and R\. Ri \(2024\)Self\-preference bias in LLM\-as\-a\-judge\.External Links:2410\.21819,[Link](https://arxiv.org/abs/2410.21819)Cited by:[§2\.2](https://arxiv.org/html/2605.19529#S2.SS2.p1.1),[§5\.4](https://arxiv.org/html/2605.19529#S5.SS4.SSS0.Px1.p1.1)\.
- P\. West, X\. Lu, N\. Dziri, F\. Brahman, L\. Li, J\. D\. Hwang, L\. Jiang, J\. Fisher, A\. Ravichander, K\. Chandu, B\. Newman, P\. W\. Koh, A\. Ettinger, and Y\. Choi \(2023\)The generative AI paradox: “what it can create, it may not understand”\.External Links:2311\.00059,[Link](https://arxiv.org/abs/2311.00059)Cited by:[§2\.1](https://arxiv.org/html/2605.19529#S2.SS1.p1.2)\.
- T\. Wu, J\. Chen, W\. Lin, M\. Li, Y\. Zhu, A\. Li, K\. Kuang, and F\. Wu \(2025\)Embracing imperfection: simulating students with diverse cognitive levels using LLM\-based agents\.External Links:2505\.19997,[Link](https://arxiv.org/abs/2505.19997)Cited by:[§2\.2](https://arxiv.org/html/2605.19529#S2.SS2.p1.1)\.
- X\. Wu, P\. P\. Saraf, G\. Lee, E\. Latif, N\. Liu, and X\. Zhai \(2024\)Unveiling scoring processes: dissecting the differences between LLMs and human graders in automatic scoring\.External Links:2407\.18328,[Link](https://arxiv.org/abs/2407.18328)Cited by:[§5\.2](https://arxiv.org/html/2605.19529#S5.SS2.p1.2),[§5\.2](https://arxiv.org/html/2605.19529#S5.SS2.p4.1)\.
- L\. Yoshida \(2025\)Do we need a detailed rubric for automated essay scoring using large language models?\.External Links:2505\.01035,[Link](https://arxiv.org/abs/2505.01035)Cited by:[item 5](https://arxiv.org/html/2605.19529#S5.I1.i5.p1.1)\.
- Z\. Yuan, Y\. Xiao, M\. Li, W\. Xuan, R\. Tong, M\. Diab, and T\. Mitchell \(2026\)Towards valid student simulation with large language models\.External Links:2601\.05473,[Link](https://arxiv.org/abs/2601.05473)Cited by:[§2\.2](https://arxiv.org/html/2605.19529#S2.SS2.p1.1),[§5\.4](https://arxiv.org/html/2605.19529#S5.SS4.SSS0.Px2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2024\)Judging LLM\-as\-a\-judge with MT\-Bench and Chatbot Arena\.External Links:2306\.05685,[Link](https://arxiv.org/abs/2306.05685)Cited by:[§2\.4](https://arxiv.org/html/2605.19529#S2.SS4.p1.1)\.
- T\. Zheng, Z\. Jiang, J\. Liu, and S\. Feng \(2026\)Leveraging computerized adaptive testing for cost\-effective evaluation of large language models in medical benchmarking\.External Links:2603\.23506,[Link](https://arxiv.org/abs/2603.23506)Cited by:[§1](https://arxiv.org/html/2605.19529#S1.p2.1)\.

## Appendix AAssessment Architecture

The system is a conversational assessment tool delivered via Telegram Bot that conducts adaptive, scenario\-based coding assessments\. Students work through multiple progressive mini coding assignments per stage, evaluated by Claude AI against predefined rubrics\. The system routes students to a terminal difficulty level based on cumulative stage performance\.

#### Two\-stage adaptive routing\.

All students begin with Stage 1 \(2 assignments covering class basics and composition\)\. Based on their cumulative Stage 1 score relative to thresholdθ\\theta, students are routed to either the*High Performer*path \(inheritance and exception handling\) or the*Low Performer*path \(reinforcement of foundational skills\)\. After Stage 2, a terminal level is assigned:

- •Advanced: Stage 2 High path, cumulative≥θ\\geq\\theta
- •Intermediate: Stage 2 High path, cumulative<θ<\\theta;*or*Stage 2 Low path, cumulative≥θ\\geq\\theta
- •Beginner: Stage 2 Low path, cumulative<θ<\\theta

#### Dynamic question generation\.

Questions are not pre\-stored\. Claude generates each assignment dynamically at runtime based on the rubric, the current stage/path, and a scenario template\. Each student receives a slightly different variation of the same assignment—same core objective and difficulty, but different scenario entities \(e\.g\., bank account, airline, cinema\), discouraging copying\. Every assignment begins with an ASCII UML class diagram showing the class\(es\) the student must implement\.

#### Scoring\.

Each submission is evaluated by Claude against the rubric for that assignment\. The model returns a 24\-element skill vector𝐱^∈\{−1\.0\}∪\[0,1\]24\\hat\{\\mathbf\{x\}\}\\in\\\{\-1\.0\\\}\\cup\[0,1\]^\{24\}\(where−1\.0\-1\.0denotes inapplicable skills\) and a scalar scores=round\(mean\(x^i:x^i≠−1\)×100\)s=\\mathrm\{round\}\(\\mathrm\{mean\}\(\\hat\{x\}\_\{i\}:\\hat\{x\}\_\{i\}\\neq\-1\)\\times 100\)\. The scalar score drives routing; the skill vector provides diagnostic detail\.

## Appendix BSkill Taxonomy

The 24 skills are organised into four progressive groups corresponding to the course’s lab sequence\. Skills marked with⋆\\starare mandatory anchors in the assessment rubric\.

Table 5:Complete 24\-skill taxonomy with descriptions and demonstration criteria\.
## Appendix CAssessment Slot–Skill Coverage

Table[6](https://arxiv.org/html/2605.19529#A3.T6)shows which skills are assessed \(scored0\.00\.0–1\.01\.0\) in each of the 6 assessment slots\. Skills not listed for a slot receive−1\.0\-1\.0\(not applicable\) in the skill vector\.

Table 6:Slot\-to\-skill mapping\. Each slot tests a designated subset of the 24 skills; remaining skills are marked−1\.0\-1\.0\.
## Appendix DProficiency Scale

Continuous skill scores are mapped to ordinal proficiency levels using the boundaries in Table[7](https://arxiv.org/html/2605.19529#A4.T7)\. These levels are used for the confusion matrix analysis \(Section[4\.3](https://arxiv.org/html/2605.19529#S4.SS3)\) and for natural\-language descriptors in student profiles\.

Table 7:Proficiency level boundaries\.
## Appendix EStudent Profile Archetypes

150 synthetic student profiles were sampled from 10 archetypes \(Table[8](https://arxiv.org/html/2605.19529#A5.T8)\)\. Each archetype defines per\-group skill ranges from which individual skill scores are drawn uniformly, with Gaussian noise \(σ=0\.04\\sigma=0\.04\) added to produce within\-archetype variation\. Profiles were generated with a fixed random seed for reproducibility\.

Table 8:Archetype definitions\. Each cell shows the\[lo,hi\]\[\\text\{lo\},\\text\{hi\}\]range for uniform sampling of skill scores within that sub\-group\. Sub\-groups: A = S01–S08 \(class basics\), B = S09–S13 \(composition\), C1= S14–S16 \(basic subclass\), C2= S17–S18 \(mandatory overrides\), C3= S19–S21 \(advanced inheritance\), D = S22–S24 \(exceptions\)\.#### Example profile\.

Table[9](https://arxiv.org/html/2605.19529#A5.T9)shows a representative “Lab 2 Developing” profile\. The student has strong class basics \(Group A, mean0\.840\.84\), partial composition skills \(Group B, mean0\.460\.46\), and minimal inheritance/exception knowledge \(Groups C–D,<0\.10<0\.10\)\. Each skill carries a natural\-language descriptor \(shown for selected skills\) that is included in the generation prompt to constrain the LLM’s code output\.

Table 9:Excerpt from student profile 0114 \(Lab 2 Developing archetype, overall score 0\.41\)\.

## Appendix FPrompt Templates

### F\.1Code Generation Prompt \(Simulation\)

For each \(student, slot\) pair, the following prompt is sent to Claude Sonnet 4\.6 to generate code that matches the student’s skill profile\. TheSTUDENT SKILL PROFILEblock lists only the skills applicable to the current slot, with each skill’s numeric score, proficiency level, and natural\-language descriptor\.

> You are simulating a student submitting a Python OOP coding assignment\. The student’s exact skill profile for the skills tested in this assignment is provided below\. Write Python code that a student at PRECISELY these skill levels would produce\. Faithfully reflect each described weakness and strength \-\-\- do not average them out or homogenise the code\. STUDENT SKILL PROFILE \(relevant skills only\): \- S01 Class Definition: 0\.82 \(Advanced\) \-\-\- Class is well\-defined, correctly named, and structurally complete with multiple methods\. \- S03 Private Instance Variables: 0\.70 \(Proficient\) \-\-\- self\.\_\_attr used for key attributes with property getters; not all attributes are private\. \- \.\.\. ASSIGNMENT: \{the generated assignment text\} Rules: 1\. Output Python code only \-\-\- no explanations, no markdown fences, no preamble\. 2\. For skills rated ‘‘Not Demonstrated’’ or ‘‘Beginning’’, the code must clearly exhibit the described gap\. 3\. For skills rated ‘‘Advanced’’ or ‘‘Mastered’’, that aspect of the code must be correct and complete\. 4\. Each skill reflects its own level independently \-\-\- the code can be strong in one area and weak in another\. 5\. Use realistic student\-style naming and formatting consistent with the described skill levels\.

### F\.2Scoring Prompt \(Evaluation\)

The scoring function sends the full rubric document \(including per\-assignment skill vector tables with scoring guidance\) along with the student’s submission\. The prompt instructs the model to:

> You are a coding assessment scorer for a Python OOP course\. The full rubric is below\. RUBRIC: \{full RUBRICS\.md content\} \-\-\- Score the student’s submission for: \- Stage: \{stage\} \- Path: \{path\} \- Assignment: \{n\} of 2 Assignment given to the student: """\{question text\}""" Student’s submission: """\{student code\}""" Instructions: 1\. Locate the rubric section for this stage, path, and assignment number\. 2\. Fill in the 24\-element skill\_vector exactly as defined in the Skill Vector table\. \- Use \-1\.0 for skills marked \-1\.0 \(not applicable\)\. \- Use a float 0\.0\-\-1\.0 for all other skills, following the scoring guidance\. \- Use intermediate values \(e\.g\. 0\.3, 0\.7\) freely\. 3\. Compute score = round\(mean\(v\_i for v\_i if v\_i \!= \-1\.0\) \* 100\)\. 4\. Write 2\-\-4 sentences of constructive feedback\. Return ONLY a valid JSON object: \{‘‘score’’: <int\>, ‘‘feedback’’: ‘‘<text\>’’, ‘‘skill\_vector’’: \[<s01\>, \.\.\., <s24\>\]\}

### F\.3Question Generation Prompt

The question generator receives the full rubric and is instructed to locate the correct section for the given stage, path, and assignment number, substitute the student’s scenario entity into the template, produce the class diagram followed by coding instructions, and return only the assignment text the student will see\. Scenario entities are drawn from pre\-defined lists:

- •Stage 1: bank account, airline booking, cinema, grade tracker, event planner, inventory, point\-of\-sale, membership
- •Stage 2 High: banking hierarchy, airline, vehicle fleet, cinema chain, course catalogue, training programme
- •Stage 2 Low: contact book, recipe manager, budget tracker, shopping list, event log, book collection

Entities are assigned deterministically per student \(seeded on student ID\) to ensure reproducibility across simulation restarts\.

## Appendix GRubric Excerpts: Strong vs\. Weak GEA Skills

To illustrate the rubric specificity hypothesis \(Section[5\.1](https://arxiv.org/html/2605.19529#S5.SS1)\), we present the scoring guidance for a strong\-GEA skill and a near\-zero\-GEA skill as they appear in the rubric’s skill vector tables\.

#### S05 — Setter with Validation \(r=0\.88r=0\.88, strong GEA\)\.

The rubric specifies concrete syntactic markers at each score level:

> Scoring guidance: 1\.0 =@attr\.setterwith a meaningful validation rule; 0\.5 = setter present but no validation logic; 0\.0 = absent\.

The 8\-level proficiency descriptors used in student profiles further constrain generation:

Table 10:S05 proficiency descriptors \(excerpt\)\.The binary nature of the criterion \(setter with validation present or absent\) provides an unambiguous code signature that both the generator and evaluator can reliably target\.

#### S21 — Polymorphism via Shared Interface \(r=−0\.03r=\-0\.03, near\-zero GEA\)\.

The rubric criterion is holistic rather than syntactic:

> Scoring guidance: 1\.0 = loop/function calls same method on mixed\-type list withoutisinstancechecks; 0\.5 = loop present but uses type checks; 0\.0 = no polymorphic usage\.

The corresponding proficiency descriptors:

Table 11:S21 proficiency descriptors \(excerpt\)\.Polymorphism is a*design choice*\(choosing to iterate over a heterogeneous list\) rather than a syntactic pattern, making it harder for the generator to “partially implement” and harder for the evaluator to grade on a continuous scale\. The descriptor levels describe*degrees of design completeness*rather than presence/absence of identifiable code tokens\.

## Appendix HRouting Logic

The routing state machine operates as follows\. Lets¯k\\bar\{s\}\_\{k\}denote the cumulative \(mean\) score after completing all assignments in stagekk\.

1. 1\.Stage 1 completion: Computes¯1\\bar\{s\}\_\{1\}\. Ifs¯1≥θ\\bar\{s\}\_\{1\}\\geq\\theta, route to Stage 2 High path; otherwise Stage 2 Low path\.
2. 2\.Stage 2 completion: Computes¯2\\bar\{s\}\_\{2\}\. - •High path: ifs¯2≥θ\\bar\{s\}\_\{2\}\\geq\\thetathen*Advanced*; else*Intermediate*\. - •Low path: ifs¯2≥θ\\bar\{s\}\_\{2\}\\geq\\thetathen*Intermediate*; else*Beginner*\.

The thresholdθ=50\\theta=50is a placeholder pending real\-student calibration\. The threshold sensitivity analysis \(Table[3](https://arxiv.org/html/2605.19529#S6.T3)\) confirms a stability plateau atθ∈\[45,55\]\\theta\\in\[45,55\]where<<5% of routing decisions change\.

Each stage comprises exactly 2 assignments\. Scores within a stage are averaged \(not summed\), sos¯k=12​∑j=12sk,j\\bar\{s\}\_\{k\}=\\frac\{1\}\{2\}\\sum\_\{j=1\}^\{2\}s\_\{k,j\}wheresk,j∈\[0,100\]s\_\{k,j\}\\in\[0,100\]\. Sessions are stored in memory keyed by Telegram user ID and do not survive bot restarts\.

Similar Articles

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

arXiv cs.CL

This paper proposes a training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-judge without human annotation, and further introduces an iterative fine-tuning strategy for a rubric generator that outperforms larger proprietary models.