A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

arXiv cs.CL Papers

Summary

This paper introduces MAFIG, a multi-agent framework that leverages LLM agents and feature-specific evaluators to generate reading comprehension items with controlled difficulty by adhering to specified feature constraints. Experiments show MAFIG achieves significantly higher constraint satisfaction and robust difficulty control compared to baseline methods.

arXiv:2605.19316v1 Announce Type: new Abstract: Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:25 AM

# A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
Source: [https://arxiv.org/html/2605.19316](https://arxiv.org/html/2605.19316)
Seonjeong Hwang1, Jun Seo1, Hyounghun Kim1,2, Gary Geunbae Lee1,2 1Graduate School of Artificial Intelligence, POSTECH, Republic of Korea 2Department of Computer Science and Engineering, POSTECH, Republic of Korea \{seonjeongh, sjin4861, h\.kim, gblee\}@postech\.ac\.kr

###### Abstract

Recent studies in difficulty\-controlled reading comprehension item generation have leveraged large language models \(LLMs\) to produce items by adjusting difficulty\-related features\. However, existing methods typically rely on a single\-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level\. To address this limitation, we introduceMAFIG, a Multi\-agent Framework for Feature\-constrained Item Generation, where multiple LLM agents and feature\-specific evaluators collaborate to generate and iteratively revise items based on intended constraints\. Furthermore, to verify the efficacy ofMAFIGin difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty\. Experimental results demonstrate thatMAFIGgenerates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty\-calibrated constraint sequence\.

A Multi\-Agent Framework for Feature\-Constrained Difficulty Control in Reading Comprehension Item Generation

Seonjeong Hwang1, Jun Seo1, Hyounghun Kim1,2, Gary Geunbae Lee1,21Graduate School of Artificial Intelligence, POSTECH, Republic of Korea2Department of Computer Science and Engineering, POSTECH, Republic of Korea\{seonjeongh, sjin4861, h\.kim, gblee\}@postech\.ac\.kr

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.19316v1/x1.png)Figure 1:Example of feature\-constrained difficulty control in multiple\-choice RC item generation\.Reading comprehension \(RC\) items are crucial in both language education and proficiency assessment\. With the continuing expansion of e\-learning and computer\-based testing, there is a substantial need for methods that can automatically generate high\-quality items encompassing a broad spectrum of difficulty levels\. Recent studies have confirmed that large language models \(LLMs\) can generate linguistically fluent and pedagogically sound RC itemsXiaoet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib13)\); Bezirhan and von Davier \([2023](https://arxiv.org/html/2605.19316#bib.bib25)\); Leeet al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib14)\); Mucciacciaet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib15)\)\. Nevertheless, the fine\-grained control over item difficulty using LLMs remains largely underexplored\.

Prior research on difficulty control for RC item generation primarily follows two paradigms\. The first approach utilizes statistical frameworks, such as item response theory \(IRT\)Lord \([1980](https://arxiv.org/html/2605.19316#bib.bib51)\), to assign difficulty parameters and subsequently train difficulty\-aware generative modelsUtoet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib22)\); Tomikawa and Uto \([2024](https://arxiv.org/html/2605.19316#bib.bib26)\); Tomikawa and Masaki \([2024](https://arxiv.org/html/2605.19316#bib.bib27)\)\. Although this approach enables psychometrically calibrated control, it necessitates substantial learner response data and suffers from limited scalability across diverse item formats\. In parallel, educational measurement research has long investigated difficulty\-related item features through item difficulty modeling to guide human item writers in crafting items with targeted difficulty levelsFerraraet al\.\([2022](https://arxiv.org/html/2605.19316#bib.bib42)\)\. Building on this, the second paradigm involves manipulating these features—such as Bloom’s cognitive levelsBloomet al\.\([1956](https://arxiv.org/html/2605.19316#bib.bib60)\)or linguistic attributes like word count and vocabulary level—to modulate difficultyElkinset al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib19)\); Hwanget al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib30)\); Yaacoubet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib29)\); Chen and Shiu \([2025](https://arxiv.org/html/2605.19316#bib.bib28)\); Okaet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib33)\)\. While the robust instruction\-following capabilities of LLMs offer promise in this direction, existing methods largely rely on direct prompting or stochastic sampling\. Consequently, they often fail to adhere to specified feature constraints, thereby undermining the reliability of difficulty control\.

To bridge this gap, we proposeMAFIG, aMulti\-Agent framework forFeature\-constrainedItemGeneration\.MAFIGis designed to generate RC items that strictly conform to multi\-dimensional feature specifications as illustrated in Figure[1](https://arxiv.org/html/2605.19316#S1.F1)\. The framework operates through a collaborative system of role\-specialized LLM agents and feature\-specific evaluators\. By leveraging an iterative refinement process, these agents incorporate both external domain knowledge \(e\.g\., standardized vocabulary levels\) and their internal reasoning capabilities to ensure rigorous constraint satisfaction\. WhileMAFIGis designed for precise adherence to feature constraints, translating this capability into systematic difficulty control necessitates a constraint sequence that predictably yields items across escalating difficulty levels\. Toward this end, we further propose a methodology for constructing difficulty\-calibrated constraint sequences, integrating pedagogical principles with empirical verification to ensure a monotonic progression of item complexity\.

We evaluate the proposed framework against two baseline approaches: \(1\) Level\-based control, where the LLM generates items based on coarse\-grained difficulty indicators \(e\.g\., Level 1 toNN\) relying solely on its internal heuristics; and \(2\) Feature\-based direct prompting, where the LLM is instructed to satisfy all feature constraints within a single\-pass generation\. Our experimental results demonstrate thatMAFIGachieves state\-of\-the\-art performance in both constraint satisfaction and difficulty calibration\. Notably, we find that baselines lacking an iterative revision process struggle to satisfy multi\-dimensional constraints, leading to inconsistent difficulty alignment—even when leveraging frontier reasoning models such as GPT\-5OpenAI \([2025](https://arxiv.org/html/2605.19316#bib.bib94)\)\.

Our contributions are summarized as follows:

- •We introduceMAFIG, a multi\-agent framework that systematically generates RC items that strictly adhere to multi\-dimensional feature constraints\.
- •We propose a novel methodology for constructing difficulty\-calibrated constraint sequences, enabling the generation of RC items with consistently distinguishable and ordered difficulty levels\.
- •Through extensive experiments, we demonstrate thatMAFIGsignificantly outperforms baselines in both constraint satisfaction and difficulty calibration\. Our results suggest that adherence to fine\-grained item features may play an important role in achieving more reliable difficulty control\.

## 2Related Work

#### LLM\-based Item Generation and Evaluation\.

Recent advancements in LLMs have facilitated the zero\-shot synthesis of test items across diverse pedagogical domains\. Without task\-specific fine\-tuning, LLMs are capable of producing linguistically coherent and semantically rigorous questionsElkinset al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib19)\); Bezirhan and von Davier \([2023](https://arxiv.org/html/2605.19316#bib.bib25)\); Leeet al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib14)\)\. Beyond generation, contemporary research has explored the role of LLMs as evaluative agents to verify answerability, factual consistency, and distractor qualitySäuberli and Clematide \([2024](https://arxiv.org/html/2605.19316#bib.bib17)\); Mucciacciaet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib15)\)\. Furthermore, LLMs have been employed as simulated students to analyze item difficulty and pedagogical alignmentLu and Wang \([2024](https://arxiv.org/html/2605.19316#bib.bib39)\); Parket al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib40)\)\. Collectively, these studies underscore a paradigm shift where LLMs serve as multifaceted components—both as generators and evaluators—within automated assessment pipelines\.

#### Difficulty\-Controllable Item Generation\.

Early endeavors in difficulty\-controllable generation primarily relied on large\-scale datasets labeled with difficulty parameters, often derived from IRTLord \([1980](https://arxiv.org/html/2605.19316#bib.bib51)\)or other pedagogical criteriaGaoet al\.\([2018](https://arxiv.org/html/2605.19316#bib.bib20)\); Utoet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib22)\); Tomikawa and Uto \([2024](https://arxiv.org/html/2605.19316#bib.bib26)\); Tomikawaet al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib23)\)\. However, such data\-driven methods are costly and frequently lack interpretability regarding the latent factors driving item difficulty\. Consequently, recent research has pivoted toward prompt\-based control, where target item types and difficulty levels are specified via natural language instructions\. In particular, prompting LLMs through cognitive taxonomies—such as Bloom’s levelsBloomet al\.\([1956](https://arxiv.org/html/2605.19316#bib.bib60)\)—has been widely explored to align generated items with specific reasoning demandsLi and Zhang \([2024](https://arxiv.org/html/2605.19316#bib.bib21)\); Yaacoubet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib29)\)\.

Despite their promise, Bloom\-level prompting often exhibits inconsistent control over reasoning depthElkinset al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib19)\); Hwanget al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib30)\)\. Furthermore, given the difficulty variance within identical cognitive levels and the predominance of lower\-level cognitive tasks \(i\.e\., Remember and Understand\) in high\-stakes testsBaghaeiet al\.\([2020](https://arxiv.org/html/2605.19316#bib.bib44)\), it is evident that such taxonomies are insufficient for achieving fine\-grained calibration\. While some recent studies have attempted more granular feature controlChen and Shiu \([2025](https://arxiv.org/html/2605.19316#bib.bib28)\); Okaet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib33)\), they lack a systematic mechanism for refining items when LLMs fail to strictly satisfy specified constraints\. Our work bridges this gap by introducing a multi\-agent framework that iteratively revises items to ensure rigorous adherence to the feature constraints necessary for precise difficulty control\.

#### Constraint\-Satisfaction Generation with LLM Agents\.

Recent advancements have repositioned LLMs as autonomous agents capable of assuming diverse roles\. By integrating mechanisms such as strategic planning, self\-reflection, inter\-agent collaboration, and tool\-augmented reasoning, these agents can navigate complex tasks and satisfy intricate, user\-defined objectivesYaoet al\.\([2022](https://arxiv.org/html/2605.19316#bib.bib11)\); Shinnet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib12)\); Madaanet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib4)\); Talebirad and Nadiri \([2023](https://arxiv.org/html/2605.19316#bib.bib3)\)\. Such frameworks have demonstrated substantial efficacy in various constraint\-satisfaction tasks, including controllable summarizationRyuet al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib10)\); Retkowski and Waibel \([2025](https://arxiv.org/html/2605.19316#bib.bib9)\)and chart generationLiet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib8)\)\. However, despite these technical strides, the application of multi\-agent collaboration to educational assessment—where linguistic, factual, and cognitive constraints must be satisfied simultaneously—remains an underexplored frontier\. Our work bridges this gap by extending the multi\-agent constraint\-satisfaction paradigm to the domain of RC item generation\.

## 3Method

![Refer to caption](https://arxiv.org/html/2605.19316v1/x2.png)Figure 2:Overview of theMAFIGgeneration pipeline\.### 3\.1Task Formulation

In this study, we focus on the multiple\-choice factual information \(MCFI\) format, where test\-takers must identify statements that are factually consistent with a given reading passage\. Specifically, we define an item as a triplet comprising a reading passage, a question stem \(e\.g\., “According to the passage, which statement is true?”\), and a set of options\. Our framework generates an item by taking a source document—which dictates the core content—and a set of feature constraints that determine the target difficulty level as inputs\. Drawing upon established literature that investigates difficulty\-related attributesBormuthet al\.\([1970](https://arxiv.org/html/2605.19316#bib.bib54)\); Anderson \([1972](https://arxiv.org/html/2605.19316#bib.bib55)\); Park \([2004](https://arxiv.org/html/2605.19316#bib.bib58)\); Rafatbakhsh and Ahmadi \([2023](https://arxiv.org/html/2605.19316#bib.bib59)\), we formalize six feature variables that govern either cognitive demand or item validity: vocabulary level, passage length, average sentence length, reasoning complexity, factuality, and neutrality\. Detailed definitions and the operationalization of these features are provided in Appendix[A](https://arxiv.org/html/2605.19316#A1)\.

### 3\.2MAFIG

As illustrated in Figure[2](https://arxiv.org/html/2605.19316#S3.F2),MAFIGsynthesizes multiple\-choice RC items through two sequential stages:Passage GenerationandOption Generation\. Each stage incorporates a closed\-loop generation and revision mechanism that produces the target component—either the passage or the set of options—while strictly adhering to the specified feature constraints\. In the primary stage, a passage is generated conditioned on the source document and passage\-level constraints\. This generated passage subsequently serves as the context for the option generation stage\. In this second phase, the framework produces options that satisfy option\-level constraints\. The revision process is performed iteratively until all constraints are met or the predefined maximum iteration threshold is reached\.

#### Evaluator\.

The Evaluator consists of a suite of specialized modules designed to quantify specific feature variables within the generated item\. This component integrates rule\-based modules—leveraging off\-the\-shelf NLP toolkits—with LLM judges for features requiring semantic understanding\. By measuring the item’s attributes, the Evaluator assesses whether the specified constraints are satisfied, and generates a comprehensiveerror reportthat identifies each violation\. Detailed information is provided in Appendix[A](https://arxiv.org/html/2605.19316#A1)\.

#### Drafter\.

The Drafter synthesizes the initial itemstate0given the source context and targeted feature constraints\. This agent samples multiple independent candidates to facilitate parallel revision, enabling more efficient exploration of the solution space and allowing for early termination\.

#### Planner\.

Conditioned on the currentstatei, the correspondingerror reporti, and arevision memorycontaining plans from previous iterations, the Planner formulates a strategy to revise the item\. This memory mechanism allows the Planner to synthesize revision strategies informed by the history of past modification attempts, thereby avoiding redundant or ineffective editsZhanget al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib18)\); Shinnet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib12)\)\. To further enhance the robustness of this process, we introduce aCreativity Enhancement Promptingstrategy\. Specifically, if a particular constraint remains unsatisfied forttconsecutive iterations, the Planner is instructed to shift from incremental adjustments to more radical revision strategies—such as excising problematic segments and regenerating them from scratch—thereby breaking the cycle of stagnation\.

Following this strategic determination, the planner: \(1\) selects the optimal agent to invoke—either the Reworder or the Editor—based on the nature of the violation, and \(2\) generates a granular instruction specifying the necessary modifications\. In the option generation stage, where multiple options are processed concurrently, the Planner further designates which specific option requires intervention and provides tailored instructions\.

#### Reworder\.

The Reworder is specifically tasked with enforcing vocabulary\-level constraints, a critical requirement in language proficiency assessments\. Given that vocabulary regulations often vary across testing organizations and target languages, relying solely on the LLM’s internal heuristics is insufficient for ensuring rigorous alignment\. To address this, we incorporate a retrieval\-augmented generation \(RAG\)Lewiset al\.\([2020](https://arxiv.org/html/2605.19316#bib.bib5)\)that enables the Reworder to revise items by cross\-referencing an external, level\-specific vocabulary database\.

The Reworder takes as input thecontext, thestatei, thetarget vocabulary level, and theinstruction from the Planner; during option generation, it additionally receives thetarget option\. The rewording process follows a three\-step pipeline: \(1\) the Reworder suggests contextually appropriate alternatives for the level\-violating words, \(2\) a rule\-based retriever assigns a vocabulary level to each candidate based on the external database, and \(3\) the Reworder replaces the problematic terms with selected alternatives that align with the permitted level range\. In cases where no valid replacements are available, the Reworder notifies the Planner that satisfying the current constraint is infeasible\. Finally, the agent returns the updateds​t​a​t​ei\+1state\_\{i\+1\}, accompanied by amessagedetailing any linguistic bottlenecks encountered during the process\.

#### Editor\.

The Editor revises the item to satisfy all constraints except the vocabulary level\. It takes as input thecontext,statei, the set ofconstraints, and thePlanner’s instruction; during option generation, it additionally receives thetarget option, and finally returns the revisedstatei\+1\.111The Editor is unable to accurately assess item features, such as sentence length and reasoning complexity\. In our preliminary experiments, inaccurate self\-evaluations from the Editor were found to introduce noise, confusing the Planner during subsequent revisions\. Consequently, the Editor is restricted from sending feedback messages to the Planner\.

#### Refiner\.

After the iterative revisions, the generated passages are passed through the Refiner\. The Refiner is prompted to make minimal revisions that improve readability and inter\-sentence coherence\.

### 3\.3Difficulty\-Calibrated Feature Constraints

WhileMAFIGfacilitates the synthesis of RC items that adhere to specified feature constraints, achieving fine\-grained difficulty control necessitates a difficulty\-calibrated feature constraint sequence—a sequence of constraints designed to yield items with monotonically increasing difficulty\. Ideally, such a sequence should be constructed through psychometric analysis\. However, this approach requires an extensive corpus of items with granular feature variations coupled with large\-scale learner response data, and such resources are currently not publicly available\. To address this limitation, we propose an alternative methodology that integrates theoretical calibration with empirical verification\.

Firstly, we construct initial constraint sets by incrementally adjusting individual feature variables in the direction of increasing cognitive demand\. However, because subtle shifts in cognitive complexity do not guarantee a perceptible change in empirical difficulty, we filter these candidates to ensure that only feature sets yielding consistently distinguishable difficulty are included in the final sequence\. For this purpose, we generate RC items for each candidate feature set usingMAFIG\. We then perform pairwise difficulty estimation using an LLM judgeRaina and Gales \([2024](https://arxiv.org/html/2605.19316#bib.bib38)\)to evaluate the difficulty alignment between items generated under constraint pairs with adjacent theoretical difficulty levels\.

A stochastic comparison operator𝔻\\mathbb\{D\}takes an ordered item pair\(Qi,Qj\)\(Q\_\{i\},Q\_\{j\}\)as input and returns a comparative judgment \(i\.e\.,11or−1\-1\) derived via Chain\-of\-Thought \(CoT\)Weiet al\.\([2022](https://arxiv.org/html/2605.19316#bib.bib97)\)prompting:

𝔻​\(Qi,Qj\)=\{1,if​Qi≻Qj−1,if​Qj≻Qi,\\mathbb\{D\}\(Q\_\{i\},Q\_\{j\}\)=\\begin\{cases\}1,\\text\{if \}Q\_\{i\}\\succ Q\_\{j\}\\\\ \-1,\\text\{if \}Q\_\{j\}\\succ Q\_\{i\},\\\\ \\end\{cases\}\(1\)where≻\\succdenotes the “more difficult than” relation\. To mitigate positional bias and ensure reliability, we conduct symmetric comparisons acrossNNstochastic inferences\. The Difficulty Alignment Score \(DAS\) is computed as:

DAS​\(Qi,Qj\)=∑n=1Nxf\(n\)\+∑n=1N\(−xr\(n\)\)2​N,\\mathrm\{DAS\}\(Q\_\{i\},Q\_\{j\}\)=\\frac\{\\sum\_\{n=1\}^\{N\}x\_\{f\}^\{\(n\)\}\+\\sum\_\{n=1\}^\{N\}\(\-x\_\{r\}^\{\(n\)\}\)\}\{2N\},\(2\)wherexf\(n\)=𝔻\(n\)​\(Qi,Qj\)x\_\{f\}^\{\(n\)\}=\\mathbb\{D\}^\{\(n\)\}\(Q\_\{i\},Q\_\{j\}\)andxr\(n\)=𝔻\(n\)​\(Qj,Qi\)x\_\{r\}^\{\(n\)\}=\\mathbb\{D\}^\{\(n\)\}\(Q\_\{j\},Q\_\{i\}\)denote the forward and reversed comparison outcomes for thenn\-th stochastic sample, respectively, and ranges from−1\-1to11\. We retain only constraint pairs whose score exceeds a predefined thresholdρ\\rho\. From these validated pairs, we identify the optimal constraint sequence that exhibits a strictly monotonic increase in difficulty\.

## 4Experiments

### 4\.1Implementation Details

We derived an eight\-level difficulty\-calibrated feature constraint sequence from 16 initial candidate sets by settingρ=0\.4\\rho=0\.4andN=4N=4, yielding 8 stochastic inferences in total\. Comprehensive details regarding the calibrated sequence are provided in Appendix[C](https://arxiv.org/html/2605.19316#A3)\.

We employed Qwen3\-32BTeam \([2025](https://arxiv.org/html/2605.19316#bib.bib93)\)in non\-reasoning mode to power all LLM agents withinMAFIG\. The decoding parameters were configured with top\-p=0\.8p=0\.8, top\-k=20k=20, and a temperature of0\.70\.7\. During the initial drafting phase, the number of parallel candidates was set to55\. The maximum iteration rounds for passage and option generation were capped at2020and100100, respectively\. In cases where no candidate achieved full constraint satisfaction within the maximum allowed rounds, the framework returned a randomly selected candidate from the final pool\. Our code and generated items are publicly available at our GitHub repository\.222[https://github\.com/SeonjeongHwang/mafig](https://github.com/SeonjeongHwang/mafig)Prompt templates used in our experiments are provided in Appendix[I](https://arxiv.org/html/2605.19316#A9)\.

### 4\.2Dataset

We utilized source documents from the Brown Corpus via the NLTK library\. We randomly sampled 40 documents spanning 10 distinct genres:news, editorial, reviews, lore, government, fiction, mystery, science fiction, adventure, andromance\. Only the first 50 sentences of each text were used as the source document for item generation\. This selection results in a total of 320 generated items \(40 source documents×\\times8 difficulty levels\)\.

### 4\.3Method Comparison

Since no existing method generates items at a fine\-grained difficulty level in a zero\-shot manner, there is no direct baseline for comparison\. Nevertheless, we construct baselines grounded in the single\-pass prompting strategy adopted by most prior workElkinset al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib19)\); Li and Zhang \([2024](https://arxiv.org/html/2605.19316#bib.bib21)\); Hwanget al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib30)\); Yaacoubet al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib29)\); Chen and Shiu \([2025](https://arxiv.org/html/2605.19316#bib.bib28)\), based on two distinct granularities of difficulty control:

#### Level\-based Control\.

The model is instructed to calibrate difficulty based on an abstract scale\. We utilize two CoT\-based prompting strategies: \(i\)Direct Prompting, where the target level is explicitly specified \(e\.g\., "Generate a level 3 question on a scale of 1–8"\), and \(ii\)Incremental Prompting, which recursively generates a leveliiitem conditioned on the leveli−1\{i\-1\}item\.333In incremental prompting, level 1 items serve as initial pivots\. Higher\-level items are then generated recursively; for fair evaluation, we compare item pairs derived from distinct pivot items\.

#### Feature\-based Control\.

The model is prompted to satisfy specific feature constraints corresponding to a target difficulty level\. Here, we examine whether rigorous constraint satisfaction enhances calibration robustness by comparing \(i\)Direct Prompting, which targets predefined features in a single\-turn generation, and \(ii\)MAFIG, which employs a multi\-agent revision loop until all constraints are met\. We mainly used two different LLMs: Qwen3\-32B in non\-reasoning mode and GPT\-5 with reasoning effort configured tomedium\.

### 4\.4Evaluation Metrics

We evaluate the performance of each item generation method across three key dimensions: \(1\) Constraint Satisfaction, \(2\) Difficulty Calibration, and \(3\) Item Quality\. An overview of each metric follows, with formal definitions and mathematical formulations provided in Appendix[B](https://arxiv.org/html/2605.19316#A2)\.

#### \(1\) Constraint Satisfaction\.

This dimension quantifies the extent to which generated items adhere to the specified constraints\. TheSuccess Ratio \(SR\)measures the proportion of items that satisfy all target constraints simultaneously, while theAchievement Ratio \(AR\)computes the average fraction of individual constraints successfully met per item\.

#### \(2\) Difficulty Calibration\.

We assess the model’s ability to control difficulty through theDifficulty Alignment Score \(DAS\), which evaluates whether items intended for higher difficulty levels are empirically more challenging, and is derived from both LLM judges and human experts\. The score ranges from−1\-1to\+1\+1, where\+1\+1indicates perfect monotonic alignment,0signifies inconsistent or negligible differences, and−1\-1represents a complete reversal of the intended difficulty order\. Additionally, we report theComplete Alignment Ratio \(CAR\), defined as the proportion of item pairs where a consensus of human experts confirms that the observed difficulty aligns with the intended level\.

#### \(3\) Item Quality\.

This dimension ensures that the generated items maintain high linguistic and logical standards\.Validityevaluates the generated RC items in terms of their answerability, the correctness of the generated answer against the real answer, and the logical and semantic independence among options\. This is measured via an LLM judge using G\-EvalLiuet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib1)\)on a three\-point scale \(1–3\)\. Furthermore, we evaluate theCoherenceandFluencyof the generated passages using UniEvalZhonget al\.\([2022](https://arxiv.org/html/2605.19316#bib.bib2)\), with scores normalized between 0 and 1\.

## 5Results

Table 1:Automatic evaluation results on diverse difficulty\-controllable item generation methods\. The best performance of each metric is inbold\. Statistics of DAS are reported in Appendix[D](https://arxiv.org/html/2605.19316#A4)\.![Refer to caption](https://arxiv.org/html/2605.19316v1/x3.png)Figure 3:Distribution of Difficulty Alignment Scores assigned by human evaluators\.### 5\.1Automatic Evaluation

#### MAFIGeffectively generates RC items satisfying feature constraints\.

As summarized in Table[1](https://arxiv.org/html/2605.19316#S5.T1),MAFIGconsistently outperforms all baseline methods in feature\-based difficulty control, achieving a SR of 92\.29%\. This result demonstrates the framework’s capability to generate items that strictly adhere to a multi\-dimensional constraint set\. In contrast, Direct Prompting with Qwen3\-32B fails to produce a single item that satisfies all constraints simultaneously, and even GPT\-5 rarely achieves full compliance\. However, an analysis of the AR reveals that Direct Prompting does not entirely disregard instructions; Qwen3\-32B and GPT\-5 meet over half and three\-quarters of the constraints on average, respectively\. These findings imply that while LLMs can partially incorporate feature\-based instructions during single\-pass generation, they struggle with the simultaneous optimization of multiple constraints\. A case study of constraint satisfaction failures inMAFIGis provided in Appendix[G](https://arxiv.org/html/2605.19316#A7)\.

#### Calibrating RC difficulty remains challenging for LLMs\.

Level\-based control methods consistently yield lower DAS compared to their feature\-based counterparts\. For Qwen3\-32B, both Direct and Incremental Prompting under the level\-based setting fail to surpass a score of 0\.2\. In contrast, even the Direct Prompting under feature\-based control achieves a higher score of 0\.276, despite an AR of only 59\.1%\. A similar trend is observed with GPT\-5, which remains below 0\.3 in level\-based scenarios but reaches 0\.495 under feature\-based control\. These results underscore that fine\-grained difficulty calibration is challenging for even state\-of\-the\-art LLMs when using abstract level descriptions\. However, explicit feature specification substantially enhances calibration consistency\.

Table 2:Human evaluation results\. A statistically significant positive correlation was observed between the LLM judge and the human expert ratings \(Spearman’sρ=0\.34\\rho=0\.34,p<0\.001p<0\.001\) in Difficulty Alignment Score\.
#### Precise constraint satisfaction drives superior difficulty alignment\.

MAFIGQwen3\-32Bachieves both the highest AR \(99\.32%\) and the highest DAS \(0\.5226\), whereas DirectQwen3\-32Bshows the lowest performance in both metrics \(59\.10% and 0\.2759, respectively\)\. Although GPT\-5 relies solely on Direct Prompting, it satisfies 77\.81% of the feature constraints on average, resulting in a higher alignment score than DirectQwen3\-32B\. These results suggest that accurately satisfying difficulty\-calibrated feature constraints leads to more reliable difficulty alignment, and that achieving such precise constraint satisfaction requires the iterative revision process ofMAFIG\.

### 5\.2Human Evaluation

We further conducted human evaluation with three domain experts\. We sampled six random source document pairs for each consecutive difficulty level and assessed corresponding RC item pairs generated by three different methods\.444Among the evaluated item pairs, 65\.9% received consistent ratings from all three annotators on the question of which item is more difficult\.More details are provided in Appendix[E](https://arxiv.org/html/2605.19316#A5)\.

As illustrated in Figure[3](https://arxiv.org/html/2605.19316#S5.F3), a large proportion of item pairs generated byMAFIGQwen3\-32Bachieved DAS above0\.50\.5\. In contrast, while baselines rarely exhibited reversed alignment, many of their pairs fell within the \-0\.5 to 0\.5 range\. This indicates that items at adjacent levels were often indistinguishable in difficulty to human experts\. Furthermore,MAFIGQwen3\-32Bachieved a CAR of 76\.19%, which is substantially higher than that of DirectGPT\-5\(57\.14%; see Table[2](https://arxiv.org/html/2605.19316#S5.T2)\)\. Overall, these results demonstrate that the proposed framework can generate item pairs exhibiting perceptible differences in difficulty across fine\-grained levels\. They also suggest that when a model fails to satisfy the specified feature constraints, it struggles to establish distinct cognitive demands between adjacent difficulty levels, thereby limiting its calibration accuracy\. Evaluators also identified the difficulty factors driving the differences within each item pair\. These factors are further analyzed in Appendix[H](https://arxiv.org/html/2605.19316#A8)\.

## 6Analysis

### 6\.1Limitations of Direct Prompting with Multiple Sampling

We investigated the constraint satisfaction performance of Direct Prompting using multiple sampling\. Figure[4](https://arxiv.org/html/2605.19316#S6.F4)illustrates the SR and AR across different numbers of sampling trials \(1, 5, and 10\) for Qwen3\-32B \(non\-reasoning\) and GPT\-5 \(reasoning\)\. Increasing the number of sampling trials generally improved both SR and AR\. GPT\-5 achieved notably high performance, with AR approaching 90% in both passage and option generation\. However, the performance gain sharply diminished when increasing the sampling times from 5 to 10 compared to the increase from 1 to 5\. This suggests that while stochastic sampling can enhance the likelihood of obtaining partially constraint\-compliant items, alone remains insufficient to ensure full constraint satisfaction\. This result underscores the necessity of explicit revision mechanisms\. The feature\-wise analysis is covered in Appendix[F](https://arxiv.org/html/2605.19316#A6)\.

![Refer to caption](https://arxiv.org/html/2605.19316v1/x4.png)Figure 4:Constraint satisfaction performance under different numbers of sampling trials \(nn\)\.
### 6\.2Model Generalization and Impact of Parallel Revision

We further evaluated the generalization capability ofMAFIGacross diverse backbone LLMs: Qwen3\-32B, Mistral\-Small\-24BMistral\.AI \([2025](https://arxiv.org/html/2605.19316#bib.bib95)\)and Phi\-4\-14BAbdinet al\.\([2024](https://arxiv.org/html/2605.19316#bib.bib91)\)\. This experiment also examined the impact of parallel revision, wherennitem drafts \(withn∈\{1,5\}n\\in\\\{1,5\\\}\) are sampled and revised in parallel until any single draft fully satisfies the constraint\.

As shown in Figure[4](https://arxiv.org/html/2605.19316#S6.F4), in passage generation, all models except Phi\-4\-14B achieved a SR of nearly 100% within 20 rounds, albeit with different convergence speeds\. In contrast, option generation proved more challenging: with only a single draft \(n=1n=1\), all models except Qwen3\-32B failed to surpass a 60% SR, even after 100 revision rounds\. This may be attributed to the reasoning complexity features controlled during option generation, which are more challenging to control than surface\-level features\.

When the number of initial candidates was increased ton=5n=5, all models achieved 100% success, converging substantially faster—passage generation typically terminated within five rounds\. For option generation, Qwen3\-32B required roughly half the number of revision rounds compared to the single\-draft setting\. Interestingly, performing parallel revisions on multiple drafts using lighter models \(Mistral\-small\-24B and Phi\-4\-14B\) yielded higher overall constraint satisfaction than revising a single draft using the larger Qwen3\-32B model\. This advantage is likely due to the diversity in revision paths across drafts, where different initial drafts require distinct modifications to satisfy the constraints\.

### 6\.3Ablation Study

We conducted an ablation study to assess the contribution of three strategies—Planner’s Instruction, Reworder’s Message, and Creativity Enhancement Prompting—to the iterative revision process in both passage and option generation \(Figure[5](https://arxiv.org/html/2605.19316#S6.F5)\)\. When the Planner’s instruction mechanism was removed and sub\-agents \(Reworder and Editor\) revised items solely based on error reports, we observed a substantially slower convergence in option generation\. Similarly, disabling the Reworder’s feedback message and the creativity enhancement mechanism also resulted in slower convergence\. In contrast, the passage generation process showed minimal performance differences across ablation settings; in fact, replacing the Planner’s instruction with direct error reports occasionally led to slightly faster convergence\. In summary, while these auxiliary mechanisms did not substantially affect surface\-level revisions in passage generation, they proved critical for accelerating convergence in option generation, where more complex semantic and reasoning\-based revisions are required to satisfy the constraints\.

![Refer to caption](https://arxiv.org/html/2605.19316v1/x5.png)Figure 5:Ablation results showing the effect of Planner’s instruction, Reworder’s message, and Creativity Enhancement on Success Ratio convergence across revision rounds\.

## 7Computational Cost

Unlike baseline methods based on single\-pass prompting,MAFIGemploys an iterative revision process to ensure that the generated RC items strictly adhere to multifaceted feature constraints\. While this approach guarantees high\-fidelity item generation, it inherently introduces a significant computational overhead compared to non\-iterative methods\. Figure[6](https://arxiv.org/html/2605.19316#S7.F6)illustrates this trade\-off by comparing the cumulative output tokens against the SR \(atn=1n=1\) for both passage and option generation stages\. In passage generation, the framework demonstrates relatively efficient convergence, with 90% of cases satisfying all linguistic and structural constraints within 10 rounds at a cost of approximately 20K cumulative tokens\. In contrast, option generation requires a much more intensive search process to satisfy fine\-grained distractor constraints; achieving about 90% SR can necessitate up to 100 rounds, resulting in an accumulated token count exceeding 130K\.

![Refer to caption](https://arxiv.org/html/2605.19316v1/x6.png)Figure 6:Success Ratio and the number of cumulative output tokens across revision rounds inMAFIG\.These results acknowledge that the substantial overhead and associated latency may renderMAFIGless suitable for real\-time applications or large\-scale deployments where instantaneous item generation is required\. Nevertheless, the necessity of our iterative framework remains evident because, as discussed in Section[6\.1](https://arxiv.org/html/2605.19316#S6.SS1), repeated single\-pass sampling often fails to converge on a valid solution that satisfies all complex feature constraints simultaneously\. Furthermore, in high\-stakes environments such as standardized language assessments or national\-level examinations, difficulty calibration and adherence to pedagogical constraints take precedence over cost efficiency\. In such contexts, the increased computational cost can be a justifiable trade\-off for the reliability and constraint satisfaction guaranteed by our framework\.

## 8Conclusion

In this study, we introducedMAFIG, a multi\-agent framework for feature\-constrained RC item generation\.MAFIGcoordinates role\-specialized LLM agents in conjunction with feature\-specific evaluators to iteratively generate, assess, and revise RC items, ensuring strict adherence to all designated feature constraints\. To complement this framework, we proposed a method for constructing a difficulty\-calibrated sequence of feature constraint sets, integrating theoretical insights from educational measurement with empirical verification\. Experimental results demonstrated that our framework achieved substantially higher constraint satisfaction rates and exhibited superior difficulty alignment performance, as validated by both LLM\-based and human expert evaluations\.

## Limitations

#### Scope of Item Formats and Difficulty Factors\.

Our experiments focus on a single item format—multiple\-choice factual information \(MCFI\) questions—and a fixed set of difficulty\-related features\. Item difficulty is influenced by a wide range of factors, and the relative importance of these factors varies across item formats\. We selected MCFI items because they represent a canonical RC format in which both surface\-level features and reasoning\-related features jointly contribute to cognitive load\. In addition, passage topic and question type were treated as controlled variables to isolate fine\-grained difficulty variation within a homogeneous setting\.

Consequently, while our findings may not directly generalize to other RC formats, the proposed framework is not inherently restricted to MCFI items\. Extending the framework to alternative formats would primarily require defining additional feature evaluators and incorporating corresponding descriptions into agent prompts, though the two\-stage passage–option generation structure might necessitate adaptation depending on the target format\. Nevertheless, this study is valuable as the pioneering effort to showcase how LLM\-based multi\-agent frameworks can be leveraged for feature\-constrained item generation and subsequent fine\-grained difficulty control\.

#### Lack of Psychometric Validation Against Examinee Performance\.

In educational measurement research, a range of difficulty\-related item features have been identified and used as practical guidelines for item writers to produce questions at a desired difficulty levelFerraraet al\.\([2022](https://arxiv.org/html/2605.19316#bib.bib42)\)\. Building on this practice, our framework generates RC items conditioned on feature constraints corresponding to a target difficulty range, and evaluates difficulty alignment through expert judgments — that is, from aprior difficultyperspective based on linguistic and cognitive item features, rather than on observed examinee performance\.

Consequently, the generated items are not psychometrically calibrated in an absolute sense: their difficulty has neither been validated against student error rates nor evaluated in terms of item response patterns\. While the framework is intended as a practical support tool for item writers, future work should incorporate empirical validation through examinee response data to establish psychometric grounding for the difficulty levels produced by our framework\.

#### Computational Cost and Latency\.

The proposed multi\-agent framework incurs additional computational cost due to its iterative revision process\. While our experiments show that the number of revision rounds is typically limited, this overhead may still pose challenges in large\-scale or time\-sensitive deployment scenarios\. As such, the framework is most suitable for settings where reliability and precise difficulty control are prioritized over minimal generation cost\.

#### Reliance on Feature Definitions and Evaluator Quality\.

The effectiveness ofMAFIGdepends on the quality of feature definitions and evaluators\. Because difficulty control is achieved through explicit feature constraints, inaccuracies or ambiguities in feature specifications—or erroneous predictions by the evaluators used to measure them—can affect the reliability of constraint satisfaction and difficulty alignment\. In particular, some features, such as reasoning complexity, rely on LLM\-based evaluators, which may introduce noise or bias\. This dependency reflects a broader limitation of feature\-based item design, even as it offers greater transparency and interpretability in how difficulty is operationalized\.

## Acknowledgments

This research was supported by Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2025 \(Project Name: Development of an AI\-Based Korean Diagnostic System for Efficient Korean Speaking Learning by Foreigners, Project Number: RS\-2025\-02413038, Contribution Rate: 45%\); by the IITP \(Institute of Information & Coummunications Technology Planning & Evaluation\) \- ITRC \(Information Technology Research Center\) grant funded by the Korea government \(Ministry of Science and ICT\) \(IITP\-2026\-RS\-2024\-00437866, Contribution Rate: 45%\); and by Institute of Information & communications Technology Planning & Evaluation \(IITP\) grant funded by the Korea government \(MSIT\) \(No\.RS\-2019\-II191906, Artificial Intelligence Graduate School Program \(POSTECH\), Contribution Rate: 10%\)\.

## References

- M\. Abdin, J\. Aneja, H\. Behl, S\. Bubeck, R\. Eldan, S\. Gunasekar, M\. Harrison, R\. J\. Hewett, M\. Javaheripi, P\. Kauffmann,et al\.\(2024\)Phi\-4 technical report\.arXiv preprint arXiv:2412\.08905\.Cited by:[§6\.2](https://arxiv.org/html/2605.19316#S6.SS2.p1.2)\.
- How to construct achievement tests to assess comprehension\.Review of educational research42\(2\),pp\. 145–170\.Cited by:[§3\.1](https://arxiv.org/html/2605.19316#S3.SS1.p1.1)\.
- S\. Baghaei, M\. S\. Bagheri, and M\. Yamini \(2020\)Analysis of ielts and toefl reading and listening tests in terms of revised bloom’s taxonomy\.Cogent Education7\(1\),pp\. 1720939\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p2.1)\.
- U\. Bezirhan and M\. von Davier \(2023\)Automated reading passage generation with openai’s large language model\.Computers and Education: Artificial Intelligence5,pp\. 100161\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p1.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px1.p1.1)\.
- B\. S\. Bloom, M\. D\. Engelhart, E\. J\. Furst, W\. H\. Hill, D\. R\. Krathwohl,et al\.\(1956\)Taxonomy of educational objectives: the classification of educational goals\. handbook 1: cognitive domain\.Longman New York\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1)\.
- J\. R\. Bormuth, J\. Manning, J\. Carr, and D\. Pearson \(1970\)Children’s comprehension of between\-and within\-sentence syntactic structures\.\.Journal of educational psychology61\(5\),pp\. 349\.Cited by:[§3\.1](https://arxiv.org/html/2605.19316#S3.SS1.p1.1)\.
- C\. H\. Chen and M\. F\. Shiu \(2025\)KAQG: a knowledge\-graph\-enhanced rag for difficulty\-controlled question generation\.arXiv preprint arXiv:2505\.07618\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p2.1),[§4\.3](https://arxiv.org/html/2605.19316#S4.SS3.p1.1)\.
- S\. Elkins, E\. Kochmar, I\. Serban, and J\. C\. Cheung \(2023\)How useful are educational questions generated by large language models?\.InInternational Conference on Artificial Intelligence in Education,pp\. 536–542\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p2.1),[§4\.3](https://arxiv.org/html/2605.19316#S4.SS3.p1.1)\.
- S\. Ferrara, J\. T\. Steedle, and R\. S\. Frantz \(2022\)Response demands of reading comprehension test items: a review of item difficulty modeling studies\.Applied Measurement in Education35\(3\),pp\. 237–253\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[Lack of Psychometric Validation Against Examinee Performance\.](https://arxiv.org/html/2605.19316#Sx1.SS0.SSS0.Px2.p1.1)\.
- Y\. Gao, L\. Bing, W\. Chen, M\. R\. Lyu, and I\. King \(2018\)Difficulty controllable generation of reading comprehension questions\.arXiv preprint arXiv:1807\.03586\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Hwang, K\. Wang, M\. Alomair, F\. Choa, and L\. K\. Chen \(2024\)Towards automated multiple choice question generation and evaluation: aligning with bloom’s taxonomy\.InInternational Conference on Artificial Intelligence in Education,pp\. 389–396\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p2.1),[§4\.3](https://arxiv.org/html/2605.19316#S4.SS3.p1.1)\.
- S\. Hwang, H\. Kim, and G\. G\. Lee \(2025\)Can llms estimate cognitive complexity of reading comprehension items?\.arXiv preprint arXiv:2510\.25064\.Cited by:[Appendix A](https://arxiv.org/html/2605.19316#A1.p1.3),[Appendix A](https://arxiv.org/html/2605.19316#A1.p2.1)\.
- G\. Lai, Q\. Xie, H\. Liu, Y\. Yang, and E\. Hovy \(2017\)RACE: large\-scale reading comprehension dataset from examinations\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,pp\. 785–794\.Cited by:[Appendix A](https://arxiv.org/html/2605.19316#A1.p1.3)\.
- U\. Lee, H\. Jung, Y\. Jeon, Y\. Sohn, W\. Hwang, J\. Moon, and H\. Kim \(2024\)Few\-shot is enough: exploring chatgpt prompt engineering method for automatic question generation in english education\.Education and Information Technologies29\(9\),pp\. 11483–11515\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p1.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§3\.2](https://arxiv.org/html/2605.19316#S3.SS2.SSS0.Px4.p1.1)\.
- B\. Li, Y\. Wang, J\. Gu, K\. Chang, and N\. Peng \(2025\)Metal: a multi\-agent framework for chart generation with test\-time scaling\.arXiv preprint arXiv:2502\.17651\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Li and Y\. Zhang \(2024\)Planning first, question second: an llm\-guided method for controllable question generation\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 4715–4729\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.19316#S4.SS3.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: nlg evaluation using gpt\-4 with better human alignment\.arXiv preprint arXiv:2303\.16634\.Cited by:[Appendix B](https://arxiv.org/html/2605.19316#A2.SS0.SSS0.Px5.p1.1),[§4\.4](https://arxiv.org/html/2605.19316#S4.SS4.SSS0.Px3.p1.1)\.
- E\. Loper and S\. Bird \(2002\)Nltk: the natural language toolkit\.arXiv preprint cs/0205028\.Cited by:[Appendix A](https://arxiv.org/html/2605.19316#A1.p2.1)\.
- F\. M\. Lord \(1980\)Applications of item response theory to practical testing problems\.Routledge\.External Links:[Document](https://dx.doi.org/10.4324/9780203056615)Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Lu and X\. Wang \(2024\)Generative students: using llm\-simulated student profiles to support question item evaluation\.InProceedings of the Eleventh ACM Conference on Learning@ Scale,pp\. 16–27\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in Neural Information Processing Systems36,pp\. 46534–46594\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px3.p1.1)\.
- Mistral\.AI \(2025\)Mistral small 3\.Note:[https://mistral\.ai/news/mistral\-small\-3](https://mistral.ai/news/mistral-small-3)Accessed: 2025\-11\-07Cited by:[§6\.2](https://arxiv.org/html/2605.19316#S6.SS2.p1.2)\.
- S\. S\. Mucciaccia, T\. M\. Paixão, F\. W\. Mutz, C\. S\. Badue, A\. F\. de Souza, and T\. Oliveira\-Santos \(2025\)Automatic multiple\-choice question generation and evaluation systems based on llm: a study case with university resolutions\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 2246–2260\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p1.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Oka, Y\. Tan, T\. Ishioka, and K\. Okada \(2025\)Systematic control of multiple\-choice item difficulty through llm\-based distractor generation\.InInternational Conference on Artificial Intelligence in Education,pp\. 147–157\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p2.1)\.
- OpenAI \(2025\)Introducing gpt\-5\.Note:[https://openai\.com/index/introducing\-gpt\-5](https://openai.com/index/introducing-gpt-5)Accessed: 2025\-11\-07Cited by:[Appendix B](https://arxiv.org/html/2605.19316#A2.SS0.SSS0.Px3.p2.9),[§1](https://arxiv.org/html/2605.19316#S1.p4.1)\.
- G\. Park \(2004\)Comparison of l2 listening and reading comprehension by university students learning english in korea\.Foreign Language Annals37\(3\),pp\. 448–458\.Cited by:[§3\.1](https://arxiv.org/html/2605.19316#S3.SS1.p1.1)\.
- J\. Park, S\. Park, H\. Won, and K\. Kim \(2024\)Large language models are students at various levels: zero\-shot question difficulty estimation\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 8157–8177\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Rafatbakhsh and A\. Ahmadi \(2023\)Predicting the difficulty of efl reading comprehension tests based on linguistic indices\.Asian\-Pacific Journal of Second and Foreign Language Education8\(1\),pp\. 41\.Cited by:[§3\.1](https://arxiv.org/html/2605.19316#S3.SS1.p1.1)\.
- V\. Raina and M\. Gales \(2024\)Question difficulty ranking for multiple\-choice reading comprehension\.arXiv preprint arXiv:2404\.10704\.Cited by:[§3\.3](https://arxiv.org/html/2605.19316#S3.SS3.p2.1)\.
- F\. Retkowski and A\. Waibel \(2025\)Zero\-shot strategies for length\-controllable summarization\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 551–572\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Ryu, H\. Do, D\. Kim, H\. Yu, D\. Kim, Y\. Kim, G\. G\. Lee, and J\. Ok \(2024\)Exploring iterative controllable summarization with large language models\.arXiv preprint arXiv:2411\.12460\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Säuberli and S\. Clematide \(2024\)Automatic generation and evaluation of reading comprehension test items with large language models\.InProceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties \(READI\)@ LREC\-COLING 2024,pp\. 22–37\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.19316#S3.SS2.SSS0.Px3.p1.3)\.
- Y\. Talebirad and A\. Nadiri \(2023\)Multi\-agent collaboration: harnessing the power of intelligent llm agents\.arXiv preprint arXiv:2306\.03314\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px3.p1.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Appendix A](https://arxiv.org/html/2605.19316#A1.p2.1),[§4\.1](https://arxiv.org/html/2605.19316#S4.SS1.p2.6)\.
- Y\. Tomikawa and U\. Masaki \(2024\)Difficulty\-controllable reading comprehension question generation considering the difficulty of reading passages\.InInternational Conference on Computers in Education,Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1)\.
- Y\. Tomikawa, A\. Suzuki, and M\. Uto \(2024\)Adaptive question–answer generation with difficulty control using item response theory and pre\-trained transformer models\.IEEE Transactions on Learning Technologies\.Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Tomikawa and M\. Uto \(2024\)Difficulty\-controllable multiple\-choice question generation for reading comprehension using item response theory\.InInternational Conference on Artificial Intelligence in Education,pp\. 312–320\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Uto, Y\. Tomikawa, and A\. Suzuki \(2023\)Difficulty\-controllable neural question generation for reading comprehension using item response theory\.InProceedings of the 18th workshop on innovative use of NLP for building educational applications \(BEA 2023\),pp\. 119–129\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[Appendix A](https://arxiv.org/html/2605.19316#A1.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[Appendix A](https://arxiv.org/html/2605.19316#A1.p2.1),[§3\.3](https://arxiv.org/html/2605.19316#S3.SS3.p3.4)\.
- C\. Xiao, S\. X\. Xu, K\. Zhang, Y\. Wang, and L\. Xia \(2023\)Evaluating reading comprehension exercises generated by llms: a showcase of chatgpt in education applications\.InProceedings of the 18th workshop on innovative use of NLP for building educational applications \(BEA 2023\),pp\. 610–625\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p1.1)\.
- A\. Yaacoub, J\. Da\-Rugna, and Z\. Assaghir \(2025\)Assessing ai\-generated questions’ alignment with cognitive frameworks in educational assessment\.arXiv preprint arXiv:2504\.14232\.Cited by:[§1](https://arxiv.org/html/2605.19316#S1.p2.1),[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.19316#S4.SS3.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.19316#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Zhang, L\. Chen, S\. Zhang, H\. Xu, Z\. Zhao, and K\. Yu \(2023\)Large language models are semi\-parametric reinforcement learning agents\.Advances in Neural Information Processing Systems36,pp\. 78227–78239\.Cited by:[§3\.2](https://arxiv.org/html/2605.19316#S3.SS2.SSS0.Px3.p1.3)\.
- M\. Zhong, Y\. Liu, D\. Yin, Y\. Mao, Y\. Jiao, P\. Liu, C\. Zhu, H\. Ji, and J\. Han \(2022\)Towards a unified multi\-dimensional evaluator for text generation\.arXiv preprint arXiv:2210\.07197\.Cited by:[§4\.4](https://arxiv.org/html/2605.19316#S4.SS4.SSS0.Px3.p1.1)\.

## Appendix AFeature Definitions and Feature\-Specific Evaluators

We employ six feature variables: four features control the cognitive demand of the item—vocabulary level, passage length, average sentence length, and reasoning complexity—while two ensure validity—factuality and neutrality among options\. Features with continuous values are discretized into categorical ranges: passage length∈\\in\{short \(5–10 sentences\), medium \(11–20 sentences\), long \(21–30 sentences\)\}, average sentence length∈\\in\{short \(<<10 words\), medium \(10–15 words\), long \(15–20 words\)\}\. Vocabulary level is divided into CEFR\-based bands—\{A \(A1–A2\), B \(B1–B2\), C \(C1\-C2\)\}\. The reasoning complexity of each option is classified into five levels: \{single\-sentence word matching,single\-sentence paraphrasing,single\-sentence inference,multi\-sentence inference,not enough information\}Laiet al\.\([2017](https://arxiv.org/html/2605.19316#bib.bib62)\); Hwanget al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib45)\)\. Finally, factuality can take values \{true,false,not given\}, and neutrality must be maintained to ensure that all options are logically independent and collectively valid\.

The Evaluators comprise off\-the\-shelf NLP toolkits and LLM judges\. Passage length and sentence length are computed using NLTKLoper and Bird \([2002](https://arxiv.org/html/2605.19316#bib.bib111)\)library, while vocabulary level is determined by the highest CEFR level among the words contained in a passage or an option\. FollowingHwanget al\.\([2025](https://arxiv.org/html/2605.19316#bib.bib45)\), we evaluate the reasoning complexity of each option along two sub\-dimensions, Evidence Scope and Transformation Level, using CoT prompting with self\-consistency decodingWeiet al\.\([2022](https://arxiv.org/html/2605.19316#bib.bib97)\); Wanget al\.\([2022](https://arxiv.org/html/2605.19316#bib.bib98)\), achieving Macro F1 scores of 69\.0 and 70\.8, respectively\. We use the same strategy to assess the factuality and neutrality of options\. These LLM\-based evaluations were conducted using Qwen3\-32BTeam \([2025](https://arxiv.org/html/2605.19316#bib.bib93)\)in non\-reasoning mode\.

## Appendix BEvaluation Metrics

#### Success Ratio\.

The SR measures the proportion of items that satisfy all target constraints simultaneously\. Formally, given a source documentSSand afeature constraint setC=\{\(Xi,xi\)\}i=1MC=\\\{\(X\_\{i\},x\_\{i\}\)\\\}\_\{i=1\}^\{M\}, where each feature variableXiX\_\{i\}is assigned a target valuexix\_\{i\}, we generate an itemQCQ\_\{C\}whose passage is composed based on the content ofSSwhile satisfying all feature constraints specified inCC\. The generation is considered successful if all constraints inCCare satisfied as follows:

Success​\(QC\)\\displaystyle\\text\{Success\}\(Q\_\{C\}\)\(3\)=𝕀​\[∀i∈\{1,…,M\},EXi​\(QC\)=xi\],\\displaystyle=\\mathbb\{I\}\\big\[\\,\\forall i\\in\\\{1,\\ldots,M\\\},\\,E\_\{X\_\{i\}\}\(Q\_\{C\}\)=x\_\{i\}\\,\\big\],whereEXi​\(⋅\)E\_\{X\_\{i\}\}\(\\cdot\)denotes the evaluation function that measures the realized value of the feature variableXiX\_\{i\}for the generated item\.

In our experiments, an item generation task is considered successful if at least one of thennparallel drafts satisfies all target constraints\. Constraint satisfaction was verified using the same automated evaluators integrated withinMAFIG\.

#### Achievement Ratio\.

To capture partial success, we measured the proportion of feature constraints satisfied by the generated item\. Formally, this is calculated as the percentage of satisfied constraints in setCC:

∑i=1M𝕀​\[EXi​\(QC\)=xi\]M×100\\displaystyle\\frac\{\\sum\_\{i=1\}^\{M\}\\mathbb\{I\}\[E\_\{X\_\{i\}\}\(Q\_\{C\}\)=x\_\{i\}\]\}\{M\}\\times\{100\}\(4\)This metric reflects how closely a generated item aligns with the target specifications even when full satisfaction is not achieved\. When multiple parallel candidate items are generated \(n\>1n\>1\), we report the results based on the item achieving the maximum AR\.

#### Difficulty Alignment Score\.

The DAS measures whether the given item pairs, which has the adjacent levels, have the difficulty order corresponding to the intended order\. We report the average DAS that are measured for the item pairs sampled from the adjacent difficulty levels \(level\-iiand level\-i\+1i\+1\) from the identical source documents\.

For evaluation usingLLM judges, we adopted the pairwise comparison method defined in Equation[2](https://arxiv.org/html/2605.19316#S3.E2)\. GPT\-5\-miniOpenAI \([2025](https://arxiv.org/html/2605.19316#bib.bib94)\)served as the difficulty evaluator, with sampling parameters set toN=4N=4, temperature1\.01\.0, and top\-pp1\.01\.0\. A score near11indicates that the method can perfectly calibrate item difficulty\. Conversely, a score near−1\-1implies that the method consistently controls difficulty in reverse of the intended order; that is, the item generated for Leveliiis perceived as more difficult than the one generated for Leveli\+1i\+1\. A score of0signifies that the LLM judge exhibits low confidence, resulting in inconsistent outputs\.

In the evaluation withhuman experts, we similarly employed a pairwise estimation framework where annotators identified the more difficult question within a pair\. If the item intended to be harder was correctly identified, the pair was assigned a score of\+1\+1; otherwise, it received−1\-1\. To evaluate whether the difficulty gap was non\-trivial, annotators assigned one of three labels: \(1\)Almost no difference, where the items are nearly identical in difficulty; \(2\)Moderate difference, representing a gap that distinguishes “easy” and “hard” items for students of the same proficiency level; and \(3\)Large difference, indicating a substantial gap suitable for students of different proficiency levels\.

Unlike statistically derived parameters, human perception of difficulty is inherently subjective and an individual’s judgment of a “difficulty gap” can vary; some may perceive items within the same proficiency range as identical, while others may discern subtle nuances\. We intentionally employed this three\-level scheme to capture such fine\-grained distinctions, encouraging annotators to recognize and report even minor variations\. This granular approach ensured that evaluators remained sensitive to subtle differences during the pairwise comparison\. For the final alignment metric, however, we consolidated these into two categories to focus on the presence of an educationally significant difference:Case 1\(Almost no difference\) andCase 2\(Moderate or Large difference\)\. This binary distinction evaluates whether the gap is sufficient to create a pedagogically meaningful distinction—even for students at the same proficiency level\.

To compute a DAS that reflects whether the difficulty difference was educationally meaningful, we combined the three annotators’ responses with a weighted sum that yields a value between−1\-1and\+1\+1:

∑r=13wr​ar\\displaystyle\\sum\_\{r=1\}^\{3\}w\_\{r\}a\_\{r\}\(5\)wherear∈\{\+1,−1\}a\_\{r\}\\in\\\{\+1,\-1\\\}denotes annotatorrr’s pairwise judgment, andwrw\_\{r\}represents the category weight:wr=0\.5w\_\{r\}=0\.5for Case 1 andwr=1w\_\{r\}=1for Case 2\. If all three annotators agreed with the intended ordering and perceived distinguishable difficulty gap, the resulting score was\+1\+1, indicating perfectly aligned difficulty\. Conversely, if at least one annotator disagreed while marking Case 1 \(minimal gap\), the score approached0\(absolute value≈0\.1667\\approx 0\.1667\), suggesting negligible difficulty difference between the two items\.

#### Complete Alignment Ratio\.

This metric defined as the proportion of pairs for which all human evaluators unanimously agreed that the observed ordering matched the intended direction\. A higher CAR indicates a stronger ability to calibrate difficulty according to the specified feature constraints\.

#### Validity\.

We assessed the validity of the generated questions including its answerability, answer matching and logical integrity considering that our targeting item format is MCFI \(with the stem of “Which statement is True based on the passage?”\)\. We measured this on 3\-point scale: 1 — The item is unanswerable \(no answer or multiple answer\), and the generated answer \(statement with the factuality True\) is not the correct answer\. 2 — The item is answerable and the real answer and intended answer are matched\. But the options are not meutualy independent \(at least two options are the relationship of entailment or contradiction making the determination of the factuality of one option also determine the factuality of the counterpart without referencing the passage\)\. This issue is not critical but degrades the item quality\. 3 — Fully valid\. the items are answerable and all options are well\-constructed\. Validity was evaluated using GPT\-5\-mini by sampling eight times, and the average score was reported as followsLiuet al\.\([2023](https://arxiv.org/html/2605.19316#bib.bib1)\)\.

Table 3:Eight\-level difficulty\-calibrated feature constraint sequence\. \(Abbreviations: T / True, F / False, S: Single\-sentence evidence, M: Multi\-sentence evidence, WM: Word Matching, P: Paraphrasing, I: Inference\)
#### Coherence and Fluency\.

To ensure readability and naturalness, we measured the coherence and fluency of generated passages using the UniEval framework, which are generally used automatic evaluation metrics that has high human alignment and largely used especially in text summarization task\.

## Appendix CConstructing Difficulty\-Calibrated Feature Constraint Sequence

To construct the difficulty\-calibrated feature constraint sequence utilized in our experiments, we first defined sixteen candidate constraint sets and generated corresponding items usingMAFIGbased on fifteen source documents \(disjoint from the test set\)\. We then computed the DAS for item pairs generated from constraint sets within a sliding window of size 5, ensuring that each item was compared with others of nearby theoretical difficulty, using GPT\-5\-mini as the judge withN=4N=4in Equation[2](https://arxiv.org/html/2605.19316#S3.E2)\. Constraint pairs with a DAS below 0\.4 were filtered out, and from the remaining candidates, we constructed the final sequence, which exhibited a monotonically increasing difficulty level\. Consequently, we obtained eight levels of difficulty\-calibrated feature constraints that are both theoretically and empirically validated \(see Table[3](https://arxiv.org/html/2605.19316#A2.T3)\)\.

## Appendix DDifficulty Alignment Score Statistics of LLM Judges

Table[4](https://arxiv.org/html/2605.19316#A4.T4)presents the DAS statistics across 8 repeated inferences \(4 for each of the forward and reversed comparisons\)\. Inherent variation across multiple samplings is reflected in the DAS, as higher variance leads the mean score closer to 0\. Beyond this, the STD values offer additional insights\. For instance, comparing Level\-based Incremental PromptingQwen3​\-​32​B\{\}\_\{\\mathrm\{Qwen3\\text\{\-\}32B\}\}with MAFIGQwen3​\-​32​B\{\}\_\{\\mathrm\{Qwen3\\text\{\-\}32B\}\}, the two methods show a substantial difference in DAS while exhibiting nearly identical STDs\. This suggests that although the frequency of inconsistent judgments was similar, the Level\-based Incremental model more frequently produced reversed difficulty judgments \(i\.e\., scores approaching−1\-1\), which offset the positive scores\.

Table 4:DAS statistics of LLM judges\. The range of DAS is\[−1,1\]\[\-1,1\]\.
## Appendix EHuman Evaluation Setup

![Refer to caption](https://arxiv.org/html/2605.19316v1/x7.png)Figure 7:Screenshots of instructions provided to human evaluators\.We recruited three professional English instructors via Upwork555[https://www\.upwork\.com/](https://www.upwork.com/), selecting candidates with proven expertise in IELTS instruction or RC item development\. The evaluation set comprised item pairs from adjacent difficulty levels \(Level 1 to 8\) sampled across six distinct source documents and three generation methods, resulting in 126 pairs per annotator\. Each evaluator was compensated with $115 for the entire task and all participants provided informed consent that their anonymized evaluations may be released for research purposes\. Detailed annotation instructions are available in Figure[7](https://arxiv.org/html/2605.19316#A5.F7)\.

## Appendix FFeature\-wise Analysis

![Refer to caption](https://arxiv.org/html/2605.19316v1/x8.png)Figure 8:Feature\-wise ARs for GPT\-5 \(Direct Prompting\) and Qwen3\-32B\-basedMAFIG, showing differential constraint satisfaction across feature types\.![Refer to caption](https://arxiv.org/html/2605.19316v1/x9.png)\(a\)Successful examples within 100 rounds\.
![Refer to caption](https://arxiv.org/html/2605.19316v1/x10.png)\(b\)Failed examples within 100 rounds\.

Figure 9:Visualization of feature\-wise satisfaction across rounds for examples that \(a\) succeed or \(b\) fail to achieve full constraint satisfaction within 100 rounds in the option generation stage\. Each cell in the heatmap represents the proportion of examples satisfying a given constraint at a specific round, where values closer to 1 indicate higher success rates and values closer to 0 indicate failure across all examples\.We compared the AR of GPT\-5 under Direct Prompting and Qwen3\-32B\-basedMAFIG\(with a single draft,n=1n=1\) in generating items that satisfy six feature constraints used for difficulty control\. For passage generation, we independently evaluated whether the generated passages satisfied the constraints onvocabulary level,passage length, andsentence length\. For option generation, we assessed the satisfaction ofneutralityamong options and, for each option, the constraints offactuality,vocabulary level, andreasoning complexity\.

As shown in Figure[8](https://arxiv.org/html/2605.19316#A6.F8), GPT\-5 achieved high satisfaction ratios for surface\-level and factuality\-based features—specifically,passage length,sentence length, andfactuality\. However, its SR sharply declined for thevocabulary levelconstraint, which requires alignment with an external lexical standard\. Similarly, GPT\-5 exhibited limited performance in satisfying deeper cognitive constraints, such asneutralityamong options andreasoning complexity, both of which require analyzing cross\-option relationships and inferential reasoning beyond surface features\.

These findings indicate that while with a reasoning\-optimized model such as GPT\-5 is sufficient to generate items aligned with surface\-level or factual constraints, revision throughMAFIGis essential for constraints that depend on external standards or demand more intricate cognitive control\. Notably, GPT\-5’s overall ARs were generally higher than those of Qwen3\-32B before revision, suggesting that employing GPT\-5 as an agent backbone withinMAFIGcould enable faster convergence toward fully constraint\-satisfying items\.

## Appendix GCase Study

Table 5:An MCFI item that fails to satisfy the feature constraints for Level 4\. After 100 rounds of revision, the item still violates the neutrality and reasoning complexity constraints for options A and C\.Through the preceding experiments, we observed that passage generation was able to achieve constraint satisfaction in nearly all examples within 20 rounds \(or within 5 rounds when the draft size was set to 5\), whereas option generation occasionally failed to reach full satisfaction even after 100 rounds\. In this section, we analyze what poses the greatest obstacles to constraint satisfaction in the option generation stage\.

Figure[9](https://arxiv.org/html/2605.19316#A6.F9)illustrates which feature constraints posed the greatest obstacles to full satisfaction\. Results from both successful and failed cases indicate that the model consistently struggled to satisfy the neutrality and the reasoning complexity constraints\. Notably, in cases where full satisfaction was not achieved even after 100 rounds, neutrality was found to be the constraint most persistently violated\.

A representative case of such neutrality violations is presented in Table[5](https://arxiv.org/html/2605.19316#A7.T5), where the target item was generated at Level 4\. The passage associated with this item addresses the topic of “moral decision\-making” and is written with a high degree of logical cohesion\. Because the sentences within the passage are mutually interdependent, generating contradicted statements that maintain a neutral relationship with one another proves particularly challenging\. This suggests that the topic itself gives rise to a passage more suited to higher\-order comprehension tasks — such as Summarization or Main Idea Identification formats — than to the Factual Information format\. In our experiments, the topic of each item was controlled through the source text, but was not treated as a variable requiring explicit control for difficulty calibration\. In practical applications, incorporating passage topic as an additional factor in the item generation process would likely mitigate constraint satisfaction failures attributable to such misalignment between passage topic and item type\.

## Appendix HQualitative Analysis of Difficulty Factors

Table 6:Distribution of perceived difficulty factors derived from qualitative expert feedback\.To explore whether feature\-based difficulty control effectively translates into actual difficulty calibration, we conducted a qualitative analysis based on the justifications provided by three domain experts during the pairwise evaluation\. Experts were asked to specify the underlying factors that influenced their perception of difficulty for each item pair\.

#### Clustering of Perceived Difficulty Factors\.

We employed a three\-stage prompting pipeline using GPT\-5 to systematically cluster the expert justifications\. First, we extracted general difficulty\-related keywords and phrases from each justification\. Second, the model was prompted to generate representative cluster labels by synthesizing these extracted factors\. After manual refinement, we performed a final classification step where each initial factor was mapped to its most appropriate cluster\.

As summarized in Table[6](https://arxiv.org/html/2605.19316#A8.T6), theanswer verification difficultycluster—which encompasses factors related to mapping between the passage and options, as well as reasoning complexity—was the most prevalent, accounting for 57\.85% of expert mentions\. This aligns closely with the reasoning complexity feature we explicitly controlled in our experiments\. While this dimension is particularly challenging to satisfy via single\-pass prompting \(as discussed in Section[F](https://arxiv.org/html/2605.19316#A6)\), our results confirm that it is the most critical factor used by experts to calibrate or assess item difficulty\. Conversely, factors not directly targeted by our framework, such as figurative language, tone, and option length, were also noted \(approximately 6%\), suggesting that these unmonitored features may introduce unintended variations in difficulty\.

![Refer to caption](https://arxiv.org/html/2605.19316v1/x11.png)Figure 10:Expert\-perceived difficulty factors for each level pair\.
#### Alignment between Controlled Features and Expert Perception\.

Figure[10](https://arxiv.org/html/2605.19316#A8.F10)visualizes the relationship between the intended difficulty calibration in our eight\-level feature constraint sequence and the factors actually perceived by experts\. In our experimental setup, adjacent difficulty levels were differentiated by intentionally increasing the cognitive complexity of specific features\.

The visualization reveals strong alignment in pairs where reasoning complexity was the intended differentiator \(e\.g\., Level 1–2, Level 3–4, Level 4–5, and Level 7–8\); in these cases, experts frequently cited factors belonging to theanswer verification difficultycluster\. Interestingly, this cluster was also prominent in the Level 6–7 pair, where passage length was the primary variable; we hypothesize that increased passage length made locating evidence for options more cognitively demanding, thereby indirectly affecting verification difficulty\. Furthermore, intended increases in vocabulary level \(Level 3–4, Level 5–7\) and adjustments in sentence and passage length were accurately identified by experts as primary drivers of difficulty disparity\. Consequently,MAFIGcan effectively support item writers to modulate the difficulty of RC items by precisely manipulating specific features in accordance with their pedagogical intentions\.

However, we observed instances where unintended features were cited as difficulty factors\. This likely stems from inherent correlations between linguistic features—for example, increasing sentence length often leads to more complex syntactic structures and the inclusion of more sophisticated vocabulary\. While our framework manages features independently, these results suggest that modeling the interdependencies between difficulty factors is a promising avenue for future improvement of the system\.

## Appendix IPrompt Templates

Figures[11](https://arxiv.org/html/2605.19316#A9.F11),[12](https://arxiv.org/html/2605.19316#A9.F12),[13](https://arxiv.org/html/2605.19316#A9.F13),[14](https://arxiv.org/html/2605.19316#A9.F14), and[15](https://arxiv.org/html/2605.19316#A9.F15)present the prompt templates used during the passage generation stage\. The templates used during the option generation stage and for the LLM\-based evaluation can be found in our GitHub repository:[https://github\.com/SeonjeongHwang/mafig](https://github.com/SeonjeongHwang/mafig)\.

Figure 11:Prompt template for theDrafteragent used in the passage generation stage\.\{placeholder\}indicates a slot to be filled with the corresponding value\.Figure 12:Prompt template for thePlanneragent used in the passage generation stage\.\{placeholder\}indicates a slot to be filled with the corresponding value\.Figure 13:Prompt template for theEditoragent used in the passage generation stage\.\{placeholder\}indicates a slot to be filled with the corresponding value\.Figure 14:Prompt template for theReworderagent used in the passage generation stage\.\{placeholder\}indicates a slot to be filled with the corresponding value\.Figure 15:Prompt template for theRefineragent used in the passage generation stage\.\{placeholder\}indicates a slot to be filled with the corresponding value\.

Similar Articles

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Hugging Face Daily Papers

MM-WebAgent is a hierarchical agentic framework that generates coherent and visually consistent webpages by coordinating AIGC-based element generation through joint optimization of layout and multimodal content. The paper introduces a benchmark and multi-level evaluation protocol, demonstrating improvements over code-generation and agent-based baselines.

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Hugging Face Daily Papers

LectūraAgents is a multi-agent framework for adaptive personalized learning that mimics professor-student interactions and generates embodied teaching actions aligned with learner profiles. It introduces a hierarchical architecture, an adaptive embodied teaching mechanism, and a Teaching Action-Speech Alignment algorithm, showing consistent improvements over existing approaches.