Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call
Summary
Edu-Theater is a data-efficient agent framework that uses LLM-powered generative agents to simulate learner behavior in educational settings. It employs a cohort-aware roll-call paradigm to infer learner states with fewer data and computational resources, achieving higher simulation accuracy.
View Cached Full Text
Cached at: 06/16/26, 11:39 AM
# 1 Illustration of Edu-Theater, a learner behavior simulation framework powered by LLM-enabled generative agents. The system simulates a classroom environment where a central Teacher Agent (center) conducts cohort-wide diagnostic probing (right) across learners with diverse cognitive traits and learning histories. Edu-Theater uses population-aware priors to efficiently infer learner states, while also having the flexibility to revert to a traditional ‘Private Tutor’ mode (left) for individualized interactions. Our approach enables scalable, high-fidelity simulations of learner behavior in educational settings, utilizing the power of LLM-based agents to model complex learning processes.
Source: [https://arxiv.org/html/2606.15225](https://arxiv.org/html/2606.15225)
Edu\-Theater
Edu\-Theater:
A Data\-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll\-Call
Weibo Gao1,2, Qi Liu1,2, Linan Yue3, Zheng Zhang1,2, Yichao Du4, Fangzhou Yao1,2, Ao Yu1,2, Zhenya Huang1,2, Shijin Wang5,2
1University of Science and Technology of China2State Key Laboratory of Cognitive Intelligence 3Southeast University4Alibaba Group5iFLYTEK Co\., Ltd\.
###### Abstract
AbstractLarge\-scale learner\-task interaction data are crucial for intelligent educational systems but are costly to collect and constrained by privacy and learner engagement\. Learner simulators play a critical role in simulating scalable learner behavior without the need for continuous involvement of real learners\. However, existing methods are predominantlyindividual\-centric, pairing a simulator with each learner to iteratively infer latent knowledge states from dense interaction histories, which is both data\- and computation\-intensive, and fragile in cold\-start scenarios\. We propose acohort\-aware roll\-call simulation paradigmthat first constructs cohort\-level proficiency priors and refines individual learner states through a small number of targeted diagnostic queries\. Based on this paradigm, we introduceEdu\-Theater, an LLM\-powered agent system that performs cohort\-aware learner simulation via a teacher agent and retrospective roll\-call probing over learner logs\. Edu\-Theater enables scalable future behavior simulation without the need for dense per\-learner histories\. Experiments on two real\-world datasets demonstrate that Edu\-Theater achieves higher simulation accuracy with significantly fewer LLM calls, producing synthetic data that enhances downstream applications such as adaptive testing\.
Keywords:LLM Agent, Educational Data Mining, Data Synthesis, Human Simulation
![[Uncaptioned image]](https://arxiv.org/html/2606.15225v1/x1.png)
Figure 1:Illustration of Edu\-Theater, a learner behavior simulation framework powered by LLM\-enabled generative agents\. The system simulates a classroom environment where a central Teacher Agent \(center\) conducts cohort\-wide diagnostic probing \(right\) across learners with diverse cognitive traits and learning histories\. Edu\-Theater uses population\-aware priors to efficiently infer learner states, while also having the flexibility to revert to a traditional ‘Private Tutor’ mode \(left\) for individualized interactions\. Our approach enables scalable, high\-fidelity simulations of learner behavior in educational settings, utilizing the power of LLM\-based agents to model complex learning processes\.
## 1Introduction
Education is one of the most interactive domains for generative AI\[[7](https://arxiv.org/html/2606.15225#bib.bib84)\]\. Each learning task, such as an exercise or a lesson, elicits learner behaviors and leaves behind interaction traces, including submitted answers and revision trajectories\. Modern tutoring platforms leverage large\-scale learner\-task interactions to model learning progress and provide personalized support, such as hints and task recommendations\[[19](https://arxiv.org/html/2606.15225#bib.bib30)\]\. However, collecting such records is costly because learning requires time and cognitive effort, and data collection is often constrained by privacy policies\[[10](https://arxiv.org/html/2606.15225#bib.bib11)\]\. As a result, training and evaluating personalized support algorithms directly on real learners’ interaction data at scale is difficult\. Generative learner simulators therefore serve as a key enabler, producing scalable interaction data from observed histories and supporting efficient offline training and evaluation without continuously involving real learners\[[35](https://arxiv.org/html/2606.15225#bib.bib9)\]\.
We revisit learner simulation from a simple yet fundamental perspective:*How can a simulator infer a learner’s latent state from observed interaction data and use it as evidence to simulate future learning behaviors?*Most existing approaches\[[18](https://arxiv.org/html/2606.15225#bib.bib87),[35](https://arxiv.org/html/2606.15225#bib.bib9),[11](https://arxiv.org/html/2606.15225#bib.bib85),[37](https://arxiv.org/html/2606.15225#bib.bib88)\]follow anindividual\-centricparadigm\. To understand a learner, the simulator is paired*one\-to\-one*with the same learner over time\. It observes learning traces as the learner engages with learning tasks and gradually forms an individualized estimate of the learner’s knowledge mastery state to predict future behaviors\. In this sense, a learner simulator resembles a companionprivate tutor\. Its operation can be viewed as presenting learning tasks to the learner, summarizing knowledge mastery from the returned learning traces, and using these signals to simulate future behaviors, as illustrated in the left part of Figure[1](https://arxiv.org/html/2606.15225#S0.F1)\. This private\-tutor\-style paradigm relies on sufficiently rich interaction data to enable fine\-grained personalization and long\-term state tracking, but it comes with a cost\. Dependence on dense histories and repeated model invocations makes it data\- and computation\-intensive at scale\. Moreover, when observations are limited, individual\-level state estimation can become unstable, making long\-horizon simulation difficult\. This raises a key question:*Can we make learner simulation more data\-efficient, reducing cost while remaining reliable under limited observations?*
To address this challenge, the private\-tutoring lens naturally leads us to a contrasting yet familiar classroom setting\. When facing a newly enrolled class, teachers rarely build a deep understanding for every student from the start\. Instead, they often rely onroll\-call\-style diagnostic probing\. By observing each student’s initial behaviors from only a few learning tasks and drawing on experience with previously taught*similar cohorts*, teachers can quickly form a coarse view of the class’s proficiency distribution without exhaustively querying every student one by one\. During instruction, a few rounds of roll\-call questioning are then sufficient to pinpoint each individual’s learning state and anticipate subsequent learning behaviors, since most students differ only slightly from the cohort\-level baseline\. This observation suggests that learner simulation can benefit fromcohort\-aware inference\[[4](https://arxiv.org/html/2606.15225#bib.bib89)\]\. The key intuition is that learners are not independent\. Students with similar prior experiences tend to follow similar learning trajectories, and group\-level regularities can serve as informative priors for estimating individual latent states\. Motivated by this, we formalize aroll\-call simulation paradigm\. The simulator first constructs population\-level priors from observed cohort experience by grouping learners with similar histories and estimating cohort\-level proficiency distributions\. It then refines each learner’s state through a small number of targeted interactive queries, thereby generating future behaviors while preserving individual variation in a single pass\.
Based on this insight, we proposeEdu\-Theater, a data\-efficient, LLM\-powered agent system that operationalizes roll\-call\-style learner understanding for scalable learner behavior simulation\. True to its name, each simulation run resembles a staged “performance” that combines cohort\-aware diagnosis with targeted roll\-call probing\. All interactions are retrospective, and no real learners are queried, as illustrated in a toy example in the right part of Figure[1](https://arxiv.org/html/2606.15225#S0.F1)\. Edu\-Theater begins by casting learners into cohorts \(“classes”\) according to their learning histories and appoints a*teacher agent*to orchestrate the simulation\. Rather than modeling each learner as a full\-fledged agent as in previous methods, Edu\-Theater represents*learners*as lightweight, indexed log*databases*that return grounded interaction traces when queried\. This design provides reliable evidence while avoiding expensive learner\-side long\-horizon reasoning\. During rehearsal, the teacher agent maintains a cohort\-level knowledge proficiency state and derives individualized learner knowledge proficiency estimates from retrospective roll\-call traces, while a cognitive diagnosis prop provides auxiliary mastery references\. When the two mastery assessments are inconsistent, a teacher\-side retriever is used to propose informative probing tasks for further refinement\. During this process, teachers from different cohorts can also discuss and share their cohort experiences and, when appropriate, adjust a learner to a more suitable cohort\. Once the assessments align, the learner is considered fully understood\. After rehearsal, the teacher performs scalable future behavior simulation by replaying and re\-organizing existing evidence\. Overall, Edu\-Theater reframes learner simulation from isolated one\-to\-one tutoring into a cohort\-aware roll\-call protocol\. This enables reliable simulation with substantially reduced reliance on extensive interaction histories\. It is worth noting that Edu\-Theater maintains the flexibility to switch to a traditional ‘Private Tutor’ mode when each cohort contains a single learner\.
Our contributions are summarized as follows:
- •New Perspective\.We approach learner simulation from a teacher’s perspective, emphasizing the importance of data\-efficient learner simulation\.
- •New Framework\.We introduce Edu\-Theater, a cohort\-aware, LLM\-powered simulation framework that uses population\-level priors and individualized probing to efficiently predict learner behaviors with minimal data\.
- •Empirical Evaluation\.We evaluate Edu\-Theater through extensive experiments on two real\-world datasets, demonstrating its superior performance and cost\-efficiency compared to existing simulators, particularly under data\-sparse conditions\.
## 2Preliminary
### 2\.1Roles and Props in Edu\-Theater
Edu\-Theater formulates learner simulation as a staged classroom roll\-call, where all interactions are retrospective and no real learners are queried\. The system replays historical learner\-task interactions, with a teacher agent performing reasoning and generation\. Each simulation within a cohort involves two roles: a teacher agent and a cohort of learners, supported by two props for task retrieval and cognitive diagnosis estimation\.
#### 2\.1\.1Roles in Edu\-Theater
##### Teacher \(Agent\)𝒜𝒯\\mathcal\{A\}^\{\\mathcal\{T\}\}\.
The teacher agent𝒜𝒯\\mathcal\{A\}^\{\\mathcal\{T\}\}is the central actor in Edu\-Theater that generates simulated learner behaviors\. It maintains an internal memory stateℳ\\mathcal\{M\}summarizing retrieved interaction evidence\. Given a target taske⋆e^\{\\star\}and learner\-task interaction traces𝒬u\\mathcal\{Q\}\_\{u\}, the teacher produces a predicted outcome of learneruuas
y^u,e⋆=𝒜𝒯\(xe⋆,ℳ,𝒬u\),\\hat\{y\}\_\{u,e^\{\\star\}\}=\\mathcal\{A\}^\{\\mathcal\{T\}\}\\\!\\left\(x\_\{e^\{\\star\}\},\\mathcal\{M\},\\mathcal\{Q\}\_\{u\}\\right\),\(1\)where𝒬u\\mathcal\{Q\}\_\{u\}denotes a small set of historical interaction records for learneruu, with each record in the form of a task–response pair\(xe,yu,e\)\(x\_\{e\},y\_\{u,e\}\), e\.g\.,xex\_\{e\}is a problem statement or lesson prompt, andyu,ey\_\{u,e\}is the corresponding answer or action\.
##### Learner \(Log Database\)𝒟u\\mathcal\{D\}\_\{u\}\.
Each learneruuis represented as a static, indexed log database𝒟u\\mathcal\{D\}\_\{u\}that stores historical interactions and serves as the sole evidence source for retrospective roll\-call\. Given a taskee, it returns the corresponding recorded interaction:
\(xe,yu,e\)=𝒟u\(e\)\.\(x\_\{e\},y\_\{u,e\}\)=\\mathcal\{D\}\_\{u\}\(e\)\.\(2\)
#### 2\.1\.2Props in Edu\-Theater
##### Cognitive Diagnosis Model\.
The cognitive diagnosis prop is a supervised model that fits collected learner–task interactions and outputs concept\-wise mastery estimates for each learner:
𝝁u=CDϕ\(𝒬u\),μu\(c\)∈\[0,1\]\.\\boldsymbol\{\\mu\}\_\{u\}=\\mathrm\{CD\}\_\{\\phi\}\(\\mathcal\{Q\}\_\{u\}\),\\qquad\\mu\_\{u\}\(c\)\\in\[0,1\]\.\(3\)The model is shared across all cohorts and trained by fitting task\-solving correctness, minimizing the supervised loss∑uℒCD\(ϕ;𝒬u\)\\sum\_\{u\}\\mathcal\{L\}\_\{\\mathrm\{CD\}\}\(\\phi;\\mathcal\{Q\}\_\{u\}\)\. In our implementation, we adopt NeuralCD\[[29](https://arxiv.org/html/2606.15225#bib.bib10)\]as the cognitive diagnosis prop, where each dimension of𝝁u\\boldsymbol\{\\mu\}\_\{u\}corresponds to the mastery estimateμu\(c\)\\mu\_\{u\}\(c\)of a knowledge conceptcc\.
##### Teacher\-side Task Retriever\.
The task retrieverℛ\\mathcal\{R\}proposes a small set of informative probing tasks for each learner, based on the teacher’s current assessment and auxiliary diagnostic references\. The specific retrieval mechanism is detailed in Section[3](https://arxiv.org/html/2606.15225#S3)\.
### 2\.2Problem Formulation
We consider a learner populationu∈𝒰u\\in\\mathcal\{U\}and a set of learning taskse∈ℰe\\in\\mathcal\{E\}\. Each learner is associated with a set of historical interaction logs stored in a learner\-specific database𝒟u\\mathcal\{D\}\_\{u\}\. Given the defined roles and supporting props, Edu\-Theater operates in a staged classroom setting, where all interactions are retrospective and no real learners are queried\. Given a future unseen taske⋆e^\{\\star\}, the goal of learner simulation is to predict the learner’s responsey^u,e⋆\\hat\{y\}\_\{u,e^\{\\star\}\}\(Eq\. \([1](https://arxiv.org/html/2606.15225#S2.E1)\)\), while relying only on limited retrospective interaction records and efficient teacher\-side model invocations\.
Edu\-Theater is an LLM\-powered agent framework for*roll\-call\-style*learner simulation with*population\-aware inference*\. The framework adopts a theatrical abstraction: the teacher agent conducts a staged classroom performance, while learners are passive log entities whose past behaviors are replayed on demand\. All probing operations are strictly retrospective, and no online interaction with real learners is involved\. The system only reorganizes and summarizes existing historical logs, while the teacher performs inference and generation as if conducting a classroom roll\-call\. Following this metaphor, each simulation run consists of three phases, as shown in Figure[2](https://arxiv.org/html/2606.15225#S3.F2):Casting\(cohort construction\),Rehearsal\(retrospective roll\-call diagnosis and learner state estimation\), andFinal Performance\(teacher\-centered scalable simulation\)\. Each learneruuis represented as a static log database𝒟u\\mathcal\{D\}\_\{u\}, while a single teacher agent𝒜𝒯\\mathcal\{A\}^\{\\mathcal\{T\}\}orchestrates evidence retrieval, learner state estimation, and behavior generation across all cohorts\.All the prompts in this section are provided in Appendix\.
### 3\.1Stage I: Casting \(Cohort Construction\)
Casting constructs*cohort\-level priors*by grouping learners with similar observed histories\. Each cohort serves as a minimal*classroom unit*that contains at least one learner and is paired with a dedicated teacher agent\. This design allows the teacher to reuse population\-level regularities as informative baselines, rather than building fully personalized models for every learner from scratch\.
For each learneruu, we compute a representation vector𝐫u\\mathbf\{r\}\_\{u\}and cluster\{𝐫u\}\\\{\\mathbf\{r\}\_\{u\}\\\}intoKKcohorts\{Ck\}k=1K\\\{C\_\{k\}\\\}\_\{k=1\}^\{K\}\. Each cohortCkC\_\{k\}is associated with a cohort memoryℳk\\mathcal\{M\}\_\{k\}and a cohort\-specific teacher agent𝒜k𝒯\\mathcal\{A\}^\{\\mathcal\{T\}\}\_\{k\}\. All teacher agents share the same underlying reasoning parameters, but maintain independent memory states to support population\-aware and context\-sensitive inference\.
When task\-solving correctness logs are available, we instantiate𝐫u\\mathbf\{r\}\_\{u\}using a lightweight task\-wise status encoding inspired by prior work\[[22](https://arxiv.org/html/2606.15225#bib.bib29)\]\. For each taske∈ℰe\\in\\mathcal\{E\}, we define
𝐫u,e=\{\[1,0,0\]ifuattemptedeandyu,e=0,\[0,0,1\]ifuattemptedeandyu,e=1,\[0,1,0\]ifudid not attempte,\\mathbf\{r\}\_\{u,e\}=\\begin\{cases\}\[1,0,0\]&\\text\{if $u$ attempted $e$ and \}y\_\{u,e\}=0,\\\\ \[0,0,1\]&\\text\{if $u$ attempted $e$ and \}y\_\{u,e\}=1,\\\\ \[0,1,0\]&\\text\{if $u$ did not attempt $e$\},\\end\{cases\}\(4\)and construct
𝐫u=Concat\(\{𝐫u,e\}e∈ℰ\)∈ℝ3\|ℰ\|\.\\mathbf\{r\}\_\{u\}=\\mathrm\{Concat\}\(\\\{\\mathbf\{r\}\_\{u,e\}\\\}\_\{e\\in\\mathcal\{E\}\}\)\\in\\mathbb\{R\}^\{3\|\\mathcal\{E\}\|\}\.\(5\)This encoding yields a high\-dimensional but sparse representation of each learner’s historical status over the task space\.
When correctness logs are unavailable,𝐫u\\mathbf\{r\}\_\{u\}can alternatively be obtained by embedding an LLM\-generated summary of the learner’s historical interactions, depending on specific task requirements\. Casting provides a coarse cohort prior using lightweight features, while fine\-grained learner\-specific adaptation is deferred to Stage II rehearsal\. Cohort assignment may optionally be revised during rehearsal when newly retrieved evidence indicates systematic mismatch; unless otherwise specified, cohort membership is kept fixed for stability\.
### 3\.2Stage II: Rehearsal \(Retrospective Roll\-call Diagnosis\)
Rehearsal equips the teacher with a compact yet diagnostic view of each learner by selectively retrieving a small set of historical interaction traces\. Instead of consuming full learner histories or executing long simulated trajectories as in previous methods, Edu\-Theater performs*retrospective roll\-call*by querying static log databases\. Within each cohortCkC\_\{k\}, rehearsal proceeds in two acts: whole\-class retrospective diagnosis for cohort calibration, followed by individualized informative probing for learner state estimation refinement\.

Figure 2:Overview of the Edu\-Theater framework\. The simulation process consists of three key phases:I\. Casting\(cohort construction\),II\. Rehearsal\(retrospective roll\-call diagnosis and learner state estimation\), andIII\. Final Performance\(teacher\-centered scalable simulation\)\. Each learner is represented as a static log database, and the teacher agent orchestrates interaction log retrieval, learner state estimation, and behavior generation\.#### 3\.2\.1Act I: Whole\-class Retrospective Diagnosis
Act I aims to obtain a high\-coverage cohort sketch by retrieving a diverse set of historical interactions, enabling the teacher to characterize cohort\-level proficiency patterns and calibrate the cognitive diagnosis model\.
Following coverage\-based designs\[[5](https://arxiv.org/html/2606.15225#bib.bib37)\], a probe setℰcov\\mathcal\{E\}^\{\\mathrm\{cov\}\}is constructed to maximize knowledge concept coverage\. Each taskeeis associated with a knowledge concept set𝒦\(e\)\\mathcal\{K\}\(e\)\. For each conceptcc, we randomly samplepptasks satisfyingc∈𝒦\(e\)c\\in\\mathcal\{K\}\(e\)and take their union to construct the probe set:
ℰcov=Cover\(ℰ\),\\mathcal\{E\}^\{\\mathrm\{cov\}\}=\\mathrm\{Cover\}\(\\mathcal\{E\}\),\(6\)whereCover\(⋅\)\\mathrm\{Cover\}\(\\cdot\)samplesmmtasks per concept whenever available, and\|ℰcov\|≪\|ℰ\|\|\\mathcal\{E\}^\{\\mathrm\{cov\}\}\|\\ll\|\\mathcal\{E\}\|\.
For each learneru∈Cku\\in C\_\{k\}and each taske∈ℰcove\\in\\mathcal\{E\}^\{\\mathrm\{cov\}\}, the teacher retrieves the recorded interaction\(xe,yu,e\)\(x\_\{e\},y\_\{u,e\}\)from𝒟u\\mathcal\{D\}\_\{u\}if it exists\. Missing records yield no observation\. The retrieved interactions for learneruuform a retrospective query set
𝒬u=\{\(xe,yu,e\)∣e∈ℰcov,\(xe,yu,e\)∈𝒟u\}\.\\mathcal\{Q\}\_\{u\}=\\\{\(x\_\{e\},y\_\{u,e\}\)\\mid e\\in\\mathcal\{E\}^\{\\mathrm\{cov\}\},\\ \(x\_\{e\},y\_\{u,e\}\)\\in\\mathcal\{D\}\_\{u\}\\\}\.\(7\)
The agent leverages its internal LLM \(e\.g\., GPT\-4o\) to aggregate the collected cohort\-level interaction data into the cohort memory as:
ℳk=LLM\(\{𝒬u\}u∈Ck\)\.\\mathcal\{M\}\_\{k\}=\\mathrm\{LLM\}\\big\(\\\{\\mathcal\{Q\}\_\{u\}\\\}\_\{u\\in C\_\{k\}\}\\big\)\.\(8\)Here,ℳk\\mathcal\{M\}\_\{k\}stores a compact description of the cohort’s overall knowledge mastery patterns\. The cognitive diagnosis model is refined using aggregated learner–task interaction data\{𝒬u\}u∈𝒰\\\{\\mathcal\{Q\}\_\{u\}\\\}\_\{u\\in\\mathcal\{U\}\}via gradient\-based updates:
ϕ←ϕ−η∇ϕ∑u∈𝒰ℒCD\(ϕ;𝒬u\),\\phi\\leftarrow\\phi\-\\eta\\nabla\_\{\\phi\}\\sum\_\{u\\in\\mathcal\{U\}\}\\mathcal\{L\}\_\{\\mathrm\{CD\}\}\(\\phi;\\mathcal\{Q\}\_\{u\}\),\(9\)whereη\\etais the learning rate\. At this stage, both the teacher agent and the cognitive diagnosis prop only provide coarse and potentially imperfect assessments of individual learners, which motivates further refinement through subsequent individualized probing\.
#### 3\.2\.2Act II: Individualized Informative Probing
Act II reduces uncertainty about individual learners by selectively retrieving additional evidence through roll\-call scenarios\. The teacher rely on cohort\-level memory and each learner’s accumulated interaction history to assess the learner\. In each roundr∈\{1,…,R\}r\\in\\\{1,\\dots,R\\\}, the teacher performs at most one roll\-call query for each learner under a strict probing budgetRR\.
Given the current cohort memoryℳk\(r\)\\mathcal\{M\}\_\{k\}^\{\(r\)\}at roundrr\(ℳk\(0\)=ℳk\\mathcal\{M\}\_\{k\}^\{\(0\)\}=\\mathcal\{M\}\_\{k\}\) and the learner’s interaction record𝒬u\\mathcal\{Q\}\_\{u\}, the teacher infers the learner’s current knowledge state\. The interaction record𝒬u\\mathcal\{Q\}\_\{u\}serves as the direct individual evidence, while the cohort memory provides a population\-level prior\. We do not construct a separate long\-term memory summary for each learner\. Instead, when a learner exhibits systematic deviations or boundary behaviors, such observations are incorporated into the shared cohort memory during the update step\. The resulting assessment is represented by a teacher\-side estimateμ~u\(r\)\(c\)\\tilde\{\\mu\}\_\{u\}^\{\(r\)\}\(c\)for each knowledge conceptcc\.
Meanwhile, the cognitive diagnosis prop provides an auxiliary mastery referenceμu\(c\)\\mu\_\{u\}\(c\)for learneruuon conceptcc, based on the*same individual interaction traces*\. At this stage, both the teacher agent and the cognitive diagnosis prop rely on incomplete evidence, and their assessments of individual learners are inherently biased and potentially imperfect\. A learner is considered sufficiently understood only when the two assessments are largely consistent\.
The comparison between these two estimates is therefore used to identify concepts on which the learner’s state remains ambiguous and may benefit from additional evidence\. We define disagreement as
Du\(r\)\(c\)=\|μu\(c\)−μ~u\(r\)\(c\)\|,D\_\{u\}^\{\(r\)\}\(c\)=\|\\mu\_\{u\}\(c\)\-\\tilde\{\\mu\}\_\{u\}^\{\(r\)\}\(c\)\|,\(10\)and quantify the internal ambiguity of the auxiliary diagnosis by
Uu\(r\)\(c\)=μu\(c\)\(1−μu\(c\)\)\.U\_\{u\}^\{\(r\)\}\(c\)=\\mu\_\{u\}\(c\)\\big\(1\-\\mu\_\{u\}\(c\)\\big\)\.\(11\)This scoring rule prioritizes concepts that are both inconsistent with the teacher’s assessment and internally ambiguous under the auxiliary reference\.
Based onDu\(r\)\(c\)D\_\{u\}^\{\(r\)\}\(c\)andUu\(r\)\(c\)U\_\{u\}^\{\(r\)\}\(c\), the teacher\-side task retriever proposes a small set of candidate probing tasks for learneruu:
𝒞\(r\)\(u\)=ℛ\(μ~u\(r\),μu\),\\mathcal\{C\}^\{\(r\)\}\(u\)=\\mathcal\{R\}\(\\tilde\{\\mu\}\_\{u\}^\{\(r\)\},\\mu\_\{u\}\),\(12\)where each candidate taske∈𝒞\(r\)\(u\)e\\in\\mathcal\{C\}^\{\(r\)\}\(u\)is scored by
Score\(r\)\(u,e\)=∑c∈𝒦\(e\)\(Du\(r\)\(c\)\+Uu\(r\)\(c\)\)\.\\mathrm\{Score\}^\{\(r\)\}\(u,e\)=\\sum\_\{c\\in\\mathcal\{K\}\(e\)\}\\big\(D\_\{u\}^\{\(r\)\}\(c\)\+U\_\{u\}^\{\(r\)\}\(c\)\\big\)\.\(13\)The highest\-scoring taske\(r\)e^\{\(r\)\}is selected and its historical interaction\(xe\(r\),yu,e\(r\)\)\(x\_\{e^\{\(r\)\}\},y\_\{u,e^\{\(r\)\}\}\)is retrieved from𝒟u\\mathcal\{D\}\_\{u\}if it exists\. If the record is missing, no observation is returned\. The retrieved interaction is appended to
𝒬u←𝒬u∪\{\(xe\(r\),yu,e\(r\)\)\}\.\\mathcal\{Q\}\_\{u\}\\leftarrow\\mathcal\{Q\}\_\{u\}\\cup\\\{\(x\_\{e^\{\(r\)\}\},y\_\{u,e^\{\(r\)\}\}\)\\\}\.\(14\)
After all learners are processed in the current round, the cohort memory is updated:
ℳk\(r\+1\)=LLM\(\{𝒬u\}u∈Ck\),\\mathcal\{M\}\_\{k\}^\{\(r\+1\)\}=\\mathrm\{LLM\}\\big\(\\\{\\mathcal\{Q\}\_\{u\}\\\}\_\{u\\in C\_\{k\}\}\\big\),\(15\)which revises the teacher’s class\-level priors\. The cognitive diagnosis prop is then incrementally refined using the updated\{𝒬u\}u∈Ck\\\{\\mathcal\{Q\}\_\{u\}\\\}\_\{u\\in C\_\{k\}\}as auxiliary diagnostic references\.
After the cohort memory is updated in each round, the teacher agent broadcasts the profiles of the learners it has just probed to a small subset of other teacher agents, and queries them one by one about whether these learners would fit better in their cohorts\. Each receiving teacher compares the learner’s interaction history with its own cohort summary and returns a binary decision\. Once a cohort that provides a more consistent explanation is found, the learner is reassigned to that cohort and the cross\-cohort review for this learner terminates\. We do not exhaustively query all teachers or search for the globally best cohort\. Although doing so could yield a more optimal reassignment, it would incur prohibitive token costs, and in practice identifying a locally better cohort already provides a meaningful improvement over making no adjustment\.
This iterative roll\-call process continues until the probing budget is exhausted \(r=Rr=R\), indicating that further queries are unlikely to yield informative gains\.
### 3\.3Stage III: Final Performance \(Teacher\-centered Simulation\)
After rehearsal, the teacher agent is treated as a calibrated performer that encodes both cohort\-level learning patterns and individual learner histories\. Edu\-Theater then conducts scalable simulation with the teacher as the only agentic actor, while learners remain passive log entities\. At this stage, the cognitive diagnosis prop can also be applied to the same interaction records to provide auxiliary concept\-wise mastery estimates𝝁u\\boldsymbol\{\\mu\}\_\{u\}as a reference\.
Given a target taske⋆e^\{\\star\}, the teacher generates a simulated learner response by conditioning on the task content, the cohort memory, and the learner’s interaction history:
y^u,e⋆=𝒜𝒯\(xe⋆,ℳk,𝒬u,𝝁u\),\\hat\{y\}\_\{u,e^\{\\star\}\}=\\mathcal\{A\}^\{\\mathcal\{T\}\}\\\!\\left\(x\_\{e^\{\\star\}\},\\mathcal\{M\}\_\{k\},\\mathcal\{Q\}\_\{u\},\\boldsymbol\{\\mu\}\_\{u\}\\right\),\(16\)wherey^u,e⋆\\hat\{y\}\_\{u,e^\{\\star\}\}is the simulated outcome for learneruuon taske⋆e^\{\\star\}\.
### 3\.4Computational Cost
The computational cost of Edu\-Theater is mainly determined by LLM invocations and token consumption\. In Act I, the system performs at most\|𝒰\|\|ℰcov\|\|\\mathcal\{U\}\|\|\\mathcal\{E\}^\{\\mathrm\{cov\}\}\|retrospective log queries, where\|𝒰\|\|\\mathcal\{U\}\|is the number of learners and\|ℰcov\|\|\\mathcal\{E\}^\{\\mathrm\{cov\}\}\|is the size of the coverage probe set\. In Act II, it conducts at most\|𝒰\|R\|\\mathcal\{U\}\|Rinformative queries under the probing budgetRR\. The total number of queries is thusO\(\|𝒰\|\(\|ℰcov\|\+R\)\)O\(\|\\mathcal\{U\}\|\(\|\\mathcal\{E\}^\{\\mathrm\{cov\}\}\|\+R\)\)\. The number of LLM invocations isK\(R\+1\)K\(R\+1\):KKinitial calls for cohort memory construction andKRKRupdates during iterative rehearsal\. The total log units processed by the LLM are approximately\|𝒰\|\|ℰcov\|\+R\|𝒰\|\|\\mathcal\{U\}\|\|\\mathcal\{E\}^\{\\mathrm\{cov\}\}\|\+R\|\\mathcal\{U\}\|\. By comparison, traditional agent methods require incremental full\-sequence replay, leading to per\-learner cost proportional toTavg\(Tavg\+1\)/2T\_\{\\text\{avg\}\}\(T\_\{\\text\{avg\}\}\+1\)/2and overall population cost\|𝒰\|⋅Tavg\(Tavg\+1\)/2\|\\mathcal\{U\}\|\\cdot T\_\{\\text\{avg\}\}\(T\_\{\\text\{avg\}\}\+1\)/2, whereTavgT\_\{\\text\{avg\}\}is the average number of tasks per learner\. In our experiments,R≈15∼20R\\approx 15\{\\sim\}20andK=15K=15, both small relative to typical sequence lengths\. By only accessing a compact, informative subset of logs instead of replaying full histories, Edu\-Theater significantly reduces both LLM calls and token usage, yielding a much lighter computational footprint\. More details are provided in Appendix\.
## 4Experiments
Edu\-Theater is designed to enable reliable learner response simulation under*limited retrospective interaction traces*with substantially reduced LLM\-side cost\. We evaluate Edu\-Theater through the following research questions:RQ1:Under varying observation availability, how faithfully can Edu\-Theater reproduce real learner response patterns?RQ2:How much teacher\-side cost can Edu\-Theater reduce compared with individual\-centric LLM simulators and classical supervised simulators?RQ3:Which staged components \(Casting, Act I coverage diagnosis, Act II discrepancy\-driven probing\) are necessary for stability and accuracy?RQ4:Can the simulated responses generated by Edu\-Theater improve downstream educational applications such as adaptive testing?
### 4\.1Basic Setup
##### Dataset
We evaluate learner simulation on two task\-solving datasets: DBE\-KT22\[[1](https://arxiv.org/html/2606.15225#bib.bib3)\]and EduHS\. DBE\-KT22 consists of online exercise interactions collected from undergraduate students enrolled in a Relational Databases course at the Australian National University \(2018–2021\) via the CodeBench platform\. EduHS is collected from a private online practice system and contains interaction logs from high\-school learners on mathematics and physics exercises\. Both datasets consist of multiple\-choice exercise interactions, where each task corresponds to a specific exercise and each response contains the learner’s selected option and a binary correctness label\. Each record is associated with textual descriptions of the exercise and its annotated knowledge concepts\. For both datasets, we randomly sample 500 learners, each with at least 20 interactions, for experiments\. EduHS will be released upon acceptance\. Table[1](https://arxiv.org/html/2606.15225#S4.T1)summarizes dataset statistics\.
Table 1:The statistics of two datasets\.
##### Baselines\.
We select two categories of simulators as baselines\.
*\(i\) Supervised simulators\.*We include KES\[[15](https://arxiv.org/html/2606.15225#bib.bib4)\], DKVMN\[[39](https://arxiv.org/html/2606.15225#bib.bib5)\], EERNN \(with Markov\)\[[27](https://arxiv.org/html/2606.15225#bib.bib7)\], SAKT\[[20](https://arxiv.org/html/2606.15225#bib.bib6)\], and DAISIM\[[42](https://arxiv.org/html/2606.15225#bib.bib8)\]as supervised simulation baselines\. These methods do not rely on LLMs and model learner task\-solving outcomes only at the correctness level, formulating simulation as a binary classification problem \(predictions above 0\.5 are treated as correct\)\. Among them, EERNN incorporates textual exercise content, whereas the others mainly operate on historical correctness traces\.
*\(ii\) LLM\-based generative simulators\.*We include EduAgent\[[35](https://arxiv.org/html/2606.15225#bib.bib9)\]and Agent4Edu\[[11](https://arxiv.org/html/2606.15225#bib.bib85)\]with multiple LLM implementations as generative baselines\. Because expert\-annotated cognitive factors are unavailable, we follow prior work by training NeuralCD\[[29](https://arxiv.org/html/2606.15225#bib.bib10)\]and using its concept\-wise mastery estimates as auxiliary cognitive factors in the prompt\. We also include a*history\-conditioned LLM*baseline that directly predicts a learner’s response from a fixed number of retrieved historical records, without cohort memory or auxiliary diagnostic modeling\.
Table 2:Performance onDBE\-KT22, with full results on two datasets listed in Appendix\.Higher is better \(↑\\uparrow\)\. “–” indicates that methods cannot predict answers\. Best results are in bold\. Within each backbone LLM, the best\-performing agent is highlighted with a light\-blue background\.
##### Implementation
We implement the teacher agent using both API\-based and open\-source LLMs\. For API\-based models, we use the OpenAI service111[https://platform\.openai\.com/docs/models](https://platform.openai.com/docs/models)and adopt GPT\-3\.5\-turbo\-1106 \(denoted as GPT\-3\.5\), GPT\-4o\-2024\-11\-20 \(denoted as GPT\-4o\), and GPT\-4\.1\-mini\. For open\-source models, we use Llama2\-7B and Llama3\-8B\[[28](https://arxiv.org/html/2606.15225#bib.bib1)\]\. The temperature of all LLMs is fixed at 0 to ensure deterministic generation\. ForEdu\-Theater, we conducted simulations with various parameter settings \(see Section[4\.4](https://arxiv.org/html/2606.15225#S4.SS4.SSS0.Px1)for details\)\. Specifically, we report the results forK=15K=15,p=10p=10, andR=15R=15on the DBE\-KT22 dataset, andK=15K=15,p=15p=15, andR=20R=20on the EduHS dataset\. To train NeuralCD, we adopt the original parameters from its corresponding paper, with the learning rateη\\etaset to 0\.002\. We report the mean over five runs on a Linux server equipped with 4×\\timesNVIDIA A100 80GB GPUs\.
##### Evaluation Metrics for Simulation Effectiveness\.
We evaluate simulation effectiveness from two aspects: \(1\) whether the simulated model selects the same answer option as the real learner, measured by accuracy \(ACC\); and \(2\) whether the predicted task\-solving correctness matches the ground truth, measured by ACC and F1\-score\. Following\[[42](https://arxiv.org/html/2606.15225#bib.bib8)\], we further adopt ROUGE\-3 to assess the similarity between the simulated correctness and the real correctness distribution\.
Table 3:Correctness simulation accuracy on the DBE\-KT22 dataset under cold\-start scenarios\.
### 4\.2Simulation Performance \(RQ1\)
The goal of Edu\-Theater is to generate simulated learner responses that resemble real responses under limited retrospective interaction records\. We therefore evaluate response simulation performance in two settings: a data\-available scenario and a cold\-start scenario\.
##### Data\-Available Scenario\.
In this setting, each learner’s interaction log is split into 90% observed records \(for optimization\) and 10% held\-out future tasks \(for test\)\. Supervised baseline models are trained on the observed records, with the last 20% of each learner’s observed records used for validation\. Edu\-Theater is allowed to retrieve retrospective interaction records during rehearsal \(Stage II\) and then generate responses for held\-out tasks during final performance \(Stage III\)\. The learner’s task\-solving response simulation performance onDBE\-KT22is summarized in Table[2](https://arxiv.org/html/2606.15225#S4.T2), withthe full results on DBE\-KT22 and EduHS listed in Appendix\.
The results indicate that: \(1\) LLM\-based simulators consistently outperform supervised baselines and native LLMs, demonstrating a stronger ability to simulate learner task\-solving behavior\. \(2\) Among supervised models, EERNN performs best due to its use of exercise text\. \(3\) GPT\-based teachers outperform Llama\-based teachers, suggesting stronger capacity in modeling learner–task interaction patterns\. \(4\) All models achieve higher performance on DBE\-KT22 than on EduHS, as DBE\-KT22 contains denser learner interaction logs\. \(5\) The answer\-level simulation accuracy is consistently lower than correctness\-level performance, since answer prediction involves a larger output space and thus poses a more challenging learning problem\.
##### Cold\-Start Scenario\.
In the cold\-start setting, we construct five levels of data sparsity by randomly selecting 0%, 20%, 30%, 50%, and 100% of each learner’s observed records\. Supervised models are trained on the available subset, while Edu\-Theater initializes cohort construction and rehearsal using only the same limited records\.
Results on DBE\-KT22 are reported in Table[3](https://arxiv.org/html/2606.15225#S4.T3), and similar trends are observed on the EduHS dataset\. We observe that: \(1\) When no historical records are available \(0%\), supervised models cannot operate, while Edu\-Theater remains functional\. \(2\) As observation becomes sparser, supervised models degrade sharply, whereas Edu\-Theater maintains stable performance\. \(3\) Edu\-Theater achieves the highest robustness under cold\-start conditions, benefiting from cohort\-level priors and population\-aware inference\.
### 4\.3LLM\-based Learner Simulation Cost \(RQ2\)
LLM\-based simulators generally incur higher computational costs than supervised models due to the repeated invocation of large language models during simulation\. To demonstrate that Edu\-Theater reduces such costs compared with existing agents, we analyze both data usage and API computational expense\.
##### Data Cost
A key advantage of Edu\-Theater lies in its data efficiency during the casting and rehearsal stages\. Instead of requiring full historical interaction logs to initialize learner states, Edu\-Theater constructs cohort\-level priors using only a subset of retrospective interaction records, and refines individual learners through a small number of roll\-call queries\. Specifically, Edu\-Theater initializes using only 63\.4% of the DBE\-KT22 data and 69\.5% of the EduHS data, while baseline LLM\-based simulators rely on the full training dataset\. This property makes Edu\-Theater particularly suitable for cold\-start settings and scenarios where learner histories are incomplete or sparsely observed\.
Table 4:Time and monetary costs of GPT\-based LLM simulators\.↓\\downarrowindicates lower is better\.
##### Time & Monetary Cost
Table[4](https://arxiv.org/html/2606.15225#S4.T4)reports the total wall\-clock time and monetary costs for representative LLM\-based simulators to complete both the rehearsal phase \(retrospective roll\-call diagnosis\) and the final performance phase \(teacher\-centered response generation\), under identical experimental settings\. All compared agents follow the same simulation pipeline and API usage protocol\. The reported costs include all API calls made during a full simulation run\. As shown, Edu\-Theater consistently requires less computational time and lower monetary expense than existing LLM\-based methods, demonstrating its efficiency in both learner state estimation and scalable response generation\. A similar trend is observed on EduHS\.
### 4\.4Parameter and Module Analysis \(RQ3\)
##### Cohort Number \(KK\) Analysis
We analyze how the number of cohorts affects the simulation performance of Edu\-Theater\. Figure[3](https://arxiv.org/html/2606.15225#S4.F3)\(a\) shows the accuracy of the GPT\-3\.5\-turbo\-based Edu\-Theater withKKset to 1, 5, 10, 15, 20, 25 and 35\. AsKKincreases, performance improves and stabilizes afterK=15K=15\. Finer cohort construction leads to more homogeneous learning patterns within each cohort, which provides more informative population\-level priors for roll\-call simulation\. WhenK=1K=1, all learners are grouped into a single cohort, resulting in degraded performance due to the high diversity of learner behaviors\.
##### Whole\-class Diagnosis Density Analysis \(pp\)
We analyze how the number of sampled tasks per knowledge concept at the Whole\-class Diagnosis stage affects simulation performance\. Recall that in Act I of rehearsal, for each conceptccwe samplepptasks satisfyingc∈𝒦\(e\)c\\in\\mathcal\{K\}\(e\)to construct the probe setℰcov\\mathcal\{E\}^\{\\mathrm\{cov\}\}\. We varyp∈\{1,3,5,10,15,20\}p\\in\\\{1,3,5,10,15,20\\\}in the experiments\. Figure[3](https://arxiv.org/html/2606.15225#S4.F3)\(b\) shows a clear performance gain asppincreases from 1 to 5 on EduHS and from 1 to 10 on DBE\-KT22, indicating that denser concept coverage allows the teacher to form more accurate cohort\-level proficiency sketches\. However, the improvement becomes marginal whenppexceeds 10, and almost saturates atp=10∼20p=10\\sim 20\. This suggests that a small number of representative tasks per concept is already sufficient for effective cohort calibration, and additional retrospective queries mainly introduce redundant information\. Overall, the analysis demonstrates that Edu\-Theater can achieve strong simulation performance with relatively low rehearsal cost, highlighting the efficiency of its coverage\-based roll\-call design\.
##### Individualized Informative Probing Budget \(RR\)
We investigate how the probing budgetRRof individualized roll\-call affects simulation performance\. Recall that in Act II of rehearsal, the teacher performs at mostRRretrospective queries for each learner\. We varyR∈\{1,5,10,15,20,25,30\}R\\in\\\{1,5,10,15,20,25,30\\\}for GPT\-4\.1\-mini\-based Edu\-Theater\. As shown in Figure[4](https://arxiv.org/html/2606.15225#S4.F4)\(a\), increasingRRconsistently improves performance on EduHS, with gains gradually stabilizing whenR≥20R\\geq 20on DBE\-KT22 andR≥15R\\geq 15on EduHS\. Longer roll\-call sequences provide richer retrospective interaction evidence, leading to more accurate learner state estimation\.

Figure 3:\(a\) Performance under different cohort numbersKK\. \(b\) The impact of whole\-class diagnosis density \(pp\)\.
Figure 4:\(a\) The impact of Budget \(RR\)\. \(b\) Ablation Study\.
##### Ablation Study
We conduct an ablation study by removing key stages of the Edu\-Theater roll\-call framework to examine their individual contributions\. Specifically, we consider the following variants: \(i\)*w/o Casting*, where cohort construction is removed and all learners share a single teacher memory; \(ii\)*w/o Act I*, where cohort\-level retrospective diagnosis is skipped and the teacher directly performs individualized probing; \(iii\)*w/o Act II*, where individualized informative probing is disabled and learner states rely solely on cohort\-level summaries; and \(iv\)*w/o inter\-cohort disc\.*, where teacher agents do not discuss and exchange cohort memories during rehearsal\. As shown in Figure[4](https://arxiv.org/html/2606.15225#S4.F4)\(b\), removing any component leads to a clear degradation in simulation performance\. Eliminating Casting or Act II causes the most significant drops, emphasizing the importance of learner\-specific refinement beyond coarse cohort estimates\. Removing Act I also results in a substantial performance loss, highlighting the importance of population\-level priors and cohort calibration for a reliable diagnostic context\. Disabling inter\-cohort discussion only slightly decreases performance, suggesting it provides additional constraints for boundary learners, but with less impact than other components\. Overall, the ablation results confirm that Edu\-Theater’s effectiveness stems from the integration of cohort construction, cohort\-level rehearsal, and individualized roll\-call refinement, rather than any single component\.
### 4\.5Education Algorithm Improvement \(RQ4\)
We investigate whether the simulated data generated by Edu\-Theater can improve downstream intelligent education algorithms\. We choose Computerized Adaptive Testing \(CAT\) as the evaluation task, as it is a representative and widely used paradigm in personalized assessment\. CAT relies on a cognitive diagnostic model \(e\.g\., IRT\[[3](https://arxiv.org/html/2606.15225#bib.bib38)\]\) to estimate learner abilities and designs an item selection strategy to adaptively recommend informative questions\. The collected responses are then used to refine the diagnostic model and predict future learner performance\. If the simulated data generated by Edu\-Theater can enhance the prediction accuracy of the diagnostic model, this would indicate that the simulated responses provide useful and realistic learning signals\.
Table 5:The improvement of CAT services on DBE\-KT22\.We use 60% of learner interaction data from DBE\-KT22 to train the IRT model, and reserve the remaining 40% for evaluation\. For each learner in the test set, Edu\-Theater simulates responses to 20 randomly selected unseen exercises based on the calibrated teacher and learner evidence\. The generated responses are combined with the original training set to construct an augmented dataset, denoted as DBE\-KT22\+\. We then retrain the IRT model using both DBE\-KT22 and DBE\-KT22\+ under three standard CAT strategies, including FSI\[[17](https://arxiv.org/html/2606.15225#bib.bib47)\], KLI\[[6](https://arxiv.org/html/2606.15225#bib.bib48)\], and MAAT\[[5](https://arxiv.org/html/2606.15225#bib.bib37)\]\. Each strategy recommends either 5 or 10 test items per learner\. Table[5](https://arxiv.org/html/2606.15225#S4.T5)reports the IRT prediction performance, where F1\-score denotes performance on the original dataset, and F1\-score\+ denotes performance after augmentation\. The results show consistent improvements across all strategies\. This suggests that Edu\- Theater can generate high\-quality learner response data for the applications\.
## 5Related Work
### 5\.1Learner Simulation
Learner simulation addresses the scarcity of high\-quality interaction data in educational systems\[[42](https://arxiv.org/html/2606.15225#bib.bib8),[36](https://arxiv.org/html/2606.15225#bib.bib77)\]\. Early methods primarily trained classifiers on historical data to predict future responses\[[15](https://arxiv.org/html/2606.15225#bib.bib4),[42](https://arxiv.org/html/2606.15225#bib.bib8)\], employing memory\-based architectures\[[26](https://arxiv.org/html/2606.15225#bib.bib49)\]or RNN variants such as EERNN\[[27](https://arxiv.org/html/2606.15225#bib.bib7)\]and KES\[[15](https://arxiv.org/html/2606.15225#bib.bib4)\]\. While DAISim\[[42](https://arxiv.org/html/2606.15225#bib.bib8)\]utilizes Markov decision processes to capture multi\-scale patterns, such traditional methods often struggle with interpretability and cold\-start challenges\. Recently, LLM\-based generative agents have emerged as a robust alternative\. Specifically, GenStu\[[18](https://arxiv.org/html/2606.15225#bib.bib87)\]generates exercise responses, EduAgent\[[35](https://arxiv.org/html/2606.15225#bib.bib9)\]simulates instructional learning, and Agent4Edu\[[11](https://arxiv.org/html/2606.15225#bib.bib85)\]incorporates forgetting dynamics across interaction turns\. Furthermore, CoderAgent\[[37](https://arxiv.org/html/2606.15225#bib.bib88)\]extends these simulations to the coding task\. Despite their potential, current LLM agents for individual learner simulation entail high computational costs and rely heavily on the model’s inherent generalization capabilities\. In this work, we introduceEdu\-Theater, a novel framework that reduces computational costs and maintains effective learner simulation by leveraging cohort\-level priors and retrospective roll\-call probing, making it ideal for data\-scarce scenarios\.
### 5\.2LLM\-based Agents
LLM\-based generative agents have demonstrated exceptional capabilities in perception, decision\-making, and execution, sparking extensive cross\-domain research\[[31](https://arxiv.org/html/2606.15225#bib.bib50)\]\. A foundational architecture proposed by Park et al\.\[[21](https://arxiv.org/html/2606.15225#bib.bib31)\]—integrating profiles, memory, action, and reflection—has enabled the simulation of complex, temporally extended human behaviors\. Building on this, subsequent research has specialized agents for communication, tool usage, recommendation, and interactive decision\-making\[[24](https://arxiv.org/html/2606.15225#bib.bib32),[34](https://arxiv.org/html/2606.15225#bib.bib51),[30](https://arxiv.org/html/2606.15225#bib.bib52),[12](https://arxiv.org/html/2606.15225#bib.bib53),[40](https://arxiv.org/html/2606.15225#bib.bib54),[38](https://arxiv.org/html/2606.15225#bib.bib43)\], alongside large\-scale multi\-agent simulations\[[9](https://arxiv.org/html/2606.15225#bib.bib55),[33](https://arxiv.org/html/2606.15225#bib.bib56),[16](https://arxiv.org/html/2606.15225#bib.bib57),[32](https://arxiv.org/html/2606.15225#bib.bib36)\]\. In educational contexts, LLM\-based agents have been increasingly leveraged to support teaching and classroom interactions\[[14](https://arxiv.org/html/2606.15225#bib.bib65),[8](https://arxiv.org/html/2606.15225#bib.bib66),[13](https://arxiv.org/html/2606.15225#bib.bib67),[2](https://arxiv.org/html/2606.15225#bib.bib70)\]\. Beyond evaluating ChatGPT’s utility in specialized fields like engineering\[[23](https://arxiv.org/html/2606.15225#bib.bib68),[25](https://arxiv.org/html/2606.15225#bib.bib69)\], recent work emphasizes pedagogical alignment\. For instance, EduAgent\[[35](https://arxiv.org/html/2606.15225#bib.bib9)\]and Agent4Edu\[[11](https://arxiv.org/html/2606.15225#bib.bib85)\]instantiate learner simulators by encoding personalized profiles and workflows, while SimClass\[[41](https://arxiv.org/html/2606.15225#bib.bib81)\]extends this to classroom\-level activities through role\-based agent modeling\.
## 6Conclusion
In this work, we proposed a novel approach to learner simulation throughEdu\-Theater, a data\-efficient, cohort\-aware simulation framework powered by large language models \(LLMs\)\. By leveraging population\-level priors and targeted roll\-call\-style diagnostic probing, Edu\-Theater offers a significant departure from traditional one\-to\-one tutoring paradigms, making learner behavior simulation more scalable and efficient\. Edu\-Theater demonstrated the feasibility of achieving personalized and accurate learner simulations with minimal data, even under sparse observation conditions\. Through empirical evaluation, we showed that our approach outperforms existing simulators in both cost\-effectiveness and prediction accuracy\. By reframing learner simulation from individual\-centric models to a cohort\-aware, roll\-call protocol, we paved the way for more efficient, cost\-effective, and scalable simulations that can support the next generation of intelligent educational systems\. Future work could explore further optimization of cohort clustering strategies and the incorporation of additional learner attributes for even finer\-grained personalization\.
## References
- \[1\]\(2022\)Dbe\-kt22: a knowledge tracing dataset based on online student evaluation\.arXiv preprint arXiv:2208\.12651\.Cited by:[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px1.p1.1)\.
- \[2\]D\. Baidoo\-Anu and L\. O\. Ansah\(2023\)Education in the era of generative artificial intelligence \(ai\): understanding the potential benefits of chatgpt in promoting teaching and learning\.Journal of AI7\(1\),pp\. 52–62\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[3\]F\. B\. Baker\(2001\)The basics of item response theory\.ERIC\.Cited by:[§4\.5](https://arxiv.org/html/2606.15225#S4.SS5.p1.1)\.
- \[4\]H\. Bi, E\. Chen, W\. He, H\. Wu, W\. Zhao, S\. Wang, and J\. Wu\(2023\)BETA\-cd: a bayesian meta\-learned cognitive diagnosis framework for personalized learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 5018–5026\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p3.1)\.
- \[5\]H\. Bi, H\. Ma, Z\. Huang, Y\. Yin, Q\. Liu, E\. Chen, Y\. Su, and S\. Wang\(2020\)Quality meets diversity: a model\-agnostic framework for computerized adaptive testing\.In2020 IEEE International Conference on Data Mining \(ICDM\),pp\. 42–51\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.15225#S3.SS2.SSS1.p2.6),[§4\.5](https://arxiv.org/html/2606.15225#S4.SS5.p2.1)\.
- \[6\]H\. Chang and Z\. Ying\(1996\)A global information approach to computerized adaptive testing\.Applied Psychological Measurement20\(3\),pp\. 213–229\.Cited by:[§4\.5](https://arxiv.org/html/2606.15225#S4.SS5.p2.1)\.
- \[7\]L\. Dai, Y\. Jiang, Y\. Chen, Z\. Guo, T\. Liu, and X\. Shao\(2024\)Agent4EDU: advancing ai for education with agentic workflows\.InProceedings of the 2024 3rd International Conference on Artificial Intelligence and Education,pp\. 180–185\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p1.1)\.
- \[8\]Y\. Dan, Z\. Lei, Y\. Gu, Y\. Li, J\. Yin, J\. Lin, L\. Ye, Z\. Tie, Y\. Zhou, Y\. Wang,et al\.\(2023\)Educhat: a large\-scale language model\-based chatbot system for intelligent education\.arXiv preprint arXiv:2308\.02773\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[9\]C\. Gao, X\. Lan, Z\. Lu, J\. Mao, J\. Piao, H\. Wang, D\. Jin, and Y\. Li\(2023\)S3: social\-network simulation system with large language model\-empowered agents\.arXiv preprint arXiv:2307\.14984\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[10\]W\. Gao, Q\. Liu, Z\. Huang, Y\. Yin, H\. Bi, M\. Wang, J\. Ma, S\. Wang, and Y\. Su\(2021\)RCD: relation map driven cognitive diagnosis for intelligent education systems\.InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval,pp\. 501–510\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p1.1)\.
- \[11\]W\. Gao, Q\. Liu, L\. Yue, F\. Yao, R\. Lv, Z\. Zhang, H\. Wang, and Z\. Huang\(2025\)Agent4edu: generating learner response data by generative agents for intelligent education systems\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23923–23932\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p3.1),[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[12\]X\. Huang, J\. Lian, Y\. Lei, J\. Yao, D\. Lian, and X\. Xie\(2023\)Recommender ai agent: integrating large language models for interactive recommendations\.arXiv preprint arXiv:2308\.16505\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[13\]F\. Kieser, P\. Wulff, J\. Kuhn, and S\. Küchemann\(2023\)Educational data augmentation in physics education research using chatgpt\.Physical Review Physics Education Research19\(2\),pp\. 020150\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[14\]H\. Li, T\. Xu, C\. Zhang, E\. Chen, J\. Liang, X\. Fan, H\. Li, J\. Tang, and Q\. Wen\(2024\)Bringing generative ai to adaptive learning in education\.arXiv preprint arXiv:2402\.14601\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[15\]Q\. Liu, S\. Tong, C\. Liu, H\. Zhao, E\. Chen, H\. Ma, and S\. Wang\(2019\)Exploiting cognitive structure for adaptive learning\.InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining,pp\. 627–635\.Cited by:[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p2.1),[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1)\.
- \[16\]R\. Liu, R\. Yang, C\. Jia, G\. Zhang, D\. Zhou, A\. M\. Dai, D\. Yang, and S\. Vosoughi\(2023\)Training socially aligned language models in simulated human society\.arXiv preprint arXiv:2305\.16960\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[17\]F\. M\. Lord\(2012\)Applications of item response theory to practical testing problems\.Routledge\.Cited by:[§4\.5](https://arxiv.org/html/2606.15225#S4.SS5.p2.1)\.
- \[18\]X\. Lu and X\. Wang\(2024\)Generative students: using llm\-simulated student profiles to support question item evaluation\.InProceedings of the Eleventh ACM Conference on Learning@ Scale,pp\. 16–27\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1)\.
- \[19\]M\. Mladenov, C\. Hsu, V\. Jain, E\. Ie, C\. Colby, N\. Mayoraz, H\. Pham, D\. Tran, I\. Vendrov, and C\. Boutilier\(2021\)Recsim ng: toward principled uncertainty modeling for recommender ecosystems\.arXiv preprint arXiv:2103\.08057\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p1.1)\.
- \[20\]S\. Pandey and G\. Karypis\(2019\)A self\-attentive model for knowledge tracing\.In12th International Conference on Educational Data Mining, EDM 2019,pp\. 384–389\.Cited by:[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p2.1)\.
- \[21\]J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,pp\. 1–22\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[22\]C\. Piech, J\. Bassen, J\. Huang, S\. Ganguli, M\. Sahami, L\. J\. Guibas, and J\. Sohl\-Dickstein\(2015\)Deep knowledge tracing\.Advances in neural information processing systems28\.Cited by:[§3\.1](https://arxiv.org/html/2606.15225#S3.SS1.p3.2)\.
- \[23\]J\. Qadir\(2023\)Engineering education in the era of chatgpt: promise and pitfalls of generative ai for education\.In2023 IEEE Global Engineering Education Conference \(EDUCON\),pp\. 1–9\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[24\]C\. Qian, X\. Cong, C\. Yang, W\. Chen, Y\. Su, J\. Xu, Z\. Liu, and M\. Sun\(2023\)Communicative agents for software development\.arXiv preprint arXiv:2307\.07924\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[25\]M\. M\. Rahman and Y\. Watanobe\(2023\)ChatGPT for education and research: opportunities, threats, and strategies\.Applied Sciences13\(9\),pp\. 5783\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[26\]S\. Reddy, S\. Levine, and A\. Dragan\(2017\)Accelerating human learning with deep reinforcement learning\.InNIPS workshop: teaching machines, robots, and humans,Cited by:[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1)\.
- \[27\]Y\. Su, Q\. Liu, Q\. Liu, Z\. Huang, Y\. Yin, E\. Chen, C\. Ding, S\. Wei, and G\. Hu\(2018\)Exercise\-enhanced sequential modeling for student performance prediction\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p2.1),[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1)\.
- \[28\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px3.p1.8)\.
- \[29\]F\. Wang, Q\. Liu, E\. Chen, Z\. Huang, Y\. Chen, Y\. Yin, Z\. Huang, and S\. Wang\(2020\)Neural cognitive diagnosis for intelligent education systems\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 6153–6161\.Cited by:[§2\.1\.2](https://arxiv.org/html/2606.15225#S2.SS1.SSS2.Px1.p1.4),[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p3.1)\.
- \[30\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[31\]L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin,et al\.\(2024\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 1–26\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[32\]L\. Wang, J\. Zhang, X\. Chen, Y\. Lin, R\. Song, W\. X\. Zhao, and J\. Wen\(2023\)Recagent: a novel simulation paradigm for recommender systems\.arXiv preprint arXiv:2306\.02552\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[33\]L\. Wang, J\. Zhang, H\. Yang, Z\. Chen, J\. Tang, Z\. Zhang, X\. Chen, Y\. Lin, R\. Song, W\. X\. Zhao,et al\.\(2023\)User behavior simulation with large language model based agents\.arXiv preprint arXiv:2306\.02552\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[34\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, S\. Zhang, E\. Zhu, B\. Li, L\. Jiang, X\. Zhang, and C\. Wang\(2023\)Autogen: enabling next\-gen llm applications via multi\-agent conversation framework\.arXiv preprint arXiv:2308\.08155\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[35\]S\. Xu, X\. Zhang, and L\. Qin\(2024\)EduAgent: generative student agents in learning\.arXiv preprint arXiv:2404\.07963\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p1.1),[§1](https://arxiv.org/html/2606.15225#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p3.1),[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[36\]F\. Yao, Q\. Liu, L\. Yue, W\. Gao, J\. Li, X\. Li, and Y\. He\(2024\)Adard: an adaptive response denoising framework for robust learner modeling\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 3886–3895\.Cited by:[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1)\.
- \[37\]Y\. Zhan, Q\. Liu, W\. Gao, Z\. Zhang, T\. Wang, S\. Shen, J\. Lu, and Z\. Huang\(2025\)CoderAgent: simulating student behavior for personalized programming learning with large language models\.arXiv preprint arXiv:2505\.20642\.Cited by:[§1](https://arxiv.org/html/2606.15225#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1)\.
- \[38\]A\. Zhang, Y\. Chen, L\. Sheng, X\. Wang, and T\. Chua\(2024\)On generative agents in recommendation\.InProceedings of the 47th international ACM SIGIR conference on research and development in Information Retrieval,pp\. 1807–1817\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[39\]J\. Zhang, X\. Shi, I\. King, and D\. Yeung\(2017\)Dynamic key\-value memory networks for knowledge tracing\.InProceedings of the 26th international conference on World Wide Web,pp\. 765–774\.Cited by:[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p2.1)\.
- \[40\]J\. Zhang, Y\. Hou, R\. Xie, W\. Sun, J\. McAuley, W\. X\. Zhao, L\. Lin, and J\. Wen\(2023\)Agentcf: collaborative learning with autonomous language agents for recommender systems\.arXiv preprint arXiv:2310\.09233\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[41\]Z\. Zhang, D\. Zhang\-Li, J\. Yu, L\. Gong, J\. Zhou, Z\. Liu, L\. Hou, and J\. Li\(2024\)Simulating classroom education with llm\-empowered agents\.arXiv preprint arXiv:2406\.19226\.Cited by:[§5\.2](https://arxiv.org/html/2606.15225#S5.SS2.p1.1)\.
- \[42\]G\. Zhao, Z\. Huang, Y\. Zhuang, J\. Liu, Q\. Liu, Z\. Liu, J\. Wu, and E\. Chen\(2023\)Simulating student interactions with two\-stage imitation learning for intelligent educational systems\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management,pp\. 3423–3432\.Cited by:[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px2.p2.1),[§4\.1](https://arxiv.org/html/2606.15225#S4.SS1.SSS0.Px4.p1.1),[§5\.1](https://arxiv.org/html/2606.15225#S5.SS1.p1.1)\.Similar Articles
LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching
LectūraAgents is a multi-agent framework for adaptive personalized learning that mimics professor-student interactions and generates embodied teaching actions aligned with learner profiles. It introduces a hierarchical architecture, an adaptive embodied teaching mechanism, and a Teaching Action-Speech Alignment algorithm, showing consistent improvements over existing approaches.
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
EnvScaler is an automated framework for scaling tool-interactive environments for LLM agents through programmatic synthesis, creating 191 diverse environments and 7K scenarios to improve agent performance on multi-turn, multi-tool interactions.
Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning
This paper proposes the EDV framework, which uses multiple heterogeneous agents in execute-distill-verify stages to build reliable experiences for LLM agents, preventing self-confirmatory errors and improving performance on long-horizon benchmarks.
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents
EEVEE is a novel test-time prompt learning framework for LLM agents that handles heterogeneous data streams through task clustering and co-evolving router-prompt optimization, achieving significant improvements over existing methods across multiple benchmarks.
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.