Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
Summary
This paper proposes a system that combines a prerequisite knowledge graph with a PPO-based policy to structure Socratic tutoring with LLMs, showing improved student mastery and efficiency over heuristic and frontier model baselines.
View Cached Full Text
Cached at: 06/11/26, 01:41 PM
# Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
Source: [https://arxiv.org/html/2606.11744](https://arxiv.org/html/2606.11744)
Sidney Tio School of Computing and Information Systems Singapore Management University sidney\.tio\.2021@phdcs\.smu\.edu\.sg &Arunesh Sinha Department of Management Science and Information Systems Rutgers Business School arunesh\.sinha@rutgers\.edu &Pradeep Varakantham School of Computing and Information Systems Singapore Management University pradeepv@smu\.edu\.sg
###### Abstract
Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum\. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself\. We show that this gap is not closed by scaling models alone\. Frontier and education\-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once\. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student’s knowledge state from that dialogue\. We propose separating these responsibilities\. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on\. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress\. Across held\-out STEM and non\-STEM topics, our PPO\-paired tutor outperforms heuristic baselines, frontier general\-purpose models, and a model specialised for Socratic dialogue — on both the rate at which students reach full curriculum mastery and the number of turns required\. Explicit curriculum structure delivers gains that scaling the underlying model does not\.
## 1Introduction
Large Language Model \(LLM\) assistants such as Claude and ChatGPT are increasingly the medium through which people learn outside the classroom\. Learning and knowledge\-seeking together account for an estimated 30 to 40% of consumer ChatGPT conversations\(Chatterjiet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib1)\), education is the second\-largest task category on Claude\(Handaet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib2); Bentet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib3)\), and the share of U\.S\. teens using ChatGPT for schoolwork doubled from 13% to 26% between 2023 and 2024\(Sidotiet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib4)\)\. Unlike a textbook or a search engine, an LLM can hold a conversation, answer follow\-up questions, and adapt its explanations dynamically, which makes it a natural tool for self\-directed learning and warrants direct examination of how well it actually performs that role\.
Figure 1:Overview of the RL\-based tutoring system\.Left:The student query is decomposed into a prerequisite knowledge graph, and is used to keep track of the curriculum\.Centre:The RL tutoring policy selects the next topic to teach based on the interaction history with the student\.Right:A single turn of the resulting Socratic dialogue, in which the tutor poses a question, the student responds, and the answer is evaluated\.We study this in the setting of self\-directed learning in the wild, where a user initiates a session with an LLM chat interface and asks a question about something they wish to learn about, with no syllabus, no instructor\-designed sequence, and no prior record of the student\. The only signal available is the dialogue itself, and a good tutor must use it to do two things at once: deliver content in an order that respects prerequisite relationships between concepts, and figure out where the student currently stands so that content can be adapted rather than delivered as a fixed lecture\.
A growing body of work studies how to make LLMs better teachers\. Systems like LearnLM\(LearnLM Team,[2024](https://arxiv.org/html/2606.11744#bib.bib5)\)improve the quality of individual concept explanations, making delivery more pedagogically sound and responsive to learner needs, but treat each interaction as self\-contained\. They do not reason about which concept to teach next, nor maintain a model of what the student has and has not yet mastered across a session\. A parallel line of work trains models to elicit reasoning from learners rather than supply answers\(Dinucu\-Jianuet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib10); Liuet al\.,[2024](https://arxiv.org/html/2606.11744#bib.bib9)\)\. This addresses how content is exchanged but again leaves curriculum structure unaddressed\. Much of this effort has been evaluated on mathematical problem solving\(Zerkouket al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib31)\), where success is legible because the student either solves a held\-out problem or does not\. This framing sits uncomfortably with the self\-directed setting for two reasons\. First, open\-ended subjects, such as the humanities, do not reduce to a final numerical answer\(Garcia and Weilbach,[2023](https://arxiv.org/html/2606.11744#bib.bib32); Faveroet al\.,[2024](https://arxiv.org/html/2606.11744#bib.bib34)\)\. Second, and more fundamentally, math benchmarks do not require the tutor to reason about curriculum structure or infer student state from dialogue alone, since the problem itself supplies the scope\.
To address this, we adopt Socratic tutoring, in which the tutor probes understanding through questions rather than direct exposition\. This surfaces misconceptions, prompts students to construct their own reasoning\(Shridharet al\.,[2022](https://arxiv.org/html/2606.11744#bib.bib22)\), and generates the kind of open\-ended exchange that self\-directed learning actually looks like — across subjects, not just those with clean numerical answers\.
Given a student query, our system constructs a prerequisite knowledge graph in which the query sits as the goal node and supporting concepts sit as ancestors\. Curriculum delivery becomes a graph traversal problem where a tutoring policy decides which node to teach next, with the aim of maximising mastery of the goal\. Deciding what to teach next and actually teaching it require different kinds of reasoning, and thus we treat them as separate problems\. The sequencing problem is addressed by a lightweight policy trained with reinforcement learning \(RL\), while the dialogue problem is handled by an LLM that conducts the Socratic exchange at each selected node and returns a mastery estimate the sequencing policy uses to decide its next move\. For scalability, we use an LLM as a student simulator; our focus is on LLM tutoring behaviour in unstructured settings and its ability to exploit structure when provided\. We evaluate this design across a mix of STEM and non\-STEM topic graphs, and find that the learned sequencing policy paired with a tutoring LLM outperforms both heuristic baselines and LLM\-only tutors, including frontier general\-purpose models and models specialised for Socratic dialogue\. An overview is provided in Figure[1](https://arxiv.org/html/2606.11744#S1.F1)\.
Our contributions are as follows:
- •We formulate informal LLM\-based education as a new task, in which an arbitrary student query is decomposed into a prerequisite knowledge graph and a tutor must guide the student through it using only dialogue as evidence of their state\.
- •We release a Gymnasium\-compatible environment\(Towerset al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib27)\)for this task, covering both STEM and non\-STEM domains and supporting evaluation over long horizons, beyond the short, math\-focused interactions of prior work\.
- •We propose a two\-component system that separates curriculum sequencing from dialogue generation, with a lightweight policy for node selection and an LLM tutor for Socratic instruction\.
- •We show that frontier and tutoring\-specialised LLMs struggle on this task when used alone, and that a lightweight Proximal Policy Optimisation \(PPO\) policy paired with a tutoring LLM closes the gap\.
## 2Related Work
### 2\.1Tutoring Language Models
A growing line of work trains or prompts LLMs to act as tutors over math problems\. These efforts share a common structure: a tutor guides a student through a single problem via dialogue, often with RL\(Dinucu\-Jianuet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib10); Weiet al\.,[2026](https://arxiv.org/html/2606.11744#bib.bib28)\), and is rewarded by some combination of pre/post solve\-rate gains on simulated students and handcrafted pedagogical qualities such as withholding answers, recognising incorrect responses, or producing good explanations\(Liuet al\.,[2024](https://arxiv.org/html/2606.11744#bib.bib9); Scarlatoset al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib11)\)\. Sessions are typically capped at 16 to 20 turns, reflecting the single\-problem framing\(Namet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib13)\)\.
This setup works well when learning can be reduced to solving a verifiable problem, but it constrains the tutor’s role in two ways that matter for everyday learning\. The reliance on a verifiable correct answer rules out domains where understanding is open\-ended, and the short horizon rules out the kind of extended sessions that deeper learning requires\. Our setting removes both constraints\. We tutor over arbitrary topics drawn from STEM and non\-STEM subjects, where success cannot be reduced to solving a specific problem, and we evaluate over horizons of one hundred turns to reflect the longer arc of self\-directed study\.
LearnLM\(LearnLM Team,[2024](https://arxiv.org/html/2606.11744#bib.bib5)\)is a large pre\-trained model that incorporates high\-quality pedagogical dialogue into its training data, evaluated via expert preference comparisons against other models on pedagogical qualities\(LearnLM Team,[2025](https://arxiv.org/html/2606.11744#bib.bib6)\)\. Evaluation of LearnLM centres on subjective expert judgement and does not directly probe whether the model can perform tasks essential for tutoring success, such as curriculum tracking and estimating a student’s knowledge state\.Wordenet al\.\([2026](https://arxiv.org/html/2606.11744#bib.bib12)\)reports that frontier models perform poorly on several pedagogical subtasks, including cognitive state modelling and comparing question difficulty, suggesting that subjective preference scores may not translate into the structured competencies tutoring requires\. Our work contributes here by making curriculum tracking and competency estimation explicit objectives, and by evaluating current models, including LearnLM, on their ability to deliver content under them\.
### 2\.2Knowledge Graphs, Knowledge Tracing, and Path Recommendation
A second line of work uses knowledge graphs to encode prerequisite and topical relationships between learning content\(Doroudiet al\.,[2019](https://arxiv.org/html/2606.11744#bib.bib17); Changet al\.,[2015](https://arxiv.org/html/2606.11744#bib.bib18); Siren and Tzerpos,[2022](https://arxiv.org/html/2606.11744#bib.bib19)\), and pairs them with Knowledge Tracing \(KT\), the task of predicting a student’s correctness on future items given their history of past responses\(Piechet al\.,[2015](https://arxiv.org/html/2606.11744#bib.bib29); Leeet al\.,[2026](https://arxiv.org/html/2606.11744#bib.bib21)\)\. More recent systems combine KT with RL to recommend the next item or learning path that maximises predicted gain, and some have begun integrating LLMs into the pipeline, for instance to label items with knowledge components from text alone\(Ozyurtet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib16)\)\.
These approaches assume a setting we do not have\. They build on interaction logs from online learning platforms \(e\.g\.Wordenet al\.\([2026](https://arxiv.org/html/2606.11744#bib.bib12)\); Changet al\.\([2015](https://arxiv.org/html/2606.11744#bib.bib18)\)\), which presumes the student is a returning user on a fixed item pool, and they reduce assessment to binary correctness on item responses, which discards the information carried in the content of a student’s reply\(Liuet al\.,[2019](https://arxiv.org/html/2606.11744#bib.bib15)\)\. Both assumptions break in our setting\. Students arrive with no prior log, the curriculum is constructed at query time, and the only diagnostic signal available is the dialogue itself\. These differences motivate a tutor that can construct curricular structure on demand and reason about student state from natural language rather than from logged item responses\.
### 2\.3Foundations from the Learning Sciences
Three findings from educational psychology shape our system’s design\. First, tutoring systems that require students to reason through intermediate steps, rather than produce final answers, approach the effectiveness of expert human tutors\(VanLehn,[2011](https://arxiv.org/html/2606.11744#bib.bib35)\), which motivates our use of Socratic dialogue over direct exposition\. Second, the order in which content is sequenced determines whether students have the prerequisite knowledge to engage constructively with new material\(Chi and Wylie,[2014](https://arxiv.org/html/2606.11744#bib.bib36)\), which is why we treat curriculum sequencing as a core problem\. Third, formats that require students to generate responses under uncertainty improve long\-term retention compared to more passive instructional formats\(Bjork and Bjork,[2011](https://arxiv.org/html/2606.11744#bib.bib38)\), and evidence that structural sequencing decisions operate largely independently of conversational modality\(Litmanet al\.,[2006](https://arxiv.org/html/2606.11744#bib.bib37)\)supports treating the two as separable components\.
## 3Problem Setup
#### Problem Setting\.
Effective tutoring requires resolving two distinct dimensions simultaneously\. The first is curricular: A complete understanding of any topic draws on subtopics that depend on one another in specific ways, and a tutor must enumerate those subtopics and the prerequisites between them\. The second is diagnostic: The tutor must assess what the student already knows\.
Consider a student who queriesqq= "I want to learn more about tariffs and sanctions"\. An economics student may grasp comparative advantage and trade balances yet struggle to connect them to sanctions as a geopolitical instrument\. A business owner familiar with tariff schedules and supply chain disruptions may lack the theoretical grounding to reason about their macroeconomic origins and effects\. The two students share the same query but require different sequences of instruction\.
An LLM tutor encountering an arbitrary student resolves the curricular dimension naturally, drawing on knowledge encoded during pretraining to enumerate the subtopics relevant toqqand the dependencies between them\. The diagnostic dimension is harder\. A student’s knowledge state is private and is revealed only through dialogue\. A curriculum that ignores this proceeds in a fixed order, re\-teaching material the student already knows and bypassing the work needed to identify genuine gaps\.
We frame the tutor’s task as allocating a budget of dialogue turns across the subtopics in the curriculum, with more turns spent where the student has the most to learn and fewer spent where competence is already in place\. We use Socratic dialogue as the mode of interaction, we treat curriculum sequencing as a learned decision, and we separate the sequencing decision from the dialogue itself\.
#### Knowledge graph\.
Given a queryqq, a language model produces a directed acyclic graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)that represents the curriculum\. Each nodev∈𝒱v\\in\\mathcal\{V\}is a subtopic required to fully understandqq, and each directed edge\(u,v\)∈ℰ\(u,v\)\\in\\mathcal\{E\}indicates thatuuis a prerequisite ofvv\. The graph has a single rootr∈𝒱r\\in\\mathcal\{V\}corresponding toqqitself, and leaf nodesℒ⊂𝒱\\mathcal\{L\}\\subset\\mathcal\{V\}corresponding to foundational concepts with no prerequisites\. In the running example,rris the topic of Tariffs and Sanctions, an interior node might be Diplomacy, and a leaf might be International Relations\. We writepre\(v\)\\mathrm\{pre\}\(v\)for the direct prerequisites ofvvandanc\(v\)\\mathrm\{anc\}\(v\)for its full ancestor subgraph\. Mastery ofvvrequires mastery of every node inanc\(v\)\\mathrm\{anc\}\(v\)\. The graph is constructed once at the start of the episode and is fixed for the remainder of the interaction\.
## 4Method
We propose a tutoring system built from two complementary components that each address one of the demands the tutor faces\. A learned sequencing policy addresses the curricular problem, observing summary statistics of the student’s response history and the structure of the prerequisite graph to decide which subtopic to engage with next\. A language model addresses the dialogue problem, conducting the Socratic exchange at the selected subtopic, generating a question, engaging with the student’s free\-form reply, and producing a binary judgement of whether the reply demonstrated understanding\. That judgement is fed back to the sequencing policy as the signal it uses to choose the next subtopic\. The two components are developed independently and combined only at inference time, which allows us to train the sequencing policy in fast simulation against a student model without any task\-specific training of the language model’s weights\.
### 4\.1Student Model
The student model describes the student in relation to the generated curriculum, and is structured to mirror the response models used in KT, which predicts the probability that a student will answer a question on a knowledge component correctly given their history of past responses\(Piechet al\.,[2015](https://arxiv.org/html/2606.11744#bib.bib29)\)\. Rather than fitting parameters to observed students, we treat the response model as a generative simulator with a fixed parametric form and use it to train the sequencing policy\.
Each nodevvis associated with a latent threshold representing the number of interactions the student needs atvvbefore achieving mastery, a single parameter that represents both the intrinsic difficulty of the topic and the strength of the student’s existing foundations\. Afternninteractions atvv, the probability of a positive response is
P\(y=1∣n,v\)=\{min\(nkv,1\)if allu∈pre\(v\)are mastered,0otherwise\.P\(y=1\\mid n,v\)=\\begin\{cases\}\\min\\\!\\left\(\\dfrac\{n\}\{k\_\{v\}\},\\ 1\\right\)&\\text\{if all \}u\\in\\mathrm\{pre\}\(v\)\\text\{ are mastered\},\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}\(1\)The probability grows linearly withnnand clips at one, reflecting the intuition that successful interactions accumulate toward mastery at a roughly constant rate\. The hard zero when prerequisites are unmastered is the more substantive modelling choice\. Rather than allowing prerequisite structure to emerge implicitly through learned parameters as KT models typically do, we make the constraint explicit so that learning at a node cannot proceed when the foundations for it are absent\. The thresholdskvk\_\{v\}are hidden from the tutor and must be inferred from the student’s responses\.
### 4\.2Tutoring as a POMDP
The setup above defines a sequential decision problem under uncertainty\. The tutor chooses which node to teach at each step, observes a noisy response, and must update its belief about hidden mastery thresholds\. This is naturally a partially observable Markov decision process \(POMDP;Sondik \([1971](https://arxiv.org/html/2606.11744#bib.bib30)\)\), and we adopt that formalism here\. The full specification, including the precise observation vector and transition rules, is given in Appendix[C\.1](https://arxiv.org/html/2606.11744#A3.SS1)\.
#### State\.
The true state at each step is the student’s mastery across all nodes in the knowledge graph, together with how much successful practice they have accumulated at each node\. Formally, masterymv∈\{0,1\}m\_\{v\}\\in\\\{0,1\\\}at a given node transitions from unmastered to mastered \(mv=1m\_\{v\}=1\) once the student’s cumulative successful interactionsnvn\_\{v\}meet a required thresholdkvk\_\{v\}, i\.e\.mv=𝟙\[nv≥kv\]m\_\{v\}=\\mathbb\{1\}\[n\_\{v\}\\geq k\_\{v\}\]\. This threshold is never observed by the tutor, so even a complete record of the student’s interaction history leaves genuine uncertainty about whether any given node has been mastered\.
#### Actions\.
At each step, the tutor selects a nodeat∈𝒱a\_\{t\}\\in\\mathcal\{V\}to teach next\. In the running example, an action is a choice between asking the student a question about supply and demand, trade balances, or sanctions as a geopolitical instrument\.
#### Observations\.
The tutor’s primary observation is the student’s utterance, which provides indirect evidence of their knowledge state\. The true mastery state is not directly accessible, thus, as a human teacher does, the tutor must infer it through dialogue\. We model this by composing the observation from the student’s utterance together with summary statistics of the response history at each node \(e\.g\. number of positive responses\) and the graph structure𝒢\\mathcal\{G\}\. The graph structure allows the tutor to reason about how evidence at one node implicates its ancestors and descendants\.
#### Rewards\.
The objective is mastery of the rootrr, which in turn requires mastering every node in𝒱\\mathcal\{V\}\. We give the tutor a large positive reward when the root is mastered and a small per\-step penalty otherwise:
ℛ\(st,at\)=\{\+10ifmr=1,−δotherwise,\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)=\\begin\{cases\}\+10&\\text\{if \}m\_\{r\}=1,\\\\ \-\\delta&\\text\{otherwise,\}\\end\{cases\}\(2\)whereδ\>0\\delta\>0encourages the tutor to reach root mastery in as few turns as possible\. We setδ=0\.1\\delta=0\.1in all experiments\. Episodes terminate either when the root is mastered or when the step budgetT=100T=100is exhausted\.
### 4\.3RL Tutor
The RL tutor is a policyπθ\\pi\_\{\\theta\}trained with PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.11744#bib.bib24)\)on the POMDP described in the previous section\. Training is performed entirely in simulation, with student responsesyty\_\{t\}generated from the binary stochastic model in Equation[1](https://arxiv.org/html/2606.11744#S4.E1)and no language model in the loop\. PPO requires many environment interactions to converge, and rolling these out through a language model across long conversation horizons would be prohibitively expensive, so we treat the binary simulator as a sufficient proxy for the structural problem to support training and account for the resulting sim\-to\-real gap at evaluation time\.
To learn a policy that generalises across topics and students rather than memorising a single curriculum, we apply Domain Randomisation\(Tobinet al\.,[2017](https://arxiv.org/html/2606.11744#bib.bib23)\)\. The core idea is to expose the policy to a wide distribution of environment configurations during training so that the resulting policy is robust to conditions not seen during training\. At the start of each episode, three features are resampled:
1. 1\.Topic and graph\.A topic is drawn from a pool spanning STEM and non\-STEM domains, and a knowledge graph𝒢\\mathcal\{G\}is constructed for it\. This varies the prerequisite structure the policy must traverse\. The full list of topics can be found in Appendix[E](https://arxiv.org/html/2606.11744#A5)\.
2. 2\.Difficulty thresholds\.For each nodev∈𝒱v\\in\\mathcal\{V\}, the thresholdkvk\_\{v\}is sampled independently from a Poisson distribution with meank¯\\bar\{k\}\. The policy does not observekvk\_\{v\}and must infer proximity to mastery from the response history\.
3. 3\.Prior knowledge\.A mastery initialisation parameterρ∈\[0,1\]\\rho\\in\[0,1\]controls the student’s existing competence\. Each node is independently marked as mastered \(withkv=0k\_\{v\}=0andmv=1m\_\{v\}=1\) with probabilityρ\\rho, producing students that range from complete novices to those with most of the curriculum already covered\.
Item 1 exposes the policy to different topics and items 2 and 3 expose it to different students, requiring it to probe for prerequisites, interpret the response signal, and allocate turns where they yield the largest gains in mastery\.
Inference with LLMs\.At inference time, the trained policyπθ\\pi\_\{\\theta\}is composed with a language model to produce a fully interactive tutoring system\. The interaction proceeds turn by turn\. Given the current observationoto\_\{t\}, the RL tutor policy selects the next nodeat=πθ\(ot\)a\_\{t\}=\\pi\_\{\\theta\}\(o\_\{t\}\)to teach\. The language model then carries out the actual instruction at nodeata\_\{t\}, generating a Socratic question grounded in the selected subtopic and engaging with the student’s free\-form reply\.
Since the student, during inference, responds in natural language rather than emitting a binary signal, the responseyty\_\{t\}that the policy expects is no longer directly available\. We obtain it from an LLM judge that reads the student’s reply and returns a positive or negative assessment\. That assessment is fed into the observation update and used to advance the environment, after which the RL policy chooses the next node\. For all our experiments, the LLM tutor and judge are the same models\.
## 5Experiments
### 5\.1Experimental Setup
We evaluate on a held\-out set of 10 topics spanning STEM and non\-STEM domains \(see Table[E](https://arxiv.org/html/2606.11744#A5.SSx2)\), disjoint from those used during training\. Each topic has a fixed knowledge graph𝒢\\mathcal\{G\}and per\-node thresholds\{kv\}v∈𝒱\\\{k\_\{v\}\\\}\_\{v\\in\\mathcal\{V\}\}, sampled once and held constant across all conditions so that performance differences reflect policy quality rather than problem difficulty\. Each topic is evaluated over 30 episodes for non\-LLM baselines, 5 for open\-source LLMs, and 3 for closed\-source models \(Gemini, GPT\-5\)\. The smaller counts for closed\-source models reflect API cost, and the resulting confidence intervals are correspondingly wider\. We provide examples of the conversations in Figure[4\(c\)](https://arxiv.org/html/2606.11744#A1.F4.sf3)\.
Baselines\.We organise methods along two factors that together define the design space of our experiments, summarised in Table[1](https://arxiv.org/html/2606.11744#S5.T1)and categorised along two factors as described below\.
The*action scope*factor specifies what the method controls\. Under BFS, the visitation order is fixed in advance by traversing𝒢\\mathcal\{G\}in breadth\-first order from the leaf, and the policy’s only decision at each step is whether to continue on the current node or advance to the next one, representing methods that follow a fixed syllabus\. Full\-scope methods are unconstrained and may revisit or interleave nodes\.
The*observation*factor specifies the signal the method receives, which is either the ground\-truth mastery state \(GT\), the latent binary responseyty\_\{t\}\(Binary\), or a natural\-language student utterance \(NL\)\. The Binary and NL conditions share the same underlying student engine and differ only in whetheryty\_\{t\}from Equation[1](https://arxiv.org/html/2606.11744#S4.E1)is exposed directly or rendered as a natural\-language reply by Qwen2\.5\-14B\-Instruct\(Dinucu\-Jianuet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib10)\)and recovered by an LLM judge\. Any performance gap between the two is therefore attributable to the language\-and\-judgement layer rather than to a different environment\. Heuristic baselines and our sequencing policy alone are evaluated under Binary, while LLM\-only methods and our full system are evaluated under NL\.
LLM\-based methods are evaluated on Qwen2\.5\-7B\-Instruct\(Qwen,[2025](https://arxiv.org/html/2606.11744#bib.bib25)\), TutorRL\(Dinucu\-Jianuet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib10)\),gemini\-3\-flash\-preview, andgpt\-5\-mini\-2025\-08\-07\. ThinkingKT\(Leeet al\.,[2026](https://arxiv.org/html/2606.11744#bib.bib21)\)serves as a representative method from the KT literature, adapted for the current environment\.
Table 1:Summary of all methods\.*BFS*methods follow a fixed BFS curriculum and only decide when to advance;*Full*methods control both sequencing and pacing\. Observations are either the ground\-truth mastery signal \(GT\), the latent binary responseyty\_\{t\}\(Binary\), or natural\-language student utterances \(NL\)\. Our proposed methods are in bold\.Action ScopeMethodObs\.DescriptionBFSOracleGTInvests exactlykvk\_\{v\}interactions using truekvk\_\{v\}\.ThresholdBinaryAdvances afterτ\\tauconsecutive positive responses\.MeankkBinaryInvests exactlyk¯\\bar\{k\}interactions per node\.FullRandomBinarySelects nodes uniformly at random\.PPOBinaryTrained policy selecting node order from POMDP observations\.ThinkingKTNLKT\-based method that uses an LLM for inference\.LLM\-OnlyNLLLM selects nodes and conducts dialogue via prompting\.LLM OracleGTLLM\-Only with truekvk\_\{v\}exposed in the turn prompt\.PPO \+ LLMNLPPO selects nodes; the LLM judges utterances to recoveryty\_\{t\}and delivers Socratic dialogue\.Metrics\.The primary metric issuccess rate, the proportion of episodes in which all nodes in𝒢\\mathcal\{G\}are mastered within the step budgetT=100T=100\. We also reportgraph mastery, the average proportion of nodes mastered across all episodes including failed ones, which captures partial progress when the budget is exhausted before full mastery\. And we reportaverage steps to mastery, computed over all episodes, as a measure of efficiency\.
### 5\.2Main Results
Table 2:Results across all methods\. Rows shaded green are our proposed methods; rows shaded grey are LLM\-based comparisons\. Steps isΔ%\\Delta\\%relative to Oracle \(lower is better; Oracle raw mean =31\.00±0\.7131\.00\\pm 0\.71steps\)\. Mastery is reported as a percentage\. All CIs are 95% \(tt\-distribution\)\.†Methods using ground\-truth mastery signals are not deployable\.ScopeObs\.MethodnSuccess \(%\)↑\\uparrowStepsΔ%\\Delta\\%↓\\downarrowMastery \(%\)↑\\uparrowUpper bounds \(non\-deployable\)BFSGT†Oracle300100\.0±0\.0100\.0\\pm 0\.00\.00\.0100\.0±0\.0100\.0\\pm 0\.0FullGT†Gemini\-Flash Oracle30100\.0±0\.0100\.0\\pm 0\.0\+8\.4±4\.0\+8\.4\\pm 4\.0100\.0±0\.0100\.0\\pm 0\.0FullGT†GPT\-5\-mini Oracle30100\.0±0\.0100\.0\\pm 0\.0\+3\.4±4\.0\+3\.4\\pm 4\.0100\.0±0\.0100\.0\\pm 0\.0Fixed BFS curriculumBFSBinaryMeankk3000\.0±0\.00\.0\\pm 0\.0\+236\.2±7\.9\+236\.2\\pm 7\.955\.6±1\.655\.6\\pm 1\.6BFSBinaryThreshold \(τ=3\\tau\{=\}3\)30041\.7±5\.641\.7\\pm 5\.6\+174\.9±9\.2\+174\.9\\pm 9\.284\.6±2\.084\.6\\pm 2\.0Full\-task methodsFullBinaryRandom30028\.3±5\.128\.3\\pm 5\.1\+217\.4±7\.3\+217\.4\\pm 7\.390\.6±0\.990\.6\\pm 0\.9FullNLThinkingKT \+ Qwen2\.5500\.0±0\.00\.0\\pm 0\.0\+236\.2±19\.9\+236\.2\\pm 19\.948\.1±4\.148\.1\\pm 4\.1FullNLThinkingKT \+ TutorRL500\.0±0\.00\.0\\pm 0\.0\+236\.2±19\.9\+236\.2\\pm 19\.949\.3±3\.949\.3\\pm 3\.9FullNLGPT\-5\-mini303\.3±6\.83\.3\\pm 6\.8\+234\.2±27\.5\+234\.2\\pm 27\.577\.1±3\.477\.1\\pm 3\.4FullNLGemini\-Flash3023\.3±16\.123\.3\\pm 16\.1\+223\.3±24\.4\+223\.3\\pm 24\.483\.9±5\.683\.9\\pm 5\.6FullBinaryPPO30093\.3±2\.893\.3\\pm 2\.8\+72\.4±5\.3\+72\.4\\pm 5\.398\.8±0\.798\.8\\pm 0\.7FullNLPPO \+ Qwen2\.55078\.0±11\.9\\mathbf\{78\.0\\pm 11\.9\}\+120\.1±23\.1\\mathbf\{\+120\.1\\pm 23\.1\}95\.4±3\.0\\mathbf\{95\.4\\pm 3\.0\}FullNLPPO \+ TutorRL5082\.0±11\.0\\mathbf\{82\.0\\pm 11\.0\}\+105\.6±24\.3\\mathbf\{\+105\.6\\pm 24\.3\}98\.0±1\.5\\mathbf\{98\.0\\pm 1\.5\}Table[2](https://arxiv.org/html/2606.11744#S5.T2)reports performance across all methods\. We highlight four findings\.
LLM\-only policies fail on the joint task\.LLM\-only policies, including frontier models, perform worse than both random and heuristic baselines, and ThinkingKT performs below random\. This is not a failure to understand the environment, since the LLM oracle variants of the same frontier models reach near\-optimal performance once given access to the ground\-truth mastery signal\. The failure lies in performing three tasks at once, namely sequencing the curriculum, conducting Socratic dialogue to deliver content, and inferring the student’s state from that dialogue, and replicates earlier findings that frontier models are no better than random on pedagogical tasks\(Wordenet al\.,[2026](https://arxiv.org/html/2606.11744#bib.bib12)\)\. KT methods fare no better, since they were designed for settings with rich prior interaction logs and fixed item pools, neither of which our setting provides\.
Separating sequencing from dialogue narrows the oracle gap\.Within the Full and NL conditions, decomposing the task into a learned sequencing policy paired with a small LLM instructor outperforms every monolithic LLM\-only method evaluated under the same conditions\. Both PPO \+ TutorRL and PPO \+ Qwen2\.5 substantially exceed the LLM\-only and heuristic baselines, despite the LLM in each case being a 7B\-parameter model\. The deployable system closest to the oracle upper bound is one that decomposes the task rather than one that uses a more capable model\.
Performance degrades modestly under natural language\.Holding the sequencing policy fixed at PPO and varying the observation factor from Binary to NL, success rate drops from 93\.3% to between 78\.0% and 82\.0%, depending on the dialogue LLM\. The drop reflects noise added by the language\-and\-judgement layer, since both observation conditions share the same underlying engine\. The remaining gap to the oracle is nonetheless considerably narrower than for any other deployable system, indicating that most of the difficulty in the joint task comes from the control problem rather than from the language layer\.
Education\-oriented models are slightly stronger dialogue partners\.In both the LLM\-only and PPO \+ LLM conditions, models with known exposure to education\-related training slightly outperform general\-purpose models at equivalent scale\. Gemini outperforms GPT\-5\-mini, and TutorRL marginally outperforms Qwen2\.5\. The differences fall within confidence intervals and we make no causal claims, but the direction is consistent across both conditions, suggesting that Socratic question generation and student response interpretation benefit from education\-focused training\.
Finetuning of the dialogue LLM degrades performance\.We present a targeted experiment in Appendix[A](https://arxiv.org/html/2606.11744#A1)showing that directly finetuning the LLM in our environment produces degraded responses and increased safety issues\.
### 5\.3Policy Behaviour Analysis
\(a\)Traversal distribution on a single sample episode\.
\(b\)Node selections over time, averaged across the evaluation set \(error bars: standard error\)\.
Figure 2:Node traversal behavior across an episode\. Nodes form a tree with the original query at the root \(depth 0\), subtopics as its immediate children, and foundational concepts at the leaves\. \(a\) Gemini’s node visit distribution on one episode\. \(b\) Node types selected by PPO and Gemini; Gemini’s rate of invalid actions grows as the episode progresses\.Figure[3](https://arxiv.org/html/2606.11744#S5.F3)illustrates the sequencing behavior of PPO across a sample episode\. PPO adopts a localised, BFS\-like traversal: it focuses on a single branch, exhausts the leaf nodes of a subtopic, and only then advances to the subtopic node itself before moving on\. This yields a structured progression in which prerequisite concepts are consistently covered before their parent topic\.
Figure[2\(a\)](https://arxiv.org/html/2606.11744#S5.F2.sf1)shows the depth of nodes visited over the course of a sample episode by Gemini, with depth increasing steadily as the session progresses, reflecting a movement from foundational concepts toward deeper material\. This emergent ordering is consistent with the traversal learned by PPO \(see Figure[5](https://arxiv.org/html/2606.11744#A2.F5)for detailed comparison on a sample episode\)\.
Figure 3:Graph traversal with PPO\. Numbers in nodes represent the first timestep which the node was visited\.Figure[2\(b\)](https://arxiv.org/html/2606.11744#S5.F2.sf2)compares node selections by PPO and Gemini across early \(t<50t<50\) and late \(t\>50t\>50\) timesteps\. Early on, both focus on foundational nodes, with Gemini pacing through them slightly faster\. Later, the two diverge: Gemini’s distribution degrades, increasingly selecting the root node \(the original query\) or producing unparseable actions, effectively reverting to unstructured LLM behavior\. The PPO policy remains stable throughout, as it operates over a fixed observation vector rather than a growing transcript, supporting recent findings on multi\-turn degradation\(Labanet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib20)\)\. Additional visualisations are provided in Appendix[B](https://arxiv.org/html/2606.11744#A2)\.
## 6Conclusion and Limitations
We formulated open\-ended LLM tutoring as curriculum traversal over a query\-conditioned prerequisite graph, and showed that our two\-component tutoring system, with a lightweight PPO policy for sequencing and an LLM for Socratic dialogue, outperforms both heuristic baselines and frontier LLM tutors\. The main failure of LLM\-only tutors is not topical knowledge but the joint demand of sequencing, dialogue, and assessment over long horizons\. Promising next steps include extracting richer mastery signal from student utterances and distilling structural tutoring behaviour back into the language model\.
Limitations\.The LLM\-generated graphs in our experiments stayed within our cap ofNmax=20N\_\{\\max\}=20nodes, and whether the method scales to larger and finer\-grained curricula, where the state and action spaces grow accordingly, remains open\. All student responses are generated synthetically by an LLM\-based student, and remain to be validated with human learners\. Higher\-fidelity student models\(Doroudiet al\.,[2019](https://arxiv.org/html/2606.11744#bib.bib17); Tioet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib26)\)are a natural avenue for strengthening the simulator in future work\.
## References
- D\. Bent, K\. Handa, E\. Durmus, A\. Tamkin, M\. McCain, S\. Ritchie, R\. Donegan, J\. Martinez, and J\. Jones \(2025\)Anthropic education report: how educators use claude\.External Links:[Link](https://www.anthropic.com/news/anthropic-education-report-how-educators-use-claude)Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p1.1)\.
- E\. L\. Bjork and R\. A\. Bjork \(2011\)Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning\.InPsychology and the Real World: Essays Illustrating Fundamental Contributions to Society,M\. A\. Gernsbacher, R\. W\. Pew, L\. M\. Hough, and J\. R\. Pomerantz \(Eds\.\),pp\. 56–64\.Cited by:[§2\.3](https://arxiv.org/html/2606.11744#S2.SS3.p1.1)\.
- H\. Chang, H\. Hsu, and K\. Chen \(2015\)Modeling exercise relationships in e\-learning: a unified approach\.\.InEDM,pp\. 532–535\.Cited by:[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p2.1)\.
- A\. Chatterji, T\. Cunningham, D\. J\. Deming, Z\. Hitzig, C\. Ong, C\. Y\. Shan, and K\. Wadman \(2025\)How people use chatgpt\.Technical reportNational Bureau of Economic Research\.Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p1.1)\.
- M\. T\. H\. Chi and R\. Wylie \(2014\)The ICAP framework: linking cognitive engagement to active learning outcomes\.Educational Psychologist49\(4\),pp\. 219–243\.External Links:[Document](https://dx.doi.org/10.1080/00461520.2014.965823)Cited by:[§2\.3](https://arxiv.org/html/2606.11744#S2.SS3.p1.1)\.
- D\. Dinucu\-Jianu, J\. Macina, N\. Daheim, I\. Hakimi, I\. Gurevych, and M\. Sachan \(2025\)From problem\-solving to teaching problem\-solving: aligning llms with pedagogy using reinforcement learning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 272–292\.Cited by:[§A\.5](https://arxiv.org/html/2606.11744#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.11744#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.11744#S5.SS1.p4.2),[§5\.1](https://arxiv.org/html/2606.11744#S5.SS1.p5.1)\.
- S\. Doroudi, V\. Aleven, and E\. Brunskill \(2019\)Where’s the reward? a review of reinforcement learning for instructional sequencing\.International Journal of Artificial Intelligence in Education29,pp\. 568–620\.Cited by:[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p1.1),[§6](https://arxiv.org/html/2606.11744#S6.p2.1)\.
- L\. Favero, J\. A\. Pérez\-Ortiz, T\. Käser, and N\. Oliver \(2024\)Enhancing critical thinking in education by means of a socratic chatbot\.pp\. 17–32\.Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p3.1)\.
- G\. G\. Garcia and C\. Weilbach \(2023\)If the sources could talk: evaluating large language models for research assistance in history\.arXiv preprint arXiv:2310\.10808\.Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p3.1)\.
- K\. Handa, D\. Bent, A\. Tamkin, M\. McCain, E\. Durmus, M\. Stern, M\. Schiraldi, S\. Huang, S\. Ritchie, S\. Syverud, K\. Jagadish, M\. Vo, M\. Bell, and D\. Ganguli \(2025\)Anthropic education report: how university students use claude\.External Links:[Link](https://www.anthropic.com/news/anthropic-education-report-how-university-students-use-claude)Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p1.1)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)LLMs get lost in multi\-turn conversation\.InICLR 2026Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2017 IEEE/RSJ international conference on intelligent robots and systems \(IROS\)Proceedings of the International Conference on Autonomous Agents and Multiagent Systems \(AAMAS\)International workshop on AI in education and educational research,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Cited by:[§5\.3](https://arxiv.org/html/2606.11744#S5.SS3.p3.2)\.
- LearnLM Team \(2024\)LearnLM: improving gemini for learning\.arXiv preprint arXiv:2412\.16429\.Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p3.1)\.
- LearnLM Team \(2025\)Evaluating gemini in an arena for learning\.arXiv preprint arXiv:2505\.24477\.Cited by:[§A\.2](https://arxiv.org/html/2606.11744#A1.SS2.p1.1),[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p3.1)\.
- U\. Lee, J\. Y\. Kim, R\. Ju, M\. Jung, and J\. Eo \(2026\)A training\-free large reasoning model\-based knowledge tracing framework for unified prediction and prescription\.arXiv preprint arXiv:2601\.01708\.Cited by:[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.11744#S5.SS1.p5.1)\.
- D\. J\. Litman, C\. P\. Rosé, K\. Forbes\-Riley, K\. VanLehn, D\. Bhembe, and S\. Silliman \(2006\)Spoken versus typed human and computer dialogue tutoring\.International Journal of Artificial Intelligence in Education16\(2\),pp\. 145–170\.Cited by:[§2\.3](https://arxiv.org/html/2606.11744#S2.SS3.p1.1)\.
- J\. Liu, Z\. Huang, T\. Xiao, J\. Sha, J\. Wu, Q\. Liu, S\. Wang, and E\. Chen \(2024\)SocraticLM: exploring socratic personalized teaching with large language models\.Advances in Neural Information Processing Systems37,pp\. 85693–85721\.Cited by:[§A\.5](https://arxiv.org/html/2606.11744#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.11744#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p1.1)\.
- Q\. Liu, S\. Tong, C\. Liu, H\. Zhao, E\. Chen, H\. Ma, and S\. Wang \(2019\)Exploiting cognitive structure for adaptive learning\.InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining,pp\. 627–635\.Cited by:[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p2.1)\.
- H\. Nam, O\. Gottesman, A\. Zhang, D\. Foster, E\. Brunskill, and L\. Ungar \(2025\)Efficient rl for optimizing conversation level outcomes with an llm\-based tutor\.arXiv preprint arXiv:2507\.16252\.Cited by:[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p1.1)\.
- Y\. Ozyurt, T\. Almaci, S\. Feuerriegel, and M\. Sachan \(2025\)Personalized exercise recommendation with semantically\-grounded knowledge tracing\.arXiv preprint arXiv:2507\.11060\.Cited by:[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p1.1)\.
- C\. Piech, J\. Bassen, J\. Huang, S\. Ganguli, M\. Sahami, L\. J\. Guibas, and J\. Sohl\-Dickstein \(2015\)Deep knowledge tracing\.Advances in neural information processing systems28\.Cited by:[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.11744#S4.SS1.p1.1)\.
- Qwen \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5\.1](https://arxiv.org/html/2606.11744#S5.SS1.p5.1)\.
- A\. Raffin, A\. Hill, A\. Gleave, A\. Kanervisto, M\. Ernestus, and N\. Dormann \(2021\)Stable\-baselines3: reliable reinforcement learning implementations\.Journal of Machine Learning Research22\(268\),pp\. 1–8\.External Links:[Link](http://jmlr.org/papers/v22/20-1364.html)Cited by:[Table 6](https://arxiv.org/html/2606.11744#A4.T6),[Table 6](https://arxiv.org/html/2606.11744#A4.T6.14.2)\.
- A\. Scarlatos, N\. Liu, J\. Lee, R\. Baraniuk, and A\. Lan \(2025\)Training llm\-based tutors to improve student learning outcomes in dialogues\.InInternational Conference on Artificial Intelligence in Education,pp\. 251–266\.Cited by:[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§4\.3](https://arxiv.org/html/2606.11744#S4.SS3.p1.2)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§A\.3](https://arxiv.org/html/2606.11744#A1.SS3.p1.2)\.
- K\. Shridhar, J\. Macina, M\. El\-Assady, T\. Sinha, M\. Kapur, and M\. Sachan \(2022\)Automatic generation of socratic subquestions for teaching math word problems\.Abu Dhabi, United Arab Emirates,pp\. 4136–4149\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.277/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.277)Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p4.1)\.
- O\. Sidoti, E\. Park, and J\. Gottfried \(2025\)About a quarter of U\.S\. teens have used ChatGPT for schoolwork – double the share in 2023\.Pew Research Center\.External Links:[Link](https://www.pewresearch.org/short-reads/2025/01/15/about-a-quarter-of-us-teens-have-used-chatgpt-for-schoolwork-double-the-share-in-2023/)Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p1.1)\.
- A\. Siren and V\. Tzerpos \(2022\)Automatic learning path creation using oer: a systematic literature mapping\.IEEE Transactions on Learning Technologies\.Cited by:[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p1.1)\.
- E\. J\. Sondik \(1971\)The optimal control of partially observable markov processes\.Stanford University\.Cited by:[§4\.2](https://arxiv.org/html/2606.11744#S4.SS2.p1.1)\.
- S\. Tio, D\. Li, and P\. Varakantham \(2025\)EduQate: generating adaptive curricula through rmabs in education settings\.Cited by:[§6](https://arxiv.org/html/2606.11744#S6.p2.1)\.
- J\. Tobin, R\. Fong, A\. Ray, J\. Schneider, W\. Zaremba, and P\. Abbeel \(2017\)Domain randomization for transferring deep neural networks from simulation to the real world\.pp\. 23–30\.Cited by:[§4\.3](https://arxiv.org/html/2606.11744#S4.SS3.p2.1)\.
- M\. Towers, A\. Kwiatkowski, J\. Terry, J\. U\. Balis, G\. D\. Cola, T\. Deleu, M\. Goulão, A\. Kallinteris, M\. Krimmel, A\. KG, R\. Perez\-Vicente, A\. Pierré, S\. Schulhoff, J\. J\. Tai, H\. Tan, and O\. G\. Younis \(2025\)Gymnasium: a standard interface for reinforcement learning environments\.External Links:2407\.17032,[Link](https://arxiv.org/abs/2407.17032)Cited by:[2nd item](https://arxiv.org/html/2606.11744#S1.I1.i2.p1.1)\.
- K\. VanLehn \(2011\)The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems\.Educational Psychologist46\(4\),pp\. 197–221\.External Links:[Document](https://dx.doi.org/10.1080/00461520.2011.611369)Cited by:[§2\.3](https://arxiv.org/html/2606.11744#S2.SS3.p1.1)\.
- S\. Wei, M\. Zhang, X\. Lin, B\. Jiang, K\. Kuang, and Z\. Dai \(2026\)UCO: a multi\-turn interactive reinforcement learning method for adaptive teaching with large language models\.External Links:2511\.08873,[Link](https://arxiv.org/abs/2511.08873)Cited by:[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p1.1)\.
- E\. Worden, C\. Heffernan, N\. Heffernan, and S\. Sonkar \(2026\)FoundationalASSIST: an educational dataset for foundational knowledge tracing and pedagogical grounding of llms\.arXiv preprint arXiv:2602\.00070\.Cited by:[§2\.1](https://arxiv.org/html/2606.11744#S2.SS1.p3.1),[§2\.2](https://arxiv.org/html/2606.11744#S2.SS2.p2.1),[§5\.2](https://arxiv.org/html/2606.11744#S5.SS2.p2.1)\.
- M\. Zerkouk, M\. Mihoubi, and B\. Chikhaoui \(2025\)A comprehensive review of ai\-based intelligent tutoring systems: applications and challenges\.arXiv preprint arXiv:2507\.18882\.Cited by:[§1](https://arxiv.org/html/2606.11744#S1.p3.1)\.
## Appendix AEnd\-to\-End RL for Socratic Qualities
Model:finetuned
TeacherThat’s great to hear\! You’ve told me that you’re interested in learning about Pythagorean Theorem\. \[…\] Can you tell me, do you know how it relates to triangles?
StudentThe theorem can help us find the perimeter of any triangle if we know two sides\.
TeacherThat’s a great insight\! You mentioned that you’re telling me the theorem can help us find the perimeter of any triangle if we know two sides\. \[…\] Do you think you can apply this theorem to right triangles specifically?
StudentI guess the theorem must work for most triangles, but maybe it works best if one angle is 90 degrees\.
TeacherThat’s a great insight\! You’ve already mentioned that you think the Pythagorean Theorem is \[…\] How does this apply to finding the sides of a right triangle?
\(a\)Repetitive questioning; little progress\.
Model:Qwen\-2\.5\-7b
TeacherOf course\! \[…\] Can you tell me what you think the Pythagorean Theorem is about?
StudentThe theorem says that any triangle’s sides can be related with a squared plus b squared equals c squared\.
TeacherThat’s a great start\! \[…\] Could you explain whya2\+b2=c2a^\{2\}\+b^\{2\}=c^\{2\}applies specifically to right triangles?
StudentI think it works for triangles where one side is longer than the others, so their squares add up to the biggest side\.
TeacherThat’s a good intuition\! \[…\] Could you remind me which side is considered the ‘longest’ in a right triangle?
\(b\)Focused, content\-driven guidance\.
Model:LearnLM
TeacherI can absolutely help with that\! \[…\] What do you already know about right triangles?
StudentA right triangle is any triangle where all three sides are the same length and one angle is 90 degrees\.
TeacherYou’re right that a right triangle has one 90\-degree angle\. \[…\] Is that true for*all*right triangles?
StudentSo, some right triangles might have all equal sides, but usually, at least two sides are different\.
TeacherThat’s a great clarification\! \[…\] Do the sides have specific names – something special about the side opposite the 90\-degree angle?
\(c\)Gentle correction; sustained engagement\.
Figure 4:Comparison of three models tutoring the same student opening on the Pythagorean Theorem\. Teacher turns abbreviated for space\.### A\.1Motivation
In a bid to optimize models for Socratic qualities, we ran a small scale experiment that uses end\-to\-end reinforcement learning finetuning to instill these qualities directly into a model\. The goal of this experiment is to test whether a generic instruction tuned model can be shaped into a Socratic tutor purely through reward signals, without relying on the curriculum knowledge graph that we use elsewhere in this work\. We view this as a natural baseline to compare against our main approach, since end\-to\-end RL is a common recipe in recent tutoring work and it lets us probe what behaviors emerge when the model is left to discover its own dialogic strategy\.
### A\.2Simulator Setup
We reuse the student simulator described in our main experiments \(see Section[5\.1](https://arxiv.org/html/2606.11744#S5.SS1)\), but we strip out the curriculum knowledge graph so that the model has to organize the conversation on its own\. The Socratic qualities we target are taken from the arena\-style evaluation introduced by the LearnLM team\[LearnLM Team,[2025](https://arxiv.org/html/2606.11744#bib.bib6)\], who define what a good tutor looks like across a small set of pedagogical dimensions\. We adopt their framing because it gives us a concrete and externally validated target for what we want the trained model to do\.
In this simplified simulator, we assume that the student has a single objective query, which corresponds to one node in the curriculum graph used in the main body, and that the student has an innatekvk\_\{v\}value associated with that node\. The model has two broad moves available to it during a conversation\. It can break the current concept down into a more foundational concept, or it can advance the conversation by using the prior dialogue as context to discuss a higher level concept\. This mirrors the upward and downward navigation that a human tutor would naturally do when probing what a learner already knows\.
### A\.3Training Procedure
We train the tutor model with Group Relative Policy Optimization \(GRPO\)\[Shaoet al\.,[2024](https://arxiv.org/html/2606.11744#bib.bib41)\], using Qwen2\.5\-7B\-Instruct as the policy that interacts with the student simulator\. For the reward signal, we use Qwen2\.5\-14B\-Instruct as a judge model that reads the full conversation and assigns a reward of1\.01\.0for each of the following pedagogical qualities that the conversation exhibits, namely Manages Cognitive Load, Inspires Active Learning, Deepens Metacognition, Stimulates Curiosity, and Adapts to the Learner\. The total reward for a conversation is the sum across these five qualities, which means a perfect conversation receives a reward of5\.05\.0\.
We chose a judge based reward over a hand designed reward because the qualities we care about are hard to capture with simple heuristics\. A larger judge model also gives us a smoother training signal than the policy could provide for itself, and it follows the same recipe that recent works on RL with LLM judges have used\.
### A\.4Qualitative Findings
We provide a qualitative visualization of the conversations generated by our finetuned model, the base Qwen2\.5\-7B\-Instruct, and LearnLM in Figure[4](https://arxiv.org/html/2606.11744#A1.F4)\. Looking at these conversations side by side makes it easy to see how each model handles the same student query and how their styles diverge\.
We discover that the finetuned Qwen2\.5\-7B has collapsed into reward hacking behavior\. In particular, the model leans heavily on a small set of repeated openers such as “That’s great to hear\! You mentioned that you’re…” across almost every turn\. This gives the tutor a very artificial quality and we deem it ill\-suited for the kind of natural conversation we would like to see between tutors and students\. In informal reads of the transcripts, the tutor feels formulaic rather than responsive, and we find that this does not provide an engaging experience for the student\. For these reasons we refrain from using this model in deployed systems\.
### A\.5Agreeable Inaccuracies as a Safety Concern
Table 3:Agreeable Inaccuracy rates across models, judged by GPT\-4\.1 and ordered by errors per 1,000 words\.A second issue cuts across more than just our finetuned model\. Earlier works like SocraticLM\[Liuet al\.,[2024](https://arxiv.org/html/2606.11744#bib.bib9)\]and TutorRL\[Dinucu\-Jianuet al\.,[2025](https://arxiv.org/html/2606.11744#bib.bib10)\], as well as our finetuned model in this case, are optimized to encourage the student and are punished for producing discouraging comments\. As a concrete example, the reward favors phrases like “That’s a good try” over phrases like “That’s not correct\.” While this push toward warmth is well intentioned, it creates a side effect that we think is worth flagging\.
We observe an elevated rate of what we call Agreeable Inaccuracies, where the tutor affirms a factually wrong statement from the student rather than correcting it\. A simple example would be the tutor responding with “That’s a good insight\! You mentioned that ice freezes at 32C\!” even though the student has stated something that is not true \(Water freezes at 0C or 32F\)\. The encouragement objective ends up overriding the correctness objective, and the model takes the path of least resistance by agreeing\.
To measure how often this happens, we use GPT\-4\.1 as a judge to read conversations from each of the models in question and to flag turns where the tutor either affirms or fails to correct a factual error from the student\. We present the results in Table[3](https://arxiv.org/html/2606.11744#A1.T3)\. The picture is mixed once we look at both error rate per 1,000 words and the absolute number of errors together, since a low rate over a long corpus can still translate into many incorrect affirmations in practice, while a high rate over a short corpus may reflect only a handful of slips\. For example, our Finetuned model has the lowest rate but still produces 40 total errors over its corpus, and SocraticLM has both a high rate and the largest absolute count at 83 errors, so we cannot cleanly claim that any single tutor model is safe on this axis\. We raise Agreeable Inaccuracies as a safety concern for deployed education systems, since a tutor that systematically confirms wrong answers can entrench misconceptions rather than fix them, and we flag a more careful study of this trade off as future work\.
## Appendix BPolicy Behavior
## Appendix CPOMDP Details
### C\.1POMDP Formulation
The tutoring problem is formulated as a partially observable Markov decision process \(POMDP\), defined by the tuple\(𝒮,𝒜,𝒪,𝒯,ℛ,Ω,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},\\mathcal\{T\},\\mathcal\{R\},\\Omega,\\gamma\)\.
#### State Space𝒮\\mathcal\{S\}\.
The true state at stepttisst=\{mvt,nvt\}v∈𝒱s\_\{t\}=\\\{m\_\{v\}^\{t\},n\_\{v\}^\{t\}\\\}\_\{v\\in\\mathcal\{V\}\}, wheremvt∈\{0,1\}m\_\{v\}^\{t\}\\in\\\{0,1\\\}denotes the latent mastery of nodevvandnvt∈ℤ≥0n\_\{v\}^\{t\}\\in\\mathbb\{Z\}\_\{\\geq 0\}denotes the number of successful interactions completed atvv\. Mastery updates deterministically:mvt\+1=𝟙\[nvt\+1≥kv\]m\_\{v\}^\{t\+1\}=\\mathbb\{1\}\[n\_\{v\}^\{t\+1\}\\geq k\_\{v\}\], somvm\_\{v\}flips from 0 to 1 as soon as the cumulative successful interaction count meets the required investment, after which further interactions atvvyield no additional benefit\. Mastery is nonetheless latent from the perspective of the agent, becausekvk\_\{v\}is hidden: even if the agent knewnvtn\_\{v\}^\{t\}, it cannot determine whethernvt≥kvn\_\{v\}^\{t\}\\geq k\_\{v\}without knowledge ofkvk\_\{v\}\.
#### Action Space𝒜\\mathcal\{A\}\.
At each step, the agent selects a nodeat∈𝒱a\_\{t\}\\in\\mathcal\{V\}to interact with\. The graph is padded to a fixed size\|𝒱\|=Nmax=20\|\\mathcal\{V\}\|=N\_\{\\max\}=20, so the action space is𝒜=Discrete\(Nmax\)\\mathcal\{A\}=\\mathrm\{Discrete\}\(N\_\{\\max\}\)\. Invalid actions \(padding nodes\) are excluded via an action maskℳt⊆𝒜\\mathcal\{M\}\_\{t\}\\subseteq\\mathcal\{A\}, exposing only nodes present in𝒢\\mathcal\{G\}\.
#### Observation Space𝒪\\mathcal\{O\}\.
The agent does not observe the true statests\_\{t\}directly\. Instead, it receives an observationot∈ℝNmax\(4\+Nmax\)o\_\{t\}\\in\\mathbb\{R\}^\{N\_\{\\max\}\(4\+N\_\{\\max\}\)\}constructed from the adjacency matrix of𝒢\\mathcal\{G\}and a per\-node feature vector\. Concretely, the features for nodevvare:
ϕv=\[nv,cv\+,p^v,𝟙\[cv\+\>0\],negv\]\\phi\_\{v\}=\\bigl\[\\,n\_\{v\},\\ c\_\{v\}^\{\+\},\\ \\hat\{p\}\_\{v\},\\ \\mathbb\{1\}\[c\_\{v\}^\{\+\}\>0\],\\ \\mathrm\{neg\}\_\{v\}\\,\\bigr\]\(3\)Table[4](https://arxiv.org/html/2606.11744#A3.T4)summarises each component\. Together, these five features constitute a sufficient statistic for the agent’s belief over mastery at each node, summarising all interaction history relevant to estimating whethernv≥kvn\_\{v\}\\geq k\_\{v\}\. The full observation isot=\[vec\(A\),Φt\]∈ℝ500o\_\{t\}=\[\\mathrm\{vec\}\(A\),\\ \\Phi\_\{t\}\]\\in\\mathbb\{R\}^\{500\}, whereA∈\{0,1\}Nmax×NmaxA\\in\\\{0,1\\\}^\{N\_\{\\max\}\\times N\_\{\\max\}\}is the padded adjacency matrix andΦt∈ℝ5×Nmax\\Phi\_\{t\}\\in\\mathbb\{R\}^\{5\\times N\_\{\\max\}\}is the stacked node feature matrix\.
Table 4:Per\-node observation features\. Each feature is computed independently for every nodev∈𝒱v\\in\\mathcal\{V\}and stacked intoΦt∈ℝ5×Nmax\\Phi\_\{t\}\\in\\mathbb\{R\}^\{5\\times N\_\{\\max\}\}\.FeatureDescriptionRangeRole in Beliefnvn\_\{v\}Total successful interaction count atvvℤ≥0\\mathbb\{Z\}\_\{\\geq 0\}Tracks investment so farcv\+c\_\{v\}^\{\+\}Cumulative positive response count atvvℤ≥0\\mathbb\{Z\}\_\{\\geq 0\}Evidence of progression toward masteryp^v=cv\+/nv\\hat\{p\}\_\{v\}=c\_\{v\}^\{\+\}/n\_\{v\}Empirical positive response rate atvv\[0,1\]\[0,1\]Estimate of proximity tokvk\_\{v\}𝟙\[cv\+\>0\]\\mathbb\{1\}\[c\_\{v\}^\{\+\}\>0\]Indicator: at least one positive response seen\{0,1\}\\\{0,1\\\}Confirms node is reachable and engagednegv\\mathrm\{neg\}\_\{v\}Consecutive negative response count atvvℤ≥0\\mathbb\{Z\}\_\{\\geq 0\}Signals stall or unmastered prerequisiteA∈\{0,1\}Nmax×NmaxA\\in\\\{0,1\\\}^\{N\_\{\\max\}\\times N\_\{\\max\}\}Padded adjacency matrix encoding prerequisite structure of𝒢\\mathcal\{G\}
#### Transition Dynamics𝒯\\mathcal\{T\}\.
The transitionst→st\+1s\_\{t\}\\to s\_\{t\+1\}is determined by the selected nodeata\_\{t\}and the stochastic responseyty\_\{t\}\. When all prerequisites ofata\_\{t\}are mastered, the interaction is attempted andnatn\_\{a\_\{t\}\}increments by 1 on a positive response; otherwise the interaction fails immediately andnatn\_\{a\_\{t\}\}is unchanged\. Mastery satisfies a monotonicity property:mvt=1⇒mvt′=1m\_\{v\}^\{t\}=1\\Rightarrow m\_\{v\}^\{t^\{\\prime\}\}=1for allt′\>tt^\{\\prime\}\>t\.
#### Observation FunctionΩ\\Omega\.
A positive responseyt=1y\_\{t\}=1at nodeata\_\{t\}is unambiguous: it indicates demonstrated understanding atata\_\{t\}and, by the prerequisite structure, confirms mastery of allv∈anc\(at\)v\\in\\mathrm\{anc\}\(a\_\{t\}\)\. Positive signals therefore propagate downward through the graph, updating the belief over the full ancestor subgraph\. A negative response is ambiguous: it may reflect incomplete mastery atata\_\{t\}itself, or at any unmastered node inanc\(at\)\\mathrm\{anc\}\(a\_\{t\}\), and constitutes only weak evidence against the interacted node and its prerequisite subgraph\.
#### Rewardℛ\\mathcal\{R\}\.
The primary objective is mastery of the root noderr\. Since mastery ofrrrequires mastery of allv∈𝒱∖\{r\}v\\in\\mathcal\{V\}\\setminus\\\{r\\\}, the reward is:
ℛ\(st,at\)=\{\+10ifmrt\+1=1−δotherwise\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)=\\begin\{cases\}\+10&\\text\{if \}m\_\{r\}^\{t\+1\}=1\\\\ \-\\delta&\\text\{otherwise\}\\end\{cases\}\(4\)whereδ\>0\\delta\>0is a small per\-step penalty that encourages the agent to achieve root mastery efficiently\. The episode terminates either upon root mastery or upon exhausting the step budgetT=100T=100\.
#### Discount Factorγ\\gamma\.
We useγ=1\\gamma=1\(undiscounted\), consistent with the episodic, budget\-constrained nature of the task\.
## Appendix DFull Results and Hyperparameters
We present the full results and experiment hyperparameters in Table[5](https://arxiv.org/html/2606.11744#A4.T5)and[6](https://arxiv.org/html/2606.11744#A4.T6)respectively\. Experiments with RL policy are ran a single RTX3090, while heuristic methods are ran with CPU resources only\. Finally, LLM methods are ran with a single H100 GPU, while closed\-source models are ran by querying the respective closed\-source API\.
Table 5:Full results across all methods including additional comparisons from Table[2](https://arxiv.org/html/2606.11744#S5.T2)\. Rows shaded green are our proposed methods; rows shaded grey are LLM\-based comparisons\. Steps isΔ%\\Delta\\%relative to Oracle \(lower is better; Oracle raw mean =31\.00±0\.7131\.00\\pm 0\.71steps\)\. Mastery is reported as a percentage\. All CIs are 95% \(tt\-distribution\)\.†Methods using ground\-truth mastery signals are not deployable\.ScopeObs\.MethodnSuccess \(%\)StepsΔ%\\Delta\\%Mastery \(%\)Upper bounds \(non\-deployable\)BFSGT†Oracle300100\.0±0\.0100\.0\\pm 0\.00\.00\.0100\.0±0\.0100\.0\\pm 0\.0FullGT†Gemini\-Flash Oracle30100\.0±0\.0100\.0\\pm 0\.0\+8\.4±4\.0\+8\.4\\pm 4\.0100\.0±0\.0100\.0\\pm 0\.0FullGT†GPT\-5\-mini Oracle30100\.0±0\.0100\.0\\pm 0\.0\+3\.4±4\.0\+3\.4\\pm 4\.0100\.0±0\.0100\.0\\pm 0\.0FullGT†TutorRL Oracle3000\.0±0\.00\.0\\pm 0\.0\+236\.2±7\.9\+236\.2\\pm 7\.932\.6±1\.732\.6\\pm 1\.7Fixed BFS curriculumBFSBinaryMeankk3000\.0±0\.00\.0\\pm 0\.0\+236\.2±7\.9\+236\.2\\pm 7\.955\.6±1\.655\.6\\pm 1\.6BFSBinaryThreshold \(τ=1\\tau\{=\}1\)3000\.0±0\.00\.0\\pm 0\.0\+236\.2±7\.9\+236\.2\\pm 7\.938\.3±1\.538\.3\\pm 1\.5BFSNLQwen2\.5 \(BFS\)500\.0±0\.00\.0\\pm 0\.0\+236\.2±19\.9\+236\.2\\pm 19\.939\.0±2\.639\.0\\pm 2\.6BFSBinaryThreshold \(τ=3\\tau\{=\}3\)30041\.7±5\.641\.7\\pm 5\.6\+174\.9±9\.2\+174\.9\\pm 9\.284\.6±2\.084\.6\\pm 2\.0Full\-task methodsFullBinaryRandom30028\.3±5\.128\.3\\pm 5\.1\+217\.4±7\.3\+217\.4\\pm 7\.390\.6±0\.990\.6\\pm 0\.9FullBinaryPPO30093\.3±2\.893\.3\\pm 2\.8\+72\.4±5\.3\+72\.4\\pm 5\.398\.8±0\.798\.8\\pm 0\.7FullNLTutorRL3000\.0±0\.00\.0\\pm 0\.0\+236\.2±7\.9\+236\.2\\pm 7\.930\.5±1\.630\.5\\pm 1\.6FullNLThinkingKT \+ Qwen2\.5500\.0±0\.00\.0\\pm 0\.0\+236\.2±19\.9\+236\.2\\pm 19\.948\.1±4\.148\.1\\pm 4\.1FullNLThinkingKT \+ TutorRL500\.0±0\.00\.0\\pm 0\.0\+236\.2±19\.9\+236\.2\\pm 19\.949\.3±3\.949\.3\\pm 3\.9FullNLGPT\-5\-mini303\.3±6\.83\.3\\pm 6\.8\+234\.2±27\.5\+234\.2\\pm 27\.577\.1±3\.477\.1\\pm 3\.4FullNLGemini\-Flash3023\.3±16\.123\.3\\pm 16\.1\+223\.3±24\.4\+223\.3\\pm 24\.483\.9±5\.683\.9\\pm 5\.6FullNLPPO \+ Qwen2\.55078\.0±11\.9\\mathbf\{78\.0\\pm 11\.9\}\+120\.1±23\.1\\mathbf\{\+120\.1\\pm 23\.1\}95\.4±3\.0\\mathbf\{95\.4\\pm 3\.0\}FullNLPPO \+ TutorRL5082\.0±11\.0\\mathbf\{82\.0\\pm 11\.0\}\+105\.6±24\.3\\mathbf\{\+105\.6\\pm 24\.3\}98\.0±1\.5\\mathbf\{98\.0\\pm 1\.5\}Table 6:Hyperparameters for PPO\. PPO defaults are from Stable\-Baselines3Raffinet al\.\[[2021](https://arxiv.org/html/2606.11744#bib.bib39)\]\. Training hyperparameters were searched via Bayesian Tuning
## Appendix ETopic Dataset
We list all topics used for training and evaluation below\. Each topic defines the root node of a knowledge graph \(DAG\) generated by an LLM and constitutes one curriculum episode\. We use Claude Sonnet 4\.6 to generate the knowledge graph for our experiments\.Non\-STEM topics are typeset in red; all remaining topics are drawn from science, technology, engineering, or mathematics\.
### Training Topics \(80 topics\)
Physics \(18\) Newton’s First Law of Motion Newton’s Second Law of Motion Gravity Electromagnetic Radiation Waves and Sound Light and Optics Electricity and Magnetism Theory of Relativity Kinetic Theory Phase Changes Simple Machines Momentum Friction Buoyancy Magnetism Electric Circuits Speed and Velocity Chemistry \(9\) Atoms Elements Molecules Atomic Theory Stoichiometry Molecular Structure pH Scale Mixtures and Solutions Osmosis and Diffusion Biology \(24\) Photosynthesis Cell Division The Circulatory System The Respiratory System Neurons Food Webs Food Chains Symbiosis Adaptation Natural Selection Immune System Digestion Reproduction Heredity Germ Theory Mitosis and Meiosis Fossil Record Biodiversity Conservation Biology Protein Synthesis The Nitrogen Cycle The Carbon Cycle Animal Behavior Migration Patterns Earth Science \(7\) Climate Change The Solar System Renewable Energy Erosion and Weathering The Water Cycle Volcanic Activity Earthquakes Mathematics \(3\) Geometry Trigonometry Statistics and Probability History \(7\) The French Revolution The Renaissance The Roman Empire The American Revolution The Enlightenment The Industrial Revolution Colonialism Literature \(9\) Allegory Personification Irony Narrative Structure Symbolism Point of View Conflict and Resolution Theme Imagery Social Sciences \(4\) Supply and Demand Communism Psychoanalysis Cultural Diffusion
### Evaluation Topics \(10 topics\)
The evaluation set is a held\-out stratified sample of 10 topics\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: We take concerted efforts to introduce the problem and share our proposed ideas through experimentation and empirical results\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: This is covered in section 5\.5\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification: We do not have theoretical results and instead focus on empirical results\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: We provide the code, hyperparameters in supplementary materials and in appendix\. To the best of our abilities, we write the process into the main body of the text\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: Code, simulator, and base data are provided in the supplementary materials\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Specfic details are provided in the appendix and experiment section\. Since we are using RL, we also provide the simulator and environment details\. Held\-out test set is described in appendix, and provided in the materials\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: We use CI to describe our empirical results\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: We provide details in the experiments appendix\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: We have followed the Code of Ethics to the best of our abilities\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: The premise of our work is grounded in societal impact: we address the everyday usage of LLMs for learning\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: Our work is largely based on openly available resources, and as such are aligned with these materials\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: We acknowledge and cite all work that come before us, which we build upon\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2606.11744v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: We release the code for our work\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: NA
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: NA\. We do not have human subjects experiments\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification: We describe how LLMs are used as both as a tutor and student simulator in our methods and setup section\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.Similar Articles
SocraticPO: Policy Optimization via Interactive Guidance
SocraticPO augments RL rollouts with Socratic-style natural language guidance and reward decay to improve scientific reasoning in LLMs, outperforming strong baselines on SciKnowEval benchmarks.
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
This paper proposes PTD-PO, a privileged tutoring distillation framework that provides dense token-level supervision for reinforcement learning with verifiable rewards in multimodal reasoning tasks, without exposing the answer. It uses structured hints and a Top-K JS divergence objective to stabilize training, consistently outperforming existing methods on 2B-8B LVLMs.
Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs
The paper presents a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier and a GPT-4-extracted prerequisite knowledge graph, achieving 80% accuracy on 1,340 question events and correlating with self-reported difficulty.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
Introduces Inquisitive Conversational Agents (ICAs) for proactive information extraction in legal dialogue, proposing a Dual Hierarchical Reinforcement Learning framework that learns when and how to ask probing questions, evaluated on U.S. Supreme Court oral arguments.
Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
This paper proposes a method to enhance target-guided proactive dialogue systems by jointly modeling user profiles and domain knowledge as conversational scenarios and employing intent-keyword bridging to predict future dialogue turns.