Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study
Summary
This paper proposes a learner model-based rubric to evaluate the adaptivity of Vision Language Models (VLMs) in mathematics education. Experiments show measurable differences in adaptivity across models and reveal that current VLMs struggle to produce consistent learner-adaptive instructional responses.
View Cached Full Text
Cached at: 05/18/26, 06:35 AM
# Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study
Source: [https://arxiv.org/html/2605.16011](https://arxiv.org/html/2605.16011)
Jie Gao1,2Yongan Yu1,211footnotemark:1Junzhu Su1 Yiran Lin1Adam K\. Dubé1Jackie Chi Kit Cheung1,2,322footnotemark:2 1McGill University2Mila – Quebec AI Institute 3Canada CIFAR AI Chair \{jie\.gao3, yongan\.yu, junzhu\.su, yiran\.lin\}@mail\.mcgill\.ca adam\.dube@mcgill\.ca,cheungja@mila\.quebec
###### Abstract
Adaptive learning refers to educational technologies that track learners’ learning progress and adapt the instructional process based on individual learners’ learning performance\. It is increasingly recognized as critical for developing an effective learning support tool\. Vision language models \(VLMs\) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction\. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions\. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks\. To address this gap, we draw on the learner model from the adaptive learning framework\(Shute and Towle,[2018](https://arxiv.org/html/2605.16011#bib.bib25)\)and propose a learner model\-based rubric\. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity\. We also evaluate two additional dimensions of VLM responses: correctness \(of answers and solutions\) and quality \(of the response itself\)\. Our experimental results show measurable differences in adaptivity across models, and also reveal that current VLMs struggle to consistently produce learner model\-based instructional responses, especially when receiving limited learner information\.
Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model\-based Rubric Study![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/x2.png)
Jie Gao1,2††thanks:Equal contribution\.Yongan Yu1,211footnotemark:1Junzhu Su1Yiran Lin1Adam K\. Dubé1††thanks:Corresponding authors\.Jackie Chi Kit Cheung1,2,322footnotemark:21McGill University2Mila – Quebec AI Institute3Canada CIFAR AI Chair\{jie\.gao3, yongan\.yu, junzhu\.su, yiran\.lin\}@mail\.mcgill\.caadam\.dube@mcgill\.ca,cheungja@mila\.quebec
## 1Introduction
As artificial intelligence shows growing use in education, there is a corresponding rise in large language model\-powered tools for providing mathematical instructions and pedagogical support\. Learners increasingly adopt VLMs as personal tutoring agents to answer mathematical problems and provide personalized solution feedback\(Baralet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib42); Cherianet al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib43); Guptaet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib41); Liuet al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib49)\)\. Personalization is both a defining characteristic of personal tutoring and a critical principle in education, as educators have historically adapted their instruction to provide personalized and differentiated learning support based on diverse learner backgrounds\(Bernackiet al\.,[2021](https://arxiv.org/html/2605.16011#bib.bib36)\)\.
Figure 1:Overview of the learner model\-based adaptivity schema\. The framework delineates the interplay between learner profile and instructional scaffolding\.To provide personalized learning to address different individuals’ learning needs, it is vital to understand learners’ learning performance before making instructional decisions\. Adaptive learning, as one of the most effective educational approaches, involves providing personalized learning instruction and feedback by adapting the learning process to different learners’ needs dynamically\(Imhofet al\.,[2020](https://arxiv.org/html/2605.16011#bib.bib24); Shute and Towle,[2018](https://arxiv.org/html/2605.16011#bib.bib25); Vanzoet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib9); Zhenget al\.,[2022](https://arxiv.org/html/2605.16011#bib.bib44)\)\. Its goal is to maximize the effectiveness of instruction through tracking learners’ prior academic background and learning performance, adjusting the content to suit different learners\(Shuteet al\.,[2000](https://arxiv.org/html/2605.16011#bib.bib26)\)\. A foundational component of adaptive learning is the learner model, which captures key aspects of a learner, such as learning performance and learning attitudes, in order to support individualized instruction\. Several studies have demonstrated that adaptive learning is superior to one\-size\-fits\-all teaching approaches\(Du Plooyet al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib59); Wanget al\.,[2023](https://arxiv.org/html/2605.16011#bib.bib39)\)\. Current VLMs primarily collect learner information from conversational interactions\(Kumaret al\.,[2023](https://arxiv.org/html/2605.16011#bib.bib3)\)\. When learners do not explicitly share details about their learning performance, behaviors, or needs, it can be difficult to infer these learners’ profiles, which limits instructional decision\-making\(Songet al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib60)\)\. Subsequently, it remains unclear whether VLMs generate truly appropriate answers and solutions to support individual learners’ needs\(Sharmaet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib2)\)\.
This paper addresses this critical research gap by developing an adaptive rubric grounded in the learner model of adaptive learning frameworks\. Our rubric incorporates core evaluation factors from learner models, including cognitive and motivational aspects\. The rubric aims to assess VLMs’ adaptive capabilities in mathematics tutoring tasks and their capacity to deliver pedagogical instruction that aligns with learners’ knowledge backgrounds\. Accordingly, our research question is:"To what extent can VLMs demonstrate adaptive instructional capabilities in visually grounded mathematics tutoring tasks?"
Our contributions are summarized as follows: \(1\) We propose a learner model\-based rubric to systematically evaluate the adaptivity schema of VLMs in mathematics tutoring tasks, grounded in cognitive, motivational, and complexity dimensions, as illustrated in Figure[1](https://arxiv.org/html/2605.16011#S1.F1)\. \(2\) We construct a human\-annotated evaluation dataset based on TIMSS mathematics items, incorporating diverse learner profiles and controlled contextual conditions to simulate realistic tutoring scenarios\. The constructed model generation pipeline and rubric design are available at our GitHub repository111GitHub Repo:[https://github\.com/Michaelyya/Adaptive\-LLMs](https://github.com/Michaelyya/Adaptive-LLMs)\. \(3\) We conduct extensive experiments of five state\-of\-the\-art VLMs, providing first\-hand analysis of challenges in generating learner\-aware instructional responses, particularly under limited learner information\.
## 2Background & Related Work
### 2\.1Adaptive Learning: The Learner Model
The learner model forms the foundation of adaptive learning by integrating assessment information and supporting inferences about learners’ proficiencies\(Shute and Towle,[2018](https://arxiv.org/html/2605.16011#bib.bib25)\)\. Personalization within adaptive learning systems begins with systematically appraising learner characteristics\(Zhouet al\.,[2023](https://arxiv.org/html/2605.16011#bib.bib46)\)\. The systematic review ofBernackiet al\.\([2021](https://arxiv.org/html/2605.16011#bib.bib36)\)identified key learner characteristics across four dimensions: cognitive factors, beliefs and motivations, social and cultural factors, and emotions and affect\. These dimensions directly inform the structural design of our rubric\. For instance, understanding a learner’s prior knowledge \(a cognitive factor\) is essential before providing instruction\. Accordingly, our rubric include the current knowledge criterion to evaluate how VLMs identify learners’ prior levels to avoid redundant or overly challenging content\. Similarly,Plass and Pawar \([2020](https://arxiv.org/html/2605.16011#bib.bib22)\)categorized variables that can be adapted in instructional responses into four domains: cognitive, motivational, affective, and sociocultural\. Within the motivational domain, self\-efficacy reflects a learner’s perceived capacity to perform domain\-specific tasks\. As a result, a learner may display a generally positive attitude toward learning while simultaneously holding low confidence in completing specific mathematics tasks\(Kaskenset al\.,[2020](https://arxiv.org/html/2605.16011#bib.bib37); Wen and Dubé,[2022](https://arxiv.org/html/2605.16011#bib.bib38)\)\. Our work synthesized insights from previous research and considered the VLMs’ conversational attributes for the adaptive rubric construction\.
### 2\.2The Role of \(V\)LMs in Education
In the domain of mathematics, \(V\)LMs have demonstrated remarkable proficiency as "solvers," achieving superior performance on both text\-based \(e\.g\., GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.16011#bib.bib16)\), MathQA\(Aminiet al\.,[2019](https://arxiv.org/html/2605.16011#bib.bib15)\)\) and vision\-based benchmarks \(e\.g\., MATH\-Vision\(Wanget al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib11)\), MathVerse\(Zhanget al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib13)\)\)\. However, a critical distinction exists between automated solvers and educational agents: solvers prioritize problem\-solving efficiency and final answer correctness\(Yanet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib21)\), whereas educational agents must provide pedagogically transparent guidance and scaffolding appropriate to the learner’s zone of proximal development\(Caiet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib12); Yuet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib10)\)\. Recent work suggests that LLMs can mimic tutor\-like behavior in interactive sessions\(Vanzoet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib9)\), andScarlatoset al\.\([2025](https://arxiv.org/html/2605.16011#bib.bib20)\)further highlights their potential role in knowledge tracing\. Despite this progress, existing work tends to overlook the multimodal nature of authentic mathematics learning\. Real\-world mathematical tasks often require students to reason across modalities, integrating textual instructions with geometric diagrams, data visualizations, and function plots\. The emergence of VLMs, including GPT\-4V, LLaVA, and Qwen\-VL, enables joint visual\-textual reasoning\(Luet al\.,[2023](https://arxiv.org/html/2605.16011#bib.bib8)\)\. Current evaluation paradigms, however, largely mirror text\-only settings, emphasizing visual question\-answering accuracy rather than instructional reasoning quality\(Lee and Zhai,[2025](https://arxiv.org/html/2605.16011#bib.bib7)\)\. Such accuracy\-focused evaluations fail to capture ability to explain visual concepts or adapt guidance to a learner’s context, leaving limited empirical evidence on whether multimodal models can serve as adaptive tutors\.
Figure 2:Overview of the adaptive rubric–based evaluation pipeline\. Learner profiles and mathematics questions are combined under four test groups with increasing learner\-specific information, provided to VLMs to generate instructional responses, and annotated by experts using the proposed rubric\.
### 2\.3Empirical Study in Adaptive Learning
The integration of VLMs into educational technologies represents a paradigm shift for intelligent tutoring systems\(Liuet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib19)\)\. While generative model\-based systems offer the potential for real\-time customization of instructional content based on student behavior\(Banjadeet al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib17); Linet al\.,[2023](https://arxiv.org/html/2605.16011#bib.bib18)\), their capacity to deliver contextually appropriate guidance remains questionable\. Notably,Borchers and Shou \([2025](https://arxiv.org/html/2605.16011#bib.bib6)\)employed a prompt variation framework to assess pedagogical adaptivity, revealing that even state\-of\-the\-art models frequently fail to adjust their feedback strategies when critical context \(e\.g\., student performance or knowledge components\) is introduced or removed\. This suggests a "context insensitivity" where models revert to generic tutoring scripts rather than responding to the specific state of the learner\. To address this gap, we aim to provide a rigorous, rubric\-based assessment of how well VLMs adhere to specific learner profiles\.
## 3Rubric Design for Adaptivity Evaluation
Our rubric is engineered to evaluate whether VLMs demonstrate adaptive instructional intelligence when tailoring support for learners with diverse mathematical proficiencies and backgrounds\. From theory to metric, the rubric is structured across three axes: Adaptivity, Correctness, and Quality\. This structure was strategically selected to disentangle pedagogical intent from technical accuracy, ensuring that the rubric can isolate a model’s adaptivity from its underlying linguistic and mathematical capabilities\. Among these, Adaptivity constitutes the core evaluative focus, while Correctness and Quality serve as essential baseline controls to ensure mathematical rigor and linguistic clarity\. In this work, we operationalize a VLM response into two distinct components: the ’Answer’ \(the final result\) and the ’Solution’ \(the comprehensive instructional content\)\. The "solution" encompasses all reasoning steps, feedback, and scaffolding strategies beyond the final answer\. This proposed rubric assesses whether the model meets the baseline standard of adaptive instructional behavior\. At a high level, the adaptivity dimension assesses whether model responses are appropriately aligned with both the instructional intent of the prompt and the learner’s background\. The correctness dimension evaluates whether the answers and solutions are mathematically valid\. The quality dimension examines overall VLM response quality\.
Adaptivity\.The adaptivity dimension includes a total of nine key questions \(Q1–Q9; see Appendix[A\.1](https://arxiv.org/html/2605.16011#A1.SS1)\)\. The evaluation elements of these questions were selected based on the core factors of the adaptive learning environments\(Bernackiet al\.,[2025](https://arxiv.org/html/2605.16011#bib.bib23); Plass and Pawar,[2020](https://arxiv.org/html/2605.16011#bib.bib22)\), while considering the interactive characteristics of VLMs\. The primary factors in this rubric include: the learner’s current mathematical knowledge, the learner’s math learning attitude, the learner’s self\-efficacy, the solution’s mathematical knowledge scope, the solution difficulty, and the solution complexity\. For instance, when a learner presents a negative math learning attitude, an adaptive response should offer positive feedback and instruction to help the learner address the math problem, rather than continuing the conversation in a negative tone or ignoring it\(Wen and Dubé,[2022](https://arxiv.org/html/2605.16011#bib.bib38)\)\. In addition, to establish a rigorous and fine\-grained normative reference in our rubric, we integrated Common Core State Standards \(CCSS\) as a key dimension\. Although TIMSS frameworks provide broad domain categories, CCSS can provide a micro\-level taxonomy for evaluation and exhibit 90% conceptual alignment with TIMSS frameworks\(Schmidt and Houang,[2012](https://arxiv.org/html/2605.16011#bib.bib58)\)\. Solution complexity is assessed across Q6–Q9, examining whether the LLM response helps the learner understand and explain the math question, providing relevant examples, offering additional similar math exercises for reinforcement, and delivering step\-by\-step and detailed instructions\.
Correctness\.This dimension mainly focuses on assessing if the VLM responses provide valid and correct instruction and feedback, addressing the specific math questions\. Both the answer and the solution should be correct and consistent to determine the correctness of the response \(Q10–Q12; see Appendix[A\.2](https://arxiv.org/html/2605.16011#A1.SS2)\)\. In this dimension, three questions were used to evaluate the correctness of the answer and its solution\. Notably, a completely correct solution is defined as one that includes both complete and valid mathematical reasoning and yields a correct final answer\. If the solution demonstrates accurate mathematical reasoning but ends with an incorrect final answer, it should be labeled as partially correct\.
Quality\.This dimension focuses primarily on evaluating the overall quality of VLM generation\. We conducted fine\-grained, multi\-dimensional annotations and collected multiple crucial evaluation dimensions from existing benchmarks for summarization quality evaluation\(Joneset al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib56); Leeet al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib55); Tanget al\.,[2024](https://arxiv.org/html/2605.16011#bib.bib57)\), including hallucinations, mathematical reasoning, faithfulness, coherence, consistency, grammatical correctness, formatting adherence, and length control\. The quality dimension includes a total of seven questions \(Q13–Q19; see Appendix[A\.3](https://arxiv.org/html/2605.16011#A1.SS3)\)\.
Figure 3:A case study of the adaptive evaluation process for a Grade 4 math problem\. The upper panel shows the VLM\-generated response with motivational support and step\-by\-step scaffolding, while the lower panels illustrate the annotation workflow, where Q1–Q9 map directly to the adaptivity rubric in Appendix[A\.1](https://arxiv.org/html/2605.16011#A1.SS1)\.
## 4Rubric\-Guided Evaluation of VLMs
To empirically evaluate the adaptive capabilities of VLMs in mathematics tutoring tasks, we implement a structured evaluation pipeline grounded in the proposed learner model–based rubric\. As illustrated in Figure[2](https://arxiv.org/html/2605.16011#S2.F2), the pipeline constructs learner profiles, pairs them with mathematics questions under varying contextual conditions, elicits instructional responses from VLMs, and evaluates the resulting outputs through expert annotation\. This methodology adopts a quasi\-experimental design that simulates realistic tutoring interactions by systematically varying two key factors: learner profiles and test groups\. While we acknowledge that such simulated interactions using synthetic learner profiles are not the final answer to evaluating VLMs in the open\-ended complexity of real\-world classrooms, they represent a necessary and rigorous first step\. By leveraging a controlled synthetic environment, we can isolate specific adaptive behaviors and minimize the confounding variables inherent in live discourse between humans and AI\.
### 4\.1Dataset and Model Selection
We draw mathematics items from the Trends in International Mathematics and Science Study \(TIMSS\) 2019 and 2023 with official permission from the International Association for the Evaluation of Educational Achievement \(IEA\)\(Fishbeinet al\.,[2021](https://arxiv.org/html/2605.16011#bib.bib4),[2025](https://arxiv.org/html/2605.16011#bib.bib1)\)\. From the full assessment pool, we construct a stratified dataset ofQ=10Q=10representative questions, with five questions sampled from Grade 4 and five from Grade 8\. To emphasize multimodal reasoning, we prioritize visually grounded items, resulting in90%90\\%of the selected questions requiring graphical or diagrammatic interpretation\. We evaluate a set ofM=5M=5VLMs spanning both proprietary and open\-source paradigms\. The detailed computational costs, model selection and decoding configuration are provided in Appendix[B](https://arxiv.org/html/2605.16011#A2)and Appendix[C](https://arxiv.org/html/2605.16011#A3)\.
### 4\.2Learner Profiles and Context Conditions
To simulate realistic tutoring scenarios, we construct learner profiles derived from authentic TIMSS statistics and formalize each learner as a tupleℓ=\(a,c,p,s\)\\ell=\(a,c,p,s\), whereaadenotes learning attitude,ccdenotes self\-reported confidence,pprepresents prior topic coverage, andsscorresponds to the learner’s TIMSS achievement score; across Grade 4 and Grade 8, we instantiate six profiles spanning three performance tiers \(high, intermediate, and low\)\. Building on this formulation, we study context sensitivity by defining four input context conditionsCkC\_\{k\}with monotonically increasing learner\-specific information:
C1=\{q\},C2=\{q,\(a,c,p\)\},C\_\{1\}=\\\{q\\\},\\quad C\_\{2\}=\\\{q,\(a,c,p\)\\\},C3=\{q,S,\(a,c,p\)\},C4=\{q,S,\(a,c,p,s\)\},C\_\{3\}=\\\{q,S,\(a,c,p\)\\\},\\quad C\_\{4\}=\\\{q,S,\(a,c,p,s\)\\\},
whereqqdenotes the mathematics question andSSdenotes a system prompt specifying grade\-level learning objectives\.
Table 1:Performance comparison of VLMs across dimensions\. Values represent the mean score percentage achieved across the rubric for each dimension\. Learner profiles are denoted by icons: High\- \(![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/student-high.png)\), Intermediate\- \(![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/student-interm.png)\), and Low\-performance \(![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/student-low.png)\)\.Bolddenotes the best performance;underlinedenotes the second best\.
### 4\.3Expert Annotation and Evaluation
Our experimental design follows a full factorial structure\. The total volume of model\-generated responses is calculated as:
N=\|G\|×\|P\|⏟Learner Profile×\|C\|⏟Context×\|Q\|⏟Task×\|M\|⏟Model=600N=\\underbrace\{\|G\|\\times\|P\|\}\_\{\\text\{Learner Profile\}\}\\times\\underbrace\{\|C\|\}\_\{\\text\{Context\}\}\\times\\underbrace\{\|Q\|\}\_\{\\text\{Task\}\}\\times\\underbrace\{\|M\|\}\_\{\\text\{Model\}\}=600where we account for\|G\|=2\|G\|=2grades,\|P\|=3\|P\|=3performance tiers,\|C\|=4\|C\|=4context conditions,\|Q\|=5\|Q\|=5questions per grade, and\|M\|=5\|M\|=5models\. Data annotation was conducted by four domain experts trained in the Learning Sciences program\. Prior to annotation, each expert completes a one\-hour one\-on\-one training session that introduces the rubric and includes discussion of at least three representative case studies to calibrate the scoring criteria\. Each response was evaluated using structured numerical ratings defined by the rubric \(see Appendix[A](https://arxiv.org/html/2605.16011#A1)\)\. The rubric items were implemented in a Google Form, and annotators rated each VLM response using this instrument\. Adaptivity score and Correctness score were positively keyed, whereas Quality score is derived from error\-related items\. Specifically, the Quality score is computed as1−17∑i=1319ei1\-\\frac\{1\}\{7\}\\sum\_\{i=13\}^\{19\}e\_\{i\}, whereei∈\{0,1\}e\_\{i\}\\in\\\{0,1\\\}indicates whether the corresponding error type \(Q13–Q19\) is present in the response\. All seven error categories are equally weighted in this aggregation, such that higher values consistently indicate better performance\. Statistical analyses are therefore conducted at the response level rather than the item level\. For completeness, we additionally report item\-level analysis in Appendix[D](https://arxiv.org/html/2605.16011#A4)\. A case study is shown in Figure[3](https://arxiv.org/html/2605.16011#S3.F3), the color\-coded groupings represent the distinct dimensions of adaptivity\. These annotations demonstrate how qualitative text outputs are converted into binary numerical scores \(Yes=1, No=0\)\. To validate the reliability of the annotation process, a subset of 120 responses was independently annotated by core coders, yielding a strong inter\-rater agreement with a Cohen’s Kappa ofκ=0\.87\\kappa=0\.87\. Detailed agreement metrics for each model are reported in Appendix[E](https://arxiv.org/html/2605.16011#A5)\.
DimensionsModel Performance by Group and Overall \(O\)Proprietary Models\(%\)Open\-Source Models\(%\)GPT\-5![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/gpt-logo.png)GPT\-o1![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/gpt-logo.png)Gemini\-2\.5\-flash![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/gemini-logo.png)Llama3\.2\-11B\-VL![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/llama-logo.png)Qwen3\-30B\-VL![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/Qwen_logo.png)![[Uncaptioned image]](https://arxiv.org/html/2605.16011v1/latex/pics/categories.png)MetricG1G2G3G4OG1G2G3G4OG1G2G3G4OG1G2G3G4OG1G2G3G4OCognitiveCK \(Q1\)31949489775366686763531001001008836478378616710010010092CCS \(Q4\)61949489858397100979483941001009472927286819497949796DL \(Q5\)449486897861868689817274899783677258756892100868691Motiv\.LA \(Q2\)3928978652243274735697100100761133868153399410010083SE \(Q3\)05347533881183801009797743116153321794929775ComplexityExpl\. \(Q6\)675816156178351424864100979790698672867892100899794Exam\. \(Q7\)0811666140053128883883688305Prac\. \(Q8\)04244222700000000000088400000SBS \(Q9\)17868686695886788978921009794967264587868100100100100100Average19686863553360515249367778787637506164554586858681Table 2:Detailed adaptivity dimension performance\. Values represent the percentage \(%\) of responses satisfying rubric criteria\. Row metrics are: Cognitive \(CK: Current Knowledge, CCS: Content Scope, DL: Difficulty Level\), Motivational \(LA: Learning Attitude, SE: Self\-Efficacy\), and Complexity\(Expl\.: Explanation, Exam\.: Examples, Prac\.: Practice, SBS: Step\-by\-Step\)\. Background colors distinguishProprietaryvs\.Open\-Sourcemodels\.
## 5Results
We examine the relationship between learner profiles and VLMs’ performance in adaptivity, correctness, and quality\. Our analysis investigates whether these dimensions exhibit systematic variation across learner profiles and model\-related factors \(VLMs and test group\), reflecting models’ sensitivity to learner differences and available contextual information\. Table[1](https://arxiv.org/html/2605.16011#S4.T1)summarizes overall performance patterns across VLMs and learner profiles\. The reported values represent the mean percentage of total possible points achieved within each dimension\. For example, a 70% Adaptivity score indicates the model satisfied 70% of the Q1–Q9 criteria\. To provide deeper insight into specific adaptive behaviors, Table[2](https://arxiv.org/html/2605.16011#S4.T2)disaggregates the Adaptivity dimension\. It reports detailed performance on individual sub\-metrics \(e\.g\., Cognitive, Motivational, Complexity\) across the four test groups \(G1–G4\), revealing how the availability of context influences specific instructional strategies\. To provide evidence for these relationships, we employ a series of two\-way analyses of variance \(ANOVA\), with learner profile and model\-related factors as independent variables\. Further details of the analysis are reported in Appendix[F](https://arxiv.org/html/2605.16011#A6)and Appendix[G](https://arxiv.org/html/2605.16011#A7)\.
### 5\.1Experimental analysis
Adaptivity \(Q1–Q9\)\.We analyzed how adaptivity scores relate to learner profiles \(high\-, intermediate\-, and low\-performing\), VLMs, and test groups\. Overall, adaptivity scores showed significant relationships across VLMs and across test groups\. Adaptivity varied substantially across models, with a significant main effect of VLMs,F\(4,705\)=41\.81F\(4,705\)=41\.81,p<\.001p<\.001, indicating differences in adaptivity performance across models\. Tukey HSD post hoc comparisons indicated that Gemini\-2\.5\-Flash and Qwen3\-VL\-30B achieved significantly higher adaptivity scores than GPT\-5, GPT\-o1, and Llama3\.2\-11B\-VL, whereas differences between Gemini\-2\.5\-Flash and Qwen3\-30B\-VL were not statistically significant\. In contrast, adaptivity scores showed no significant association with learner profiles,F\(2,705\)=1\.72F\(2,705\)=1\.72,p=\.18p=\.18, and no significant interaction between learner profile and VLMs,F\(8,705\)=1\.63F\(8,705\)=1\.63,p=\.11p=\.11\. This uniformity across groups suggests that these VLMs default to a "one\-size\-fits\-all" pedagogical style\. While they could provide answers and solutions, they lack the nuanced scaffolding based on the specific needs of different learner profiles\. A complementary analysis examined adaptivity as a function of learner profile and test group \(Groups 1–4, reflecting increasing contextual information\)\. The results showed a significant relationship between contextual information and adaptivity: responses generated with access to additional context exhibited higher adaptivity scores\. This pattern was supported by a significant main effect of test group,F\(3,708\)=92\.73F\(3,708\)=92\.73,p<\.001p<\.001\. Tukey HSD comparisons revealed that Group 1 \(math question only\) scored significantly lower than Groups 2, 3, and 4\.
Correctness \(Q10–Q12\)\.Correctness scores showed a similar pattern, with variation driven primarily by model differences, demonstrating a highly significant main effect of VLMs,F\(4,705\)=64\.28F\(4,705\)=64\.28,p<\.001p<\.001\. Crucially, this model\-driven performance was robust and consistent across different learner performance levels\. Post hoc analyses indicated that Gemini\-2\.5\-Flash and GPT\-based models outperformed Llama3\.2\-11B\-VL, and that Gemini\-2\.5\-Flash also scored significantly higher than Qwen3\-30B\-VL\. When examining learner profile and VLMs, correctness scores exhibited no association with learner performance level, as reflected by a non\-significant main effect of learner profile,F\(2,705\)=0\.01F\(2,705\)=0\.01,p=\.99p=\.99, and no significant interaction,F\(8,705\)=0\.65F\(8,705\)=0\.65,p=\.74p=\.74\. The lack of significant variance suggests two underlying dynamics regarding model behavior: either the baseline user prompt \(Group 2\) is already sufficient for the VLMs to reach their performance ceiling on this task, or the models ignore the supplementary system prompts provided in Groups 3 and 4\.
Quality \(Q13–Q19\)\.Quality scores further reinforced the pattern observed for adaptivity and correctness: learner profiles did not systematically relate to the model response quality, whereas substantial differences emerged across VLMs\. However, quality differed significantly across models, as indicated by a significant main effect of VLMs,F\(4,705\)=69\.44F\(4,705\)=69\.44,p<\.001p<\.001\. Tukey HSD comparisons showed that Llama3\.2\-11B\-VL produced significantly higher error counts than all other models, while Gemini\-2\.5\-Flash achieved significantly lower error scores than GPT\-5, GPT\-O1, and Qwen3\-30B\-VL\.
Consistent with the overall analyses, distinct performance profiles emerged across the models\. Gemini\-2\.5\-Flash demonstrated strong and balanced performance across all metrics, while GPT\-o1 prioritized correctness and quality over adaptivity\. Qwen3\-30B\-VL obtained the highest average adaptivity score across learner profiles \(M =71%71\\%\)\. However, its correctness and quality were comparatively lower\. This is mainly due to overly long model generations that were more likely to be truncated or incomplete, which in turn produced formatting issues and partial non\-adherence to prompt instructions\. We also observed a correctness gap between Llama3\.2\-11B\-VL and the closed\-source models, with the latter consistently outperforming Llama3\.2\-11B\-VL\.
### 5\.2Observations
This subsection summarizes key empirical observations derived from rubric\-based annotations of VLM responses\. These observations are grounded in aggregated rubric scores across learner profiles and context conditions, supplemented by qualitative inspection of representative model outputs\.
#### VLMs exhibit minimal behavioral adaptation to learner attitudes and self\-efficacy\.
As adaptivity constitutes the core dimension of the proposed rubric, we first examine how VLMs respond to learner\-specific motivational factors, particularly learning attitude and self\-efficacy \(Q3–Q5\)\. Across models, GPT\-5, GPT\-o1, and Llama3\.2\-11B\-Vision\-Instruct exhibit consistently low response rates in addressing these factors\. For example, GPT\-o1 addresses self\-efficacy concerns in only8%8\\%of applicable cases, with a0%0\\%response rate for Grade 4 intermediate\-performing learners\. In contrast, low\-performance learners receive substantially more motivational feedback, with an average response rate of36\.8%36\.8\\%when low confidence is explicitly indicated\. Among all evaluated models, Qwen3\-30B\-VL \(33\.1%33\.1\\%\) and Gemini\-2\.5\-Flash \(32\.5%32\.5\\%\) demonstrate comparatively stronger engagement with learners’ self\-efficacy\. Interestingly, some Qwen3\-30B\-VL responses include explicit encouragement \(e\.g\.,“You got this\!”\), suggesting partial instructional responses using motivational language\. These patterns suggest models implicitly equate lower performance with greater motivational need, overlooking the importance of supporting confidence and learning attitudes among high\- and intermediate\-performing learners\.
#### Providing learners’ knowledge backgrounds to the VLM does not guarantee appropriate feedback\.
It is essential to understand learners’ current knowledge level before providing them with feedback\. However, such an understanding does not mean that the VLMs have the capabilities to provide appropriate differentiated feedback\. In several cases, responses to different learners are nearly identical, suggesting weak sensitivity to learner distinctions\. The VLMs frequently misunderstand how to deliver suitable content to specific groups\. Table[2](https://arxiv.org/html/2605.16011#S4.T2)illustrates that GPT\-5 generates longer and more detailed explanations for high\-performance learners than for intermediate\-performance learners, despite the latter typically requiring greater instructional scaffolding\. These findings indicate that access to learner background information alone does not guarantee appropriate or pedagogically aligned feedback\.
#### Multimodal misinterpretation and response quality failures\.
Beyond adaptivity, our rubric\-based evaluation surfaces notable correctness and quality issues, particularly for visually grounded mathematics problems\. Geometry items pose a significant challenge: for instance, a Grade 4 geometry question \(G4Q5\) required comparing four choices’ rectangles to identify the correct one\. Although the task structure is clear, only 10 out of 60 model responses produced correct answers\. This error pattern persists despite the problem’s clear structure, pointing to limitations in visual–spatial reasoning rather than ambiguity in the task itself\. These errors are consistently reflected in low correctness scores across multiple models and learner profiles\. Moreover, a common failure mode involves conflating multiple\-choice option labels with their embedded numerical content\. When presented with options like "D\. 6kg", the correct response should identify the label \(D\) rather than the embedded value \(6kg\)\. However, models regarded the embedded value as the final answer, omitting the required option label\. Additionally, when questions integrate images, formulas, and text, models often extract misaligned or inaccurate information\. This manifests as hallucinations or inaccurate extraction in solution, compromising mathematical reasoning and instructional reliability, even when the final answer appears superficially plausible\.
## 6Conclusion
This paper evaluates whether VLMs demonstrate adaptive capabilities in mathematics tutoring tasks\. To this end, we introduce a learner model–inspired rubric that provides a structured framework for assessing adaptivity beyond surface\-level personalization\. Applying this rubric to VLMs across learner profiles reveals that, while models exhibit some sensitivity to learner context, their adaptive behavior remains limited and inconsistent\. In particular, models often fail to differentiate instructional strategies across learners with distinct academic backgrounds, indicating a gap between generic tutoring responses and genuinely adaptive instruction\.
The proposed rubric offers a principled tool for systematically comparing adaptive capabilities across models\. Looking forward, future work may extend this rubric to distinguish baseline adaptivity from more sophisticated forms of instructional adaptation and to examine additional components of adaptive learning frameworks\. More broadly, this work highlights the need for evaluation methodologies that prioritize pedagogical alignment with learners, providing guidance for the development of more adaptive and educationally grounded multimodal models\.
## Limitations
While we have developed the adaptive rubric and accompanying worksheet to identify the VLMs’ adaptive capabilities, the rubric questions for adaptivity are not exhaustive\. Other adaptivity variables, including those from affective and socio\-cultural aspects, are also important for evaluation\. What’s more, the limited math items used in this study may make our evaluation suffer from insufficient item coverage regarding pedagogical adaptivity\. In addition, our study simulated the learner rather than testing with real students, which cannot include all the learning behavior factors of real students\. Moreover, there might be other differences inherent in using an automated tutoring system compared to human tutors that are not addressed in the current adaptive learning framework\. For instance, human tutors may create more social pressure, which might be helpful or harmful for learning\. A further limitation is that the proposed rubric is designed as a baseline that evaluates whether responses have adaptive capabilities in mathematics tutoring tasks\. However, a response classified as adaptive in this study does not represent a high\-quality, high\-level adaptive response\.
## Ethical Consideration
This study involves the evaluation of vision language models using synthetic and publicly available mathematical problem data\. No personally identifiable information or human subject data were used in model evaluation\. We obtained official permission from the International Association for the Evaluation of Educational Achievement \(IEA\) to use the Trends in International Mathematics and Science Study \(TIMSS\) 2019 and 2023 mathematics items from Grade 4 and Grade 8\. Thus, the data has no potential privacy or content safety concerns\. Additionally, the learner characteristics of learner profile data were extracted from the TIMSS 2019 and TIMSS 2023 public reports and datasets\(Fishbeinet al\.,[2021](https://arxiv.org/html/2605.16011#bib.bib4),[2025](https://arxiv.org/html/2605.16011#bib.bib1)\)\.
## References
- A\. Amini, S\. Gabriel, S\. Lin, R\. Koncel\-Kedziorski, Y\. Choi, and H\. Hajishirzi \(2019\)Mathqa: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 \(long and short papers\),pp\. 2357–2367\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- S\. Banjade, H\. Patel, S\. Pokhrel,et al\.\(2024\)Empowering education by developing and evaluating generative ai\-powered tutoring system for enhanced student learning\.Journal of Artificial Intelligence and Capsule Networks6\(3\),pp\. 278–298\.Cited by:[§2\.3](https://arxiv.org/html/2605.16011#S2.SS3.p1.1)\.
- S\. Baral, L\. Lucy, R\. Knight, A\. Ng, L\. Soldaini, N\. T\. Heffernan, and K\. Lo \(2025\)DrawEduMath: evaluating vision language models with expert\-annotated students’ hand\-drawn math images\.arXiv preprint arXiv:2501\.14877\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p1.1)\.
- M\. L\. Bernacki, M\. J\. Greene, and N\. G\. Lobczowski \(2021\)A systematic review of research on personalized learning: personalized by whom, to what, how, and for what purpose \(s\)?\.Educational Psychology Review33\(4\),pp\. 1675–1715\.Cited by:[§A\.1](https://arxiv.org/html/2605.16011#A1.SS1.p2.pic1.2.2.2.1.1.1),[§1](https://arxiv.org/html/2605.16011#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.16011#S2.SS1.p1.1)\.
- M\. L\. Bernacki, C\. Walkington, A\. Emery, and L\. Zhang \(2025\)Handbook of personalized learning\.Taylor & Francis\.Cited by:[§3](https://arxiv.org/html/2605.16011#S3.p2.1)\.
- C\. Borchers and T\. Shou \(2025\)Can large language models match tutoring system adaptivity? a benchmarking study\.InInternational Conference on Artificial Intelligence in Education,pp\. 407–420\.Cited by:[§2\.3](https://arxiv.org/html/2605.16011#S2.SS3.p1.1)\.
- L\. Cai, M\. M\. Msafiri, and D\. Kangwa \(2025\)Exploring the impact of integrating ai tools in higher education using the zone of proximal development\.Education and Information Technologies30\(6\),pp\. 7191–7264\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- A\. Cherian, K\. Peng, S\. Lohit, J\. Matthiesen, K\. Smith, and J\. Tenenbaum \(2024\)Evaluating large vision\-and\-language models on children’s mathematical olympiads\.Advances in Neural Information Processing Systems37,pp\. 15779–15800\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- E\. Du Plooy, D\. Casteleijn, and D\. Franzsen \(2024\)Personalized adaptive learning in higher education: a scoping review of key characteristics and impact on academic performance and engagement\.Heliyon10\(21\)\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- B\. Fishbein, P\. Foy, and L\. Yin \(2021\)TIMSS 2019 user guide for the international database\.2nd edition,Boston College, TIMSS & PIRLS International Study Center\.External Links:[Link](https://timssandpirls.bc.edu/timss2019/international-database/)Cited by:[§4\.1](https://arxiv.org/html/2605.16011#S4.SS1.p1.3),[Ethical Consideration](https://arxiv.org/html/2605.16011#Sx2.p1.1)\.
- B\. Fishbein, M\. Taneva, and K\. Kowolik \(2025\)TIMSS 2023 user guide for international database\.Boston College, TIMSS & PIRLS International Study Center\.Note:Retrieved from TIMSS 2023 International DatabaseExternal Links:[Link](https://timss2023.org/data)Cited by:[§4\.1](https://arxiv.org/html/2605.16011#S4.SS1.p1.3),[Ethical Consideration](https://arxiv.org/html/2605.16011#Sx2.p1.1)\.
- A\. Gupta, J\. Reddig, T\. Calo, D\. Weitekamp, and C\. J\. MacLellan \(2025\)Beyond final answers: evaluating large language models for math tutoring\.InInternational Conference on Artificial Intelligence in Education,pp\. 323–337\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p1.1)\.
- G\. Hackett and N\. E\. Betz \(1989\)An exploration of the mathematics self\-efficacy/mathematics performance correspondence\.Journal for research in Mathematics Education20\(3\),pp\. 261–273\.Cited by:[§A\.1](https://arxiv.org/html/2605.16011#A1.SS1.p3.pic1.2.2.2.1.1.2)\.
- C\. Imhof, P\. Bergamin, and S\. McGarrity \(2020\)Implementation of adaptive learning systems: current state and potential\.Online teaching and learning in higher education,pp\. 93–115\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- J\. Jones, L\. Mo, E\. Fosler\-Lussier, and H\. Sun \(2024\)A multi\-aspect framework for counter narrative evaluation using large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),pp\. 147–168\.Cited by:[§3](https://arxiv.org/html/2605.16011#S3.p4.1)\.
- J\. Kaskens, E\. Segers, S\. L\. Goei, J\. E\. Van Luit, and L\. Verhoeven \(2020\)Impact of children’s math self\-concept, math self\-efficacy, math anxiety, and teacher competencies on math development\.Teaching and teacher education94,pp\. 103096\.Cited by:[§2\.1](https://arxiv.org/html/2605.16011#S2.SS1.p1.1)\.
- N\. Kibrislioglu \(2015\)An investigation about 6th grade students’ attitudes towards mathematics\.Procedia\-Social and Behavioral Sciences186,pp\. 64–69\.Cited by:[§A\.1](https://arxiv.org/html/2605.16011#A1.SS1.p2.pic1.2.2.2.1.1.2)\.
- H\. Kumar, I\. Musabirov, M\. Reza, J\. Shi, X\. Wang, J\. J\. Williams, A\. Kuzminykh, and M\. Liut \(2023\)Impact of guidance and interaction strategies for llm use on learner performance and perception\.arXiv preprint arXiv:2310\.13712\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- G\. Lee and X\. Zhai \(2025\)Realizing visual question answering for education: gpt\-4v as a multimodal ai\.TechTrends69\(2\),pp\. 271–287\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- Y\. Lee, T\. Yun, J\. Cai, H\. Su, and H\. Song \(2024\)UniSumEval: towards unified, fine\-grained, multi\-dimensional summarization evaluation for llms\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 3941–3960\.Cited by:[§3](https://arxiv.org/html/2605.16011#S3.p4.1)\.
- C\. Lin, A\. Y\. Huang, and O\. H\. Lu \(2023\)Artificial intelligence in intelligent tutoring systems toward sustainable education: a systematic review\.Smart learning environments10\(1\),pp\. 41\.Cited by:[§2\.3](https://arxiv.org/html/2605.16011#S2.SS3.p1.1)\.
- V\. Liu, E\. Latif, and X\. Zhai \(2025\)Advancing education through tutoring systems: a systematic literature review\.arXiv preprint arXiv:2503\.09748\.Cited by:[§2\.3](https://arxiv.org/html/2605.16011#S2.SS3.p1.1)\.
- Z\. Liu, S\. X\. Yin, G\. Lin, and N\. Chen \(2024\)Personality\-aware student simulation for conversational intelligent tutoring systems\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 626–642\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p1.1)\.
- P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao \(2023\)Mathvista: evaluating mathematical reasoning of foundation models in visual contexts\.arXiv preprint arXiv:2310\.02255\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- M\. D\. Marraffino, B\. L\. Schroeder, N\. W\. Fraulini, W\. L\. Van Buskirk, and C\. I\. Johnson \(2021\)Adapting training in real time: an empirical test of adaptive difficulty schedules\.Military Psychology33\(3\),pp\. 136–151\.Cited by:[§A\.1](https://arxiv.org/html/2605.16011#A1.SS1.p4.pic1.2.2.2.1.1.1),[§A\.1](https://arxiv.org/html/2605.16011#A1.SS1.p5.pic1.2.2.2.1.1.1)\.
- J\. L\. Plass and S\. Pawar \(2020\)Toward a taxonomy of adaptivity for learning\.Journal of Research on Technology in Education52\(3\),pp\. 275–300\.Cited by:[§A\.1](https://arxiv.org/html/2605.16011#A1.SS1.p1.pic1.2.2.2.1.1.1),[§A\.1](https://arxiv.org/html/2605.16011#A1.SS1.p3.pic1.2.2.2.1.1.1),[§2\.1](https://arxiv.org/html/2605.16011#S2.SS1.p1.1),[§3](https://arxiv.org/html/2605.16011#S3.p2.1)\.
- A\. Scarlatos, R\. S\. Baker, and A\. Lan \(2025\)Exploring knowledge tracing in tutor\-student dialogues using llms\.InProceedings of the 15th International Learning Analytics and Knowledge Conference,pp\. 249–259\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- W\. H\. Schmidt and R\. T\. Houang \(2012\)Curricular coherence and the common core state standards for mathematics\.Educational Researcher41\(8\),pp\. 294–308\.Cited by:[§3](https://arxiv.org/html/2605.16011#S3.p2.1)\.
- S\. Sharma, P\. Mittal, M\. Kumar, and V\. Bhardwaj \(2025\)The role of large language models in personalized learning: a systematic review of educational impact\.Discover Sustainability6\(1\),pp\. 1–24\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- V\. Shute and B\. Towle \(2018\)Adaptive e\-learning\.InAptitude,pp\. 105–114\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.16011#S2.SS1.p1.1)\.
- V\. Shute, S\. Lajoie, and K\. Gluck \(2000\)Individualized and group approaches to training\.Training and retraining: A handbook for business, industry, government, and the military,pp\. 171–207\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- C\. Song, S\. Shin, and K\. Shin \(2024\)Implementing the dynamic feedback\-driven learning optimization framework: a machine learning approach to personalize educational pathways\.Applied Sciences14\(2\),pp\. 916\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- L\. Tang, I\. Shalyminov, A\. Wong, J\. Burnsky, J\. Vincent, Y\. Yang, S\. Singh, S\. Feng, H\. Song, H\. Su,et al\.\(2024\)TofuEval: evaluating hallucinations of llms on topic\-focused dialogue summarization\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4455–4480\.Cited by:[§3](https://arxiv.org/html/2605.16011#S3.p4.1)\.
- A\. Vanzo, S\. P\. Chowdhury, and M\. Sachan \(2025\)GPT\-4 as a homework tutor can improve student engagement and learning outcomes\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 31119–31136\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- K\. Wang, J\. Pan, W\. Shi, Z\. Lu, H\. Ren, A\. Zhou, M\. Zhan, and H\. Li \(2024\)Measuring multimodal mathematical reasoning with math\-vision dataset\.Advances in Neural Information Processing Systems37,pp\. 95095–95169\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- S\. Wang, C\. Christensen, W\. Cui, R\. Tong, L\. Yarnall, L\. Shear, and M\. Feng \(2023\)When adaptive learning is effective learning: comparison of an adaptive learning system to teacher\-led instruction\.Interactive learning environments31\(2\),pp\. 793–803\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- R\. Wen and A\. K\. Dubé \(2022\)A systematic review of secondary students’ attitudes towards mathematics and its relations with mathematics achievement\.Journal of Numerical Cognition8\(2\),pp\. 295–325\.Cited by:[§2\.1](https://arxiv.org/html/2605.16011#S2.SS1.p1.1),[§3](https://arxiv.org/html/2605.16011#S3.p2.1)\.
- Y\. Yan, J\. Su, J\. He, F\. Fu, X\. Zheng, Y\. Lyu, K\. Wang, S\. Wang, Q\. Wen, and X\. Hu \(2025\)A survey of mathematical reasoning in the era of multimodal large language model: benchmark, method & challenges\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 11798–11827\.External Links:[Link](https://aclanthology.org/2025.findings-acl.614/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.614),ISBN 979\-8\-89176\-256\-5Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- Y\. Yu, A\. Krantz, and N\. G\. Lobczowski \(2025\)From recall to reasoning: automated question generation for deeper math learning through large language models\.InInternational Conference on Artificial Intelligence in Education,pp\. 414–422\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- R\. Zhang, D\. Jiang, Y\. Zhang, H\. Lin, Z\. Guo, P\. Qiu, A\. Zhou, P\. Lu, K\. Chang, Y\. Qiao,et al\.\(2024\)Mathverse: does your multi\-modal llm truly see the diagrams in visual math problems?\.InEuropean Conference on Computer Vision,pp\. 169–186\.Cited by:[§2\.2](https://arxiv.org/html/2605.16011#S2.SS2.p1.1)\.
- L\. Zheng, M\. Long, L\. Zhong, and J\. F\. Gyasi \(2022\)The effectiveness of technology\-facilitated personalized learning on learning achievements and learning perceptions: a meta\-analysis\.Education and Information Technologies27\(8\),pp\. 11807–11830\.Cited by:[§1](https://arxiv.org/html/2605.16011#S1.p2.1)\.
- W\. Zhou, Q\. Li, and C\. Li \(2023\)Learning to predict persona information for dialogue personalization without explicit persona description\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 2979–2991\.Cited by:[§2\.1](https://arxiv.org/html/2605.16011#S2.SS1.p1.1)\.
## Appendix ADetailed Rubric
This section comprehensively documents the rubric criteria, scoring hierarchies, and reference examples established in the evaluation framework\. To ensure reproducibility and transparency, each entry below specifies the exact text used to prompt evaluators, the theoretical dimensions underpinning the metric \(e\.g\.,Cognitive Aspectsvs\.Motivational Aspects\), and the precise logic required to assign values \(e\.g\.,−1\-1vs\.0vs\.11\)\. Additionally, we provide the canonical examples that define the boundary conditions for each score, presenting the material verbatim from the study’s design documents\.
### A\.1Adaptivity
Q1\. Does the response take the learner’s prior math knowledge into account?Cognitive aspects \- Current knowledge\(Plass and Pawar,[2020](https://arxiv.org/html/2605.16011#bib.bib22)\)Criteria:The response should match the learner’s stated math level\. A response fails when it: 1\) The content area is more than the learner’s\. 2\) The content area is out of this grade’s scope\.Scoring & Examples:•0 = No: Prompt: I just learned algebra in grade 8\. Could you teach me this question "y=aby=ab\. Which is the value of y when a = 1 and b = 2" Response: Sure\! Think about a x b in a higher\-dimensional space…•1 = Yes
Q2\. Does this response try to build a positive attitude in the learner?Motivational aspects \- Learning attitude\(Bernackiet al\.,[2021](https://arxiv.org/html/2605.16011#bib.bib36)\)Criteria:Definition of Learning Attitude in Math:Kibrislioglu \([2015](https://arxiv.org/html/2605.16011#bib.bib50)\)defines attitude towards mathematics as liking or disliking of the subject; a tendency to engage in or avoid mathematical activities; a belief that one is good or bad at mathematics; and a belief that mathematics is useful or useless\.Scoring & Examples:•\-1 = No, it responds negatively•0 = No•1 = Yes: Prompt: I don’t like math\. Can you help me understand 2 × 3? Response: "I can help you with that…\."
Q3\. Does this response intend to improve the student’s self\-efficacy for completing this mathematics task?Motivational aspects \- Self\-efficacy\(Plass and Pawar,[2020](https://arxiv.org/html/2605.16011#bib.bib22)\)Criteria:Math self\-efficacy is a situational or problem\-specific assessment of an individual’s confidence in her or his ability to successfully perform or accomplish a particular mathematical task or problem\(Hackett and Betz,[1989](https://arxiv.org/html/2605.16011#bib.bib48)\)\. The response should help the student feel capable of solving the math question\.Scoring & Examples:•\-1 = No, it responds negatively•0 = No•1 = Yes: Prompt: I am not confident in math learning\. Can you help me understand 2 × 3? Response: "Math can be tricky, but we can solve this together\."
Q4\. Does the solution meet the Common Core State Standards at this grade level?Cognitive aspects \- Difficulty\(Marraffinoet al\.,[2021](https://arxiv.org/html/2605.16011#bib.bib45)\)Criteria:The response aligns with the Common Core standard math knowledge scope and difficulty\.Scoring & Examples:•0 = No: Not followed the Common Core Standard•1 = Yes: Prompt: I’m not sure I can solve 12 × 13\. Can you teach me? Response: Of course\! This is a Number and Operation question in Grade 4……?
Q5\. Is the solution at an appropriate difficulty level for the learner?Cognitive aspects \- Difficulty\(Marraffinoet al\.,[2021](https://arxiv.org/html/2605.16011#bib.bib45)\)Criteria:The response should match the learner’s skill level\. A response fails when the explanation is too advanced, too complex, or uses concepts the learner is not ready for\.Scoring & Examples:•0 = No: Not provided a solution at an appropriate difficulty level•1 = Yes: Prompt: What is2\+22\+2? Response: It is 4 because adding two items to another two items makes four items\.
Q6\. Does this response explain the meaning of this math question to help the learner understand the problem?ComplexityCriteria:The response should explain the meaning of the math question to help the learner know its logic and better know how to solve it\. A response fails when the response does not provide an explanation of the meaning of the math question\.Scoring & Examples:•0 = No: Not explained•1 = Yes: Response: …This question means…?
Q7\. Does this response provide an example aligned with this math question’s topic?ComplexityCriteria:The response should give an example that matches the math topic in the question\. A response fails when the example is unrelated, off\-topic, or does not help illustrate the math idea being asked about\.Scoring & Examples:•0 = No: No example provided•1 = Yes: Response: …Here is an example…
Q8\. Does this response provide additional practice similar to this math question?ComplexityCriteria:The response should give extra practice problems that match the math concept in the question\. A response fails when the response does not provide extra practice\.Scoring & Examples:•0 = No: No additional practice provided•1 = Yes: Response: …Let’s practice…
Q9\. Does the response provide cues \(e\.g\., step\-by\-step, signals of importance\) to guide the learner?ComplexityCriteria:The response should include helpful cues such as steps, hints, or pointing out important parts of the problem\. A response fails when it gives the answer with no guidance or offers no structure to help the learner follow the process\.Scoring & Examples:•0 = No: No cues provided•1 = Yes: Response: …Let’s go step by step\. First…
### A\.2Correctness
Q10\. Does the response provide a correct answer to this math question?The response provides a correct answer\.Criteria:The final answer provided by the model must be mathematically correct\.Scoring & Examples:•0 = No: Prompt:2\+2=?2\+2=? Response: 5•1 = Yes: Prompt:2\+2=?2\+2=? Response: 4
Q11\. Does the solution support and align with the answer?Criteria:The key explanation and logic should clearly match the final answer\.Scoring & Examples:•1 = No•0 = Yes: Prompt: Can you teach me how to solve 12 × 13? Response: The answer is 156\. This is a number & operation math question…
Q12\. Does the response provide a correct solution to this math question?Criteria:The response should give the correct and accurate solution based on the math problem\.Scoring & Examples:•0 = No: More than 50% incorrect•1 = Mostly correct: Minor error that doesn’t affect logic•2 = Completely correct
### A\.3Quality
Q13\. Does the response have hallucinations or inaccurately extracted information?Criteria:The model responds with content that contains hallucinations and/or extracts inaccurate information from the provided prompt\.Scoring & Examples:•1 = No•0 = Yes
Q14\. Does the response have failures in mathematical or logical reasoning?Criteria:The model in its responses makes errors in arithmetic calculations, statistical analysis, or other quantitative conceptual or numerical operations\.Scoring & Examples:•1 = No•0 = Yes
Q15\. Does the response fail to accurately and completely follow the prompt instructions?Criteria:The model doesn’t follow explicit or clearly implied instructions in the model input\. This includes responding in the wrong language, ignoring the math exercise’s key information, or not adhering to the user’s requirement\.Scoring & Examples:•1 = No•0 = Yes
Q16\. Is the response incoherent?Criteria:The model’s reasoning contains logical fallacies, contradicts itself within the same response, or reaches conclusions that don’t follow from the premises\. It includes cases where the model presents information and then contradicts it later in the same response\.Scoring & Examples:•1 = No•0 = Yes
Q17\. Does the response provide irrelevant information?Criteria:The model in its responses includes information that doesn’t address the user’s query or goes beyond the scope of what was asked\. This includes discussing tangential topics, adding unnecessary background, or focusing on aspects not central to the query\.Scoring & Examples:•1 = No•0 = Yes
Q18\. Does the response have formatting, grammatical, or vocabulary errors?Criteria:This failure mode occurs when the model’s response contains errors in formatting, grammar, or word usage\.Scoring & Examples:•1 = No•0 = Yes
Q19\. Is the response overly brief or lacking necessary detail?Criteria:The model provides a response that lacks sufficient detail or explanation to adequately address the query, given the complexity of the topic and the apparent depth of information requested\.Scoring & Examples:•1 = No•0 = Yes
## Appendix BModel Sources and Computational Cost
The models evaluated in this paper are obtained from the following sources:
1. 1\.
2. 2\.
3. 3\.
For large proprietary models \(e\.g\., GPT\-5\), a one\-time evaluation on 600 samples incurs a cost of approximately $6 for generating explanations\. All open\-source model evaluations are conducted on a system equipped with two NVIDIA RTX 4090 GPUs \(32GB memory each\)\. Overall, the modest computational requirements demonstrate that our evaluation protocol is accessible to researchers with limited computational resources, while still enabling comprehensive assessment of state\-of\-the\-art models\.
## Appendix CDecoding configuration
To ensure comparability across models, we standardize the decoding parameters for all evaluated VLMs\. Specifically, we set the maximum generation length to 512 tokens and use a temperature of 0\.7 for all models\. Generation terminates when the model reaches the token limit or produces an end\-of\-sequence token\.
These settings are chosen to allow sufficient space for step\-by\-step explanations while preventing excessively long outputs\. The token limit is determined based on pilot annotations by human experts, which estimate the typical length required to provide a complete pedagogical explanation for the evaluated mathematics problems\. Importantly, the same configuration is applied across all models to ensure that differences in response quality reflect model behavior rather than decoding differences\.
## Appendix DItem\-Level Robustness Analysis
To examine whether specific evaluation dimensions drive the reported findings, we conduct an item\-level robustness analysis across the 19 evaluation questions used in our human annotation framework\. Each item captures a distinct aspect of instructional quality, including learner adaptation, motivational support, instructional scaffolding, correctness, and potential failure modes\.
For each item, we compute the mean percentage across the six learner profiles for each model\. Table[3](https://arxiv.org/html/2605.16011#A4.T3)reports the resulting item\-level averages\.
Table 3:Item\-level mean percentages across learner profiles \(N=6\)\.The results reveal consistent patterns across items\. Gemini and Qwen achieve the highest scores on most positive instructional dimensions, including prior\-knowledge adaptation \(Q1\), motivational support \(Q2–Q3\), explanation quality \(Q6\), and instructional guidance \(Q9\)\. In contrast, GPT\-O1 and Llama exhibit substantially higher failure rates in reasoning errors \(Q14\), instruction\-following failures \(Q15\), and incoherence \(Q16\)\.
Importantly, these patterns remain stable across evaluation items, suggesting that the reported model differences are not driven by a small subset of evaluation criteria\. Instead, the relative strengths and weaknesses of models appear consistently across multiple pedagogical dimensions\.
## Appendix EHuman Annotation Details
To verify the reliability of our evaluation rubric, we conducted a human annotation study on a stratified subset of 120 model responses\. Two expert annotators independently evaluated these instances, achieving an overall inter\-rater agreement of 87\.11%\. As detailed in Table[4](https://arxiv.org/html/2605.16011#A5.T4), these results demonstrate strong consistency across different models and learner profiles, confirming the reproducibility of our scoring framework\.
Table 4:Detailed inter\-rater agreement metrics for each evaluated model on the annotated subset \(N=120N=120\)\. The overall average agreement across all models is 87\.11%\.
## Appendix FStatistical Analysis of Model Performance
p\(aj,bj∣Dt\+1\)∝p\(Rnew∣aj,bj,θnew\)⏟Likelihood: how well new responsesfit the current item parameters⋅p\(aj,bj∣Dt\)⏟Prior: learned fromprevious calibration datap\(a\_\{j\},b\_\{j\}\\mid D\_\{t\+1\}\)\\;\\propto\\;\\underbrace\{p\(R\_\{\\text\{new\}\}\\mid a\_\{j\},b\_\{j\},\\theta\_\{\\text\{new\}\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{Likelihood: how well new responses\}\\\\ \\text\{fit the current item parameters\}\\end\{subarray\}\}\\;\\cdot\\;\\underbrace\{p\(a\_\{j\},b\_\{j\}\\mid D\_\{t\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{Prior: learned from\}\\\\ \\text\{previous calibration data\}\\end\{subarray\}\}
We conducted a Repeated Measures ANOVA to statistically evaluate the performance differences between VLMs and test items\. As detailed in Table[5](https://arxiv.org/html/2605.16011#A6.T5), the analysis revealed a statistically significant main effect for the LLM \(F=41\.39,p<\.001F=41\.39,p<\.001\), confirming that the choice of model significantly impacts performance quality\.
Additionally, we observed a significant main effect for Item \(F=721\.61,p<\.001F=721\.61,p<\.001\) and a significant Item×\\timesLLM Interaction \(F=21\.75,p<\.001F=21\.75,p<\.001\)\. This interaction indicates that the performance gap between models is not uniform; rather, it varies depending on the specific difficulty or type of math question presented\.
CasesSum of SquaresdfMean SquareFpWithin Subjects EffectsItem593\.60a8a74\.201a721\.61a<\.001<\.001Item∗\*LLM71\.56a32a2\.236a21\.75a<\.001<\.001Residuals588\.1757200\.103aMauchly’s test indicates sphericity assumption is violated \(p<\.05p<\.05\)\.Between Subjects EffectsLLM61\.25415\.31241\.39<\.001<\.001Residuals264\.497150\.370Table 5:Repeated Measures ANOVA Results\. Analysis of Within\-Subjects Effects \(Item, Interaction\) and Between\-Subjects Effects \(LLM\)\.Figure[4](https://arxiv.org/html/2605.16011#A6.F4)visualizes these descriptive trends\. While some items \(e\.g\., Item 7\) proved uniformly difficult for all models, causing performance to converge near zero, other items \(e\.g\., Item 1 and 9\) exhibit distinct stratification, where higher\-performing models separate clearly from lower\-performing ones\.
Figure 4:Descriptive Plots of Model Performance by Item\. The x\-axis represents the math items, and the y\-axis represents the normalized performance score\. The divergence of lines illustrates the Item×\\timesLLM interaction effect\.
## Appendix GStatistical Analysis by Test Group
We extended our analysis to evaluate performance differences between distinct Test Groups using a Repeated Measures ANOVA\. As presented in Table[6](https://arxiv.org/html/2605.16011#A7.T6), we found a statistically significant main effect for TestGroup \(F\(3,716\)=92\.73,p<\.001F\(3,716\)=92\.73,p<\.001\), with a substantial effect size \(ηp2=0\.280\\eta\_\{p\}^\{2\}=0\.280\)\. This indicates that the grouping factor accounts for 28% of the variance in between\-subjects performance\.
Furthermore, the Item×\\timesTestGroup interaction was significant \(F\(24,5728\)=19\.13,p<\.001F\(24,5728\)=19\.13,p<\.001\), suggesting that performance gaps between groups are dependent on the specific item being solved\.
CasesSSdfMSFp𝜼𝒑𝟐\\bm\{\\eta\_\{p\}^\{2\}\}𝝎𝟐\\bm\{\\omega^\{2\}\}Within Subjects EffectsItem593\.64a8a74\.205a695\.91a<\.001<\.0010\.4930\.412Item∗\*TestGroup48\.95a24a2\.040a19\.13a<\.001<\.0010\.0740\.022Residuals610\.7857280\.107aMauchly’s test indicates sphericity assumption is violated \(p<\.05p<\.05\)\.Between Subjects EffectsTestGroup91\.14330\.38192\.73<\.001<\.0010\.2800\.088Residuals234\.597160\.328Table 6:Repeated Measures ANOVA \(Test Group\)\. Comparison of Within\-Subjects and Between\-Subjects effects, including effect sizes \(ηp2\\eta\_\{p\}^\{2\}andω2\\omega^\{2\}\)\.As illustrated in Figure[5](https://arxiv.org/html/2605.16011#A7.F5), while all groups struggled similarly on the most difficult items \(e\.g\., Item 7\), the lower\-performing group \(bottom line\) showed significantly higher divergence on Items 1, 3, and 5 compared to the clustered higher\-performing groups\.
Figure 5:Descriptive Plots by Test Group\.Performance stratification across 9 items\. The x\-axis represents the items, and the y\-axis represents the normalized score\. One group \(open circles\) consistently underperforms compared to the others\.
## Appendix HCase Profiles
To qualitatively illustrate the adaptive capabilities of the evaluated models, we present a case study of a Grade 4 measurement problem\. Figure[6](https://arxiv.org/html/2605.16011#A8.F6)displays the responses generated by the model for the same math question across three distinct learner profiles: High\-performance, Intermediate, and Low\-performance\. This visual comparison demonstrates how the model adjusts its pedagogical strategy, ranging from direct calculation to scaffolded instruction, based on the learner’s stated confidence and prior knowledge\.
Figure 6:Case profiles for three learner profiles in G4Q5\.
## Appendix IExample Responses Across Context Conditions
Table[7](https://arxiv.org/html/2605.16011#A9.T7)illustrates how an O1 model adapts its explanations across four context conditions \(G1\-G4\) for a Grade 4 learner solving a balance scale problem\.
CPrompt / Learner ProfileModel ResponseG1Grade 4, low prior knowledgeA concise way to see this is:• From the first balance, the pineapple must weigh 1 kg \+ 2 kg = 3 kg\.• The second balance shows that the 3 kg pineapple plus 2 kg equals the unlabeled weight, so that weight is 5 kg\.• Finally, the third balance has that 5 kg weight plus 1 kg equals the watermelon’s mass\. Hence, the watermelon weighs 6 kg\.C2Grade 4, low prior knowledge \+ partial contextHere’s one way to think this through step\-by\-step:1\) Top scale: 1 kg \+ 2 kg = pineapple → 3 kg2\) Middle scale: pineapple \(3 kg\) \+ 2 kg = single heavy weight → 5 kg3\) Bottom scale: 5 kg \+ 1 kg = watermelon → 6 kgTip: Writing equations for each scale helps solve the puzzle step by step\.G3Grade 4, low prior knowledge \+ full contextLet’s solve it step by step:1\) Top scale: 1 kg \+ 2 kg = pineapple → 3 kg2\) Middle scale: pineapple \(3 kg\) \+ 2 kg = anvil → 5 kg3\) Bottom scale: anvil \(5 kg\) \+ 1 kg = watermelon → 6 kgGOOD WAY: Assign a “mystery number” to each object and solve each scale sequentially\.G4Grade 4, low prior knowledge \+ full context \+ learner motivationLet’s work through each scale, step by step:1\) Top: 1 kg \+ 2 kg = pineapple → 3 kg2\) Middle: pineapple \(3 kg\) \+ 2 kg = unlabeled weight → 5 kg3\) Bottom: 5 kg \+ 1 kg = watermelon → 6 kgTherefore, the watermelon weighs 6 kilograms\.Table 7:O1 model responses across context conditions \(G1–G4\) for a Grade 4 learner on a balance scale problem\. Increasing context leads to more step\-by\-step guidance and scaffolding\.
## Appendix JPrompt Design Strategies
This section provides a comprehensive documentation of the evaluation rubric, presenting the specific criteria, scoring hierarchies, and illustrative examples utilized in the study\. To ensure full transparency and reproducibility, the content below reproduces the definitions exactly as they appear in the evaluation framework\. Each entry details the theoretical sub\-dimension \(e\.g\., Cognitive vs\. Motivational aspects\), the precise conditions required for each score \(e\.g\., the distinction between a score of 0 and 1\), and the reference examples that serve as ground truth for annotators\.
Table 8:User prompts\.User prompts across learner profiles \(high\-performance, intermediate\-performance, and low\-performance\) and grade levels \(grade 4 and 8\)\.
## Appendix KLearning Goals & Content Areas
Table 9:Content Areas & Learning Goals\.TIMSS mathematical content areas and learning goals in Grade 4\.Table 10:Content Areas & Learning Goals\.TIMSS mathematical content areas and learning goals in Grade 8\.Similar Articles
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.
Leveraging Vision-Language Models to Detect Attention in Educational Videos
This paper explores using a Vision-Language Model (VLM) to detect attention loss in educational videos by combining gaze data with video content, but finds that VLM approaches do not outperform traditional machine learning baselines.
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
This paper introduces LEVANTE-bench, a benchmark that systematically evaluates vision-language models on six cognitive tasks and compares their performance to children aged 5-12, finding that current VLMs align only partially with children's cognitive abilities.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework
CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework for vision-language models that enforces evidence-grounded reasoning through a five-stage pipeline, achieving 87.1% accuracy on ScienceQA and proposing a suite of 23 metrics for evaluation.