Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

arXiv cs.AI 06/10/26, 04:00 AM Papers
Summary
This paper introduces EngVQA, a multimodal benchmark for evaluating engineering reasoning in vision-language models, along with an 8-stage automatic evaluation framework that enables fine-grained analysis of reasoning failures. It reveals substantial limitations in current VLMs' engineering reasoning capabilities.
arXiv:2606.10833v1 Announce Type: new Abstract: Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:17 AM
# Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation
Source: [https://arxiv.org/html/2606.10833](https://arxiv.org/html/2606.10833)
Syed Wasiq\*, Syed Mohamad Tawseeq\*, Yashwant Pravinrao Bangde, and Debaditya Roy syedwasiq12@kgpian\.iitkgp\.ac\.in,tawseeq@kgpian\.iitkgp\.ac\.in, yashwant@kgpian\.iitkgp\.ac\.in,debaditya@cse\.iitkgp\.ac\.in Indian Institute of Technology Kharagpur

###### Abstract

Vision\-Language Models \(VLMs\) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored\. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi\-step reasoning\. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision\-making, where reasoning failures may produce physically invalid yet superficially plausible solutions\. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes\. We introduceEngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems\. We introduce an 8\-stage automatic evaluation framework for assessing VLM\-generated solutions\. The framework independently evaluates each stage of the solution, enabling fine\-grained analysis of reasoning failures\. We benchmark multiple state\-of\-the\-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities\. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0\.975 and a mean absolute error of 0\.67 on a 10\-point grading scale\. Our results highlight the importance of process\-oriented evaluation for reliable assessment of multimodal engineering reasoning systems\.

## 1Introduction

Recent Vision\-Language Models \(VLMs\) such as GPT\-4V\(OpenAI,[2023](https://arxiv.org/html/2606.10833#bib.bib1)\), and Gemini Pro\(Gemini,[2025](https://arxiv.org/html/2606.10833#bib.bib2)\)have significantly advanced multimodal reasoning and visual understanding\. However, existing benchmarks including VQA\(Agrawal et al\.,[2016](https://arxiv.org/html/2606.10833#bib.bib3)\), GQA\(Hudson and Manning,[2019](https://arxiv.org/html/2606.10833#bib.bib4)\), ScienceQA\(Lu et al\.,[2022](https://arxiv.org/html/2606.10833#bib.bib5)\), and MathVista\(Lu et al\.,[2024](https://arxiv.org/html/2606.10833#bib.bib6)\)primarily evaluate symbolic reasoning and final\-answer correctness, providing limited assessment of physically grounded engineering reasoning involving technical diagrams, governing equations, and multi\-stage analytical workflows\.

Recent engineering\-oriented benchmarks such as EngiBench\(Zhou et al\.,[2026](https://arxiv.org/html/2606.10833#bib.bib7)\), EEE\-Bench\(Li et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib8)\), move toward engineering\-oriented evaluation, yet important limitations remain\. EngiBench primarily focuses on textual engineering reasoning and capability\-level rubric evaluation without explicit stage\-wise decomposition of intermediate reasoning stages\. EEE\-Bench emphasizes multimodal understanding in electrical and electronics engineering, but still largely evaluates final\-answer correctness rather than structured engineering workflows\. Recent reasoning\-aware and process\-oriented evaluation frameworks including G\-Eval\(Liu et al\.,[2023a](https://arxiv.org/html/2606.10833#bib.bib9)\), Prometheus\(Kim et al\.,[2024](https://arxiv.org/html/2606.10833#bib.bib10)\), ProcessBench\(Zheng et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib11)\), and Thinking\-LLM\-as\-a\-Judge\(Saha et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib12)\)demonstrate that structured reasoning\-aware evaluation improves automated assessment reliability\.

To address these limitations, we introduceEngVQA, a benchmark of authentic engineering problems spanning 5 subjects: Fluid Mechanics, Heat and Mass Transfer, Dynamics, Mechanics of Materials, and Thermodynamics\. The benchmark requires joint reasoning over technical diagrams, physical principles, symbolic derivations, and multi\-step quantitative analysis\.

Table 1:Comparison of multimodal scientific and engineering reasoning benchmarks along dimensions of engineering realism, technical diagram understanding, process\-level evaluation, and physics\-aware reasoning constraints\.Building on the benchmark, we proposeEngJudge, an 8\-stage process\-oriented evaluation framework that decomposes engineering solutions into interpretable reasoning stages while modeling dependency\-aware error propagation across interrelated reasoning stages\. EngJudge independently evaluates localized reasoning stages, improving interpretability and reducing evaluator ambiguity compared to holistic evaluation approaches\.

To validate the reliability of EngJudge, we also conduct a human validation study with engineering students\. Our findings show that the framework’s automated scores align closely with human expert grading philosophies, demonstrating its potential as a reliable tool for structured, process\-oriented evaluation\. Table[1](https://arxiv.org/html/2606.10833#S1.T1)summarizes the differences between existing benchmarks and our approach\. Our contributions are as follows:

- •We introduceEngVQA, a multimodal benchmark of 696 authentic engineering problems requiring reasoning over technical diagrams, physical principles, symbolic derivations, and multi\-step quantitative analysis\.
- •We proposeEngJudge, an 8\-stage process\-oriented evaluation framework that models dependency\-aware reasoning failures and achieves strong agreement with expert human evaluators\.
- •We demonstrate that SOTA VLMs exhibit substantial weaknesses in engineering reasoning, particularly in diagram interpretation, equation selection, assumption validation, and physically consistent multi\-stage analysis\.

## 2Related Work

##### Engineering and Scientific Reasoning Benchmarks

Recent benchmarks have explored scientific and engineering reasoning in large language and vision\-language models across diverse domains\. General multimodal reasoning benchmarks such as MMMU\(Yue et al\.,[2024](https://arxiv.org/html/2606.10833#bib.bib13)\)evaluate broad\-domain visual reasoning across university\-level subjects, while scientific reasoning datasets such as SciBench\(Wang et al\.,[2024](https://arxiv.org/html/2606.10833#bib.bib14)\), ScienceQA\(Lu et al\.,[2022](https://arxiv.org/html/2606.10833#bib.bib5)\), and MathVista\(Lu et al\.,[2024](https://arxiv.org/html/2606.10833#bib.bib6)\)focus primarily on scientific problem solving, symbolic reasoning, and final\-answer correctness\. More recent engineering\-oriented benchmarks have extended multimodal evaluation into STEM and applied engineering domains\. EngiBench\(Zhou et al\.,[2026](https://arxiv.org/html/2606.10833#bib.bib7)\)evaluates engineering\-focused question answering tasks, while EEE\-Bench\(Li et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib8)\)introduces multimodal reasoning problems in electrical engineering involving circuit diagrams and technical schematics\. SeePhys\(Xiang et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib15)\)studies visually grounded physics reasoning, and CSVQA\(Jian et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib16)\)explores multimodal STEM reasoning in educational settings\. Existing benchmarks predominantly evaluate reasoning through final\-answer correctness or holistic solution\-level scoring\.

##### Stage\-wise Reasoning and Process Evaluation

Recent work has shown that final\-answer correctness alone is insufficient for evaluating reasoning quality in modern language and vision\-language models\(Lightman et al\.,[2023](https://arxiv.org/html/2606.10833#bib.bib17); Golovneva et al\.,[2023](https://arxiv.org/html/2606.10833#bib.bib18)\)\. As models increasingly generate long\-form chain\-of\-thought reasoning, evaluating intermediate reasoning behavior has become important for understanding logical consistency, factual correctness, and reasoning reliability\.

ROSCOE\(Golovneva et al\.,[2023](https://arxiv.org/html/2606.10833#bib.bib18)\)introduces fine\-grained metrics for evaluating generated reasoning traces across semantic and logical consistency dimensions\.Lightman et al\. \([2023](https://arxiv.org/html/2606.10833#bib.bib17)\)demonstrate that process supervision can improve mathematical reasoning by rewarding intermediate reasoning correctness rather than relying solely on final answers\. ProcessBench\(Zheng et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib11)\)studies process\-level reasoning failures in mathematical settings through stage\-wise verification of intermediate reasoning chains\.

##### LLM\-as\-a\-Judge and Structured Evaluation

Recent work has increasingly explored the use of large language models as automated evaluators for reasoning and generation tasks\(Zheng et al\.,[2023](https://arxiv.org/html/2606.10833#bib.bib19); Liu et al\.,[2023b](https://arxiv.org/html/2606.10833#bib.bib20); Kim et al\.,[2024](https://arxiv.org/html/2606.10833#bib.bib10)\)\. G\-Eval\(Liu et al\.,[2023b](https://arxiv.org/html/2606.10833#bib.bib20)\)demonstrates that rubric\-guided evaluation can improve the reliability and interpretability of LLM\-based assessment, while Prometheus\(Kim et al\.,[2024](https://arxiv.org/html/2606.10833#bib.bib10)\)explores fine\-grained rubric\-conditioned evaluation strategies for scalable automatic assessment\. More recent frameworks investigate structured reasoning strategies for evaluation itself\. Thinking\-LLM\-as\-a\-Judge\(Saha et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib12)\)proposes planning\-oriented judging strategies in which evaluators explicitly reason through structured evaluation plans before assigning scores\.

## 3EngVQA Benchmark

Table 2:Subject\-wise statistics of EngVQA\. ATPQ \(Average Topics per question\), Fluid Mechanics \(FM\), Heat and Mass Transfer \(HMT\), Mechanics of Materials \(MoM\), Thermodynamics \(Thermo\), and Dynamics \(Dyn\)\.### 3\.1Benchmark Principles

We chose problems\-solution pairs that align with EngVQA’s benchmark design principles: \(1\)Diagram\-Grounded Analytical Reasoning:Problems require extracting geometry, boundary conditions, force directions, flow structure, material interfaces, and spatial constraints directly from technical figures such as free\-body diagrams, thermodynamic plots, flow schematics, stress distributions, and engineering layouts\. \(2\)Structured Multi\-Stage Reasoning:Solutions involve multiple interdependent reasoning stages including problem characterization, assumption formulation, visual interpretation, equation selection, symbolic derivation, algebraic computation, and physical validation\. \(3\)Physics\-Constrained Engineering Workflows:Problems require physically valid reasoning under domain\-specific engineering constraints\. Models must maintain consistency between visual interpretation, governing equations, simplifying assumptions, and final quantitative predictions throughout the solution process\.

### 3\.2Benchmark Statistics

Table[2](https://arxiv.org/html/2606.10833#S3.T2)summarizes the subject composition and reasoning diversity of EngVQA\. The benchmark spans five foundational engineering subjects and contains 696 problems requiring multimodal reasoning over technical diagrams, governing equations, symbolic derivations, and physical constraints\. Rather than concentrating on a small set of narrow problem templates, the benchmark includes diverse question distributions across different topics\. The subject distribution reflects both breadth and diversity in engineering reasoning workflows\. Dynamics and Mechanics of Materials emphasize free\-body analysis, force interactions, and rigid\-body reasoning, while Thermodynamics and Heat & Mass Transfer involve physically constrained energy\-system analysis, property relationships, and transport phenomena\. Fluid Mechanics problems additionally require spatial reasoning over flow structures, pressure distributions, and conservation laws\. Across all subjects, technical diagrams play a central role in downstream analytical formulation, making visual interpretation a necessary component of successful problem solving rather than an auxiliary context\. The average number of topics per question measures reasoning density across the domains\. This metric underscores that solving a typical problem in our benchmark requires the simultaneous integration and synthesis of multiple physical concepts, which fundamentally heightens the difficulty of problems\. Full list of topics per subject in Appendix[B](https://arxiv.org/html/2606.10833#A2)\.

## 4EngJudge: Stage\-wise Evaluation

Evaluating engineering reasoning requires more than final\-answer correctness because failures often emerge within intermediate reasoning stages such as assumption formulation, diagram interpretation, equation selection, and computation\. To capture this process\-level behavior, we developEngJudge, a stage\-wise evaluation framework motivated by an error analysis ofgemini\-2\.0\-flash\-expsolutions across 3 subjects \- Fluid Mechanics, Heat and Mass Transfer \(HMT\), and Mechanics of Materials \(MOM\)\.

Our analysis shows that engineering reasoning failures are highly multifaceted and rarely occur in isolation\. We additionally observe that visual interpretation forms a structurally distinct failure mode, while error correlations suggest that engineering reasoning is best modeled as apartially dependentprocess rather than a fully independent one \(Appendix[A](https://arxiv.org/html/2606.10833#A1)\)\. These findings motivate two key design choices in EngJudge: independent evaluation of localized reasoning stages and dependency propagation along empirically observed reasoning edges\.

![Refer to caption](https://arxiv.org/html/2606.10833v1/x1.png)Figure 1:A representative example problem with LLM generated solution showing error propagation from incorrect assumptions affects the rest of the solution\.Figure[1](https://arxiv.org/html/2606.10833#S4.F1)illustrates a failure case in a fluid mechanics problem\. Here, a simple step\-by\-step average metric would penalize the integration step for having incorrect numerical coefficients, failing to recognize that the math was internally coherent but corrupted by prior errors\. By decoupling these stages and tracking the failure chain \(Assumptions→Equation Selection→Algebraic Accuracy→Final Answer\\text\{Assumptions\}\\to\\text\{Equation Selection\}\\to\\text\{Algebraic Accuracy\}\\to\\text\{Final Answer\}\), EngJudge pinpoints the root cause of failure in physical modeling rather than mathematical execution\.

### 4\.1Evaluation Framework

![Refer to caption](https://arxiv.org/html/2606.10833v1/x2.png)Figure 2:Overview of the proposed EngJudge evaluation framework\. A\. VLM generates a structured step\-wise solution for an engineering problem containing both text and technical diagram\. B\. Solution is decomposed into eight reasoning stages, each independently evaluated using rubric\-guided LLM\-as\-a\-judge prompts \(App[F\.3](https://arxiv.org/html/2606.10833#A6.SS3)\) with penalty\-based scoring and fatal\-error detection\. C\. Stage scores are then propagated through a dependency graph \(colors correspond to steps in B\.\) and D\. aggregated using meta\-evaluation checks to produce a final interpretable score on a 0–10 scale\.We introduce a multi\-stage, penalty\-based automatic evaluation framework,EngJudge, for grading LLM\-generated solutions to graduate\-level engineering problems\. Rather than relying solely on holistic or single\-score evaluationZheng et al\. \([2023](https://arxiv.org/html/2606.10833#bib.bib19)\); Liu et al\. \([2023b](https://arxiv.org/html/2606.10833#bib.bib20)\), our framework decomposes each solution into eight structured reasoning stages and evaluates them individually through an LLM\-as\-a\-judge\. The framework combinespenalty\-based stage scoring,graph\-based dependency propagation,meta\-evaluation checks, andexam\-style partial credit\(Mertler,[2001](https://arxiv.org/html/2606.10833#bib.bib21)\)to produce a single interpretable score on a 0–10 scale\. The full pipeline is illustrated in Figure[2](https://arxiv.org/html/2606.10833#S4.F2)\. The evaluation pipeline consists of the following stages:

##### Solution Parsing:

We prompt the LLM to generate solutions using a fixed 8\-stage reasoning structure designed to separate different components of engineering problem\-solving, consistent with chain\-of\-thought prompting conventionsWei et al\. \([2022](https://arxiv.org/html/2606.10833#bib.bib22)\); Kojima et al\. \([2022](https://arxiv.org/html/2606.10833#bib.bib23)\)\. Each stage is enclosed within explicit tags so that the generated solution can be parsed automatically \(See Figure[2A](https://arxiv.org/html/2606.10833#S4.F2)\)\.

\#\#\#\#\#\# STAGE\_NAME \#\#\#\#\#\# \[stage content\] \#\#\#\#\#\# END\_STAGE \#\#\#\#\#\#

A rule\-based parser extracts each tagged block and produces an ordered list of reasoning stages\. The required stages are: Problem Characterization, Assumptions, Visual Interpretation, Equation Selection, Logical Reasoning, Algebraic Accuracy, Physical Interpretation, and Final Answer\.

##### Stage\-wise Evaluation:

Each reasoning stage is evaluated independently using a dedicated LLM\-as\-a\-judge prompt \(see Appendix[F](https://arxiv.org/html/2606.10833#A6)\) tailored to that stage\. Every stage begins with a score of 10, and penalties are deducted for detected reasoning errors\. The penalties are distributed into four main categories shown in \(Table[3](https://arxiv.org/html/2606.10833#S4.T3)\)\.

Table 3:Penalty severity scale\.The raw stage score \(YtY\_\{t\}\) is computed as:

Yt=max⁡\(0,10−∑ipi\)Y\_\{t\}=\\max\\\!\\left\(0,\\ 10\-\\textstyle\\sum\_\{i\}p\_\{i\}\\right\)\(1\)wherepip\_\{i\}is the penalty assigned to theithi^\{th\}identified error in stagett\.

##### Fatal Errors:

Certain critical mistakes cap the maximum score for a stage regardless of the accumulated penalties\. For example, if the governing equation is fundamentally incorrect, theEquation Selectionstage immediately receives a score of 0\. Similarly, dimensionally inconsistent formulations or physically impossible results apply fixed score caps \(See Appendix[C\.2](https://arxiv.org/html/2606.10833#A3.SS2)\)\.

For theFinal Answerstage, predicted numerical values are compared against the ground\-truth solution using relative error\. Errors within 5% receive no penalty, errors between 5% and 10% receive a moderate penalty, and errors above 10% receive a major penalty\. Missing units, incomplete answers, or physically impossible results \(e\.g\., negative absolute temperature\) receive additional penalties or score caps\.

##### Graph\-Based Dependency Propagation:

After computing the stage\-wise scores, we propagate upstream reasoning failures through a dependency graph\. This design is motivated by the error analysis discussed in Section[4](https://arxiv.org/html/2606.10833#S4), which showed that many engineering reasoning errors propagate across stages\.

We model these relationships using the dependency graph shown in Figure[2C](https://arxiv.org/html/2606.10833#S4.F2)\. Each stage score is adjusted using the scores of its parent stages\. LetYtY\_\{t\}denote the raw score of stagettobtained from the penalty\-based evaluation\. LetP\(t\)P\(t\)denote the set of parent stages connected to stagett, and letNtN\_\{t\}be the number of parent stages \(Nt=len\(P\(t\)\)N\_\{t\}=\\text\{len\}\(P\(t\)\)\)\. LetStS\_\{t\}andSpS\_\{p\}denote the final propagated scores of stagettand its parent stage\(s\)p∈P\(t\)p\\in P\(t\), respectively\. For stages with no parents i\.e\. Stage 1, we setS1=Y1S\_\{1\}=Y\_\{1\}\. For all other stages, the dependency score is computed as:

St=Yt×1Nt∑p∈P\(t\)Sp10S\_\{t\}=Y\_\{t\}\\times\\frac\{1\}\{N\_\{t\}\}\\sum\_\{p\\in P\(t\)\}\\frac\{S\_\{p\}\}\{10\}\(2\)

##### Score Aggregation and Final Penalty Gates:

The final score is computed from the dependency\-aware stage scores obtained after propagation\. LetStS\_\{t\}denote the propagated score for stagett, and letNNdenote the number of reasoning stages\. The base score is computed as:

Sbase=1N∑t=1NStS\_\{\\text\{base\}\}=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}S\_\{t\}\(3\)
Missing reasoning stages receive fixed deductions:

Sblend=max⁡\(0,Sbase−MP\)S\_\{\\text\{blend\}\}=\\max\(0,\\,S\_\{\\text\{base\}\}\-\\text\{MP\}\)\(4\)
where MP denotes the total missing\-stage penalty\. The blended score is then adjusted using three meta\-evaluation checks applied to the complete solution:

- •VERBOSITYpenalizes excessive repetition, filler reasoning, and unnecessarily long solutions that do not contribute meaningful engineering reasoning\(Zheng et al\.,[2023](https://arxiv.org/html/2606.10833#bib.bib19)\)\.
- •COVERAGEchecks whether all requested quantities, sub\-parts, assumptions, and reasoning steps are addressed in the final solution\.
- •SANITY\_FAILdetects physically impossible or internally inconsistent outputs, such as negative absolute temperature, invalid units, or violations of conservation laws\.

The resulting penalties are applied multiplicatively during final score aggregation\. Final score is calculated as:

Sfinal=Sblend⋅\(1−𝙲𝙾𝚅𝙴𝚁𝙰𝙶𝙴\)⋅\(1−𝚅𝙴𝚁𝙱𝙾𝚂𝙸𝚃𝚈\)⋅0\.9𝚂𝙰𝙽𝙸𝚃𝚈\_𝙵𝙰𝙸𝙻S\_\{\\text\{final\}\}=S\_\{\\text\{blend\}\}\\cdot\(1\-\\mathtt\{COVERAGE\}\)\\cdot\(1\-\\mathtt\{VERBOSITY\}\)\\cdot 0\.9^\{\\mathtt\{SANITY\\\_FAIL\}\}\(5\)

## 5Experiments

### 5\.1Benchmark Construction Pipeline

Question PDFs are processed using a semi\-automated extraction pipeline combining Adobe PDF Extract API111[https://developer\.adobe\.com/document\-services/apis/pdf\-extract/](https://developer.adobe.com/document-services/apis/pdf-extract/), PyMuPDF222[https://pymupdf\.readthedocs\.io/en/latest/](https://pymupdf.readthedocs.io/en/latest/), and Gemini\-3\.1\-Flash\-Lite to recover layouts, figures, equations, and symbolic notation while preserving multimodal structure\. All extracted questions, diagrams, metadata, and solutions are manually verified against the reference material\.

### 5\.2Experimental Settings

We evaluate two vision\-language models as solution generators:Qwen3\-VL\-8BandGemini\-2\.5\-Flash, representing compact open\-weight and proprietary frontier models, respectively\. We used Chain\-of\-Thought \(CoT\) promptingWei et al\. \([2022](https://arxiv.org/html/2606.10833#bib.bib22)\)to trace step\-by\-step physical and mathematical reasoning \(see Appendix[F](https://arxiv.org/html/2606.10833#A6)\)\. We useGemini\-3\.1\-Pro\-Previewas the primary evaluator due to its strong multimodal reasoning capabilities, and additionally employQwen3\-VL\-32B\-Instructto assess cross\-model agreement and reduce potential self\-evaluation bias\.

##### Baseline Evaluator Setup

To quantify the benefits of EngJudge’s structured design, we implement two single\-pass baselines\. In both baseline configurations, the evaluator receives the entire problem statement, the VLM’s solution, and the ground\-truth reference, and predicts scores for all eight reasoning stages in a single LLM call \(see Appendix[F\.2](https://arxiv.org/html/2606.10833#A6.SS2)\)\. The first baseline \(SinglePass\) averages these raw stage scores \(YtY\_\{t\}in Equation[1](https://arxiv.org/html/2606.10833#S4.E1)\) directly\. The second baseline \(SinglePass \+ DP\) applies our Directed Acyclic Graph dependency propagation \(DP\) rules to obtain dependency scores \(StS\_\{t\}in Equation[2](https://arxiv.org/html/2606.10833#S4.E2)\) before averaging\.

### 5\.3Results

![Refer to caption](https://arxiv.org/html/2606.10833v1/x3.png)Figure 3:Cross\-evaluator comparison across the five engineering subjects in EngVQA\. Each radar plot represents a generator–evaluator pair\. SinglePass uses a single LLM call and averages stage\-wise raw scoresYtY\_\{t\}\(Equation[1](https://arxiv.org/html/2606.10833#S4.E1)\), while SinglePass\+DP averages stage\-wise dependency scoresStS\_\{t\}\(Equation[2](https://arxiv.org/html/2606.10833#S4.E2)\)\. All scores range from 0–10\.The cross\-evaluator performance comparison across the five engineering subjects is summarized in Figure[3](https://arxiv.org/html/2606.10833#S5.F3)\. The quantitative results clearly indicate thatGemini\-2\.5\-Flashsignificantly outperformsQwen3\-VL\-8Bacross all evaluated subjects\. This performance gap is consistent across all subjects\. Both models struggle severely in Fluid Mechanics and Heat & Mass Transfer, which represent the lowest\-scoring areas\. This difficulty arises from the complex multi\-phase physics, differential boundary conditions, and spatial geometry integration required in these fields\. Conversely, performance is higher in Thermodynamics and Mechanics of Materials, where problems tend to rely more on algebraic path\-following and discrete\-state equations\. Nonetheless, no generator model breaches a mean score of44in any subject under EngJudge, highlighting the extreme difficulty of engineering questions for current state\-of\-the\-art VLMs\.

We also observe a massive performance discrepancy between the baseline and the EngJudge evaluation framework\. When evaluated by theGemini\-3\.1\-Pro\-Preview \(SinglePass\)one\-shot judge,Gemini\-2\.5\-Flashreceives a highly optimistic overall score of8\.0018\.001\. However, when subjected to the structured, dependency\-aware grading ofEngJudge, its score drops to2\.8692\.869\. This behavior is driven by the fact that standard one\-shot LLMs suffer from a leniency bias, often overlooking mathematical errors, missing justifications, or misinterpreting visuals if the final text appears superficially plausible\.

PCASVIESLRAAPIFA0224466881010StageScoreSinglePassSinglePass \+ DPEngJudgeFigure 4:Stage\-wise scores\.We observe that the gap betweenSinglePassandSinglePass \+ DPis remarkably small and relatively constant \(ranging from0\.110\.11to0\.650\.65points\)\. This indicates that mathematical error propagation alone is insufficient to address the overestimation of LLM capabilities\. Instead, the massive drop to theEngJudgecurve \(a gap of2\.52\.5to4\.14\.1points across all stages\) is primarily driven by the framework’s granular penalty rubrics, fatal error caps, and global meta\-checks\.

We also observe that the curves reveal a dichotomy between theproblem setupandexecution\. During the initial setup stages, scores remain relatively high across all configurations \(remaining above5\.45\.4even under EngJudge\)\. However, once the model transitions to the execution stages \- Logical Reasoning \(LR\), Algebraic Accuracy \(AA\), Physical Interpretation \(PI\), and the Final Answer \(FA\) \- performance drops significantly\. Under EngJudge, the score bottoms out at2\.742\.74forAA, indicating that algebraic calculation and sequential execution represent the*primary cognitive bottlenecks*for the generator model\.

Table 4:Component ablation of EngJudge usingGemini 3\.1 Pro Previewas the evaluator\.##### Component Ablation of EngJudge\.

Table[4](https://arxiv.org/html/2606.10833#S5.T4)quantifies the contribution of each framework component\. Removing dependency propagation based on our causal graph \(*w/o DP*\) substantially increases scores for both Gemini 2\.5 Flash and Qwen3\-VL\-8B, showing that dependency\-aware propagation is the primary source of grading rigor by penalizing downstream reasoning built on incorrect upstream formulations\. Removing meta\-evaluation checks further increases scores, indicating frequent violations of coverage and verbosity constraints in generated solutions\. In contrast, removing missing\-stage penalties has minimal impact on Gemini, suggesting strong adherence to the required structured reasoning format\.

We also conduct an empirical analysis of stage\-wise score correlations using the Pearson correlation matrix of raw scores across all stages \(Figure[5](https://arxiv.org/html/2606.10833#S5.F5)\)\. We observe strong correlations between adjacent reasoning stages, notably between Algebraic Accuracy \(AA\) and the Final Answer \(FA\) \(r=0\.69r=0\.69\), and between Equation Selection \(ES\) and Logical Reasoning \(LR\) \(r=0\.53r=0\.53\)\.

Conversely, correlations between early conceptual stages and final calculations are negligible \(e\.g\., Problem Characterization \(PC\) vs\. Final Answer yieldsr=0\.06r=0\.06\)\. These strong, localized correlations reflect the physical reality of error propagation, empirically validating our dependency graph design: errors in prerequisite stages cascade to compromise downstream execution\. Detailed analysis of these correlations and their relationship in the dependency graph is provided in Appendix[D\.3\.2](https://arxiv.org/html/2606.10833#A4.SS3.SSS2)\.

![Refer to caption](https://arxiv.org/html/2606.10833v1/x4.png)Figure 5:Pearson correlation matrix between stage\-wise raw scores\.

## 6Human Study for Validating EngJudge

To rigorously validate the automated evaluations produced by the EngJudge \(Sec[4](https://arxiv.org/html/2606.10833#S4)\), we conducted a blinded human study to assess the alignment between EngJudge’s structured scoring and human expert intuition\. The evaluation panel consisted of 9 undergraduate/graduate students possessing strong backgrounds and practical experience in the chosen engineering subjects\.

Evaluators were presented with randomized engineering problems, generated solutions fromGemini 2\.5 flash, and EngJudge assigned scores for each stage through a web interface \(see Appendix Figs\.[14](https://arxiv.org/html/2606.10833#A5.F14)\-[21](https://arxiv.org/html/2606.10833#A5.F21)\)\. They rated EngJudge’s stagewise and overall scores on 4 randomly selected questions\. EngJudge demonstrated exceptional alignment with human judgment, achieving a Pearson correlation coefficient of0\.975and a Mean Absolute Error \(MAE\) of just 0\.67 \(on a 10\-point scale\)\. When assessing EngJudge’s overall score, 66\.7% of the evaluators rated the overall system as “Good” or “Very Good” and 22\.3% rated it as average\.

In another setting, we asked human evaluators to compare EngJudge’s dependency\-aware score against a naive average, obtained by simply averaging the scores assigned to the eight stages in a blind A/B format\. Evaluators explicitly favored the dependency score in 66\.7% of cases\. This strongly validates our hypothesis that engineering reasoning evaluation requires hierarchical, cascading penalization rather than isolated stage\-wise grading\. A comprehensive breakdown of the methodology, graphical visualizations, and granular stage\-level metrics is provided in Appendix[E](https://arxiv.org/html/2606.10833#A5)\.

## 7Conclusion

In this paper, we introduced EngVQA, a challenging benchmark of graduate\-level engineering physics problems spanning five core engineering disciplines\. Second, we introduce EngJudge, an evaluation framework that uses dependency\-aware score propagation to capture compounding physical and mathematical errors in engineering reasoning\. Evaluation of SOTA VLMs on EngJudge highlights reasoning collapses due to visual\-geometric and algebraic failures\. EngJudge shows excellent alignment with human expert judgment, and its dependency\-aware framework is preferred over averaging\. We believe EngJudge can be a robust diagnostic tool to guide the development of physically grounded multimodal reasoning agents for engineering applications and interactive tutoring systems\.

## Limitations

##### Data Contamination\.

We cannot fully rule out benchmark contamination from model pretraining data\. However, the benchmark emphasizes multimodal understanding and multi\-step engineering reasoning, making direct memorization alone insufficient\.

##### Computational Cost of Evaluation\.

Our framework requires 11 LLM judge calls per solution, making evaluation significantly more computationally expensive\. Although KV caching can reduce redundant computation\(Pope et al\.,[2022](https://arxiv.org/html/2606.10833#bib.bib24)\), evaluation cost may still limit scalability in large\-scale or resource\-constrained settings\.

##### LLM Judge Reliability and Bias\.

EngJudge inherits known limitations of LLM\-as\-a\-judge evaluation, including potential biases and imperfect alignment with human judgment\. Although human validation demonstrates strong agreement, such biases cannot be fully eliminated\(Huang et al\.,[2025](https://arxiv.org/html/2606.10833#bib.bib25)\)\.

## Ethics Statement

This work complies with ethical guidelines concerning data sourcing, human studies, and computational deployment\. The evaluation dataset comprises standard academic engineering physics problems\. We conducted the human study with volunteer graduate engineering students; participation was entirely voluntary, all data was fully anonymized, and no PII or demographic data was collected\. Prior to the study, participants were presented with explicit instructions explaining how their evaluation data would be used, and they provided informed consent before proceeding\. Furthermore, all evaluated models were accessed in accordance with their developer terms of service and license agreements, using local configurations on consumer\-grade hardware to minimize computational energy consumption and environmental footprint\.

## References

- OpenAI \[2023\]OpenAI\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Gemini \[2025\]Team Gemini\.Gemini: A family of highly capable multimodal models, 2025\.URL[https://arxiv\.org/abs/2312\.11805](https://arxiv.org/abs/2312.11805)\.
- Agrawal et al\. \[2016\]Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C\. Lawrence Zitnick, Dhruv Batra, and Devi Parikh\.Vqa: Visual question answering, 2016\.URL[https://arxiv\.org/abs/1505\.00468](https://arxiv.org/abs/1505.00468)\.
- Hudson and Manning \[2019\]Drew A\. Hudson and Christopher D\. Manning\.Gqa: A new dataset for real\-world visual reasoning and compositional question answering\.In*2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 6693–6702, 2019\.doi:10\.1109/CVPR\.2019\.00686\.
- Lu et al\. \[2022\]Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai\-Wei Chang, Song\-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan\.Learn to explain: Multimodal reasoning via thought chains for science question answering\.In*The 36th Conference on Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Lu et al\. \[2024\]Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai\-Wei Chang, Michel Galley, and Jianfeng Gao\.Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024\.URL[https://arxiv\.org/abs/2310\.02255](https://arxiv.org/abs/2310.02255)\.
- Zhou et al\. \[2026\]Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, and Junhua Zhao\.Engibench: A benchmark for evaluating large language models on engineering problem solving, 2026\.URL[https://arxiv\.org/abs/2509\.17677](https://arxiv.org/abs/2509.17677)\.
- Li et al\. \[2025\]Ming Li, Jike Zhong, Tianle Chen, Yuxiang Lai, and Konstantinos Psounis\.Eee\-bench: A comprehensive multimodal electrical and electronics engineering benchmark\.In*2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 13337–13349, 2025\.doi:10\.1109/CVPR52734\.2025\.01245\.
- Liu et al\. \[2023a\]Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\.G\-eval: NLG evaluation using gpt\-4 with better human alignment\.In Houda Bouamor, Juan Pino, and Kalika Bali, editors,*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 2511–2522, Singapore, December 2023a\. Association for Computational Linguistics\.doi:10\.18653/v1/2023\.emnlp\-main\.153\.URL[https://aclanthology\.org/2023\.emnlp\-main\.153/](https://aclanthology.org/2023.emnlp-main.153/)\.
- Kim et al\. \[2024\]Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo\.Prometheus: Inducing fine\-grained evaluation capability in language models, 2024\.URL[https://arxiv\.org/abs/2310\.08491](https://arxiv.org/abs/2310.08491)\.
- Zheng et al\. \[2025\]Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin\.Processbench: Identifying process errors in mathematical reasoning, 2025\.URL[https://arxiv\.org/abs/2412\.06559](https://arxiv.org/abs/2412.06559)\.
- Saha et al\. \[2025\]Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, and Tianlu Wang\.Learning to plan & reason for evaluation with thinking\-llm\-as\-a\-judge, 2025\.URL[https://arxiv\.org/abs/2501\.18099](https://arxiv.org/abs/2501.18099)\.
- Yue et al\. \[2024\]Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen\.Mmmu: A massive multi\-discipline multimodal understanding and reasoning benchmark for expert agi\.In*2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 9556–9567, 2024\.doi:10\.1109/CVPR52733\.2024\.00913\.
- Wang et al\. \[2024\]Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R\. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang\.Scibench: evaluating college\-level scientific problem\-solving abilities of large language models\.In*Proceedings of the 41st International Conference on Machine Learning*, ICML’24\. JMLR\.org, 2024\.
- Xiang et al\. \[2025\]Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu\-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, and Xiaodan Liang\.Seephys: Does seeing help thinking? – benchmarking vision\-based physics reasoning, 2025\.URL[https://arxiv\.org/abs/2505\.19099](https://arxiv.org/abs/2505.19099)\.
- Jian et al\. \[2025\]Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, and Xuchen Song\.Csvqa: A chinese multimodal benchmark for evaluating stem reasoning capabilities of vlms, 2025\.URL[https://arxiv\.org/abs/2505\.24120](https://arxiv.org/abs/2505.24120)\.
- Lightman et al\. \[2023\]Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.*ArXiv*, abs/2305\.20050, 2023\.URL[https://api\.semanticscholar\.org/CorpusID:258987659](https://api.semanticscholar.org/CorpusID:258987659)\.
- Golovneva et al\. \[2023\]Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel\-Zarandi, and Asli Celikyilmaz\.ROSCOE: A suite of metrics for scoring step\-by\-step reasoning\.In*International Conference on Learning Representations*, 2023\.URL[https://openreview\.net/forum?id=xYlJRpzZtsY](https://openreview.net/forum?id=xYlJRpzZtsY)\.
- Zheng et al\. \[2023\]Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.In*Proceedings of the 37th International Conference on Neural Information Processing Systems*, NIPS ’23, Red Hook, NY, USA, 2023\. Curran Associates Inc\.
- Liu et al\. \[2023b\]Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu\.G\-eval: Nlg evaluation using gpt\-4 with better human alignment\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, pages 2511–2522, 2023b\.
- Mertler \[2001\]Craig A\. Mertler\.Designing scoring rubrics for your classroom\.*Practical Assessment, Research, and Evaluation*, 7\(25\), 2001\.URL[https://doi\.org/10\.7275/gcy8\-0w24](https://doi.org/10.7275/gcy8-0w24)\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H\. Chi, Quoc V\. Le, and Denny Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Proceedings of the 36th International Conference on Neural Information Processing Systems*, NIPS ’22, Red Hook, NY, USA, 2022\. Curran Associates Inc\.ISBN 9781713871088\.
- Kojima et al\. \[2022\]Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\.Large language models are zero\-shot reasoners\.In*Proceedings of the 36th International Conference on Neural Information Processing Systems*, NIPS ’22, Red Hook, NY, USA, 2022\. Curran Associates Inc\.ISBN 9781713871088\.
- Pope et al\. \[2022\]Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean\.Efficiently scaling transformer inference, 2022\.URL[https://arxiv\.org/abs/2211\.05102](https://arxiv.org/abs/2211.05102)\.
- Huang et al\. \[2025\]Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao\.An empirical study of LLM\-as\-a\-judge for LLM evaluation: Fine\-tuned judge model is not a general substitute for GPT\-4\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 5880–5895, Vienna, Austria, 2025\. Association for Computational Linguistics\.doi:10\.18653/v1/2025\.findings\-acl\.306\.
- Tversky and Kahneman \[1974\]Amos Tversky and Daniel Kahneman\.Judgment under uncertainty: Heuristics and biases\.*Science*, 185\(4157\):1124–1131, 1974\.doi:10\.1126/science\.185\.4157\.1124\.

## Appendix AError Analysis

Before designing the evaluation framework, we conducted a systematic pilot error analysis across three engineering subjects: Fluid Mechanics, Heat and Mass Transfer \(HMT\), and Mechanics of Materials \(MOM\)\. Rather than imposing evaluation steps heuristically, we first analyzed how VLMs fail on engineering reasoning problems and whether those failures exhibit consistent structural patterns across domains\.

For each pilot subject, a frontier VLM \(Gemini\-2\.0\-flash\-exp\) was prompted to solve a subset of the problems\. A separate LLM judge then identified and categorized all reasoning errors relative to the ground\-truth solutions, producing subject\-specific error taxonomies grounded in engineering domain knowledge\. The errors and their categories were manually inspected as well\.

Unless otherwise noted, figures visualize representative pilot subjects; analogous trends were consistently observed across all other subjects\.

##### Key Finding 1: Errors Are Multi\-Faceted and Co\-occurring\.

Across all three pilot subjects, arithmetic\-related failures were among the most frequent observed error categories, occurring 49 times in Fluid Mechanics, 45 times in HMT, and 80 times in MOM \(Figure[6](https://arxiv.org/html/2606.10833#A1.F6)\)\. However, the secondary distribution of errors varied substantially across domains\.

Fluid Mechanics failures frequently involved wrong formula usage \(Bernoulli, continuity, Navier\-Stokes: 31 occurrences\) and boundary condition errors \(31 occurrences\)\. HMT failures showed a high rate of wrong or unjustified assumptions \(31 occurrences\), particularly misidentificaion of steady\-state vs\. transient regime\. MOM failures were dominated by stress\-formula misuse\(44 occurrences\), free\-body setup errors, and geometric misinterpretation\(36 occurrences\)\.

These results indicate that engineering reasoning failures are not well\-described by a single generic failure mode\. Instead, different engineering domains exhibit distinct reasoning error signatures, motivating domain\-aware and step\-level evaluation\.

020204040Logical Inconsistency Between Reasoning StepsUnit Mismatch Or Conversion MistakesIgnoring Part Of QuestionWrong Formula Usage Outside Validity DomainWrong Or Unjustified AssumptionsGeometric Interpretation Errors Radius Diameter AreaWrong Formula Usage Bernoulli Continuity Navier StokesBoundary Condition ErrorsMisconception Of Underlying Physical PrincipleCalculation Arithmetic Errors1111151516162020262628283131313133334949Frequency\(a\)Fluid Mechanics02020404060608080Ignoring Part Of QuestionPoor Problem DecompositionWrong Formula Usage Outside Validity DomainLogical Inconsistency Between Reasoning StepsImproper Free Body Or Equilibrium SetupIncorrect Bending Moment Or Shear Force DiagramWrong Or Unjustified AssumptionsDimensionality Or Geometry MisinterpretationIncorrect Application Of Normal And Shear Stress FormulasCalculation Arithmetic Errors1515181826262828323233333535363644448181Frequency\(b\)Mechanics of Materials
Figure 6:Most frequent error categories across representative engineering domains\. Arithmetic\-related failures are consistently common, while secondary error distributions remain domain\-specific\. Heat and Mass Transfer exhibited similar trends\.
##### Key Finding 2: Errors Cluster into Distinct Reasoning Steps\.

When we grouped the subject\-specific error types by the reasoning operation they implicate, a consistent structure emerged across all three subjects shown in \(Table[5](https://arxiv.org/html/2606.10833#A1.T5)\)\.

Table 5:Mapping of observed error types to evaluation steps\. Each step captures a qualitatively distinct failure mode not detectable by inspecting other steps\.
##### Key Finding 3: Engineering Errors Exhibit Sparse Structured Dependencies\.

Figure[7](https://arxiv.org/html/2606.10833#A1.F7)shows error correlation matrices from the pilot analysis\. The matrices reveal that engineering reasoning failures are not fully independent\. Several dependencies consistently emerge across errors in different reasoning stages\.

The strongest and most consistent dependency appears between equation misuse and arithmetic errors \(r=0\.55r=0\.55–0\.750\.75across pilot subjects\), reflecting cascading failures where incorrect governing equations propagate into downstream computation\. Additional moderate dependencies were observed between visual interpretation and equation selection in diagram\-centric problems, as well as between assumption formulation and downstream logical reasoning\.

This sparse dependency structure has a direct methodological implication: engineering reasoning should not be evaluated using a single score, it also cannot be fully factorized into completely independent steps\. Instead, evaluation must combine step\-level assessment with selective dependency\-aware propagation along empirically justified reasoning edges\.

![Refer to caption](https://arxiv.org/html/2606.10833v1/images/01_error_correlation_matrix.png)Figure 7:Error correlation matrix for Fluid Mechanics\. Most off\-diagonal entries are near zero, confirming step independence\. The exception, formula misapplication and arithmetic errors \(r=0\.75r=0\.75\), reflects a cascading dependency that directly motivates the propagation edge in the DAG\.
##### Key Finding 4: Visual Reasoning Is a Distinct and Persistent Failure Mode\.

We computed avisual error susceptibility scorefor each topic within each subject—defined as the fraction of errors due to misinterpretation of the diagram rather than to purely textual or computational reasoning \(Figure[8](https://arxiv.org/html/2606.10833#A1.F8)\)\. Several topics scored above 0\.55:hydrostatic pressure distributionsandbuoyancyin Fluid Mechanics \(0\.56 each\);thermal resistance networksin HMT \(0\.57\); andstress resultants and equilibrium,torsion, andbendingin MOM \(0\.59, 0\.55, 0\.55\)\. This led us to concluding that visual misinterpretation is a structurally independent failure mode requiring its own evaluation step\.

00\.10\.10\.20\.20\.30\.30\.40\.40\.50\.5Hydrostatic forcesConservation of massEnergy eq\. \(First Law\)StabilityBernoulli’s eq\.Archimedes’ principleViscosity & class\.BuoyancyHydrostatic pressure dist\.Hydrostatic pressure dist\.0\.40\.40\.430\.430\.450\.450\.450\.450\.450\.450\.470\.470\.490\.490\.490\.490\.560\.560\.560\.56Visual Error Susceptibility ScoreFigure 8:Topics with highest visual error susceptibility scores \(Fluid Mechanics\)\. Scores above0\.50\.5indicate that the majority of errors in that topic originate from diagram misinterpretation, confirming visual reasoning as a structurally distinct failure mode\.
##### From Error Analysis to Evaluation Design\.

The pilot error analysis directly motivated the design of our evaluation framework\. Frequent propagation from equation mistakes to arithmetic errors motivated the use ofdependency propagation, where upstream mistakes reduce the score of downstream steps\. Similarly, the large number of diagram\-related failures motivated a dedicatedVisual Interpretationstep, while the different reasoning patterns across subjects motivated the use ofdomain\-specific prompts\.

The resulting DAG structure \(Figure[9](https://arxiv.org/html/2606.10833#A1.F9)\) follows the natural flow of an engineering solution\.Problem CharacterizationandAssumptionsappear first because they affect all later steps\.Visual Interpretationfeeds intoEquation Selection, since correct equations often depend on properly reading diagrams and boundary conditions\.Equation SelectionandLogical ReasoninginfluenceAlgebraic Accuracy, because calculations based on incorrect equations or inconsistent logic usually produce incorrect numerical results\.Physical Interpretationruns alongside symbolic computation and contributes to theFinal Answer, ensuring that the final result is also physically meaningful\.

The error correlation matrix \(Figure[7](https://arxiv.org/html/2606.10833#A1.F7)\) provides more insight into this propagation of errors\.

![Refer to caption](https://arxiv.org/html/2606.10833v1/x5.png)Figure 9:Dependency DAG structure used for trust propagation across reasoning steps\.This cascade pattern is precisely what the dependency propagation formula captures: a correct downstream step built on a flawed upstream step should not receive full credit, because the apparent correctness is contingent on an invalid foundation\.

## Appendix BBenchmark details

To ensure broad curriculum\-level coverage and reduce over\-specialization toward narrow reasoning patterns, EngVQA was curated across five major engineering domains spanning dynamics, thermodynamics, transport phenomena, structural mechanics, and physics\-constrained analytical reasoning\. The benchmark intentionally includes both foundational engineering principles and advanced multi\-topic analytical problems commonly encountered in undergraduate and graduate\-level engineering curricula\.

Many questions require simultaneous reasoning across multiple conceptual categories, including diagram interpretation, governing equation selection, physical constraint validation, symbolic derivation, and numerical computation\. Rather than evaluating isolated single\-step calculations, the benchmark emphasizes interconnected engineering workflows involving coupled physical processes, system\-level reasoning, and multi\-stage analytical dependencies\.

The benchmark additionally contains substantial cross\-topic overlap within each subject area\. For example, a single problem may simultaneously involve visual interpretation, conservation laws, material properties, boundary conditions, and multi\-stage quantitative reasoning\. This structure was intentionally designed to evaluate robustness across heterogeneous engineering concepts, mathematical formulations, and physical settings\.

The full list of representative topics covered within each engineering domain is summarized below\.

Dynamics Topic CoverageParticle Kinematics:Rectilinear motion with constant and variable acceleration; dependent motion of particles; projectile motion; normal–tangential coordinates; polar and cylindrical coordinates; relative motion; absolute and relative velocity/acceleration; curvilinear trajectories; coordinate transformations; rotating reference frames\.Particle Kinetics:Newton’s second law; equations of motion in Cartesian, normal–tangential, and cylindrical coordinates; work–energy methods; conservative and non\-conservative forces; impulse–momentum; angular impulse and momentum; impact analysis; central\-force motion; power and efficiency\.Rigid\-Body Dynamics:Plane motion; fixed\-axis rotation; rolling constraints; instantaneous centers; rigid\-body kinetics; mass moment of inertia; radius of gyration; gyroscopic motion; free and damped vibration; force and moment analysis; rigid\-body impact; connected rigid\-body systems; cable tension; planetary and orbital motion\.

Thermodynamics Topic CoverageFundamental Thermodynamics:Open and closed systems; intensive and extensive properties; pressure and temperature; thermodynamic equilibrium; specific volume and density; work and heat transfer mechanisms; energy conservation; mechanical and thermal equilibrium; property relations; phase\-change processes; saturated mixtures; superheated vapor; subcooled liquid; property diagrams and tables\.Energy and First\-Law Analysis:Closed\-system and control\-volume energy balances; steady and unsteady flow; nozzles and diffusers; turbines and compressors; throttling valves; pumps and heat exchangers; polytropic processes; specific heats; ideal\-gas and real\-gas relations; compressibility effects; flow work and enthalpy\.Entropy and Exergy:Second\-law analysis; entropy generation; isentropic processes; reversible and irreversible processes; exergy destruction; thermal efficiency; COP of refrigeration and heat\-pump systems; availability and work potential; entropy balances\.Thermodynamic Cycles:Carnot cycle; Rankine cycle with regeneration and reheat; Brayton cycle and gas turbines; Otto and Diesel cycles; refrigeration cycles; combined cycles; jet propulsion systems\.

Fluid Mechanics Topic CoverageFluid Statics and Properties:Hydrostatic pressure distributions; manometry; fluid properties; shear stress and viscosity; surface tension and capillarity; buoyancy and stability; atmospheric pressure variation; hydrostatic forces on submerged surfaces\.Fluid Dynamics and Conservation Laws:Conservation of mass and momentum; Bernoulli analysis; integral and differential control\-volume analysis; continuity equations; energy equations; velocity and acceleration fields; vorticity and circulation; dimensional analysis; Buckingham\-Π\\Pitheorem; Reynolds, Mach, Froude, Weber, and Euler numbers\.Internal and External Flows:Laminar and turbulent pipe flow; entrance\-region development; boundary layers; Reynolds\-number effects; head losses and Moody diagram; velocity profiles; hydraulic jumps; flow separation; drag and lift; wake formation; control surfaces; pump and turbine performance; cavitation\.Compressible and Computational Flows:Compressible nozzle flows; isentropic flow; Fanno and Rayleigh flow; shock waves; potential flow theory; Navier–Stokes equations; Euler equations; CFD discretization and numerical convergence; flow visualization; transport phenomena and environmental flows\.

Heat & Mass Transfer Topic CoverageHeat Conduction:Fourier’s law; steady\-state and transient conduction; thermal conductivity; thermal resistance networks; lumped\-capacitance analysis; plane\-wall, cylindrical, and spherical conduction; composite walls; fins and extended surfaces; thermal contact resistance; two\-dimensional conduction; conduction shape factors; boundary conditions in conduction\.Convection and Boundary Layers:Forced and natural convection; convection heat\-transfer coefficients; velocity and thermal boundary layers; laminar and turbulent flow; Reynolds, Prandtl, Nusselt, Grashof, Rayleigh, and Biot numbers; internal and external flow convection; mixed convection; boiling and condensation; hydrodynamic and thermal entry lengths\.Thermal Radiation:Blackbody radiation; emissivity; Stefan–Boltzmann law; Planck’s law; Wien’s displacement law; radiation exchange; view factors; participating media radiation; surface energy balance\.Mass Transfer and Coupled Transport:Fick’s law; mass diffusivity; species conservation; evaporation; Raoult’s and Henry’s laws; heat exchangers; LMTD and NTU methods; thermal energy generation and storage; thermoelectric effects; coupled heat and mass\-transfer analogies\.

Mechanics of Materials Topic CoverageStress and Deformation:Normal and shear stress; axial loading; stress–strain relations; Hooke’s law; Poisson’s ratio; thermal stresses; combined loading; pressure vessels; principal stresses; Mohr’s circle; strain transformations; failure theories\.Structural Analysis:Beam bending; torsion of circular and thin\-walled sections; free\-body diagrams; equilibrium analysis; stress resultants; shear and bending moment diagrams; neutral axis; centroids and moments of inertia; section modulus; shear center; buckling and elastic stability\.Advanced Material Behavior:Stress concentration effects; composite beams and transformed sections; unsymmetric bending; energy methods and Castigliano theorem; deflection analysis; impact loading and strain energy; plastic bending; thin\-walled structures; material yielding and fracture behavior\.

### B\.1Topic Generation and Assignment Methodology

To systematically classify the conceptual coverage of the benchmark, we established a structured vocabulary of topics for each engineering subject and mapped them to individual problems\. This appendix details our topic extraction process, validation matching pipeline, and the resulting conceptual density statistics\.

#### B\.1\.1Master Topic List Synthesis

Rather than relying on manually predefined keywords, we utilized an LLM\-driven topic synthesis pipeline to generate a realistic representation of university\-level engineering curricula\. For each subject, the text statements of all questions in the dataset were concatenated and fed intoGemini 2\.5 Flash\. The model was prompted to analyze the question corpus and synthesize a master vocabulary of5050–6060generalized topics per domain \(e\.g\.,5353for Thermodynamics and5454for Dynamics\)\. These synthesized topics were structured to match standard course syllabi\.

#### B\.1\.2Multimodal Topic Assignment and Validation

With the master topic list established, we ran a multimodal classification pipeline to map topics to each problem\. For each question:

1. 1\.The model was provided with the question statement, its master topic list, and the corresponding question diagram\.
2. 2\.The model selected all topics necessary to formulate or solve the problem\.
3. 3\.To ensure data integrity and prevent hallucinated labels, the model’s output was processed by an automated validation script\. The validator matched each output string against the master topic list using case\-insensitive transformations and fuzzy string matching \(with a similarity cutoff threshold of0\.850\.85\)\. Topics failing this validation check were discarded\.

#### B\.1\.3Conceptual Density and Subject Variations

The average number of topics assigned per question varies significantly by subject:

- •Dynamics \(3\.573\.57topics/Q\):Emphasizes modular problem\-solving frameworks where questions are typically isolated to specific kinematic or kinetic formulations \(e\.g\., pure projectile motion or pure impulse\-momentum analysis\)\.
- •Fluid Mechanics \(5\.815\.81topics/Q\):Highlights moderately coupled phenomena such as pressure fields under hydrostatic conditions or head loss calculations in pipe networks\.
- •Thermodynamics \(9\.289\.28topics/Q\) and Mechanics of Materials \(9\.469\.46topics/Q\):Require concurrent evaluation of material properties, load conditions, geometric invariants, and thermodynamic state variables\.
- •Heat & Mass Transfer \(13\.8913\.89topics/Q\):Exhibits the highest conceptual coupling because standard heat transfer scenarios inherently involve multiple parallel transport phenomena \(conduction, convection, and radiation occurring simultaneously with thermal boundaries and resistance properties\)\.

This multi\-concept mapping highlights that solving a typical problem in our benchmark requires the simultaneous integration and synthesis of multiple physical principles, making the benchmark fundamentally more challenging than standard single\-template datasets\.

## Appendix CDetailed Evaluation Framework

This appendix provides the detailed mathematical and algorithmic specifications for theEngJudgeevaluation framework, expanding on the summary in Section[4\.1](https://arxiv.org/html/2606.10833#S4.SS1)\.

### C\.1Penalty Scales

The automated evaluation framework utilizes a penalty\-based grading scheme designed to mirror the grading practices of human professors in graduate\-level engineering courses\. Rather than evaluating a solution in a binary \(correct/incorrect\) fashion, each of the eight mandatory reasoning stages begins with a base score of 10\. The automated grader identifies specific errors within the stage, classifies them into one of four severity levels, and subtracts the corresponding penalty points\.

The four severity levels consist of :Minor\(2 points\),Moderate\(4 points\),Major\(7 points\), andCritical\(10 points\)\. The spacing of this scale \(2, 4, 7, 10\) is designed to ensure that minor arithmetic or formatting slips do not excessively penalize a student’s score, while severe conceptual errors or invalid physical formulations are heavily penalized, often resulting in an immediate zero for that stage\. This configuration provides a flexible yet robust mechanism for partial credit distribution\.

Table[6](https://arxiv.org/html/2606.10833#A3.T6)summarizes the criteria and provides concrete illustrative examples from the evaluation of the benchmark questions\.

Table 6:Detailed penalty scales, grading criteria, and concrete failure examples across the benchmark domains\.At the end of the human study, evaluators were asked to rate the calibration of the penalty scheme \(the specific deductions of 2, 4, 7, and 10 points for minor, moderate, major, and critical errors\)\. The feedback from the 9 expert participants was distributed as follows:

- •Well Calibrated:55\.6%\(5 out of 9 evaluators\) expressed direct agreement that the penalty levels are well calibrated\.
- •Minors Too Harsh:22\.2%\(2 out of 9 evaluators\) felt that minor error penalties \(2 points\) were slightly too harsh\.
- •Scale Too Coarse:11\.1%\(1 out of 9 evaluators\) suggested a more granular scale\.
- •Not Harsh Enough:11\.1%\(1 out of 9 evaluators\) indicated that the penalties could be even stricter\.

### C\.2Fatal Error Capping Mechanism

To prevent VLMs from receiving high scores on steps containing fundamental conceptual breakdowns, the EngJudge framework incorporates a*Fatal Error Capping*mechanism\. During the evaluation of a reasoning step, if the evaluator model detects any of the fatal error types defined in Table[7](https://arxiv.org/html/2606.10833#A3.T7), the score for that specific step is capped at the designated value, regardless of any correct algebraic manipulations or formatting compliance\. If multiple fatal errors are triggered within a single step, the step score is capped at the*lowest*applicable value\.

The physical rationales and definition for each fatal error cap are detailed below:

- •Governing Equation Incorrect \(Score Cap = 3\):This error is triggered when the VLM selects a fundamental governing equation that is physically inapplicable to the problem domain\. A prime example is applying Bernoulli’s equation \(which assumes frictionless, inviscid flow\) to a high\-viscosity pipe flow system where shear stress dominates, or using linear momentum equations in a system where rotational torque governs\. Because selecting the incorrect governing relation invalidates the entire physical foundation of the solution, the step is capped at a low score of33, reflecting a failure in basic physical reasoning\.
- •Physical Model Invalid \(Score Cap = 4\):This error occurs when the VLM formulates a physical representation or simplification that directly contradicts the physical realities of the problem\. Examples include assuming a transient process is at steady\-state, treating a highly compressible gas as an incompressible fluid, or modeling a multi\-dimensional flux as a one\-dimensional flow without justification\. While the model may choose the correct general equations, the invalid simplification renders the subsequent modeling physically incorrect, justifying a cap of44\.
- •Dimensionally Inconsistent \(Score Cap = 2\):This represents the most severe algebraic and physical violation in engineering\. It is triggered when the model writes an equation where the units or physical dimensions of the left\-hand side do not match those of the right\-hand side \(e\.g\., adding a force term directly to a velocity term, or equating mass flow rate to pressure\)\. Because dimensional consistency is the most fundamental sanity check in physical sciences, any dimensional violation indicates a complete collapse of mathematical and physical consistency, resulting in the strictest cap of22\.
- •Boundary Conditions Invalid \(Score Cap = 4\):In graduate\-level engineering physics, solving governing differential equations \(such as the Navier\-Stokes or heat diffusion equations\) depends entirely on specifying boundary and initial conditions\. This error is triggered when the model misapplies interface conditions—such as setting a free\-slip condition at a stationary wall, neglecting convective heat transfer at a boundary, or using absolute pressure in momentum balances where gauge pressure is required\. Since incorrect boundary conditions yield physically impossible solutions, the score is capped at44\.

By enforcing these limits, the framework ensures that a solution is not graded solely on surface\-level correctness or formatting compliance, but rather on its conceptual and physical integrity\.

Table 7:Fatal error caps\. The step score is capped at the lowest applicable value when a fatal error is detected\.
### C\.3Meta\-Evaluation Checks and Deductions

#### C\.3\.1Missing Stage Penalty \(MP\)

To ensure that models do not bypass intermediate cognitive stages \(e\.g\., jumping directly to a final answer without stating assumptions or selecting equations\), the rule\-based parser scans the output for the eight required markup tags\.

- •Deduction Rule:For every missing reasoning stage, a flat penalty of2\.02\.0points is added to the total missing penalty \(MPMP\): MP=2\.0×\|\{absent required stages\}\|MP=2\.0\\times\\left\|\\\{\\text\{absent required stages\}\\\}\\right\|\(6\)This penalty is subtracted directly from the base score\(aggregated score from all the stages\) before applying meta\-evaluation multipliers:Sblend=max⁡\(0,Sbase−MP\)S\_\{\\text\{blend\}\}=\\max\(0,S\_\{\\text\{base\}\}\-MP\)\.

Beyond step\-level grading, EngJudge applies three solution\-level*meta\-evaluations*: Coverage, verbosity and physical sanity, to assess the overall quality, completeness, and physical grounding of the derivation\. These checks are implemented via LLM\-as\-a\-judge and evaluates the full solution and not stage\-wise\.

#### C\.3\.2Coverage Penalty \(COVERAGE\)

The coverage check verifies whether the model has addressed all sub\-questions \(e\.g\., part \(a\) and part \(b\)\) and calculated all requested engineering quantities\.

- •Evaluation Rubric:The evaluator model evaluates the solution on a scale of0to1010\. A score of1010indicates complete coverage\. For each missing or unaddressed sub\-question or target quantity, the score is reduced\. This gives us the coverage score \(CS\)
- •Penalty Mapping:The coverage score \(CS\) is mapped to a fractional penalty𝙲𝙾𝚅𝙴𝚁𝙰𝙶𝙴∈\[0,0\.5\]\\mathtt\{COVERAGE\}\\in\[0,\\,0\.5\]using: 𝙲𝙾𝚅𝙴𝚁𝙰𝙶𝙴=min⁡\(0\.5,\(10−𝙲𝚂\)×0\.10\)\\mathtt\{COVERAGE\}=\\min\\left\(0\.5,\\ \(10\-\\mathtt\{CS\}\)\\times 0\.10\\right\)\(7\)Each point below1010deducts10%10\\%from the final score, capped at a maximum deduction of50%50\\%\.

#### C\.3\.3Verbosity Penalty \(VERBOSITY\)

Large Language Models frequently generate excessive, redundant steps or boilerplate explanations\[Zheng et al\.,[2023](https://arxiv.org/html/2606.10833#bib.bib19)\]\. The verbosity check penalizes solutions that contain excessive filler text, unnecessary restatements of the problem, or repetitive derivations\.

- •Evaluation Rubric:The evaluator model rates the conciseness and technical directness of the solution from0to1010\. A score of1010represents a clean, direct derivation with minimal filler text\. This gives us the verbosity score\(VS\)\.
- •Penalty Mapping:The verbosity score \(VS\) maps to a fractional penalty𝚅𝙴𝚁𝙱𝙾𝚂𝙸𝚃𝚈∈\[0,0\.5\]\\mathtt\{VERBOSITY\}\\in\[0,\\,0\.5\]using: 𝚅𝙴𝚁𝙱𝙾𝚂𝙸𝚃𝚈=min⁡\(0\.5,\(10−𝚅𝚂\)×0\.10\)\\mathtt\{VERBOSITY\}=\\min\\left\(0\.5,\\ \(10\-\\mathtt\{VS\}\)\\times 0\.10\\right\)\(8\)Similar to coverage, each point of excessive verbosity deducts10%10\\%, up to a maximum cap of50%50\\%\. We cap it at 0\.5 as we don’t want it to bring the overall final score down to zero, same goes for Coverage\.

#### C\.3\.4Physical Sanity Check \(SANITY\_FAIL\)

This is a critical global filter checking for high\-level reasoning and physical consistency\. Even if a model’s intermediate steps appear mathematically structured, the final numerical values might be physically impossible\. This is different from the earlier fatal error caps, as this checks the final solution/answer\.

- •Evaluation Rubric:The evaluator model performs a binary check \(𝚂𝙰𝙽𝙸𝚃𝚈\_𝙵𝙰𝙸𝙻∈\{0,1\}\\mathtt\{SANITY\\\_FAIL\}\\in\\\{0,1\\\}\) on whether the solution violates fundamental physical laws\. Examples of failure conditions include: - –Violating the Second Law of Thermodynamics \(e\.g\., heat engines exceeding Carnot efficiency\)\. - –Violating mass or energy conservation \(e\.g\., fluid outflow exceeding inflow in steady state\)\. - –Generating impossible physical bounds \(e\.g\., negative absolute temperatures in Kelvin, efficiencies greater than 1\.0, or velocities exceeding the speed of light\)\.
- •Penalty Mapping:If any physical sanity check fails,𝚂𝙰𝙽𝙸𝚃𝚈\_𝙵𝙰𝙸𝙻=1\\mathtt\{SANITY\\\_FAIL\}=1, which applies a global multiplicative factor of0\.9𝚂𝙰𝙽𝙸𝚃𝚈\_𝙵𝙰𝙸𝙻=0\.90\.9^\{\\mathtt\{SANITY\\\_FAIL\}\}=0\.9\(a flat10%10\\%penalty\) to the final score\.

Final score is calculated as:

Sfinal=Sblend⋅\(1−𝙲𝙾𝚅𝙴𝚁𝙰𝙶𝙴\)⋅\(1−𝚅𝙴𝚁𝙱𝙾𝚂𝙸𝚃𝚈\)⋅0\.9𝚂𝙰𝙽𝙸𝚃𝚈\_𝙵𝙰𝙸𝙻S\_\{\\text\{final\}\}=S\_\{\\text\{blend\}\}\\cdot\(1\-\\mathtt\{COVERAGE\}\)\\cdot\(1\-\\mathtt\{VERBOSITY\}\)\\cdot 0\.9^\{\\mathtt\{SANITY\\\_FAIL\}\}\(9\)

## Appendix DExperiments

The table[8](https://arxiv.org/html/2606.10833#A4.T8)contains all the scores obtained by using different generators, evaluators, and evaluation techniques\.

Table 8:Cross\-evaluator performance comparison across engineering domains\. Scores are reported on a 0–10 scale\. Gray rows correspond to the proposed EngJudge framework\. FM = Fluid Mechanics, HMT = Heat & Mass Transfer, MoM = Mechanics of Materials, Thermo = Thermodynamics, Dyn = Dynamics, and CoT = Chain\-of\-ThoughtWei et al\. \[[2022](https://arxiv.org/html/2606.10833#bib.bib22)\]\.### D\.1Plot against Baseline

FMHMTMoMThermoDyn02244668810107\.47\.47\.77\.77\.227\.229\.139\.137\.567\.562\.342\.342\.012\.013\.253\.253\.333\.332\.852\.85ScoreBaselineEngJudgeFigure 10:Comparison of baseline and EngJudge evaluation byGemini 3\.1 Pro Preview\.
### D\.2Step wise scores

Table 9:Data for average step\-wise dependent \(Dep\) and Independent \(Ind\) scores, evaluated withGemini 3\.1 Pro Preview
### D\.3Detailed Topological and Correlation Analyses

We provide extended quantitative details on the choice of dependency structures and the empirical correlations between reasoning steps\.

#### D\.3\.1Influence of Dependency Topologies

To evaluate the choice of dependency mapping, we compare the proposed Directed Acyclic Graph against alternative topological structures in Table[10](https://arxiv.org/html/2606.10833#A4.T10)\. Excluding dependency propagation entirely \(*Independent/Flat*\) yields optimistic base scores of7\.197\.19\(Gemini\) and5\.185\.18\(Qwen\)\. Conversely, imposing a sequential chain \(*Strict Linear Chain*\), where each stepNNdepends entirely on stepN−1N\-1, results in overly severe penalties \(4\.294\.29for Gemini,2\.932\.93for Qwen\)\. This is because a linear topology assumes that a minor mistake in an early independent step \(e\.g\., a small assumption slip\) completely invalidates parallel, unaffected tracks \(such as reading dimensions from a diagram\)\. Our proposed*True DAG*models these parallel pathways accurately, penalizing dependent steps only when their specific prerequisites fail, resulting in balanced base scores of5\.035\.03\(Gemini\) and3\.483\.48\(Qwen\)\.

Table 10:Ablation showing the effect of different dependency structures on the averagebase score\(excluding meta\-evaluation checks and missing step penalties\)\. Removing dependency propagation \(Independent/Flat\) yields substantially higher scores, while stricter dependency assumptions \(Linear Chain\) result in lower scores than the proposed DAG\-based topology\.
#### D\.3\.2Correlation and Causal Dependency Analysis

To empirically validate the dependency relations modeled by our DAG, we analyze the Pearson correlation matrix between the raw step scores across all evaluations \(shown in Figure[5](https://arxiv.org/html/2606.10833#S5.F5)in the main paper\)\. The matrix reveals a clear, diagonal\-adjacent correlation structure, where strong correlations are localized to neighboring steps of the reasoning chain\.

Specifically, we observe strong correlations between Algebraic Accuracy and the Final Answer \(r=0\.69r=0\.69\), and between Equation Selection and Logical Reasoning \(r=0\.53r=0\.53\)\. These findings reflect the physical reality of error propagation: a failure in selecting the correct governing equation or executing mathematical derivations directly compromises downstream calculations and the final numerical output\. Conversely, correlations between early conceptual steps and final execution steps are negligible \(e\.g\., Problem Characterization vs\. Final Answer yieldsr=0\.06r=0\.06\)\. This demonstrates that a VLM’s ability to classify the physics domain of a problem is independent of its ability to execute the math required to solve it, reinforcing the need for step\-wise evaluation\.

##### Why Statistical Correlation Differs from Causal DAG Dependencies?

This is a key design question: why are some pairs of steps with high statistical correlation in Figure[5](https://arxiv.org/html/2606.10833#S5.F5)not connected by direct dependency edges in our DAG? For instance, Physical Interpretation \(PI\) and Final Answer \(FA\) exhibit a high correlation ofr=0\.60r=0\.60, and Algebraic Accuracy \(AA\) and Physical Interpretation \(PI\) showr=0\.52r=0\.52, yet neither pair has a direct edge in the evaluation graph\. This design choice is guided by three main principles:

1. 1\.Causality vs\. Correlation:The DAG is designed to enforce direct, physical causal prerequisites\. For example, a student can mathematically compute a correct final numerical answer \(FA\) via correct algebraic manipulation \(AA\) without necessarily understanding or explaining its physical meaning \(PI\)\. BecausePIis not a strict mathematical prerequisite for calculatingFA, drawing a dependency edge fromPItoFAwould be causally incorrect, despite their statistical correlation\.
2. 2\.Confounding by Downstream Position:Later steps in the reasoning chain \(such asAA,PI, andFA\) are strongly correlated because they are co\-dependent on the cumulative errors of early steps \(likeASandES\)\. This shared dependency on common ancestors creates high statistical correlation \(confounding\) in the empirical data\. Adding redundant edges between these downstream variables in the DAG would lead to duplicate penalization for the same upstream error\.
3. 3\.Conceptual Independence in Rubrics:Conceptual reasoning \(such as qualitative physical interpretation\) and algebraic computation are graded as independent dimensions in standard engineering pedagogy\. A model may fail the algebra but perform a correct physical limit check, or vice versa\. The high empirical correlation \(r=0\.52r=0\.52betweenAAandPI\) is a reflection of general model capability \(i\.e\., stronger VLMs perform well on both, while weaker models fail on both\) rather than a causal dependency between the two skills\.

Thus, this correlation does not necessarily validate our dependency graph, but it does support the claims made in the graph to some extent\.

### D\.4Qualitative Case Studies

This section details the primary failure modes observed in the generated solutions, categorized by the nature of the cognitive or mathematical lapse\. To illustrate these mechanisms, we present four distinct examples spanning four different engineering domains\.

#### D\.4\.1Visual Interpretation Errors

Despite demonstrating strong general visual recognition, vision\-language models \(VLMs\) frequently fail to map two\-dimensional diagrams of three\-dimensional systems into correct mathematical formulations, or overlook subtle geometric features that alter the boundary conditions\.

- •Omission of Physical Constraints \(Thermodynamics, Problem 5\-131\):When modeling a piston\-cylinder device under compression, the VLM failed to identify the physical stops inside the cylinder wall depicted in the schematic\. Consequently, it modeled the entire compression process as isobaric under the assumption that the piston was free to move indefinitely\. In reality, once the piston contacts the stops, the process transitions to a constant\-volume phase, rendering the subsequent state and boundary work calculations invalid\.

#### D\.4\.2Physical Principles and Boundary Conditions Errors

A major source of low evaluation scores is the formulation of equations that violate the system’s boundary conditions, flow characteristics, or conservation laws\.

- •Incorrect Boundary Pressure Formulation \(Fluid Mechanics, Problem 11\):In analyzing the force required to hold an exit nozzle plug in place, the model applied Bernoulli’s equation to calculate a non\-zero exit pressure \(p2≈364kPap\_\{2\}\\approx 364\\text\{ kPa\}\)\. Because the nozzle discharges directly to the atmosphere as an unconfined jet, the correct physical boundary condition is a uniform gauge pressure of zero \(p2=0p\_\{2\}=0\)\. Introducing a fictitious pressure force at the boundary led to a physically invalid linear momentum balance\.

#### D\.4\.3Dimensionless Group and Correlation Errors

In domains involving convective transport and non\-dimensional scaling, models frequently apply empirical correlations outside their valid regimes or evaluate fluid properties at incorrect reference states\.

- •Incorrect Reference Temperature Property Evaluations \(Heat & Mass Transfer, Problem 71\):During the forced convection cooling analysis of an electronic chip, the model evaluated the physical properties of the air stream \(density, dynamic viscosity, and thermal conductivity\) at the free\-stream temperature of25∘C25^\{\\circ\}\\text\{C\}\(300 K\) instead of the film temperature \(Tf=\(Ts\+T∞\)/2≈308KT\_\{f\}=\(T\_\{s\}\+T\_\{\\infty\}\)/2\\approx 308\\text\{ K\}\)\. This property mismatch introduced systematic errors into the Reynolds \(ReRe\) and Prandtl \(PrPr\) numbers, which propagated through the Nusselt number correlation and yielded an incorrect convective heat transfer coefficient\.

#### D\.4\.4Arithmetic and Unit Discrepancies

Even when the physical formulation is sound, final numerical answers are often invalidated by basic calculation slips, incorrect coordinate transformations, or lookup errors\.

- •Kinematic and Component Formulation Slips \(Dynamics, Problem 21\-77\):When calculating the angular momentum of a precessing satellite, the model equated the satellite’s spin rate directly to the total angular velocity component along the symmetry axis \(ψz=ωs\\psi\_\{z\}=\\omega\_\{s\}\), ignoring the precession contribution and the coordinate rotation\. This algebraic process error resulted in a cascading numerical slip, underestimating the satellite’s angular momentum by approximately45%45\\%\.

## Appendix EDetailed Human Study Results

This appendix provides a detailed quantitative breakdown of the human validation study referenced in Section[6](https://arxiv.org/html/2606.10833#S6)\. The study utilized a custom\-built, interactive evaluation dashboard that sequentially guided evaluators through randomly assigned problems per session\. Participants first performed a binary rating \(correct/incorrect\) on each generated step without seeing the framework’s evaluation, minimizing anchoring biasTversky and Kahneman \[[1974](https://arxiv.org/html/2606.10833#bib.bib26)\]\. Subsequently, they were shown the framework’s dependency score and penalty justifications, and were asked to rate the severity on a 5\-point Likert scale \(ranging from “Much too low” to “Much too high”\)\.

### E\.1Data Point Formulation and Rating Quantification

To perform continuous statistical analysis on qualitative feedback, we define a structured mapping to convert the qualitative human ratings into numerical scores\.

##### Data Point Definition

A single evaluation data point represents the rating of a single reasoning unit \(either a stage\-level evaluation or a meta\-evaluation check\) for a given problem instance by a human participant\. With 9 expert participants evaluating 4 randomized problems each, the total number of data points is:

Ntotal=Nstep\+Nmeta=285\+108=393N\_\{\\text\{total\}\}=N\_\{\\text\{step\}\}\+N\_\{\\text\{meta\}\}=285\+108=393\(10\)whereNstep=285N\_\{\\text\{step\}\}=285represents individual stage evaluations \(8 stages per problem, with occasional missing steps or parsing fallbacks omitted\), andNmeta=108N\_\{\\text\{meta\}\}=108represents meta\-evaluation evaluations \(3 checks per problem: Verbosity, Coverage, and Physical Sanity\)\.

##### Likert\-to\-Numerical Mapping

During the study, human experts did not assign raw scores directly; instead, they reviewed the framework’s automated score \(Sauto∈\[0,10\]S\_\{\\text\{auto\}\}\\in\[0,10\]\) and selected a qualitative rating on a 5\-point Likert scale to express their level of agreement\. To reconstruct the absolute human score \(ShumanS\_\{\\text\{human\}\}\), we define an adjustment mapping \(δ\\delta\) for each Likert category:

“Much too low”⟹δ=\+2\\displaystyle\\implies\\delta=\+2“Slightly too low”⟹δ=\+1\\displaystyle\\implies\\delta=\+1“About right”⟹δ=0\\displaystyle\\implies\\delta=0\(11\)“Slightly too high”⟹δ=−1\\displaystyle\\implies\\delta=\-1“Much too high”⟹δ=−2\\displaystyle\\implies\\delta=\-2The equivalent human score is then reconstructed by adjusting the automated score and clipping the result to the valid\[0,10\]\[0,10\]grading scale:

Shuman=max⁡\(0,min⁡\(10,Sauto\+δ\)\)S\_\{\\text\{human\}\}=\\max\(0,\\min\(10,S\_\{\\text\{auto\}\}\+\\delta\)\)\(12\)This formulation ensures a human\-consistent numerical scale, where a rating of “About right” \(δ=0\\delta=0\) corresponds to perfect agreement \(Shuman=SautoS\_\{\\text\{human\}\}=S\_\{\\text\{auto\}\}\)\. Pearson correlation \(rr\) and Mean Absolute Error \(MAE\) are then calculated directly using the paired arrays ofSautoS\_\{\\text\{auto\}\}andShumanS\_\{\\text\{human\}\}across the393393data points\.

### E\.2Continuous Alignment Metrics

The continuous alignment between the framework’s automated scores and the synthetic human scores is visually depicted in Figure[11](https://arxiv.org/html/2606.10833#A5.F11)and quantitatively summarized in Table[11](https://arxiv.org/html/2606.10833#A5.T11)\.

Table 11:Quantitative Alignment of our framework vs\. Human Expert Evaluators![Refer to caption](https://arxiv.org/html/2606.10833v1/x6.png)Figure 11:Alignment between synthetic human scores and automated our framework scores \(N=393N=393\)\. Points tightly clustering around they=xy=xdiagonal indicate exceptionally strong agreement\.
### E\.3Meta\-Evaluation Breakdown

Beyond sequential steps, the framework assesses complete solutions through three meta\-checks\. The human alignment for these individual checks is highly consistent:

- •Verbosity \(N=36N=36\):r=0\.983r=0\.983, MAE = 0\.94\.
- •Coverage \(N=36N=36\):r=0\.987r=0\.987, MAE = 0\.36\.
- •Physical Sanity \(N=36N=36\):r=0\.974r=0\.974, MAE = 0\.39\.

### E\.4Qualitative Agreement and Evaluator Preferences

![Refer to caption](https://arxiv.org/html/2606.10833v1/x7.png)Figure 12:Distribution of overall framework ratings assigned by human expert evaluators\.![Refer to caption](https://arxiv.org/html/2606.10833v1/x8.png)Figure 13:Evaluator preference for dependency\-based scoring vs\. a naive unweighted average\.In addition to continuous correlations, the study measured exact categorical agreement\. For48\.9% \(192/393\)of all evaluation points, human evaluators selected “About right”, indicating perfect exact agreement with the numerical score assigned by the framework without any need for adjustment\.

When observing the overall framework ratings \(Figure[12](https://arxiv.org/html/2606.10833#A5.F12)\), 5 evaluators rated it “Good” \(55\.6%\) and 1 rated it “Very Good” \(11\.1%\)\. Furthermore, the 2\-to\-1 win rate \(66\.7%, Figure[13](https://arxiv.org/html/2606.10833#A5.F13)\) for the dependency scoring method over an unweighted average highlights a critical pedagogical reality in engineering:*a correct algebraic derivation is rendered meaningless if the fundamental assumptions or physical constraints formulated in prior steps are flawed*\. The human study confirms that EngJudge successfully captures this nuanced grading philosophy\.

Figure[14](https://arxiv.org/html/2606.10833#A5.F14)to Figure[21](https://arxiv.org/html/2606.10833#A5.F21)present representative snapshots of the web interface used during the human evaluation process across different engineering domains and problem instances\.

![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_1.png)

![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_2.png)

Figure 14:Introductory pages of annotation webpage\.![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_4.png)Figure 15:Users are asked to judge whether the LLM\-generated solution step is correct or not for a given question\.![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_5.png)

![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_6.png)

Figure 16:Users are shown the ground truth solution and the step evaluation of our framework\. Users are required to rate it\. This continues for all 8 stages\.![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_7.png)Figure 17:8th stage \(Final answer\) and its evaluation is shown, and asked to rate\.![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_8.png)

![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_9.png)

Figure 18:Meta evaluation checks are shown and asked to be rated\.![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_10.png)Figure 19:After 11 steps, final independent stage\-wise scores are shown\. Users are asked to choose if they feel that the average score is right\.![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_11.png)Figure 20:Users are asked their opinion about the dependency\-based final score, and whether it is better than the simple average\.![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_12.png)

![Refer to caption](https://arxiv.org/html/2606.10833v1/images/human_study_ss/human_study_ss_13.png)

Figure 21:After all 4 questions, at the end, some framework\-specific questions are asked\.

## Appendix FPrompt Templates

### F\.1Generation Prompt

This is the prompt that is used to generate solutions of the questions present in our benchmark\.

Youareanexpert\{domain\}professorsolvingagraduate\-levelproblem\.\\

Examinethequestiontextandanydiagramcarefully\.

StructureyoursolutionusingtheEXACTtaggedformatbelow\.ALLsectionsareMANDATORY\.

RULES:

\-Beconcise\.Onlystatewhatisnecessary\.

\-Showallarithmeticclearly\.ConverttoSIunitsbeforesubstituting\.

\-Verifyyourfinalanswerisphysicallyreasonable\.

REQUIREDSTRUCTURE\(followthisorder\):

\#\#\#\#\#\#PROBLEM\_CHARACTERIZATION\#\#\#\#\#\#

Statetheproblemtypeandwhatneedstobefound\.Oneortwosentences\.

Example:"Steady\-state1DradialconductionthroughacompositecylinderwithconvectiveouterBC\.FindQ/LandT\_2\."

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

\#\#\#\#\#\#ASSUMPTIONS\#\#\#\#\#\#

Listonlythekeyassumptions\(2\-4max\)withone\-linejustifications\.

Example:

1\.Steadystate\-\-notime\-dependenttermsgiven

2\.Constantk\-\-temperaturerangeissmall

3\.1Dradial\-\-longpipe,neglectendeffects

DoNOTover\-complicate\.

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

\#\#\#\#\#\#VISUAL\_INTERPRETATION\#\#\#\#\#\#

Extractfromthediagram:dimensions,boundaryconditions,materialproperties\.Bebriefandfactual\.

Ifnodiagram,stategeometryfromtheproblemtext\.

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

\#\#\#\#\#\#EQUATION\_SELECTION\#\#\#\#\#\#

Writethegoverningequationsyouwilluseandbrieflystatewhyeachapplies\.

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

\#\#\#\#\#\#LOGICAL\_REASONING\#\#\#\#\#\#

Outlinethemathematicalandphysicallogicyouwillfollowtoarriveatthesolution\.

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

\#\#\#\#\#\#ALGEBRAIC\_ACCURACY\#\#\#\#\#\#

Ifderivationisneeded,showitstepbystep\.Then,converttoSIunits,substituteallvalues\\

intotheequations,andcomputestepbystepshowingintermediateresults\.Use3\-4significantfigures\.

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

\#\#\#\#\#\#PHYSICAL\_INTERPRETATION\#\#\#\#\#\#

In2\-3sentences:whatdoestheresultmeanphysically?Isthemagnitudereasonable?

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

\#\#\#\#\#\#FINAL\_ANSWER\#\#\#\#\#\#

Statethefinalnumericalanswer\(s\)withunits\.Labelparts\(a\),\(b\),etc\.ifneeded\.

\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#

IMPORTANT:UseEXACTLYthetagformat\#\#\#\#\#\#TAG\_NAME\#\#\#\#\#\#and\#\#\#\#\#\#END\_STEP\#\#\#\#\#\#\.\\

Donotskipanysection\.Keepthesolutionconcise\-\-avoidunnecessaryverbosity\.

Solvethefollowingproblem:

\{problem\_text\}

### F\.2Baseline Evaluation Prompts

Youareanexpertengineeringprofessorgradingastudent’ssolutionagainstagroundtruthanswer\.

Youmustprovideascoreoutof10\(where10isperfectand0isentirelyincorrect\)foreachofthefollowing8standardengineeringreasoningsteps:

1\.PROBLEM\_CHARACTERIZATION:Identifiestheunderlyingphysics,problemtype,andgoverningprinciples\.

2\.ASSUMPTIONS:Makesvalidassumptionsbasedonprobleminformationwithphysicaljustification\.

3\.VISUAL\_INTERPRETATION:Correctlyinterpretsdiagrams,FBDs,geometricinformation,andvisualconstraints\.

4\.EQUATION\_SELECTION:Verifiescorrectgoverningequation,justifiedsimplifications,appropriatecoordinatesystem,correctBCs\.

5\.LOGICAL\_REASONING:Ensureslogicalvalidityandmeaningfulcontributionofeachreasoningstep\.

6\.ALGEBRAIC\_ACCURACY:Evaluatesderivation,numericalsubstitutions,algebraicmanipulations,andexpressions\.

7\.PHYSICAL\_INTERPRETATION:Evaluateswhetherthemodelinterpretsthefinalresultphysically\.

8\.FINAL\_ANSWER:Comparespredictedanswerwithgroundtruthusingstrictnumericalerrorthresholds\.

Comparethestudent’ssolutiontotheGroundTruthimageprovided\.

OUTPUTFORMAT:

YourresponseMUSTbeavalidJSONobjectmatchingthisexactstructure:

\{

"step\_evaluations":\[

\{

"step\_name":"PROBLEM\_CHARACTERIZATION",

"score":<intbetween0and10\>,

"reasoning":"Briefjustificationforthisscore"

\},

\.\.\.\(dothisforall8steps\)

\],

"overall\_reasoning":"Abriefsummaryofthestudent’soverallperformance\."

\}

\[INPUT:QUESTIONIMAGE\(S\)\]

\[INPUT:GROUNDTRUTHSOLUTIONIMAGE\(S\)\]

QUESTION:

\[INPUT:QUESTIONTEXT\]

STUDENTSOLUTION:

\[INPUT:GENERATEDSOLUTIONTEXT\]

### F\.3EngJudge Evaluation Prompts

#### F\.3\.1Problem Characterization Prompt

YouareanextremelystrictengineeringexamgraderevaluatingthePROBLEMCHARACTERIZATIONstepofanLLM\-generatedsolutiontoagraduate\-levelengineeringproblem\.

Thisstepcheckswhetherthemodelcorrectlyidentifieswhattypeofproblemitisdealingwithbeforeattemptingtosolveit\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYontheproblemcharacterization,givenproblemstatements,andtop\-levelgoverningprinciples\.Donotevaluatethefinalmathoranswerhere;focusonhowtheproblemiscategorizedandsetup\.

\-\-\-

QUESTION:

\{question\_text\}

PROBLEMCHARACTERIZATIONSTEP:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

EVALUATIONCRITERIA:

1\.PHYSICSDOMAIN&PROBLEMTYPE

\-Isthecorrectbranchofengineering/physicsidentified\(e\.g\.,heattransfer,fluidmechanics\)?

\-Isthespecificsub\-topiccorrectlyidentified\(e\.g\.,forcedvsnaturalconvection\)?

\-Istheproblemtypecorrectlyidentified\(steady\-statevstransient,1D/2D/3D\)?

2\.GOVERNINGPRINCIPLES

\-Aretherelevantphysicallawsmentioned\(conservationofmass/energy/momentum\)?

\-Arethegoverningprinciplesappropriateforthisproblem?

3\.KEYVARIABLES&GEOMETRY

\-Aretheimportantgivenquantitiescorrectlyidentified?

\-Isitclearwhatquantityneedstobefound?

\-Isthephysicalconfigurationandgeometrycorrectlyunderstood\(pipeflow,cylinder,etc\.\)?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1&2:PhysicsDomain,Type,&Principles\|Criterion3:KeyVariables&Geometry\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Slightlyimpreciseterminology;missingaminorbackgrounddetail\.\|Missedanon\-criticalgeometricparameter\.\|

\|MODERATE\|4\|Wrongsub\-classification\(e\.g\.,callsnaturalconvectionforcedconvection\)\.\|Missedanimportantgivenvariable;omittedthetargetvariabletobefound\.\|

\|MAJOR\|7\|Wrongphysicsdomain;fundamentallymisidentifiedtheproblemtype\(e\.g\.,transientinsteadofsteady\)\.\|Misidentifiedthecoregeometry\(e\.g\.,treatedasphereasacylinder\)\.\|

\|CRITICAL\|10\|Completelywrongphysicalmodelidentification\-\>scorebecomes0\.\|Completelyfailedtoextractanymeaningfulvariablesorsetup\.\|

FATALERRORS:

\-Governingequationwillbewrongduetomischaracterization\-\>capat3

\-Physicalmodelidentifiedisinvalidforthisproblem\-\>capat4

SCORING:step\_score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment

QuestionSegment:

Alongcylindricalrodgeneratesheatinternallyataconstantvolumetricrate\.

Theoutersurfaceoftherodismaintainedataconstanttemperature\.

Assumesteadyoperatingconditionsanddeterminethetemperaturedistributionwithintherod\.

ProblemCharacterizationProvided:

ProblemType:1DTransientHeatConductioninaCylinder

ReasoningforEvaluation:

Theproblemexplicitlystatessteadyoperatingconditionsandcontinuousinternalheatgeneration,whichindicatesasteady\-stateheatconductionproblem\.

Thestudentincorrectlyclassifiedtheproblemastransient,implyingtime\-dependenttemperaturevariation\.Thiscontradictstheproblemdescriptionandleadstoselectingthewronggoverningequation\(transientheatequationinsteadofsteady\-stateconductionwithgeneration\)\.

ExpectedJSONOutput

’’’

\{\{

"errors":\[

\{\{

"description":"Misidentifiedtheproblemastransientheatconductiondespitetheproblemexplicitlystatingsteadyoperatingconditions\.",

"criterion":"1&2:PhysicsDomain,Type,&Principles",

"severity":"MAJOR",

"penalty":7

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"physical\_model\_invalid",

"description":"Classifyingasteady\-stateheatgenerationproblemastransientleadstoselectingthewronggoverningequation\.",

"score\_cap":4

\}\}

\],

"total\_penalties":7,

"raw\_score":3,

"capped\_score":3,

"justification":"Thestudentfundamentallymisclassifiedtheproblemastransientinsteadofsteady\-state\."

\}\}

’’’

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"errors":\[

\{\{

"description":"<whattheerroris\>",

"criterion":"<PhysicsDomain,Type,&Principles\|KeyVariables&Geometry\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"<governing\_equation\_incorrect\|physical\_model\_invalid\>",

"description":"<explanation\>",

"score\_cap":<3or4\>

\}\}

\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<min\(raw\_score,lowestfatal\_errorcap\)orraw\_scoreifnofatalerrors\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’scharacterizationagainsttheGROUNDTRUTHREFERENCEandtheoriginalimage\>"

\}\}

#### F\.3\.2Assumptions Prompt

YouareanextremelystrictengineeringexamgraderevaluatingtheASSUMPTIONSstepofanLLM\-generatedsolutiontoagraduate\-levelengineeringproblem\.

Thisstepcheckswhetherthemodelmakesvalid,justified,andcompleteassumptionsbeforesolvingtheproblem\.Assumptionsmaybeexplicit\(stated\)orimplicit\(requiredbychosenequations\)\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYonthespecificlistofassumptions,simplifications,andjustificationsprovided\.Donotevaluateequationsormathhere;focussolelyonthephysicalassumptionsbeingmade\.

\-\-\-

QUESTION:

\{question\_text\}

ASSUMPTIONSSTEP:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

EVALUATIONCRITERIA:

1\.VALIDITY&JUSTIFICATION

\-Iseachassumptionphysicallyvalidforthisproblem?

\-Iseachassumptionjustifiedwithaphysicalreasonorstandardpractice?

\-Areanyassumptionsclearlywrongortooaggressive\(e\.g\.,removingessentialphysics\)?

2\.COMPLETENESS&CONSISTENCY

\-Areallnecessaryassumptionsstated\(steady\-state,1D,incompressible,etc\.\)?

\-Areassumptionsconsistentwithinformationgivenintheproblemanddiagrams?

\-Areanyassumptionscontradictedbytheproblemstatement?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1:Validity&Justification\|Criterion2:Completeness&Consistency\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Slightlyimprecisejustificationforavalidassumption\.\|Missingaminorassumptionthatdoesn’tstrictlyaffectthemath\.\|

\|MODERATE\|4\|Poorlyjustifiedsimplification\.\|Missinganimportantassumptionrequiredforthechosenequations\.\|

\|MAJOR\|7\|Invalidassumptionthatchangesthesolutionapproach\.\|Assumptiondirectlycontradictedbygivenproblemdata\.\|

\|CRITICAL\|10\|Assumptionthatleadstocompletelywrongphysics\(e\.g\.,inviscidflowinviscosity\-dominatedproblem\)\.\|Systematicallycontradictedtheentiresetup\.\|

FATALERRORS:

\-Assumptionleadstoinvalidphysicalmodel\-\>capat4

\-Assumptiondirectlycontradictstheproblemstatement\-\>capat4

SCORING:step\_score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

QuestionSegment

Alongcylindricalpipecarrieshotwaterandlosesheattothesurroundingair\.

Thepipewallhasthermalconductivityk=45W/mK\.

Theoutersurfaceofthepipeisexposedtoairwithaconvectioncoefficienth=25W/m2K\.

Assumesteadyoperatingconditions\.

StudentAssumptions

Steadystate

One\-dimensionalradialheatconductionthroughthepipewall

Negligibleconvectionheattransferfromthepipesurface

ReasoningforEvaluation

Theproblemexplicitlystatesthatthepipesurfaceisexposedtoairwithaconvectioncoefficienth=25W/m2K\.

Thestudentassumednegligibleconvectionheattransfer,whichcontradictsthephysicalmechanismspecifiedintheproblem\.Thisremovesanessentialheattransfermodefromthemodel\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"errors":\[

\{\{

"description":"Assumedisothermalwalls,buttheproblemexplicitlystatesauniformheatfluxcondition\.",

"criterion":"2\.Completeness&Consistency",

"severity":"MAJOR",

"penalty":7

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"boundary\_conditions\_invalid",

"description":"Assumingisothermalinsteadofisofluxchangestheboundaryconditiontypefundamentally\.",

"score\_cap":4

\}\}

\],

"total\_penalties":7,

"raw\_score":3,

"capped\_score":3,

"justification":"Thestudentmadeaboundaryconditionassumptionthatdirectlycontradictstheproblemstatement\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"errors":\[

\{\{

"description":"<whattheerroris\>",

"criterion":"<Validity&Justification\|Completeness&Consistency\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"<physical\_model\_invalid\|boundary\_conditions\_invalid\>",

"description":"<explanation\>",

"score\_cap":<4\>

\}\}

\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<min\(raw\_score,lowestfatal\_errorcap\)orraw\_scoreifnofatalerrors\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’sassumptionsagainstthoseintheGROUNDTRUTHREFERENCEimages\>"

\}\}

#### F\.3\.3Visual Interpretation Prompt

YouareanextremelystrictengineeringexamgraderevaluatingtheVISUALINTERPRETATIONstepofanLLM\-generatedsolutiontoagraduate\-levelengineeringproblem\.

Thisstepcheckswhetherthemodelcorrectlyreadsandinterpretsdiagrams,figures,free\-bodydiagrams\(FBDs\),andgeometricinformationfromtheproblem\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYonhowvisualdiagramelementsareidentifiedanddescribed\.Donotevaluatealgebraorfinalanswers\.

\-\-\-

QUESTION:

\{question\_text\}

VISUALINTERPRETATIONSTEP:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

EVALUATIONCRITERIA:

1\.DIMENSIONS&GEOMETRY

\-Arealldimensionscorrectlyreadfromthediagram\(lengths,radii,angles\)?

\-Aregeometricrelationships\(parallel,concentric\)correctlyidentified?

2\.BOUNDARYCONDITIONS&LOADING

\-Areappliedforces,pressures,heatfluxes,orboundarytemperaturescorrectlyidentified?

\-Aresupportconditions\(fixed,pinned,free\)correctlyread?

\-Isflowdirectionorboundarylayertypecorrectlynotedfromthevisual?

3\.MATERIALS&COORDINATES

\-Aredifferentmaterialsorregionsproperlyrecognized?

\-Isthespatialorientationcorrectlyunderstood?

\-Isthecoordinatesystemconsistentwiththediagram?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1&3:Dimensions,Geometry&Coordinates\|Criterion2:BoundaryConditions&Loading\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Imprecisegeometricdescription;minordimensionmisreadthathaslittleimpact\.\|Missedaminorvisualannotation\.\|

\|MODERATE\|4\|Misreadanimportantdimension;choseanawkwardcoordinatesystem\.\|Missedaboundaryconditionexplicitlyshowninthediagram\.\|

\|MAJOR\|7\|Fundamentallymisreadgeometry\(e\.g\.,wrongshapeororientation\)\.\|Missedcriticalloadingorfluxshownindiagram\.\|

\|CRITICAL\|10\|Completelywronginterpretationofthephysicalsetupfromthediagram\-\>score=0\.\|Completelyinvertedtheloading/flowdirections\.\|

FATALERRORS:

\-Misinterpretationleadstowrongphysicalmodel\-\>capat4

\-Dimensionsorgeometrysowrongthatgoverningequationisinapplicable\-\>capat3

SCORING:step\_score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment:

Fromthediagram,wecanseeaflatplateoflengthL=2m\.FlowapproachesfromtheleftatU\_inf=10m/s\.Theplateisheatedstartingfromx=0\.

ReasoningforEvaluation:

Theprovideddiagramclearlyshowsanunheatedstartinglength,whereheatingonlybeginsatx=0\.5m\.Thestudentfailedtonoticetheunheatedstartinglengthvisualannotation\.Thisisamissedboundarycondition\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"errors":\[

\{\{

"description":"Failedtovisuallyidentifytheunheatedstartinglengthfromx=0tox=0\.5mshowninthediagram\.",

"criterion":"2\.BoundaryConditions&Loading",

"severity":"MODERATE",

"penalty":4

\}\}

\],

"fatal\_errors":\[\],

"total\_penalties":4,

"raw\_score":6,

"capped\_score":6,

"justification":"Thestudentmissedanimportantvisualboundarycondition\(unheatedstartinglength\),whichwillaffectthethermalboundarylayerformulation\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"errors":\[

\{\{

"description":"<whattheerroris\>",

"criterion":"<Dimensions,Geometry&Coordinates\|BoundaryConditions&Loading\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"<physical\_model\_invalid\|governing\_equation\_incorrect\>",

"description":"<explanation\>",

"score\_cap":<3or4\>

\}\}

\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<min\(raw\_score,lowestfatal\_errorcap\)orraw\_scoreifnofatalerrors\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’svisualreadingofdimensionsandgeometryagainsttheGROUNDTRUTHREFERENCEimages\>"

\}\}

#### F\.3\.4Equation Selection Prompt

YouareanextremelystrictengineeringexamgraderevaluatingtheEQUATIONSELECTIONstepofanLLM\-generatedsolution\.

ThisstepmustbeevaluatedwithEXTREMEstrictness\.Ifthegoverningequationiswrong,theENTIREsolutionisbasedonwrongphysicsandthescoreMUSTbe0\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYontheselectionofgoverningequationsandassociatedboundarydifferentialmath\.Donotevaluatealgebra,derivations,oranswershere\.

\-\-\-

QUESTION:

\{question\_text\}

EQUATIONSELECTIONSTEP:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

STRICTEVALUATIONCRITERIA:

1\.GOVERNINGEQUATION\(CriticalAxis\)

\-Isthecorrectgoverningequationchosenforthisphysicalsystem?

\-Isittherightform\(differentialvsintegral,1Dvs2D\)?

\-Ifthegoverningequationisfundamentallywrong\-\>score=0immediately\.

2\.BOUNDARYCONDITIONS&COORDINATES

\-Aretheequationsforboundaryconditionscorrectlyformulated?

\-Isthechosencoordinatesystem\(Cartesian,cylindrical,spherical\)appropriate?

\-Arevectorquantitiesexpressedcorrectly?

3\.JUSTIFICATION&SIMPLIFICATION

\-Aresimplifyingassumptionsjustifiedintheequationform\(e\.g\.,droppingtransientterm\)?

\-Arethereanyinvalid,applicability\-exceeded,ordimensionallyinconsistentequations?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion2&3:BCs,Coordinates,&Simplifications\|Criterion1:GoverningEquation\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Smallnotationissue;missingaminorsupportiveequation\.\|\(N/A\)\|

\|MODERATE\|4\|Chooseaslightlysub\-optimalcoordinatesystem;unjustifiedequationsimplification\.\|Usedaslightlyincorrectformofthegoverningequation\.\|

\|MAJOR\|7\|Formulatedboundaryconditionsincorrectly;usedanequationoutsideitsvalidrange\.\|\(N/A\)\|

\|CRITICAL\|10\|Dimensionallyinconsistentboundaryconditions\.\|Chosethefundamentallywronggoverningequation\-\>score=0\.\|

FATALERRORS:

\-Governingequationincorrect\-\>score=0

\-Dimensionallyinconsistentequations\-\>capat2

\-Boundaryconditionsthatinvalidatetheformulation\-\>capat4

SCORINGRULES:

\-Ifgoverning\_equation\_correctisfalse\-\>score=0

\-Otherwise,step\_score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment:

Problem:HeatGenerationinacylindricalwire?

GoverningEquation:

Heatequationinsphericalcoordinateswithheatgeneration\.

1/r^2d/dr\(r^2dT/dr\)\+q\_dot/k=0

ReasoningforEvaluation:

Theproblemisaboutheatconductioninalongcylindricalwire\.Thestudentchosetheheatequationinsphericalcoordinatesinsteadofcylindrical\(1/rd/dr\(rdT/dr\)\)\.ThisisafundamentallywronggoverningequationbecausethegeometrygeometryradicallychangestheLaplacianoperator\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"governing\_equation\_correct":false,

"errors":\[

\{\{

"description":"SelectedsphericalcoordinateLaplacianforacylindricalwireproblem\.",

"criterion":"1\.GoverningEquation",

"severity":"CRITICAL",

"penalty":10

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"governing\_equation\_incorrect",

"description":"Usingsphericalcoordinatesforacylinderfundamentallychangesthedifferentialequation,makingitimpossibletoreachthecorrectsolution\.",

"score\_cap":0

\}\}

\],

"total\_penalties":10,

"raw\_score":0,

"capped\_score":0,

"justification":"Thestudentchosethewronggoverningequationbyusingthesphericalformforacylindricalproblem\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"governing\_equation\_correct":<true\|false\>,

"errors":\[

\{\{

"description":"<whattheerroris\>",

"criterion":"<1\.GoverningEquation\|2&3:BCs,Coordinates,&Simplifications\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"<governing\_equation\_incorrect\|dimensionally\_incon\-

sistent\|boundary\_conditions\_invalid\>",

"description":"<explanation\>",

"score\_cap":<0,2,or4\>

\}\}

\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<0ifgoverningequationwrong,elsemax\(0,10\-total\_penalties\)\>,

"capped\_score":<min\(raw\_score,lowestfatal\_errorcap\)orraw\_scoreifnofatalerrors\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’sequationselectionagainsttheGROUNDTRUTHREFERENCEimages\>"

\}\}

#### F\.3\.5Logical Reasoning Prompt

YouareanextremelystrictengineeringexamgraderevaluatingtheLOGICALREASONINGqualityofanLLM\-generatedsolutionsteptoagraduate\-levelengineeringproblem\.

Thisevaluateswhetherthereasoningwithinthisstepislogicallyvalid,well\-structured,andmakesameaningfulcontributiontothesolution\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYonthelogicalflow,textexplanations,andjustificationclaims\.Donotevaluatetherawmathequationshere;focusontheEnglishreasoningthatconnectsthem\.

\-\-\-

QUESTION:

\{question\_text\}

REASONINGSTEPBEINGEVALUATED:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

EVALUATIONCRITERIA:

1\.LOGICALVALIDITY&COMPLETENESS

\-Doeseachclaimfollowlogicallyfromthepreviousone?

\-Arethereanynon\-sequiturs,circulararguments,orunjustifiedconclusions?

\-Areallnecessarylogicallinkspresent,oraretheremassiveleaps?

\-Doesthereasoningcontributemeaningfullytosolvingtheproblem?

2\.PHYSICSCAUSALITY&PROPORTIONALITY

\-Isthedirectionofphysicalcausationcorrect\(e\.g\.,temperaturegradientcausesheatflow\)?

\-Areproportionalrelationshipsstatedcorrectly?

\-Dothelogicalclaimsalignwithphysicalreality?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1:Validity&Completeness\|Criterion2:PhysicalCausality\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Slightlyimprecisetextreasoning;minorgapinjustification\.\|Slightlyconfusingphysicalexplanation\.\|

\|MODERATE\|4\|Alogicalgapthataffectsclarity;assumingaconclusionmidway\.\|Wrongproportionalitystatement\(e\.g\.,saysTincreaseswithxwhenitdecreases\)\.\|

\|MAJOR\|7\|Circularargument;amassiveunjustifiedleap;completenon\-sequitur\.\|Incorrectcause\-and\-effectreasoning\.\|

\|CRITICAL\|10\|Fundamentallyflawedlogicthatforcesacompletelywrongsolutionapproach\.\|Claimsthatcontradictthemostbasiclawsofphysics\.\|

SCORING:step\_score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

QuestionSegment:

Ametalrodhasoneendmaintainedat400Kandtheotherat300K\.

Assumesteadyone\-dimensionalheatconductionalongtherod\.

Explainthedirectionofheatflow\.

ReasoningStepBeingEvaluated:

Sincethecolderendoftherodhasalowertemperature,heatwillnaturallyflowfromthecoldendtowardthehotendinordertoequalizethetemperaturedifference\.

GroundTruthReference:

HeatconductionoccursfromhighertemperaturetolowertemperatureaccordingtoFourier’slaw\.

ReasoningforEvaluation

Thereasoningincorrectlystatesthedirectionofheattransfer\.Heatflowsfromhottocold,notfromcoldtohot\.Theexplanationreversesthephysicalcausalityofheattransferandcontradictsthebasicthermodynamicprinciplegoverningconduction\.

ExpectedJSONOutput

\{\{

"errors":\[

\{\{

"description":"Claimedthatheatflowsfromthecolderregiontothehotterregion,reversingthecorrectdirectionofheattransfer\.",

"criterion":"2\.PhysicsCausality",

"severity":"MAJOR",

"penalty":7

\}\}

\],

"fatal\_errors":\[\],

"total\_penalties":7,

"raw\_score":3,

"capped\_score":3,

"justification":"Thereasoningreversesthefundamentalphysicalcausalityofheattransfer,incorrectlystatingthatheatflowsfromcoldtohot\."

\}\}

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"errors":\[

\{\{

"description":"<whattheerroris\>",

"criterion":"<1\.Validity&Completeness\|2\.PhysicsCausality\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"fatal\_errors":\[\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<sameasraw\_score\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’slogicalflowandphysicscausalityagainsttheGROUNDTRUTHREFERENCEimages\>"

\}\}

#### F\.3\.6Algebraic Accuracy Prompt

YouareanextremelystrictengineeringexamgraderevaluatingtheALGEBRAICACCURACYstepofanLLM\-generatedsolutiontoagraduate\-levelengineeringproblem\.

Thisstepcomprehensivelyevaluatesthemathematicaloperations,treatingarithmeticsubstitutionsandthestep\-by\-stepalgebraicprocessasaunifiedwhole\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYonthespecificmathematicalsteps,algebraicmanipulations,andnumericalsubstitutionsrelevanttothisstepoftheproblem\.Donotevaluatethefinalanswervaluehere;focusonthe\*process\*ofgettingthere\.

\-\-\-

QUESTION:

\{question\_text\}

ALGEBRAICACCURACYSTEP:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

EVALUATIONCRITERIA:

\-Arethecorrectnumericalvaluesfromtheproblemstatementsubstituted?

\-Arestandardphysicalconstants\(g,R,sigma,etc\.\)usedwithcorrectvalues?

\-Areunitconversionsperformedcorrectlybeforesubstitution?

\-Areadditions,subtractions,multiplications,divisions,powers,androotscomputedcorrectly?

\-Aresignscorrectandhandledproperly?

\-Aresignificantfiguresandroundinghandledappropriately?

\-Areunitsdimensionallyconsistentaftersubstitution?

\-Arealgebraicmanipulations\(rearranging,factoring,expanding\)performedcorrectly?

\-Areintegralsevaluatedcorrectly\(limits,technique\)?

\-Areboundaryconditionsappliedcorrectlyduringderivation?

\-Aresimplificationsmathematicallyvalid\(e\.g\.,small\-angleapproximations\)?

\-Doesthisproceduralstep\-by\-steppathmatchtheintendedphysics/mathderivationintheoriginalsolution?

\-Arevariabledependencieshandledcorrectly?

\-\-\-

GRADINGRUBRIC\(PENALTIES\):

\|Severity\|Points\|DescriptionofErrors\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|

\|MINOR\|2\|Smallroundingdifferences;insignificantarithmeticslipthatdoesn’tpropagatemuch;significantfiguresandprecisionhandlingerror;skippedatrivialalgebraicstep;minornotationinconsistency\.\|

\|MODERATE\|4\|Unitconversionerror;wrongvalueforaphysicalconstant;intermediatearithmeticmistake;algebraicerrorcatchablebyinspection;missinganoticeableintermediatestep;unjustifiedsimplification\.\|

\|MAJOR\|7\|Wrongvaluesubstitutedforakeyparameter;signerrorthatsignificantlychangestheresult;wrongintegrationandintegrationlimits;wrongboundaryconditionapplication;fundamentallywrongalgebraicpath;keystepsunjustified\.\|

\|CRITICAL\|10\|Systematicallywrongsubstitutions;dimensionallyinconsistentaftersubstitution;completedisregardfortheproperderivationmethod;dimensionallyinconsistentderivation\.\|

FATALERRORS:

\-Operationcreatesadimensionallyinconsistentexpression\-\>capat2

\-WrongBCapplicationinvalidatesthemathematicalformulation\-\>capat4

SCORING:step\_score=max\(0,10\-total\_penalties\)

\-\-\-

EXAMPLE:

InputSegment:

Substitution:Q=\(50W/mK\)\*\(2m^2\)\*\(300\-350K\)/0\.1m=50\*2\*50/0\.1=50000W

ReasoningforEvaluation:

Thestudentswapped\(T1\-T2\)as\(300\-350\)whichis\-50,butthendroppedthenegativesigninthenextsteptogetapositive50\.Thisisasignerrorthatsignificantlychangesthephysicalmeaning\(heatflowdirection\),fallingunderMajorseverity\.

TheQvalueshouldhaveadimensionW\-m,butoutputdimensionisgivenasW\.Thisisdimensionallyinconsistentaswell\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"errors":\[

\{\{

"description":"Droppednegativesignduringsubstitution:\(300\-350\)yielded\+50insteadof\-50inthenextsimplificationstep\.",

"criterion":"Algebraic&ProcessAccuracy",

"severity":"MAJOR",

"penalty":7

\}\}

\],

"fatal\_errors":\[\],

"total\_penalties":7,

"raw\_score":3,

"capped\_score":3,

"justification":"Thestepcontainsamajorsignerrorinintermediatearithmetic,changingthedirectionofheatflow\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"errors":\[

\{\{

"description":"<whattheerroris\>",

"criterion":"Algebraic&ProcessAccuracy",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"fatal\_errors":\[

\{\{

"type":"<dimensionally\_inconsistent\|boundary\_conditions\_invalid\>",

"description":"<explanation\>",

"score\_cap":<2or4\>

\}\}

\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<min\(raw\_score,lowestfatal\_errorcap\)orraw\_scoreifnofatalerrors\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’smathematicalstepsagainsttheGROUNDTRUTHREFERENCEimages\>"

\}\}

#### F\.3\.7Physical Interpretation Prompt

YouareanextremelystrictengineeringexamgraderevaluatingthePHYSICALINTERPRETATIONstepofanLLM\-generatedsolutiontoagraduate\-levelengineeringproblem\.

Thisstepcheckswhetherthemodelcorrectlyinterpretsthephysicalmeaningandsignificanceofitscomputedresult\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYonthetextblockwherethefinalresultisanalyzedordiscussed\.Donotevaluatethealgebraicarrivalatthatresulthere\.

\-\-\-

QUESTION:

\{question\_text\}

PHYSICALINTERPRETATIONSTEP:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

EVALUATIONCRITERIA:

1\.RESULT&TRENDINTERPRETATION

\-Doesthemodelexplainwhatthenumericalresultphysicallymeans?

\-Doesthemodelcorrectlyidentifyhowtheresultdependsonkeyparameters?

\-Arephysicaltrends\(increasing/decreasingwithT,P,V\)correct?

2\.BENCHMARKS&LIMITINGCASES

\-Doesthemodelcheckwhethertheanswermagnitudeisphysicallyreasonable?

\-Isitcomparedagainstknownlimitingcases\(e\.g\.,ask\-\>inf\)?

\-Areengineeringorpracticalimplicationsdiscussed?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1:Result&Trend\|Criterion2:Benchmarks&Limits\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Superficialinterpretation;missingaminorengineeringcontext\.\|Skippedalimitingcasethatisrequestedbytheproblem\.\|

\|MODERATE\|4\|Incorrecttrendidentified\(e\.g\.,claimedheatlossdecreaseswithlargerarea\)\.\|Nomagnitudecheckwhenthegivenresultclearlylooksslightlyunusual\.\|

\|MAJOR\|7\|Fundamentallywrongphysicalinterpretationoftheresult\.\|Interpretedanimpossiblelimitingcaseasvalid\.\|

\|CRITICAL\|10\|Interpretationcompletelycontradictsbasicphysicaldefinitions\.\|Activelydefendedaphysicallyimpossibleresult\(e\.g\.,T<0K\)asperfectlynormal\.\|

SCORING:step\_score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment:

ThecalculatedNusseltnumberof0\.5indicatesextremelyintenseturbulentforcedconvection,provingthatheattransferisveryefficient\.

ReasoningforEvaluation:

ANusseltnumberof0\.5isverylow\(lessthan1usuallymeansconductiondominates,butNu<1generallyphysicallyimpliesaflaworuniquerarefiedregime\)\.CharacterizingNu=0\.5as"intenseturbulentconvection"isafundamentallywronginterpretationofthemagnitudeanddimensionlessgroupmeaning\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"errors":\[

\{\{

"description":"InterpretedaverylowNusseltnumber\(0\.5\)asindicatingintenseturbulentconvection,whichisbackwards\.",

"criterion":"1\.Result&Trend",

"severity":"MAJOR",

"penalty":7

\}\}

\],

"fatal\_errors":\[\],

"total\_penalties":7,

"raw\_score":3,

"capped\_score":3,

"justification":"ThestudentfundamentallymisunderstoodthephysicalmeaningandscaleoftheNusseltnumber\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"errors":\[

\{\{

"description":"<whattheerroris\>",

"criterion":"<1\.Result&Trend\|2\.Benchmarks&Limits\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"fatal\_errors":\[\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<sameasraw\_score\>,

"justification":"<explanationofphysicalinterpretationquality\>"

\}\}

#### F\.3\.8Final Answer Prompt

YouareanextremelystrictengineeringexamgraderevaluatingtheFINALANSWERofanLLM\-generatedsolution\.

ThefinalansweractsasaGATEKEEPER\.Ifthenumericalanswerissignificantlywrong,thesolutionreceivesheavypenaltiesregardlessofreasoningquality\.

\*\*\*CRITICALINSTRUCTION\*\*\*

WhencomparingagainsttheGROUNDTRUTHIMAGE,firstEXTRACTANDFOCUSONLYonthefinalboxed/statednumericalanswersandtheirunits\.Do\*not\*evaluatethemathstepshere\.

\-\-\-

QUESTION:

\{question\_text\}

FINALANSWERPROVIDED:

\{step\_content\}

GROUNDTRUTHANSWER:

\{ground\_truth\}

\-\-\-

EVALUATIONCRITERIA:

1\.NUMERICALCORRECTNESS\(Primary\)

\-Comparethepredictednumericalvalue\(s\)withthegroundtruth\.

\-Calculate:error=\|predicted\-ground\_truth\|/\|ground\_truth\|

\-Multiplevalues?Evaluateeach\.Thetotalpenaltyshouldreflecttheoverallaccuracy\.

\-Ifonepartisperfectandanotherisfundamentallywrong,assignabalancedpenalty\(e\.g\.,5\-7points\)ratherthananautomatic10\.

2\.UNITS&PRESENTATION

\-ArethecorrectSIorproblem\-specifiedunitsprovided?

\-Isitclearlystatedwithreasonablesignificantfigures?

3\.COMPLETENESS&PHYSICALPOSSIBILITY

\-AreALLparts/valuesaskedforactuallyprovided?

\-Istheresultphysicallyimpossible?\(Negativeabsolutetemp,negativedensity,etc\.\)

\-SignerrorsthatreversethephysicalmeaningareMAJOR\.

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1:NumericalCorrectness\|Criterion2:Units&Presentation\|Criterion3:Completeness&Possibility\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|\(N/A\)\|Minorpresentationissue;offbyonesig\-fig\.\|Missedaveryminorsecondaryvalue\.\|

\|MODERATE\|4\|Error\>5%and<=10%\.\|Incorrectunitstringormissingunits\.\|\(N/A\)\|

\|MAJOR\|7\|Error\>10%\.\|\(N/A\)\|Missedaprimaryrequestedanswer\(e\.g\.foundT,forgotQ\)\.\|

\|CRITICAL\|10\|Numericalvalueiscompletely,fundamentallywrong\(offbyorderofmagnitude\)\.\|\(N/A\)\|Resultisphysicallyimpossible\-\>capscoreat2\.\|

NUMERICALERRORBANDS:

correct:error<=5%

moderate:5%<error<=10%

major:error\>10%

FATALERRORS:

\-Physicallyimpossibleresult\(e\.g\.,negativemass\)\-\>capscoreat2

SCORINGRULES:

\-Ifphysicallyimpossible:step\_score=min\(step\_score,2\)

\-Otherwisestep\_score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment

QuestionSegment:

Acopperrodconductsheatbetweentwosurfaces\.Theheattransferratethroughtherodiscalculatedtobe520W\.

Alsodeterminethetemperatureattherodmidpoint,whichis340K\.

FinalAnswerProvided:

Q=500W

T\_mid=340K

GroundTruthAnswer:

Q=520W

T\_mid=340K

ReasoningforEvaluation

ThestudentpredictedQ=500Wwhilethecorrectvalueis520W\.

error=\|500\-520\|/520=0\.0385

Thiscorrespondsto3\.85%error,whichiswithinthe<=5%correctband,sothenumericalvalueisacceptable\.

Themidpointtemperature340Kexactlymatchesthegroundtruth\.

Unitsarecorrectlyprovidedforbothquantitiesandallrequestedvaluesarepresent\.

ExpectedJSONOutput

’’’

\{\{

"predicted\_values":\[

\{\{

"quantity":"Heattransferrate\(Q\)",

"predicted":500,

"ground\_truth":520,

"unit\_predicted":"W",

"unit\_expected":"W",

"numerical\_error":0\.0385,

"error\_band":"correct"

\}\},

\{\{

"quantity":"Midpointtemperature\(T\_mid\)",

"predicted":340,

"ground\_truth":340,

"unit\_predicted":"K",

"unit\_expected":"K",

"numerical\_error":0\.0,

"error\_band":"correct"

\}\}

\],

"units\_correct":true,

"physically\_possible":true,

"physical\_impossibility\_reason":null,

"all\_parts\_answered":true,

"errors":\[\],

"total\_penalties":0,

"raw\_score":10,

"capped\_score":10,

"justification":"Bothnumericalvaluesarecorrectwithintoleranceandunitsareproperlyprovided\."

\}\}

’’’

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"predicted\_values":\[

\{\{

"quantity":"<whatisbeingpredicted\>",

"predicted":<numericalvalueornullifnotfound\>,

"ground\_truth":<numericalvalueornullifnotprovided\>,

"unit\_predicted":"<unitprovided\>",

"unit\_expected":"<expectedunit\>",

"numerical\_error":<calculatedrelativeerrorornull\>,

"error\_band":"<correct\|moderate\|major\|not\_comparable\>"

\}\}

\],

"units\_correct":<true\|false\>,

"physically\_possible":<true\|false\>,

"physical\_impossibility\_reason":"<reasonifphysicallyimpossible,elsenull\>",

"all\_parts\_answered":<true\|false\>,

"errors":\[

\{\{

"description":"<whattheerroris\>",

"category":"<NumericalCorrectness\|Units&Presentation\|Completeness\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"total\_penalties":<sum\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<min\(raw\_score,2\)ifphysicallyimpossible,elseraw\_score\>,

"justification":"<detailedexplanationexplicitlycomparingthefinalnumericalvaluesandunitsagainsttheGROUNDTRUTHREFERENCEimages\>"

\}\}

#### F\.3\.9Meta\-Eval: Verbosity Prompt

YouareanextremelystrictengineeringexamgraderevaluatingtheVERBOSITYoftheCOMPLETELLM\-generatedsolutiontoagraduate\-levelengineeringproblem\.

Thisisameta\-evaluation\.YouarecheckingtheENTIREsolutiontoensureitisconcise,direct,andfreeofunnecessaryfillerorrepetition\.

\-\-\-

QUESTION:

\{question\_text\}

COMPLETESOLUTION:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

VERBOSITYCHECKS:

1\.REPETITION&RESTATEMENT

\-Doesthesolutionunnecessarilyrestatetheentirequestionbeforestarting?

\-Doesitrepeatthesameconclusionmultipletimesacrossdifferentsteps?

\-Areequationswrittenoutrepeatedlywithoutanynewsubstitutionorderivation?

\-CRITICAL:ThesolutionisREQUIREDtofollowastricttaggedformat\(e\.g\.,\#\#\#\#\#\#PROBLEM\_CHARACTERIZATION\#\#\#\#\#\#,\#\#\#\#\#\#ASSUMPTIONS\#\#\#\#\#\#\)\.DoNOTpenalizetheexistenceofthesesectionsas"boilerplate"or"filler"\.Theyaremandatory\.

2\.FILLERTEXT&OVER\-EXPLANATION

\-Isthereexcessiveconversationalfiller\("Aswecanclearlysee\.\.\.","Itisimportanttonotethat\.\.\."\)?

\-Aretrivialalgebraicstepsover\-explainedinparagraphsoftext?

\-Couldthesolutionbesignificantlyshorterwithoutlosinganytechnicalrigor?

\-Arethereanyextraassumptions,algebraicstepswhicharenotactuallyneeded?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1:Repetition&Restatement\|Criterion2:FillerText&Over\-Explanation\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Smallamountofrepetitivetext;statedtheobviousonce\.\|Afewconversationalfillerphrases\.\|

\|MODERATE\|4\|Repeatedlyrestatedthegivenparametersinmultiplesteps;repeatedthefinalanswerunnecessarily\.\|Over\-explainedasimplealgebraicmanipulationwithaparagraph;extraunnecessaryassumptions,algebraicsteps\.\|

\|MAJOR\|7\|Restatedtheentiremulti\-paragraphquestionverbatimbeforesolving;circularrepetitivelogic\.\|Heavyrelianceonconversationalfiller;readslikeanessayratherthananengineeringsolution\.\|

\|CRITICAL\|10\|Solutionisoverwhelminglypaddedwithredundanttext,buryingtheactualmathematicalstepsentirely\.\|Almostentirelyfillertextwithminimaltechnicalcontent\.\|

SCORING:score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment:

Solution:

Aswecanseefromtheextremelydetailedproblemstatementprovidedabove,whichstatesthatwehaveapipeofradius0\.5mandflowof10m/s,itisincrediblyvitalandimportanttofirstcalculatetheReynoldsnumbertoseeiftheflowislaminarorturbulent\.Laminarflowissmooth,whileturbulentflowischaotic\.Wemustcalculatethisimportantparameter\.

Re=rho\*V\*D/mu=1000\*10\*1/0\.001=10,000,000\.

Since10,000,000ismuchgreaterthanthethresholdof2300,wecanconcludewithabsolutecertaintythattheflowisindeedveryturbulent,whichmeansitischaoticandwell\-mixed\.

ReasoningforEvaluation:

Thestudentusedanextremeamountofconversationalfiller\("incrediblyvital","absolutecertainty"\)andover\-explainedbasicdefinitions\("laminarflowissmooth","well\-mixed"\)\.ThispaddingburiesthesimpleRecalculation\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"violations":\[

\{\{

"description":"Excessiveconversationalfiller,over\-explanationofbasicfluidmechanicsdefinitions,andunnecessaryrestatementofgivenparameters\.",

"category":"FillerText&Over\-Explanation",

"severity":"MAJOR",

"penalty":7,

"location":"Overallsolution"

\}\}

\],

"total\_penalties":7,

"raw\_score":3,

"capped\_score":3,

"justification":"Thesolutionisheavilypaddedwithconversationalfillerandintroductorytextthataddszerotechnicalvalue\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"violations":\[

\{\{

"description":"<whattheviolationis\>",

"category":"<Repetition&Restatement\|FillerText&Over\-Explanation\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>,

"location":"<whichpartsofthesolutionareinvolved\>"

\}\}

\],

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<sameasraw\_score\>,

"justification":"<explanationofverbosityassessment\>"

\}\}

Iftherearenoviolations,returnanemptyviolationslistandraw\_scoreof10\.

#### F\.3\.10Meta\-Eval: Coverage Prompt

YouareanextremelystrictengineeringexamgraderperformingaCOVERAGECHECKontheCOMPLETELLM\-generatedsolutiontoagraduate\-levelengineeringproblem\.

Thisisameta\-evaluation\.YouarecheckingtheENTIREsolutiontoverifythatALLpartsofthequestionhavebeenaddressed\.

\-\-\-

QUESTION:

\{question\_text\}

COMPLETESOLUTION:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

COVERAGECHECKS:

1\.ALLASKEDQUANTITIES

\-Doesthesolutioncomputeeveryprimaryandsecondaryquantityrequested?

\-Ifthequestionasksformultiplevalues\(e\.g\.,"FindTandQ"\),areALLofthemcomputed?

2\.SUB\-QUESTIONS&SCOPE

\-Ifthequestionhasparts\(a\),\(b\),\(c\),areALLpartsanswered?

\-Doesthesolutionaddressthefullphysicallydescribedscope\(e\.g\.,iftherearetwoconnectedpipes,arebothanalyzed\)?

3\.RELEVANTANALYSIS&RESULTS

\-Arerequesteddiagrams/plotsmentionedordescribed?

\-Arefinalnumericalanswersclearlyprovidedratherthanjustsymbolicequations?

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1:AskedQuantities\|Criterion2&3:Scope&Analysis\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Missingunitsononesecondaryresult\.\|Missedaminorimplicitdetail\("commentonresult"\)\.\|

\|MODERATE\|4\|Missedplotting/sketchingentirelywhenrequested\.\|Addressedpart\(b\)butonlyimplicitly;missingunitsonmultipleresults\.\|

\|MAJOR\|7\|Forgottocomputeoneofthemainrequestedvariables\(e\.g\.foundT,forgotQ\)\.\|Entiresub\-part\(e\.g\.PartB\)ofthequestioncompletelyignored\.\|

\|CRITICAL\|10\|Answeredthewrongquestionentirely\.\|Majorityofwhatwasaskedisnotaddressedatall\.\|

SCORING:score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment:

SolutioncomputesthetemperaturedistributionT\(x\)perfectly,butstopsthere\.

ProblemStatement:"DeterminethetemperaturedistributionT\(x\)andcalculatethetotalheattransferrateQatthesurfacex=0\."

ReasoningforEvaluation:

Theproblemexplicitlyasksfortwothings:T\(x\)andQ\.ThestudentcompletelyforgottocalculateQ\.Thisisamajorcoverageomissionofaprimaryrequestedquantity\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"violations":\[

\{\{

"description":"FailedtocalculatethetotalheattransferrateQatthesurfacex=0\.",

"category":"1\.AskedQuantities",

"severity":"MAJOR",

"penalty":7,

"location":"Endofsolution/overall"

\}\}

\],

"total\_penalties":7,

"raw\_score":3,

"capped\_score":3,

"justification":"ThestudentperfectlysolvedthefirsthalfofthequestionbutcompletelyignoredtherequestfortheheattransferrateQ\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"violations":\[

\{\{

"description":"<whatismissing\>",

"category":"<AskedQuantities\|Scope&Analysis\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>,

"location":"<whereitshouldhavebeen\>"

\}\}

\],

"all\_parts\_covered":<true\|false\>,

"total\_penalties":<sumofallpenalties\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<sameasraw\_score\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’scoverageagainsttherequiredpartsandquantitiesintheGROUNDTRUTHREFERENCEimages\>"

\}\}

Ifallpartsarefullycovered,returnanemptyviolationslist,all\_parts\_covered=true,andraw\_scoreof10\.

#### F\.3\.11Meta\-Eval: Physical Sanity Prompt

YouareanextremelystrictengineeringexamgraderperformingaFINALPHYSICALSANITYCHECKonanLLM\-generatedsolution\.

Thisisameta\-evaluation\.YoumustdeterminewhetherthepredictedengineeringquantitiesacrosstheentiresolutionarePHYSICALLYREASONABLE\.

Ifanypredictedquantityisphysicallyimpossible,theENTIREsolutionscorewillbeheavilypenalized,potentiallycapped\.

\-\-\-

QUESTION:

\{question\_text\}

COMPLETESOLUTION:

\{step\_content\}

GROUNDTRUTHREFERENCE:

\{ground\_truth\}

\-\-\-

PHYSICALSANITYCHECKS:

1\.SIGNCHECKS&LIMITS

\-Density,Mass,AbsoluteTemperature\(\>0K\),Thermalconductivity,Viscositymustbepositive\.

\-Efficiencyofheatenginesmustbe<=Carnotefficiency\.

\-Heatcannotspontaneouslyflowfromcoldtohot\.

2\.MAGNITUDEREASONABLENESS

\-Velocitiesshouldnotexceedspeedoflightfornon\-relativisticproblems\.

\-Stressesshouldnotexceedultimatestrengthofspecifiedmaterialsridiculously\.

\-Pressuresshouldbephysicallymeaningful\.

3\.CONSERVATIONLAWS

\-Mass,Energy,Momentummustbeconserved\.

\-\-\-

GRADINGRUBRICMATRIX\(PENALTIES\):

\|Severity\|Points\|Criterion1:Signs&Limits\|Criterion2&3:Magnitudes&Conservation\|

\|:\-\-\-\|:\-\-\-:\|:\-\-\-\|:\-\-\-\|

\|MINOR\|2\|Smallviolationofanassumptionboundary\.\|Usedgaugepressureinsteadofabsoluteinaminornon\-idealgascalculation\.\|

\|MODERATE\|4\|TcalculatedasslightlynegativeCelsiuswhenexpectedhot\.\|Stressexceedsyieldstrengthbutstudentdoesnotnotice\.\|

\|MAJOR\|7\|Claimedheatflowsfromcoldtohot;Efficiency\>Carnot\.\|Massisclearlynotconservedbyafactorof2\.\|

\|CRITICAL\|10\|Negativeabsolutetemperature\(K\);Negativemassordensity\.\|Energycreatedfromnothing;efficiency\>100%\.\|

SCORING:score=max\(0,10\-total\_penalties\)

\-\-\-

FEW\-SHOTEXAMPLE:

InputSegment:

T2=300K\-500K=\-200K\.

Thefinaltemperatureofthegasis\-200Kelvin\.

ReasoningforEvaluation:

Absolutetemperature\(Kelvin\)cannotbenegative\.Thisisafundamentalphysicalimpossibilityandacriticalsanitycheckfailure\.

ExpectedJSONOutput:

‘‘‘json

\{\{

"violations":\[

\{\{

"quantity":"FinalTemperatureT2",

"value":"\-200K",

"reason":"AbsolutetemperaturemeasuredinKelvincannotbenegative\.",

"category":"1\.Signs&Limits",

"severity":"CRITICAL",

"penalty":10

\}\}

\],

"physically\_reasonable":false,

"sanity\_check\_passed":false,

"total\_penalties":10,

"raw\_score":0,

"capped\_score":0,

"justification":"Thestudentcalculatedanegativeabsolutetemperature,whichisphysicallyimpossible\."

\}\}

‘‘‘

\-\-\-

OUTPUTFORMAT\(JSONONLY\):

\{\{

"violations":\[

\{\{

"quantity":"<whatquantityisproblematic\>",

"value":"<theproblematicvalue\>",

"reason":"<whyitisphysicallyimpossibleorunreasonable\>",

"category":"<Signs&Limits\|Magnitudes&Conservation\>",

"severity":"<MINOR\|MODERATE\|MAJOR\|CRITICAL\>",

"penalty":<2,4,7,or10\>

\}\}

\],

"physically\_reasonable":<true\|false\>,

"sanity\_check\_passed":<true\|false\>,

"total\_penalties":<sum\>,

"raw\_score":<max\(0,10\-total\_penalties\)\>,

"capped\_score":<sameasraw\_score\>,

"justification":"<detailedexplanationexplicitlycomparingthesolution’sphysicalresultsandtrendsagainsttheGROUNDTRUTHREFERENCEimagesandknownphysicallimits\>"

\}\}

Iftherearenoviolations,returnanemptyviolationslistandsetflagstotrue,raw\_score=10\.

## Appendix GComputational Cost and Infrastructure

The proposedEngJudgeevaluation framework utilizes isolated, step\-wise prompts to evaluate solutions, which increases grading accuracy but introduces different computational requirements depending on the deployment configuration\. We benchmark the computational costs of two evaluation pipelines: an API\-driven setup utilizing Google Vertex AI \(gemini\-3\.1\-pro\-preview\) and a local GPU\-based setup utilizing Qwen models \(Qwen3\-VL\-32BandQwen3\-VL\-8B\)\.

### G\.1API\-Based Gemini Evaluation Cost

For our primary API benchmark, the evaluator \(gemini\-3\.1\-pro\-preview\) makes 11 separate API queries per question \(8 stage\-specific runs and 3 meta\-evaluation runs\)\. The cumulative evaluation required 10\.80 hours of continuous API execution time, consuming36\.4636\.46M input tokens and generating1\.771\.77M output tokens in total\. Preprocessing costs include Adobe PDF Services API remote extraction \(5–15 seconds per document\) and local neural LaTeX OCR \(pix2tex\) equation inference \(2–3 seconds per image on a standard CUDA GPU\)\. A simplified breakdown of the average time and input token consumption per question under the Gemini evaluator is summarized in Table[12](https://arxiv.org/html/2606.10833#A7.T12)\.

Table 12:Simplified computational cost breakdown per engineering subject usinggemini\-3\.1\-pro\-previewas the judge model\.
### G\.2Local GPU\-Based Qwen Evaluation Cost

In addition to the API\-based evaluator, we evaluate a local GPU\-driven pipeline\. Candidate solution generation is performed usingQwen3\-VL\-8Bon an NVIDIA L40 GPU, and step\-wise evaluation is performed usingQwen3\-VL\-32Bon an NVIDIA H100 GPU\.

The evaluation runtime for a 100\-sample test suite was 6 hours usingQwen3\-VL\-32B\(averaging 3\.6 minutes per question\), while the candidate generation runtime was 5 hours usingQwen3\-VL\-8B\(averaging 3\.0 minutes per question\)\. The local deployment consumed a total of 11 GPU\-hours \(6 H100 GPU\-hours and 5 L40 GPU\-hours\)\.

### G\.3Evaluator Hyperparameters and Infrastructure

The step\-wise evaluation judge is primarily driven bygemini\-3\.1\-pro\-previewhosted on Google Cloud Vertex AI, and local Qwen models \(Qwen3\-VL\-32BandQwen3\-VL\-8B\) run via a local Hugging Face pipeline\. The generation configurations and environments used for the automated grading and generation steps are detailed below:

- •Gemini Configurations: - –Temperature:0\.20\.2\. A low temperature was selected to ensure deterministic scoring behavior and reduce evaluator variance\. - –Max Output Tokens:8,1928,192tokens\. - –Thinking/Reasoning Budgets:Disabled \(VERTEX\_ENABLE\_THINKING = false\) to evaluate direct grading performance\. - –JSON Output Validation:An automated regular\-expression and bracket\-matching recovery parser is applied to clean and restore incomplete JSON responses resulting from context window limits\.
- •Qwen Configurations: - –Temperature:0\.70\.7\. For the Qwen models, the default pipeline configuration temperature of0\.70\.7was preserved\. Unlike the evaluation\-specific low\-variance constraint set for Gemini, the default configuration was retained to maintain the models’ standard inference behavior\. - –Max Output Tokens:8,0008,000tokens\.

## Appendix HUse of LLMs

In this work, LLMs were used in two ways:

- •Grammar checking and language polishing during paper writing
- •Serving as the models for generation and evaluation judges for rubric\-based scoring\.
Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

Similar Articles

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Revealing Interpretable Failure Modes of VLMs

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

Submit Feedback

Similar Articles

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
Revealing Interpretable Failure Modes of VLMs
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning