IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

arXiv cs.CL Papers

Summary

Introduces IsoSci, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. The study finds that 91.3% of reasoning-mode gains are knowledge-dependent, challenging common assumptions about chain-of-thought reasoning.

arXiv:2607.01431v1 Announce Type: new Abstract: We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82.3%, 96.0%]), directly challenging the assumption that chain-of-thought reasoning improves short-horizon procedural scientific problem-solving. Reasoning toggles on highly capable models provide less than 5 percentage points accuracy gain across all domains, and a reasoning-specialized model (o3-mini) that outperforms its standard counterpart on GPQA Diamond (+19.2 percentage points) underperforms on ISOSCI (-24.7 percentage points), showing that benchmark choice determines conclusions about reasoning utility. We release ISOSCI at https://huggingface.co/datasets/isosci/isosci
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:40 AM

# IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs
Source: [https://arxiv.org/html/2607.01431](https://arxiv.org/html/2607.01431)
Samir Abdaljalil Electrical and Computer Engineering Texas A&M University College Station, TX USA Erchin Serpedin Electrical and Computer Engineering Texas A&M University College Station, TX USA Hasan Kurban College of Science and Engineering Hamad Bin Khalifa University Doha, Qatar

###### Abstract

We introduceIsoSci, a benchmark of isomorphic cross\-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation\. Each pair shares identical logical structure but requires different domain\-specific knowledge, enabling controlled attribution of reasoning\-mode gains\. Across five model pairs spanning four model families, we find that91\.3% of reasoning\-mode gains are knowledge\-dependent rather than structure\-invariant\(63/69 gains; Wilson 95% CI \[82\.3%, 96\.0%\]\), directly challenging the assumption that chain\-of\-thought reasoning improves short\-horizon procedural scientific problem\-solving\. Reasoning toggles on highly capable models provide less than 5pp accuracy gain across all domains, and a reasoning\-specialized model \(o3\-mini\) that outperforms its standard counterpart on GPQA Diamond \(\+\+19\.2pp\) underperforms onIsoSci\(−\-24\.7pp\), showing that benchmark choice determines conclusions about reasoning utility\. We releaseIsoSciat[https://huggingface\.co/datasets/isosci/isosci](https://huggingface.co/datasets/isosci/isosci)

## 1Introduction

Recent advances in large language models have increasingly emphasized*reasoning*as a key driver of performance on complex tasks\[[28](https://arxiv.org/html/2607.01431#bib.bib28),[4](https://arxiv.org/html/2607.01431#bib.bib4),[12](https://arxiv.org/html/2607.01431#bib.bib12)\]\. Techniques such as chain\-of\-thought prompting\[[26](https://arxiv.org/html/2607.01431#bib.bib26)\], reasoning\-specific training\[[21](https://arxiv.org/html/2607.01431#bib.bib21),[24](https://arxiv.org/html/2607.01431#bib.bib24)\], and test\-time compute scaling\[[19](https://arxiv.org/html/2607.01431#bib.bib19)\]have shown substantial gains on benchmarks such as GPQA\[[18](https://arxiv.org/html/2607.01431#bib.bib18)\], SciBench\[[23](https://arxiv.org/html/2607.01431#bib.bib23)\], and MMLU\-STEM\[[8](https://arxiv.org/html/2607.01431#bib.bib8)\]\.

The problem is that these benchmarks conflate two distinct capabilities: retrieving the correct domain\-specific knowledge, and applying the appropriate reasoning procedure over that knowledge\. When a model fails a chemistry problem, it is unclear whether the failure reflects an inability to recall the relevant formula or an inability to execute the required reasoning steps\. Without disentangling these factors, a basic question goes unanswered:*do reasoning mechanisms improve reasoning itself, or do they primarily improve knowledge utilization?*

We introduceIsoSci, a benchmark of isomorphic cross\-domain science problem pairs designed to answer this question directly\. Each problem is paired with a structurally identical counterpart from a different scientific domain: both require the same sequence of logical and computational steps, but depend on entirely different domain knowledge\. If a model succeeds on one problem but fails on its isomorphic counterpart, the gap must be attributed to missing knowledge, not to reasoning ability\.

UsingIsoSci, we evaluate five model pairs across four model families, covering both traditional reasoning\-vs\-standard comparisons and toggle\-based comparisons \(same model, reasoning on vs\. off\)\. Across 8,408 evaluations spanning four scientific domains, we find that91\.3% of reasoning\-mode gains are knowledge\-dependent rather than structure\-invariant\(63/69 gains across all five pairs; Wilson 95% CI \[82\.3%, 96\.0%\]\), for short\-horizon procedural science problems\. Enabling reasoning has minimal effect on overall accuracy for high\-capability models \(below 5pp across domains\), and a reasoning\-specialized model \(o3\-mini\) that outperforms its standard counterpart on GPQA \(\+\+19\.2pp\) underperforms onIsoSci\(−\-24\.7pp\), showing that benchmark choice determines conclusions about reasoning model utility\.

These findings suggest that reasoning mechanisms function primarily as*extended knowledge retrieval*on short\-horizon science tasks, increasing the probability that relevant domain facts are surfaced during generation rather than improving logical procedure execution\.

#### Contributions\.

\(1\) A construction methodology for isomorphic cross\-domain science problem pairs that hold reasoning structure constant while varying domain knowledge, applicable at any scale or domain\. \(2\) Thepknowp\_\{\\text\{know\}\}metric \(Eq\.[5](https://arxiv.org/html/2607.01431#S4.E5)\), which decomposes reasoning\-mode gains into knowledge\-dependent and structure\-invariant components\. \(3\)IsoSci, a 144\-pair benchmark spanning four scientific domains under CC\-BY\-4\.0, with empirical findings on knowledge dependence of reasoning gains, toggle effects, and benchmark\-dependent model comparisons\.

## 2Related Work

#### Reasoning in large language models\.

Methods for eliciting multi\-step behavior include chain\-of\-thought prompting\[[1](https://arxiv.org/html/2607.01431#bib.bib1),[13](https://arxiv.org/html/2607.01431#bib.bib13),[25](https://arxiv.org/html/2607.01431#bib.bib25),[26](https://arxiv.org/html/2607.01431#bib.bib26)\]and test\-time compute scaling\[[2](https://arxiv.org/html/2607.01431#bib.bib2)\]\. Evaluations of these methods typically report end\-task accuracy improvements, treating different mechanisms as interchangeable, without analyzing how they alter the balance between intermediate computation, search, and reliance on memorized patterns\. For scientific reasoning, benchmarks such as MMLU\-STEM\[[8](https://arxiv.org/html/2607.01431#bib.bib8)\], SciBench\[[23](https://arxiv.org/html/2607.01431#bib.bib23)\], and GPQA\[[18](https://arxiv.org/html/2607.01431#bib.bib18)\]cover undergraduate to graduate\-level science questions across multiple formats and difficulty levels, and have been widely used to track progress across model generations\. Their evaluations are primarily aggregate, however, offering limited insight into the sources of model success or failure\.

#### Disentangling reasoning and knowledge\.

Isolating reasoning ability from knowledge in LLMs remains an open problem\[[11](https://arxiv.org/html/2607.01431#bib.bib11),[29](https://arxiv.org/html/2607.01431#bib.bib29),[7](https://arxiv.org/html/2607.01431#bib.bib7)\]\. Chain\-of\-thought analyses suggest intermediate steps function more as structured memory retrieval than logical inference\[[9](https://arxiv.org/html/2607.01431#bib.bib9),[10](https://arxiv.org/html/2607.01431#bib.bib10),[26](https://arxiv.org/html/2607.01431#bib.bib26)\], and benchmark performance is known to be sensitive to knowledge coverage\[[17](https://arxiv.org/html/2607.01431#bib.bib17)\]\. The closest concurrent work isThapa et al\.\[[20](https://arxiv.org/html/2607.01431#bib.bib20)\], who train a PubMedBERT classifier to label biomedical QA items as reasoning\-heavy or knowledge\-heavy, finding that only 32\.8% require multi\-step reasoning and that models consistently underperform on that subset\.IsoScidiffers in three respects: we construct matched pairs with structurally identical solution procedures by design rather than classifying existing items post\-hoc; our metricpknowp\_\{\\text\{know\}\}operates at the pair level and can isolate whether a gain transfers across domains, which stratum\-level accuracy cannot\.

#### Benchmark design and controlled evaluation\.

Recent work improves benchmark quality through adversarial filtering\[[18](https://arxiv.org/html/2607.01431#bib.bib18)\], domain stratification\[[8](https://arxiv.org/html/2607.01431#bib.bib8)\], and tolerance\-based grading\[[23](https://arxiv.org/html/2607.01431#bib.bib23)\], addressing memorization and grading fidelity\[[14](https://arxiv.org/html/2607.01431#bib.bib14),[20](https://arxiv.org/html/2607.01431#bib.bib20)\]\. These designs remain aggregate and do not control for solution procedure across items\.IsoSciextends this line by enforcing structural equivalence through isomorphic cross\-domain pairs, enabling comparisons where reasoning demands are held fixed and performance can be decomposed into knowledge\-dependent and structure\-invariant components\.

## 3TheIsoSciBenchmark and Evaluation Protocol

This section formalizes theIsoScibenchmark and thepknowp\_\{\\text\{know\}\}decoupling metric\. The benchmark holds the reasoning structure of a problem constant across a cross\-domain pair while varying the domain knowledge required, so that an accuracy gap between the two members attributes to knowledge rather than to reasoning\. The 144\-pair release is one instantiation of the methodology, which extends to any scientific domain or scale\.

### 3\.1Notation and Preliminaries

Let𝒟=\{phys,chem,bio,earth\}\\mathcal\{D\}=\\\{\\text\{phys\},\\text\{chem\},\\text\{bio\},\\text\{earth\}\\\}denote the four scientific*domains*\(physics, chemistry, biology, earth science\)\. Let𝒮=\{s1,…,s5\}\\mathcal\{S\}=\\\{s\_\{1\},\\dots,s\_\{5\}\\\}denote the five*structure types*listed below\. Let𝒳\\mathcal\{X\}denote the space of natural\-language problem statements and𝒴\\mathcal\{Y\}the space of admissible answers \(multiple\-choice letters, numerical values, or short text strings\)\. A*problem*is a tupleq=\(xq,aq,dq,sq\)∈𝒳×𝒴×𝒟×𝒮q=\(x\_\{q\},a\_\{q\},d\_\{q\},s\_\{q\}\)\\in\\mathcal\{X\}\\times\\mathcal\{Y\}\\times\\mathcal\{D\}\\times\\mathcal\{S\}with textxqx\_\{q\}, gold answeraqa\_\{q\}, domaindqd\_\{q\}, and structuresqs\_\{q\}; let𝒬\\mathcal\{Q\}denote the set of all such problems\. Forq∈𝒬q\\in\\mathcal\{Q\}, letK​\(q\)K\(q\)denote the set of*domain\-specific knowledge atoms*required to solveqq\(formulas, physical or chemical constants, named domain entities\)\. Letℳ⊂𝒟×𝒟\\mathcal\{M\}\\subset\\mathcal\{D\}\\times\\mathcal\{D\}denote the set of cross\-domain*mappings*considered, with\|ℳ\|=6\|\\mathcal\{M\}\|=6covering each unordered pair of distinct domains\.

Letℱ\\mathcal\{F\}denote the set of LLM configurations under evaluation; eachf∈ℱf\\in\\mathcal\{F\}is a \(stochastic\) mapping from a prompt to a generated string in𝒴∗\\mathcal\{Y\}^\{\*\}\. We define the*evaluation function*

E:ℱ×𝒬→\{0,1\},E​\(f,q\)=1​\{extract⁡\(f​\(prompt⁡\(xq\)\)\)≡aq\},E:\\mathcal\{F\}\\times\\mathcal\{Q\}\\to\\\{0,1\\\},\\qquad E\(f,q\)\\;=\\;\\mathbf\{1\}\\\!\\left\\\{\\operatorname\{extract\}\\\!\\bigl\(f\(\\operatorname\{prompt\}\(x\_\{q\}\)\)\\bigr\)\\equiv a\_\{q\}\\right\\\},\(1\)whereprompt⁡\(⋅\)\\operatorname\{prompt\}\(\\cdot\)wrapsxqx\_\{q\}in the zero\-shot chain\-of\-thought template \(Section[4\.3](https://arxiv.org/html/2607.01431#S4.SS3)\),extract⁡\(⋅\)\\operatorname\{extract\}\(\\cdot\)applies the cascade of Section[4\.3](https://arxiv.org/html/2607.01431#S4.SS3), and≡\\equivis exact match for letters or strings and±2%\\pm 2\\%relative tolerance for numerical answers\. We writeΠ⊂ℱ×ℱ\\Pi\\subset\\mathcal\{F\}\\times\\mathcal\{F\}for the set of evaluated*model pairs*; each\(R,S\)∈Π\(R,S\)\\in\\PihasRRin the reasoning configuration andSSin the standard configuration\.

### 3\.2Formal Definition of Isomorphic Pairs

###### Definition 1\(Isomorphic problem pair\)\.

Two problemsq,q′∈𝒬q,q^\{\\prime\}\\in\\mathcal\{Q\}form an*isomorphic pair*, writtenq≅q′q\\cong q^\{\\prime\}, if all of the following hold:

1. \(i\)dq≠dq′d\_\{q\}\\neq d\_\{q^\{\\prime\}\}\(different domains\);
2. \(ii\)sq=sq′s\_\{q\}=s\_\{q^\{\\prime\}\}\(same structure type\);
3. \(iii\)there exists a bijectionϕ:K​\(q\)→K​\(q′\)\\phi:K\(q\)\\to K\(q^\{\\prime\}\)such that the solution procedure ofq′q^\{\\prime\}is obtained from that ofqqby replacing eachk∈K​\(q\)k\\in K\(q\)withϕ​\(k\)\\phi\(k\);
4. \(iv\)K​\(q\)∩K​\(q′\)=∅K\(q\)\\cap K\(q^\{\\prime\}\)=\\emptyset\(knowledge sets are disjoint\)\.

#### Structure types \(𝒮\\mathcal\{S\}\)\.

IsoScirestricts attention to five short\-horizon \(3 to 5 reasoning steps\) structure types for which the bijectionϕ\\phiin Definition[1](https://arxiv.org/html/2607.01431#Thmdefinition1)is tractable to verify:

1. 1\.*Formula recall and substitution*: recall a domain law, substitute given values, compute \(e\.g\., ideal gas law, Beer\-Lambert law\)\.
2. 2\.*Unit conversion chain*: multi\-step unit tracking across a sequence of conversions\.
3. 3\.*Conservation law application*: identify and apply a conservation principle \(energy, mass, charge, momentum\)\.
4. 4\.*Proportional reasoning*: use ratio or scaling relationships to recover an unknown quantity\.
5. 5\.*Two\-step causal chain*: qualitative reasoning where causeAAimplies effectBBimplies effectCC, with no numerical computation\.

Table[1](https://arxiv.org/html/2607.01431#S3.T1)shows a representative pair with structures=s=formula\_recall\_and\_substituteand three solution steps under non\-overlapping knowledge sets\.

Table 1:Example isomorphic pair fromIsoSci\(physics to chemistry mapping\)\. Both problems share structure typeformula\_recall\_and\_substitutewith three solution steps\. Knowledge setsK​\(q\)K\(q\)andK​\(q′\)K\(q^\{\\prime\}\)are disjoint\.RoleProblemSourceqqA 2\.0 mol sample of ideal gas at 300 K occupies 49\.2 L\. What is the pressure in atm? \(requires:P​V=n​R​TPV=nRT;R=0\.0821R=0\.0821L⋅\\cdotatm/mol⋅\\cdotK\)Targetq′q^\{\\prime\}A solution of weak acid HA has concentrationC=0\.10C=0\.10M and acid dissociation constantKa=1\.8×10−5K\_\{a\}=1\.8\\times 10^\{\-5\}\. What is the pH? \(requires:pH=−log⁡Ka​C\\mathrm\{pH\}=\-\\log\\sqrt\{K\_\{a\}C\}\)Structuressrecall formula→\\tosubstitute values→\\tocomputeDomains\(dq,dq′\)\(d\_\{q\},d\_\{q^\{\\prime\}\}\)physics \(thermodynamics\)→\\tochemistry \(acid\-base\)

### 3\.3Dataset Construction

Construction proceeds in three stages, summarized as𝒬seed→generate𝒬cand→verify𝒬pass→balanceIsoSci\\mathcal\{Q\}^\{\\text\{seed\}\}\\xrightarrow\{\\text\{generate\}\}\\mathcal\{Q\}^\{\\text\{cand\}\}\\xrightarrow\{\\text\{verify\}\}\\mathcal\{Q\}^\{\\text\{pass\}\}\\xrightarrow\{\\text\{balance\}\}\\textsc\{IsoSci\}\{\}\.

#### Stage 1: Seed collection\.

The seed pool𝒬seed\\mathcal\{Q\}^\{\\text\{seed\}\}aggregates 2,315 items from GPQA Diamond\[[18](https://arxiv.org/html/2607.01431#bib.bib18)\]\(n=198n=198\), SciBench\[[23](https://arxiv.org/html/2607.01431#bib.bib23)\]\(n=585n=585\), and MMLU\-STEM\[[8](https://arxiv.org/html/2607.01431#bib.bib8)\]\(n=1,532n=1\{,\}532\)\. Earth\-science seeds, absent from these sources, are generated synthetically with claude\-sonnet\-4\-5 \(96 problems; prompt in Appendix[B\.4](https://arxiv.org/html/2607.01431#A2.SS4)\), giving 2,411 problems\. Token\-overlap deduplication \(Jaccard\>0\.40\\operatorname\{Jaccard\}\>0\.40\) yields\|𝒬seed\|=2,190\|\\mathcal\{Q\}^\{\\text\{seed\}\}\|=2\{,\}190unique problems \(physics 867, chemistry 639, biology 588, earth 96\)\.

#### Stage 2: Isomorphic partner generation\.

For each mapping\(d,d′\)∈ℳ\(d,d^\{\\prime\}\)\\in\\mathcal\{M\}, we sample up tonseed=25n\_\{\\text\{seed\}\}=25seeds from\{q∈𝒬seed:dq=d\}\\\{q\\in\\mathcal\{Q\}^\{\\text\{seed\}\}:d\_\{q\}=d\\\}and prompt claude\-sonnet\-4\-5 to generatencand=3n\_\{\\text\{cand\}\}=3candidate target problemsq′q^\{\\prime\}per seed satisfying conditions \(i\) to \(iv\) of Definition[1](https://arxiv.org/html/2607.01431#Thmdefinition1)under the source structuresqs\_\{q\}\(prompt in Appendix[B\.2](https://arxiv.org/html/2607.01431#A2.SS2)\)\. This yields\|𝒬cand\|=429\|\\mathcal\{Q\}^\{\\text\{cand\}\}\|=429candidate pairs acrossℳ\\mathcal\{M\}\.

#### Stage 3: Automated verification\.

A panel of three judges,𝒥=\{claude\-sonnet\-4\-5,GPT\-4o\-mini,DeepSeek\-V3\}\\mathcal\{J\}=\\\{\\text\{claude\-sonnet\-4\-5\},\\text\{GPT\-4o\-mini\},\\text\{DeepSeek\-V3\}\\\}, scores each candidate on four criteria𝒞=\{clogic,cindep,cdiff,cself\}\\mathcal\{C\}=\\\{c\_\{\\text\{logic\}\},c\_\{\\text\{indep\}\},c\_\{\\text\{diff\}\},c\_\{\\text\{self\}\}\\\}corresponding to logical equivalence, domain independence, difficulty parity, and self\-containment\. Letrj​\(c,q,q′\)∈\{1,2,3,4,5\}r\_\{j\}\(c,q,q^\{\\prime\}\)\\in\\\{1,2,3,4,5\\\}denote the rating from judgej∈𝒥j\\in\\mathcal\{J\}on criterionc∈𝒞c\\in\\mathcal\{C\}\. A pair\(q,q′\)\(q,q^\{\\prime\}\)is*accepted*into𝒬pass\\mathcal\{Q\}^\{\\text\{pass\}\}if and only if

minj∈𝒥⁡minc∈𝒞⁡rj​\(c,q,q′\)≥3\.5\.\\min\_\{j\\in\\mathcal\{J\}\}\\;\\min\_\{c\\in\\mathcal\{C\}\}\\;r\_\{j\}\(c,q,q^\{\\prime\}\)\\;\\geq\\;3\.5\.\(2\)Of 429 candidates, 217 satisfy \([2](https://arxiv.org/html/2607.01431#S3.E2)\) \(50\.6% pass rate\)\. After balancing acrossℳ\\mathcal\{M\}to a target of 22 to 25 pairs per mapping, the released benchmark isIsoSci=𝒬balpass\\textsc\{IsoSci\}\{\}=\\mathcal\{Q\}^\{\\text\{pass\}\}\_\{\\text\{bal\}\}with\|IsoSci\|=144\|\\textsc\{IsoSci\}\{\}\|=144pairs \(288 problems\)\. Distribution and overall acceptance rates are reported in Table[2](https://arxiv.org/html/2607.01431#S3.T2)\.

Table 2:IsoScidataset statistics\. Of 429 candidates, 217 satisfy the acceptance rule \([2](https://arxiv.org/html/2607.01431#S3.E2)\) \(50\.6%\); 144 are retained after balancing acrossℳ\\mathcal\{M\}\. The overall accept rate is the ratio of retained to candidate\.Domain mapping\(d,d′\)\(d,d^\{\\prime\}\)CandidatesRetainedOverall accept ratephysics→\\tochemistry752533\.3%physics→\\tobiology752533\.3%physics→\\toearth sci\.602541\.7%chemistry→\\tobiology752229\.3%chemistry→\\toearth sci\.692231\.9%biology→\\toearth sci\.752533\.3%Total42914433\.6%
#### Robustness of the verification rule\.

Because claude\-sonnet\-4\-5 contributes both to candidate generation in Stage 2 and to the judge panel𝒥\\mathcal\{J\}, a confound is possible: shared blind spots could inflate the accepted set𝒬pass\\mathcal\{Q\}^\{\\text\{pass\}\}\. Recomputing \([2](https://arxiv.org/html/2607.01431#S3.E2)\) over𝒥∖\{claude\-sonnet\-4\-5\}\\mathcal\{J\}\\setminus\\\{\\text\{claude\-sonnet\-4\-5\}\\\}rejects only1/1441/144released pairs, retaining 99\.3%; no pair accepted by the two\-judge panel is rejected by the three\-judge panel\.

#### Human expert audit\.

Two PhD\-level annotators independently rated a stratified sample of 50 candidates \(25 LLM\-accepted, 25 LLM\-rejected\) on𝒞\\mathcal\{C\}\. Inter\-annotator agreement was substantial \(κ=0\.714\\kappa=0\.714, exact agreement 84%\), comparable to LLM\-human agreement \(κ∈\{0\.686,0\.648\}\\kappa\\in\\\{0\.686,0\.648\\\}\)\. Against human consensus on the 42 pairs with full annotator agreement, the LLM ensemble attained precision0\.9410\.941, recall0\.8420\.842, andF1=0\.889F\_\{1\}=0\.889\. The single false positive involved formula overlap across domains; the three false negatives indicate a conservative bias that reduces dataset size without contaminating it\.

### 3\.4Comparison with Existing Science Benchmarks

Table[3](https://arxiv.org/html/2607.01431#S3.T3)positionsIsoScirelative to existing benchmarks\. The defining property, isolating reasoning from knowledge via isomorphic pairs, is absent from all prior work\. Recent benchmarks emphasizing multimodality \(PhysUniBench\[[22](https://arxiv.org/html/2607.01431#bib.bib22)\]\), symbolic correctness \(QuantumBench\[[15](https://arxiv.org/html/2607.01431#bib.bib15)\]\), or research\-grade derivation \(FrontierMath\[[5](https://arxiv.org/html/2607.01431#bib.bib5)\]\) target complementary dimensions of scientific reasoning\.IsoSciprovides a controlled diagnostic specifically for the knowledge\-versus\-reasoning attribution question, which the above benchmarks do not address by design\.

Table 3:Comparison ofIsoSciwith existing science benchmarks\. ✓ = supported; ✗ = not supported;∼\\sim= partial\.PropertyGPQASciBenchMMLU\-STEMIsoSciIsolates reasoning from knowledge✗✗✗✓Cross\-domain isomorphic pairs✗✗✗✓Domain\-stratified∼\\sim∼\\sim✓✓Free\-response format✗✓✗✓Graduate\-level difficulty✓∼\\sim✗∼\\simPublicly released✓✓✓✓NNproblems44869514,042288

## 4Experimental Setup

### 4\.1Model Configurations

We evaluate\|Π\|=5\|\\Pi\|=5paired configurationsΠ=\{\(Ri,Si\)\}i=15\\Pi=\\\{\(R\_\{i\},S\_\{i\}\)\\\}\_\{i=1\}^\{5\}in two tiers \(Table[4](https://arxiv.org/html/2607.01431#S4.T4)\)\. Three*main pairs*are run on all four benchmarks; two*supplementary pairs*are run onIsoScialone because of API latency and budget constraints\. A pair is*traditional*ifRiR\_\{i\}andSiS\_\{i\}are distinct trained models from the same family, and*toggle*ifRiR\_\{i\}andSiS\_\{i\}are the same model with the provider’s reasoning flag set toTrueandFalserespectively\. To avoid potential contamination, we exclude all Anthropic\-family models fromΠ\\Pibecause claude\-sonnet\-4\-5 was used in Stages 1 to 3\.

Table 4:Model pairs\(R,S\)∈Π\(R,S\)\\in\\Pi\.*Traditional*pairs compare a reasoning\-trained model against a standard\-trained counterpart\.*Toggle*pairs use the same model with reasoning enabled or disabled, eliminating architectural confounds\. Supplementary pairs were evaluated onIsoScialone\.PairReasoning configRRStandard configSSTypeo3 / 4oopenai/o3\-miniopenai/gpt\-4o\-miniTraditionalQwen3\-32Bthink\-on/offqwen/qwen3\-32breasoning=Trueqwen/qwen3\-32breasoning=FalseToggleGemini 2\.0 Flashthink\-on/offgoogle/gemini\-2\.0\-flash\-001reasoning=Truegoogle/gemini\-2\.0\-flash\-001reasoning=FalseToggleSupplementary pairs \(IsoScionly\)DeepSeek\-R1 / V3deepseek/deepseek\-r1\-0528deepseek/deepseek\-chat\-v3\-0324TraditionalQwQ / Qwen2\.5qwen/qwq\-32bqwen/qwen\-2\.5\-72b\-instructTraditionalThe pair o3\-mini versus GPT\-4o\-mini follows the canonical reasoning\-versus\-standard comparison\[[16](https://arxiv.org/html/2607.01431#bib.bib16)\]; because traditional pairs differ in pretraining and RLHF objectives in addition to reasoning training, causal claims about the reasoning mechanism are most cleanly supported by the toggle pairs\. Qwen3\-32B\[[27](https://arxiv.org/html/2607.01431#bib.bib27)\]and Gemini 2\.0 Flash\[[3](https://arxiv.org/html/2607.01431#bib.bib3)\]expose a reasoning toggle via thereasoning\.enabledparameter, holding all model weights and decoding parameters constant; including a mid\-capability \(Qwen3\-32B\) and a high\-capability model \(Gemini 2\.0 Flash, 91\.9% MMLU\-STEM\) tests whether the toggle effect is capability\-dependent\. Supplementary pairs DeepSeek\-R1 versus DeepSeek\-V3\[[6](https://arxiv.org/html/2607.01431#bib.bib6)\]and QwQ\-32B versus Qwen2\.5\-72B extend coverage of traditional pairs\.

### 4\.2Evaluation Benchmarks

Main pairs are evaluated onℬ=\{IsoSci,GPQA,MMLU\-STEM,SciBench\}\\mathcal\{B\}=\\\{\\textsc\{IsoSci\}\{\},\\text\{GPQA\},\\text\{MMLU\-STEM\},\\text\{SciBench\}\\\}with item counts 288, 198, 750, and 585 respectively \(a total of 1,821 items\)\.IsoScimixes formats inherited from its seed sources: 103 pairs \(71\.5%\) are 4\-way multiple choice \(MMLU\-STEM seeds\), 38 pairs \(26\.4%\) are free\-response numerical \(SciBench seeds\), and 3 pairs \(2\.1%\) are short answer\. Supplementary pairs are evaluated onIsoScialone\. The total evaluation volume is10,92610\{,\}926API calls, of which8,4088\{,\}408\(76\.9%76\.9\\%\) returned valid responses; exclusion analysis appears in Appendix[F](https://arxiv.org/html/2607.01431#A6)\.

### 4\.3Evaluation Protocol

#### Inference\.

For eachf∈ℱf\\in\\mathcal\{F\}andq∈⋃ℬq\\in\\bigcup\\mathcal\{B\}we sampley=f​\(prompt⁡\(xq\)\)y=f\(\\operatorname\{prompt\}\(x\_\{q\}\)\)once at temperature0with maximum output 8,192 tokens\. The prompt template is the zero\-shot chain\-of\-thought instruction:

Think step by step\. Show your reasoning clearly\. Provide your final answer at the end in the format:Final Answer:<your answer\>

For multiple\-choice items \(GPQA, MMLU\-STEM, and the MCQ subset ofIsoSci\), the four choices are appended toxqx\_\{q\}in a deterministically shuffled order \(seed equals item index\)\.

#### Answer extraction\.

The mapextract:𝒴∗→𝒴∪\{⊥\}\\operatorname\{extract\}:\\mathcal\{Y\}^\{\*\}\\to\\mathcal\{Y\}\\cup\\\{\\bot\\\}applies a five\-pattern cascade and returns the first match: \(1\) the substring after`\*\*Final Answer:\*\*`; \(2\) the contents of`\\boxed\{\.\.\.\}`; \(3\) the numeric content of the last display equation`$$\.\.\.$$`; \(4\) the substring afterThereforeorThe answer is; \(5\) the last numeric expression inyy\. For multiple\-choice items, an additional pass matches written\-out answer text against the choice options\. Full algorithm appears in Appendix[C](https://arxiv.org/html/2607.01431#A3)\.

#### Grading\.

ForIsoSci, the equality test≡\\equivin \([1](https://arxiv.org/html/2607.01431#S3.E1)\) is conditioned on the format inherited from the seed: letter match for MCQ items,±2%\\pm 2\\%relative tolerance for free\-response numerical items, and exact string match for short\-answer items\. GPQA and MMLU\-STEM use letter match; SciBench uses±2%\\pm 2\\%tolerance throughout\. The±2%\\pm 2\\%rule may be inappropriate for logarithmic quantities \(pH, pKaK\_\{a\}\); manual review of 50 SciBench chemistry responses identified 3 cases \(<<0\.6pp impact\) where rounding convention differences caused false negatives\.

#### Aggregation\.

Domain\-stratified deltas \(Table[7](https://arxiv.org/html/2607.01431#S5.T7)\) pool all records for a given domain acrossℬ\\mathcal\{B\}and weight each record equally; this means benchmarks contributing more domain\-relevant items contribute more to the aggregate \(for example, MMLU\-STEM contributes 250 biology items versus GPQA’s 19\)\. Per\-benchmark breakdowns appear in Appendix[A](https://arxiv.org/html/2607.01431#A1)\.IsoSci\-specific quantities \(pknowp\_\{\\text\{know\}\},Δacc\\Delta\_\{\\text\{acc\}\}in Table[6](https://arxiv.org/html/2607.01431#S5.T6)\) useIsoScialone\.

### 4\.4Reasoning\-Knowledge Decoupling Metric

For a pair\(R,S\)∈Π\(R,S\)\\in\\Piand a problemq∈𝒬q\\in\\mathcal\{Q\}, define the*gain indicator*

GR,S​\(q\)=1​\{E​\(R,q\)=1∧E​\(S,q\)=0\}\.G\_\{R,S\}\(q\)\\;=\\;\\mathbf\{1\}\\\!\\left\\\{E\(R,q\)=1\\;\\wedge\\;E\(S,q\)=0\\right\\\}\.\(3\)For each isomorphic pair\(q,q′\)∈IsoSci\(q,q^\{\\prime\}\)\\in\\textsc\{IsoSci\}\{\}and each\(R,S\)∈Π\(R,S\)\\in\\Pi, classify the pair by its joint gain pattern under \([3](https://arxiv.org/html/2607.01431#S4.E3)\) and aggregate overIsoSci:

ks\\displaystyle k\_\{s\}=∑\(q,q′\)∈IsoSciGR,S​\(q\)​\(1−GR,S​\(q′\)\),\\displaystyle=\\sum\_\{\(q,q^\{\\prime\}\)\\in\\textsc\{IsoSci\}\{\}\}G\_\{R,S\}\(q\)\\,\\bigl\(1\-G\_\{R,S\}\(q^\{\\prime\}\)\\bigr\),kt\\displaystyle k\_\{t\}=∑\(q,q′\)∈IsoSci\(1−GR,S​\(q\)\)​GR,S​\(q′\),\\displaystyle=\\sum\_\{\(q,q^\{\\prime\}\)\\in\\textsc\{IsoSci\}\{\}\}\\bigl\(1\-G\_\{R,S\}\(q\)\\bigr\)\\,G\_\{R,S\}\(q^\{\\prime\}\),kb\\displaystyle k\_\{b\}=∑\(q,q′\)∈IsoSciGR,S​\(q\)​GR,S​\(q′\)\.\\displaystyle=\\sum\_\{\(q,q^\{\\prime\}\)\\in\\textsc\{IsoSci\}\{\}\}G\_\{R,S\}\(q\)\\,G\_\{R,S\}\(q^\{\\prime\}\)\.\(4\)A reasoning\-mode gain on a pair is*knowledge\-dependent*if it occurs on exactly one member of the pair \(it contributes toksk\_\{s\}orktk\_\{t\}\): the reasoning configuration improves on one domain but not on the structurally identical other, indicating that the gain reflects domain\-specific knowledge rather than shared reasoning structure\. A gain is*structure\-invariant*if it occurs on both members \(it contributes tokbk\_\{b\}\)\. The*knowledge\-dependence ratio*is

pknow=ks\+ktks\+kt\+kb,p\_\{\\text\{know\}\}\\;=\\;\\frac\{k\_\{s\}\+k\_\{t\}\}\{k\_\{s\}\+k\_\{t\}\+k\_\{b\}\},\(5\)computed over thengain=ks\+kt\+kbn\_\{\\text\{gain\}\}=k\_\{s\}\+k\_\{t\}\+k\_\{b\}pairs on which any gain exists\. We report Wilson 95% confidence intervals onpknowp\_\{\\text\{know\}\}\(Table[6](https://arxiv.org/html/2607.01431#S5.T6)\)\. Becausekbk\_\{b\}requires improvement on*both*pair members,pknowp\_\{\\text\{know\}\}is a conservative upper bound on knowledge\-dependence: asymmetrically expressed reasoning gains, those improving the harder member only, contribute toks\+ktk\_\{s\}\+k\_\{t\}even when they reflect structural improvement\.

### 4\.5Assumptions

The methodology depends on the following assumptions, each anchored empirically where possible\.

1. A1\.Verifiable bijection\.For every released pair\(q,q′\)∈IsoSci\(q,q^\{\\prime\}\)\\in\\textsc\{IsoSci\}\{\}, the bijectionϕ\\phiin Definition[1](https://arxiv.org/html/2607.01431#Thmdefinition1)exists and the solution procedure is preserved underϕ\\phi\. Anchor: the acceptance rule \([2](https://arxiv.org/html/2607.01431#S3.E2)\) filters pairs that fail this property, attainingF1=0\.889F\_\{1\}=0\.889against human consensus on the audited subset\.
2. A2\.Disjoint knowledge\.K​\(q\)∩K​\(q′\)=∅K\(q\)\\cap K\(q^\{\\prime\}\)=\\emptysetfor every released pair\. Anchor: criterioncindepc\_\{\\text\{indep\}\}in \([2](https://arxiv.org/html/2607.01431#S3.E2)\) enforces this; the dominant LLM\-judge failure mode is formula overlap across domains, identified in 1 of 50 audited cases\.
3. A3\.Toggle isolation\.For toggle pairs,RRandSScorrespond to the same model weights with all decoding parameters held constant except the provider\-specific reasoning flag, so anyE​\(R,q\)−E​\(S,q\)E\(R,q\)\-E\(S,q\)difference attributes to the reasoning mechanism rather than to model identity\.
4. A4\.Pair\-level independence\.Pair outcomes\(GR,S​\(q\),GR,S​\(q′\)\)\(G\_\{R,S\}\(q\),G\_\{R,S\}\(q^\{\\prime\}\)\)are independent across distinct pairs inIsoSci, supporting Wilson confidence intervals onpknowp\_\{\\text\{know\}\}and bootstrap confidence intervals onΔacc\\Delta\_\{\\text\{acc\}\}\.
5. A5\.Coverage\.The seed pool𝒬seed\\mathcal\{Q\}^\{\\text\{seed\}\}is representative of*short\-horizon, information\-complete, procedural*scientific problems at the undergraduate to early\-graduate level\. long\-horizon derivation, hypothesis generation, and open\-ended synthesis are out of scope\.

### 4\.6Scope and Limitations

The methodology covers short\-horizon \(3 to 5 step\) procedural problems across four scientific domains and the five structure types of𝒮\\mathcal\{S\}\(Section[3\.2](https://arxiv.org/html/2607.01431#S3.SS2)\)\. Long\-horizon derivations, research\-grade symbolic computation, hypothesis generation, and open\-ended synthesis are out of scope\. Generation and verification rely on LLMs, with a residual false\-positive rate of approximately 6% against human consensus; convergence ofpknowp\_\{\\text\{know\}\}across five model pairs from four families provides empirical evidence that this does not change the direction of the finding\. For toggle pairs, the reasoning flag alters generation behavior but not model weights, so toggle comparisons test inference\-time reasoning rather than reasoning\-specialized training\. Asymmetric truncation \(approximately 23% of API calls; Appendix[F](https://arxiv.org/html/2607.01431#A6)\) imposes a conservative bias: a robustness check on the valid\-response subset shifts the pooledpknowp\_\{\\text\{know\}\}estimate by 0\.9pp \(Appendix[H](https://arxiv.org/html/2607.01431#A8)\)\. Additional limitations are discussed in Section[6](https://arxiv.org/html/2607.01431#S6.SS0.SSS0.Px1)\.

## 5Results

Table[5](https://arxiv.org/html/2607.01431#S5.T5)reports accuracy for all six model configurations across all four benchmarks\.

Table 5:Accuracy \(%\) for all model configurations across benchmarks\. For toggle pairs, “R” denotes reasoning\-on, “S” denotes reasoning\-off \(same underlying model\)\. 95% bootstrap CIs reported in Appendix[A](https://arxiv.org/html/2607.01431#A1)\.ModelModeIsoSciGPQAMMLU\-STEMSciBencho3\-miniR36\.856\.160\.041\.7GPT\-4o\-miniS61\.536\.979\.238\.5Qwen3\-32BR37\.229\.860\.828\.2Qwen3\-32BS36\.824\.257\.725\.3Gemini 2\.0 FlashR61\.859\.691\.953\.5Gemini 2\.0 FlashS62\.261\.191\.952\.1### 5\.1Reasoning\-Knowledge Decoupling onIsoSci

Table[6](https://arxiv.org/html/2607.01431#S5.T6)presents the coreIsoSciresult: the attribution of reasoning\-mode gains to knowledge\-dependent versus structure\-invariant improvements\. Thengainn\_\{\\text\{gain\}\}column reports the number of pairs on which any gain exists per model pair;pknowp\_\{\\text\{know\}\}is computed exclusively over this subset, and Wilson 95% CIs are reported\. The wide CIs \(e\.g\., \[67\.6, 100\.0\] for Gemini and DeepSeek\-R1\) reflect smallngainn\_\{\\text\{gain\}\}values, a direct consequence of near\-zero or negative overall deltas for several pairs\.

Under the symmetric definition \(Eq\.[5](https://arxiv.org/html/2607.01431#S4.E5)\), which counts gains on either pair member as knowledge\-dependent, the three main model pairs yield a pooled ratio of41/43=95\.3%41/43=95\.3\\%\(Wilson 95% CI \[84\.5%, 98\.7%\]\), with per\-pair values of 100%, 89\.5%, and 100%\. Two supplementary pairs evaluated onIsoScionly — DeepSeek\-R1 vs\. DeepSeek\-V3 and QwQ\-32B vs\. Qwen2\.5\-72B — are consistent with this pattern\. Including all five pairs, the pooledpknow=91\.3%p\_\{\\text\{know\}\}=91\.3\\%\(63/69 gains, CI \[82\.3%, 96\.0%\]\), with tighter CIs than the three\-pair estimate\. QwQ\-32B shows the most structure\-invariant gains \(kb=4k\_\{b\}=4\) and equal source and target deltas \(\+\+4\.2pp on both\), suggesting partial cross\-domain transfer; itspknow=77\.8%p\_\{\\text\{know\}\}=77\.8\\%is the lowest of the five pairs but remains well above chance\. Appendix[H](https://arxiv.org/html/2607.01431#A8)reports a robustness check on the valid\-response subset; the pooled three\-pair estimate shifts by only 0\.9pp \(94\.4%, CI \[81\.9%, 98\.5%\]\), confirming the finding is not driven by asymmetric truncation\. A source/target label\-swap permutation test \(Appendix[I](https://arxiv.org/html/2607.01431#A9)\) finds no significant directional imbalance for the main pairs \(p=0\.065p=0\.065–0\.1950\.195\), though we note a marginal result for QwQ\-32B \(p=0\.048p=0\.048\)\.

Table 6:Reasoning\-knowledge decoupling onIsoSci\.ksk\_\{s\}= source\-only gains;ktk\_\{t\}= target\-only gains;kbk\_\{b\}= structure\-invariant gains \(both members\)\.pknow=\(ks\+kt\)/\(ks\+kt\+kb\)p\_\{\\text\{know\}\}=\(k\_\{s\}\+k\_\{t\}\)/\(k\_\{s\}\+k\_\{t\}\+k\_\{b\}\)with Wilson 95% CI\. Supplementary pairs \(DeepSeek, QwQ\) were evaluated onIsoScionly\.Model pairksk\_\{s\}ktk\_\{t\}kbk\_\{b\}ngainn\_\{\\text\{gain\}\}pknowp\_\{\\text\{know\}\}95% CIo3\-mini / GPT\-4o\-mini511016100\.0%\[80\.6, 100\.0\]Qwen3\-32B think on/off12521989\.5%\[68\.6, 97\.1\]Gemini Flash think on/off7108100\.0%\[67\.6, 100\.0\]DeepSeek\-R1 / DeepSeek\-V36208100\.0%\[67\.6, 100\.0\]QwQ\-32B / Qwen2\.5\-72B11341877\.8%\[54\.8, 91\.0\]Pooled \(all 5 pairs\)412266991\.3%\[82\.3, 96\.0\]Across all five model pairs, knowledge\-dependent gains dominate structure\-invariant gains, withpknowp\_\{\\text\{know\}\}ranging from 77\.8% to 100% and a pooled estimate of 91\.3%\. The pattern is consistent: when the reasoning configuration improves on one member of a pair, it typically does not improve on the isomorphic counterpart despite an identical solution procedure\. The convergence across five model pairs spanning four model families \(OpenAI, Google, Qwen, DeepSeek\) and both comparison types mitigates concerns about LLM\-judge reliability in the construction pipeline and reduces the risk that the finding is model\-specific\.

#### Interpretation\.

TheIsoScidesign provides a direct test: if reasoning improvement were structural \(better at multi\-step inference regardless of domain\), gains would manifest equally on both members of an isomorphic pair\. The data rejects this for the problem types covered byIsoSci\. Extended reasoning appears to help models retrieve and apply domain facts more thoroughly, but does not improve the logical procedure applied once those facts are retrieved\. Whether this finding extends to long\-horizon or open\-ended scientific reasoning remains an open question\.

### 5\.2Reasoning Toggles Provide Minimal Benefit on Science

Table[7](https://arxiv.org/html/2607.01431#S5.T7)reports accuracy deltas \(Δacc\\Delta\_\{\\text\{acc\}\}= reasoning−\-standard\) stratified by domain across all benchmarks\. For the two toggle pairs—where architectural confounds are eliminated—gains are 0–4pp at most\. For Gemini \(91\.9% MMLU\-STEM\), the toggle makes essentially no difference across all four domains, with all 95% CIs including zero\. Qwen3 shows small positive gains that barely exclude zero \(physics:\+\+1\.1 to\+\+4\.7pp\)\. McNemar’s test confirms this: for both toggle pairs the discordant counts are small and nearly equal \(Qwen3:b=21b=21,c=20c=20; Gemini:b=8b=8,c=9c=9\), yielding McNemar statistic=0\.0=0\.0\(p=1\.0p=1\.0\) — no evidence of a systematic toggle effect in either direction\.

Table 7:Domain\-stratified accuracy deltas \(Δacc\\Delta\_\{\\text\{acc\}\}= reasoning−\-standard, pp\) averaged across all benchmarks\. 95% bootstrap CIs in parentheses\.Model pairPhysicsChemistryBiologyEarth Sci\.o3 / GPT\-4o−\-9\.6 \(−\-14\.1,−\-5\.1\)−\-6\.0 \(−\-10\.3,−\-1\.7\)−\-9\.4 \(−\-14\.1,−\-4\.7\)−\-25\.0 \(−\-35\.6,−\-14\.4\)Qwen3 on/off\+\+2\.9 \(\+\+1\.1,\+\+4\.7\)\+\+2\.6 \(\+\+0\.7,\+\+4\.5\)\+\+2\.9 \(\+\+0\.3,\+\+5\.5\)\+\+4\.2 \(−\-4\.2,\+\+12\.5\)Gemini 2\.0 on/off\+\+0\.6 \(−\-1\.8,\+\+3\.0\)\+\+0\.7 \(−\-2\.4,\+\+3\.8\)−\-0\.6 \(−\-4\.1,\+\+2\.9\)−\-4\.2 \(−\-15\.3,\+\+6\.9\)AverageΔ\\Delta−\-2\.0−\-0\.9−\-2\.4−\-8\.3
### 5\.3Benchmark Choice Determines Conclusions About Reasoning Models

Table 8:Benchmark\-dependent reversal for o3\-mini vs\. GPT\-4o\-mini\.o3\-mini wins on GPQA Diamond but loses onIsoSciand MMLU\-STEM, demonstrating that benchmark choice drives conclusions about reasoning utility\.Benchmarko3\-miniGPT\-4o\-miniΔ\\DeltaGPQA Diamond56\.136\.9\+\+19\.2IsoSci36\.861\.5−\-24\.7MMLU\-STEM60\.079\.2−\-19\.2SciBench41\.738\.5\+\+3\.2

The o3\-mini results across benchmarks reveal a striking pattern shown in Table[5\.3](https://arxiv.org/html/2607.01431#S5.SS3)\. On GPQA Diamond, hard graduate\-level questions requiring deep conceptual reasoning, o3\-mini outperforms GPT\-4o\-mini by\+\+19\.2pp overall\. OnIsoSci—structured problems with defined solution procedures requiring knowledge recall and substitution—o3\-mini underperforms GPT\-4o\-mini by−\-24\.7pp\. On MMLU\-STEM—broad knowledge recall—o3\-mini underperforms by−\-19\.2pp\.

This benchmark\-dependent reversal has a direct implication: conclusions about reasoning model utility on science depend entirely on which benchmark is used\. GPQA Diamond, with its emphasis on conceptual depth and multi\-step derivation, favors reasoning\-specialized models\.IsoSci, with its structured formula\-substitution problems, reveals that the same model is less capable of efficient knowledge retrieval\. Neither benchmark alone provides a complete picture; together they diagnose*where*and*how*a model’s science capability breaks down\.

## 6Conclusion

We introduceIsoSci, a benchmark of isomorphic cross\-domain science problems designed to disentangle reasoning from domain knowledge in LLM evaluation\. The core contributions are a construction methodology for isomorphic problem pairs applicable at any scale or domain, and thepknowp\_\{\\text\{know\}\}metric that decomposes reasoning\-mode gains into knowledge\-dependent and structure\-invariant components\. Across five model pairs and four model families, most reasoning\-mode gains are knowledge\-dependent, and enabling reasoning yields only marginal improvements on short\-horizon procedural science tasks\. The reversal for o3\-mini, strong on GPQA Diamond and weak onIsoSci, shows that no single benchmark provides a complete picture of scientific capability\. We hope the isomorphic\-pair methodology andpknowp\_\{\\text\{know\}\}metric support more precise evaluation of scientific reasoning and motivate larger instantiations covering longer\-horizon problem types\.

#### Limitations\.

\(1\) Dataset scope:With 144 pairs,IsoSciis smaller than prior benchmarks and does not support fine\-grained sub\-domain analysis; however, it suffices for the controlled pairwise comparisons underlying our claims\.\(2\) Model coverage:We evaluate a limited set of models; however, results are consistent across traditional and toggle\-based comparisons, suggesting the observed patterns are not model\-specific\.\(3\) Grading noise:Automated grading \(±2% tolerance with pattern\-based extraction\) may introduce minor errors, mainly from formatting or unit mismatches; manual checks indicate this affects<3%<3\\%of cases and does not change overall conclusions\.\(4\) Toggle interpretation:Thereasoning\.enabledflag changes generation behavior but not model weights; it remains the cleanest available method for isolating reasoning at inference time\.

## References

- Abdaljalil et al\. \[2025\]Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, and Erchin Serpedin\.Theorem\-of\-thought: A multi\-agent framework for abductive, deductive, and inductive reasoning in language models\.In Yuji Zhang, Canyu Chen, Sha Li, Mor Geva, Chi Han, Xiaozhi Wang, Shangbin Feng, Silin Gao, Isabelle Augenstein, Mohit Bansal, Manling Li, and Heng Ji, editors,*Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models \(KnowFM\)*, pages 111–119, Vienna, Austria, August 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-283\-1\.doi:10\.18653/v1/2025\.knowllm\-1\.10\.URL[https://aclanthology\.org/2025\.knowllm\-1\.10/](https://aclanthology.org/2025.knowllm-1.10/)\.
- Chen et al\. \[2025\]Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann\.Rethinking fine\-tuning when scaling test\-time compute: Limiting confidence improves mathematical reasoning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=jvVQeSMeGM](https://openreview.net/forum?id=jvVQeSMeGM)\.
- Deepmind \[2025\]Deepmind\.Gemini 2\.0 flash model card, 2025\.URL[http://storage\.googleapis\.com/deepmind\-media/Model\-Cards/Gemini\-2\-0\-Flash\-Model\-Card\.pdf](http://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-0-Flash-Model-Card.pdf)\.
- Dziri et al\. \[2023\]Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D\. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi\.Faith and fate: Limits of transformers on compositionality\.In*Thirty\-seventh Conference on Neural Information Processing Systems*, 2023\.URL[https://openreview\.net/forum?id=Fkckkr3ya8](https://openreview.net/forum?id=Fkckkr3ya8)\.
- Glazer et al\. \[2025\]Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean\-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon\.Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2025\.URL[https://arxiv\.org/abs/2411\.04872](https://arxiv.org/abs/2411.04872)\.
- Guo et al\. \[2025\]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, and Xiao et al\. Bi\.Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning\.*Nature*, 645\(8081\):633–638, sept 2025\.ISSN 1476\-4687\.doi:10\.1038/s41586\-025\-09422\-z\.URL[http://dx\.doi\.org/10\.1038/s41586\-025\-09422\-z](http://dx.doi.org/10.1038/s41586-025-09422-z)\.
- Hazra et al\. \[2025\]Risha Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, and Luc De Raedt\.Have large language models learned to reason? a characterization via 3\-SAT\.In*Second Conference on Language Modeling*, 2025\.URL[https://openreview\.net/forum?id=MPTlWIVSMU](https://openreview.net/forum?id=MPTlWIVSMU)\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.In*International Conference on Learning Representations*, 2021\.URL[https://openreview\.net/forum?id=d7KBjmI3GmQ](https://openreview.net/forum?id=d7KBjmI3GmQ)\.
- Hong et al\. \[2025\]Yihuai Hong, Meng Cao, Dian Zhou, Lei Yu, and Zhijing Jin\.The reasoning\-memorization interplay in language models is mediated by a single direction\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Findings of the Association for Computational Linguistics: ACL 2025*, pages 21565–21585, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-256\-5\.doi:10\.18653/v1/2025\.findings\-acl\.1111\.URL[https://aclanthology\.org/2025\.findings\-acl\.1111/](https://aclanthology.org/2025.findings-acl.1111/)\.
- Jin et al\. \[2025\]Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang\.Disentangling memory and reasoning ability in large language models\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1681–1701, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.doi:10\.18653/v1/2025\.acl\-long\.84\.URL[https://aclanthology\.org/2025\.acl\-long\.84/](https://aclanthology.org/2025.acl-long.84/)\.
- Kartac et al\. \[2026\]Ivan Kartac, Mateusz Lango, and Ondrej Dušek\.Reasoning gets harder for llms inside a dialogue, 2026\.URL[https://arxiv\.org/abs/2603\.20133](https://arxiv.org/abs/2603.20133)\.
- Lewkowycz et al\. \[2022\]Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman\-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur\-Ari, and Vedant Misra\.Solving quantitative reasoning problems with language models\.In Alice H\. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,*Advances in Neural Information Processing Systems*, 2022\.URL[https://openreview\.net/forum?id=IFXTZERXdM7](https://openreview.net/forum?id=IFXTZERXdM7)\.
- Li et al\. \[2024\]Wanhua Li, Zibin Meng, Jiawei Zhou, Donglai Wei, Chuang Gan, and Hanspeter Pfister\.SocialGPT: Prompting LLMs for social relation reasoning via greedy segment optimization\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*, 2024\.URL[https://openreview\.net/forum?id=xcF2VbyZts](https://openreview.net/forum?id=xcF2VbyZts)\.
- Li et al\. \[2026\]Yuangang Li, Justin Tian Jin Chen, Ethan Yu, David Hong, and Iftekhar Ahmed\.Beyond output correctness: Benchmarking and evaluating large language model reasoning in coding tasks, 2026\.URL[https://arxiv\.org/abs/2604\.12379](https://arxiv.org/abs/2604.12379)\.
- Minami et al\. \[2025\]Shunya Minami, Tatsuya Ishigaki, Ikko Hamamura, Taku Mikuriya, Youmi Ma, Naoaki Okazaki, Hiroya Takamura, Yohichi Suzuki, and Tadashi Kadowaki\.Quantumbench: A benchmark for quantum problem solving, 2025\.URL[https://arxiv\.org/abs/2511\.00092](https://arxiv.org/abs/2511.00092)\.
- OpenAI \[2025\]OpenAI\.Openai o3 and o4\-mini system card, 2025\.URL[https://openai\.com/index/o3\-o4\-mini\-system\-card/](https://openai.com/index/o3-o4-mini-system-card/)\.
- Razeghi et al\. \[2022\]Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh\.Impact of pretraining term frequencies on few\-shot numerical reasoning\.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,*Findings of the Association for Computational Linguistics: EMNLP 2022*, pages 840–854, Abu Dhabi, United Arab Emirates, December 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.findings\-emnlp\.59\.URL[https://aclanthology\.org/2022\.findings\-emnlp\.59/](https://aclanthology.org/2022.findings-emnlp.59/)\.
- Rein et al\. \[2024\]David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R\. Bowman\.GPQA: A graduate\-level google\-proof q&a benchmark\.In*First Conference on Language Modeling*, 2024\.URL[https://openreview\.net/forum?id=Ti67584b98](https://openreview.net/forum?id=Ti67584b98)\.
- Snell et al\. \[2025\]Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=4FWAwZtd2n](https://openreview.net/forum?id=4FWAwZtd2n)\.
- Thapa et al\. \[2026\]Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison G Zhang, Angela Zhang, Eric Wu, Haotian Ye, and James Zou\.Reasoning or knowledge: Stratified evaluation of biomedical LLMs\.In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,*Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2450–2483, Rabat, Morocco, March 2026\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-380\-7\.doi:10\.18653/v1/2026\.eacl\-long\.111\.URL[https://aclanthology\.org/2026\.eacl\-long\.111/](https://aclanthology.org/2026.eacl-long.111/)\.
- Wang \[2025\]Jun Wang\.A tutorial on llm reasoning: Relevant methods behind chatgpt o1, 2025\.URL[https://arxiv\.org/abs/2502\.10867](https://arxiv.org/abs/2502.10867)\.
- Wang et al\. \[2026\]Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, and Xinzhu Ma\.Physunibench: A multi\-modal physics reasoning benchmark at undergraduate level, 2026\.URL[https://arxiv\.org/abs/2506\.17667](https://arxiv.org/abs/2506.17667)\.
- Wang et al\. \[2024a\]Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R\. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang\.SciBench: Evaluating College\-Level Scientific Problem\-Solving Abilities of Large Language Models\.In*Proceedings of the Forty\-First International Conference on Machine Learning*, 2024a\.
- Wang et al\. \[2024b\]Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, and Alessandro Sordoni\.Guiding language model reasoning with planning tokens\.In*First Conference on Language Modeling*, 2024b\.URL[https://openreview\.net/forum?id=wi9IffRhVM](https://openreview.net/forum?id=wi9IffRhVM)\.
- Wang et al\. \[2023\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H\. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.In*The Eleventh International Conference on Learning Representations*, 2023\.URL[https://openreview\.net/forum?id=1PL1NIMMrw](https://openreview.net/forum?id=1PL1NIMMrw)\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh, editors,*Advances in Neural Information Processing Systems*, volume 35, pages 24824–24837\. Curran Associates, Inc\., 2022\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf)\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu\.Qwen3 technical report, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Yu et al\. \[2025\]Jiahao Yu, Zelei Cheng, Xian Wu, and Xinyu Xing\.GPO: Learning from critical steps to improve LLM reasoning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=c6RDAutyNE](https://openreview.net/forum?id=c6RDAutyNE)\.
- Zhou et al\. \[2025\]Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, and Bo Han\.From passive to active reasoning: Can large language models ask the right questions under incomplete information?In*Forty\-second International Conference on Machine Learning*, 2025\.URL[https://openreview\.net/forum?id=LCaTpVuvpj](https://openreview.net/forum?id=LCaTpVuvpj)\.

## Appendix AFull Accuracy Results with Confidence Intervals

Table[9](https://arxiv.org/html/2607.01431#A1.T9)reports complete accuracy results for all six model configurations across all four benchmarks and four scientific domains, with 95% bootstrap confidence intervals \(1,000 samples\)\.

Table 9:Full accuracy results \(%\) with 95% bootstrap CIs\. “R” = reasoning mode; “S” = standard mode\.nn= number of problems evaluated per cell\. “—” indicates that the source benchmark contains no questions in that scientific domain: GPQA Diamond has no earth science questions; MMLU\-STEM as downloaded contains no earth science subset; SciBench covers only physics and chemistry, with no biology or earth science problems\. Earth science coverage is provided exclusively byIsoSci, which includes 72 earth science problems per model configuration sourced from our synthetic generation pipeline \(Section[3\.3](https://arxiv.org/html/2607.01431#S3.SS3)\)\.ModelModePhysicsChemistryBiologyEarth Sci\.IsoSci\(nn=75 / 69 / 72 / 72\)o3\-miniR52\.0 \[40\.0, 64\.0\]30\.4 \[20\.3, 42\.0\]34\.7 \[25\.0, 45\.8\]29\.2 \[19\.4, 40\.3\]GPT\-4o\-miniS69\.3 \[58\.7, 80\.0\]62\.3 \[50\.7, 73\.9\]59\.7 \[48\.6, 70\.8\]54\.2 \[41\.7, 65\.3\]Qwen3\-32BR41\.3 \[29\.3, 53\.3\]21\.7 \[13\.0, 31\.9\]45\.8 \[34\.7, 57\.0\]38\.9 \[27\.8, 50\.0\]Qwen3\-32BS36\.0 \[25\.3, 46\.7\]26\.1 \[15\.9, 37\.7\]50\.0 \[38\.9, 62\.5\]34\.7 \[25\.0, 45\.8\]Gemini FlashR65\.3 \[53\.3, 76\.0\]62\.3 \[50\.7, 72\.5\]58\.3 \[47\.2, 69\.4\]61\.1 \[50\.0, 72\.2\]Gemini FlashS65\.3 \[54\.7, 76\.0\]59\.4 \[47\.8, 71\.0\]58\.3 \[47\.2, 69\.4\]65\.3 \[54\.2, 76\.4\]GPQA Diamond \(nn=86 / 93 / 19 / 0\)o3\-miniR65\.1 \[54\.7, 74\.4\]47\.3 \[37\.6, 57\.0\]57\.9 \[36\.8, 78\.9\]—GPT\-4o\-miniS44\.2 \[33\.7, 54\.7\]30\.1 \[21\.5, 39\.8\]36\.8 \[15\.8, 57\.9\]—Qwen3\-32BR46\.5 \[36\.0, 57\.0\]10\.8 \[5\.4, 18\.3\]47\.4 \[26\.2, 68\.4\]—Qwen3\-32BS30\.2 \[20\.9, 39\.5\]17\.2 \[9\.7, 24\.7\]31\.6 \[10\.5, 52\.6\]—Gemini FlashR81\.4 \[73\.3, 89\.5\]37\.6 \[28\.0, 47\.3\]68\.4 \[47\.4, 89\.5\]—Gemini FlashS81\.4 \[73\.3, 89\.5\]39\.8 \[30\.1, 50\.5\]73\.7 \[52\.6, 89\.5\]—MMLU\-STEM \(nn=250 / 250 / 250 / 0\)o3\-miniR43\.6 \[38\.0, 49\.6\]51\.6 \[45\.6, 57\.6\]84\.8 \[80\.4, 88\.8\]—GPT\-4o\-miniS71\.6 \[65\.6, 76\.4\]74\.0 \[68\.4, 79\.6\]92\.0 \[88\.8, 95\.2\]—Qwen3\-32BR45\.6 \[39\.6, 51\.6\]51\.2 \[44\.8, 57\.2\]85\.6 \[81\.2, 90\.0\]—Qwen3\-32BS44\.4 \[38\.4, 50\.8\]47\.2 \[41\.6, 53\.2\]81\.6 \[76\.8, 86\.8\]—Gemini FlashR91\.2 \[87\.2, 94\.4\]89\.2 \[84\.8, 92\.8\]95\.2 \[92\.4, 97\.6\]—Gemini FlashS91\.2 \[87\.6, 94\.4\]88\.8 \[84\.8, 92\.4\]95\.6 \[92\.8, 98\.0\]—SciBench \(nn=236 / 349 / 0 / 0\)o3\-miniR47\.0 \[41\.1, 53\.4\]38\.1 \[32\.9, 43\.3\]——GPT\-4o\-miniS45\.8 \[39\.4, 52\.1\]33\.5 \[28\.7, 38\.7\]——Qwen3\-32BR29\.7 \[23\.7, 36\.0\]27\.2 \[22\.6, 32\.4\]——Qwen3\-32BS30\.5 \[25\.0, 36\.4\]21\.8 \[17\.8, 26\.4\]——Gemini FlashR52\.5 \[46\.2, 59\.3\]54\.2 \[49\.0, 59\.3\]——Gemini FlashS50\.8 \[44\.5, 57\.2\]53\.0 \[48\.1, 58\.5\]——
## Appendix BEvaluation Prompts

### B\.1Zero\-Shot Chain\-of\-Thought Evaluation Prompt

All models receive the following system\-level instruction prepended to each question:

Evaluation prompt \(all benchmarks\)Think step by step\. Show your reasoning clearly\. Provide your final answer at the end in the format: \*\*Final Answer:\*\* <your answer\>

For GPQA Diamond \(MCQ\), the question is formatted as:

GPQA question format\{question\_text\} A\) \{option\_A\} B\) \{option\_B\} C\) \{option\_C\} D\) \{option\_D\}

Answer choices are shuffled using a deterministic seed equal to the question index, ensuring reproducibility\. The correct answer letter varies per question\.

### B\.2Isomorphic Partner Generation Prompt

The following prompt was used to generate isomorphic partner problems \(Stage 2\):

Partner generation promptSystem:You are an expert in multiple scientific disciplines with deep knowledge of physics, chemistry, biology, and earth science\. Your task is to create ISOMORPHIC science problems — problems that share identical logical and mathematical structure but require different domain knowledge\. ‘‘Isomorphic’’ means: same number and type of reasoning steps; same mathematical operations; same solution procedure; but different domain facts, constants, formulas, and named entities\. You must respond with valid JSON only\.User:I have a source problem from \{source\_domain\} with the following structure:SOURCE PROBLEM: \{question\}CORRECT ANSWER: \{answer\}REASONING STRUCTURE: \- Structure type: \{structure\_type\} \- Key formula/principle used: \{formula\} \- Solution steps: \{steps\}Your task: Generate \{n\} isomorphic partner problems in \{target\_domain\}\.Each partner must: \(1\) use a DIFFERENT formula/principle from \{target\_domain\}; \(2\) have IDENTICAL logical structure: \{structure\_type\}; \(3\) have the same number of solution steps \(\{n\_steps\} steps\); \(4\) be solvable at undergraduate level; \(5\) be completely self\-contained; \(6\) have a unique, unambiguous correct answer\.Return a JSON array with fields: question, answer, formula\_used, solution\_steps, domain\_knowledge\_required, isomorphism\_justification, sub\_topic\.

### B\.3LLM Judge Verification Prompt

The following prompt was used for automated pair verification \(Stage 3\):

LLM judge verification promptSystem:You are an expert science educator and benchmark quality reviewer\. Evaluate pairs of science problems for structural isomorphism\. Score each criterion from 1 to 5\. Respond with valid JSON only\.User:Evaluate this isomorphic problem pair:=== SOURCE PROBLEM \(\{source\_domain\}\) === \{source\_question\} ANSWER: \{source\_answer\}=== TARGET PROBLEM \(\{target\_domain\}\) === \{target\_question\} ANSWER: \{target\_answer\}=== CLAIMED ISOMORPHISM === Structure type: \{structure\_type\} Justification: \{justification\}Score this pair on 4 criteria \(1–5 each\): 1\. LOGICAL EQUIVALENCE \(1–5\): same reasoning procedure? 2\. DOMAIN INDEPENDENCE \(1–5\): knowledge required is non\-overlapping? 3\. DIFFICULTY PARITY \(1–5\): equally challenging at undergraduate level? 4\. SELF\-CONTAINMENT \(1–5\): fully specified with all needed information?Return JSON: \{‘‘logical\_equivalence’’: int, ‘‘domain\_independence’’: int, ‘‘difficulty\_parity’’: int, ‘‘self\_containment’’: int, ‘‘answer\_seems\_correct’’: bool, ‘‘rejection\_reason’’: str or null\}

### B\.4Earth\-Science Seed Problem Generation Prompt

The following prompt was used to generate the 96 synthetic earth\-science seed problems in Stage 1 \(Section[3\.3](https://arxiv.org/html/2607.01431#S3.SS3)\)\. The same prompt template was used for any domain where benchmark coverage was insufficient; in practice only earth science required synthetic generation\. Temperature was set to 0\.7 to encourage topical diversity\.

Seed generation prompt \(earth science\)System:You are an expert science educator creating evaluation problems\. Your task is to generate clear, well\-defined science problems suitable for a benchmark dataset\. Each problem must: \(1\) be solvable at college or advanced undergraduate level; \(2\) have a single unambiguous correct answer; \(3\) require a clear, identifiable reasoning procedure; \(4\) be self\-contained \(all needed information is in the problem\)\. Respond only with valid JSON \-\-\- no preamble or explanation\.User:Generate \{n\} distinct \{structure\_type\} problems in earth science\.Structure type definition: \- formula\_recall\_and\_substitute: student must recall a specific law/formula, substitute given values, compute result \- unit\_conversion\_chain: multi\-step unit conversion requiring tracking of units throughout \- conservation\_law\_application: identify and apply a conservation principle \(energy, mass, charge, etc\.\) \- proportional\_reasoning: use ratios or scaling relationships to find an unknown quantity \- two\_step\_causal\_chain: qualitative reasoning where A leads to B leads to C \(no computation required\)Requirements: \- All numerical values must be given in the problem \- Difficulty: college undergraduate \- Vary the sub\-topics within earth science \- Each problem must be solvable in 3\-\-5 reasoning stepsReturn a JSON array of objects, each with: \{‘‘question’’: ‘‘full problem text’’, ‘‘answer’’: ‘‘correct answer with units if applicable’’, ‘‘solution\_steps’’: \[‘‘step 1’’, ‘‘step 2’’, \.\.\.\], ‘‘formula\_used’’: ‘‘name of the key formula or principle’’, ‘‘sub\_topic’’: ‘‘specific topic within earth science’’, ‘‘estimated\_steps’’: <integer 3\-\-5\>\}

## Appendix CAnswer Extraction Algorithm

Algorithm[1](https://arxiv.org/html/2607.01431#alg1)describes the answer extraction procedure applied to all model responses\.

Algorithm 1Answer extraction from model response1:Response text

rr; answer choices

𝒞\\mathcal\{C\}\(MCQ only\)

2:Extracted answer string

aa
3:if

rris emptythenreturn“”

4:endif

5:

a←a\\leftarrowRegexSearch\(

rr,\*\*Final Answer:\*\* \(\.\+\)\)

6:if

a≠a\\neqnullthengotopost\-process

7:endif

8:

a←a\\leftarrowRegexSearch\(

rr,\\boxed\{\(\.\+\)\}\)

9:if

a≠a\\neqnullthengotopost\-process

10:endif

11:

a←a\\leftarrowRegexSearch\(

rr,$$ number $$\)

12:if

a≠a\\neqnullthengotopost\-process

13:endif

14:

a←a\\leftarrowRegexSearch\(

rr,Therefore/The answer is \(\.\+\)\)

15:if

a≠a\\neqnullthengotopost\-process

16:endif

17:forline

ℓ\\ellin reversed\(lines\(

rr\)\)do

18:if

ℓ\\ellcontains a numberthen

a←a\\leftarrowextracted number;gotopost\-process

19:endif

20:endfor

21:

a←a\\leftarrowlast non\-empty line of

rr
22:

23:Post\-processing\(MCQ only, applied to all paths above\):

24:if

𝒞≠∅\\mathcal\{C\}\\neq\\emptysetand

aais not a single letter A–Dthen

25:if

aamatches any choice text in

𝒞\\mathcal\{C\}then

a←a\\leftarrowmatching letter

26:endif

27:endif

28:return

aa

## Appendix DExample Dataset Pairs

We present representative examples fromIsoSci, illustrating the structure of isomorphic problem pairs across domains\. Each pair consists of a source and target problem with identical reasoning structure but distinct domain knowledge\.

Example 1: Conservation Law \(Physics→\\toChemistry\)Structure:conservation law \+ proportional reasoningSource problem \(Physics\):A liquid flows at a constant flow rate through a pipe with circular cross\-sections of varying diameters\. At one point in the pipe, the diameter is2​cm2\\ \\text\{cm\}and the flow speed is18​m/s18\\ \\text\{m/s\}\. What is the flow speed at another point in this pipe, where the diameter is3​cm3\\ \\text\{cm\}?A\)4​m/s4\\ \\text\{m/s\} B\)6​m/s6\\ \\text\{m/s\} C\)8​m/s8\\ \\text\{m/s\} D\)12​m/s12\\ \\text\{m/s\}Answer: C\)8​m/s8\\ \\text\{m/s\}Target problem \(Chemistry\):A gas diffuses through a porous membrane at a constant molar flow rate\. At one location in the membrane, the cross\-sectional area is4\.0​cm24\.0\\ \\text\{cm\}^\{2\}and the diffusion flux is0\.12​mol/\(cm2⋅s\)0\.12\\ \\text\{mol\}/\(\\text\{cm\}^\{2\}\\cdot\\text\{s\}\)\. What is the diffusion flux at another location where the cross\-sectional area is6\.0​cm26\.0\\ \\text\{cm\}^\{2\}?A\)0\.04​mol/\(cm2⋅s\)0\.04\\ \\text\{mol\}/\(\\text\{cm\}^\{2\}\\cdot\\text\{s\}\) B\)0\.06​mol/\(cm2⋅s\)0\.06\\ \\text\{mol\}/\(\\text\{cm\}^\{2\}\\cdot\\text\{s\}\) C\)0\.08​mol/\(cm2⋅s\)0\.08\\ \\text\{mol\}/\(\\text\{cm\}^\{2\}\\cdot\\text\{s\}\) D\)0\.18​mol/\(cm2⋅s\)0\.18\\ \\text\{mol\}/\(\\text\{cm\}^\{2\}\\cdot\\text\{s\}\)Answer: C\)0\.08​mol/\(cm2⋅s\)0\.08\\ \\text\{mol\}/\(\\text\{cm\}^\{2\}\\cdot\\text\{s\}\)

Example 2: Statistical Inference \(Physics→\\toChemistry\)Structure:CLT \+ standardization \+ probability lookupSource problem \(Physics\):LetXXequal the maximal oxygen intake of a human on a treadmill, measured in milliliters of oxygen per minute per kilogram of body weight\. Assume that, for a particular population, the mean ofXXisμ=54\.030\\mu=54\.030and the standard deviation isσ=5\.8\\sigma=5\.8\. LetX¯\\bar\{X\}be the sample mean of a random sample of sizen=47n=47\.FindP​\(52\.761≤X¯≤54\.453\)P\(52\.761\\leq\\bar\{X\}\\leq 54\.453\), approximately\.Answer:0\.62470\.6247Target problem \(Chemistry\):LetYYequal the molar concentration of a sodium chloride solution, measured in moles per liter\. Assume that, for a particular preparation method, the mean ofYYisμ=0\.850​M\\mu=0\.850\\ \\text\{M\}and the standard deviation isσ=0\.042​M\\sigma=0\.042\\ \\text\{M\}\. LetY¯\\bar\{Y\}be the sample mean of a random sample of sizen=36n=36\.FindP​\(0\.836≤Y¯≤0\.859\)P\(0\.836\\leq\\bar\{Y\}\\leq 0\.859\), approximately\.Answer:0\.62470\.6247

## Appendix EDataset Release and Reproducibility

#### Dataset and Croissant metadata\.

IsoSciis released on HuggingFace at[https://huggingface\.co/datasets/isosci/isosci](https://huggingface.co/datasets/isosci/isosci)\(anonymized for review\)\. The repository includes train \(80 pairs\) and test \(64 pairs\) splits stratified by domain mapping, provided for downstream studies that require held\-out evaluation sets, such as fine\-tuning or few\-shot prompting experiments\. All results in this paper use the full 144\-pair set without train/test separation\. In addition to full pair metadata, and a Croissant\-compliantcroissant\.jsonmetadata file at the repository root\. The Croissant file was validated locally using themlcroissantpackage prior to submission\.

#### Evaluation code\.

The full pipeline is released at[https://anonymous\.4open\.science/r/isosci\-603C/](https://anonymous.4open.science/r/isosci-603C/)\. The pipeline is implemented in Python 3\.9\+ and requires only standard scientific libraries plus therequestspackage for API calls\. All random seeds are fixed for reproducibility\.

#### Computational cost\.

Table[10](https://arxiv.org/html/2607.01431#A5.T10)reports approximate API costs for replicating our evaluation\.

Table 10:Approximate API cost for full replication\. Estimates are based on publicly listed OpenRouter pricing at the time of evaluation and may vary over time\.StageAPI callsEst\. cost \(USD\)Dataset construction \(Stages 1–3\)∼\\sim800$80–120Model evaluation \(Stage 4\)∼\\sim10,926$400–800Total∼\\sim11,726$480–920
#### Model versions\.

Table[11](https://arxiv.org/html/2607.01431#A5.T11)lists exact model identifiers used in this study, accessed via the OpenRouter API\.

Table 11:Exact model identifiers used in evaluation\.ModelOpenRouter identifiero3\-mini \(reasoning\)openai/o3\-miniGPT\-4o\-mini \(standard\)openai/gpt\-4o\-mini\-2024\-07\-18Qwen3\-32B thinking=ONqwen/qwen3\-32b:nitro\+reasoning\.enabled=trueQwen3\-32B thinking=OFFqwen/qwen3\-32b:nitro\+reasoning\.enabled=falseGemini 2\.0 Flash thinking=ONgoogle/gemini\-2\.0\-flash\-001\+reasoning\.enabled=trueGemini 2\.0 Flash thinking=OFFgoogle/gemini\-2\.0\-flash\-001\+reasoning\.enabled=falseDataset construction only \(not evaluated\)Claude claude\-sonnet\-4\-5 \(generation\)anthropic/claude\-sonnet\-4\-5GPT\-4o\-mini \(judge\)openai/gpt\-4o\-mini\-2024\-07\-18DeepSeek\-V3 \(judge\)deepseek/deepseek\-v3
#### Toggle implementation\.

For both Qwen3\-32B and Gemini 2\.0 Flash, we pass"reasoning": \{"enabled": true/false\}as a top\-level field in the OpenRouter API request body\. Whenenabled=false, the model generates a direct response without a visible<think\>\.\.\.</think\>block; whenenabled=true, the model prefixes its response with an explicit reasoning chain before the final answer\. Temperature is 0 in both conditions and no other parameters are modified\. Manual inspection of 20 randomly sampled response pairs confirmed that the toggle controls the presence of a visible<think\>block but that both conditions generate comparably long final responses on structured scientific problems\. We note that this toggle suppresses visible chain\-of\-thought generation but does not modify model weights; it is possible that models internally perform multi\-step reasoning in standard mode without surfacing it, in which case our comparisons measure the effect of*visible*extended reasoning\.

## Appendix FAPI Exclusion Analysis

Of 10,926 API calls \(66configurations×\\times\(288\+198\+585\+750\)\(288\+198\+585\+750\)items\), 8,408 \(76\.9%\) returned valid responses used in analysis\. The remaining 2,518 \(23\.1%\) were excluded due to output token limit truncation \(≈\\approx18%, concentrated in reasoning\-on configurations on free\-response benchmarks\), API timeout \(≈\\approx3%\), and format errors \(≈\\approx2%\)\. Reasoning\-on configurations show higher exclusion rates on SciBench \(28% vs\. 11% for standard mode\), consistent with longer reasoning chains hitting the 8,192 token limit\. This asymmetric exclusion could in principle suppress observed reasoning\-mode gains; we note it as a conservative bias — if anything it understates reasoning\-mode accuracy, making the null finding for toggle pairs more rather than less credible\. Evaluations affected by an earlier 2,048 token limit were rerun after the limit was increased\.

## Appendix GStatistical Tests

#### McNemar’s test\.

We used McNemar’s test with continuity correction to assess whether reasoning and standard configurations produce significantly different correct/incorrect patterns on pairedIsoSciitems \(n=288n=288per comparison\)\. The test statistic is\(\|b−c\|−1\)2/\(b\+c\)\(\|b\-c\|\-1\)^\{2\}/\(b\+c\)wherebb= items correct under reasoning only andcc= items correct under standard only\. Results are reported in Table[12](https://arxiv.org/html/2607.01431#A7.T12)\.

For o3\-mini vs\. GPT\-4o\-mini, the test confirms a highly significant difference \(b=16b=16,c=87c=87, stat=47\.57=47\.57,p<0\.001p<0\.001\): the standard model \(GPT\-4o\-mini\) outperforms on substantially more items than the reasoning model does\.

For the two toggle pairs, the discordant counts are small and nearly equal: Qwen3\-32B \(b=21b=21,c=20c=20\) and Gemini 2\.0 Flash \(b=8b=8,c=9c=9\)\. The continuity correction yields a statistic of 0\.00 in both cases \(p=1\.0p=1\.0\), indicating no evidence of a systematic toggle effect\. We note that the accuracies are not literally identical — Qwen3\-32B shows 37\.2% vs\. 36\.8% and Gemini shows 61\.8% vs\. 62\.2% — but the discordant pairs are balanced in both directions, meaning gains and losses from enabling reasoning cancel out almost exactly across the 288 items\.

Table 12:McNemar’s test results onIsoScipaired items \(continuity\-corrected\)\.bb= reasoning correct, standard wrong;cc= reasoning wrong, standard correct\.n=288n=288paired items per comparison\.ComparisonbbccStatisticpp\-valueInterpretationo3\-mini vs\. GPT\-4o\-mini168747\.57<<0\.001Significant differenceQwen3\-32B think\-on vs\. off21200\.001\.000No evidence of toggle effectGemini Flash think\-on vs\. off890\.001\.000No evidence of toggle effect
#### Bootstrap confidence intervals\.

All reported CIs use 1,000 bootstrap samples with the percentile method\. The random seed is fixed at 42 for all bootstrap calculations\.

#### Binomial test on source/target gain asymmetry\.

As a complement topknowp\_\{\\text\{know\}\}, we test whether source\-only gains \(ksk\_\{s\}\) and target\-only gains \(ktk\_\{t\}\) occur with equal probability under the null hypothesisH0:Pr⁡\[source\-only\]=0\.5H\_\{0\}:\\Pr\[\\text\{source\-only\}\]=0\.5\. This tests directional asymmetry in where the knowledge bottleneck falls, which is distinct from the primarypknowp\_\{\\text\{know\}\}finding\.

Across the three main model pairs \(ks=24k\_\{s\}=24,kt=17k\_\{t\}=17,n=41n=41asymmetric gains\), the binomial test yieldsp=0\.349p=0\.349: no significant directional asymmetry within the main pairs\. Pooled across all five model pairs \(ks=41k\_\{s\}=41,kt=22k\_\{t\}=22,n=63n=63\), the test yieldsp=0\.023p=0\.023, indicating that source\-only gains outnumber target\-only gains at conventional significance\. This asymmetry is driven by the supplementary pairs \(DeepSeek\-R1 and QwQ\-32B\) and should be interpreted with caution: it suggests that source problems may be slightly harder or more knowledge\-discriminating than target problems on average, but it does not affect the primary finding that knowledge\-dependent gains \(ks\+ktk\_\{s\}\+k\_\{t\}\) dominate structure\-invariant gains \(kbk\_\{b\}\) across all five pairs\.

## Appendix HRobustness Check:pknowp\_\{\\text\{know\}\}on Valid\-Response Subset

A potential concern is that asymmetric response truncation could biaspknowp\_\{\\text\{know\}\}if truncated responses are systematically correct or incorrect\. We recomputepknowp\_\{\\text\{know\}\}under the symmetric definition \(Eq\.[5](https://arxiv.org/html/2607.01431#S4.E5)\) restricted to items where both configurations produced valid, non\-truncated responses \(non\-empty response, non\-empty extracted answer, no API error\)\.

Table[13](https://arxiv.org/html/2607.01431#A8.T13)reports results\. Qwen3\-32B shows the largest exclusion rate \(88 invalid reasoning responses, 86 invalid standard responses out of 288\), leaving 194 items \(67\.4%\)\. The restricted accuracy for Qwen3\-32B rises from 37% to 54%, confirming that invalid responses are concentrated on harder items\.

Despite this, the pooled restrictedpknow=94\.4%p\_\{\\text\{know\}\}=94\.4\\%\(34/36 gains, Wilson 95% CI \[81\.9%, 98\.5%\]\) is within 3pp of the full\-set estimate of 95\.3% \(43 main pairs, CI \[84\.5%, 98\.7%\]\), and the CIs overlap substantially\. The finding is robust to exclusion of invalid responses\.

Table 13:Robustness check:pknowp\_\{\\text\{know\}\}on the valid\-response subset under the symmetric definition\. “Restrictednn” = items retained after excluding truncated or empty responses\.Model pairRestr\.nnExcl\. RExcl\. Sngainn\_\{\\text\{gain\}\}pknowp\_\{\\text\{know\}\}95% CIo3\-mini / GPT\-4o\-mini27612012100\.0%\[75\.8, 100\.0\]Qwen3\-32B think on/off19488861687\.5%\[64\.0, 96\.5\]Gemini Flash think on/off288008100\.0%\[67\.6, 100\.0\]Pooled758——3694\.4%\[81\.9, 98\.5\]Full\-set \(Table[6](https://arxiv.org/html/2607.01431#S5.T6)\)864——4395\.3%\[84\.5, 98\.7\]The Qwen3\-32B restrictedngain=16n\_\{\\text\{gain\}\}=16is smaller than the full\-set value of 19, as expected: restricting to valid responses removes some pairs where a gain existed, shrinking the denominator\. Thepknowp\_\{\\text\{know\}\}estimate changes from 89\.5% to 87\.5%, well within the overlapping confidence intervals\. The direction of the finding is unchanged across all three model pairs and both the full\-set and restricted analyses\.

## Appendix ISource/Target Label Permutation Test

A potential concern is that the asymmetry between source\-only gains \(ksk\_\{s\}\) and target\-only gains \(ktk\_\{t\}\) reflects a systematic difficulty imbalance between source and target problems rather than domain knowledge asymmetry\. To test this, we performed a label\-swap permutation test: for each pair, we randomly swapped the source and target labels with probability 0\.5, recomputed\|ks−kt\|\|k\_\{s\}\-k\_\{t\}\|on the shuffled data, and repeated for 1,000 iterations\. The empiricalpp\-value is the fraction of permutations yielding\|ks−kt\|≥\|k\_\{s\}\-k\_\{t\}\|\\geqthe observed value\.

Results are reported in Table[14](https://arxiv.org/html/2607.01431#A9.T14)\. The observed source/target imbalance is not statistically significant at any conventional threshold \(p=0\.065p=0\.065–0\.1950\.195per model pair\)\. We cannot rule out that some portion of theksk\_\{s\}/ktk\_\{t\}asymmetry reflects difficulty differences between source and target problems rather than directional knowledge asymmetry\.

Crucially, this test addresses a secondary question about the*direction*of knowledge\-dependent gains, not the primary finding\. The main claim rests onpknow=\(ks\+kt\)/\(ks\+kt\+kb\)=95\.3%p\_\{\\text\{know\}\}=\(k\_\{s\}\+k\_\{t\}\)/\(k\_\{s\}\+k\_\{t\}\+k\_\{b\}\)=95\.3\\%, which measures whether knowledge\-dependent gains \(ks\+kt=41k\_\{s\}\+k\_\{t\}=41\) dominate structure\-invariant gains \(kb=2k\_\{b\}=2\)\. This ratio is unaffected by the source/target label assignment: swapping labels convertsksk\_\{s\}gains intoktk\_\{t\}gains and vice versa, but leavesks\+ktk\_\{s\}\+k\_\{t\}unchanged\. The permutation test therefore has no bearing on the primary finding\.

Table 14:IsoSciSource/target label\-swap permutation test\. The test statistic is\|ks−kt\|\|k\_\{s\}\-k\_\{t\}\|; the null distribution is generated by randomly swapping source and target labels within pairs \(1,000 iterations, seed 42\)\.pp\-values above 0\.05 indicate the observed imbalance is consistent with random label assignment\. This test addresses label\-direction asymmetry, not the primarypknowp\_\{\\text\{know\}\}finding\.Model pairksk\_\{s\}ktk\_\{t\}\|ks−kt\|\|k\_\{s\}\-k\_\{t\}\|Perm\.pp\-valueo3\-mini / GPT\-4o\-mini51160\.195Qwen3\-32B think on/off12570\.136Gemini Flash think on/off7160\.065

Similar Articles

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

arXiv cs.AI

SciR is a new controllable benchmark for evaluating LLMs on scientific reasoning including deduction, induction, and causal abduction, with parametric control over extraction and inference difficulty. Tests show both axes degrade performance across models, with reasoning models like DeepSeek-R1 outperforming instruct models on inference.

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

arXiv cs.CL

Introduces mmPISA-bench, a compact multilingual reasoning benchmark derived from PISA, and evaluates proprietary LLMs across 43 languages, finding that they reason effectively with some performance variations, and that machine-translated questions do not degrade accuracy.