Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

arXiv cs.AI 05/19/26, 04:00 AM Papers
llm-reasoning scientific-logicality physics reasoning-improvement dataset training-methodology
Summary
This paper introduces a methodology to enrich scientific logicality in LLM reasoning, including assessment criteria and data sampling methods, and demonstrates its effectiveness on physics problems using multiple backbone LLMs.
arXiv:2605.17104v1 Announce Type: new Abstract: With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs' performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of the scientific reasoning process -- logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality-enriched methodology, including a set of assessment criteria and data sampling methods for logicality-guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high-quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems. Code is available at \href{https://github.com/ScienceOne-AI/PhysLogic}{https://github.com/ScienceOne-AI/PhysLogic}.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:39 AM
# Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics
Source: [https://arxiv.org/html/2605.17104](https://arxiv.org/html/2605.17104)
###### Abstract

With the continuous advancement of reasoning abilities in Large Language Models \(LLMs\), their application to scientific reasoning tasks has gained significant research attention\. Current research primarily emphasizes boosting LLMs’ performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains\. However, these approaches neglect the essence of the scientific reasoning process—logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions\. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality\-enriched methodology, including a set of assessment criteria and data sampling methods for logicality\-guided training, to improve the logical faithfulness as well as task performance\. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology\. For data construction, we extract scientific problems from academic literature and sample a high\-quality dataset exhibiting strong logicality\. Experiments based on three different backbone LLMs reveal that: 1\) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2\) the enriched scientific logicality plays a critical role in solving scientific problems\. Code is available at[https://github\.com/ScienceOne\-AI/PhysLogic](https://github.com/ScienceOne-AI/PhysLogic)\.

scientific logicality, logicality assessment, LLM physics reasoning

![Refer to caption](https://arxiv.org/html/2605.17104v1/x1.png)Figure 1:Comparison of the scientific reasoning paradigms between DeepSeek\-R1 and a professional \(human\): LLM lacks the scientific logicality possessed by human experts\.## 1Introduction

With the continuous advancement of Large Language Models \(LLMs\), significant research efforts and progress have been made to apply them to solving scientific problems across disciplines such as mathematics, physics, and chemistry, aiming to enhance the efficiency in academic research and education\(Zhanget al\.,[2024b](https://arxiv.org/html/2605.17104#bib.bib39); Zhenget al\.,[2025b](https://arxiv.org/html/2605.17104#bib.bib12)\)\. For complex problem solving, early work focuses on strategies at inference time and designs structured procedures that guide LLMs to reason step by step\(Weiet al\.,[2022](https://arxiv.org/html/2605.17104#bib.bib8); Wanget al\.,[2023](https://arxiv.org/html/2605.17104#bib.bib7)\)\. More recently, reasoning models such as DeepSeek R1 and OpenAI o1 adopt a training\-time paradigm that instills sophisticated reasoning abilities during learning\(Guoet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib9); Jaechet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib10)\), yielding strong performance across disciplinary reasoning tasks\(Huet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib13)\)\. Building on this paradigm, a number of studies have constructed training corpora containing long and complex scientific reasoning traces to train LLMs\(Yuanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib17); Fanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib21); Zhanget al\.,[2024a](https://arxiv.org/html/2605.17104#bib.bib18); Luet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib20)\)\. Meanwhile, many benchmarks are built to evaluate models’ scientific problem\-solving capability by formulating question\-answer \(QA\) tasks in diverse formats\(Reinet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib23); Wanget al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib22)\)\.

However, these studies narrowly cast scientific reasoning as an end\-to\-end natural language processing task and neglect the essence of the scientific reasoning process–logicality, which encompasses a set of interrelated concepts, methods and principles, and forms the rational foundation that ensures the validity of reasoning steps and the reliability of conclusions\(Popper,[2005](https://arxiv.org/html/2605.17104#bib.bib16); Díazet al\.,[2023](https://arxiv.org/html/2605.17104#bib.bib40)\)\. Figure[1](https://arxiv.org/html/2605.17104#S0.F1)illustrates an example of the reasoning paradigms of DeepSeek\-R1 and a professional human in answering a scientific question, where humans typically follow a series of interconnected logical steps including*problem formalization*,*model generation*,*evidence generation*,*evidence evaluation*and*drawing conclusions*, etc\. \(corresponding to the epistemic activities inFischeret al\.\([2014](https://arxiv.org/html/2605.17104#bib.bib41)\)\)\. A related study reveals that each scientific discipline has its own paradigm of reasoning, which is the way of solving problems that are generally held in common by the community of those practising in this discipline\(Dowden,[1993](https://arxiv.org/html/2605.17104#bib.bib45)\)\. In contrast, the reasoning traces generated by current reasoning LLMs are often an ad hoc aggregation of recall, review, and self\-reflection steps with lengthy iterations and relatively weak logical coherence between them\.

In this paper, we conduct the first systematic investigation into the internal logicality underlying LLM scientific reasoning\. First, we design a set of criteria with three dimensions: logical fidelity, causal connection, and inferential progress to assess scientific logicality during the reasoning process; then we design two SFT data sampling methods, based on distillation and reasoning style transfer, respectively, to enhance scientific logicality in LLM reasoning\. To practice the above methodology, we choose physics as an exemplar discipline, whose reasoning paradigm spans formal derivation and computation in formal sciences \(e\.g\., pure math\), as well as real\-world modeling and experimental methodology in natural sciences\. More concretely, we construct a set of high\-quality QA datasets extracted from the core logical derivations of physics papers, from which we sample 80k SFT instances and 864 benchmark examples\. Both in\-domain and out\-of\-domain experiments are conducted to examine the effect of SFT on enhancing LLMs’ scientific reasoning logicality and their final task performances\.

The contributions of our work are summarized as follows:

1. 1\.We make the first exploration of the logicality in LLM scientific reasoning, and design logicality\-centric assessment criteria and data sampling methods to improve LLMs’ scientific reasoning process and performance\. Empirical studies involving third\-party verification fully demonstrate the validity of the criteria\.
2. 2\.We construct a high\-quality QA dataset extracted from physics papers and on this basis, build thePhysLogicbenchmark, the first of its kind for systematically evaluating the logicality of LLM physics reasoning, together with two distinct logicality\-enriched training datasets\.
3. 3\.We conduct extensive experiments and the results on bothPhysLogicbenchmark and three representative public benchmarks show that our constructed training dataset can effectively improve LLM logicality in physics reasoning and the final task performances\.

## 2Methodology

![Refer to caption](https://arxiv.org/html/2605.17104v1/x2.png)Figure 2:Assessment criteria for the scientific reasoning of LLMs, encompassing three dimensions: Logical Fidelity, Causal Connection, and Inferential Progress\.Scientific Reasoning is regarded as the cognitive processes required to use the scientific method, consisting ofa series of steps\(Díazet al\.,[2023](https://arxiv.org/html/2605.17104#bib.bib40)\), which are aligned to the epistemic definition ofFischeret al\.\([2014](https://arxiv.org/html/2605.17104#bib.bib41)\)\. Thus, solving a scientific problem involves distinct reasoning steps \(we term them aslogical nexuses111[https://www\.merriam\-webster\.com/dictionary/nexus](https://www.merriam-webster.com/dictionary/nexus)222For specific examples, please refer to the data examples in Appendix[J](https://arxiv.org/html/2605.17104#A10)\.\), denoted as𝒩=\{ν1,⋯,νn\}\\mathcal\{N\}=\\\{\\nu\_\{1\},\\cdots,\\nu\_\{n\}\\\}, wherennis the number of nexuses\. Based onFischeret al\.\([2014](https://arxiv.org/html/2605.17104#bib.bib41)\), logical nexuses \(characterized by epistemic activities\) might differ substantially in the relative weights in a discipline\. These weights corresponding to𝒩\\mathcal\{N\}are denoted as𝒲=\{w1,⋯,wn\}\\mathcal\{W\}=\\\{w\_\{1\},\\cdots,w\_\{n\}\\\}\. The reasoning process of a problem solver is represented by a sequence of sentencesℛ=\{r1,⋯,rm\}\\mathcal\{R\}=\\\{r\_\{1\},\\cdots,r\_\{m\}\\\}\. Specifically, to ensure that each segment is semantically independent and complete while maintaining computational efficiency, we adopt a rule\-based sentence\-level segmentation scheme, a design choice that is widely adopted in prior work\(Lightmanet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib3); Sunet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib4); Macaret al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib5)\)\. To enable quantitative assessment, we first encode these textual steps into vector representations\. Using a sentence encoder, we transform the ground\-truth nexuses𝒩\\mathcal\{N\}into embeddingsV𝒩=\{vν1,⋯,vνn\}V\_\{\\mathcal\{N\}\}=\\\{v\_\{\\nu\_\{1\}\},\\cdots,v\_\{\\nu\_\{n\}\}\\\}and the reasoning stepsℛ\\mathcal\{R\}into embeddingsVℛ=\{vr1,⋯,vrm\}V\_\{\\mathcal\{R\}\}=\\\{v\_\{r\_\{1\}\},\\cdots,v\_\{r\_\{m\}\}\\\}\. In this chapter, we first propose multi\-dimensional assessment criteria that use the nexus embeddingsV𝒩V\_\{\\mathcal\{N\}\}as the ground truth to assess the scientific logicality of the reasoning process embeddingsVℛV\_\{\\mathcal\{R\}\}\. Furthermore, given a dataset of scientific problems, where each entry comprises a QA pair,𝒩\\mathcal\{N\}, and𝒲\\mathcal\{W\}, we design two distinct logic\-aware data sampling methods for SFT\.

![Refer to caption](https://arxiv.org/html/2605.17104v1/x3.png)Figure 3:A pipeline to construct scientific QA data from academic papers, along with three SFT data sampling methods: a baseline and two comparative methods enriched with scientific logic\.### 2\.1Assessment for Scientific Logicality in LLM Reasoning

As shown in Figure[2](https://arxiv.org/html/2605.17104#S2.F2), we designed criteria with three complementary dimensions to assess the scientific logicality of an LLM’s reasoning process:

Logical Fidelityℱ\\mathcal\{F\}This metric quantifies the content alignment between the reasoning process under evaluation and the logical nexuses\. We assess logical fidelity by aligning ground\-truth logical nexus embeddings \(V𝒩V\_\{\\mathcal\{N\}\}\) with the model’s reasoning step embeddings \(VℛV\_\{\\mathcal\{R\}\}\)\. First, we compute a cosine similarity matrixM∈ℝn×mM\\in\\mathbb\{R\}^\{n\\times m\}between the two sets of embeddings\. A greedy matching algorithm then identifies the optimal set of one\-to\-one pairs𝒞\\mathcal\{C\}by selecting matches that exceed a predefined similarity thresholdτ\\tau\. Finally, we represent Logical Fidelity using the Logic F\-Score \(ℱ\\mathcal\{F\}\), which is the harmonic mean of the alignment’s Logic Precision \(π\\pi, which describes the proportion of the model’s reasoning steps that are logically valid\) and Logic Recall \(ρ\\rho, which describes the proportion of logical nexuses that are covered by the model’s reasoning\):

ρ=∑\(i,j\)∈𝒞wi⋅Mij∑k=1nwk,π=\|𝒞\|m,ℱ=2⋅π⋅ρπ\+ρ\\rho=\\frac\{\\sum\_\{\(i,j\)\\in\\mathcal\{C\}\}w\_\{i\}\\cdot M\_\{ij\}\}\{\\sum\_\{k=1\}^\{n\}w\_\{k\}\},\\quad\\pi=\\frac\{\|\\mathcal\{C\}\|\}\{m\},\\quad\\mathcal\{F\}=2\\cdot\\frac\{\\pi\\cdot\\rho\}\{\\pi\+\\rho\}wherewiw\_\{i\}is the importance weight of nexusνi\\nu\_\{i\},n=\|𝒩\|n=\|\\mathcal\{N\}\|, andm=\|ℛ\|m=\|\\mathcal\{R\}\|\. Anℱ\\mathcal\{F\}score of 1 indicates a perfect match with the logical nexuses, and higher values reflect a greater degree of content\-level consistency between the model’s reasoning and the logical nexuses\.

Causal Connection𝒪\\mathcal\{O\}This dimension considers whether the LLM preserves the correct ordering between pairs of logical nexuses that have an inherent causal or derivational direction\. When the model touches on both nexuses during reasoning, we examine whether the order it presents is consistent with the ground truth\. This consistency is determined based on the relative distribution of semantic similarities\. Specifically, for each nexusνi\\nu\_\{i\}, we compute its Positional CentroidPiP\_\{i\}\-its semantic center of mass within the model’s reasoning processℛ\\mathcal\{R\}\. The score𝒪\\mathcal\{O\}is the weighted proportion of nexus pairs that maintain their correct relative temporal order:

Pi=∑j=1mj⋅Mij∑j=1mMij,𝒪=∑i<ks\.t\.Pi<Pk\(wi\+wk\)∑i<k\(wi\+wk\)P\_\{i\}=\\frac\{\\sum\_\{j=1\}^\{m\}j\\cdot M\_\{ij\}\}\{\\sum\_\{j=1\}^\{m\}M\_\{ij\}\},\\quad\\mathcal\{O\}=\\frac\{\\sum\_\{i<k\\text\{ s\.t\. \}P\_\{i\}<P\_\{k\}\}\(w\_\{i\}\+w\_\{k\}\)\}\{\\sum\_\{i<k\}\(w\_\{i\}\+w\_\{k\}\)\}An𝒪\\mathcal\{O\}score of 1 indicates a perfectly ordered sequence, while a score near 0\.5 suggests random ordering\.

Inferential Progress𝒫\\mathcal\{P\}This metric assesses whether the reasoning exhibits overall forward logical progression\. For example, if the LLM repeatedly circles back to previously covered propositions or oscillates between them without making forward progress, the score on this dimension decreases\. Specifically, it assesses reasoning efficiency by identifying non\-productive patterns like conceptual loops\. It analyzes the conceptual trajectory of the reasoning process, which is represented by a sequence of Similarity Vectors\[S1→,…,Sm→\]\[\\vec\{S\_\{1\}\},\\dots,\\vec\{S\_\{m\}\}\]\. Each vectorSj→\\vec\{S\_\{j\}\}captures the similarity of a reasoning steprjr\_\{j\}to allnnground\-truth nexuses:

Sj→=\[M1jM2j…Mnj\]T\\vec\{S\_\{j\}\}=\\begin\{bmatrix\}M\_\{1j\}&M\_\{2j\}&\\dots&M\_\{nj\}\\end\{bmatrix\}^\{T\}The final score𝒫\\mathcal\{P\}is the average conceptual novelty across the reasoning path, where the novelty of each step is one minus its maximum cosine similarity to any preceding step’s vector:

𝒫=1m−1∑j=2m\(1−max1≤k<j⁡\(Sj→⋅Sk→‖Sj→‖‖Sk→‖\)\)\\mathcal\{P\}=\\frac\{1\}\{m\-1\}\\sum\_\{j=2\}^\{m\}\\left\(1\-\\max\_\{1\\leq k<j\}\\left\(\\frac\{\\vec\{S\_\{j\}\}\\cdot\\vec\{S\_\{k\}\}\}\{\\\|\\vec\{S\_\{j\}\}\\\|\\\|\\vec\{S\_\{k\}\}\\\|\}\\right\)\\right\)A score close to 1 signifies a highly efficient, forward\-progressing reasoning path, whereas a low score indicates significant conceptual repetition\.

Table 1:Statistics for the final 80k instruction\-tuning datasetTask TypeData NumberQQTokensAATokens𝒩\\mathcal\{N\}Length𝒩\\mathcal\{N\}Tokensℛ\\mathcal\{R\}TokensℛRST\\mathcal\{R\}\_\{\\text\{RST\}\}TokensMCP12587158\.46337\.618\.4220\.987971\.93902\.82Comp\. \(E\)17634222\.22345\.379\.3821\.0910757\.041025\.93Comp\. \(N\)48005219\.86355\.418\.9022\.869920\.151137\.26Proof1774216\.11417\.689\.1722\.4110518\.051167\.08
\* MCP: Multiple Choice Problem; Comp\. \(E\): Expression Computation; Comp\. \(N\): Numeric Computation; Proof: Proof\-based Problem\.

Table 2:Comparison of our proposedPhysLogicbenchmark with existing science benchmarks that include physics: Ours is the first benchmark to incorporate multiple, complementary dimensions for assessing the logicality of the reasoning process\.BenchmarkDisciplineDifficultyLevelsQuestionTypesAnswerVerificationReasoning VerificationStepsOrderProgressGPQA\(Reinet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib23)\)GeneralGrad\.MCQ✓✗✗✗SciBench\(Wanget al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib22)\)GeneralUGComp\.✓✗✗✗UGPhysics\(Xuet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib25)\)PhysicsUGMCQ, Comp\.✓✗✗✗PHYSICS\(Fenget al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib24)\)PhysicsGrad\.Comp\.✓✗✗✗PhysReason\(Zhanget al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib26)\)PhysicsHSComp\.✓✓✗✗PRISM\-PHYSICS\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib48)\)PhysicsUG, Grad\.Comp\.✓✓✓✗PhysLogicPhysicsHS, UG, Grad\.MCQ, Comp\., Proof✓✓✓✓
\* HS \(High School\), UG \(Undergraduate\), Grad\. \(Graduate\), MCQ \(Multiple\-Choice Question\), Comp\. \(Computational\), Proof \(Proof\-based\)\.

### 2\.2Scientific Logicality\-Guided Data Sampling for SFT

To enhance the logicality of LLMs for scientific reasoning, we propose two logic\-aware data sampling methods for SFT\. These methods are designed for datasets where each entry consists of a questionQQ, an answerAA, and a set of logical nexuses𝒩\\mathcal\{N\}with corresponding weights𝒲\\mathcal\{W\}\. Both approaches are illustrated in Figure[3](https://arxiv.org/html/2605.17104#S2.F3)\.

Reasoning Style TransferThis method uses a powerful reasoning LLM \(ℒ\\mathcal\{L\}\) in a style transfer task to generate a fluent reasoning path from the discrete logical nexuses\. The model is prompted with the questionQQand the weighted nexuses \(𝒩\\mathcal\{N\},𝒲\\mathcal\{W\}\) to synthesize a cohesive, narrative\-style reasoning process\. This effectively translates the structured logic into a natural thinking format\. The operation is formalized as:

R′=ℒ\(Q,𝒩,𝒲\)R^\{\\prime\}=\\mathcal\{L\}\(Q,\\mathcal\{N\},\\mathcal\{W\}\)whereR′R^\{\\prime\}is the synthesized reasoning\. The final data entry for SFT is constructed as the tuple\{Q,R′,A\}\\\{Q,R^\{\\prime\},A\\\}, pairing the synthesized reasoning with the original question and answer\. The reasoning processR′R^\{\\prime\}is explicitly demarcated by "*think*" tags\.

Logical\-DistillationThis strategy distills high\-quality data by filtering the native reasoning of a powerful LLM \(ℒ\\mathcal\{L\}\), using the ground\-truth nexuses as an indirect supervisory signal\.

First, the LLM is prompted with a questionQQto generate a native reasoning pathR′R^\{\\prime\}and answeraa:

\(R′,A′\)=ℒ\(Q\)\(R^\{\\prime\},A^\{\\prime\}\)=\\mathcal\{L\}\(Q\)The generated reasoningR′R^\{\\prime\}is segmented into discrete stepsℛ\\mathcal\{R\}\. We then assess this sequence against the ground\-truth nexuses𝒩\\mathcal\{N\}using our evaluation suite to obtain scores for logic precision \(π\\pi\), logic recall \(ρ\\rho\), Causal Connection \(𝒪\\mathcal\{O\}\), and Inferential Progress \(𝒫\\mathcal\{P\}\)\.

To make the metrics comparable, we z\-normalize each raw metricXXusing the meanμX\\mu\_\{X\}and standard deviationσX\\sigma\_\{X\}computed over the full datasetDfullD\_\{\\text\{full\}\}, and then apply a sigmoid to obtain a bounded scoreX~∈\(0,1\)\\tilde\{X\}\\in\(0,1\), i\.e\.,X~=sigmoid\(\(X−μX\)/σX\)\\tilde\{X\}=\\mathrm\{sigmoid\}\\\!\\big\(\(X\-\\mu\_\{X\}\)/\\sigma\_\{X\}\\big\)\. We then compute the final logical score𝒮\\mathcal\{S\}as a weighted combination of logical fidelity and two auxiliary criteria333We report the weight settings and sensitivity analysis in Appendix[G](https://arxiv.org/html/2605.17104#A7)\.\. Specifically, logical fidelity is defined as the harmonic mean of the normalized precisionπ~\\tilde\{\\pi\}and recallρ~\\tilde\{\\rho\}:

𝒮=δℱ⋅2π~ρ~π~\+ρ~\+δ𝒪⋅𝒪~\+δ𝒫⋅𝒫~\.\\mathcal\{S\}=\\delta\_\{\\mathcal\{F\}\}\\cdot\\frac\{2\\tilde\{\\pi\}\\tilde\{\\rho\}\}\{\\tilde\{\\pi\}\+\\tilde\{\\rho\}\}\+\\delta\_\{\\mathcal\{O\}\}\\cdot\\tilde\{\\mathcal\{O\}\}\+\\delta\_\{\\mathcal\{P\}\}\\cdot\\tilde\{\\mathcal\{P\}\}\\,\.The final data entry for SFT is constructed as the tuple\{Q,R′,A′\}\\\{Q,R^\{\\prime\},A^\{\\prime\}\\\}\. FromDfullD\_\{\\text\{full\}\}, we sample a subsetDDby selecting instances in the top\-κ\\kappapercentile according to𝒮\\mathcal\{S\}:

D=Topκ\(Dfull,key=𝒮\)\.D=\\mathrm\{Top\}\_\{\\kappa\}\(D\_\{\\text\{full\}\},\\mathrm\{key\}=\\mathcal\{S\}\)\\,\.For comparison, we also use the full datasetDfullD\_\{\\text\{full\}\}as a baseline, directly distilling on the entire question dataset\.

## 3Data Foundation

Dataset ConstructionWe instantiate our methodology in physics and build both training and evaluation data directly from research papers, which naturally encode rigorous deductive chains\. We first collect 380,678 physics papers from arXiv and peer\-reviewed journals, then useDeepSeek\-R1444[https://api\-docs\.deepseek\.com](https://api-docs.deepseek.com/)\(Guoet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib9)\)\(hereinafter "R1"; prompts in Appendix[I\.5](https://arxiv.org/html/2605.17104#A9.SS5)\) to retain theory\-centric works and filter out reviews, empirical studies, and tool papers, yielding 118,039 papers\. For each retained paper, we run a multi\-turn dialogue withR1to construct scientific problems:R1generates a questionQQof specified type and difficulty from the derivations \(with a 15:85 ratio of multiple\-choice to open\-ended questions\), produces the solution in the form of a reasoning trajectoryRRand, when applicable, a final answerAA, and then extracts core logical nexuses𝒩\\mathcal\{N\}together with importance weights𝒲\\mathcal\{W\}\. We treat\(𝒩,𝒲\)\(\\mathcal\{N\},\\mathcal\{W\}\)as the logical gold standard for the problem\. During the distillation process, to ensure that the distilled answerA′A^\{\\prime\}matches the original answerAA, we apply rejection sampling with a maximum of 5 retries\.

Quality ControlTo guarantee the quality of the synthesized data, we designed 3 quality control methods: Rule\-based filtering, LLM\-based filtering, and Human evaluation\. The implementation details and specific results of the quality control are presented in Appendix[E\.2](https://arxiv.org/html/2605.17104#A5.SS2)\.

Benchmark ConstructionLeveraging the multi\-dimensional assessing methodology for scientific logicality, we introducePhysLogic– the first comprehensive benchmark for logical reasoning in physics\. Specifically, we selected a total of 864 papers from nine distinct physics subfields\. For each subfield, we curated a balanced set of 96 questions spanning four difficulty levels \(High School, Undergraduate, Master’s, and PhD\) and four problem types\. To ensure a comprehensive and balanced evaluation, each difficulty\-type combination comprises 6 distinct problems\. The innovative aspects of our benchmark, compared to existing work, are summarized in Table[2](https://arxiv.org/html/2605.17104#S2.T2)\.

SFT Data ConstructionBeyond the 864 instances reserved for our benchmark, we randomly sampled 80k entries to generate data for SFT\. Following the sampling methods in Section[2\.2](https://arxiv.org/html/2605.17104#S2.SS2), we constructed two instruction\-tuning datasets with high logicality: 80k samples for Reasoning Style Transfer \(RST\) and 40k for Distillation with Logic Supervision \(Logic\-Distill\)\. In addition, an 80k\-sample baseline dataset was created using the direct reasoning outputs ofR1\(Direct\-Distill\)\.

Dataset StatisticsWe conducted a statistical analysis of the content and distribution of the constructed dataset\. Table[1](https://arxiv.org/html/2605.17104#S2.T1)summarizes the statistics for the 80k training dataset, categorized by four tasks\. It details the proportions, the average token counts for questions, reasonings and answers, the average number and length of logical nexuses\. Additional statistics and visualizations for the dataset can be found in Appendix[E\.1](https://arxiv.org/html/2605.17104#A5.SS1)\.

## 4Experiments

In this section, we empirically evaluate our proposed methodology and aim to answer three key questions\. We examine Q1 via two empirical studies, Q2 via in\-domain experiments on our proposedPhysLogicbenchmark, and Q3 via out\-of\-domain experiments on public physics QA benchmarks\.

- •Q1:Do our proposed metrics genuinely capture the logicality of reasoning?
- •Q2:Can our proposed logicality\-based SFT data sampling method enhance the scientific logicality of LLMs in physics reasoning?
- •Q3:Does the improved scientific logicality really contribute to better task performance?

### 4\.1Training and Evaluation Setup

Model TrainingFrom the constructed dataset, we sample three SFT subsets: \(1\) Direct\-Distill \(80k\), \(2\) Reasoning Style Transfer \(RST, 80k\), and \(3\) Logic\-Distill \(40k\)\. The backbone LLMs include 1\) a reasoning LLM: DeepSeek\-R1\-Distill\-Qwen\-7B \(Hereinafter referred to as "R1\-7B"\)\(Guoet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib9)\); 2\) a chat LLM: Qwen2\.5\-7B\-Instruct \(Hereinafter referred to as "Qwen2\.5\-7B"\)\(Yanget al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib32)\); and 3\) a base LLM: Llama\-3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib35)\)\.

During the training period, we employed the efficientLlamaFactory555[https://github\.com/hiyouga/LLaMA\-Factory](https://github.com/hiyouga/LLaMA-Factory)framework to perform full\-parameter fine\-tuning on the model\. To ensure training stability and efficiency, we meticulously configured a series of hyperparameters\. Specifically, the learning rate was set to5\.0×𝟏𝟎−𝟔\\mathbf\{5\.0\\times 10^\{\-6\}\}, paired with a cosine learning rate scheduler for dynamic adjustments, and a warmup ratio of0\.03\\mathbf\{0\.03\}\. Given computational resource constraints, we set the per\-device train batch size to𝟏\\mathbf\{1\}and used𝟐\\mathbf\{2\}gradient accumulation steps, achieving an effective batch size of𝟐\\mathbf\{2\}\. Additionally, to handle long text sequences, the model’s maximum sequence length \(cutoff length\) was extended to𝟑𝟐𝟕𝟔𝟖\\mathbf\{32768\}\.

For optimization, we adopted several advanced techniques to enhance training efficiency and reduce memory consumption\. BF16 mixed\-precision was enabled throughout the training process, and theDeepSpeed ZeRO Stage 3666[https://github\.com/deepspeedai/DeepSpeed](https://github.com/deepspeedai/DeepSpeed)optimization strategy was integrated\. Furthermore, we applied theFlashAttention\-2777[https://github\.com/Dao\-AILab/flash\-attention](https://github.com/Dao-AILab/flash-attention)mechanism to accelerate the computation of the attention module and enabled gradient checkpointing to further conserve memory\.

To ensure the reproducibility of our experiments, the global random seed was fixed to𝟒𝟐\\mathbf\{42\}\. The entire training process was conducted for𝟐\\mathbf\{2\}epochs\. The training for each model was conducted on 8 NVIDIA H100 Tensor Core GPUs\.

Table 3:Consistency between third\-party ratings and our logicality metrics, measured by Pearson/Spearman correlations \(All correlations are statistically significant,p<0\.001p<0\.001\)\.PearsonSpearmanRaterℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Avg\.\\mathrm\{Avg\.\}ℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Avg\.\\mathrm\{Avg\.\}Human Expert0\.80210\.71140\.69480\.74530\.81570\.72490\.70920\.7798LLM Judge0\.82630\.74360\.74710\.78600\.84040\.75820\.77170\.8303Table 4:Means and medians of logicality metrics on correct/incorrect samples\.ℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Avg\.\\mathrm\{Avg\.\}meanmedianmeanmedianmeanmedianmeanmedianCorrect52\.353\.673\.176\.96\.375\.1243\.943\.7Incorrect46\.050\.067\.572\.15\.554\.7839\.739\.3Table 5:In\-domain experimental results on our proposedPhysLogicbenchmarkBackbonesLlama\-3\.1\-8B\-baseQwen2\.5\-7B\-InstructDeepSeek\-R1\-Distill\-Qwen\-7BSFT Dataℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Avg\.Accℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Avg\.Accℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Avg\.AccNaturalReasoning53\.5670\.492\.9642\.3423\.6152\.4370\.762\.9742\.0535\.8848\.0859\.857\.6338\.5235\.65MegaScience53\.7168\.105\.2542\.3531\.0254\.6369\.255\.4943\.1239\.8139\.9671\.083\.9338\.3244\.44Sci\-Instruct50\.7661\.594\.4738\.9413\.8945\.6863\.114\.3537\.7115\.7451\.0361\.527\.0439\.8632\.87SCP\-116k46\.6063\.574\.3938\.1925\.0047\.0963\.914\.2838\.4337\.2753\.6667\.486\.8942\.6846\.30Ours \(RST\)58\.7073\.695\.4345\.9444\.6758\.8971\.165\.1245\.0642\.8256\.6873\.547\.7145\.9847\.45

For each matrix, the best results highlighted inboldand the second\-best resultsunderlined\.

Table 6:Comparative results on public physics benchmarks between our constructed data and baselinesBackboneDataScaleLlama\-3\.1\-8BQwen2\.5\-7B\-InstructDeepSeek\-R1\-Distill\-Qwen\-7BSFT DatasetGPQASciBenchPhysR\.Avg\.GPQASciBenchPhysR\.Avg\.GPQASciBenchPhysR\.Avg\.NaturalReasoning80k7\.3617\.0914\.4812\.9823\.2630\.0517\.8123\.7136\.8226\.9420\.3328\.03MegaScience80k27\.0215\.0216\.3319\.4636\.8232\.1219\.2229\.3953\.4950\.9426\.8043\.74Sci\-Instruct80k19\.777\.2513\.7213\.5830\.629\.8417\.1919\.2234\.507\.1420\.1520\.60SCP\-116k80k43\.0232\.6429\.5735\.0841\.4743\.5219\.1734\.7256\.5952\.3333\.0947\.34Ours \(Direct\-Distill\)80k46\.9037\.8229\.2137\.9837\.9848\.7022\.1836\.2951\.5550\.7730\.5044\.27Ours \(Logic\-Distill\)40k39\.5335\.7530\.1335\.1443\.0249\.2242\.8845\.0453\.4953\.7153\.0553\.42Ours \(RST\)80k40\.7027\.4624\.7730\.9847\.6738\.8637\.7141\.4160\.4655\.9540\.3652\.26

\* For each backbone, the best results are highlighted inboldand the second\-best results areunderlined\.

BaselinesBesides the Direct\-Distill baseline, we also benchmark against four public physics\-QA datasets: NaturalReasoning\(Yuanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib17)\), MegaScience\(Fanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib21)\), SCP\-116k\(Luet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib20)\), and Sci\-Instruct\(Zhanget al\.,[2024a](https://arxiv.org/html/2605.17104#bib.bib18)\)\. For a fair comparison, we only use the physics\-related subsets from these datasets and sample an equal amount of data \(80k\) for each training set\. Details of the baseline datasets are provided in Appendix[F\.1](https://arxiv.org/html/2605.17104#A6.SS1)

Evaluation MetricsIn\-domain, we evaluate onPhysLogicusing Logical Fidelity \(ℱ\\mathcal\{F\}\), Causal Connection \(𝒪\\mathcal\{O\}\), Inferential Progress \(𝒫\\mathcal\{P\}\), and final\-answer accuracy on multiple\-choice and numerical problems \(Acc\)\. To avoid methodological circularity, since the "Logic\-Distill" data are sampled usingPhysLogic’s metrics, we report in\-domain results by comparing only against "RST" to ensure objective evaluation\. Out\-of\-domain, we report final accuracy on three public benchmarks with distinct formats: \(1\) multiple choice, physics subset of GPQA\-Diamond \(GPQA\)\(Reinet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib23)\); \(2\) numerical calculation, physics subset of SciBench \(SciBench\)\(Wanget al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib22)\); and \(3\) reasoning, PhysReason \(PhysR\.\)\(Zhanget al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib26)\)\. All scores are average percentages over three independent runs\. For logicality evaluation, the sentence encoder isall\-MiniLM\-L6\-v2888[https://huggingface\.co/sentence\-transformers/all\-MiniLM\-L6\-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)\. The prompts used for the inference phase and for the judge LLMs across all four benchmarks are provided in Appendix[I\.4](https://arxiv.org/html/2605.17104#A9.SS4)\. More details are in Appendix[F\.2](https://arxiv.org/html/2605.17104#A6.SS2)\.

### 4\.2Validity of Logicality Metrics

To answerQ1:*"Do our proposed metrics genuinely capture the logicality of reasoning?"*, in this section, we design two empirical experiments to validate the validity of the three proposed metricsℱ\\mathcal\{F\},𝒪\\mathcal\{O\}and𝒫\\mathcal\{P\}before the main experiment\.

Study 1: Consistency with third\-party evaluation indicators\.We randomly sample 200 instances from our benchmark\. A human physics expert and ChatGPT\-5 each assign 1–10 logicality scores to the reasoning processes produced by R1\-7B\. The scoring rubric is reported in Appendix[H](https://arxiv.org/html/2605.17104#A8)\. We then compute Pearson and Spearman correlation coefficients\(Pearson,[1895](https://arxiv.org/html/2605.17104#bib.bib2); Spearman,[1904](https://arxiv.org/html/2605.17104#bib.bib1)\)between each component metric \(ℱ\\mathcal\{F\},𝒪\\mathcal\{O\},𝒫\\mathcal\{P\}\) as well as their averaged score \(Overall\\mathrm\{Overall\}\), and the human/LLM judger ratings\. As shown in Table[3](https://arxiv.org/html/2605.17104#S4.T3), all three metrics exhibit positive and significant correlations with both third\-party indicators \(allp<0\.001p<0\.001\), suggesting thatour automatic logicality metrics align well with human and LLM judgments\.

Study 2: Relation between logicality and task performance\.We further examine whether higher logicality scores are associated with better task performance\. For Qwen2\.5\-7B, R1\-7B, and GPT\-5, we computeℱ\\mathcal\{F\},𝒪\\mathcal\{O\}, and𝒫\\mathcal\{P\}for correctly and incorrectly answered samples, as well as their average \(Overall\\mathrm\{Overall\}\)\. Table[4](https://arxiv.org/html/2605.17104#S4.T4)reports the scores of the two groups\. Across all three dimensions and Avg\., correct samples obtain significantly higher scores than incorrect ones \(p<0\.001p<0\.001\), showing thathigher logicality is closely associated with better reasoning performance\.

### 4\.3In\-Domain Experiment

In this section, we systematically evaluate the scientific logicality of existing LLMs on physics reasoning and examine whether our supervised fine\-tuning \(SFT\) data can improve this capability\. We compare four baseline datasets with our RST\-sampled dataset by fine\-tuning three backbones on the same 80k training examples and evaluating scientific logicality\. As shown in Table[5](https://arxiv.org/html/2605.17104#S4.T5), SFT on RST\-sampled trajectories consistently yields the highest scientific logicality across all three backbones, outperforming the second\-best baseline by3\.59%,1\.94%, and3\.30%in average logicality, respectively; it also improves final answer accuracy by13\.65%,3\.01%, and1\.15%over the second\-best results\. We further evaluate a broader set of open\- and closed\-source LLMs on the PhysReason benchmark\. Figure[5](https://arxiv.org/html/2605.17104#A4.F5),[6](https://arxiv.org/html/2605.17104#A4.F6),[7](https://arxiv.org/html/2605.17104#A4.F7)and[8](https://arxiv.org/html/2605.17104#A4.F8)in Appendix[C](https://arxiv.org/html/2605.17104#A3)provides a direct comparison of scientific logicality across models, showing that fine\-tuning a 7B model on only our 80k training set surpasses comparable 14B and even 32B models overall, and achieves the highest average logicality among all closed\-source LLMs\. Together, these results provide an affirmative answer toQ2:our RST\-based SFT data effectively enhances LLM scientific logicality in physics reasoning\.

Table 7:Ablation study on different backbones and settings\.BackboneLlama\-3\.1\-8BQwen\-2\.5\-7B\-InstructDeepSeek\-R1\-Distill\-Qwen\-7BSettingLogicalityAnswerLogicalityAnswerLogicalityAnswerLogic\-Distill45\.5036\.5442\.7844\.0244\.1449\.73w/oℱ\\mathcal\{F\}43\.90 \(\-1\.60\)33\.85 \(\-2\.69\)40\.03 \(\-2\.75\)40\.59 \(\-3\.43\)41\.58 \(\-2\.56\)45\.08 \(\-4\.65\)w/o𝒪\\mathcal\{O\}44\.05 \(\-1\.85\)31\.31 \(\-5\.23\)38\.35 \(\-4\.43\)38\.25 \(\-5\.77\)41\.38 \(\-2\.76\)38\.68 \(\-11\.05\)w/o𝒫\\mathcal\{P\}44\.06 \(\-1\.44\)33\.72 \(\-2\.82\)40\.77 \(\-2\.01\)38\.72 \(\-5\.30\)41\.69 \(\-2\.45\)45\.18 \(\-4\.55\)Random43\.67 \(\-1\.83\)31\.93 \(\-4\.61\)41\.73 \(\-1\.03\)28\.77 \(\-15\.25\)40\.24 \(\-3\.90\)41\.78 \(\-7\.95\)
\* The best results highlighted inbold, in ablation results, the parenthesized deltas following each metric denote the change with respect to "Logic\-Distill"\.

![Refer to caption](https://arxiv.org/html/2605.17104v1/x4.png)Figure 4:Scaling law curves for scientific logicality and task performance of models trained on four SFT datasets at varying data scales\.
### 4\.4Out\-of\-Domain Experiment

Table[6](https://arxiv.org/html/2605.17104#S4.T6)reports the evaluation results on three public benchmarks\. When further trained on Qwen2\.5\-7B and R1\-7B, our proposed "Logic\-Distill" and "RST" methods outperform the other baselines\. Notably, "Logic\-Distill", which incorporates finer\-grained computational steps, performs best: despite using onlyhalf of the training datacompared to all the other baselines, it outperforms the strongest baseline by8\.75%and6\.08%on Qwen2\.5\-7B and R1\-7B, respectively\. "RST" ranks second, exceeding the best baseline by5\.12%and4\.92%on the two backbones\. On the Llama\-3\-8B\-base model, "Direct\-Distill" outperforms "Logic\-Distill", which we attribute to more pronounced scaling\-law effects on base models, as the former is trained with a larger dataset of 80k reasoning instances\. However, a comparison with an equivalent amount of training data \(refer to the next section "Scaling Laws"\) demonstrates that "Logic\-Distill" is better and more efficient\. These findings provide a conclusive answer toQ3: training with higher\-logicality reasoning data also positively impacts the final performance of LLMs on various public physics QA tasks\.

### 4\.5Ablation Study

We ablate our three logicality dimensions \(ℱ\\mathcal\{F\},𝒪\\mathcal\{O\},𝒫\\mathcal\{P\}\) by removing each one individually from the "Logical\-Distill" sampling process, re\-weighting the other two to0\.50\.5\. As shown in Table[7](https://arxiv.org/html/2605.17104#S4.T7), removing any single dimension significantly degrades both scientific logicality and task performance\. The ablation of Causal Connection \(𝒪\\mathcal\{O\}\), which assesses reasoning order, causes the most substantial performance drop, due to its role in filtering out hallucinations\.

### 4\.6Scaling Law

We study SFT effects across backbones by varying data size for "MegaScience", "Direct\-Distill", "Logic\-Distill", and "RST" and plotting scaling\-law curves of mean logicality and accuracy \(Figure[4](https://arxiv.org/html/2605.17104#S4.F4)\)\. On Qwen and DeepSeek, performance typically dips before rising, likely due to a reasoning paradigm mismatch between SFT traces and the native pretraining data that hurts small\-data SFT\. The growth trends for our "Logic\-Distill" and "RST" are the most pronounced\. Moreover, the comparison between "Logic\-Distill" and "Direct\-Distill" at equivalent data volumes clearly demonstrates that for SFT with scientific reasoning data, enhancing scientific logicality is a more effective strategy than simply increasing the data scale\.

### 4\.7Supplementary Experiments

In addition to the experiments described above, we conducted more extensive analyses and studies, including scalability to larger backbones \(Appendix[D\.1](https://arxiv.org/html/2605.17104#A4.SS1)\), OOD performance on mathematical benchmarks \(Appendix[D\.2](https://arxiv.org/html/2605.17104#A4.SS2)\), different matching strategies \(Appendix[D\.3](https://arxiv.org/html/2605.17104#A4.SS3)\), the logicality score of the training set \(Appendix[D\.4](https://arxiv.org/html/2605.17104#A4.SS4)\), and the ablation on sampling percentiles in "Logic\-Distill" \(Appendix[D\.5](https://arxiv.org/html/2605.17104#A4.SS5)\)\. Furthermore, to intuitively illustrate the evaluative role of our three logicality metrics, we provide case studies in Appendix[K](https://arxiv.org/html/2605.17104#A11)\.

## 5Related Work

Dataset and benchmarks for LLM physics reasoningSupervision typically derives from corpus extraction or LLM\-based synthesis\.NaturalReasoningandSCP\-116Kextract QA from research corpora and textbooks, with explicit step traces in the latter\(Yuanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib17); Luet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib20)\);Sci\-Instructsynthesizes and self\-revises data\(Zhanget al\.,[2024a](https://arxiv.org/html/2605.17104#bib.bib18)\);MegaScienceaggregates public datasets with difficulty\-aware filtering\(Fanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib21)\)\. Evaluation includes cross\-disciplinary suites:GPQA\(MCQ\) andSciBench\(open\-ended numerical problems\)\(Reinet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib23); Wanget al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib22)\); and physics\-specific sets:UGPhysics\(undergraduate exercises with rule\-based scoring\) andPhysReason\(competition problems with step\-level verification\)\(Xuet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib25); Zhanget al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib26)\)\.

Process\-oriented evaluation of LLM reasoningThese methods assess step validity rather than only final answers\.REVEALlabels relevance, attribution, and logical correctness\(Jacoviet al\.,[2024](https://arxiv.org/html/2605.17104#bib.bib43)\);ProcessBenchandPRMBenchevaluate step\-error detection and PRM robustness\(Zhenget al\.,[2025a](https://arxiv.org/html/2605.17104#bib.bib44); Songet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib46)\);PRISM\-Physicsuses a topological structure to verify the problem\-solving process\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib48)\), but it only touches upon one aspect of logicality;VerifyBenchextends verification to physics, chemistry, and biology\(Liet al\.,[2026](https://arxiv.org/html/2605.17104#bib.bib47)\)\.

## 6Conclusion

This work pioneers a systematic study of scientific logicality in LLM reasoning\. We introduce assessment criteria to quantify this logicality and propose two SFT data sampling strategies to effectively improve it\. Taking physics as an exemplar discipline, we construct a dedicated dataset and benchmark to practise our methodology\. Empirical studies involving third\-party verification demonstrate the effectiveness of the proposed logical metrics\. Comprehensive experiments verify the effectiveness of our proposed methodology for both scientific logicality and task performances in physics\.

## Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grants \#72293575, \#72225011, and \#72434005\.

## Impact Statement

This work studies scientific logicality in LLM reasoning and proposes assessment criteria and logic\-aware data sampling methods for physics reasoning\. By shifting evaluation and training from final\-answer accuracy alone toward logical fidelity, causal connection, and inferential progress, our benchmark and datasets may support more reliable process\-level evaluation of scientific reasoning models and contribute to AI\-assisted scientific education, research assistance, and model development\. This process\-level perspective also helps identify where a model’s scientific reasoning fails, rather than only whether its final answer is wrong\.

However, improving process\-level logicality does not guarantee factual correctness, complete scientific validity, or safe deployment\. Our metrics assess reasoning traces through their alignment with extracted logical nexuses, causal ordering, and inferential progression, and should therefore be interpreted as diagnostic tools rather than complete certificates of scientific correctness\. Models trained or evaluated using our framework may still produce plausible but incorrect derivations, especially outside the physics settings and benchmarks studied in this paper\. We therefore encourage users to combine our metrics with final\-answer evaluation, expert inspection, and domain\-specific robustness tests, and to use such systems as decision\-support tools rather than substitutes for domain experts\.

Finally, our dataset and benchmark are constructed from scholarly articles in public repositories, including arXiv and established physics journals\. To reduce ethical and privacy risks, we remove metadata such as author names and affiliations, constrain generation to central scientific problems, and apply rule\-based and LLM\-based filters together with human evaluation for quality control\. In our public release, we will provide only synthesized question\-answer pairs and logical nexuses, without information intended to identify specific source papers\.

## References

- R\. L\. Brennan and D\. J\. Prediger \(1981\)Coefficient kappa: some uses, misuses, and alternatives\.Educational and psychological measurement41\(3\),pp\. 687–699\.Cited by:[§E\.2](https://arxiv.org/html/2605.17104#A5.SS2.p6.1)\.
- C\. Díaz, B\. Dorner, H\. Hussmann, and J\. Strijbos \(2023\)Conceptual review on scientific reasoning and scientific thinking\.Current Psychology42\(6\),pp\. 4313–4325\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p2.1),[§2](https://arxiv.org/html/2605.17104#S2.p1.13)\.
- B\. H\. Dowden \(1993\)Logical reasoning\.Wadsworth Pub\. Co\.\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p2.1)\.
- R\. Fan, Z\. Wang, and P\. Liu \(2025\)MegaScience: pushing the frontiers of post\-training datasets for science reasoning\.arXiv preprint arXiv:2507\.16812\.Cited by:[Table 22](https://arxiv.org/html/2605.17104#A5.T22.4.3.1.2.1.2.1),[§1](https://arxiv.org/html/2605.17104#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p5.1),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- K\. Feng, Y\. Zhao, Y\. Liu, T\. Yang, C\. Zhao, J\. Sous, and A\. Cohan \(2025\)PHYSICS: benchmarking foundation models on university\-level physics problem solving\.InFindings of ACL,pp\. 11717–11743\.Cited by:[Table 2](https://arxiv.org/html/2605.17104#S2.T2.5.6.1)\.
- F\. Fischer, I\. Kollar, S\. Ufer, B\. Sodian, H\. Hussmann, R\. Pekrun, B\. Neuhaus, B\. Dorner, S\. Pankofer, M\. Fischer,et al\.\(2014\)Scientific reasoning and argumentation: advancing an interdisciplinary research agenda in education\.Frontline Learning Research2\(3\),pp\. 28–45\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p2.1),[§2](https://arxiv.org/html/2605.17104#S2.p1.13)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1),[§3](https://arxiv.org/html/2605.17104#S3.p1.8),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p1.1)\.
- M\. Hu, C\. Ma, W\. Li, W\. Xu, J\. Wu, J\. Hu, T\. Li, G\. Zhuang, J\. Liu, Y\. Lu,et al\.\(2025\)A survey of scientific large language models: from data foundations to agent frontiers\.arXiv preprint arXiv:2508\.21148\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1)\.
- A\. Jacovi, Y\. Bitton, B\. Bohnet, J\. Herzig, O\. Honovich, M\. Tseng, M\. Collins, R\. Aharoni, and M\. Geva \(2024\)A chain\-of\-thought is as strong as its weakest link: a benchmark for verifiers of reasoning chains\.InProceedings of ACL,pp\. 4615–4634\.Cited by:[§5](https://arxiv.org/html/2605.17104#S5.p2.1)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)OpenAI o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1)\.
- X\. Li, X\. Li, S\. Hu, Y\. Guo, and W\. Zhang \(2026\)VerifyBench: a systematic benchmark for evaluating reasoning verifiers across domains\.InProceedings of AAAI,Vol\.40,pp\. 31796–31804\.Cited by:[§5](https://arxiv.org/html/2605.17104#S5.p2.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InProceedings of ICLR,Cited by:[§2](https://arxiv.org/html/2605.17104#S2.p1.13)\.
- D\. Lu, X\. Tan, R\. Xu, T\. Yao, C\. Qu, W\. Chu, Y\. Xu, and Y\. Qi \(2025\)SCP\-116K: a high\-quality problem\-solution dataset and a generalized pipeline for automated extraction in the higher education science domain\.arXiv preprint arXiv:2501\.15587\.Cited by:[Table 22](https://arxiv.org/html/2605.17104#A5.T22.4.5.1.2.1.2.1),[§1](https://arxiv.org/html/2605.17104#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p5.1),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- U\. Macar, P\. C\. Bogdan, S\. Rajamanoharan, and N\. Nanda \(2025\)Thought Branches: interpreting LLM reasoning requires resampling\.InProceedings of NeurIPS Workshop on Mechanistic Interpretability,Cited by:[§2](https://arxiv.org/html/2605.17104#S2.p1.13)\.
- K\. Pearson \(1895\)Note on regression and inheritance in the case of two parents\.Proceedings of the Royal Society of London58,pp\. 240–242\.Cited by:[§4\.2](https://arxiv.org/html/2605.17104#S4.SS2.p2.5)\.
- K\. Popper \(2005\)The logic of scientific discovery\.Routledge\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p2.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof Q&A benchmark\.InFirst Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1),[Table 2](https://arxiv.org/html/2605.17104#S2.T2.5.3.1),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p6.3),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- M\. Song, Z\. Su, X\. Qu, J\. Zhou, and Y\. Cheng \(2025\)PRMBench: a fine\-grained and challenging benchmark for process\-level reward models\.InProceedings of ACL,pp\. 25299–25346\.Cited by:[§5](https://arxiv.org/html/2605.17104#S5.p2.1)\.
- C\. Spearman \(1904\)The proof and measurement of association between two things\.The American Journal of Psychology15\(1\),pp\. 72–101\.Cited by:[§4\.2](https://arxiv.org/html/2605.17104#S4.SS2.p2.5)\.
- H\. Sun, Y\. Sakai, H\. Sakajo, S\. Ozaki, K\. Hayashi, H\. Kamigaito, and T\. Watanabe \(2025\)LoCt\-Instruct: an automatic pipeline for constructing datasets of logical continuous instructions\.InProceedings of EMNLP,pp\. 34199–34218\.Cited by:[§2](https://arxiv.org/html/2605.17104#S2.p1.13)\.
- L\. Wang, W\. Xu, Y\. Lan, Z\. Hu, Y\. Lan, R\. K\. Lee, and E\. Lim \(2023\)Plan\-and\-solve prompting: improving zero\-shot chain\-of\-thought reasoning by large language models\.InProceedings of ACL,pp\. 2609–2634\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1)\.
- X\. Wang, Z\. Hu, P\. Lu, Y\. Zhu, J\. Zhang, S\. Subramaniam, A\. R\. Loomba, S\. Zhang, Y\. Sun, and W\. Wang \(2024\)SciBench: evaluating college\-level scientific problem\-solving abilities of large language models\.InProceedings of ICML,pp\. 50622–50649\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1),[Table 2](https://arxiv.org/html/2605.17104#S2.T2.5.4.1),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p6.3),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of NeurIPS,Vol\.35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1)\.
- X\. Xu, Q\. Xu, T\. Xiao, T\. Chen, Y\. Yan, J\. Zhang, S\. Diao, C\. Yang, and Y\. Wang \(2025\)UGPhysics: a comprehensive benchmark for undergraduate physics reasoning with large language models\.InProceedings of ICML,pp\. 69849–69877\.Cited by:[Table 2](https://arxiv.org/html/2605.17104#S2.T2.5.5.1),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p1.1)\.
- W\. Yuan, J\. Yu, S\. Jiang, K\. Padthe, Y\. Li, D\. Wang, I\. Kulikov, K\. Cho, Y\. Tian, J\. Weston, and X\. Li \(2025\)NaturalReasoning: reasoning in the wild with 2\.8m challenging questions\.InProceedings of NeurIPS,Vol\.38\.Cited by:[Table 22](https://arxiv.org/html/2605.17104#A5.T22.4.2.1.2.1.3.1),[§1](https://arxiv.org/html/2605.17104#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p5.1),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- D\. Zhang, Z\. Hu, S\. Zhoubian, Z\. Du, K\. Yang, Z\. Wang, Y\. Yue, Y\. Dong, and J\. Tang \(2024a\)SciInstruct: a self\-reflective instruction annotated dataset for training scientific language models\.InProceedings of NeurIPS,Vol\.37,pp\. 1443–1473\.Cited by:[Table 22](https://arxiv.org/html/2605.17104#A5.T22.4.4.1.2.1.2.1),[§1](https://arxiv.org/html/2605.17104#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p5.1),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- X\. Zhang, Y\. Dong, Y\. Wu, J\. Huang, C\. Jia, B\. Fernando, M\. Z\. Shou, L\. Zhang, and J\. Liu \(2025\)PhysReason: a comprehensive benchmark towards physics\-based reasoning\.InProceedings of ACL,pp\. 16593–16615\.Cited by:[Table 2](https://arxiv.org/html/2605.17104#S2.T2.5.7.1),[§4\.1](https://arxiv.org/html/2605.17104#S4.SS1.p6.3),[§5](https://arxiv.org/html/2605.17104#S5.p1.1)\.
- Y\. Zhang, X\. Chen, B\. Jin, S\. Wang, S\. Ji, W\. Wang, and J\. Han \(2024b\)A comprehensive survey of scientific large language models and their applications in scientific discovery\.InProceedings of EMNLP,pp\. 8783–8817\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1)\.
- W\. Zhao, Q\. Ma, J\. Shi, S\. Wu, J\. Han, Y\. Xiao, S\. Chen, X\. Luo, L\. Schmidt, and J\. Zou \(2025\)PRISM\-Physics: causal DAG\-based process evaluation for physics reasoning\.InProceedings of NeurIPS Workshop on Mathematical Reasoning and AI,Cited by:[Table 2](https://arxiv.org/html/2605.17104#S2.T2.5.8.1),[§5](https://arxiv.org/html/2605.17104#S5.p2.1)\.
- C\. Zheng, Z\. Zhang, B\. Zhang, R\. Lin, K\. Lu, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025a\)ProcessBench: identifying process errors in mathematical reasoning\.InProceedings of ACL,pp\. 1009–1024\.Cited by:[§5](https://arxiv.org/html/2605.17104#S5.p2.1)\.
- T\. Zheng, Z\. Deng, H\. T\. Tsang, W\. Wang, J\. Bai, Z\. Wang, and Y\. Song \(2025b\)From automation to autonomy: a survey on large language models in scientific discovery\.InProceedings of EMNLP,pp\. 17744–17761\.Cited by:[§1](https://arxiv.org/html/2605.17104#S1.p1.1)\.

Appendices

Contents

## Appendix AReproducibility Statement

To ensure the reproducibility of the research results in this work, we provide the following details:

- •Training Details:Section[4\.1](https://arxiv.org/html/2605.17104#S4.SS1)provides the detailed parameters and hardware specifications for the model training process\.
- •Evaluation Details:Appendix[F](https://arxiv.org/html/2605.17104#A6)presents details of the four public baseline datasets, the specific implementation methods for the benchmarking process, and the model deployment details, respectively\.
- •Parameter Sensitivity Analysis:Appendix[G](https://arxiv.org/html/2605.17104#A7)details the parameters involved in our proposed method, along with a sensitivity analysis of these parameters\.
- •Complete Prompts:Appendix[I](https://arxiv.org/html/2605.17104#A9)provides all the prompts used to query the LLMs throughout the entire workflow of this work\.
- •

## Appendix BComplete Experimental Results

Due to space limitations, we cannot present all the detailed results of the scaling law experiments and ablation study in the main text\. This chapter, however, includes Tables[8](https://arxiv.org/html/2605.17104#A4.T8),[9](https://arxiv.org/html/2605.17104#A4.T9)and[10](https://arxiv.org/html/2605.17104#A4.T10), which report the complete results of the scaling law experiments using three different backbone LLMs, and Table[11](https://arxiv.org/html/2605.17104#A4.T11), which reports the complete results of the ablation study\.

## Appendix CPerformances of Various LLMs on OurPhysLogicBenchmark

We tested a total of 25 types of open\-source \(in blue\)/closed\-source \(in purple\) LLMs and LLMs after sft using the data we constructed \(in yellow\), and conducted comparative evaluations on the PhysLogic dataset we constructed\. Figures[5](https://arxiv.org/html/2605.17104#A4.F5),[6](https://arxiv.org/html/2605.17104#A4.F6)and[7](https://arxiv.org/html/2605.17104#A4.F7)reports the three logicality scores\. Figure[8](https://arxiv.org/html/2605.17104#A4.F8)reports the final accuracy\.

## Appendix DSupplementary Experiments

### D\.1Scalability to larger backbones

To further examine whether the effectiveness of our method scales to larger language models, we conduct additional experiments using Qwen\-2\.5\-14B\-Instruct as the backbone\. We compare Logic\-Distill \(LD\) and RST with two baselines, MegaScience and Direct\-Distill \(DD\), and evaluate them on PhysLogic, GPQA, and SciBench\. The results are summarized in Table[12](https://arxiv.org/html/2605.17104#A4.T12)\.

As shown in Table[12](https://arxiv.org/html/2605.17104#A4.T12), both LD and RST consistently improve the average logicality score over the baselines\. In terms of final\-answer accuracy, LD achieves the best average performance, improving from 38\.41 with MegaScience and 39\.76 with DD to 45\.28\. RST obtains a comparable average accuracy of 45\.26, while achieving the highest average logicality score\. These results indicate that the benefits of our logic\-guided training strategy are not limited to smaller backbones, but can also transfer effectively to larger\-scale LLMs\.

Table 8:Complete results of scaling law experiment based on Llama\-3\-8B\.DatasetScalePublic BenchmarksPhysLogicGPQA\-D\(Phy\.\)SciBench\(Phy\.\)PhysReasonAverageℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}AverageLogicalityAnswerScoreMegaScience20k18\.2111\.9211\.5913\.9153\.6868\.465\.0442\.3925\.6940k21\.3213\.9914\.6616\.6654\.0367\.387\.0742\.8330\.5660k25\.9713\.4715\.9018\.4553\.6368\.094\.9242\.2128\.7080k27\.0215\.0216\.3319\.4653\.7168\.105\.2542\.3531\.02Ours \(Direct\-Distill\)20k33\.7224\.8723\.8427\.4849\.0969\.284\.0140\.7931\.6440k39\.1532\.3016\.7029\.3856\.0070\.314\.7043\.6739\.5860k39\.9237\.3024\.7133\.9856\.6572\.054\.9644\.5539\.1280k46\.9037\.8229\.2137\.9856\.9772\.485\.0044\.8243\.52Ours \(Logic\-Distill\)10k32\.1723\.3123\.2326\.2447\.1367\.283\.9739\.4624\.5420k36\.0527\.4627\.7330\.4150\.6970\.204\.3241\.7432\.1830k37\.2130\.5728\.8432\.2154\.1172\.524\.5443\.7236\.5740k39\.5335\.7530\.1335\.1457\.3874\.284\.8445\.5040\.74Ours \(RST\)20k18\.2210\.8815\.9015\.0052\.1866\.694\.8541\.2423\.6140k25\.9719\.6720\.1521\.9353\.9067\.054\.2841\.7428\.2460k31\.0122\.2822\.3725\.2254\.0468\.984\.5542\.5234\.4980k40\.7027\.4624\.7730\.9858\.7073\.695\.4345\.9444\.67Table 9:Complete results of scaling law experiment based on Qwen\-2\.5\-7B\-Instruct\.DatasetScalePublic BenchmarksPhysLogicGPQA\-D\(Phy\.\)SciBench\(Phy\.\)PhysReasonAverageℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}AverageLogicalityAnswerScorebackbone025\.9737\.3016\.6426\.6457\.6566\.736\.4443\.6134\.26MegaScience20k27\.9125\.3923\.6625\.6551\.4966\.496\.3341\.4435\.8840k33\.7227\.9824\.9628\.8952\.2867\.106\.6342\.0037\.0460k34\.1129\.0224\.5829\.2455\.3768\.065\.5943\.0138\.1980k36\.8232\.1219\.2229\.3954\.6369\.255\.4943\.1239\.81Ours \(Direct\-Distill\)20k24\.4219\.1717\.9220\.5047\.9367\.044\.3939\.7936\.9640k27\.5130\.0518\.5525\.3751\.6469\.304\.2441\.7338\.9760k37\.2145\.6020\.8334\.5553\.0970\.854\.3642\.7740\.2880k37\.9848\.7022\.1836\.2953\.9270\.354\.6142\.9640\.51Ours \(Logic\-Distill\)10k18\.6030\.2233\.3927\.4045\.2966\.984\.3138\.8635\.4220k29\.0734\.8936\.3033\.4248\.0767\.874\.2240\.0537\.5830k40\.7142\.4942\.1441\.7848\.2368\.664\.5240\.4738\.6640k43\.0249\.2242\.8845\.0453\.9869\.394\.9742\.7840\.97Ours \(RST\)20k26\.7426\.4227\.7326\.9653\.6868\.394\.8642\.3134\.9540k36\.0536\.7934\.5735\.8054\.9669\.715\.1843\.2837\.5060k41\.8638\.3435\.6138\.6055\.4871\.594\.9844\.0240\.9780k47\.6738\.8637\.7141\.4158\.8971\.165\.1245\.0642\.82Table 10:Complete results of scaling law experiment based on DeepSeek\-R1\-Distill\-Qwen\-7B\.DatasetScalePublic BenchmarksPhysLogicGPQA\-D\(Phy\.\)SciBench\(Phy\.\)PhysReasonAverageℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}AverageLogicalityAnswerScorebackbone066\.2860\.128\.151\.4947\.9166\.7810\.941\.8640\.51MegaScience20k45\.3551\.332\.0842\.9142\.369\.494\.5738\.7945\.3740k52\.3350\.7831\.8845\.0040\.8869\.854\.4538\.3942\.5960k52\.7150\.7830\.7544\.7541\.4270\.274\.3138\.6744\.4480k53\.4950\.9426\.843\.7439\.9671\.083\.9338\.3244\.44Ours \(Direct\-Distill\)20k44\.1948\.737\.0343\.3149\.52684\.5440\.6930\.7140k48\.4551\.332\.1643\.9750\.0666\.14\.5640\.2435\.1960k51\.1651\.332\.4744\.9851\.1167\.064\.6840\.9535\.4280k51\.5550\.7730\.544\.2751\.3168\.824\.5341\.5536\.11Ours \(Logic\-Distill\)10k43\.0243\.5245\.6644\.0748\.9366\.514\.4339\.9633\.120k44\.1950\.2548\.7347\.7250\.2867\.084\.4740\.6132\.1830k48\.8454\.451\.251\.4852\.268\.234\.7741\.7334\.9540k53\.4953\.7153\.0553\.4255\.3371\.95\.1944\.1438\.66Ours \(RST\)20k44\.9651\.8132\.9743\.2554\.3868\.49\.9144\.2341\.6740k5050\.7835\.6745\.4854\.8369\.088\.9744\.2945\.1460k57\.3655\.4434\.3849\.0656\.5871\.47\.1445\.0446\.9980k60\.4655\.9540\.3652\.2656\.6873\.547\.7145\.9847\.45Table 11:Complete results of ablation study\.BackboneSettingPublic BenchmarksPhysLogicGPQA\-D\(Phy\.\)SciBench\(Phy\.\)PhysReasonAverageℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}AverageLogicalityAnswerScoreLlama\-3\.1\-8BLogic\-Distill39\.5335\.7530\.1335\.1457\.3874\.284\.8445\.5040\.74w/o F46\.5130\.0520\.6432\.4054\.6672\.374\.9443\.9938\.19w/o O38\.3728\.520\.3929\.0954\.872\.634\.7344\.0537\.96w/o P38\.3734\.7124\.7732\.6255\.172\.344\.7544\.0637\.04random39\.1532\.316\.729\.385670\.314\.743\.6739\.58Qwen2\.5\-7B\-InstructLogic\-Distill43\.0249\.2242\.8845\.0453\.9869\.394\.9742\.7840\.97w/o F41\.8645\.0838\.1541\.7048\.6467\.414\.0440\.0337\.27w/o O39\.5342\.6636\.139\.4345\.1265\.164\.7838\.3534\.72w/o P41\.0946\.9829\.0839\.0548\.9669\.224\.1340\.7737\.73random27\.5130\.0518\.5525\.3751\.6469\.34\.2441\.7338\.97DeepSeek\-R1\-Distill\-Qwen\-7BLogic\-Distill53\.4953\.7153\.0553\.4255\.3371\.95\.1944\.1438\.66w/o F53\.146\.6246\.5848\.7752\.7667\.394\.641\.5834\.03w/o O45\.3549\.2226\.140\.2251\.5467\.724\.8941\.3834\.03w/o P51\.9451\.340\.4247\.8952\.1668\.044\.8741\.6937\.04random48\.4551\.332\.1643\.9750\.0666\.14\.5640\.2435\.19![Refer to caption](https://arxiv.org/html/2605.17104v1/x5.png)Figure 5:Visualization of logical fidelity score![Refer to caption](https://arxiv.org/html/2605.17104v1/x6.png)Figure 6:Visualization of causal connection score![Refer to caption](https://arxiv.org/html/2605.17104v1/x7.png)Figure 7:Visualization of inferential progress score![Refer to caption](https://arxiv.org/html/2605.17104v1/x8.png)Figure 8:Visualization of final answer accuracyTable 12:Scalability results using Qwen\-2\.5\-14B\-Instruct as the backbone\. We report average logicality and final\-answer accuracy on PhysLogic, GPQA, and SciBench\.SettingAvg\. LogicalityGPQASciBenchPhysLogicAvg\. AccuracyMegaScience43\.3433\.7241\.4540\.0538\.41Direct\-Distill44\.6029\.0747\.1543\.0639\.76Logic\-Distill47\.3038\.3750\.2647\.2245\.28RST49\.8543\.0247\.1545\.6045\.26Table 13:Out\-of\-domain results on math benchmarks using Qwen\-2\.5\-7B\-Instruct as the backbone\. We compare our methods with MegaScience and Direct\-Distill baselines\.SettingMATH\-500GSM8KAIME2025AMCAvg\.MegaScience72\.4086\.736\.7940\.3151\.56Direct\-Distill71\.2084\.386\.2541\.1950\.76Logic\-Distill76\.0090\.458\.5443\.1554\.54RST74\.6088\.708\.1340\.8153\.06Table 14:ℱ\\mathcal\{F\}scores obtained with greedy matching and dynamic\-programming matching, and their relative deviations\.LLMGreedy matchingDynamic\-programming matchingRelative deviationGPT\-554\.0655\.372\.42%Qwen2\.5\-7B\-Instruct57\.6559\.092\.50%DeepSeek\-R1\-Distill\-Qwen\-7B47\.9148\.942\.15%Table 15:Average per\-instance processing time \(in seconds\) for greedy matching and dynamic\-programming matching\.Matching methodPer\-instance time \(s\)Greedy0\.1366Dynamic programming1\.2403Table 16:Logicality scores of the constructed training datasets\.Datasetℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Direct\-Distill57\.7073\.285\.29RST63\.4977\.647\.03
### D\.2Out\-of\-domain evaluation on math benchmarks

Although our constructed training set is purely physics\-oriented, it still yields non\-trivial improvements on mathematical reasoning tasks\. In particular, Logic\-Distill achieves the best performance on all four benchmarks, improving the average score from 51\.56 with MegaScience and 50\.76 with Direct\-Distill to 54\.54\. RST also achieves a higher average score than both baselines\. These results provide further evidence that our logic\-guided training strategy exhibits meaningful cross\-domain generalization beyond physics\.

### D\.3Sensitivity to the matching strategy for logical fidelity

In our main experiments, we adopt a greedy matching strategy to compute logical fidelityℱ\\mathcal\{F\}\. This choice is primarily motivated by computational efficiency, since the metric must be evaluated many times over long reasoning trajectories, across multiple datasets and models\. More complex global alignment methods would substantially increase the runtime of the evaluation pipeline\.

To assess whether our conclusions are sensitive to this design choice, we additionally implement a global alignment method based on dynamic programming and recompute the logical fidelity scores for all evaluated models on the same set of reasoning processes\. Table[14](https://arxiv.org/html/2605.17104#A4.T14)reports theℱ\\mathcal\{F\}scores obtained with these 2 matching methods, together with their relative deviations\. Table[15](https://arxiv.org/html/2605.17104#A4.T15)further compares the average per\-instance processing time of the two matching strategies\.

Empirically, the dynamic\-programming\-based scores are highly consistent with those from greedy matching, with relative deviations below3%3\\%and stable model rankings and performance trends under both strategies\. At the same time, dynamic programming is nearly an order of magnitude slower than greedy matching\. These results indicate that our findings are not sensitive to the specific matching strategy, and that the proposed greedy matching provides a robust and efficient choice for computing logical fidelity\.

### D\.4Logicality of the constructed training datasets

To further analyze the properties of our constructed datasets, we compute the three logicality metricsℱ\\mathcal\{F\},𝒪\\mathcal\{O\}, and𝒫\\mathcal\{P\}on the training data of the Direct\-Distill and RST settings\. Publicly available training corpora are not included in this comparison because they only contain question\-answer pairs and do not provide annotated logical nexuses\. The results are summarized in Table[16](https://arxiv.org/html/2605.17104#A4.T16)\.

We observe that the RST training data consistently achieves higher scores on all three metrics, indicating that it contains reasoning trajectories that are more closely aligned with the expert logical structure\. This analysis provides additional evidence that our logic\-guided data construction procedure yields training signals with stronger inherent logicality\.

### D\.5Ablation on sampling percentiles in "Logic\-Distill"

Table 17:Ablation study \(in\-domain\) under different sampling percentiles inLogic\-Distill\(Qwen\-2\.5\-7B\-Instruct, all settings trained on 20k examples\)\.Data sampled rate of Logic\-Distillℱ\\mathcal\{F\}𝒪\\mathcal\{O\}𝒫\\mathcal\{P\}Acc25%50\.0169\.344\.7841\.5950%48\.0767\.874\.2237\.5875%47\.6767\.484\.0337\.81100% \(Direct\-Distill\)47\.9367\.044\.3936\.96Table 18:Ablation study \(out\-of\-domain\) under different sampling percentiles inLogic\-Distill\(Qwen\-2\.5\-7B\-Instruct, all settings trained on 20k examples\)\.Data sampled rate of Logic\-DistillGPQA\-DSciBenchPhysReason25%32\.5635\.9240\.3050%29\.0734\.8936\.3075%26\.7424\.3527\.79100% \(Direct\-Distill\)24\.4219\.1717\.92![Refer to caption](https://arxiv.org/html/2605.17104v1/figs/statistics_1.png)\(a\)Subfield distribution of the filtered papers
![Refer to caption](https://arxiv.org/html/2605.17104v1/figs/statistics_2.png)\(b\)Distribution of question quadruplets in the full constructed QA set

Figure 9:Visualization of the distribution of the constructed datasetTo further examine the effectiveness of our logicality scores, we conduct an ablation study over sampling percentiles in theLogic\-Distillsetting\. Using Qwen2\.5\-7B as the backbone, we form three additionalLogic\-Distillvariants by selecting the top 25%, 50%, and 75% of training examples ranked by theirLogic\-Distillscores\. For fair comparison, all variants \(including the 100%Direct\-Distillbaseline\) are downsampled to 20k training examples, and evaluated on both in\-domain and out\-of\-domain benchmarks\.

Table[17](https://arxiv.org/html/2605.17104#A4.T17)reports in\-domain results onPhysLogic, and Table[18](https://arxiv.org/html/2605.17104#A4.T18)reports out\-of\-domain results on GPQA\-D, SciBench, and PhysReason\. As the sampling threshold is relaxed from top 25% to 100%, performance consistently degrades in both logicality metrics \(ℱ\\mathcal\{F\},𝒪\\mathcal\{O\},𝒫\\mathcal\{P\}\) and task accuracy, indicating that higherLogic\-Distillscores correspond to more valuable training signals\. All comparisons above use the same 20k\-training checkpoint to control for data scale; nevertheless, data scale also matters for absolute performance \(e\.g\., 40k top\-50% generally outperforms 20k top\-25%\),motivating our choice of the 50% threshold in the main experiments as a trade\-off between logicality and data scale\.

## Appendix EMore Details on Dataset Construction

Table 19:The amount of data filtered out at each step of the quality control process\.Filtering StepQuantityInitially Collected Papers380678Rule\-based FilteringPaper Topic Filtering262639Forbidden Keywords1764Incorrect Formats354Deduplication243LLM\-based FilteringForbidden Keywords3258Data Quality1439Final Remaining Data110981### E\.1Visualization of Data Statistics

Figure[9\(a\)](https://arxiv.org/html/2605.17104#A4.F9.sf1)illustrates the distribution of the filtered papers across physics subfields, adopting the classification system of the nine major categories for physics on arXiv131313astro\-ph:*astrophysics*,cond\-mat:*condensed matter*,gr\-qc:*general relativity & quantum cosmology*,hep:*high energy physics*,math\-ph:*mathematical physics*,nlin:*nonlinear sciences*,nucl:*nuclear physics*, physics:*classical physics*, andquant\-ph:*quantum physics*\(A paper can belong to multiple subfields at the same time\)\.\. Figure[9\(b\)](https://arxiv.org/html/2605.17104#A4.F9.sf2)presents the distribution of the initial four words within the question sentences\. A large number of question lengths and formats highlight the diversity of our constructed dataset\.

### E\.2Details of Quality Control

This section provides a detailed introduction to the quality control process, which includes rule\-based data filtering, LLM\-based quality filtering, and human\-based data quality inspection\. Specifically:

Table 20:Human evaluation results on 200 sampled data points\. All scores are percentages \(%\)\.RaterRPQQAQNQAverageRater 1100\.099\.094\.593\.096\.63Rater 298\.596\.088\.088\.092\.6
\*RP: Relevance to Paper;QQ: Question Quality;AQ: Answer Quality;NQ: Nexus Quality\.

Table 21:Consistent scores between the two raters\.MetricRPQQAQNQAveragePercentage Agreement \(%\)98\.596\.088\.088\.092\.6Brennan\-Prediger0\.970\.920\.760\.760\.85
\*RP: Relevance to Paper;QQ: Question Quality;AQ: Answer Quality;NQ: Nexus Quality\. Brennan\-Prediger: used to measure inter\-annotator agreement\. A value approaching 1\.0 indicates near\-perfect agreement between raters after correcting for chance\.

Table 22:Detailed information of the datasets used for training in the experiment\.DatasetData SourceGenerationMethodDisciplinesDisciplineLabelledTotalVolumePhysicsVolumeSampleRatioNaturalReasoning\(Yuanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib17)\)Pre\-trainingcorporaLLM\-basedSynthesisPhysicsComputer ScienceMathEconomicsSocial SciencesNo1145824\-0\.07MegaScience\(Fanet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib21)\)Universitytextbooks &public datasetsCorpusExtractionMedicineBiologyChemistryComputer SciencePhysicsMathEconomicsYes1253230414101\.93Sci\-Instruct\(Zhanget al\.,[2024a](https://arxiv.org/html/2605.17104#bib.bib18)\)UnlabeledscientificquestionsLLM\-basedSynthesisPhysicsChemistryMathYes2540511238690\.65SCP\-116k\(Luet al\.,[2025](https://arxiv.org/html/2605.17104#bib.bib20)\)AcademicdocumentsCorpusExtractionPhysicsChemistryBiologyYes2741661621920\.49OursAcademicliteraturesLLM\-basedSynthesisPhysicsYes1109811109810\.72Table 23:Hyperparameter settings for model inference\.max\_tokenstemperaturetop\_pn655360\.60\.958Rule\-based data filteringincludes:

- •Paper\-topic filtering:We retain only papers with rigorous logical deduction, excluding reviews, tool\-development papers, and empirical studies\.
- •Forbidden\-keyword filtering:Since the synthesis prompt forbids paper\-specific details \(e\.g\., experiments or data\), we remove samples whose questions, answers, or logical nexuses contain terms such as “paper,” “experimental results,” or “author\.”
- •Format filtering:We discard samples with invalid answer or nexus formats, including malformed multiple\-choice items, missing required final\-answer formats, or incorrectly formatted logical nexuses\.
- •Deduplication:We apply MinHash LSH and remove near\-duplicate questions with Jaccard similarity\>0\.8\>0\.8\.

LLM\-based data filteringincludes:

- •Filtering of forbidden keywords:With the same objective as the first point above, this step filters out data containing specific content from the paper\.
- •Filtering for data quality:This step filters out data with incomplete information, incorrect question types, or overly simplistic reasoning steps\.

The prompt for the LLM\-based evaluation is provided in Appendix[I\.5](https://arxiv.org/html/2605.17104#A9.SS5)\.

Human\-based data quality inspection:We randomly sampled 200 data points from the generated dataset, and two Ph\.D\. students scored each data point against the following four dimensions:

- •Relevance to Paper \(RP\):Is the question related to the core research or derivation process of the paper?
- •Question Quality \(QQ\):Is the question complete, free of missing information and formatting errors, and does not give the answer away in the question?
- •Answer Quality \(AQ\):Is the answer correct?
- •Nexus Quality \(NQ\):Does the logical nexus correctly describe the derivation process for this question based on the derivations in the paper?

Table[19](https://arxiv.org/html/2605.17104#A5.T19)shows the amount of data filtered out by each step of the rule\-based and LLM\-based filtering\. Table[20](https://arxiv.org/html/2605.17104#A5.T20)presents the average data quality scores for the 200 sampled items across the four dimensions as assessed by the two human raters\. Table[21](https://arxiv.org/html/2605.17104#A5.T21)reports the percentage agreement and Brennan\-Prediger score\(Brennan and Prediger,[1981](https://arxiv.org/html/2605.17104#bib.bib42)\)between the two raters\.

## Appendix FImplementation Details

### F\.1Details on Training Dataset

Table 24:Detailed information about the evaluated Closed\-Source LLMsModelVersionLLM type[gpt\-5](https://platform.openai.com/docs/models/gpt-5)\-reasoning[gpt\-5\-mini](https://platform.openai.com/docs/models/gpt-5-mini)\-reasoning[gpt\-5\-nano](https://platform.openai.com/docs/models/gpt-5-nano)\-reasoning[o4\-mini](https://platform.openai.com/docs/models/o4-mini)\-reasoning[doubao\-seed\-1\.6\-thinking](https://www.volcengine.com/docs/82379/1330310)250615reasoning[claude\-3\.7\-sonnet](https://www.anthropic.com/news/claude-3-7-sonnet)20250219reasoning[gemini\-2\.5\-flash](https://ai.google.dev/gemini-api/docs/changelog)preview\-04\-17reasoning[grok\-4\-fast\-reasoning](https://docs.x.ai/docs/models/grok-4-fast-reasoning)\-reasoning[yi\-large](https://platform.01.ai/)\-chatTable 25:Detailed information about the evaluated Closed\-Source LLMsModelVersionLLM typeParameters \(B\)[DeepSeek\-V3](https://github.com/deepseek-ai/DeepSeek-V3)\-chat671 \(37B act\.\)[DeepSeek\-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)\-reasoning671 \(37B act\.\)[DeepSeek\-R1\-Distill\-Qwen\-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)\-reasoning14[DeepSeek\-R1\-Distill\-Qwen\-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)\-reasoning7[GLM\-4\.5](https://docs.z.ai/guides/llm/glm-4.5)\-reasoning355 \(32B act\.\)[Kimi\-K2](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)0905chat1000 \(32B act\.\)[Llama\-3\.1\-8B\-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)\-chat8[Qwen2\.5\-32B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)\-chat32[Qwen2\.5\-14B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)\-chat14[Qwen2\.5\-7B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)\-chat7
### F\.2Implementation Details on Benchmarking

This section details the evaluation setup on three public benchmarks andPhysLogicbenchmark to ensure the reproducibility of our results\.

GPQA:The evaluation is conducted usinga public third\-party framework\-ScienceEval181818[https://github\.com/ScienceOne\-AI/ScienceEval](https://github.com/ScienceOne-AI/ScienceEval)\. We test 86 multiple\-choice physics questions from the diamond subset\. Answer correctness is determined using the framework’s rule\-based method\.

SciBench:We also employ theScienceEvalframework to evaluate 193 computational physics problems\. Answer correctness is verified through a combination of rule\-based methods and a mathematical validation library\.

PhysReason:We utilizea public third\-party framework\-Evalscope191919[https://github\.com/modelscope/evalscope](https://github.com/modelscope/evalscope)for evaluation\. We selected plain\-text physics problems and decomposed multi\-part questions into individual items to facilitate assessment by LLMs\. The evaluation uses the framework’s custom question\-answering pipeline, and answer correctness is determined via the LLM\-as\-a\-judge approach, withdeepseek\-v3\-0324202020[https://api\-docs\.deepseek\.com/news/news250325](https://api-docs.deepseek.com/news/news250325)serving as the judge LLM\.

PhysLogic:The complete benchmark, comprising 864 problems, along with the full code for inference, answer assessment, and logicality evaluation,is provided in the supplementary materials\. We observed that the judge LLM exhibited significant variability when evaluating proofs and expression derivation problems\. Therefore, to ensure objective and robust answer assessment, we limited our final answer evaluation to the 216 multiple\-choice and 216 numerical computation questions\. Multiple\-choice questions are judged using a rule\-based method, while computational questions are assessed using a hybrid of mathematical validation and an LLM judge\.

### F\.3Details on LLM Deployment

Model deployment during the evaluation process is facilitated by thelmdeploy212121[https://github\.com/InternLM/lmdeploy](https://github.com/InternLM/lmdeploy)framework\. The specific hyperparameters used for inference are detailed in Table[23](https://arxiv.org/html/2605.17104#A5.T23)\. Because some closed\-source LLMs do not support some of the parameters set in the main experiment \(see Appendix[F\.2](https://arxiv.org/html/2605.17104#A6.SS2)\), the experiments on closed\-source models only fixed temperature=0\.6\\mathbf\{0\.6\}, and the rest of the parameters were not set\. Table[24](https://arxiv.org/html/2605.17104#A6.T24)and[25](https://arxiv.org/html/2605.17104#A6.T25)summarize the detailed information of the LLMs used in experiment\.

Table 26:Sensitivity analysis for the weights of the three logicality dimensions \(δℱ\\delta\_\{\\mathcal\{F\}\},δ𝒪\\delta\_\{\\mathcal\{O\}\},δ𝒫\\delta\_\{\\mathcal\{P\}\}\)\.Weight SettingsLlama\-3\.1\-8BQwen\-2\.5\-7B\-InstructDeepSeek\-R1\-Distill\-Qwen\-7B𝜹𝓕\\boldsymbol\{\\delta\_\{\\mathcal\{F\}\}\}𝜹𝓞\\boldsymbol\{\\delta\_\{\\mathcal\{O\}\}\}𝜹𝓟\\boldsymbol\{\\delta\_\{\\mathcal\{P\}\}\}LogicalityAnswerLogicalityAnswerLogicalityAnswer00\.50\.543\.9033\.8540\.0340\.5941\.5845\.080\.500\.544\.0531\.3138\.3538\.2541\.3838\.680\.50\.5044\.0633\.7240\.7738\.7241\.6945\.18
\* The worst\-performing result for each metric is highlighted inred\.

![Refer to caption](https://arxiv.org/html/2605.17104v1/x9.png)Figure 10:Logical fidelity of various models vs\. similarity thresholdτ\\tau

## Appendix GParameter Sensitivity Analysis

### G\.1Analysis on Logicality Dimension Weights

In the Distillation with Logic Supervision process, the score of a sample \(𝒮\\mathcal\{S\}\) is calculated as a weighted sum of logical fidelity \(ℱ\\mathcal\{F\}\), causal connection \(𝒪\\mathcal\{O\}\), and inferential progress \(𝒫\\mathcal\{P\}\):

𝒮=δℱ⋅\(2⋅Norm\(π\)⋅Norm\(ρ\)Norm\(π\)\+Norm\(ρ\)\)\+δ𝒪⋅Norm\(𝒪\)\+δ𝒫⋅Norm\(𝒫\)\\mathcal\{S\}=\\delta\_\{\\mathcal\{F\}\}\\cdot\\left\(2\\cdot\\frac\{\\text\{Norm\}\(\\pi\)\\cdot\\text\{Norm\}\(\\rho\)\}\{\\text\{Norm\}\(\\pi\)\+\\text\{Norm\}\(\\rho\)\}\\right\)\+\\delta\_\{\\mathcal\{O\}\}\\cdot\\text\{Norm\}\(\\mathcal\{O\}\)\+\\delta\_\{\\mathcal\{P\}\}\\cdot\\text\{Norm\}\(\\mathcal\{P\}\)We performed a sensitivity analysis to set the final weights for our three logicality dimensions \(δℱ\\delta\_\{\\mathcal\{F\}\},δ𝒪\\delta\_\{\\mathcal\{O\}\}, andδ𝒫\\delta\_\{\\mathcal\{P\}\}\)\. In this analysis, we individually removed the influence of each dimension by setting its respective weight to0while keeping the other two equal, and then sampled a 40k dataset for training222222This experimental setup is the same as that in the ablation study \(Section[4\.5](https://arxiv.org/html/2605.17104#S4.SS5)\)\.\. The results in Table[26](https://arxiv.org/html/2605.17104#A6.T26)show that removing Causal Connection \(δ𝒪\\delta\_\{\\mathcal\{O\}\}\) leads to the most significant performance degradation\. We attribute this to the fact that errors in the causal sequence of reasoning are the most critical logical flaws\. Therefore, we assignedδ𝒪\\delta\_\{\\mathcal\{O\}\}the highest final weight, with the final configuration set to \(δℱ=0\.25\\delta\_\{\\mathcal\{F\}\}=\\mathbf\{0\.25\},δ𝒪=0\.50\\delta\_\{\\mathcal\{O\}\}=\\mathbf\{0\.50\},δ𝒫=0\.25\\delta\_\{\\mathcal\{P\}\}=\\mathbf\{0\.25\}\)\.

### G\.2Analysis on Similarity Threshold in Logical Fidelity

In the calculation of Logical Fidelity, we employ a similarity threshold,τ\\tau, within the greedy matching algorithm\. In our main experiments, this threshold was set to0\.3\\mathbf\{0\.3\}\. However, the choice ofτ\\tauis a critical hyperparameter that could influence the evaluation results\. To strengthen the credibility and reliability of our evaluation, we examined the effect of varying the similarity threshold\. Specifically, we set the thresholdτ\\tauto values of0\.1,0\.2,…,0\.90\.1,0\.2,\\dots,0\.9\. We then compared the Logical Fidelity of models trained on our sampled dataset against those trained on a baseline dataset, with this evaluation being conducted across all three backbones\.

As shown in Figure[10](https://arxiv.org/html/2605.17104#A6.F10), the logical fidelity of all tested LLMs decreases with a higher similarity threshold\. Our proposed "RST" and "logic\-distill" data sampling methods maintain superior performance over the baselines across the entire range of threshold values\.

## Appendix HScoring Rubric for Human Experts and LLM Judge

The following shows the scoring criteria used in Section[4\.2](https://arxiv.org/html/2605.17104#S4.SS2)for human experts and LLM judges to assess the logicality of the reasoning process\.

`Scoring rubric for human experts and LLM judger\.`

`Appendix I Prompt Design I\.1 Prompts of Self QA Below is the prompt to generate the question: System prompt for question generation \(Non\-MCP Question\) System prompt for question generation \(MCP Question\) Below is the prompt to generate the answer: System prompt for answer generation I\.2 Prompts of Inference Below is the prompt for a strong LLM to answer the question directly: Prompt for answer question directly \(Non\-MCP Question\) Prompt for answer question directly \(MCP Question\) Below is the prompt for strong LLM to transfer the logical nexuses into a continuous reasoning process: Prompt for style transfer I\.3 Prompts of Logical Nexuses Extraction Below is the prompt to extract logical nexuses from the paper: System prompt to extract logical nexuses I\.4 Prompts of Benchmarking Below is the prompt for GPQA benchmark’s evaluation: GPQA prompt Below is the prompt for SciBench benchmark’s evaluation: SciBench prompt Below is the prompt for PhysReason benchmark’s evaluation: PhysReason’s inference prompt Below is the prompt for judging by LLM during PhysReason evaluating\. PhysReason’s judging prompt Below is the prompt for our proposed PhysLogic benchmark’s evaluation: PhysLogic’s inference prompt Below is the prompt for LLM judgment during PhysLogic evaluation\. PhysLogic’s judging prompt I\.5 Prompts of Quality Control Below is the prompt for paper topic filtering: System prompt for paper topic filtering Below is the prompt for filtering out data samples with forbidden keywords: System prompt for forbidden keywords filtering Below is the prompt for filtering out data samples with incomplete information, incorrect question types, or overly simplistic reasoning steps\. System prompt for low\-quality datas filtering Appendix J Data Examples J\.1 Multiple Choice Problem Below is an example of a multiple choice problem: • Difficulty: High School • Subdomain: general relativity & quantum cosmology Question \(Multiple Choice Problem\) Consider a constant\-density star \(Schwarzschild star\) with compactness C≡M/RC\\equiv M/R approaching the black hole limit C→1/2C\\rightarrow 1/2 from below\. Define y1=1−2Cy\_\{1\}=\\sqrt\{1\-2C\} and the coordinate transformation x=1−y=1−1−\(r/α\)2x=1\-y=1\-\\sqrt\{1\-\(r/\\alpha\)^\{2\}\} where α=R3/2/RS\\alpha=R^\{3/2\}/\\sqrt\{R\_\{S\}\} and RS=2MR\_\{S\}=2M\. The interior pressure becomes negative when C\>4/9C\>4/9, creating a singular surface at x0=−κx\_\{0\}=\-\\kappa where κ=3y1−1\\kappa=3y\_\{1\}\-1\. The tidal Love number k2k\_\{2\} quantifies the quadrupolar tidal deformability of this star as measured at its surface r=Rr=R \(x=x1x=x\_\{1\}\)\. When deriving the behavior of k2k\_\{2\} near the black hole limit \(C→1/2\+C\\rightarrow 1/2^\{\+\}, κ→0−\\kappa\\rightarrow 0^\{\-\}\), what functional dependence on compactness δC=C−1/2\\delta C=C\-1/2 does k2k\_\{2\} exhibit? A\. k2∝δCk\_\{2\}\\propto\\delta C B\. k2∝1/log⁡\(\|δC\|\)k\_\{2\}\\propto 1/\\log\(\|\\delta C\|\) C\. k2∝exp⁡\(−b/\|δC\|\)k\_\{2\}\\propto\\exp\\left\(\-b/\|\\delta C\|\\right\) D\. k2∝exp⁡\(−b\|δC\|\)k\_\{2\}\\propto\\exp\\left\(\-b\|\\delta C\|\\right\) Answer \(Multiple Choice Problem\) To answer this question, we must derive the behavior of the tidal Love number k2k\_\{2\} for a Schwarzschild star as the compactness C≡M/RC\\equiv M/R approaches the black hole limit C→1/2−C\\to 1/2^\{\-\}\. The compactness deviation is δC=C−1/2<0\\delta C=C\-1/2<0, and we focus on the magnitude \|δC\|=1/2−C\|\\delta C\|=1/2\-C\. The key parameter is y1=1−2Cy\_\{1\}=\\sqrt\{1\-2C\}, which relates to the coordinate transformation x=1−yx=1\-y, with y=1−\(r/α\)2y=\\sqrt\{1\-\(r/\\alpha\)^\{2\}\} and α=R3/2/RS\\alpha=R^\{3/2\}/\\sqrt\{R\_\{S\}\}, where RS=2MR\_\{S\}=2M\. The compactness C\>4/9C\>4/9 \(beyond Buchdahl\) leads to negative pressure in the interior, with a singular surface at x0=−κx\_\{0\}=\-\\kappa, where κ=3y1−1\\kappa=3y\_\{1\}\-1\. As C→1/2−C\\to 1/2^\{\-\}, y1→0\+y\_\{1\}\\to 0^\{\+\}, and κ→−1\+\\kappa\\to\-1^\{\+\}\. Thus, x0=−κ→1−x\_\{0\}=\-\\kappa\\to 1^\{\-\} and the surface x1=1−y1→1−x\_\{1\}=1\-y\_\{1\}\\to 1^\{\-\}\. The difference \|x1−x0\|=2y1=21−2C∝\|δC\|\|x\_\{1\}\-x\_\{0\}\|=2y\_\{1\}=2\\sqrt\{1\-2C\}\\propto\\sqrt\{\|\\delta C\|\}, so \|x1−x0\|\|x\_\{1\}\-x\_\{0\}\| scales with \|δC\|\\sqrt\{\|\\delta C\|\}\. The tidal Love number k2k\_\{2\} is derived from the perturbation equation for the logarithmic derivative h\(r\)h\(r\) \(or H\(r\)H\(r\)\), governed by a Riccati equation in the xx\-coordinate, with coefficients depending on the metric and pressure\. The pressure singularity at x0x\_\{0\} introduces essential singular behavior in the perturbation solution\. Integrating the equation from x=0x=0 to x=x1x=x\_\{1\} with boundary conditions h≈2h\\approx 2 at small rr and continuity at r=Rr=R, the solution near x0x\_\{0\} involves Frobenius expansion or WKB\-like methods due to the singularity\. Analysis shows that as \|x1−x0\|∝\|δC\|→0\|x\_\{1\}\-x\_\{0\}\|\\propto\\sqrt\{\|\\delta C\|\}\\to 0, the wavefunction solution exhibits exponential suppression proportional to exp⁡\(−b/\|δC\|\)\\exp\(\-b/\\sqrt\{\|\\delta C\|\}\) for some constant b\>0b\>0\. To match the boundary condition at x1x\_\{1\}, the value hRh\_\{R\} inherited from this solution gives k2k\_\{2\} through the formula k2∝\[2C\(hR−1\)−hR\+2\]/denominatork\_\{2\}\\propto\[2C\(h\_\{R\}\-1\)\-h\_\{R\}\+2\]/\\text\{denominator\}, simplified via the exponential decay\. Thus, the dominant behavior is k2∝exp⁡\(−b/\|δC\|\)k\_\{2\}\\propto\\exp\(\-b/\|\\delta C\|\) for a constant b\>0b\>0\. This is contrasted with ECO logarithmic scaling \(k2∝1/log⁡\|δC\|k\_\{2\}\\propto 1/\\log\|\\delta C\|\) and linear scaling \(k2∝\|δC\|k\_\{2\}\\propto\|\\delta C\|\), both excluded as they do not match the derived exponential form\. Option D \(exp⁡\(−b\|δC\|\)\\exp\(\-b\|\\delta C\|\)\) is invalid as it approaches 1, not 0, when \|δC\|→0\|\\delta C\|\\to 0\. The answer is therefore C\. C Logical Nexus \(Multiple Choice Problem\) 1\. Define the deviation δC=C−1/2\\delta C=C\-1/2 and identify y1=1−2C≈\|2δC\|y\_\{1\}=\\sqrt\{1\-2C\}\\approx\\sqrt\{\|2\\delta C\|\} near the BH limit\. \(10 points\) 2\. Express κ=3y1−1≈−1\+32\|δC\|\\kappa=3y\_\{1\}\-1\\approx\-1\+3\\sqrt\{2\|\\delta C\|\} as C→1/2−C\\to 1/2^\{\-\}, with κ→−1\+\\kappa\\to\-1^\{\+\}\. \(10 points\) 3\. Apply coordinate transformation x=1−yx=1\-y to find the singular surface x0=−κ≈1−32\|δC\|x\_\{0\}=\-\\kappa\\approx 1\-3\\sqrt\{2\|\\delta C\|\} and surface coordinate x1=1−y1≈1−2\|δC\|x\_\{1\}=1\-y\_\{1\}\\approx 1\-\\sqrt\{2\|\\delta C\|\}\. \(15 points\) 4\. Show \|x1−x0\|=2y1≈22\|δC\|∝\|δC\|→0\|x\_\{1\}\-x\_\{0\}\|=2y\_\{1\}\\approx 2\\sqrt\{2\|\\delta C\|\}\\propto\\sqrt\{\|\\delta C\|\}\\to 0 as δC→0−\\delta C\\to 0^\{\-\}\. \(15 points\) 5\. Formulate the tidal perturbation as a Riccati equation in the xx\-coordinate, noting coefficient singularities at x0x\_\{0\} due to pressure divergence\. \(15 points\) 6\. Derive the solution’s exponential suppression near x0x\_\{0\}: ∝exp⁡\(−b/\|δC\|\)\\propto\\exp\(\-b/\\sqrt\{\|\\delta C\|\}\) for b\>0b\>0, using WKB\-like asymptotics or Frobenius analysis\. \(20 points\) 7\. Evaluate hh at the surface \(x1x\_\{1\}\) and substitute into the k2k\_\{2\} formula to confirm k2∝exp⁡\(−b/\|δC\|\)k\_\{2\}\\propto\\exp\(\-b/\|\\delta C\|\), rejecting options A, B, and D\. \(15 points\) J\.2 Expression Computation Below is an example of an expression computation problem: • Difficulty: PhD student • Subdomain: mathematical physics Question \(Expression Computation Problem\) In the context of the random loop model on a dd\-dimensional hypercubic lattice, consider the nearest\-neighbor connection probability at the same time, denoted as κ\\kappa\. This probability satisfies the inequality: 1−κ≤θI\(1,−1\)u,d2κ\+θ4dβ1\-\\kappa\\leq\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\\sqrt\{\\kappa\}\+\\frac\{\\theta\}\{4d\\beta\} where: • θ∈ℕ\\theta\\in\\mathbb\{N\}, θ≥2\\theta\\geq 2 is the number of loop colors, • u∈\[0,1/2\]u\\in\[0,1/2\] is the loop model parameter, • β\>0\\beta\>0 is the inverse temperature satisfying β≥θ/\(4d\)\\beta\\geq\\theta/\(4d\), • d∈ℕd\\in\\mathbb\{N\} is the lattice dimension, • I\(1,−1\)u,dI\_\{\(1,\-1\)\}^\{u,d\} is a constant defined as: I\(1,−1\)u,d=supα∈\[0,1\]∫\[0,2π\]dddk\(2π\)duα\+\(1−u\)\(1−α\)ϵ\(k\+π\)ϵ\(k\)\(1−1d∑j=1dcos⁡\(kj\)\)\+I\_\{\(1,\-1\)\}^\{u,d\}=\\sup\_\{\\alpha\\in\[0,1\]\}\\int\_\{\[0,2\\pi\]^\{d\}\}\\frac\{d^\{d\}k\}\{\(2\\pi\)^\{d\}\}\\sqrt\{u\\alpha\+\(1\-u\)\(1\-\\alpha\)\\frac\{\\epsilon\(k\+\\pi\)\}\{\\epsilon\(k\)\}\}\\left\(1\-\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\\cos\(k\_\{j\}\)\\right\)\_\{\+\} with ϵ\(k\)=2∑j=1d\(1−cos⁡kj\)\\epsilon\(k\)=2\\sum\_\{j=1\}^\{d\}\(1\-\\cos k\_\{j\}\) and k\+π=\(k1\+π,…,kd\+π\)k\+\\pi=\(k\_\{1\}\+\\pi,\\dots,k\_\{d\}\+\\pi\)\. Derive an expression for the lower bound of κ\\kappa in terms of the given parameters θ\\theta, I\(1,−1\)u,dI\_\{\(1,\-1\)\}^\{u,d\}, β\\beta, and dd\. Answer \(Expression Computation Problem\) The inequality provided is: 1−κ≤θI\(1,−1\)u,d2κ\+θ4dβ1\-\\kappa\\leq\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\\sqrt\{\\kappa\}\+\\frac\{\\theta\}\{4d\\beta\} where κ=ℙβ,θ,u\(E0,e,0\)\\kappa=\\mathbb\{P\}\_\{\\beta,\\theta,u\}\(E\_\{0,e,0\}\)\. We solve for a lower bound on κ\\kappa as follows: 1\. Rearrange terms to isolate the square root: 1−θ4dβ−κ≤θI\(1,−1\)u,d2κ1\-\\frac\{\\theta\}\{4d\\beta\}\-\\kappa\\leq\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\\sqrt\{\\kappa\} 2\. Define substitutions: Let x=κx=\\sqrt\{\\kappa\} \(so x≥0x\\geq 0\), and set: A=θI\(1,−1\)u,d2,B=1−θ4dβA=\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\},\\quad B=1\-\\frac\{\\theta\}\{4d\\beta\} The inequality becomes: B−x2≤AxB\-x^\{2\}\\leq Ax 3\. Form a quadratic inequality: Rearrange to: x2\+Ax−B≥0x^\{2\}\+Ax\-B\\geq 0 This quadratic inequality holds when x≥−A\+A2\+4B2x\\geq\\frac\{\-A\+\\sqrt\{A^\{2\}\+4B\}\}\{2\} \(considering x≥0x\\geq 0 and the quadratic’s positive root\)\. 4\. Substitute back: Since κ=x2\\kappa=x^\{2\}, the lower bound is: κ≥\(−A\+A2\+4B2\)2\\kappa\\geq\\left\(\\frac\{\-A\+\\sqrt\{A^\{2\}\+4B\}\}\{2\}\\right\)^\{2\} Replacing AA and BB: κ≥\(−θI\(1,−1\)u,d2\+\(θI\(1,−1\)u,d2\)2\+4\(1−θ4dβ\)2\)2\\kappa\\geq\\left\(\\frac\{\-\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\+\\sqrt\{\\left\(\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\\right\)^\{2\}\+4\\left\(1\-\\frac\{\\theta\}\{4d\\beta\}\\right\)\}\}\{2\}\\right\)^\{2\} 5\. Simplify the expression: The term inside the square root simplifies as: \(θI\(1,−1\)u,d2\)2\+4−θdβ=θ2\(I\(1,−1\)u,d\)22\+4−θdβ\\left\(\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\\right\)^\{2\}\+4\-\\frac\{\\theta\}\{d\\beta\}=\\frac\{\\theta^\{2\}\(I\_\{\(1,\-1\)\}^\{u,d\}\)^\{2\}\}\{2\}\+4\-\\frac\{\\theta\}\{d\\beta\} Thus: κ≥\(θ2\(I\(1,−1\)u,d\)22\+4−θdβ−θI\(1,−1\)u,d22\)2\\kappa\\geq\\left\(\\frac\{\\sqrt\{\\frac\{\\theta^\{2\}\(I\_\{\(1,\-1\)\}^\{u,d\}\)^\{2\}\}\{2\}\+4\-\\frac\{\\theta\}\{d\\beta\}\}\-\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\}\{2\}\\right\)^\{2\} This is the lower bound for the nearest\-neighbor connection probability κ\\kappa\. κ≥\(θ2\(I\(1,−1\)u,d\)22\+4−θdβ−θI\(1,−1\)u,d22\)2 Logical Nexus \(Expression Computation Problem\) 1\. Rearrange the given inequality to isolate constant and κ\\kappa terms: move θ4dβ\\frac\{\\theta\}\{4d\\beta\} to the left and κ\\kappa to the right, yielding 1−κ−θ4dβ≤θI\(1,−1\)u,d2κ1\-\\kappa\-\\frac\{\\theta\}\{4d\\beta\}\\leq\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\}\\sqrt\{\\kappa\}\. \(10 points\) 2\. Substitute x=κx=\\sqrt\{\\kappa\} and define constants: A=θI\(1,−1\)u,d2A=\\theta\\frac\{I\_\{\(1,\-1\)\}^\{u,d\}\}\{\\sqrt\{2\}\} and B=1−θ4dβB=1\-\\frac\{\\theta\}\{4d\\beta\}, transforming the inequality to B−x2≤AxB\-x^\{2\}\\leq Ax\. \(20 points\) 3\. Rearrange the substituted inequality into standard quadratic form: x2\+Ax−B≥0x^\{2\}\+Ax\-B\\geq 0\. \(10 points\) 4\. Solve the quadratic inequality by identifying the relevant root for x≥0x\\geq 0: x≥−A\+A2\+4B2x\\geq\\frac\{\-A\+\\sqrt\{A^\{2\}\+4B\}\}\{2\}\. \(30 points\) 5\. Substitute κ=x2\\kappa=x^\{2\} back into the solution, yielding κ≥\(−A\+A2\+4B2\)2\\kappa\\geq\\left\(\\frac\{\-A\+\\sqrt\{A^\{2\}\+4B\}\}\{2\}\\right\)^\{2\}\. \(10 points\) 6\. Replace AA and BB with their expressions and simplify the square root term to θ2\(I\(1,−1\)u,d\)22\+4−θdβ\\sqrt\{\\frac\{\\theta^\{2\}\(I\_\{\(1,\-1\)\}^\{u,d\}\)^\{2\}\}\{2\}\+4\-\\frac\{\\theta\}\{d\\beta\}\}\. \(10 points\) 7\. Write the final expression for the lower bound of κ\\kappa using the simplified terms\. \(10 points\) J\.3 Numeric Computation Below is an example of a numeric computation problem: • Difficulty: Master’s student • Subdomain: classical physics, condensed matter Question \(Numeric Computation Problem\) In high\-harmonic interferometry studies of solids, a phase shift Δϕ\\Delta\\phi in the harmonic radiation can arise from excitation\-induced bandgap changes\. The dipole phase for a harmonic of order NN is given by ϕ=N\(ω0tr\+π/2\)−S\(tr\)\\phi=N\(\\omega\_\{0\}t\_\{r\}\+\\pi/2\)\-S\(t\_\{r\}\), where S\(tr\)=∫titrΔε\(k\(τ\)\)𝑑τS\(t\_\{r\}\)=\\int\_\{t\_\{i\}\}^\{t\_\{r\}\}\\Delta\\varepsilon\(k\(\\tau\)\)d\\tau is the semi\-classical action in atomic units, tit\_\{i\} and trt\_\{r\} are the excitation and recombination times, and Δε\\Delta\\varepsilon is the energy difference between bands\. Suppose an excitation uniformly increases Δε\\Delta\\varepsilon by a constant ΔEg\\Delta E\_\{g\} \(in hartree\) across all kk\. Using the semi\-classical model, derive the expression for Δϕ=ϕ′−ϕ\\Delta\\phi=\\phi^\{\\prime\}\-\\phi, where ϕ′\\phi^\{\\prime\} is the phase after the perturbation\. In an experiment, the phase shift of the 7th harmonic \(N=7N=7\) is measured as −1\.2\-1\.2 radians, with a characteristic excursion time of 1\.51\.5 femtoseconds\. Using the conversion factors 1 fs=10−15 s1\\text\{ fs\}=10^\{\-15\}\\text\{ s\}, 1 a\.u\. time=2\.4188×10−17 s1\\text\{ a\.u\. time\}=2\.4188\\times 10^\{\-17\}\\text\{ s\}, and 1 hartree=27\.211 eV1\\text\{ hartree\}=27\.211\\text\{ eV\}, calculate ΔEg\\Delta E\_\{g\} in eV\. Answer \(Numeric Computation Problem\) The phase shift Δϕ\\Delta\\phi is related to the bandgap change ΔEg\\Delta E\_\{g\} and the characteristic excursion time Δt\\Delta t by the formula Δϕ=−ΔEgΔt\\Delta\\phi=\-\\Delta E\_\{g\}\\Delta t, where Δϕ\\Delta\\phi is in radians, ΔEg\\Delta E\_\{g\} is in hartree, and Δt\\Delta t is in atomic units of time\. Given Δϕ=−1\.2\\Delta\\phi=\-1\.2 radians and Δt=1\.5\\Delta t=1\.5 femtoseconds, first convert Δt\\Delta t to atomic units using the provided conversions\. 1 femtosecond \(fs\) = 10−1510^\{\-15\} seconds, and 1 atomic unit \(a\.u\.\) of time = 2\.4188×10−172\.4188\\times 10^\{\-17\} seconds\. Thus: Δtau=1\.5×10−152\.4188×10−17=1\.52\.4188×102≈62\.0142267 a\.u\. time\.\\Delta t\_\{\\text\{au\}\}=\\frac\{1\.5\\times 10^\{\-15\}\}\{2\.4188\\times 10^\{\-17\}\}=\\frac\{1\.5\}\{2\.4188\}\\times 10^\{2\}\\approx 62\.0142267\\text\{ a\.u\. time\}\. Solve for ΔEg\\Delta E\_\{g\} in hartree: −1\.2=−ΔEg×62\.0142267⟹ΔEg=1\.262\.0142267≈0\.01935239 hartree\.\-1\.2=\-\\Delta E\_\{g\}\\times 62\.0142267\\implies\\Delta E\_\{g\}=\\frac\{1\.2\}\{62\.0142267\}\\approx 0\.01935239\\text\{ hartree\}\. Convert ΔEg\\Delta E\_\{g\} to electron volts using 1 hartree = 27\.211 eV: ΔEg \(in eV\)=0\.01935239×27\.211≈0\.52659788429 eV\.\\Delta E\_\{g\}\\text\{ \(in eV\)\}=0\.01935239\\times 27\.211\\approx 0\.52659788429\\text\{ eV\}\. Rounding to three decimal places: 0\.52659788429 rounds to 0\.527 eV, as the fourth decimal place is 5 \(followed by 9\), requiring rounding up\. The answer is therefore 0\.527\. Logical Nexus \(Numeric Computation Problem\) 1\. Recognize that a uniform increase in ΔEg\\Delta E\_\{g\} modifies the semi\-classical action to S′\(tr\)=S\(tr\)\+ΔEg\(tr−ti\)S^\{\\prime\}\(t\_\{r\}\)=S\(t\_\{r\}\)\+\\Delta E\_\{g\}\(t\_\{r\}\-t\_\{i\}\)\. \(10 points\) 2\. Express the perturbed dipole phase as ϕ′=N\(ω0tr\+π/2\)−\[S\(tr\)\+ΔEg\(tr−ti\)\]\\phi^\{\\prime\}=N\(\\omega\_\{0\}t\_\{r\}\+\\pi/2\)\-\[S\(t\_\{r\}\)\+\\Delta E\_\{g\}\(t\_\{r\}\-t\_\{i\}\)\]\. \(10 points\) 3\. Formulate the phase shift Δϕ=ϕ′−ϕ=−\[S′\(tr\)−S\(tr\)\]=−ΔEg\(tr−ti\)\\Delta\\phi=\\phi^\{\\prime\}\-\\phi=\-\[S^\{\\prime\}\(t\_\{r\}\)\-S\(t\_\{r\}\)\]=\-\\Delta E\_\{g\}\(t\_\{r\}\-t\_\{i\}\)\. \(15 points\) 4\. Identify Δt=tr−ti\\Delta t=t\_\{r\}\-t\_\{i\} as the characteristic excursion time to obtain Δϕ=−ΔEgΔt\\Delta\\phi=\-\\Delta E\_\{g\}\\Delta t\. \(10 points\) 5\. Convert the given characteristic excursion time of 1\.5fs1\.5\\,\\text\{fs\} to atomic units using 1fs=10−15s1\\,\\text\{fs\}=10^\{\-15\}\\,\\text\{s\} and 1a\.u\. time=2\.4188×10−17s1\\,\\text\{a\.u\. time\}=2\.4188\\times 10^\{\-17\}\\,\\text\{s\}: Δtau=\(1\.5×10−15\)/\(2\.4188×10−17\)≈62\.014a\.u\. time\\Delta t\_\{\\text\{au\}\}=\(1\.5\\times 10^\{\-15\}\)/\(2\.4188\\times 10^\{\-17\}\)\\approx 62\.014\\,\\text\{a\.u\. time\}\. \(15 points\) 6\. Apply the derived relationship Δϕ=−ΔEgΔt\\Delta\\phi=\-\\Delta E\_\{g\}\\Delta t with N=7N=7 harmonic phase shift Δϕ=−1\.2rad\\Delta\\phi=\-1\.2\\,\\text\{rad\}: −1\.2=−ΔEg×62\.014\-1\.2=\-\\Delta E\_\{g\}\\times 62\.014\. \(10 points\) 7\. Solve for ΔEg\\Delta E\_\{g\} in hartree: ΔEg=1\.2/62\.014≈0\.019352hartree\\Delta E\_\{g\}=1\.2/62\.014\\approx 0\.019352\\,\\text\{hartree\}\. \(10 points\) 8\. Convert ΔEg\\Delta E\_\{g\} from hartree to eV using 1hartree=27\.211eV1\\,\\text\{hartree\}=27\.211\\,\\text\{eV\}: ΔEg,eV=0\.019352×27\.211≈0\.52660eV\\Delta E\_\{g,\\text\{eV\}\}=0\.019352\\times 27\.211\\approx 0\.52660\\,\\text\{eV\}\. \(10 points\) 9\. Round the result to three decimal places \(0\.527eV0\.527\\,\\text\{eV\}\) based on significant figures from input values\. \(10 points\) J\.4 Proof\-based Problem Below is an example of a proof\-based problem: • Difficulty: Undergraduate • Subdomain: nuclear physics, astrophysics, high energy physics Question \(Proof\-Based Problem\) Consider a hybrid neutron star described by a first\-order phase transition from a hadronic matter phase to color\-superconducting quark matter at a critical baryon chemical potential μc\\mu\_\{c\}\. The hadronic phase equation of state is denoted Ph\(μ\)P\_\{h\}\(\\mu\) and the quark phase Pq\(μ\)P\_\{q\}\(\\mu\), satisfying mechanical equilibrium at transition: Ph\(μc\)=Pq\(μc\)P\_\{h\}\(\\mu\_\{c\}\)=P\_\{q\}\(\\mu\_\{c\}\)\. The corresponding energy densities are εh\(μc\)\\varepsilon\_\{h\}\(\\mu\_\{c\}\) and εq\(μc\)\\varepsilon\_\{q\}\(\\mu\_\{c\}\), with a discontinuity Δε=εq\(μc\)−εh\(μc\)\\Delta\\varepsilon=\\varepsilon\_\{q\}\(\\mu\_\{c\}\)\-\\varepsilon\_\{h\}\(\\mu\_\{c\}\)\. The transition pressure is Pc=Ph\(μc\)P\_\{c\}=P\_\{h\}\(\\mu\_\{c\}\)\. Assume the equation of state satisfies causality \(0≤dP/dε≤c2=10\\leq dP/d\\varepsilon\\leq c^\{2\}=1 in natural units\) and thermodynamic consistency\. Prove that for Δε\>εh\(μc\)\+3Pc2\\Delta\\varepsilon\>\\dfrac\{\\varepsilon\_\{h\}\(\\mu\_\{c\}\)\+3P\_\{c\}\}\{2\}, the phase transition induces a gravitational instability in static spherical neutron stars governed by the Tolman\-Oppenheimer\-Volkoff equations\. Use the conservation of stress\-energy ∇μTμν=0\\nabla\_\{\\mu\}T^\{\\mu\\nu\}=0 and the Oppenheimer\-Volkoff equation for pressure gradients, explicitly showing how the causality constraint and energy discontinuity lead to a divergent pressure gradient incompatible with stellar equilibrium\. Answer \(Proof\-Based Problem\) To prove that a gravitational instability arises when Δε\>εh\(μc\)\+3Pc2\\Delta\\varepsilon\>\\dfrac\{\\varepsilon\_\{h\}\(\\mu\_\{c\}\)\+3P\_\{c\}\}\{2\}, we use the Tolman\-Oppenheimer\-Volkoff \(TOV\) equations for static spherical symmetry, which describe hydrostatic equilibrium in general relativity\. The metric is: ds2=−e2Φ\(r\)dt2\+e2Λ\(r\)dr2\+r2dΩ2ds^\{2\}=\-e^\{2\\Phi\(r\)\}dt^\{2\}\+e^\{2\\Lambda\(r\)\}dr^\{2\}\+r^\{2\}d\\Omega^\{2\} where dΩ2d\\Omega^\{2\} is the angular metric, and e−2Λ\(r\)=1−2m\(r\)re^\{\-2\\Lambda\(r\)\}=1\-\\dfrac\{2m\(r\)\}\{r\}\. The TOV equations are: dPdr=−\[ε\(r\)\+P\(r\)\]\[m\(r\)\+4πr3P\(r\)\]r2\(1−2m\(r\)r\)\(1\)\\frac\{dP\}\{dr\}=\-\\frac\{\[\\varepsilon\(r\)\+P\(r\)\]\[m\(r\)\+4\\pi r^\{3\}P\(r\)\]\}\{r^\{2\}\\left\(1\-\\dfrac\{2m\(r\)\}\{r\}\\right\)\}\\quad\(1\) dmdr=4πr2ε\(r\)\(2\)\\frac\{dm\}\{dr\}=4\\pi r^\{2\}\\varepsilon\(r\)\\quad\(2\) At the phase transition radius rcr\_\{c\}, pressure is continuous \(P\(rc\)=PcP\(r\_\{c\}\)=P\_\{c\}\), but energy density jumps from εh\(μc\)\\varepsilon\_\{h\}\(\\mu\_\{c\}\) to εq\(μc\)=εh\(μc\)\+Δε\\varepsilon\_\{q\}\(\\mu\_\{c\}\)=\\varepsilon\_\{h\}\(\\mu\_\{c\}\)\+\\Delta\\varepsilon\. The mass function m\(r\)m\(r\) is continuous at rcr\_\{c\}, but its derivative is discontinuous due to Δε\\Delta\\varepsilon\. Using \(2\): dmdr\|rc±=4πrc2ε\(rc±\)\\frac\{dm\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\\pm\}\}=4\\pi r\_\{c\}^\{2\}\\varepsilon\(r\_\{c\}^\{\\pm\}\) where ε\(rc\+\)=εh\\varepsilon\(r\_\{c\}^\{\+\}\)=\\varepsilon\_\{h\} and ε\(rc−\)=εq\\varepsilon\(r\_\{c\}^\{\-\}\)=\\varepsilon\_\{q\}\. The pressure gradient at rcr\_\{c\} from the hadronic side \(r→rc\+r\\to r\_\{c\}^\{\+\}\) and quark side \(r→rc−r\\to r\_\{c\}^\{\-\}\) is derived from \(1\): dPdr\|rc\+=−\[εh\+Pc\]\[m\(rc\)\+4πrc3Pc\]rc2\(1−2m\(rc\)rc\)\(3\)\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\+\}\}=\-\\frac\{\[\\varepsilon\_\{h\}\+P\_\{c\}\]\[m\(r\_\{c\}\)\+4\\pi r\_\{c\}^\{3\}P\_\{c\}\]\}\{r\_\{c\}^\{2\}\\left\(1\-\\dfrac\{2m\(r\_\{c\}\)\}\{r\_\{c\}\}\\right\)\}\\quad\(3\) dPdr\|rc−=−\[εq\+Pc\]\[m\(rc\)\+4πrc3Pc\]rc2\(1−2m\(rc\)rc\)\(4\)\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\-\}\}=\-\\frac\{\[\\varepsilon\_\{q\}\+P\_\{c\}\]\[m\(r\_\{c\}\)\+4\\pi r\_\{c\}^\{3\}P\_\{c\}\]\}\{r\_\{c\}^\{2\}\\left\(1\-\\dfrac\{2m\(r\_\{c\}\)\}\{r\_\{c\}\}\\right\)\}\\quad\(4\) Define Q≡m\(rc\)\+4πrc3Pc\>0Q\\equiv m\(r\_\{c\}\)\+4\\pi r\_\{c\}^\{3\}P\_\{c\}\>0 and G≡rc2\(1−2m\(rc\)rc\)\>0G\\equiv r\_\{c\}^\{2\}\\left\(1\-\\dfrac\{2m\(r\_\{c\}\)\}\{r\_\{c\}\}\\right\)\>0 \(since 2m\(rc\)/rc<12m\(r\_\{c\}\)/r\_\{c\}<1 for equilibrium\)\. The difference in pressure gradients is: dPdr\|rc−−dPdr\|rc\+=−QGΔε\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\-\}\}\-\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\+\}\}=\-\\frac\{Q\}\{G\}\\Delta\\varepsilon As Δε\>0\\Delta\\varepsilon\>0 and Q/G\>0Q/G\>0, dPdr\|rc−<dPdr\|rc\+<0\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\-\}\}<\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\+\}\}<0, meaning the quark\-core gradient is steeper\. Assuming constant energy density εq\\varepsilon\_\{q\} in a thin quark core near rcr\_\{c\}, \(4\) simplifies at any r<rcr<r\_\{c\} to: dPdr=−K\(εq\+P\)\(5\)\\frac\{dP\}\{dr\}=\-K\(\\varepsilon\_\{q\}\+P\)\\quad\(5\) where K≡QG\>0K\\equiv\\dfrac\{Q\}\{G\}\>0 is constant near rcr\_\{c\}\. Solve \(5\) for P\(r\)P\(r\): ∫P0PdP′εq\+P′=−K∫0r𝑑r′\\int\_\{P\_\{0\}\}^\{P\}\\frac\{dP^\{\\prime\}\}\{\\varepsilon\_\{q\}\+P^\{\\prime\}\}=\-K\\int\_\{0\}^\{r\}dr^\{\\prime\} ln⁡\(εq\+Pεq\+P0\)=−Kr\\ln\\left\(\\frac\{\\varepsilon\_\{q\}\+P\}\{\\varepsilon\_\{q\}\+P\_\{0\}\}\\right\)=\-Kr where P0P\_\{0\} is central pressure at r=0r=0\. Thus: εq\+P=\(εq\+P0\)e−Kr\(6\)\\varepsilon\_\{q\}\+P=\(\\varepsilon\_\{q\}\+P\_\{0\}\)e^\{\-Kr\}\\quad\(6\) At r=rcr=r\_\{c\}, P=PcP=P\_\{c\}: εq\+Pc=\(εq\+P0\)e−Krc\(7\)\\varepsilon\_\{q\}\+P\_\{c\}=\(\\varepsilon\_\{q\}\+P\_\{0\}\)e^\{\-Kr\_\{c\}\}\\quad\(7\) Rearranging for P0P\_\{0\}: P0=\(εq\+Pc\)eKrc−εq\(8\)P\_\{0\}=\(\\varepsilon\_\{q\}\+P\_\{c\}\)e^\{Kr\_\{c\}\}\-\\varepsilon\_\{q\}\\quad\(8\) The quark core has mass m\(rc\)=4π3εqrc3m\(r\_\{c\}\)=\\dfrac\{4\\pi\}\{3\}\\varepsilon\_\{q\}r\_\{c\}^\{3\}\. Using 2m\(rc\)rc=8πεqrc23\\dfrac\{2m\(r\_\{c\}\)\}\{r\_\{c\}\}=\\dfrac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\} in GG and Q=m\(rc\)\+4πrc3PcQ=m\(r\_\{c\}\)\+4\\pi r\_\{c\}^\{3\}P\_\{c\} yields: K=m\(rc\)\+4πrc3Pcrc2\(1−8πεqrc23\)\(9\)K=\\frac\{m\(r\_\{c\}\)\+4\\pi r\_\{c\}^\{3\}P\_\{c\}\}\{r\_\{c\}^\{2\}\\left\(1\-\\dfrac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}\\right\)\}\\quad\(9\) The term \(1−8πεqrc23\)\\left\(1\-\\dfrac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}\\right\) in KK causes P0→∞P\_\{0\}\\to\\infty in \(8\) when: 1−8πεqrc23→0\+⟹8πεqrc23=11\-\\dfrac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}\\to 0^\{\+\}\\implies\\dfrac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}=1 Equation \(7\) then requires: limKrc→∞eKrcfinite⟹εq\+Pc=0\\lim\_\{Kr\_\{c\}\\to\\infty\}e^\{Kr\_\{c\}\}\\quad\\text\{finite\}\\implies\\varepsilon\_\{q\}\+P\_\{c\}=0 which is unphysical \(εq,Pc\>0\\varepsilon\_\{q\},P\_\{c\}\>0\)\. For Δε\>εh\+3Pc2\\Delta\\varepsilon\>\\dfrac\{\\varepsilon\_\{h\}\+3P\_\{c\}\}\{2\}, we show the numerator in \(4\) dominates\. At the limit of stability, \(8\) holds, but instability occurs when P0P\_\{0\} is unbounded\. Consider the critical case where \(9\) approaches zero from above\. Substitute m\(rc\)m\(r\_\{c\}\) into \(4\) at rc−r\_\{c\}^\{\-\}: dPdr\|rc−→−∞as1−8πεqrc23→0\+\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\-\}\}\\to\-\\infty\\quad\\text\{as\}\\quad 1\-\\frac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}\\to 0^\{\+\} This divergence violates equilibrium if Δε\\Delta\\varepsilon is large enough\. To find the threshold, equate the unstable limit’s gradient to the hadronic phase\. Using \(3\) and the critical point: 8πεqrc23=1⟹εq=38πrc2\\frac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}=1\\implies\\varepsilon\_\{q\}=\\frac\{3\}\{8\\pi r\_\{c\}^\{2\}\} At Δε=εh\+3Pc2\\Delta\\varepsilon=\\dfrac\{\\varepsilon\_\{h\}\+3P\_\{c\}\}\{2\}, the two gradients yield consistent finite central pressures\. If Δε\>εh\+3Pc2\\Delta\\varepsilon\>\\dfrac\{\\varepsilon\_\{h\}\+3P\_\{c\}\}\{2\}, then: dPdr\|rc−<−\[εq\+Pc\]QG\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\-\}\}<\-\\frac\{\[\\varepsilon\_\{q\}\+P\_\{c\}\]Q\}\{G\} with Q/G→∞Q/G\\to\\infty, so dPdr\|rc−→−∞\\frac\{dP\}\{dr\}\\bigg\|\_\{r\_\{c\}^\{\-\}\}\\to\-\\infty, leading to P0→∞P\_\{0\}\\to\\infty or negative pressure in \(8\), violating physicality and equilibrium\. Thus, gravitational instability occurs\. Logical Nexus \(Proof\-Based Problem\) 1\. Set up the TOV equations for static spherical symmetry, including the pressure gradient equation and mass continuity equation\. \(10 points\) 2\. Apply mechanical equilibrium at the phase transition radius rcr\_\{c\}: Ph\(μc\)=Pq\(μc\)=PcP\_\{h\}\(\\mu\_\{c\}\)=P\_\{q\}\(\\mu\_\{c\}\)=P\_\{c\}, with a discontinuity in energy density Δε=εq\(μc\)−εh\(μc\)\\Delta\\varepsilon=\\varepsilon\_\{q\}\(\\mu\_\{c\}\)\-\\varepsilon\_\{h\}\(\\mu\_\{c\}\)\. \(10 points\) 3\. Derive the pressure gradients just below \(rc−r\_\{c\}^\{\-\}\) and above \(rc\+r\_\{c\}^\{\+\}\) the transition using the TOV equation, showing dPdr\|rc−=−\[εq\+Pc\]QG\\frac\{dP\}\{dr\}\\big\|\_\{r\_\{c\}^\{\-\}\}=\-\\frac\{\[\\varepsilon\_\{q\}\+P\_\{c\}\]Q\}\{G\} and dPdr\|rc\+=−\[εh\+Pc\]QG\\frac\{dP\}\{dr\}\\big\|\_\{r\_\{c\}^\{\+\}\}=\-\\frac\{\[\\varepsilon\_\{h\}\+P\_\{c\}\]Q\}\{G\}, where Q=m\(rc\)\+4πrc3Pc\>0Q=m\(r\_\{c\}\)\+4\\pi r\_\{c\}^\{3\}P\_\{c\}\>0 and G=rc2\(1−2m\(rc\)rc\)\>0G=r\_\{c\}^\{2\}\\left\(1\-\\frac\{2m\(r\_\{c\}\)\}\{r\_\{c\}\}\\right\)\>0\. \(20 points\) 4\. Recognize that dPdr\|rc−<dPdr\|rc\+<0\\frac\{dP\}\{dr\}\\big\|\_\{r\_\{c\}^\{\-\}\}<\\frac\{dP\}\{dr\}\\big\|\_\{r\_\{c\}^\{\+\}\}<0 due to εq\>εh\\varepsilon\_\{q\}\>\\varepsilon\_\{h\} \(Δε\>0\\Delta\\varepsilon\>0\) and Q/G\>0Q/G\>0, indicating a steeper gradient in the quark phase\. \(10 points\) 5\. Assume constant εq\\varepsilon\_\{q\} in a thin quark core near rcr\_\{c\} and solve the simplified pressure equation dPdr=−K\(εq\+P\)\\frac\{dP\}\{dr\}=\-K\(\\varepsilon\_\{q\}\+P\) with K=Q/GK=Q/G, yielding P\(r\)=\(εq\+P0\)e−Kr−εqP\(r\)=\(\\varepsilon\_\{q\}\+P\_\{0\}\)e^\{\-Kr\}\-\\varepsilon\_\{q\}, where P0P\_\{0\} is central pressure\. \(15 points\) 6\. Express KK in terms of εq\\varepsilon\_\{q\} and rcr\_\{c\} using m\(rc\)=4π3εqrc3m\(r\_\{c\}\)=\\frac\{4\\pi\}\{3\}\\varepsilon\_\{q\}r\_\{c\}^\{3\} from mass continuity, resulting in K=4π3εqrc3\+4πrc3Pcrc2\(1−8πεqrc23\)K=\\frac\{\\frac\{4\\pi\}\{3\}\\varepsilon\_\{q\}r\_\{c\}^\{3\}\+4\\pi r\_\{c\}^\{3\}P\_\{c\}\}\{r\_\{c\}^\{2\}\\left\(1\-\\frac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}\\right\)\} \(10 points\) 7\. Identify that K→∞K\\to\\infty when 8πεqrc23→1−\\frac\{8\\pi\\varepsilon\_\{q\}r\_\{c\}^\{2\}\}\{3\}\\to 1^\{\-\}, causing dPdr\|rc−→−∞\\frac\{dP\}\{dr\}\\big\|\_\{r\_\{c\}^\{\-\}\}\\to\-\\infty and violating equilibrium, as P0→∞P\_\{0\}\\to\\infty or becomes unphysical\. \(10 points\) 8\. Enforce causality \(0≤dPdε≤10\\leq\\frac\{dP\}\{d\\varepsilon\}\\leq 1\) to ensure this divergence condition is reached only when εq\\varepsilon\_\{q\} satisfies εq=38πrc2\\varepsilon\_\{q\}=\\frac\{3\}\{8\\pi r\_\{c\}^\{2\}\} at criticality\. \(5 points\) 9\. Substitute the critical εq\\varepsilon\_\{q\} into the gradient expressions and equate the instability threshold to the discontinuity condition, demonstrating Δε\>εh\+3Pc2\\Delta\\varepsilon\>\\frac\{\\varepsilon\_\{h\}\+3P\_\{c\}\}\{2\} implies divergent pressure gradients incompatible with equilibrium\. \(10 points\) Appendix K Case Studies To more intuitively illustrate the evaluative role of our three logicality metrics, we provide examples below for a high\-scoring case and three low\-scoring cases along each dimension\. Due to space constraints and to more intuitively demonstrate the logicality of the reasoning process, we summarize the LLM’s reasoning into a sequence of reasoning steps\. Question In the study of topological defects in particle packings on a spherical surface, the number of excess disclination pairs NdN\_\{d\} follows the scaling law Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\}, where NN is the number of particles and α\\alpha is a dimensionless constant specific to the lattice type \(hexagonal or square\)\. For a hexagonal lattice with N=3600N=3600 particles, a simulation yields Nd=80N\_\{d\}=80\. For a square lattice with N=4900N=4900 particles, a simulation yields Nd=245N\_\{d\}=245\. The theoretical prediction for the ratio of αHex/αSq\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\} is given by: αHexαSq=31/42⋅βHexβSq\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}=\\frac\{3^\{1/4\}\}\{2\}\\cdot\\frac\{\\beta\_\{\\text\{Hex\}\}\}\{\\beta\_\{\\text\{Sq\}\}\} where βHex\\beta\_\{\\text\{Hex\}\} and βSq\\beta\_\{\\text\{Sq\}\} are constants from a generic lattice model, and βHex/βSq=0\.544\\beta\_\{\\text\{Hex\}\}/\\beta\_\{\\text\{Sq\}\}=0\.544\. Calculate the percentage error of the experimentally determined ratio αHex/αSq\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\} relative to the theoretical prediction\. Provide your answer as a percentage to three significant figures\. Logical Nexuses 1\. Calculate αHex\\alpha\_\{\\text\{Hex\}\} for the hexagonal lattice using Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\} with N=3600N=3600 and Nd=80N\_\{d\}=80: 3600=60\\sqrt\{3600\}=60, so αHex=80/60=4/3≈1\.333\\alpha\_\{\\text\{Hex\}\}=80/60=4/3\\approx 1\.333\. 2\. Calculate αSq\\alpha\_\{\\text\{Sq\}\} for the square lattice using Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\} with N=4900N=4900 and Nd=245N\_\{d\}=245: 4900=70\\sqrt\{4900\}=70, so αSq=245/70=7/2=3\.500\\alpha\_\{\\text\{Sq\}\}=245/70=7/2=3\.500\. 3\. Determine the experimental ratio: αHex/αSq=\(4/3\)/\(7/2\)=8/21≈0\.381\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\}=\(4/3\)/\(7/2\)=8/21\\approx 0\.381\. 4\. Compute the theoretical ratio: αHex/αSq=\(31/4/2\)⋅\(βHex/βSq\)=\(31/4/2\)×0\.544\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\}=\(3^\{1/4\}/2\)\\cdot\(\\beta\_\{\\text\{Hex\}\}/\\beta\_\{\\text\{Sq\}\}\)=\(3^\{1/4\}/2\)\\times 0\.544\. First, 31/2=3≈1\.7323^\{1/2\}=\\sqrt\{3\}\\approx 1\.732, then 31/4=1\.732≈1\.3163^\{1/4\}=\\sqrt\{1\.732\}\\approx 1\.316\. 5\. Complete the theoretical ratio calculation: \(1\.316/2\)×0\.544≈0\.658×0\.544=0\.358\(1\.316/2\)\\times 0\.544\\approx 0\.658\\times 0\.544=0\.358\. 6\. Find the absolute difference: \|0\.381−0\.358\|=0\.023\\left\|0\.381\-0\.358\\right\|=0\.023\. 7\. Calculate the relative error: 0\.023/0\.358≈0\.0640\.023/0\.358\\approx 0\.064\. 8\. Convert to a percentage: 0\.064189×100%=6\.419%≈6\.42%0\.064189\\times 100\\%=6\.419\\%\\approx 6\.42\\%\. Good Case 1\. Recognize that the main task is to extract the coefficients αHex\\alpha\_\{\\text\{Hex\}\} and αSq\\alpha\_\{\\text\{Sq\}\} from the simulations, form their ratio, and compare it with the theoretical prediction in order to obtain a percentage error\. 2\. Start from the scaling relation for both lattices, Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\}, and rearrange it to isolate the lattice\-dependent parameter as α=Nd/N\\alpha=N\_\{d\}/\\sqrt\{N\} for each case\. 3\. For the hexagonal lattice, compute the square root of the particle number: NHex=3600=60\\sqrt\{N\_\{\\text\{Hex\}\}\}=\\sqrt\{3600\}=60, which will be used to normalize the defect count\. 4\. Use this value to estimate the hexagonal coefficient as αHex=80/60≈1\.33\\alpha\_\{\\text\{Hex\}\}=80/60\\approx 1\.33, keeping a few significant figures to avoid premature rounding\. 5\. For the square lattice, similarly evaluate NSq=4900=70\\sqrt\{N\_\{\\text\{Sq\}\}\}=\\sqrt\{4900\}=70, providing the appropriate scale for the defect number in that geometry\. 6\. Obtain the square\-lattice coefficient via αSq=245/70=3\.50\\alpha\_\{\\text\{Sq\}\}=245/70=3\.50, again retaining sufficient precision for the later ratio calculation\. 7\. Form the experimentally determined ratio of the two lattice coefficients as rexp=αHexαSq≈1\.333\.50≈0\.381,r\_\{\\text\{exp\}\}=\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}\\approx\\frac\{1\.33\}\{3\.50\}\\approx 0\.381, which quantifies the relative magnitude of excess disclinations between the two lattices\. 8\. Turn to the theoretical expression, αHexαSq=31/42βHexβSq,\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}=\\frac\{3^\{1/4\}\}\{2\}\\,\\frac\{\\beta\_\{\\text\{Hex\}\}\}\{\\beta\_\{\\text\{Sq\}\}\}, and use the given ratio βHex/βSq=0\.544\\beta\_\{\\text\{Hex\}\}/\\beta\_\{\\text\{Sq\}\}=0\.544 as input to the prediction\. 9\. Estimate the factor 31/43^\{1/4\} numerically \(for instance via 31/4≈1\.3163^\{1/4\}\\approx 1\.316\), then divide by 2 and multiply by 0\.5440\.544 to obtain the theoretical ratio rth≈0\.358r\_\{\\text\{th\}\}\\approx 0\.358\. 10\. Quantify the discrepancy between simulation and theory by computing the absolute difference Δr=\|rexp−rth\|≈\|0\.381−0\.358\|≈0\.023\\Delta r=\\lvert r\_\{\\text\{exp\}\}\-r\_\{\\text\{th\}\}\\rvert\\approx\\lvert 0\.381\-0\.358\\rvert\\approx 0\.023\. 11\. Convert this discrepancy into a relative error by normalizing with respect to the theoretical value: ε=Δr/rth≈0\.023/0\.358≈0\.064\\varepsilon=\\Delta r/r\_\{\\text\{th\}\}\\approx 0\.023/0\.358\\approx 0\.064\. 12\. Express the relative error as a percentage by multiplying by 100%100\\%, giving a final percentage error of approximately 6\.42%6\.42\\% when rounded to three significant figures, and conclude that the simulated ratio is within a few percent of the theoretical prediction\. Bad Case 1: Low logical fidelity 1\. Note that the problem asks for a comparison between an experimentally inferred ratio αHex/αSq\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\} and its theoretical counterpart, but decide to adopt a simplified treatment of the scaling law in order to "make the algebra easier"\. 2\. Take the original relation Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\} and, assuming that for large NN the dependence on N\\sqrt\{N\} can be approximated as linear in NN, replace it by an effective rule α≈Nd/N\\alpha\\approx N\_\{d\}/N for estimating the lattice parameters\. 3\. Apply this simplified formula to the hexagonal lattice to obtain αHex≈Nd,Hex/NHex=80/3600≈0\.0222\\alpha\_\{\\text\{Hex\}\}\\approx N\_\{d,\\text\{Hex\}\}/N\_\{\\text\{Hex\}\}=80/3600\\approx 0\.0222, treating this as the effective coefficient\. 4\. Use the same approximation for the square lattice, giving αSq≈Nd,Sq/NSq=245/4900≈0\.0500\\alpha\_\{\\text\{Sq\}\}\\approx N\_\{d,\\text\{Sq\}\}/N\_\{\\text\{Sq\}\}=245/4900\\approx 0\.0500, thereby defining a second effective coefficient\. 5\. Form the experimental ratio directly from these approximate coefficients: rexp≈αHexαSq≈0\.02220\.0500≈0\.444,r\_\{\\text\{exp\}\}\\approx\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}\\approx\\frac\{0\.0222\}\{0\.0500\}\\approx 0\.444, assuming this still captures the essential trend between the two lattices\. 6\. Turn to the theoretical formula αHexαSq=31/42βHexβSq,\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}=\\frac\{3^\{1/4\}\}\{2\}\\,\\frac\{\\beta\_\{\\text\{Hex\}\}\}\{\\beta\_\{\\text\{Sq\}\}\}, but, for simplicity, interpret the factor 31/43^\{1/4\} as if it were just 3\\sqrt\{3\}, arguing that the precise exponent will not dramatically change the outcome\. 7\. Approximate 3≈1\.73\\sqrt\{3\}\\approx 1\.73 and thus take 31/4≈1\.733^\{1/4\}\\approx 1\.73, ignoring the distinction between the square root and the fourth root in the numerical evaluation\. 8\. Divide this value by 2 to find the prefactor 31/4/2≈1\.73/2≈0\.8663^\{1/4\}/2\\approx 1\.73/2\\approx 0\.866, which is then used in place of the exact value\. 9\. Multiply the prefactor by the given β\\beta\-ratio to obtain the theoretical prediction: rth≈0\.866×0\.544≈0\.471,r\_\{\\text\{th\}\}\\approx 0\.866\\times 0\.544\\approx 0\.471, and regard this as the model’s expected ratio\. 10\. Compare the approximate experimental ratio and the theoretical one by computing the absolute difference Δr=\|0\.444−0\.471\|≈0\.027\\Delta r=\\lvert 0\.444\-0\.471\\rvert\\approx 0\.027, treating this as the deviation between simulation and theory\. 11\. Evaluate the relative error with respect to the theoretical value as ε=Δr/rth≈0\.027/0\.471≈0\.057\\varepsilon=\\Delta r/r\_\{\\text\{th\}\}\\approx 0\.027/0\.471\\approx 0\.057, which is then interpreted as the fractional discrepancy\. 12\. Convert this fractional discrepancy into a percentage error via ε×100%≈5\.7%\\varepsilon\\times 100\\%\\approx 5\.7\\%, concluding \(incorrectly\) that the simulations and theory agree at roughly the few\-percent level despite the inconsistent use of the scaling law and the exponent in the theoretical expression\. Bad Case 2: Low causal connection 1\. Begin by identifying the target quantity as the percentage error between the experimentally inferred ratio αHex/αSq\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\} and the theoretical prediction, and write down the general expression percent error=\|rexp−rth\|rth×100%,\\text\{percent error\}=\\frac\{\\lvert r\_\{\\text\{exp\}\}\-r\_\{\\text\{th\}\}\\rvert\}\{r\_\{\\text\{th\}\}\}\\times 100\\%, where rexpr\_\{\\text\{exp\}\} and rthr\_\{\\text\{th\}\} denote the experimental and theoretical ratios, respectively\. 2\. Before actually computing either ratio, reason qualitatively that both lattices obey the same scaling law Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\} and that all given numerical factors \(defect counts, particle numbers, and β\\beta\-ratios\) are of order unity, and therefore anticipate that the percentage error should be relatively small, plausibly well below 10%10\\%\. 3\. Treat this qualitative expectation of a “small” error as a provisional conclusion and aim to verify it by working out rexpr\_\{\\text\{exp\}\} and rthr\_\{\\text\{th\}\} more explicitly, rather than deriving the size of the error purely from detailed calculation\. 4\. Turn first to the theoretical side and recall that the model predicts αHexαSq=31/42βHexβSq,\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}=\\frac\{3^\{1/4\}\}\{2\}\\,\\frac\{\\beta\_\{\\text\{Hex\}\}\}\{\\beta\_\{\\text\{Sq\}\}\}, with the given input βHex/βSq=0\.544\\beta\_\{\\text\{Hex\}\}/\\beta\_\{\\text\{Sq\}\}=0\.544, so that once 31/43^\{1/4\} is evaluated, the theoretical ratio rthr\_\{\\text\{th\}\} can be obtained\. 5\. Estimate 31/43^\{1/4\} numerically \(for instance by recalling that it lies between 11 and 2\\sqrt\{2\} and taking 31/4≈1\.323^\{1/4\}\\approx 1\.32 as a reasonable approximation\), and then compute the theoretical ratio as rth≈1\.322×0\.544≈0\.36,r\_\{\\text\{th\}\}\\approx\\frac\{1\.32\}\{2\}\\times 0\.544\\approx 0\.36, which provides a concrete value against which to compare the experimental result\. 6\. Only after having a numerical estimate for rthr\_\{\\text\{th\}\}, go back to the simulation data and use the scaling law Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\} to extract the coefficient for the hexagonal lattice as αHex=Nd,HexNHex=803600=8060≈1\.33\.\\alpha\_\{\\text\{Hex\}\}=\\frac\{N\_\{d,\\text\{Hex\}\}\}\{\\sqrt\{N\_\{\\text\{Hex\}\}\}\}=\\frac\{80\}\{\\sqrt\{3600\}\}=\\frac\{80\}\{60\}\\approx 1\.33\. 7\. Apply the same procedure to the square lattice, computing αSq=Nd,SqNSq=2454900=24570=3\.50,\\alpha\_\{\\text\{Sq\}\}=\\frac\{N\_\{d,\\text\{Sq\}\}\}\{\\sqrt\{N\_\{\\text\{Sq\}\}\}\}=\\frac\{245\}\{\\sqrt\{4900\}\}=\\frac\{245\}\{70\}=3\.50, thereby obtaining the second coefficient needed for the experimental ratio\. 8\. Form the experimental ratio only at this stage, using the two coefficients, rexp=αHexαSq≈1\.333\.50≈0\.38,r\_\{\\text\{exp\}\}=\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}\\approx\\frac\{1\.33\}\{3\.50\}\\approx 0\.38, and note that this value is numerically close to the theoretical estimate rth≈0\.36r\_\{\\text\{th\}\}\\approx 0\.36 found earlier\. 9\. Substitute these values into the percentage error formula, percent error=\|0\.38−0\.36\|0\.36×100%,\\text\{percent error\}=\\frac\{\\lvert 0\.38\-0\.36\\rvert\}\{0\.36\}\\times 100\\%, but focus mainly on the fact that the numerator is small compared with the denominator, rather than computing the fraction precisely\. 10\. Argue that since the difference \|0\.38−0\.36\|\|0\.38\-0\.36\| is roughly of the order 10−210^\{\-2\} while 0\.360\.36 is of order 10−110^\{\-1\}, the resulting percentage error must be on the order of a few percent, which is broadly consistent with the initial expectation that the error would be well below 10%10\\%\. 11\. On this basis, conclude that the simulations and the theoretical prediction agree to within a small percentage error and accept the qualitative estimate \(“a few percent, comfortably under 10%10\\%”\) as sufficiently accurate, without revisiting the earlier provisional assumption or computing the exact percentage value\. 12\. Note that, although all individual computations \(for αHex\\alpha\_\{\\text\{Hex\}\}, αSq\\alpha\_\{\\text\{Sq\}\}, rexpr\_\{\\text\{exp\}\}, and rthr\_\{\\text\{th\}\}\) are consistent with the underlying physics, the logical order of reasoning is inverted: a conclusion about the error size is adopted before the essential quantities are actually derived and is then merely checked, rather than being logically deduced from the detailed calculations\. Bad Case 3: Low inferential progress 1\. Start by recognizing that the goal is to compare the experimentally inferred ratio αHex/αSq\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\} with its theoretical prediction, and then compute the percentage error using the usual form \|rexp−rth\|/rth×100%\\bigl\|r\_\{\\text\{exp\}\}\-r\_\{\\text\{th\}\}\\bigr\|/r\_\{\\text\{th\}\}\\times 100\\%\. 2\. Use the scaling law Nd=αNN\_\{d\}=\\alpha\\sqrt\{N\} to extract the coefficients for each lattice from the simulation data, noting that α=Nd/N\\alpha=N\_\{d\}/\\sqrt\{N\} follows directly from rearranging the relation\. 3\. For the hexagonal lattice, compute NHex=3600=60\\sqrt\{N\_\{\\text\{Hex\}\}\}=\\sqrt\{3600\}=60 and obtain αHex=80/60≈1\.33\\alpha\_\{\\text\{Hex\}\}=80/60\\approx 1\.33 as the simulation\-based coefficient\. 4\. For the square lattice, compute NSq=4900=70\\sqrt\{N\_\{\\text\{Sq\}\}\}=\\sqrt\{4900\}=70 and obtain αSq=245/70=3\.50\\alpha\_\{\\text\{Sq\}\}=245/70=3\.50 as the corresponding coefficient\. 5\. Form the experimental ratio rexp=αHex/αSq≈1\.33/3\.50≈0\.38r\_\{\\text\{exp\}\}=\\alpha\_\{\\text\{Hex\}\}/\\alpha\_\{\\text\{Sq\}\}\\approx 1\.33/3\.50\\approx 0\.38, and note that it seems to be a number modestly smaller than 0\.50\.5, which will later be compared against the theoretical value\. 6\. Turn to the theoretical prediction rth=αHexαSq=31/42βHexβSq,r\_\{\\text\{th\}\}=\\frac\{\\alpha\_\{\\text\{Hex\}\}\}\{\\alpha\_\{\\text\{Sq\}\}\}=\\frac\{3^\{1/4\}\}\{2\}\\,\\frac\{\\beta\_\{\\text\{Hex\}\}\}\{\\beta\_\{\\text\{Sq\}\}\}, and decide that the crucial difficulty is obtaining a “sufficiently accurate” value for 31/43^\{1/4\} before proceeding any further\. 7\. Begin by approximating 31/43^\{1/4\} via nested square roots, writing 31/4=33^\{1/4\}=\\sqrt\{\\sqrt\{3\}\}, then estimate 3≈1\.7\\sqrt\{3\}\\approx 1\.7 and 1\.7≈1\.30\\sqrt\{1\.7\}\\approx 1\.30, but immediately worry that this may not be accurate enough for a precise percentage error\. 8\. Attempt to refine the value using a Taylor or binomial expansion around x=1x=1, expressing 31/4=\(1\+2\)1/43^\{1/4\}=\(1\+2\)^\{1/4\} and sketching the series for \(1\+x\)1/4\(1\+x\)^\{1/4\}, but then realize that actually working out several terms numerically by hand is cumbersome and error\-prone\. 9\. Attempt to refine the value using a Taylor or binomial expansion … 10\. Attempt to refine the value using a Taylor or binomial expansion … 11\. Attempt to refine the value using a Taylor or binomial expansion … 12\. …`
Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

Similar Articles

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Submit Feedback

Similar Articles

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening