When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

arXiv cs.AI Papers

Summary

This paper introduces SCALAR, a structured critic-actor loop framework, to evaluate how different interaction patterns between AI agents improve reasoning in theoretical physics problems.

arXiv:2605.06772v1 Announce Type: new Abstract: As large language models (LLMs) show increasing promise on research-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges: How does the interaction between researchers and agents affect the results? We study this using SCALAR (Structured Critic--Actor Loop for AI Reasoning), an Actor--Critic--Judge pipeline applied to quantum field theory and string theory problems. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale. Multi-turn dialogue improves over single-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor--Critic pairing. Increasing the scale within one model family (e.g. from the 8B-parameter DeepSeek-R1 variant to DeepSeek-R1 70B) improves some easier-problem behavior, but does not remove the hardest bottleneck we observe. Critic feedback strategy matters most clearly in the asymmetric Actor--Critic setting (e.g., a lightweight Haiku Actor guided by a stronger Sonnet Critic), where constructive feedback improves mean-score outcomes. In same-family Actor--Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI-driven scientific discovery.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:06 AM

# When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic–Actor Loop for Agentic Reasoning
Source: [https://arxiv.org/html/2605.06772](https://arxiv.org/html/2605.06772)
Constantinos PapageorgakisAlexander G\. StapletonSokratis Trifinopoulos

###### Abstract

As large language models \(LLMs\) show increasing promise on research\-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges:*How does the interaction between researchers and agents affect the results?*We study this using SCALAR \(Structured Critic–Actor Loop for AI Reasoning\), an Actor–Critic–Judge pipeline applied to quantum field theory and string theory problems\. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions\. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale\. Multi\-turn dialogue improves over single\-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor–Critic pairing\. Increasing the scale within one model family \(*e\.g\.*from the 8B\-parameter DeepSeek\-R1 variant to DeepSeek\-R1 70B\) improves some easier\-problem behavior, but does not remove the hardest bottleneck we observe\. Critic feedback strategy matters most clearly in the asymmetric Actor–Critic setting \(*e\.g\.*, a lightweight Haiku Actor guided by a stronger Sonnet Critic\), where constructive feedback improves mean\-score outcomes\. In same\-family Actor–Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial\. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI\-driven scientific discovery\.

LLM agents, physics reasoning, multi\-agent systems, prompting strategies

CCTP\-2026\-7 CERN\-TH\-2026\-097 ITCP\-2026\-7 QMUL\-PH\-26\-15

## 1Introduction

Large language models \(LLMs\) and LLM\-based agents are a new type of interlocutor in the dialogue that drives the scientific process\. They can reason, make decisions \(rather than merely perform algorithmic computations\), and even exhibit greater adaptability to iterative prompts than one\-shot queries\. This behavior is closer to that of a human collaborator than that of any previous computational tool\. Early evidence of new contributions to theoretical physics\(Guevaraet al\.,[2026](https://arxiv.org/html/2605.06772#bib.bib19); Schwartz,[2026](https://arxiv.org/html/2605.06772#bib.bib47); Shih,[2026b](https://arxiv.org/html/2605.06772#bib.bib46),[a](https://arxiv.org/html/2605.06772#bib.bib44); Luet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib22)\), mathematical discovery\(Romera\-Paredes and others,[2024](https://arxiv.org/html/2605.06772#bib.bib11); Novikovet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib27)\), and agentic scientific workflows in high\-energy physics\(Plehnet al\.,[2026](https://arxiv.org/html/2605.06772#bib.bib28); Agrawalet al\.,[2026](https://arxiv.org/html/2605.06772#bib.bib29)\)is very encouraging\. However, how physicists should structure this collaboration is an open question: in general, multi\-turn interactions exhibit sticky error states and capability degradations\(Lianget al\.,[2024](https://arxiv.org/html/2605.06772#bib.bib51); Laban and others,[2025](https://arxiv.org/html/2605.06772#bib.bib33); Zhang and others,[2025](https://arxiv.org/html/2605.06772#bib.bib37)\), although structured multi\-agent pooling can reduce hallucinations\(Till and others,[2025](https://arxiv.org/html/2605.06772#bib.bib43)\)\.

We probe these dynamics with SCALAR, a deliberately pedagogical*Actor–Critic–Judge*pipeline: one LLM agent \(comparable to a student\) plays the*Actor*attempting a graduate\-level quantum\-field\-theory \(QFT\) or string theory problem\. The*Critic*LLM \(analogous to a teaching assistant\) then delivers formative feedback mid\-task, before the final agent playing the*Judge*\(teacher\) sets the standard against which the work is ultimately evaluated; for prior discussions of LLMs as Judges see e\.g\.\(Zhenget al\.,[2023](https://arxiv.org/html/2605.06772#bib.bib58)\)\. Such a pedagogical interpretation completes a scaffolding loop\(Woodet al\.,[1976](https://arxiv.org/html/2605.06772#bib.bib5); Vygotsky,[1978](https://arxiv.org/html/2605.06772#bib.bib4)\)\. Multi\-agent approaches to LLM reasoning have shown promise — from debate frameworks\(Duet al\.,[2024](https://arxiv.org/html/2605.06772#bib.bib14); Estornellet al\.,[2024](https://arxiv.org/html/2605.06772#bib.bib34)\)to specialized refinement agents for physics\(Jaiswal and others,[2024](https://arxiv.org/html/2605.06772#bib.bib32)\)and interpretable AI\-scientist collaboration\(Xuet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib35)\)\. Pre\-prompting strategies have also been shown to make measurable differences to the perceived quality of the LLM’s output\(Kimet al\.,[2024](https://arxiv.org/html/2605.06772#bib.bib50)\), and recent work on instructional distraction shows that models can be sensitive to how task instructions are embedded in surrounding text\(Hwanget al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib30)\)\. To our knowledge, however, no prior work has systematically studied which*interaction strategies*between human and AI lead to the best outcomes in the field of theoretical physics\.

Our motivation is threefold\. First, physicists are already consulting LLMs for their day\-to\-day calculations, so we need to evaluate the whole interaction, not only the first answer: how reliably these tools converge, how they respond to challenge, and where they fail\. This is prerequisite to calibrating that usage in a regime not assessed by single\-turn benchmarks\(Chung and others,[2025](https://arxiv.org/html/2605.06772#bib.bib21); Gao and others,[2025](https://arxiv.org/html/2605.06772#bib.bib31); Zhanget al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib41)\)\. Second, theoretical research is moving towards workflows in which the physicist supervises a collection of AI agents, rather than interacting with one model at a time\. In our automated benchmark this supervisory role is stylized as an external Judge, while moment\-to\-moment Critic feedback is delegated to an AI teaching assistant\. In open\-ended use, the physicist may instead occupy part of the Critic role directly by probing claims, supplying consistency checks, and deciding when the exchange has met the required standard\. Understanding which scaffolding styles help which AI actors reach correct solutions is therefore a prerequisite for making either supervisory mode productive\. Third, SCALAR gives us a controlled testbed in which to hypothesis\-test widely repeated prompt\-engineering claims —*e\.g\.*, the report that “assigning the model a persona” can swing performance by tens of percentage points\(Gupta and others,[2024](https://arxiv.org/html/2605.06772#bib.bib36)\)— on current\-generation models and on reasoning\-heavy scientific tasks\.

To study these questions we introduce independent axes of variation for each party\. The Actor is varied through an*Actor persona*, defined as the combination of an*expertise level*\(novice, expert, or unspecified default\) and a*reasoning style*\(meticulous, physical, skeptical, or left unspecified\)\. This is the kind of Actor pre\-prompting a physicist might use when asking a model to approach a problem in a particular way\. The Critic is varied across a range of*Critic feedback strategies*\(from lenient and pedagogical to strict and adversarial, plus an unspecified default\), capturing how the assistant intervenes\. During the dialogue, the Judge enters only as a reference\-backed evaluator of correctness, silent with respect to the Actor–Critic exchange\. Stored transcripts can then be re\-scored by additional Judges, letting us separate interaction effects from Judge\-specific scoring effects\. This defines the role of the Judge as an authority that sets the standard of the exchange without actively participating in it\. Each persona–strategy configuration is sampled several times on graduate\-level QFT and string theory problems, allowing us to estimate trends across repeated dialogues rather than relying on single\-run anecdotes\. In addition to endpoint scores and convergence rates, we use per\-turn*score\-update curves*as a compact diagnostic of when Critic feedback continues to move the Actor and when a dialogue appears to enter a low\-drift regime\.

While SCALAR is set up in a theoretical physics context, it can be straightforwardly extended to other domains\. The lessons we are attempting to extract about LLM pre\-prompting and interaction can eventually inform the optimization of multi\-agent setups, where the agent persona and skill set play an important role\. More generally, we view our analysis as a step towards more efficient AI\-assisted open\-ended research; the interaction patterns identified here provide a vocabulary for that future work\.

Problem andPersona/Strategy SetupActorForm solutionCriticFeedback \+Error ChecksJudgeScore \+ VerdictPass?Earlystop?SaveRunEndYesFeedback loopNoYesNo: iterate

Figure 1:The SCALAR Actor–Critic–Judge pipeline\. The Actor and the Critic engage in iterative dialogue, while an independent evaluator \(Judge\) scores the Actor’s current solution against a ground truth\. The Actor’s output is shaped by the Actor persona; the Critic’s feedback is shaped by the Critic feedback strategy\.
## 2Methods

### 2\.1Roles, Feedback Strategies and Pipeline

We begin the discussion of our methods by describing the SCALAR Actor personas and Critic feedback strategies, both of which are implemented through pre\-prompting\. The Actor persona is factored into two orthogonal dimensions\. The first sets the expertise level:*expert*\(“you are an expert in theoretical physics”\),*novice*\(“you are a student learning QFT”\), or*default*\(no expertise instruction\)\. The second covers different reasoning styles, which shape the approach the Actor takes in the calculations:*meticulous*\(emphasizing careful algebra and cross\-checks\),*physical*\(prioritizing physical intuition and limiting cases\),*skeptical*\(questioning assumptions at each step\), or*default*with no style instruction\. The full set of combinations yields3×4=123\\times 4=12Actor personas\.

The Critic’s feedback strategy controls the tone of the feedback provided to the Actor:*adversarial*\(aggressively challenging claims\),*strict*\(precise error flagging\),*pedagogical*\(Socratic questioning\),*lenient*\(gentle suggestions accepting partial progress\), and*default*with no stylistic emphasis\. The full prompt texts are given in[Appendix˜A](https://arxiv.org/html/2605.06772#A1)\. The LLM assigned to each role is an additional degree of freedom and our model choices are described in[Section˜2\.3](https://arxiv.org/html/2605.06772#S2.SS3)\.

With the pre\-prompting fixed, SCALAR proceeds as follows \([Figure˜1](https://arxiv.org/html/2605.06772#S1.F1)\)\. Given a problem statement and persona–strategy configuration, the Actor produces an initial solution attempt\. The Critic, who has access to a reference solution but is instructed not to disclose it, then reviews this attempt, flags errors, and delivers structured feedback\. The Judge scores the Actor’s work against the reference solution and issues a pass/fail verdict\. If the Actor passes, or if an early\-stopping criterion is met \(iteration limit or score stagnation\), the run is saved and terminated\. Otherwise the Critic’s feedback is passed back to the Actor for a further attempt and the loop repeats until the stopping criteria are met\. For analysis, the recorded state after each Actor turn consists of the fixed experimental settings and the dialogue record available at that point\. Thus the recorded dialogue state is the natural Markov state of the generated dialogue: conditional on that state and the fixed configuration, the next role call is generated without invoking any additional recorded dialogue history\. In[Section˜3](https://arxiv.org/html/2605.06772#S3), we analyze the Judge score as a scalar projection of this evolving Markov state\.

### 2\.2Evaluation and Metrics

At each iterationttthe Judge scores the Actor’s current solution in six dimensions totaling100100points: correctness \(5050\), mathematical rigor \(1010\), logical flow \(1010\), justification quality \(1010\), completeness \(1010\), and physical consistency \(1010\)\. Letst∈\[0,100\]s\_\{t\}\\in\[0,100\]denote the total score at turntt\. Here lowercasettindexes turns, while uppercaseTiT\_\{i\}denotes the number of scored Actor states in runii\. A run withTTscored Actor states produces a sequences0,s1,…,sT−1s\_\{0\},s\_\{1\},\\ldots,s\_\{T\-1\}\. The Judge operates outside the Actor–Critic loop: its scores do not feed back into the dialogue, so the same transcripts can be re\-scored by different Judge LLMs to separate dialogue\-level effects from judge\-specific ones\. Below, a subscriptiidenotes a run, and angle brackets denote arithmetic averages over the indicated set of runs\.

We report three evaluation metrics: two score\-based quantities measured in points out of100100, and one rate that we report as a percentage of runs:

- •Mean per\-turn score:s¯i=Ti−1​∑t=0Ti−1si,t∈\[0,100\]\\bar\{s\}\_\{i\}\\;=\\;T\_\{i\}^\{\-1\}\\sum\_\{t=0\}^\{T\_\{i\}\-1\}s\_\{i,t\}\\;\\in\\;\[0,100\]\. This is the average per\-turn score across the whole dialogue\. When we quote a group means¯\\bar\{s\}, we mean the arithmetic mean of these run\-level quantities\. When we instead quote a*final score*, we say so explicitly and meansi,Ti−1s\_\{i,T\_\{i\}\-1\}averaged over runs\.
- •Gain:gi=si,Ti−1−si,0∈\[−100,\+100\]g\_\{i\}=s\_\{i,T\_\{i\}\-1\}\-s\_\{i,0\}\\in\[\-100,\+100\]\. This is the endpoint improvement in score points out of100100\.gi\>0g\_\{i\}\>0means the dialogue made the solution better\.
- •Convergence rate: Letri∈\{0,1\}r\_\{i\}\\in\\\{0,1\\\}denote whether the run converged\. A run is counted as converged when at least one iteration of the dialogue produces an Actor solution satisfying all three criteria: correctness≥40\\geq 40\(*i\.e\.*,≥80%\\geq\\\!80\\%of the5050\-point correctness rubric\), total Actor score≥80\\geq 80, and final\-answer equivalence with the reference, under the scoring Judge\. For runs whose original loop was driven by the same Judge that scores them, the passing iteration is also the terminal iteration; for re\-scored transcripts the loop length is fixed by the original Judge and the rescoring Judge can mark a non\-terminal iteration as passing\. For any group of runsGG, the convergence rate isRG=⟨ri⟩i∈GR\_\{G\}=\\langle r\_\{i\}\\rangle\_\{i\\in G\}and is reported as a percentage\. See[Appendix˜D](https://arxiv.org/html/2605.06772#A4)for the formal rule\.

For cross\-problem Critic feedback strategy comparisons we also use*problem\-normalized contrasts*\. Letm​\(i\)m\(i\),p​\(i\)p\(i\), andc​\(i\)c\(i\)denote runii’s Actor model setting \(Haiku, DS8B, or DS70B\), problem, and Critic feedback strategy\. For Actor model settingmmand Critic feedback strategycc, define

Ds¯​\(m,c\)\\displaystyle D\_\{\\bar\{s\}\}\(m,c\)=⟨s¯i−⟨s¯⟩m,p​\(i\)⟩m​\(i\)=m,c​\(i\)=c,\\displaystyle=\\left\\langle\\bar\{s\}\_\{i\}\-\\langle\\bar\{s\}\\rangle\_\{m,p\(i\)\}\\right\\rangle\_\{m\(i\)=m,\\,c\(i\)=c\},DR​\(m,c\)\\displaystyle D\_\{R\}\(m,c\)=100​⟨ri−⟨r⟩m,p​\(i\)⟩m​\(i\)=m,c​\(i\)=c\.\\displaystyle=100\\left\\langle r\_\{i\}\-\\langle r\\rangle\_\{m,p\(i\)\}\\right\\rangle\_\{m\(i\)=m,\\,c\(i\)=c\}\.ThusDs¯D\_\{\\bar\{s\}\}is measured in score points andDRD\_\{R\}in percentage points\. These quantities are descriptive contrasts, not new raw scores: they ask whether a Critic feedback strategy sits above or below the local Actor–problem baseline\. Here “local” means the baseline for the same Actor model setting and the same problem, after pooling over personas and Critic feedback strategies; the contrast therefore removes the much larger differences in baseline problem difficulty before comparing Critic feedback strategies\.

### 2\.3Problems and Models

We test SCALAR on three graduate\-level QFT and string theory problems drawn from standard textbooks\.*Peskin 2\.3*\(Peskin and Schroeder,[1995](https://arxiv.org/html/2605.06772#bib.bib2)\)requires computing the Feynman propagator at spacelike separation, yielding the modified Bessel function resultm4​π2​r​K1​\(m​r\)\\frac\{m\}\{4\\pi^\{2\}r\}K\_\{1\}\(mr\)\.*Peskin 4\.2*\(Peskin and Schroeder,[1995](https://arxiv.org/html/2605.06772#bib.bib2)\)asks for the lowest\-order lifetime of a heavy scalar particle decaying into two lighter SCALAR\.*Polchinski 2\.7*\(Polchinski,[1998](https://arxiv.org/html/2605.06772#bib.bib3)\)requires deriving Operator Product Expansion \(OPE\) coefficients in the free boson CFT, a core calculation in conformal field theory\. Each of these questions was specifically chosen to examine different facets of physics reasoning\. For example, Peskin 2\.3 and Polchinski 2\.7 are conceptually straightforward, longer calculations, requiring care with algebraic manipulations, but less physical reasoning and intuition\. Conversely, Peskin 4\.2 is a technically shorter question designed to test conceptual knowledge of decay rates in QFT\. For this work the problem set is intentionally chosen to be small, allowing us to deploy many strategies with modest compute\. The tradeoff is that the results should be read as a controlled case study of interaction structure rather than as a broad benchmark of theoretical\-physics reasoning\.

We study three Actor model settings and introduce the abbreviations used below\. The first uses DeepSeek\-R1 70B\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.06772#bib.bib9)\)\(70B parameters; hereafter DS70B\) as both the Actor and Critic, with QwQ\-32B\(Qwen Team,[2024](https://arxiv.org/html/2605.06772#bib.bib17)\)\(32B parameters; hereafter QWQ\) as the primary Judge; each of the12×5=6012\\times 5=60persona–strategy configurations is run with 15 repeats \(5 per problem\) at temperatureT=0\.7T\{=\}0\.7for up to 4 iterations, yielding 900 total runs\. The second uses DeepSeek\-R1\-0528\-Qwen3\-8B \(8B parameters; recorded in the logs asdeepseek\-r1; hereafter DS8B\) with the same DS70B Critic and QWQ Judge, again yielding 900 runs\. The third uses Claude Haiku 4\.5\(Anthropic,[2024](https://arxiv.org/html/2605.06772#bib.bib10)\)\(parameter count not public; hereafter Haiku\) as the Actor and Claude Sonnet 4\.6 \(parameter count not public; hereafter Sonnet\) as both Critic and primary Judge, with 5 repeats per persona–strategy cell on Peskin 2\.3 \(n=300n=300runs\), and partial coverage on Peskin 4\.2 \(n=83n=83\) and Polchinski 2\.7 \(n=51n=51\)\.

For Haiku, the reduced\-coverage runs served as exploratory screens\. They showed that Haiku converges on only3%3\\%of Polchinski 2\.7 cases under the Sonnet Judge, so we did not extend this problem to the full sweep\. In what follows, we therefore report detailed Haiku Critic feedback strategy statistics primarily on the two Peskin & Schroeder problems, while retaining the available Polchinski runs for exploratory per\-problem QWQ comparisons\.

To separate effects rooted in the dialogue itself from those specific to a particular Judge \(see*e\.g\.*,\(Tanet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib53)\)for a discussion on Judge reliability\) we additionally re\-score transcripts with independent Judges\. For the Haiku runs Sonnet is the primary Judge; we use closed\-model Opus on a small sample as a calibration check on Sonnet/QWQ agreement\. The Haiku transcripts are re\-scored by QWQ and by DS70B, and the two DeepSeek\-family Actor model settings are re\-scored by DS70B alongside the QWQ primary Judge\. Throughout the main results we report QWQ as the common scalable Judge across all three Actor model settings and treat DS70B re\-scoring as a sensitivity check\.

Because the tasks are drawn from theoretical physics, human domain expertise also provides an additional validation layer: we inspect representative outlier transcripts and ambiguous Judge decisions directly, checking whether apparent failures reflect genuine physics errors, grading artifacts, or limitations of the reference comparison\. In practice, we find that the QWQ Judge produces scores that are broadly consistent with manual inspection of representative cases and closer to the closed\-model audit set than DS70B re\-scoring, suggesting that it can act as a reliable primary scalable grader for the purposes of this study\.

The analyzed corpus is compute\-dense: across Actor, Critic, primary Judge, and re\-scoring calls, it contains roughly9\.4×1079\.4\\times 10^\{7\}estimated token\-equivalents reconstructed from archived prompt/response text, including about6\.2×1076\.2\\times 10^\{7\}tokens of judging\. Exact provider\-side token counters were not preserved, so these estimates are a scale\-of\-compute summary rather than billing totals\. This complements broader single\-turn studies such as TPBench, which spans 57 theoretical\-physics problems and 10 models with five attempts per model/problem\(Chung and others,[2025](https://arxiv.org/html/2605.06772#bib.bib21)\); SCALAR instead spends inference budget on factorial persona–strategy sweeps, iterative refinement, and Judge sensitivity over repeated transcripts\. This comparison should not be read as a claim that unconstrained multi\-agent dialogue is generically superior to single\-turn evaluation: the SCALAR Critic is reference\-conditioned, seeing the solution while being instructed not to disclose it, so the loop is closer to supervised tutoring than to open\-ended autonomous discovery\.

## 3Results

We summarize performance using the evaluation metrics defined in[Section˜2\.2](https://arxiv.org/html/2605.06772#S2.SS2): the mean per\-turn scores¯\\bar\{s\}, the gaingg, and the convergence rateRR\. When comparing strategies across problems, we additionally use the descriptive problem\-normalized contrastsDs¯D\_\{\\bar\{s\}\}andDRD\_\{R\}\. We refer to one complete Actor–Critic–Judge dialogue trajectory as a*run*; each run is treated as one independent observation\. For statistical comparisons, we use: Wilcoxon signed\-rank to investigate the sign and significance of paired turn\-0/turn\-\(T−1\)\(T\{\-\}1\)gain differences; Kruskal–Wallis to detect differences across the five Critic feedback strategies; and Mann–Whitney for further pairwise Critic feedback strategy comparisons\. These procedures use the raw run\-level evaluation quantities, not the problem\-normalized contrasts\. Definitions and procedure details are given in[Appendix˜D](https://arxiv.org/html/2605.06772#A4)\.

![Refer to caption](https://arxiv.org/html/2605.06772v1/x1.png)Figure 2:Per\-problem convergence for the three Actor model settings\. Haiku uses the Sonnet Critic; DS8B and DS70B both use the DS70B Critic\. Haiku has reduced coverage on Peskin 4\.2 and Polchinski 2\.7 \(n=83n=83andn=51n=51, respectively\), while the other shown cells usen=300n=300runs\.![Refer to caption](https://arxiv.org/html/2605.06772v1/x2.png)Figure 3:Run\-fate breakdown by problem for the DS70B Actor under the QWQ Judge\. Bars show single\-shot passes, multi\-turn passes after one/two/three Critic turns, and runs that terminate without passing\. The large single\-shot component on Peskin 4\.2 reflects cases where the Actor passes before Critic feedback is needed\.#### Structured feedback improves first attempts\.

For all three Actor model settings, the final solution is substantially better than the Actor’s first attempt under the common QWQ scoring \([Figure˜2](https://arxiv.org/html/2605.06772#S3.F2)\)\. On the two Peskin problems, Haiku with a Sonnet Critic reaches high QWQ convergence \(87\.7%87\.7\\%on Peskin 2\.3 and96\.4%96\.4\\%on Peskin 4\.2\), while the reduced Polchinski screen remains much harder\.111In the exploratory Polchinski 2\.7 Haiku screen, QWQ convergence is27\.5%27\.5\\%\(n=51n=51\)\.The available re\-scoring checks preserve the sign of the multi\-turn gain, so this conclusion is not an artifact of the common QWQ scale\.

For DS70B on all three problems, QWQ convergence reaches65\.7%65\.7\\%\(n=900n\{=\}900\), with mean final score80\.780\.7and mean gaing=\+13\.4g=\+13\.4\. DS8B reaches lower QWQ convergence,52\.4%52\.4\\%\(n=900n\{=\}900\), with mean final score76\.676\.6and mean gaing=\+13\.3g=\+13\.3\. This within\-family comparison is useful because the Critic is held fixed: DS8B resembles DS70B more than Haiku in its Critic feedback strategy sensitivity, even though its absolute performance is lower under QWQ\. Under DS70B re\-scoring the within\-family ordering becomes judge\-sensitive, so the common figures use QWQ rather than treating the DS8B/DS70B scale ordering as a judge\-independent claim\.

The DS70B turn\-0score under QWQ averages67\.367\.3and climbs to80\.680\.6by the end of the Critic loop, closing roughly40%40\\%of the gap to saturation despite the stronger single\-shot baseline\. This improvement is not a product of prompt optimization alone: it requires the iterative structure, consistent with structured multi\-agent refinement reducing errors\(Till and others,[2025](https://arxiv.org/html/2605.06772#bib.bib43)\)and contrasting with the failure of naive sequential strategies\(Gao and others,[2025](https://arxiv.org/html/2605.06772#bib.bib31)\)and with prior findings that single\-turn prompt optimization yields no significant gain\(Chung and others,[2025](https://arxiv.org/html/2605.06772#bib.bib21)\)\.

The three problems span a clear difficulty spectrum for DS70B, which manifests as a shift in the mixture of run\-fate categories \([Figure˜3](https://arxiv.org/html/2605.06772#S3.F3)\)\. QWQ convergence is90\.7%90\.7\\%on Peskin 4\.2,64\.7%64\.7\\%on Peskin 2\.3, and41\.7%41\.7\\%on Polchinski 2\.7, withn=300n=300runs per problem\. DS8B does not follow the same problem ordering: QWQ convergence is69\.0%69\.0\\%on Peskin 2\.3,54\.3%54\.3\\%on Peskin 4\.2, and34\.0%34\.0\\%on Polchinski 2\.7, withn=300n=300runs per problem \([Figure˜2](https://arxiv.org/html/2605.06772#S3.F2)\)\. Polchinski 2\.7 remains the hardest problem across all three Actor model settings, but DS8B finds Peskin 2\.3 easier than Peskin 4\.2 — the opposite of what DS70B and Haiku do\. The same reversal appears in the score\-update curves discussed next: for DS8B, Peskin 4\.2 stops improving around intermediate scores, whereas Peskin 2\.3 continues to improve until higher scores\.

![Refer to caption](https://arxiv.org/html/2605.06772v1/x3.png)Figure 4:Empirical score\-update curves for the two DeepSeek\-family Actor scales under the common QWQ scoring\. Points estimate the projected score\-update fieldv​\(s\)=𝔼​\[Δ​st∣st≃s\]v\(s\)=\\mathbb\{E\}\[\\Delta s\_\{t\}\\mid s\_\{t\}\\simeq s\]by averaging next\-turn updatesΔ​st=st\+1−st\\Delta s\_\{t\}=s\_\{t\+1\}\-s\_\{t\}in bins of the current Actor scorests\_\{t\}for the indicated Actor model and problem, pooling over personas and Critic feedback strategies; shaded bands are bootstrap95%95\\%confidence intervals\. Only runs with at least one observed update contribute, so immediate single\-shot passes are not part of these curves\. Markers on the zero line indicate estimated fixed pointss∗s^\{\\ast\}, meaning zeros of the projected score\-update field\.
#### Score\-update curves describe dialogue dynamics\.

The per\-problem convergence rates say where the Actor–Critic loop ends, but not how later Critic turns move the solution\. Because SCALAR records the fixed configuration and dialogue state after each turn \([Section˜2\.1](https://arxiv.org/html/2605.06772#S2.SS1)\), that recorded state is the natural Markov state of the generated dialogue\. The Markovian interpretation first becomes interesting when we project that state to a single observable: the Judge score\. For the DeepSeek\-family comparison in[Figure˜4](https://arxiv.org/html/2605.06772#S3.F4), we estimate the projected*score\-update field*

v\(s\)=𝔼\[st\+1−st\|st≃s\]\.v\(s\)=\\mathbb\{E\}\\\!\\left\[s\_\{t\+1\}\-s\_\{t\}\\,\\middle\|\\,s\_\{t\}\\simeq s\\right\]\.The average is over all observed next\-turn transitions in the same score bin for a fixed Actor model and problem, pooling over personas and Critic strategies\. A zero of this curve is a*projected fixed point*: the next Critic turn no longer improves the Actor on average in the observed ensemble near that score\. Throughout the rest of the paper, “fixed point” refers to this projected score\-level object, not to a complete fixed point of the full recorded dialogue state\. This is the quantity that separates “already easy,” “still improvable,” and “apparently stuck” regimes that are compressed together by endpoint gain\. In Appendix[E](https://arxiv.org/html/2605.06772#A5), the gain distributions show the same point from a different angle: endpoint gain mixes first\-shot successes, rescued runs, and stuck trajectories rather than isolating a clean strategy effect\.

Recent work has imported dynamical\-systems language into LLM behavior:Carson \([2025](https://arxiv.org/html/2605.06772#bib.bib39)\)model sentence\-level reasoning trajectories of open\-source LLMs as a switching linear dynamical system on a latent manifold,Sarfati and others \([2025](https://arxiv.org/html/2605.06772#bib.bib40)\)characterize token\-level activation paths as approximately linear\-drift stochastic processes, andWang and others \([2025](https://arxiv.org/html/2605.06772#bib.bib42)\)study attractor cycles under repeated paraphrasing\. SCALAR puts related language to work in a different setting: the underlying state is a multi\-agent dialogue, the driving field is supplied by a reference\-conditioned Critic, and the observable is an externally judged physics score\. The result is a score projection of a Markov transcript process\. Appendix[E](https://arxiv.org/html/2605.06772#A5)states the corresponding projection and transition\-kernel definitions, and reports residual checks that support the score projection as a useful coarse\-grained diagnostic while leaving predictive kernel modeling to larger studies\.

Under this view, the within\-family comparison reveals different regimes\. On Peskin 2\.3 both DeepSeek\-family Actors continue to improve until they reach high scores\. On Peskin 4\.2, DS8B develops a mid\-score fixed point nearst≃63s\_\{t\}\\simeq 63, whereas DS70B remains in a positive\-drift regime through most of the observed range and only crosses near high score; we therefore do not mark a clean mid\-score fixed point for DS70B on this problem\. On Polchinski 2\.7, however, both DS8B and DS70B have a similar fixed\-point region nearst≃63s\_\{t\}\\simeq 63\. In this sense, increasing parameter count within the same model family changes the easier\-problem dynamics but does not remove the hard\-problem bottleneck\.

#### Critic feedback strategy is model dependent\.

Feedback style should matter most when the Critic is actively moving the Actor within a dialogue, rather than when a run passes immediately or remains stuck\. This is what we observe for Haiku, but not as a stable single\-model effect for either DeepSeek Actor scale\.

![Refer to caption](https://arxiv.org/html/2605.06772v1/x4.png)Figure 5:Problem\-normalized Critic feedback strategy contrasts under QWQ scoring\. Haiku values use the two Peskin problems \(n=383n\{=\}383\); DS8B and DS70B values use all three problems \(n=900n\{=\}900each\)\. For each Actor model setting and problem, we subtract the local baseline before averaging by Critic feedback strategy, so the figure asks which Critic feedback strategies sit above or below the corresponding model–problem baseline rather than which problems are easiest\. Vertical bars show descriptive run\-level bootstrap95%95\\%confidence intervals after problem\-centering\.Ds¯D\_\{\\bar\{s\}\}is measured in score points on the0–100100Judge scale;DRD\_\{R\}is measured in percentage points of convergence\.Under the common QWQ scoring, Haiku is more sensitive to Critic feedback strategy than either DeepSeek Actor scale \([Figure˜5](https://arxiv.org/html/2605.06772#S3.F5)\)\. The normalization is needed because raw mean score and raw convergence are dominated by problem difficulty; the contrasts instead isolate within\-problem Critic feedback strategy shifts\. Under this view, Haiku places pedagogical, lenient, and default feedback above its local baseline inDs¯D\_\{\\bar\{s\}\}, with strict and adversarial below it\. This visual pattern matches the raws¯\\bar\{s\}Critic feedback strategy test for Haiku \(Kruskal–Wallisp=0\.012p\{=\}0\.012\), while the corresponding convergence\-rate test is not significant \(p=0\.69p\{=\}0\.69\)\. Thus the Haiku Critic feedback strategy signal is clearest in mean per\-turn score and should not be read as a strict ranking of the five strategies\.

For DS70B alone the five Critic feedback strategies are statistically indistinguishable\. Across the900900\-run sweep, the raw QWQs¯\\bar\{s\}means by Critic feedback strategy span only a few score points, and the omnibus Critic\-feedback\-strategy tests are null \(Kruskal–Wallisp=0\.61p\{=\}0\.61fors¯\\bar\{s\}andp=0\.17p\{=\}0\.17forRRunder QWQ\)\. DS70B re\-scoring gives the same null conclusion\. DS8B is similar: under QWQ, the omnibus tests find no reliable Critic feedback strategy effect ons¯\\bar\{s\}\(p=0\.10p\{=\}0\.10\) or convergence rateRR\(p=0\.22p\{=\}0\.22\), and DS70B re\-scoring agrees\. Descriptively, however, both DeepSeek\-family Actors put lenient feedback first under QWQ\. Naively pooling the two DeepSeek Actor scales yields a weak lenient advantage for raws¯\\bar\{s\}and convergence rate\. We therefore report this as a QWQ\-conditioned tendency, not as the same kind of robust Critic feedback strategy effect seen for Haiku\. Across all three Actor model settings, we find no stable evidence that adversarial or strict feedback is the best Critic feedback strategy\.

#### Actor persona prompting has negligible effect\.

Actor persona prompts are even weaker as a design variable\. For DS70B under the common QWQ scoring, the1212Actor\-persona means span onlys¯∈\[69\.4,74\.1\]\\bar\{s\}\\in\[69\.4,74\.1\]on the full900900\-run sweep — a55\-point range across1212configurations, no larger than what sampling variation alone would produce if all configurations had identical effects \(Kruskal–Wallisp=0\.99p\{=\}0\.99\)\. Varying the Actor persona along expertise and reasoning\-style axes has no measurable effect on DS70B outcomes\. DS8B gives the same broad message: the few low\-ppaxis tests are not stable across evaluation metrics or re\-scoring checks, while the expertise axis itself is null\.

Appendix[A](https://arxiv.org/html/2605.06772#A1)shows the full6060\-cell persona×\\timesstrategy heatmaps for the balanced DeepSeek\-family sweeps, supporting the same conclusion: no single Actor persona is consistently bright across problems, models, and Critic strategies\.

For Haiku under QWQ scoring, the expertise axis also does not robustly beat the unspecified default once we restrict to the two Peskin & Schroeder problems \(s¯\\bar\{s\}: default75\.575\.5vs expert74\.174\.1\)\. The available re\-scoring checks do not turn this into a stable dialogue\-level expertise effect\. Taken together with the Critic feedback strategy analysis above, the lesson is that among the prompt axes we tested only Critic feedback strategy reliably moves Haiku mean\-score results, while neither DeepSeek Actor scale shows a robust single\-model persona or Critic feedback strategy effect\. Actor persona prompting is therefore not a reliable design variable in either pairing\.

## 4Discussion and Outlook

### 4\.1Practical Takeaways

Two observations from our experiments may be useful when deploying SCALAR for graduate\-level physics reasoning\.

*The Actor–Critic pairing is the key design variable\.*Critic feedback strategy matters, but not universally: it must be tuned for the Actor–Critic system rather than chosen as a context\-free rule\. The same multi\-turn structure improves all three Actor model settings, but the mechanism differs: Haiku improves smoothly within the dialogue, whereas the DeepSeek\-family averages combine first\-shot successes, rescued runs, and failures that remain stuck\. For a physicist using a frontier assistant, this gives three practical lessons:

1. 1\.Use dialogue rather than only a one\-shot request\.222This recommendation is hardly revolutionary, but it is often the missing step in practice: a single flawed response is still sometimes treated by physicists as a verdict on the model, rather than as the first turn of an interaction\.Across all Actor model settings, multi\-turn refinement improves over the initial solution, especially when later turns target concrete physics checks such as dimensions, limits, symmetries, known special cases, and missing factors\.
2. 2\.Do not expect Actor persona prompts to carry the result\. Pre\-prompting current reasoning Actors with stronger personas, such as “be an expert theorist,” is not a reliable design variable in our experiments\.
3. 3\.Treat Critic feedback strategy as the prompt lever most worth testing, but do not assume that harsher criticism is better\. Strict and adversarial feedback are never stably best in our experiments\. Among the axes studied here, this is where we see the clearest model\-dependent movement in mean score\. A better starting point is constructive feedback that preserves correct partial work, targets the missing physics check, and only then tightens the standard; in the DeepSeek\-family settings, the lenient signal is suggestive but remains QWQ\-conditioned rather than a robust prescription\.

Furthermore, recent results that support the importance of asymmetric Actor\-Critic pairing include\(Jianget al\.,[2026](https://arxiv.org/html/2605.06772#bib.bib54)\)\. This also suggests that prompt\-engineering folklore deserves re\-testing on current reasoning\-tuned models, especially in dialogue rather than only in single\-turn settings\.

*Dialogue dynamics are regime dependent\.*Recent controlled studies of reasoning models report complexity\-dependent regimes, including abrupt performance collapse beyond critical task difficulty and phase\-transition\-like behavior in logical reasoning benchmarks\(Shojaeeet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib23); Hazraet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib25); Zhanget al\.,[2026](https://arxiv.org/html/2605.06772#bib.bib26)\)\. Our three physics problems are not a controlled complexity ladder, so we do not claim a phase transition\.

Instead, the score\-update analysis in[Section˜3](https://arxiv.org/html/2605.06772#S3)gives a conservative way to describe dialogue regimes: easy instances often pass before Critic feedback, intermediate instances can be moved by structured feedback, and hard instances can remain stuck despite additional dialogue\. The DS8B/DS70B comparison is a first controlled scale probe within one model family: increasing parameters changes the easy/intermediate regimes, but does not by itself remove the Polchinski bottleneck\. A denser DS\-R1 scale ladder could test whether such fixed\-point regions disappear at problem\-dependent critical scales; with only two scales here, this remains motivated future work rather than a scaling claim\.

In Appendix[C](https://arxiv.org/html/2605.06772#A3),[Figure˜8](https://arxiv.org/html/2605.06772#A3.F8)also identifies justification quality as the most persistent rubric gap\. This suggests a natural next step for SCALAR: treat Actor–Critic exchanges as controlled dynamical systems, vary model scale, Critic strength, turn budget, and problem difficulty, and test whether empirical transition kernels predict held\-out convergence\. A richer state could also include Critic scores, turning the one\-dimensional Actor\-score projection into a joint description of feedback quality and Actor uptake\.

This connects to the “agentic gap” critique of static text\-only reasoning evaluations\(Khanet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib24)\): scaffolding can rescue some failures, but our hard\-problem results show that scaffolding itself has boundaries\.

### 4\.2Toward AI\-Assisted Scientific Discovery

AI\-assisted discovery in fundamental physics is moving from speculation to an operational research question\(Luet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib22)\)\. Recent Gemini case studies give an open\-ended perspective: frontier models can be useful in scientific workflows when embedded in human\-supervised loops involving decomposition, critique, code generation, and external validation\(Woodruffet al\.,[2026](https://arxiv.org/html/2605.06772#bib.bib60)\)\. SCALAR takes the complementary controlled direction: by returning to known\-reference problems, it makes one part of that workflow measurable, namely how Critic feedback changes the trajectory of an Actor’s reasoning\.

At the model level, one might expect that fine\-tuning models on domain\-specific high\-energy theory corpora, as in the FeynTune study\(Richmondet al\.,[2026](https://arxiv.org/html/2605.06772#bib.bib45)\), would substantially improve research\-style reasoning in high\-energy theory\. However, the results of that study indicate that targeted fine\-tuning at the scales currently accessible to small academic collaborations does not close the gap with frontier general\-purpose models\. This motivates the SCALAR approach: rather than trying to compete with frontier capability at the model level, scaffold a frontier model with a structured Critic and obtain additional gains at the*interaction*level\. We test this idea on graduate\-level QFT and string\-theory calculations in a controlled benchmark\.

However, in the present setup, the Critic is reference\-conditioned: it has access to reference solutions and therefore guides the Actor with privileged knowledge of the answer\. The Judge, by contrast, represents the external\-validation part of the physicist’s supervisory role: it evaluates the Actor–Critic exchange from outside the dialogue and determines whether the final reasoning meets the required standard\. This is appropriate for benchmarking, but it means that the grading system still presupposes a known solution\.

The longer\-term vision is to remove this dependence on reference solutions\.333This connects to a broader program in AI\-assisted science: models can already learn useful structure from scientific data, but extracting reliable human\-usable insight from that structure often remains labor intensive\(Schmidt and Lipson,[2009](https://arxiv.org/html/2605.06772#bib.bib63); Udrescu and Tegmark,[2020](https://arxiv.org/html/2605.06772#bib.bib64); Cranmeret al\.,[2020](https://arxiv.org/html/2605.06772#bib.bib65); Kitouniet al\.,[2024](https://arxiv.org/html/2605.06772#bib.bib62); Richardsonet al\.,[2025](https://arxiv.org/html/2605.06772#bib.bib61)\)\. Agentic workflows could eventually help automate parts of this interpretive loop\.For open problems, the researcher would instead move into the Critic role, supplying consistency checks and domain intuition\. The Judge would then be replaced by an external arbiter — experimental measurements, simulation outputs, theorem\-prover checks, or mathematical consistency conditions — so that the dialogue is validated against available ground truth rather than against a textbook solution\.[Appendix˜C](https://arxiv.org/html/2605.06772#A3)explains which parts of the rubric would survive, and which would need replacement, in such a no\-reference setting\. In this setting, the three\-agent analogy with the scientific method would become literal rather than stylized\. Understanding how to configure Actor and Critic so that their dialogue reliably converges on problems whose answers are known is therefore a prerequisite for the more ambitious goal of collaborating on problems whose answers are not\.

## 5Author contribution

All authors contributed equally\.

## Data availability

## Acknowledgments

The authors thank Tilman Plehn and Jesse Thaler for their comments on the manuscript\. CP was partially supported by the Science and Technology Facilities Council \(STFC\) Consolidated Grant ST/X00063X/1 “Amplitudes, Strings & Duality\.” AGS acknowledges support from Pierre Andurand\. This research utilized the Apocrita HPC facility, supported by QMUL Research\-IT\(Kinget al\.,[2017](https://arxiv.org/html/2605.06772#bib.bib49)\)\. ST is supported by the Swiss National Science Foundation project number P5R5PT\_222350, and acknowledges CERN TH Department for hospitality while this research was being carried out\. This work is also supported by the National Science Foundation under Cooperative Agreement PHY\-2019786 \(The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi\.org/\)\.

## References

- P\. Agrawal, N\. Craig, A\. Madden, and I\. Valenzuela Lombera \(2026\)The FERMIACC: agents for particle theory\.arXiv preprint arXiv:2603\.22538\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- Anthropic \(2024\)The Claude Model Card and System Prompt\.Note:[https://www\.anthropic\.com](https://www.anthropic.com/)Cited by:[§2\.3](https://arxiv.org/html/2605.06772#S2.SS3.p2.5)\.
- T\. Carson \(2025\)A statistical physics of language model reasoning\.arXiv preprint\.Cited by:[§3](https://arxiv.org/html/2605.06772#S3.SS0.SSS0.Px2.p2.1)\.
- D\. Chunget al\.\(2025\)Theoretical physics benchmark \(TPBench\): a dataset and study of AI reasoning capabilities in theoretical physics\.arXiv preprint arXiv:2502\.15815\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.06772#S2.SS3.p6.2),[§3](https://arxiv.org/html/2605.06772#S3.SS0.SSS0.Px1.p3.4)\.
- M\. Cranmer, A\. Sanchez\-Gonzalez, P\. Battaglia, R\. Xu, K\. Cranmer, D\. Spergel, and S\. Ho \(2020\)Discovering symbolic models from deep learning with inductive biases\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 17429–17442\.Cited by:[footnote 3](https://arxiv.org/html/2605.06772#footnote3)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§2\.3](https://arxiv.org/html/2605.06772#S2.SS3.p2.5)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2024\)Improving factuality and reasoning in language models through multiagent debate\.arXiv preprint arXiv:2305\.14325\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- A\. Estornell, J\. Ton, Y\. Yao, and Y\. Liu \(2024\)ACC\-Collab: an actor\-critic approach to multi\-agent LLM collaboration\.arXiv preprint arXiv:2411\.00053\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- Z\. Gaoet al\.\(2025\)Test\-time scaling techniques in theoretical physics: a comparison of methods on the TPBench dataset\.arXiv preprint\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p3.1),[§3](https://arxiv.org/html/2605.06772#S3.SS0.SSS0.Px1.p3.4)\.
- A\. Guevara, A\. Lupsasca, D\. Skinner, A\. Strominger, and K\. Weil \(2026\)Single\-minus gluon tree amplitudes are nonzero\.arXiv preprint arXiv:2602\.12176\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- J\. Guptaet al\.\(2024\)Persona is a double\-edged sword: enhancing the zero\-shot reasoning by ensembling the role\-playing and neutral prompts\.arXiv preprint arXiv:2408\.08631\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p3.1)\.
- R\. Hazra, G\. Venturato, P\. Zuidberg Dos Martires, and L\. De Raedt \(2025\)Have large language models learned to reason? a characterization via 3\-SAT phase transition\.arXiv preprint arXiv:2504\.03930\.Cited by:[§4\.1](https://arxiv.org/html/2605.06772#S4.SS1.p3.1)\.
- Y\. Hwang, Y\. Kim, J\. Koo, T\. Kang, H\. Bae, and K\. Jung \(2025\)LLMs can be easily confused by instructional distractions\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 19483–19496\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.957),[Link](https://aclanthology.org/2025.acl-long.957/)Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- R\. Jaiswalet al\.\(2024\)Improving physics reasoning in large language models using mixture of refinement agents\.arXiv preprint arXiv:2412\.00821\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- S\. Jiang, Z\. Zhang, Y\. Zhang, S\. Yang, W\. Xia, and S\. Soatto \(2026\)Asymmetric actor\-critic for multi\-turn llm agents\.External Links:2604\.00304,[Link](https://arxiv.org/abs/2604.00304)Cited by:[§4\.1](https://arxiv.org/html/2605.06772#S4.SS1.p2.2)\.
- S\. Khan, S\. Madhavan, and K\. Natarajan \(2025\)A comment on “the illusion of thinking”: reframing the reasoning cliff as an agentic gap\.arXiv preprint arXiv:2506\.18957\.Cited by:[§4\.1](https://arxiv.org/html/2605.06772#S4.SS1.p6.1)\.
- J\. Kim, N\. Yang, and K\. Jung \(2024\)Persona is a double\-edged sword: mitigating the negative impact of role\-playing prompts in zero\-shot reasoning tasks\.External Links:2408\.08631,[Link](https://arxiv.org/abs/2408.08631)Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- T\. King, S\. Butcher, and L\. Zalewski \(2017\)Apocrita \- high performance computing cluster for queen mary university of london\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.438045),[Link](https://doi.org/10.5281/zenodo.438045)Cited by:[Acknowledgments](https://arxiv.org/html/2605.06772#Sx2.p1.1)\.
- O\. Kitouni, N\. Nolte, V\. S\. Pérez\-Díaz, S\. Trifinopoulos, and M\. Williams \(2024\)From neurons to neutrons: a case study in interpretability\.InProceedings of the 41st International Conference on Machine Learning,pp\. 24726–24748\.External Links:[Link](https://proceedings.mlr.press/v235/kitouni24a.html)Cited by:[footnote 3](https://arxiv.org/html/2605.06772#footnote3)\.
- W\. H\. Kruskal and W\. A\. Wallis \(1952\)Use of ranks in one\-criterion variance analysis\.Journal of the American Statistical Association47\(260\),pp\. 583–621\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1952.10483441)Cited by:[2nd item](https://arxiv.org/html/2605.06772#A4.I1.i2.p1.4)\.
- P\. Labanet al\.\(2025\)LLMs get lost in multi\-turn conversation\.arXiv preprint\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- Z\. Liang, D\. Yu, W\. Yu, W\. Yao, Z\. Zhang, X\. Zhang, and D\. Yu \(2024\)MathChat: benchmarking mathematical reasoning and instruction following in multi\-turn interactions\.External Links:2405\.19444,[Link](https://arxiv.org/abs/2405.19444)Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- S\. Lu, Z\. Jin, T\. J\. Zhang, P\. Kos, J\. I\. Cirac, and B\. Schölkopf \(2025\)Can theoretical physics research benefit from language agents?\.arXiv preprint arXiv:2506\.06214\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.06772#S4.SS2.p1.1)\.
- H\. B\. Mann and D\. R\. Whitney \(1947\)On a test of whether one of two random variables is stochastically larger than the other\.The Annals of Mathematical Statistics18\(1\),pp\. 50–60\.External Links:[Document](https://dx.doi.org/10.1214/aoms/1177730491)Cited by:[3rd item](https://arxiv.org/html/2605.06772#A4.I1.i3.p1.2)\.
- A\. Novikov, N\. Vu, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. R\. Ruiz, A\. Mehrabian, M\. P\. Kumar, A\. See, S\. Chaudhuri, G\. Holland, A\. Davies, S\. Nowozin, P\. Kohli, and M\. Balog \(2025\)AlphaEvolve: a coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- M\. E\. Peskin and D\. V\. Schroeder \(1995\)An introduction to quantum field theory\.Addison\-Wesley\.Cited by:[Appendix B](https://arxiv.org/html/2605.06772#A2.SS0.SSS0.Px1),[Appendix B](https://arxiv.org/html/2605.06772#A2.SS0.SSS0.Px3),[§2\.3](https://arxiv.org/html/2605.06772#S2.SS3.p1.1)\.
- T\. Plehn, D\. Schiller, and N\. Schmal \(2026\)MadAgents\.arXiv preprint arXiv:2601\.21015\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- J\. Polchinski \(1998\)String theory\.Vol\.1,Cambridge University Press\.Cited by:[Appendix B](https://arxiv.org/html/2605.06772#A2.SS0.SSS0.Px2),[§2\.3](https://arxiv.org/html/2605.06772#S2.SS3.p1.1)\.
- Qwen Team \(2024\)QwQ: Reflect Deeply on the Boundaries of the Unknown\.Note:[https://qwenlm\.github\.io/blog/qwq\-32b\-preview/](https://qwenlm.github.io/blog/qwq-32b-preview/)Cited by:[§2\.3](https://arxiv.org/html/2605.06772#S2.SS3.p2.5)\.
- K\. A\. Richardson, S\. Trifinopoulos, and M\. Williams \(2025\)The dna of nuclear models: how ai predicts nuclear masses\.arXiv preprint arXiv:2508\.08370\.External Links:[Link](https://arxiv.org/abs/2508.08370)Cited by:[footnote 3](https://arxiv.org/html/2605.06772#footnote3)\.
- P\. Richmond, C\. Papageorgakis, V\. Niarchos, B\. Chowdhury, and P\. Agarwal \(2026\)FeynTune: large language models for high\-energy theory\.Mach\. Learn\. Sci\. Tech\.7\(2\),pp\. 025012\.External Links:2508\.03716,[Document](https://dx.doi.org/10.1088/2632-2153/ae47bb)Cited by:[§4\.2](https://arxiv.org/html/2605.06772#S4.SS2.p2.1)\.
- B\. Romera\-Paredeset al\.\(2024\)Mathematical discoveries from program search with large language models\.Nature625,pp\. 468–475\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- R\. Sarfatiet al\.\(2025\)Lines of thought in large language models\.InICLR,Cited by:[§3](https://arxiv.org/html/2605.06772#S3.SS0.SSS0.Px2.p2.1)\.
- M\. Schmidt and H\. Lipson \(2009\)Distilling free\-form natural laws from experimental data\.Science324\(5923\),pp\. 81–85\.Cited by:[footnote 3](https://arxiv.org/html/2605.06772#footnote3)\.
- M\. D\. Schwartz \(2026\)Resummation of the C\-Parameter Sudakov Shoulder Using Effective Field Theory\.arXiv preprint arXiv:2601\.02484\.External Links:2601\.02484Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- D\. Shih \(2026a\)Learning to Unscramble Feynman Loop Integrals with SAILIR\.arXiv preprint arXiv:2604\.05034\.External Links:2604\.05034Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- D\. Shih \(2026b\)Learning to Unscramble: Simplifying Symbolic Expressions via Self\-Supervised Oracle Trajectories\.arXiv preprint arXiv:2603\.11164\.External Links:2603\.11164Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- P\. Shojaee, I\. Mirzadeh, K\. Alizadeh, M\. Horton, S\. Bengio, and M\. Farajtabar \(2025\)The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity\.arXiv preprint arXiv:2506\.06941\.Cited by:[§4\.1](https://arxiv.org/html/2605.06772#S4.SS1.p3.1)\.
- S\. Tan, S\. Zhuang, K\. Montgomery, W\. Y\. Tang, A\. Cuadron, C\. Wang, R\. A\. Popa, and I\. Stoica \(2025\)JudgeBench: a benchmark for evaluating llm\-based judges\.External Links:2410\.12784,[Link](https://arxiv.org/abs/2410.12784)Cited by:[§2\.3](https://arxiv.org/html/2605.06772#S2.SS3.p4.1)\.
- R\. Tillet al\.\(2025\)Multi\-model consistency improves hallucination detection and mitigation in large language models\.arXiv preprint\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1),[§3](https://arxiv.org/html/2605.06772#S3.SS0.SSS0.Px1.p3.4)\.
- S\. Udrescu and M\. Tegmark \(2020\)AI feynman: a physics\-inspired method for symbolic regression\.Science Advances6\(16\),pp\. eaay2631\.Cited by:[footnote 3](https://arxiv.org/html/2605.06772#footnote3)\.
- L\. S\. Vygotsky \(1978\)Mind in society: the development of higher psychological processes\.Harvard University Press,Cambridge, MA\.Note:Edited by Michael Cole, Vera John\-Steiner, Sylvia Scribner, and Ellen SoubermanCited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- Y\. Wanget al\.\(2025\)Unveiling attractor cycles in large language models: a dynamical systems view of successive paraphrasing\.arXiv preprint\.Cited by:[§3](https://arxiv.org/html/2605.06772#S3.SS0.SSS0.Px2.p2.1)\.
- F\. Wilcoxon \(1945\)Individual comparisons by ranking methods\.Biometrics Bulletin1\(6\),pp\. 80–83\.External Links:[Document](https://dx.doi.org/10.2307/3001968)Cited by:[1st item](https://arxiv.org/html/2605.06772#A4.I1.i1.p1.2)\.
- D\. Wood, J\. S\. Bruner, and G\. Ross \(1976\)The role of tutoring in problem solving\.Journal of Child Psychology and Psychiatry17\(2\),pp\. 89–100\.External Links:[Document](https://dx.doi.org/10.1111/j.1469-7610.1976.tb00381.x)Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- D\. P\. Woodruff, V\. Cohen\-Addad, L\. Jain, J\. Mao, S\. Zuo,et al\.\(2026\)Accelerating scientific research with gemini: case studies and common techniques\.arXiv preprint arXiv:2602\.03837\.External Links:[Link](https://arxiv.org/abs/2602.03837)Cited by:[§4\.2](https://arxiv.org/html/2605.06772#S4.SS2.p1.1)\.
- Y\. Xu, H\. Kimlee, Y\. Xiao, and D\. Luo \(2025\)Advancing AI\-scientist understanding: multi\-agent LLMs with interpretable physics reasoning\.arXiv preprint arXiv:2504\.01911\.Note:ICML 2025 Workshop on MASCited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.
- X\. Zhang, Y\. Zhang, Z\. Chen, J\. Yu, W\. Yang, and Z\. Song \(2026\)Logical phase transitions: understanding collapse in LLM logical reasoning\.arXiv preprint arXiv:2601\.02902\.Cited by:[§4\.1](https://arxiv.org/html/2605.06772#S4.SS1.p3.1)\.
- X\. Zhang, Y\. Dong,et al\.\(2025\)PhysReason: a comprehensive benchmark towards physics\-based reasoning\.arXiv preprint arXiv:2502\.12054\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p3.1)\.
- Y\. Zhanget al\.\(2025\)TurnBench\-MS: a benchmark for evaluating multi\-turn, multi\-step reasoning in large language models\.arXiv preprint\.Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 46595–46623\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§1](https://arxiv.org/html/2605.06772#S1.p2.1)\.

## Appendix APrompt Texts

In SCALAR, Actor persona pre\-prompting is factored into expertise×\\timesreasoning style\. Representative summaries of each component follow; the full prompts used in experiments are provided in the code repository\.

#### Expertise level\.

*Expert*: “You are an expert in theoretical physics with deep knowledge of QFT and mathematical methods\.”*Novice*: “You are a graduate student learning QFT, working through problems to build understanding\.”*Default \(d\)*: no expertise instruction\.

#### Reasoning style\.

*Meticulous*: “Check every algebraic step carefully\. Verify signs, prefactors, and limits before proceeding\.”*Physical*: “Prioritize physical intuition\. Use dimensional analysis, limiting cases, and symmetry to guide your calculation\.”*Skeptical*: “Question every assumption\. If a step seems unjustified, flag it and consider alternatives\.”*Default \(d\)*: no style instruction\.

#### Feedback strategies\.

*Adversarial*: “Aggressively challenge every claim\. Demand explicit justification for each step\.”*Strict*: “Flag every error precisely with the correct form\.”*Pedagogical*: “Guide through Socratic questioning\. Ask leading questions rather than stating errors directly\.”*Lenient*: “Focus on what the solver got right\. Offer gentle suggestions\. Accept partial progress\.”*Default \(d\)*: no additional style instructions\.

The heatmaps in[Figures˜6](https://arxiv.org/html/2605.06772#A1.F6)and[7](https://arxiv.org/html/2605.06772#A1.F7)show the balanced persona×\\timesstrategy grids for the two DeepSeek\-family sweeps\. The block structure separates default, expert, and novice Actor personas\. The visual pattern is irregular rather than block\-diagonal: individual cells can be bright, but there is no stable persona formula that dominates across problems\.

![Refer to caption](https://arxiv.org/html/2605.06772v1/x5.png)Figure 6:DS8Bs¯\\bar\{s\}heatmaps under QWQ scoring, separated by problem\. Rows are Actor personas, columns are Critic feedback strategies, and cell color gives the mean per\-turn score\.![Refer to caption](https://arxiv.org/html/2605.06772v1/x6.png)Figure 7:DS70Bs¯\\bar\{s\}heatmaps under QWQ scoring, separated by problem\. Rows are Actor personas, columns are Critic feedback strategies, and cell color gives the mean per\-turn score\.

## Appendix BList of Physics Problems

As was outlined in[Section˜2\.3](https://arxiv.org/html/2605.06772#S2.SS3), in this work we consider three problems chosen to probe various qualities of each model\-prompt pair\. Each problem is reproduced below\.

#### Peskin 2\.3\(Peskin and Schroeder,[1995](https://arxiv.org/html/2605.06772#bib.bib2)\)

Evaluate the function

⟨0\|ϕ​\(x\)​ϕ​\(y\)\|0⟩=D​\(x−y\)=∫d3​p\(2​π\)3​1Ep→​e−i​p⋅\(x−y\)\\langle 0\|\\phi\(x\)\\phi\(y\)\|0\\rangle=D\(x\-y\)=\\int\\frac\{d^\{3\}p\}\{\(2\\pi\)^\{3\}\}\\frac\{1\}\{E\_\{\\vec\{p\}\}\}e^\{\-ip\\cdot\(x\-y\)\}for\(x−y\)\(x\-y\)spacelike so that\(x−y\)2=−r2\(x\-y\)^\{2\}=\-r^\{2\}, explicitly in terms of Bessel functions\.

#### Polchinski 2\.7\(Polchinski,[1998](https://arxiv.org/html/2605.06772#bib.bib3)\)

Consider the free CFT of free SCALARXμX^\{\\mu\}\.

1. 1\.By computing the relevant OPEs, confirm the following weights Xμ​\(0,0\)\\displaystyle X^\{\\mu\}~~\(0,0\)∂Xμ​\(1,0\)\\displaystyle\\partial X^\{\\mu\}~~\(1,0\)∂¯​Xμ​\(0,1\)\\displaystyle\\bar\{\\partial\}X^\{\\mu\}~~\(0,1\)∂2Xμ​\(2,0\)\\displaystyle\\partial^\{2\}X^\{\\mu\}~~\(2,0\):ei​k⋅X:\(α′​k24,α′​k24\)\\displaystyle:e^\{ik\\cdot X\}:~~\\left\(\\frac\{\\alpha^\{\\prime\}k^\{2\}\}\{4\},\\frac\{\\alpha^\{\\prime\}k^\{2\}\}\{4\}\\right\)and determine which operators are tensors\.
2. 2\.Do this for the same operators in the linear dilaton theory\.

#### Peskin 4\.2\(Peskin and Schroeder,[1995](https://arxiv.org/html/2605.06772#bib.bib2)\)

Consider the following Lagrangian, involving two real scalar fieldsΦ\\Phiandϕ\\phi:

ℒ=12​\(∂μΦ\)2−12​M2​Φ2\+12​\(∂μϕ\)2−12​m2​ϕ2−μ​Φ​ϕ​ϕ\.\\mathcal\{L\}=\\tfrac\{1\}\{2\}\(\\partial\_\{\\mu\}\\Phi\)^\{2\}\-\\tfrac\{1\}\{2\}M^\{2\}\\Phi^\{2\}\+\\tfrac\{1\}\{2\}\(\\partial\_\{\\mu\}\\phi\)^\{2\}\-\\tfrac\{1\}\{2\}m^\{2\}\\phi^\{2\}\-\\mu\\Phi\\phi\\phi\.The last term is an interaction that allows aΦ\\Phiparticle to decay into twoϕ\\phi’s, provided thatM\>2​mM\>2m\. Assuming that this condition is met, calculate the lifetime of theΦ\\Phito lowest order inμ\\mu\.

## Appendix CScoring Criteria

We evaluate each candidate solution using an*LLM Judge*that compares the submitted solution against the problem statement and, when available, a reference solution\. The Judge produces a structured verdict consisting of binary pass/fail, fine\-grained sub\-scores, error flags, and qualitative summaries\.

#### Inputs\.

For each evaluation instance, the Judge receives:

1. 1\.the problem statementPP,
2. 2\.the Actor\-generated solutionAA,
3. 3\.Critic feedbackCCembedded in the evaluated text, and
4. 4\.a reference solutionRR\(when available\)\.

#### Final\-result equivalence check\.

Before assigning scores, the Judge performs an explicit*final\-result comparison*between the Actor’s claimed final answer and the reference answer\. Let

Eq​\(A,R\)∈\{0,1\}\\mathrm\{Eq\}\(A,R\)\\in\\\{0,1\\\}denote whether the Actor’s final result is mathematically equivalent to the reference\. Equivalence is defined up to:

- •algebraic rearrangement,
- •equivalent notation,
- •trivial reordering of factors\.

Non\-equivalence is declared if the two results differ in sign, numerical prefactor, functional form, missing or extra terms, or complex phase factors\. If the Actor provides no final formula or only an incomplete formula, then

Eq​\(A,R\)=0\.\\mathrm\{Eq\}\(A,R\)=0\.

#### Actor score\.

The Actor is evaluated along six dimensions:

Sactor=Sc\+Sr\+Sl\+Sj\+Sm\+Sp,Sactor∈\[0,100\],S\_\{\\mathrm\{actor\}\}=S\_\{c\}\+S\_\{r\}\+S\_\{l\}\+S\_\{j\}\+S\_\{m\}\+S\_\{p\},\\qquad S\_\{\\mathrm\{actor\}\}\\in\[0,100\],where:

Sc\\displaystyle S\_\{c\}∈\[0,50\]\\displaystyle\\in\[0,50\]correctness,\\displaystyle\\text\{correctness\},Sr\\displaystyle S\_\{r\}∈\[0,10\]\\displaystyle\\in\[0,10\]mathematical rigor,\\displaystyle\\text\{mathematical rigor\},Sl\\displaystyle S\_\{l\}∈\[0,10\]\\displaystyle\\in\[0,10\]logical flow,\\displaystyle\\text\{logical flow\},Sj\\displaystyle S\_\{j\}∈\[0,10\]\\displaystyle\\in\[0,10\]quality of justification,\\displaystyle\\text\{quality of justification\},Sm\\displaystyle S\_\{m\}∈\[0,10\]\\displaystyle\\in\[0,10\]completeness,\\displaystyle\\text\{completeness\},Sp\\displaystyle S\_\{p\}∈\[0,10\]\\displaystyle\\in\[0,10\]physical consistency\.\\displaystyle\\text\{physical consistency\}\.
##### Correctness \(ScS\_\{c\}\)\.

Correctness is based on the error density

ρ=NerrorsNsteps\.\\rho=\\frac\{N\_\{\\mathrm\{errors\}\}\}\{N\_\{\\mathrm\{steps\}\}\}\.The rubric is:

- •Sc=50S\_\{c\}=50:ρ=0\\rho=0, all equations are mathematically sound, and the final result is within1%1\\%of the expected result;
- •Sc∈\[42,49\]S\_\{c\}\\in\[42,49\]:ρ<0\.1\\rho<0\.1, only minor computational errors;
- •Sc∈\[27,41\]S\_\{c\}\\in\[27,41\]:0\.1≤ρ<0\.30\.1\\leq\\rho<0\.3, some correct intermediate reasoning but substantial gaps;
- •Sc∈\[0,26\]S\_\{c\}\\in\[0,26\]:ρ≥0\.3\\rho\\geq 0\.3, indicating fundamental conceptual or methodological errors\.

##### Mathematical rigor \(SrS\_\{r\}\)\.

Rigor is measured by the justification ratio

J=NjustifiedNstatements\.J=\\frac\{N\_\{\\mathrm\{justified\}\}\}\{N\_\{\\mathrm\{statements\}\}\}\.The rubric is:

- •Sr=10S\_\{r\}=10:J=1J=1, all statements justified;
- •Sr∈\[7,9\]S\_\{r\}\\in\[7,9\]:J≥0\.8J\\geq 0\.8, mostly rigorous;
- •Sr∈\[4,6\]S\_\{r\}\\in\[4,6\]:0\.5≤J<0\.80\.5\\leq J<0\.8, partial justification;
- •Sr∈\[0,3\]S\_\{r\}\\in\[0,3\]:J<0\.5J<0\.5, predominantly unjustified reasoning\.

##### Logical flow \(SlS\_\{l\}\)\.

Logical coherence is assessed using the dependency structure of the derivation:

- •Sl=10S\_\{l\}=10: the derivation forms a clear directed acyclic graph \(DAG\), meaning a non\-circular dependency structure in which each step follows from prior steps;
- •Sl∈\[7,9\]S\_\{l\}\\in\[7,9\]: mostly sequential with minor organizational gaps;
- •Sl∈\[4,6\]S\_\{l\}\\in\[4,6\]: unclear dependencies or partially disconnected reasoning;
- •Sl∈\[0,3\]S\_\{l\}\\in\[0,3\]: non\-sequential reasoning, circularity, or disconnected argument structure\.

##### Quality of justification \(SjS\_\{j\}\)\.

Explanation depth is measured by the average reasoning\-chain length

R=avg​\(explanation steps per result\)\.R=\\mathrm\{avg\}\(\\text\{explanation steps per result\}\)\.The rubric is:

- •Sj=10S\_\{j\}=10:R≥3R\\geq 3;
- •Sj∈\[7,9\]S\_\{j\}\\in\[7,9\]:2≤R<32\\leq R<3;
- •Sj∈\[4,6\]S\_\{j\}\\in\[4,6\]:1≤R<21\\leq R<2;
- •Sj∈\[0,3\]S\_\{j\}\\in\[0,3\]:R<1R<1\.

##### Completeness \(SmS\_\{m\}\)\.

Coverage is measured by

M=Naddressed​requirementsNtotal​requirements\.M=\\frac\{N\_\{\\mathrm\{addressed\\ requirements\}\}\}\{N\_\{\\mathrm\{total\\ requirements\}\}\}\.The rubric is:

- •Sm=10S\_\{m\}=10:M=1M=1;
- •Sm∈\[7,9\]S\_\{m\}\\in\[7,9\]:M≥0\.8M\\geq 0\.8;
- •Sm∈\[4,6\]S\_\{m\}\\in\[4,6\]:0\.5≤M<0\.80\.5\\leq M<0\.8;
- •Sm∈\[0,3\]S\_\{m\}\\in\[0,3\]:M<0\.5M<0\.5\.

##### Physical consistency \(SpS\_\{p\}\)\.

Physical validity is assessed via dimensional analysis, unit propagation, and limiting\-case behavior:

- •Sp=10S\_\{p\}=10: dimensionally consistent, correct limits, physically plausible;
- •Sp∈\[7,9\]S\_\{p\}\\in\[7,9\]: generally consistent with minor issues;
- •Sp∈\[4,6\]S\_\{p\}\\in\[4,6\]: some unit or limiting\-behavior errors;
- •Sp∈\[0,3\]S\_\{p\}\\in\[0,3\]: unphysical or dimensionally inconsistent results\.

#### Critic score\.

When Critic feedback is included, it is scored separately:

Scritic\\displaystyle S\_\{\\mathrm\{critic\}\}=Ca\+Cd\+Cf\+Ct\+Cc\+Ch\+Cv\+Co,\\displaystyle=C\_\{a\}\+C\_\{d\}\+C\_\{f\}\+C\_\{t\}\+C\_\{c\}\+C\_\{h\}\+C\_\{v\}\+C\_\{o\},Scritic\\displaystyle S\_\{\\mathrm\{critic\}\}∈\[0,100\]\\displaystyle\\in\[0,00\]with:

Ca\\displaystyle C\_\{a\}∈\[0,20\]\\displaystyle\\in\[0,20\]accuracy of identification,\\displaystyle\\text\{accuracy of identification\},Cd\\displaystyle C\_\{d\}∈\[0,15\]\\displaystyle\\in\[0,15\]depth of analysis,\\displaystyle\\text\{depth of analysis\},Cf\\displaystyle C\_\{f\}∈\[0,15\]\\displaystyle\\in\[0,15\]constructive feedback quality,\\displaystyle\\text\{constructive feedback quality\},Ct\\displaystyle C\_\{t\}∈\[0,15\]\\displaystyle\\in\[0,15\]technical understanding,\\displaystyle\\text\{technical understanding\},Cc\\displaystyle C\_\{c\}∈\[0,10\]\\displaystyle\\in\[0,10\]clarity of communication,\\displaystyle\\text\{clarity of communication\},Ch\\displaystyle C\_\{h\}∈\[0,10\]\\displaystyle\\in\[0,10\]comprehensiveness,\\displaystyle\\text\{comprehensiveness\},Cv\\displaystyle C\_\{v\}∈\[0,10\]\\displaystyle\\in\[0,10\]pedagogical value,\\displaystyle\\text\{pedagogical value\},Co\\displaystyle C\_\{o\}∈\[0,5\]\\displaystyle\\in\[0,5\]objectivity and fairness\.\\displaystyle\\text\{objectivity and fairness\}\.This score is recorded but does not feed the pipeline decisions or the present analysis\. In the dynamical language of[Appendix˜E](https://arxiv.org/html/2605.06772#A5), it could become a second observable alongside the Actor score, separating poor feedback from useful feedback that the Actor fails to take up\. We leave this two\-observable analysis for follow\-up work\.

##### Critic accuracy \(CaC\_\{a\}\)\.

Error\-detection quality is quantified using precision and recall,

P=T​PT​P\+F​P,R=T​PT​P\+F​N\.P=\\frac\{TP\}\{TP\+FP\},\\qquad R=\\frac\{TP\}\{TP\+FN\}\.High scores require both high precision and high recall\.

##### Critic depth \(CdC\_\{d\}\)\.

Depth of analysis is measured by

D=Nexamined​stepsNtotal​steps\.D=\\frac\{N\_\{\\mathrm\{examined\\ steps\}\}\}\{N\_\{\\mathrm\{total\\ steps\}\}\}\.

##### Constructiveness \(CfC\_\{f\}\)\.

Actionability is measured by

A=Nactionable​suggestionsNtotal​suggestions\.A=\\frac\{N\_\{\\mathrm\{actionable\\ suggestions\}\}\}\{N\_\{\\mathrm\{total\\ suggestions\}\}\}\.

##### Technical understanding \(CtC\_\{t\}\)\.

Conceptual accuracy is measured by

T=Ncorrect​conceptsNreferenced​concepts\.T=\\frac\{N\_\{\\mathrm\{correct\\ concepts\}\}\}\{N\_\{\\mathrm\{referenced\\ concepts\}\}\}\.

##### Comprehensiveness \(ChC\_\{h\}\)\.

Coverage of relevant issues is measured by

H=Naddressed​aspectsNproblem​aspects\.H=\\frac\{N\_\{\\mathrm\{addressed\\ aspects\}\}\}\{N\_\{\\mathrm\{problem\\ aspects\}\}\}\.

#### Pass criterion\.

A single Actor turn is marked as passing only if all of the following hold:

Pass=\(Eq​\(A,R\)=1\)∧\(Sactor≥80\)∧\(Sc≥40\)\.\\mathrm\{Pass\}=\\Bigl\(\\mathrm\{Eq\}\(A,R\)=1\\Bigr\)\\wedge\\Bigl\(S\_\{\\mathrm\{actor\}\}\\geq 80\\Bigr\)\\wedge\\Bigl\(S\_\{c\}\\geq 40\\Bigr\)\.In implementation, a non\-equivalent final result always forcesPass=0\\mathrm\{Pass\}=0\. A run is then counted as converged if at least one of its iterations satisfies this turn\-level pass rule under the scoring Judge \([Section˜2\.2](https://arxiv.org/html/2605.06772#S2.SS2)\)\.

#### Reference\-free use\.

For known textbook problems, correctness and convergence rely on the reference answer throughEq​\(A,R\)\\mathrm\{Eq\}\(A,R\)\. In a future open\-problem setting this part of the rubric would have to be replaced by external validation: symbolic checks, limiting cases, simulations, experiments, theorem\-prover output, or expert consistency review\. The process criteria — mathematical rigor, logical flow, justification quality, completeness relative to the stated task, and physical consistency — remain meaningful without a reference solution\. By contrast, “correctness”, Critic error\-identification accuracy, and the present pass rule are reference\-dependent and would become measures of constraint satisfaction rather than answer matching\. In the dynamical\-systems language of[Appendix˜E](https://arxiv.org/html/2605.06772#A5), removing the reference solution would keep the score projection but alter the drift, since the Critic’s directional pull is currently reference\-conditioned\.

#### Error taxonomy\.

The Judge also records categorical errors\. At the solution level, major issues are classified into:

COMPUTATIONAL\_ERROR,CONCEPTUAL\_ERROR,METHODOLOGICAL\_ERROR,DIMENSIONAL\_ERROR,BOUNDARY\_CONDITION\_ERROR,CONVERGENCE\_ERROR, andAPPROXIMATION\_ERROR\.

Most categories are self\-explanatory\. The less obvious labels are:METHODOLOGICAL\_ERROR, for using an inappropriate setup or solution method;BOUNDARY\_CONDITION\_ERROR, for missing or misapplied boundary/normalization conditions;CONVERGENCE\_ERROR, for invalid limiting, integral, or series\-convergence reasoning; andAPPROXIMATION\_ERROR, for unjustified expansions or dropped terms\.

#### Binary error flags\.

The Judge records the following binary diagnostic flags:

SIGN\_ERROR,MISSING\_TERM,PRODUCT\_RULE\_ERROR,CHAIN\_RULE\_ERROR,BOUNDARY\_CONDITION\_MISAPPLIED,UNIT\_MISMATCH,ALGEBRA\_SIMPLIFY\_FAIL,LIMIT\_ERROR,NORMALIZATION\_ERROR,INCOMPLETE\_EXPR,TEST\_FAIL,TENSOR\_INDEX\_ERROR,DIMENSIONAL\_CONSISTENCY,SYMMETRY\_PRESERVATION,COMBINATORIAL\_FACTORS,REGULARIZATIONAL\_CONSISTENCY,MISSING\_JUSTIFICATION\_NON\_TRIVIAL\_STEPS,LIMITING\_BEHAVIOR\_FAIL,POSITIVITY\_AND\_REALITY\_CONSTRAINTS\_VIOLATION,VIOLATION\_OF\_WARD\_IDENTITIES\.

Most labels are self\-explanatory\. The less standard ones correspond to failed problem\-specific checks:TEST\_FAILmarks failure of an explicit validation test,COMBINATORIAL\_FACTORScovers missing symmetry or counting factors, andVIOLATION\_OF\_WARD\_IDENTITIESmarks failure of gauge/current\-conservation constraints\. The names are legacy JSON keys, so some positive\-sounding labels are interpreted as diagnostic indicators when set to true\.

#### Missing\-content labels\.

The Judge may also assign missing\-content labels to the Actor solution:

UNJUSTIFIED\_CLAIMS,INCORRECT\_RESULT,INCOMPLETE\_DERIVATION,MISSING\_BOUNDARY\_CONDITIONS,DIMENSIONAL\_INCONSISTENCY, andCONVERGENCE\_FAILURE\.

#### Progress signal\.

To assess whether Critic intervention changed the solution appreciably, the Judge also records an Actor–Critic progress indicator:

Progress=𝟏​\(\|Scurrent−Sprevious\|\>δ\),\\mathrm\{Progress\}=\\mathbf\{1\}\\\!\\left\(\\left\|S\_\{\\mathrm\{current\}\}\-S\_\{\\mathrm\{previous\}\}\\right\|\>\\delta\\right\),whereδ\\deltais a threshold, typically set to55points\.

#### Implementation note\.

The Judge is required to return a strict JSON object containing the above fields\. If the returned structure is invalid or unparsable, the system defaults to a failing verdict and records the malformed response for debugging\.

The component\-score diagnostic in[Figure˜8](https://arxiv.org/html/2605.06772#A3.F8)decomposes the Actor score into the six rubric criteria defined in[Section˜2\.2](https://arxiv.org/html/2605.06772#S2.SS2)\. Each line is normalized to the maximum value of that criterion, so the plot shows which parts of the Judge score move during the dialogue rather than how many raw rubric points each criterion contributes\. This is not an additional aggregate evaluation metric used in the statistical tests; it is a decomposition of the scores¯\\bar\{s\}summarizes\. The figure is descriptive: because SCALAR stops runs after a passing verdict, later turns average over the subset of runs still active at that turn\.

![Refer to caption](https://arxiv.org/html/2605.06772v1/x7.png)Figure 8:Actor score\-component evolution under QWQ scoring\. Each component is normalized to its rubric maximum before averaging\. Haiku uses the two Peskin problems; DS8B and DS70B use all three problems\. This diagnostic decomposes the Judge score and is not an additional aggregate evaluation metric\.

## Appendix DEvaluation Metrics and Statistical Procedures

This appendix supplements[Section˜2\.2](https://arxiv.org/html/2605.06772#S2.SS2)with details not in the main text: the precise pass rule used in the convergence count, edge cases for runs that terminate early, and the specific statistical test implementations\. The scoring rubric itself \(the six Judge dimensions, pass criterion, and error flags\) is described in[Appendix˜C](https://arxiv.org/html/2605.06772#A3)\.

#### Iteration range and early stopping\.

The iteration cap isTmax=4T\_\{\\max\}=4for the DeepSeek\-family runs andTmax=5T\_\{\\max\}=5for the Haiku runs; a run terminates earlier atT<TmaxT<T\_\{\\max\}when the Judge used during generation issues a passing verdict or when the score\-stagnation trigger fires\. The reported evaluation metrics use the actualTTof each run, sos¯i\\bar\{s\}\_\{i\}andgig\_\{i\}are comparable across runs of different length\. Runs that terminate after a single iteration \(T=1T\{=\}1\) havegi=0g\_\{i\}\{=\}0by construction\.

#### Pass rule behind the convergence rate\.

The turn\-level pass rule from[Appendix˜C](https://arxiv.org/html/2605.06772#A3)requires final\-answer equivalence with the reference, total Actor score≥80\\geq 80, and correctness component≥40\\geq 40; a non\-equivalent final answer forcesPass=0\\mathrm\{Pass\}\{=\}0regardless of the other components\. A run is counted as converged if any iteration of the dialogue satisfies this rule under the scoring Judge\. For runs whose original loop was driven by the same Judge that scores them this is equivalent to a final\-turn pass; for re\-scored transcripts the loop length is fixed by the original Judge and the scoring Judge can mark a non\-terminal iteration as the convergence event\.

#### Statistical tests\.

Each run contributes one independent observation for each response variable\. All tests are two\-sided, non\-parametric \(no Gaussianity assumption on the score distributions\), and use the defaultscipy\.statsimplementations, with the classical tests cited below for provenance\.

Unless explicitly stated otherwise, the statistical response variables are the raw run\-level evaluation quantitiess¯i\\bar\{s\}\_\{i\},gig\_\{i\}, andrir\_\{i\}\. The problem\-normalized contrastsDs¯D\_\{\\bar\{s\}\}andDRD\_\{R\}are descriptive summaries used in[Figure˜5](https://arxiv.org/html/2605.06772#S3.F5); they are not inputs to thepp\-values reported in the Results\.

Throughout the paper we use the conventionalp<0\.05p<0\.05threshold to flag “statistically significant” results: this means the test statistic is at least as extreme as the9595th percentile of its null distribution — equivalently, under the null hypothesis \(no real effect\) a result this surprising or more would occur less than11in2020times; smallerpp\-values carry stronger evidence against the null\. In the main text, each quotedpp\-value belongs to the statistical test named in the same sentence, usually a Kruskal–Wallis omnibus test across Critic strategies or Actor personas\. When a re\-scoring check is mentioned without a newpp\-value, it is used only as a sensitivity check on the direction of the conclusion, not as a separate headline test\.

- •Wilcoxon signed\-rankon paired\(s0,sT−1\)\(s\_\{0\},s\_\{T\-1\}\)scores \(scipy\.stats\.wilcoxon;Wilcoxon,[1945](https://arxiv.org/html/2605.06772#bib.bib6)\)\. A paired\-sample test that asks whether the median of the within\-run gainsgi=si,Ti−1−si,0g\_\{i\}=s\_\{i,T\_\{i\}\-1\}\-s\_\{i,0\}differs from zero; we use it to assess whether the multi\-turn dialogue induces a non\-zero gain over the population of runs\.
- •Kruskal–WallisHH\-testons¯\\bar\{s\}across the five Critic strategies \(scipy\.stats\.kruskal;Kruskal and Wallis,[1952](https://arxiv.org/html/2605.06772#bib.bib8)\)\. A rank\-based multi\-group test that asks whether samples from≥2\\geq 2groups come from a common distribution; we use it to assess whether the Critic feedback strategy label has any overall effect ons¯\\bar\{s\}, applied separately for each Actor–Critic pairing\.
- •Mann–WhitneyUU\-testfor specific pairwise Critic feedback strategy comparisons \(scipy\.stats\.mannwhitneyu;Mann and Whitney,[1947](https://arxiv.org/html/2605.06772#bib.bib7)\)\. A two\-sample rank\-based test that asks whether values from one group tend to be larger than values from another; we use it to assess whether a particular pair of Critic feedback strategies \(*e\.g\.*, pedagogical vs\. strict\) differ\. We report uncorrectedpp\-values; where multiple comparisons contribute to the same claim we flag this alongside the relevant result\. We do not use rank correlations between five Critic feedback strategy means as evidential tests, because with only five ranks they are too coarse to support an independent statistical claim\.

#### Re\-scoring protocol\.

Re\-scoring Judges score the stored Actor–Critic transcripts using the same Judge prompt \([Appendix˜A](https://arxiv.org/html/2605.06772#A1)\) but a different Judge LLM; the Actor and Critic are not re\-run, and the original stopping decisions are preserved\. The Haiku transcripts are re\-scored by QWQ and by DS70B in addition to the primary Sonnet scoring, and the DS70B and DS8B transcripts are re\-scored by DS70B in addition to the primary QWQ scoring\. This supports the cross\-Judge consistency checks reported in[Section˜3](https://arxiv.org/html/2605.06772#S3)\.

![Refer to caption](https://arxiv.org/html/2605.06772v1/x8.png)Figure 9:Endpoint gain distributions under QWQ scoring\. Haiku uses the two Peskin problems \(n=383n\{=\}383\), while DS8B and DS70B use all three problems \(n=900n\{=\}900each\)\. Dashed vertical lines mark the median gain\.

## Appendix EScore\-Update Field and Markov Projection

The score\-update curves in[Figure˜4](https://arxiv.org/html/2605.06772#S3.F4)estimate a one\-step score drift; they are not inputs to the hypothesis tests in the Results\. LetXtX\_\{t\}denote the full SCALAR state after Actor turntt: problem, persona, Critic feedback strategy, transcript, scoring state, and model\-sampling rule\. Conditional on the experimental cell, the algorithm induces a one\-step transition

p​\(Xt\+1∣Xt,Xt−1,…\)=p​\(Xt\+1∣Xt\)\.p\(X\_\{t\+1\}\\mid X\_\{t\},X\_\{t\-1\},\\ldots\)=p\(X\_\{t\+1\}\\mid X\_\{t\}\)\.The scalar Judge score is a projectionst=S​\(Xt\)s\_\{t\}=S\(X\_\{t\}\), and the useful question is how much of the dialogue dynamics remains visible in this one\-dimensional coordinate\. For runii, letsi,ts\_\{i,t\}be the Judge score after Actor turnttand defineΔ​si,t=si,t\+1−si,t\\Delta s\_\{i,t\}=s\_\{i,t\+1\}\-s\_\{i,t\}\. For a score binBbB\_\{b\}, we estimate the empirical drift

v^b=1\|𝒯b\|​∑\(i,t\)∈𝒯bΔ​si,t,𝒯b=\{\(i,t\):si,t∈Bb\}\.\\hat\{v\}\_\{b\}=\\frac\{1\}\{\|\\mathcal\{T\}\_\{b\}\|\}\\sum\_\{\(i,t\)\\in\\mathcal\{T\}\_\{b\}\}\\Delta s\_\{i,t\},\\qquad\\mathcal\{T\}\_\{b\}=\\\{\(i,t\):s\_\{i,t\}\\in B\_\{b\}\\\}\.The shaded bands in[Figure˜4](https://arxiv.org/html/2605.06772#S3.F4)are bootstrap confidence intervals forv^b\\hat\{v\}\_\{b\}over runs\. When the fitted curve crosses zero, we denote the crossing bys∗s^\{\\ast\}and call it a projected fixed point\. This means only that the observed next\-turn update vanishes on average near that score; it is not necessarily a complete fixed point of the full recorded dialogue state\.

One can also estimate the local fluctuation scale

D^b=12​Var\(i,t\)∈𝒯b⁡\(Δ​si,t\),\\hat\{D\}\_\{b\}=\\frac\{1\}\{2\}\\operatorname\{Var\}\_\{\(i,t\)\\in\\mathcal\{T\}\_\{b\}\}\\\!\\left\(\\Delta s\_\{i,t\}\\right\),the empirical analogue of diffusion for the projected score\. Here stochasticity comes both from nonzero\-temperature LLM sampling and from projecting many transcript states to the same score\. We do not fit a continuous Fokker–Planck or Smoluchowski model here: the trajectories are short, early stopping censors the active ensemble, and passing depends on final\-answer equivalence rather than on a score threshold alone\. A future temperature sweep could separate drift\-dominated improvement from noise\-assisted escape out of low\-drift regions\.

Discarding the transcript gives a score\-only Markov closure,

p​\(st\+1∣st,st−1,…\)≈p​\(st\+1∣st\),p\(s\_\{t\+1\}\\mid s\_\{t\},s\_\{t\-1\},\\ldots\)\\approx p\(s\_\{t\+1\}\\mid s\_\{t\}\),after conditioning on the experimental cell\. This does not follow from the transcript\-level Markov property: two transcripts with the same score can contain different physics errors\. As an exploratory DS70B/QWQ check, we regressed the residualΔ​st−v^​\(st\)\\Delta s\_\{t\}\-\\hat\{v\}\(s\_\{t\}\)on the previous scorest−1s\_\{t\-1\}within each problem; the fitted memory terms were small compared with their standard errors\. Thus the present data are consistent with the score\-only closure, while longer held\-out trajectories are needed for predictive validation\.

The predictive version would fit a discrete transition kernel over score bins with an absorbing pass state𝒫=\{s≥80​and​final​answer​equivalent\}\\mathcal\{P\}=\\\{s\\geq 80\\ \\mathrm\{and\\ final\\ answer\\ equivalent\}\\\}\. On held\-out trajectories this kernel could predict score histograms, convergence rates, or mean remaining turns; in the absorbing\-chain idealization, the non\-pass bins obeyτ=\(I−KB​B\)−1​𝟏\\tau=\(I\-K\_\{BB\}\)^\{\-1\}\\mathbf\{1\}\.

Similar Articles

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

arXiv cs.LG

ReCrit introduces a transition-aware reinforcement learning framework for scientific critic reasoning, decomposing initial-to-critic behavior into four quadrants (Correction, Sycophancy, Robustness, Boundary) and using dynamic asynchronous rollout. It improves critic accuracy significantly on Qwen models across multiple scientific benchmarks.

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

arXiv cs.AI

This paper introduces ICRL, a framework that jointly trains a solver and critic with reinforcement learning to internalize critique guidance, enabling the solver to improve without external critique. It uses distribution calibration and role-wise group advantage estimation, achieving 6-7 point gains over GRPO on agentic and mathematical reasoning tasks.

AI-written critiques help humans notice flaws

OpenAI Blog

OpenAI trained language models to write critiques of text summaries, helping human evaluators spot flaws more effectively — a step toward scalable oversight of AI systems on difficult tasks. The work explores how AI-assisted feedback can improve human evaluation quality as a proof of concept for alignment research.