From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

arXiv cs.LG Papers

Summary

Introduces Multi-Agent Residual In-Context Learning (MARICL), an agentic framework that uses LLM agents to analyze residuals from a base model on tabular data, hypothesize missing structure, and produce explicit correction terms via textual gradient optimization. Across nine benchmarks, MARICL consistently improves over its base model and demonstrates mechanistic generalization in cell-free protein predictions.

arXiv:2605.22897v1 Announce Type: new Abstract: A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi-Agent Residual In-Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base-model fails, hypothesize missing structure from high-residual examples provided in context, and produce explicit correction terms refined through multi-turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch-specific noise, we freeze formulas learned on one experimental batch of the Cell-Free Protein dataset and apply them (with no retraining and no further LLM calls) to held-out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:56 AM

# From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data
Source: [https://arxiv.org/html/2605.22897](https://arxiv.org/html/2605.22897)
Mohammad R\. Rezaei Department of Computer Science University of Toronto Vector Institute Toronto, ON, Canada mr\.rezaei@mail\.utoronto\.ca &Rahul G\. Krishnan Department of Computer Science University of Toronto Vector Institute Toronto, ON, Canada rahulgk@cs\.toronto\.edu

###### Abstract

A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding\. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largelyinspective: they answer “which features matter?” but do not articulate how features interact or refine explanations iteratively alongside human understanding\. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing\. We introduce Multi\-Agent Residual In\-Context Learning \(MARICL\), an agentic framework in which LLM agents analyze where a base\-model fails, hypothesize missing structure from high\-residual examples provided in context, and produce explicit correction terms refined through multi\-turn textual gradient optimization\. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets\. To test whether these corrections reflect real structure or batch\-specific noise, we freeze formulas learned on one experimental batch of the Cell\-Free Protein dataset and apply them \(with no retraining and no further LLM calls\) to held\-out batches\. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically\. The success boundary aligns with the biochemistry, not the batch count — direct evidence of mechanistic generalization\. The code is available at[GitHub](https://github.com/MrRezaeiUofT/Multi_Agent_Residual_In_Context_Learning)\.

## 1Introduction

Tabular machine learning faces a recurring tradeoff between accuracy and interpretability\. Gradient boosting and related methods are highly accurate on tabular data\(Grinsztajnet al\.,[2022](https://arxiv.org/html/2605.22897#bib.bib48); McElfreshet al\.,[2024](https://arxiv.org/html/2605.22897#bib.bib49)\)but expose nothing about the relationships they have learned\. Post hoc methods address this only partially: SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.22897#bib.bib50)\)attributes a prediction to its features and TreeSHAP extends this to pairwise interactions\(Lundberget al\.,[2020](https://arxiv.org/html/2605.22897#bib.bib51)\), but the attributions are per\-sample weights on a fixed model, not a global formula that can be inspected, edited, or improved\. Inherently interpretable models such as GAMs\(Hastie and Tibshirani,[1987](https://arxiv.org/html/2605.22897#bib.bib54)\)and EBMs\(Noriet al\.,[2019](https://arxiv.org/html/2605.22897#bib.bib57)\)are competitive on tabular benchmarks, but they fit smooth per\-feature shape functions \(and pairwise terms in GA2Ms\) rather than named symbolic expressions tied to specific feature combinations and coefficients\. Symbolic regression\(Cranmer,[2023](https://arxiv.org/html/2605.22897#bib.bib63); Shojaeeet al\.,[2025](https://arxiv.org/html/2605.22897#bib.bib61)\)produces equations, but it fits the target end\-to\-end rather than diagnosing where and why a given model fails\.

Recent work onagentic interpretability\(Kimet al\.,[2025](https://arxiv.org/html/2605.22897#bib.bib56)\)argues that LLM agents can go beyond one\-shot inspection by reasoning across multiple turns to refine explanations\. The practical difficulty is that the loop has too many directions to explore and rarely converges on something testable unless it is constrained\. We ask: how can an agentic loop produce both accurate predictions and inspectable explanations under such a constraint?

![Refer to caption](https://arxiv.org/html/2605.22897v1/Figures/figure1_maricl_schematic_v2.png)Figure 1:MARICL framework overview: \(1\-2\) a base\-model generates predictions, \(3\) residual analysis selects high\-error examples, \(4\-5\) an LLM encoder produces structured hypotheseszkz\_\{k\}that a decoder converts into explanationsTkT\_\{k\}and executable formulas, \(6\) textual gradient optimization refines corrections via critique feedback, and \(7\-8\) query\-aware aggregation\.A natural baseline is*LLM\-ICL*: place the entire training set in an LLM’s context as\(xi,yi\)\(x\_\{i\},y\_\{i\}\)pairs and ask it to predictyyfor each test query in a single forward pass\. On cell\-free protein yield prediction this reaches onlyR2=0\.35R^\{2\}\{=\}0\.35; worse than ordinary linear regression\. The failure is informative: a single forward pass is being asked to regress, identify nonlinear structure, and ground itself in the numerical scale of the data all at once, over the full output domain\. The hypothesis space is simply too large for one shot to localize\.

We therefore split the problem\. A statistical*base\-model*\(linear regression, XGBoost, or any validated predictor\) handles the regression scaffolding and fixes the output scale\. The LLM is then asked a much narrower question:*what is the base\-model missing?*Its target shrinks from the fullyyto the residualr=y−y^MLr=y\-\\hat\{y\}\_\{\\text\{ML\}\}, and its job shrinks from end\-to\-end prediction to articulating the structured failure modes visible in high\-residual examples\. This gives us two design choices: \(i\) keep the base\-model as an anchor on the predictive task, and \(ii\) restrict the LLM to explaining the residual signal the base\-model leaves behind\.

These two choices define Multi\-Agent Residual In\-Context Learning \(MARICL\) \(Figure[1](https://arxiv.org/html/2605.22897#S1.F1)\)\. Anencoder\-agentreads high\-residual training examples and produces structured hypotheses\. Adecoder\-agentcompiles each hypothesis into an executable correction term — a named formula over specific features\.*Textual gradient optimization*\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.22897#bib.bib8)\)\( an iterative loop in which an LLM critiques its own output in natural language and proposes refinements, in place of the numerical gradients used in standard optimization\) then refines each correction by sharpening the formula on the examples where it still fails\. We runKKsuch agents in parallel and aggregate them with performance\-weighted ensembling\. This aggregation is query\-aware: each correction term is gated by a learned weightα\\alpha—such as the0\.280\.28coefficient in Table[1](https://arxiv.org/html/2605.22897#S1.T1)—which reflects the agent’s inferred expertise based on the query’s proximity to specific high\-residual clusters in the training data\.

The cell\-free protein example illustrates how this plays out in practice\.*Cell\-free protein synthesis*is a biochemistry technique that produces proteins in a test tube using a cell extract plus added reagents \(energy substrates, cofactors, polyamines\); the prediction task is yield as a function of reagent concentrations\. On the sample in Table[1](https://arxiv.org/html/2605.22897#S1.T1), the base model underpredicts yield when NAD \(an energy cofactor\) and spermidine \(a polyamine that boosts translation\) are both high\. MARICL infers cofactor synergy, generates the interaction termNAD×\\timessperm, then refines it with a saturation term for folinic acid \(which has diminishing returns\)\. Across the dataset this liftsR2R^\{2\}from0\.350\.35to0\.650\.65\(\+0\.30\+0\.30over LLM\-ICL; full trace in Appendix[A](https://arxiv.org/html/2605.22897#A1)\)\.

Table 1:MARICL vs\. LLM\-ICL on the cell\-free protein example \(NAD=0\.8=0\.8, sperm=0\.7=0\.7, fol=0\.3=0\.3,y=0\.72y=0\.72, basey^ML=0\.58\\hat\{y\}\_\{\\text\{ML\}\}=0\.58\)\. MARICL targets only the residual; LLM\-ICL predictsyyfrom scratch\.StageMARICLLLM\-ICLResidual analysisHigh\-error samples cluster at NAD\>0\.6\>0\.6, sperm\>0\.4\>0\.4; feeds encoder\.— no base to compare againstHypothesis“NAD–spermidine cofactor synergy drives underprediction\.”— implicit in one forward passFormulat=0t\{=\}0f\(0\)=0\.5⋅NAD×sperm=0\.28f^\{\(0\)\}=0\.5\\cdot\\text\{NAD\}\\times\\text\{sperm\}=0\.28— no executable correctionCritiquet=0t\{=\}0“Folinic acid saturates; add Michaelis–Menten term\.”— no feedback loopFormulat=1t\{=\}1f\(1\)=f\(0\)\+0\.5⋅fol0\.5\+fol=0\.4675f^\{\(1\)\}=f^\{\(0\)\}\+\\dfrac\{0\.5\\cdot\\text\{fol\}\}\{0\.5\+\\text\{fol\}\}=0\.4675— no refinementWeightingα=0\.28\\alpha=0\.28\(learned via query\-aware aggregation\)— N/APredictiony^=fML\+α​f\(1\)\\hat\{y\}=f\_\{\\text\{ML\}\}\+\\alpha f^\{\(1\)\}
=0\.58\+\(0\.28×0\.4675\)≈0\.711=0\.58\+\(0\.28\\times 0\.4675\)\\approx\\mathbf\{0\.711\}
error≈0\.009\\approx\\mathbf\{0\.009\}y^≈0\.52\\hat\{y\}\\approx 0\.52
error≈0\.20\\approx 0\.20\(∼22×\{\\sim\}22\{\\times\}larger\)On nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic domains, MARICL improves over its base model on every dataset\. Gains are largest where the base is weakest \(e\.g\.,\+0\.236​Δ​R2\+0\.236\\,\\Delta R^\{2\}over a linear base on Cell\-Free Protein\) and smaller but consistent over stronger bases such as XGBoost\.

Contributions\.\(1\) MARICL: an agentic framework that produces named, executable correction terms over a base model through structured hypothesis generation and iterative refinement\. \(2\) A residual\-conditioned adaptation of textual\-gradient optimization\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.22897#bib.bib8)\), paired with a distance\-based aggregation that down\-weights each correction when applied far from the residuals it was inferred from\. \(3\) A layered ablation that progressively removes the LLM’s pretraining priors \(feature names, domain context, frontier\-model capability\), together with a planted\-ground\-truth synthetic benchmark no LLM has seen\. Together these lower\-bound the data\-driven share of MARICL’s gain at∼50%\\sim\\\!50\\%on real benchmarks\. \(4\) A cross\-plate transfer experiment on Cell\-Free Protein, where each*plate*is one experimental batch\. Correction formulas frozen on one plate and applied verbatim to other plates improve over92%92\\%of pairs within the same reagent protocol and fail systematically across protocols — evidence that the corrections capture biochemical mechanism rather than batch\-specific noise\.

## 2Methods

Algorithm 1MARICL Training Algorithm1:Input:Training data

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}, base\-model

fMLf\_\{\\text\{ML\}\}, number of corrections

KK, refinement iterations

TT, residual fraction

κ\\kappa, performance threshold

pminp\_\{\\min\}, batch size

BB
2:Output:Correction ensemble

ℳ∗\\mathcal\{M\}^\{\*\}, performance scores

\{pk\}\\\{p\_\{k\}\\\}
3:Compute residuals

rir\_\{i\}for all

\(𝐱i,yi\)∈𝒟train\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{\\text\{train\}\}via Eq\.[3](https://arxiv.org/html/2605.22897#S2.E3)

4:Construct

𝒟high\-res\\mathcal\{D\}\_\{\\text\{high\-res\}\}as top\-

κ\\kappafraction by

\|ri\|\|r\_\{i\}\|
5:Build augmented context

𝒞aug\\mathcal\{C\}\_\{\\text\{aug\}\}\(Eq\.[5](https://arxiv.org/html/2605.22897#S2.E5)\)

6:for

k=1k=1to

KKdo

7:if

\|𝒟high\-res\|\>B\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|\>Bthen

8:Partition into batches; encode each via Eq\.[7](https://arxiv.org/html/2605.22897#S2.E7)

9:

zk\(0\)←Concat​\(zk,1,zk,2,…\)z\_\{k\}^\{\(0\)\}\\leftarrow\\text\{Concat\}\(z\_\{k,1\},z\_\{k,2\},\\ldots\)
10:else

11:

zk\(0\)←EncodeLM​\(𝒞aug,𝒫encoder\(k\)\)z\_\{k\}^\{\(0\)\}\\leftarrow\\text\{Encode\}\_\{\\text\{LM\}\}\(\\mathcal\{C\}\_\{\\text\{aug\}\},\\mathcal\{P\}\_\{\\text\{encoder\}\}^\{\(k\)\}\)
12:endif

13:

\(Tk\(0\),fk\(0\)\)←DecodeLM​\(zk\(0\),𝒫decoder\)\(T\_\{k\}^\{\(0\)\},f\_\{k\}^\{\(0\)\}\)\\leftarrow\\text\{Decode\}\_\{\\text\{LM\}\}\(z\_\{k\}^\{\(0\)\},\\mathcal\{P\}\_\{\\text\{decoder\}\}\)
14:Validate

fk\(0\)f\_\{k\}^\{\(0\)\}; regenerate if invalid

15:

𝒮k\(0\)←\{\(zk\(0\),mk\(0\)\)\}\\mathcal\{S\}\_\{k\}^\{\(0\)\}\\leftarrow\\\{\(z\_\{k\}^\{\(0\)\},m\_\{k\}^\{\(0\)\}\)\\\}
16:endfor

17:for

k=1k=1to

KKdo

18:for

t=0t=0to

T−1T\-1do

19:Evaluate

ℒk\(t\)\\mathcal\{L\}\_\{k\}^\{\(t\)\}on

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}using

\(Tk\(t\),fk\(t\)\)\(T\_\{k\}^\{\(t\)\},f\_\{k\}^\{\(t\)\}\)
20:Identify failure set

ℰk\(t\)\\mathcal\{E\}\_\{k\}^\{\(t\)\}\(Eq\.[10](https://arxiv.org/html/2605.22897#S2.E10)\)

21:Generate critique

gk\(t\)g\_\{k\}^\{\(t\)\}\(Eq\.[11](https://arxiv.org/html/2605.22897#S2.E11)\)

22:

𝒮k\(t\+1\)←𝒮k\(t\)∪\{\(zk\(t\),mk\(t\),ℒk\(t\),gk\(t\)\)\}\\mathcal\{S\}\_\{k\}^\{\(t\+1\)\}\\leftarrow\\mathcal\{S\}\_\{k\}^\{\(t\)\}\\cup\\\{\(z\_\{k\}^\{\(t\)\},m\_\{k\}^\{\(t\)\},\\mathcal\{L\}\_\{k\}^\{\(t\)\},g\_\{k\}^\{\(t\)\}\)\\\}
23:Refine:

\(Tk\(t\+1\),fk\(t\+1\)\)\(T\_\{k\}^\{\(t\+1\)\},f\_\{k\}^\{\(t\+1\)\}\)via Eq\.[13](https://arxiv.org/html/2605.22897#S2.E13)

24:Validate

fk\(t\+1\)f\_\{k\}^\{\(t\+1\)\}; regenerate if invalid

25:endfor

26:

mk∗←arg⁡mint⁡ℒk\(t\)m\_\{k\}^\{\*\}\\leftarrow\\arg\\min\_\{t\}\\mathcal\{L\}\_\{k\}^\{\(t\)\}; compute

pkp\_\{k\}\(Eq\.[17](https://arxiv.org/html/2605.22897#S2.E17)\)

27:endfor

28:return

ℳ∗=\{mk∗:pk\>pmin\}\\mathcal\{M\}^\{\*\}=\\\{m\_\{k\}^\{\*\}:p\_\{k\}\>p\_\{\\min\}\\\},

\{pk\}\\\{p\_\{k\}\\\}

We consider supervised learning on𝒟=\{\(𝐱i,yi\)\}i=1N\\mathcal\{D\}=\\\{\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}with𝐱i∈ℝd\\mathbf\{x\}\_\{i\}\\in\\mathbb\{R\}^\{d\}andyi∈ℝy\_\{i\}\\in\\mathbb\{R\}\(regression\) oryi∈\{1,…,C\}y\_\{i\}\\in\\\{1,\\ldots,C\\\}\(classification\), split into train, validation, and test sets\. We assume access to a pretrained base modelfML:ℝd→𝒴f\_\{\\text\{ML\}\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathcal\{Y\}— linear, tree, or gradient\-boosted\. MARICL learnsKKcorrection agents whose predictions are added on top of the base model; interpretability lives in the corrections, which name what the base model systematically misses\.

Each agent produces a correctionmk=\(Tk,fk\)m\_\{k\}=\(T\_\{k\},f\_\{k\}\)that pairs a natural\-language templateTkT\_\{k\}with a closed\-form formulafkf\_\{k\}\. For regression,fk:ℝd→ℝf\_\{k\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}\(clipped to the scaled target range; Appendix[N\.3](https://arxiv.org/html/2605.22897#A14.SS3)\)\. For classification,fk:ℝd→ℝCf\_\{k\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{C\}produces per\-class scores converted to a distributionQkQ\_\{k\}in Section[2\.3](https://arxiv.org/html/2605.22897#S2.SS3)\. The MARICL prediction is then:

y^MARICL​\(𝐱\)=fML​\(𝐱\)\+∑k=1Kαk​\(𝐱\)⋅Δk​\(𝐱\)\\hat\{y\}\_\{\\text\{MARICL\}\}\(\\mathbf\{x\}\)=f\_\{\\text\{ML\}\}\(\\mathbf\{x\}\)\+\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\(\\mathbf\{x\}\)\\cdot\\Delta\_\{k\}\(\\mathbf\{x\}\)\(1\)whereΔk​\(𝐱\)=fk​\(𝐱\)\\Delta\_\{k\}\(\\mathbf\{x\}\)=f\_\{k\}\(\\mathbf\{x\}\)andαk​\(𝐱\)≥0\\alpha\_\{k\}\(\\mathbf\{x\}\)\\geq 0,∑k=1Kαk​\(𝐱\)=1\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\(\\mathbf\{x\}\)=1are query\-dependent attention weights \(Section[2\.4](https://arxiv.org/html/2605.22897#S2.SS4)\)\. For classification:

PMARICL​\(𝐱\)=β⋅PML​\(𝐱\)\+\(1−β\)​∑k=1Kαk​\(𝐱\)⋅Qk​\(𝐱\)P\_\{\\text\{MARICL\}\}\(\\mathbf\{x\}\)=\\beta\\cdot P\_\{\\text\{ML\}\}\(\\mathbf\{x\}\)\+\(1\-\\beta\)\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\(\\mathbf\{x\}\)\\cdot Q\_\{k\}\(\\mathbf\{x\}\)\(2\)withPML​\(𝐱\),Qk​\(𝐱\)∈ΔC−1P\_\{\\text\{ML\}\}\(\\mathbf\{x\}\),Q\_\{k\}\(\\mathbf\{x\}\)\\in\\Delta^\{C\-1\}andβ∈\[0,1\]\\beta\\in\[0,1\]\.

Residual analysis\.For each training example we compute residuals

ri=\{yi−fML​\(𝐱i\)\(regression\)𝕀​\[fML​\(𝐱i\)≠yi\]⋅\(1−PML​\(𝐱i\)yi\)\(classification\)r\_\{i\}=\\begin\{cases\}y\_\{i\}\-f\_\{\\text\{ML\}\}\(\\mathbf\{x\}\_\{i\}\)&\\text\{\(regression\)\}\\\\ \\mathbb\{I\}\[f\_\{\\text\{ML\}\}\(\\mathbf\{x\}\_\{i\}\)\\neq y\_\{i\}\]\\cdot\\bigl\(1\-P\_\{\\text\{ML\}\}\(\\mathbf\{x\}\_\{i\}\)\_\{y\_\{i\}\}\\bigr\)&\\text\{\(classification\)\}\\end\{cases\}\(3\)and select the top\-κ\\kappafraction by\|ri\|\|r\_\{i\}\|:

𝒟high\-res=\{\(𝐱π​\(i\),yπ​\(i\),rπ​\(i\)\):i≤⌊κ​Ntrain⌋\},\\mathcal\{D\}\_\{\\text\{high\-res\}\}=\\bigl\\\{\(\\mathbf\{x\}\_\{\\pi\(i\)\},y\_\{\\pi\(i\)\},r\_\{\\pi\(i\)\}\):i\\leq\\lfloor\\kappa N\_\{\\text\{train\}\}\\rfloor\\bigr\\\},\(4\)whereπ\\piorders examples by descending\|ri\|\|r\_\{i\}\|\. A step\-by\-step trace of the full pipeline on a concrete Cell\-Free Protein example appears in Appendix[A](https://arxiv.org/html/2605.22897#A1)\.

### 2\.1Structured Hypothesis Generation

Rather than prompting an LLM directly for formulas, MARICL uses an encoder–decoder with a structured representationzkz\_\{k\}to hypothesize explanations for the failures of the base\-model\.

Encoder: from residuals to structured hypotheses\.We construct an augmented context

𝒞aug=\(𝒟high\-res,𝒞domain,𝒞features\)\\mathcal\{C\}\_\{\\text\{aug\}\}=\\bigl\(\\mathcal\{D\}\_\{\\text\{high\-res\}\},\\mathcal\{C\}\_\{\\text\{domain\}\},\\mathcal\{C\}\_\{\\text\{features\}\}\\bigr\)\(5\)combining high\-residual examples, optional domain context, and feature descriptions\. The encoder analyses residual patterns:

zk=EncodeLM​\(𝒞aug,𝒫encoder\(k\)\),z\_\{k\}=\\text\{Encode\}\_\{\\text\{LM\}\}\\bigl\(\\mathcal\{C\}\_\{\\text\{aug\}\},\\mathcal\{P\}\_\{\\text\{encoder\}\}^\{\(k\)\}\\bigr\),\(6\)where each𝒫encoder\(k\)\\mathcal\{P\}\_\{\\text\{encoder\}\}^\{\(k\)\}targets a different aspect: \(1\)error patterns— which feature combinations drive high errors and what nonlinearities the base\-model misses; \(2\)sample patterns— direct feature–target relationships visible in high\-residual examples\.

Batched encoding\.When\|𝒟high\-res\|\>B\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|\>B, we partition into batches, encode independently, and concatenate to preserve all insights:

zk,b=EncodeLM​\(𝒞aug\(b\),𝒫encoder\(k\)\),zk=Concat​\(zk,1,…,zk,⌈\|𝒟high\-res\|/B⌉\)\.z\_\{k,b\}=\\text\{Encode\}\_\{\\text\{LM\}\}\\bigl\(\\mathcal\{C\}\_\{\\text\{aug\}\}^\{\(b\)\},\\mathcal\{P\}\_\{\\text\{encoder\}\}^\{\(k\)\}\\bigr\),\\quad z\_\{k\}=\\text\{Concat\}\\bigl\(z\_\{k,1\},\\ldots,z\_\{k,\\lceil\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|/B\\rceil\}\\bigr\)\.\(7\)Decoder: from hypotheses to executable corrections\.The decoder transformszkz\_\{k\}into a correction\(Tk,fk\)=DecodeLM​\(zk,𝒫decoder\)\(T\_\{k\},f\_\{k\}\)=\\text\{Decode\}\_\{\\text\{LM\}\}\(z\_\{k\},\\mathcal\{P\}\_\{\\text\{decoder\}\}\)comprising a natural\-language explanationTkT\_\{k\}and a Python expressionfkf\_\{k\}\. This separation forces the LLM to articulate*why*errors occur before specifying*how*to correct them, while diverse𝒫encoder\(k\)\\mathcal\{P\}\_\{\\text\{encoder\}\}^\{\(k\)\}encourage complementary hypotheses\.

### 2\.2Correction Refinement via Textual Gradient Optimization

Each correction maintains a textual state𝒮k\(t\)\\mathcal\{S\}\_\{k\}^\{\(t\)\}comprising the hypothesiszk\(t\)z\_\{k\}^\{\(t\)\}, correction\(Tk\(t\),fk\(t\)\)\(T\_\{k\}^\{\(t\)\},f\_\{k\}^\{\(t\)\}\), and accumulated critique history\. Att=0t\{=\}0:

zk\(0\)=EncodeLM​\(𝒞aug,𝒫encoder\(k\)\),\(Tk\(0\),fk\(0\)\)=DecodeLM​\(zk\(0\),𝒫decoder\)\.z\_\{k\}^\{\(0\)\}=\\text\{Encode\}\_\{\\text\{LM\}\}\\bigl\(\\mathcal\{C\}\_\{\\text\{aug\}\},\\mathcal\{P\}\_\{\\text\{encoder\}\}^\{\(k\)\}\\bigr\),\\quad\(T\_\{k\}^\{\(0\)\},f\_\{k\}^\{\(0\)\}\)=\\text\{Decode\}\_\{\\text\{LM\}\}\\bigl\(z\_\{k\}^\{\(0\)\},\\mathcal\{P\}\_\{\\text\{decoder\}\}\\bigr\)\.\(8\)When available, base\-model knowledge \(feature importances, coefficients\) guides initialization\. At iterationtt, we evaluate the train loss

ℒk\(t\)=1\|𝒟train\|​∑\(𝐱,y\)ℓ​\(fML​\(𝐱\)\+Δk\(t\)​\(𝐱\),y\)\\mathcal\{L\}\_\{k\}^\{\(t\)\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\text\{train\}\}\|\}\\sum\_\{\(\\mathbf\{x\},y\)\}\\ell\\bigl\(f\_\{\\text\{ML\}\}\(\\mathbf\{x\}\)\+\\Delta\_\{k\}^\{\(t\)\}\(\\mathbf\{x\}\),\\;y\\bigr\)\(9\)and identify the failure set

ℰk\(t\)=\{\{\(𝐱,y,y^\):\|y^−y\|\>τfail\}\(regression\)\{\(𝐱,y,𝐩^\):p^y<1−τfail\}\(classification\)\\mathcal\{E\}\_\{k\}^\{\(t\)\}=\\begin\{cases\}\\bigl\\\{\(\\mathbf\{x\},y,\\hat\{y\}\):\|\\hat\{y\}\-y\|\>\\tau\_\{\\text\{fail\}\}\\bigr\\\}&\\text\{\(regression\)\}\\\\\[2\.0pt\] \\bigl\\\{\(\\mathbf\{x\},y,\\hat\{\\mathbf\{p\}\}\):\\hat\{p\}\_\{y\}<1\-\\tau\_\{\\text\{fail\}\}\\bigr\\\}&\\text\{\(classification\)\}\\end\{cases\}\(10\)where𝐩^\\hat\{\\mathbf\{p\}\}is the current ensemble probability vector \(Eq\.[2](https://arxiv.org/html/2605.22897#S2.E2)\); the classification criterion is equivalent to cross\-entropy above−log⁡\(1−τfail\)\-\\log\(1\-\\tau\_\{\\text\{fail\}\}\)\. We useτfail=0\.5\\tau\_\{\\text\{fail\}\}=0\.5\(Appendix[N\.2](https://arxiv.org/html/2605.22897#A14.SS2)\)\. The per\-correction loss in Eq\.[9](https://arxiv.org/html/2605.22897#S2.E9)drives selection ofmk∗m\_\{k\}^\{\*\}, while the failure set conditions on the ensemble so that critique targets points where the full predictor still fails on the true class\. A textual gradient \(critique\) is then generated:

gk\(t\)=LM​\(𝒫critique∣zk\(t\),mk\(t\),ℒk\(t\),ℰk\(t\)\),g\_\{k\}^\{\(t\)\}=\\text\{LM\}\\Bigl\(\\mathcal\{P\}\_\{\\text\{critique\}\}\\mid z\_\{k\}^\{\(t\)\},m\_\{k\}^\{\(t\)\},\\mathcal\{L\}\_\{k\}^\{\(t\)\},\\mathcal\{E\}\_\{k\}^\{\(t\)\}\\Bigr\),\(11\)asking the LLM to analyze why the correction fails onℰk\(t\)\\mathcal\{E\}\_\{k\}^\{\(t\)\}and suggest refinements\. Across both regimes,ℰk\(t\)\\mathcal\{E\}\_\{k\}^\{\(t\)\}identifies the queries most informative for refinement: high\-error regression points and low\-true\-class\-confidence classification points\. The state accumulates

𝒮k\(t\)=𝒮k\(t−1\)∪\{\(zk\(t\),mk\(t\),ℒk\(t\),gk\(t\)\)\},\\mathcal\{S\}\_\{k\}^\{\(t\)\}=\\mathcal\{S\}\_\{k\}^\{\(t\-1\)\}\\cup\\bigl\\\{\(z\_\{k\}^\{\(t\)\},m\_\{k\}^\{\(t\)\},\\mathcal\{L\}\_\{k\}^\{\(t\)\},g\_\{k\}^\{\(t\)\}\)\\bigr\\\},\(12\)and subsequent generations condition on the full history:

\(Tk\(t\+1\),fk\(t\+1\)\)∼pLM\(⋅∣𝒞aug,𝒮k\(t\),𝒫decoder\)\.\(T\_\{k\}^\{\(t\+1\)\},f\_\{k\}^\{\(t\+1\)\}\)\\sim p\_\{\\text\{LM\}\}\\bigl\(\\cdot\\mid\\mathcal\{C\}\_\{\\text\{aug\}\},\\mathcal\{S\}\_\{k\}^\{\(t\)\},\\mathcal\{P\}\_\{\\text\{decoder\}\}\\bigr\)\.\(13\)
#### Notation\.

Att=0t\{=\}0the encoder produces an initial hypothesiszk\(0\)z^\{\(0\)\}\_\{k\}from the high\-residual context\. At each subsequent iteration, both the hypothesis and the correction are regenerated conditioned on the accumulated stateSk\(t\)S^\{\(t\)\}\_\{k\}via Eq\.[13](https://arxiv.org/html/2605.22897#S2.E13), yielding an updated\(zk\(t\),Tk\(t\),fk\(t\)\)\(z^\{\(t\)\}\_\{k\},T^\{\(t\)\}\_\{k\},f^\{\(t\)\}\_\{k\}\)\. The hypothesis therefore evolves to reflect*what*the base\-model is missing in light of recent failures, while the formula evolves to express*how*that missing structure is captured\.

### 2\.3Probability Conversion for Classification

For classification,fkf\_\{k\}outputs per\-class scores𝐬k​\(𝐱\)=\[sk\(1\)​\(𝐱\),…,sk\(C\)​\(𝐱\)\]\\mathbf\{s\}\_\{k\}\(\\mathbf\{x\}\)=\[s\_\{k\}^\{\(1\)\}\(\\mathbf\{x\}\),\\ldots,s\_\{k\}^\{\(C\)\}\(\\mathbf\{x\}\)\], converted via temperature\-scaled softmax:

Qk​\(𝐱\)c=exp⁡\(sk\(c\)​\(𝐱\)/τk\)∑c′exp⁡\(sk\(c′\)​\(𝐱\)/τk\),Q\_\{k\}\(\\mathbf\{x\}\)\_\{c\}=\\frac\{\\exp\(s\_\{k\}^\{\(c\)\}\(\\mathbf\{x\}\)/\\tau\_\{k\}\)\}\{\\sum\_\{c^\{\\prime\}\}\\exp\(s\_\{k\}^\{\(c^\{\\prime\}\)\}\(\\mathbf\{x\}\)/\\tau\_\{k\}\)\},\(14\)withτk\\tau\_\{k\}chosen by validation \(ECE; Appendix[N\.2](https://arxiv.org/html/2605.22897#A14.SS2)\)\. Query\-specific confidence is

ck​\(𝐱\)=σ​\(γ⋅\(1−dk​\(𝐱\)\)\),c\_\{k\}\(\\mathbf\{x\}\)=\\sigma\\Bigl\(\\gamma\\cdot\\bigl\(1\-d\_\{k\}\(\\mathbf\{x\}\)\\bigr\)\\Bigr\),\(15\)wheredk​\(𝐱\)=min⁡\(d~k​\(𝐱\)/Dk95,1\)d\_\{k\}\(\\mathbf\{x\}\)=\\min\\bigl\(\\tilde\{d\}\_\{k\}\(\\mathbf\{x\}\)/D\_\{k\}^\{95\},\\,1\\bigr\),d~k​\(𝐱\)=min\(𝐱′,⋅\)∈𝒟high\-res\(k\)⁡‖𝐱−𝐱′‖2\\tilde\{d\}\_\{k\}\(\\mathbf\{x\}\)=\\min\_\{\(\\mathbf\{x\}^\{\\prime\},\\cdot\)\\in\\mathcal\{D\}\_\{\\text\{high\-res\}\}^\{\(k\)\}\}\\\|\\mathbf\{x\}\-\\mathbf\{x\}^\{\\prime\}\\\|\_\{2\}is computed in standardized feature space \(frozen train statistics\),Dk95D\_\{k\}^\{95\}is the9595th percentile of pairwise distances within𝒟high\-res\(k\)\\mathcal\{D\}\_\{\\text\{high\-res\}\}^\{\(k\)\}, and clipping ensuresdk∈\[0,1\]d\_\{k\}\\in\[0,1\]on far queries \.

### 2\.4Query\-Aware Aggregation

Training\-time example selection\.During hypothesis generation, informative examples are selected from the high\-residual pool via

s​\(𝐱q,𝐱i\)=exp⁡\(−‖𝐱q−𝐱i‖22/2​σ2\)⋅\|ri\|γs,s\(\\mathbf\{x\}\_\{\\text\{q\}\},\\mathbf\{x\}\_\{i\}\)=\\exp\\bigl\(\-\\\|\\mathbf\{x\}\_\{\\text\{q\}\}\-\\mathbf\{x\}\_\{i\}\\\|\_\{2\}^\{2\}/2\\sigma^\{2\}\\bigr\)\\cdot\|r\_\{i\}\|^\{\\gamma\_\{s\}\},\(16\)where the*query anchor*𝐱q\\mathbf\{x\}\_\{\\text\{q\}\}is the unselected high\-residual example with the largest\|ri\|\|r\_\{i\}\|remaining in the pool \(greedy top\-residual seed; ties broken by index\),σ\\sigmais the median pairwise distance over the pool, andγs\>0\\gamma\_\{s\}\>0trades off residual magnitude against spatial proximity \(distinct fromγ\\gammain Eq\.[15](https://arxiv.org/html/2605.22897#S2.E15)\)\. The score rewards examples that are both large in residual and near the anchor, yielding a locally representative, correctively informative context\. This is used*only during training*; inference uses Eq\.[18](https://arxiv.org/html/2605.22897#S2.E18)\.

Inference\-time aggregation \(zero LLM cost\)\.At inference, each compiledfkf\_\{k\}is executed directly\. Corrections are weighted by a global performance score

pk=\{exp⁡\(−MAEk/τ\)\(regression\)F1kmacro\(classification\)p\_\{k\}=\\begin\{cases\}\\exp\(\-\\text\{MAE\}\_\{k\}/\\tau\)&\\text\{\(regression\)\}\\\\ \\text\{F1\}\_\{k\}^\{\\text\{macro\}\}&\\text\{\(classification\)\}\\end\{cases\}\(17\)whereτ\>0\\tau\>0scalesMAEk\\text\{MAE\}\_\{k\}\(fixed vs\. target range; Appendix[N\.2](https://arxiv.org/html/2605.22897#A14.SS2)\)\. Scorespkp\_\{k\}use𝒟train\\mathcal\{D\}\_\{\\text\{train\}\};β\\beta,τk\\tau\_\{k\},KK,κ\\kappause validation\. Combined withck​\(𝐱\)c\_\{k\}\(\\mathbf\{x\}\)from Eq\.[15](https://arxiv.org/html/2605.22897#S2.E15):

αk​\(𝐱\)=pk⋅ck​\(𝐱\)⋅𝕀​\[pk\>pmin\]Z​\(𝐱\),\\alpha\_\{k\}\(\\mathbf\{x\}\)=\\frac\{p\_\{k\}\\cdot c\_\{k\}\(\\mathbf\{x\}\)\\cdot\\mathbb\{I\}\[p\_\{k\}\>p\_\{\\min\}\]\}\{Z\(\\mathbf\{x\}\)\},Z​\(𝐱\)=∑jpj⋅cj​\(𝐱\)⋅𝕀​\[pj\>pmin\],Z\(\\mathbf\{x\}\)=\\sum\_\{j\}p\_\{j\}\\cdot c\_\{j\}\(\\mathbf\{x\}\)\\cdot\\mathbb\{I\}\[p\_\{j\}\>p\_\{\\min\}\],\(18\)wherepminp\_\{\\min\}filters unreliable corrections\. IfZ​\(𝐱\)=0Z\(\\mathbf\{x\}\)=0\(all corrections fail the threshold\), MARICL falls back to the base\-model prediction\. The complete inference pipeline involves only arithmetic and nearest\-neighbor lookups with zero LLM overhead\. Algorithm[1](https://arxiv.org/html/2605.22897#alg1)presents the training procedure\.

## 3Related Work

Interpretable ML\.Post\-hoc methods like SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.22897#bib.bib50)\)\(including TreeSHAP and the Shapley interaction index\(Lundberget al\.,[2020](https://arxiv.org/html/2605.22897#bib.bib51)\)\) and LIME\(Ribeiroet al\.,[2016](https://arxiv.org/html/2605.22897#bib.bib53)\)attribute predictions to features and pairs of features, but the attributions are local, per\-sample, and tied to a fixed predictor; they cannot articulate*why*a model fails as a closed\-form formula and provide no mechanism for iteratively refining it\. Inherently interpretable models like GAMs\(Hastie and Tibshirani,[1987](https://arxiv.org/html/2605.22897#bib.bib54)\)and EBMs\(Noriet al\.,[2019](https://arxiv.org/html/2605.22897#bib.bib57)\)fit smooth nonparametric shape functions per feature, with optional pairwise terms in GA2Ms\(Louet al\.,[2013](https://arxiv.org/html/2605.22897#bib.bib52)\); we compare against*EBM \(with pairwise\)*as a strong interpretable interaction baseline \(Table[7](https://arxiv.org/html/2605.22897#A2.T7)\)\. MARICL is complementary: it produces explicit, closed\-form symbolic formulas \(named cofactor products, saturation terms, sigmoidal gates\) rather than per\-pair learned response surfaces, and refines them iteratively against base\-model failures\. InterpreTabNet\(Siet al\.,[2024](https://arxiv.org/html/2605.22897#bib.bib55)\)learns sparse attention masks over features and uses LLMs to generate natural language explanations of feature interdependencies, but remains inspective rather than iteratively refining hypotheses\. MARICL addresses a complementary question: where does a model systematically fail and why?

Symbolic Regression\.Classical genetic programming\(Cranmer,[2023](https://arxiv.org/html/2605.22897#bib.bib63)\)and LLM\-guided methods\(Shojaeeet al\.,[2025](https://arxiv.org/html/2605.22897#bib.bib61)\)discover closed\-form expressions from data\. LLM\-LEx\(Harveyet al\.,[2025](https://arxiv.org/html/2605.22897#bib.bib64)\)uses LLMs to generate predictive formulas directly\. These methods focus on fitting data globally; MARICL instead targets residual patterns through iterative hypothesis refinement, discovering equations that explainwhere and whybaselines err\.

Multi\-Agent and Textual Optimization\.Debate frameworks\(Duet al\.,[2023](https://arxiv.org/html/2605.22897#bib.bib2)\)and mixture\-of\-agents\(Wanget al\.,[2024](https://arxiv.org/html/2605.22897#bib.bib5); Rezaeiet al\.,[2025](https://arxiv.org/html/2605.22897#bib.bib83)\)coordinate LLMs on shared tasks\. TextGrad\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.22897#bib.bib8)\)and self\-refine\(Madaanet al\.,[2023](https://arxiv.org/html/2605.22897#bib.bib10)\)optimize outputs through critique\. MARICL differs: each agent specializes on distinct failure patterns via residual analysis, and multi\-turn refinement targets interpretable formulas validated by predictive performance\.

In\-Context Learning, Boosting, and LLM Feature Engineering\.Demonstration selection methods leverage similarity\(Liuet al\.,[2021](https://arxiv.org/html/2605.22897#bib.bib18)\)or diversity\(Levyet al\.,[2023](https://arxiv.org/html/2605.22897#bib.bib19)\)\. Gradient boosting\(Friedman,[2001](https://arxiv.org/html/2605.22897#bib.bib58)\)fits weak learners to residuals\. CAAFE\(Hollmannet al\.,[2023b](https://arxiv.org/html/2605.22897#bib.bib82)\)prompts an LLM to generate executable feature\-engineering code on the input space, using dataset descriptions as context and validating each candidate against held\-out accuracy\. MARICL shares the compile\-to\-executable strategy but operates on the*residual*space rather than the input space, generating corrections through iterative TextGrad refinement rather than one\-shot generation, and producing named symbolic terms tied to specific feature combinations rather than augmented feature columns; in combination, this provides both executable corrections and human\-readable hypotheses about feature interactions causing failures\.

## 4Experiments

We evaluate MARICL on nine benchmarks across scientific, biomedical, social, economic, and synthetic domains \(mean±\\pmstd across random seeds\)\. Four questions structure the evaluation: \(i\) does MARICL improve prediction; \(ii\) can it recover known structure from data; \(iii\) how much of the gain comes from iterative refinement versus the LLM’s pretraining priors; and \(iv\) do the learned formulas generalize beyond their training data?

Datasets\.Five regression tasks \(Cell\-Free Protein Production\(Borkowskiet al\.,[2020](https://arxiv.org/html/2605.22897#bib.bib34)\),Enzyme Activity,Diabetes Progression,California Housing,Bike Sharing\) and three classification tasks \(Zoo,High School Social Classification,Adult Income\)\. We additionally construct aSynthetic Benchmark\(N=1,000N=1\{,\}000,d=8d=8\) with a planted ground\-truth formula that cannot appear in any LLM pretraining corpus \(§[4\.2](https://arxiv.org/html/2605.22897#S4.SS2)\)\.Baselines\.Linear/Logistic Regression, XGBoost, EBM\(Noriet al\.,[2019](https://arxiv.org/html/2605.22897#bib.bib57)\), SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.22897#bib.bib50)\), TabPFN\(Hollmannet al\.,[2023a](https://arxiv.org/html/2605.22897#bib.bib59)\), LLM\-LEx\(Harveyet al\.,[2025](https://arxiv.org/html/2605.22897#bib.bib64)\), LLM\-ICL, Symbolic Regression\(Cranmer,[2023](https://arxiv.org/html/2605.22897#bib.bib63)\), and PySR\-on\-residuals\. Cross\-validation, hyperparameter sensitivity, calibration, and paired Wilcoxon tests with Benjamini–Hochberg FDR control appear in Appendices[N](https://arxiv.org/html/2605.22897#A14)and[K](https://arxiv.org/html/2605.22897#A11)\.

![Refer to caption](https://arxiv.org/html/2605.22897v1/Figures/results_figure.png)Figure 2:Performance across regression and classification benchmarks \(±\\pm1 std, 5 seeds\)\. Full numerical results in Tables[16](https://arxiv.org/html/2605.22897#A10.T16),[15](https://arxiv.org/html/2605.22897#A10.T15)\.### 4\.1MARICL Improves Prediction on Eight Real Benchmarks

MARICL outperforms all base models on every real benchmark, liftingR2R^\{2\}and macro\-F1 over both Linear and XGBoost bases \(Figure[2](https://arxiv.org/html/2605.22897#S4.F2); Tables[15](https://arxiv.org/html/2605.22897#A10.T15),[16](https://arxiv.org/html/2605.22897#A10.T16)\)\. Within\-base gains are largest when the base is weak, since a stronger base leaves less structured residual to correct\. Against TabPFN the gap falls within one standard deviation on the hardest tasks, but MARICL additionally returns a human\-readable formula\. LLM\-LEx and LLM\-ICL both underperform, confirming that the iterative refinement loop is necessary; single\-pass generation is not enough\.

The Diabetes task illustrates how the choice of base matters: Linear\+MARICL outperforms XGBoost\+MARICL because the linear base is itself stronger on this dataset, and MARICL inherits that advantage\. On Adult Income, macro\-F1 rises from0\.6920\.692to0\.8000\.800\(\+0\.108\+0\.108,\+15\.6%\+15\.6\\%\) over XGBoost; the XGBoost baseline has poor minority\-class recall, and MARICL’s corrections concentrate on those high\-residual examples\. Even with XGBoost tuned over a five\-axis grid \(Table[25](https://arxiv.org/html/2605.22897#A14.T25)\), MARICL leads by\+0\.106​R2\+0\.106\\,R^\{2\}on Cell\-Free Protein and\+0\.069\+0\.069macro\-F1 on Adult Income, with positive gaps on all nine benchmarks\. Among residual\-correction baselines \(PySR\-on\-residuals, pairwise interactions, EBM, MLP\-on\-residuals\), MARICL is the only method that achieves both the best within\-modelR2R^\{2\}*and*a human\-readable formula \(Appendix[B](https://arxiv.org/html/2605.22897#A2)\)\.

Table 2:Synthetic benchmark with planted ground truth\. MARICL recovers the dominant sigmoid as an explicit symbolic term\. XGBoost is a standalone full model; itsΔ​R2\\Delta R^\{2\}is reported over Linear for comparability, not as a residual correction\.MethodTypeR2R^\{2\}Δ​R2\\Delta R^\{2\}Sigmoid?Linear \(base\)Standalone \(base\)0\.387——PySR on residualsResidual correction0\.854\+0\.467PartialXGBoostStandalone0\.831\+0\.444NoMARICL \(ours\)Residual correction0\.892\+0\.505YesOracle correctionStandalone \(oracle\)0\.961\+0\.574—
### 4\.2Synthetic Benchmark: Recovering Planted Ground Truth

Gains on real benchmarks could in principle come from the LLM recognising the domain rather than reasoning over the data\. To rule this out we construct a benchmark with a known ground truth that the LLM cannot have seen:

Y=0\.6​X1\+0\.4​X2\+2\.5​sigmoid​\(1\.8​X1​X3−1\.2\)\+0\.3​sin⁡\(X5​X7\)\+ε,ε∼𝒩​\(0,0\.12\)Y=0\.6X\_\{1\}\+0\.4X\_\{2\}\+2\.5\\,\\text\{sigmoid\}\(1\.8X\_\{1\}X\_\{3\}\-1\.2\)\+0\.3\\sin\(X\_\{5\}X\_\{7\}\)\+\\varepsilon,\\quad\\varepsilon\\sim\\mathcal\{N\}\(0,0\.1^\{2\}\)with iid features \(N=1,000N\{=\}1\{,\}000,d=8d\{=\}8; Appendix[D](https://arxiv.org/html/2605.22897#A4)\)\. A linear base captures the additive terms but misses the sigmoid nonlinearity, which is therefore the structure a residual correction must recover\.

Table[2](https://arxiv.org/html/2605.22897#S4.T2)shows that MARICL recovers the planted sigmoid as the explicit symbolic termc⋅sigmoid​\(a​X1​X3−b\)c\\cdot\\text\{sigmoid\}\(a\\,X\_\{1\}X\_\{3\}\-b\), with\(a,b\)=\(1\.6,1\.1\)\(a,b\)=\(1\.6,1\.1\)within∼12%\{\\sim\}12\\%of the true\(1\.8,1\.2\)\(1\.8,1\.2\); thesin\\sinterm is absorbed into polynomial residuals and the amplitudeccis folded into linear coefficients \(a limitation we flag in §[5](https://arxiv.org/html/2605.22897#S5)\)\. PySR\-on\-residuals reaches comparableR2R^\{2\}but returns a 12\-node nested expression that obscures the planted structure even whenσ\\sigmais provided as a primitive \(Appendix[N\.5](https://arxiv.org/html/2605.22897#A14.SS5)\)\. Because the planted mechanism cannot exist in pretraining data, this experiment isolates the data\-driven component of MARICL’s performance\.

### 4\.3Disentangling Data\-Driven Refinement from Prior Knowledge

How much of MARICL’s improvement comes from the iterative loop, and how much from the LLM already knowing the domain \(e\.g\., that NAD is a translation cofactor or that BMI predicts diabetes\)? We separate the two by progressively removing what the LLM gets “for free” from pretraining\.

\(i\) Removing the domain promptdrops performance by only2\.32\.3–3\.53\.5points and still recovers the same dominant correction category \(Appendix[C](https://arxiv.org/html/2605.22897#A3)\)\.\(ii\) Anonymizing feature names\(replacing them with opaquefeat\_0, …,feat\_d\) preserves6767–69%69\\%of MARICL’s gain on Cell\-Free Protein and Diabetes \(Appendix[I](https://arxiv.org/html/2605.22897#A9)\)\.\(iii\) Disabling refinement\(T=0T\{=\}0\): performance increases monotonically with the number of iterations \(Table[5](https://arxiv.org/html/2605.22897#S4.T5)\), so any gain aboveT=0T\{=\}0is attributable to the loop rather than the initial prompt\.\(iv\) All three at once: with feature names anonymized, no domain context anywhere in the prompts, and a small open LLM \(Llama\-3\-8B\) replacing frontier models, roughly half of MARICL’s end\-to\-end gain still survives — and still exceeds PySR\-on\-residuals \(Appendix[L](https://arxiv.org/html/2605.22897#A12)\)\.

### 4\.4Interpretability of Learned Corrections

MARICL’s corrections are explicit symbolic formulas that name what the base model misses \(Table[3](https://arxiv.org/html/2605.22897#S4.T3)\)\. For Cell\-Free Protein, the correction encodes NAD–spermidine cofactor synergy \(polyamines stimulate translation 3–5×\\times\(Atkinset al\.,[1975](https://arxiv.org/html/2605.22897#bib.bib42)\); NAD and CoA are ATP\-regeneration cofactors\(Jewett and Swartz,[2004](https://arxiv.org/html/2605.22897#bib.bib45)\)\) and a Michaelis–Menten saturation term for folinic acid\(Caiet al\.,[2015](https://arxiv.org/html/2605.22897#bib.bib47)\)\. The “Correction Only” column shows that the formula carries meaningful structure on its own, independent of the base model\. For Zoo classification, the correction encodeshair×milk×\(1−eggs\)\\text\{hair\}\\times\\text\{milk\}\\times\(1\-\\text\{eggs\}\)for mammalian synapomorphies and weights feathers above flight ability so that flightless birds remain classified as birds; LLM priors on taxonomy plausibly contribute here, so we treat this as controlled validation rather than a data\-driven inference\. When a correction is unreliable, thepminp\_\{\\min\}threshold \(Eq\.[18](https://arxiv.org/html/2605.22897#S2.E18)\) suppresses it\.

Table 3:Interpretable corrections for protein expression and zoo classification\. “Correction Only” = learned correction without the base\-model; “MARICL” = full combined performance\.DatasetLearned Correction FormulaDomain Interpretation & Literature SupportCorrection OnlyMARICLProtein Expressiony^=1\.5⋅NAD\+1\.2⋅spermidine\+0\.5⋅NAD×spermidine\\hat\{y\}=1\.5\\cdot\\text\{NAD\}\+1\.2\\cdot\\text\{spermidine\}\+0\.5\\cdot\\text\{NAD\}\\times\\text\{spermidine\}\+0\.6⋅CoA\+0\.5⋅3\-PGA\+0\.6\\cdot\\text\{CoA\}\+0\.5\\cdot\\text\{3\-PGA\}\+0\.5⋅folinic\_acid0\.5\+folinic\_acid\+\\frac\{0\.5\\cdot\\text\{folinic\\\_acid\}\}\{0\.5\+\\text\{folinic\\\_acid\}\}NAD\-spermidine synergy:polyamines stimulate translation 3–5×\\times\(Atkinset al\.,[1975](https://arxiv.org/html/2605.22897#bib.bib42)\); the PANOx\-SP system uses NAD and CoA to regenerate ATP\(Jewett and Swartz,[2004](https://arxiv.org/html/2605.22897#bib.bib45)\)\.Folinic acid saturation:formylmethionine precursor with an empirical concentration plateau\(Caiet al\.,[2015](https://arxiv.org/html/2605.22897#bib.bib47)\)\.0\.4950\.723R2R^\{2\}Zoo Animalsscoremammal=1\.1⋅hair\+1\.3⋅milk\+0\.8⋅\(hair×milk×\(1−eggs\)\)\\text\{score\}\_\{\\text\{mammal\}\}=1\.1\\cdot\\text\{hair\}\+1\.3\\cdot\\text\{milk\}\+0\.8\\cdot\(\\text\{hair\}\\times\\text\{milk\}\\times\(1\-\\text\{eggs\}\)\)scorebird=1\.2⋅feathers\+0\.8⋅airborne\+0\.5⋅\(feathers×airborne\)\\text\{score\}\_\{\\text\{bird\}\}=1\.2\\cdot\\text\{feathers\}\+0\.8\\cdot\\text\{airborne\}\+0\.5\\cdot\(\\text\{feathers\}\\times\\text\{airborne\}\)scorefish=1\.0⋅fins\+0\.9⋅aquatic\+0\.4⋅\(fins×aquatic×backbone\)\\text\{score\}\_\{\\text\{fish\}\}=1\.0\\cdot\\text\{fins\}\+0\.9\\cdot\\text\{aquatic\}\+0\.4\\cdot\(\\text\{fins\}\\times\\text\{aquatic\}\\times\\text\{backbone\}\)Taxonomic encoding:hair×milk×\(1−eggs\)\\text\{hair\}\\times\\text\{milk\}\\times\(1\-\\text\{eggs\}\)captures mammalian synapomorphies\.Feathers over flight:prioritizes morphology over behavior so that flightless birds remain birds\.Treated as controlled validation; LLM priors on taxonomy likely contribute\.0\.7620\.975 Acc

### 4\.5Frozen Correction Formulas Transfer Across Plates Without Retraining

#### Setup\.

We use the Cell\-Free Protein dataset, in which proteins are produced in a test tube from a cell extract plus a mixture of reagents \(energy substrates, cofactors, polyamines, etc\.\)\. Each*plate*is one experimental batch, run with a specific reagent recipe, and yields measurements for many wells\. Plates AL\_2 through AL\_6 share one reagent protocol; AL\_7 through AL\_10 use a different protocol with different reagent ranges\. We refer to these two groups as*reagent cohorts*\.

#### Hypothesis\.

If MARICL’s corrections capture real biochemistry rather than batch\-specific noise, then a formula learned on one plate should still improve predictions on a different plate from the same cohort, with no retraining and no LLM calls\. Held\-out accuracy on the same plate is too weak a check, since formulas overfit to a single batch can still generalize within that batch; cross\-plate transfer is the harder test\.

Table 4:Cross\-plate transfer: formulas frozen on source plate AL\_6 applied to target plates with zero LLM cost\. “Same reagent” plates share AL\_6’s protocol; “Diff\. regime” plates use a different one\. The corrections consistently beat the base model within the same cohort and fail across the regime boundary\. Per\-plate breakdown in Appendix[E](https://arxiv.org/html/2605.22897#A5)\.Target cohort% improvementAvgΔ\\DeltaMAESame reagent \(AL\_2–AL\_5\)92–97%\+0\.155\+0\.155to\+0\.249\+0\.249Diff\. regime \(AL\_7–AL\_10\)0–8%−0\.155\-0\.155to−0\.213\-0\.213
#### Result\.

We freeze correction formulas from3737source runs on plate AL\_6 and apply them to88held\-out target plates with no updates and no LLM calls \(Appendix[E\.1](https://arxiv.org/html/2605.22897#A5.SS1)\)\. Within the same reagent cohort \(AL\_2–AL\_5\), the frozen corrections improve over the base model on9292–97%97\\%of source–target pairs, with mean MAE reductions of0\.1550\.155to0\.2490\.249per plate \(Table[4](https://arxiv.org/html/2605.22897#S4.T4)\)\. Across the different cohort \(AL\_7–AL\_10\), the same formulas improve only0–8%8\\%of pairs\.

The success boundary aligns with the reagent protocol, not the number of plates between source and target — direct evidence of mechanistic generalization: a formula that captures real biochemistry transfers wherever that biochemistry holds, whereas an overfit formula would not produce such a clean boundary\. Robustness of the quality filter and the effect of the base model on transfer are in Appendix[E](https://arxiv.org/html/2605.22897#A5)\.

### 4\.6Ablation Studies

Table 5:Ablations on Cell\-Free Protein \(mean±\\pmstd, 5 seeds\)\. Each sub\-block varies the named factor while holding all other factors at the headline configuration \(K=2K\{=\}2,T=10T\{=\}10, Top\-K==100, scaled features\)\. TheK=2K\{=\}2Linear \(0\.6480\.648\) and XGBoost \(0\.7230\.723\) entries match the headline values in Table[15](https://arxiv.org/html/2605.22897#A10.T15)\.ConfigurationR2R^\{2\}ConfigR2R^\{2\}\# Corr\. \(Linear\)\# Corr\. \(XGBoost\)1 corr\.0\.541±\\pm0\.0271 corr\.0\.690±\\pm0\.0252 corr\.0\.648±\\pm0\.0302 corr\.0\.723±\\pm0\.031Top\-K \(XGBoost\)Feature Scaling \(Linear\)Top\-K = 300\.626±\\pm0\.024Unscaled0\.412±\\pm0\.029Top\-K = 500\.711±\\pm0\.021Scaled0\.648±\\pm0\.030Top\-K = 1000\.723±\\pm0\.031Refinement Iter\. \(XGBoost,K=2K\{=\}2, Top\-K==100\)TT= 0 \(base only\)0\.5787±\\pm0\.025TT= 50\.6789±\\pm0\.026TT= 30\.6234±\\pm0\.024TT= 100\.7231±\\pm0\.031Table[5](https://arxiv.org/html/2605.22897#S4.T5)ablates the main hyperparameters on Cell\-Free Protein, holding everything else at the headline configuration \(K=2K\{=\}2,T=10T\{=\}10, Top\-K=100=100, scaled features\)\. Adding a second correction \(K=1→2K\{=\}1\{\\to\}2\) yields\+0\.107​Δ​R2\+0\.107\\,\\Delta R^\{2\}for Linear and\+0\.033\+0\.033for XGBoost; the smaller XGBoost gain reflects its stronger base, which leaves less residual signal for additional corrections to absorb \(§[M\.3](https://arxiv.org/html/2605.22897#A13.SS3)\)\. Larger residual pools \(Top\-K\) yield monotonically richer signal\. Feature scaling is essential for the linear base\. More refinement iterations improve performance \(T=10\>T=5\>T=3T\{=\}10\>T\{=\}5\>T\{=\}3\), confirming that the loop adds value beyond the initial prompt\. Training requires 14–82 LLM calls per run \($0\.02–$0\.11 with Gemini 2\.0 Flash\); cost analysis and backbone robustness are in Appendices[F](https://arxiv.org/html/2605.22897#A6)and[G](https://arxiv.org/html/2605.22897#A7)\.

## 5Discussion and Conclusion

MARICL deploys LLM agents to analyze a base\-model’s systematic failures: an encoder transforms residual patterns into structured hypotheses, a decoder converts these into executable corrections, and textual gradient optimization iteratively refines them against predictive performance\.When MARICL succeeds\.Within\-model gains are largest in domains with correctable residual structure:\+0\.236​Δ​R2\+0\.236\\,\\Delta R^\{2\}over a Linear base on Cell\-Free Protein,\+0\.144\+0\.144over XGBoost on the same task\. MARICL is most clearly advantageous when the practitioner’s validated base leaves substantial structured residual, or when interpretable corrections are valued alongside accuracy — the regime where end\-to\-end LLM agents \(LLM\-ICL\) underperform\.

Limitations\.MARICL’s gains are modest when the base already captures the dominant nonlinearity or under high\-dimensional noise \(§[M\.3](https://arxiv.org/html/2605.22897#A13.SS3)\);pminp\_\{\\min\}ensures graceful degradation\. Training requires a small LLM\-call budget \(Appendix[F](https://arxiv.org/html/2605.22897#A6)\); inference is zero\-cost\. The joint prior\-stripping ablation \(§[L](https://arxiv.org/html/2605.22897#A12)\) shows roughly half of MARICL’s gain on real benchmarks survives without feature semantics, domain context, or a frontier LLM; the other half is a deployment benefit when priors align with the domain\. The synthetic and cross\-plate results \(§[4\.2](https://arxiv.org/html/2605.22897#S4.SS2), §[4\.5](https://arxiv.org/html/2605.22897#S4.SS5)\) carry the inference argument on tasks that admit no pretraining shortcut\. Corrections may capture correlations rather than causation, and formulas are scoped to the regime they were learned in\.

Broader impact\.Among the residual\-correction baselines we evaluate, MARICL is the only one that jointly achieves the highest within\-modelR2R^\{2\}and produces named, closed\-form formulas suitable for domain\-expert inspection — a Pareto\-style claim rather than dominance over all interpretable methods\. In scientific applications, understanding*where and why*models fail can generate testable hypotheses, and the cross\-plate result shows that formulas can track meaningful experimental boundaries, supporting reuse across related conditions and flagging when fresh discovery is needed\.

## References

- Enhanced differential synthesis of proteins in a mammalian cell\-free system by addition of polyamines\.Journal of Biological Chemistry250\(14\),pp\. 5688–5695\.Cited by:[§4\.4](https://arxiv.org/html/2605.22897#S4.SS4.p1.3),[Table 3](https://arxiv.org/html/2605.22897#S4.T3.4.4.4.4.1.1)\.
- O\. Borkowski, C\. Bricio, M\. Murgiano, B\. Rothschild\-Mancinelli, G\. Stan, and T\. Ellis \(2020\)Large scale active\-learning\-guided exploration for in vitro protein production optimization\.Nature Communications11\(1\),pp\. 1872\.Cited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.2.3),[§4](https://arxiv.org/html/2605.22897#S4.p2.2)\.
- Q\. Cai, J\. A\. Hanson, A\. R\. Steiner, C\. Tran, M\. R\. Masikat, R\. Chen, J\. F\. Zawada, A\. K\. Sato, T\. J\. Hallam, and G\. Yin \(2015\)A simplified and robust protocol for immunoglobulin expression in Escherichia coli cell\-free protein synthesis systems\.Biotechnology Progress31\(3\),pp\. 823–831\.Cited by:[§4\.4](https://arxiv.org/html/2605.22897#S4.SS4.p1.3),[Table 3](https://arxiv.org/html/2605.22897#S4.T3.4.4.4.4.1.1)\.
- A\. Cornish\-Bowden \(2012\)Fundamentals of enzyme kinetics\.4th edition,Wiley\-VCH,Weinheim\.Note:First edition 1979; covers Michaelis\-Menten kinetics, steady\-state kinetics, and enzyme mechanismsCited by:[Table 13](https://arxiv.org/html/2605.22897#A8.T13.6.6.6.3.3),[Appendix H](https://arxiv.org/html/2605.22897#A8.p2.4)\.
- M\. Cranmer \(2023\)Interpretable machine learning for science with pysr and symbolicregression\.jl\.arXiv preprint arXiv:2305\.01582\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1),[§3](https://arxiv.org/html/2605.22897#S3.p2.1),[§4](https://arxiv.org/html/2605.22897#S4.p2.2)\.
- R\. A\. DeFronzo, E\. Ferrannini, L\. Groop, R\. R\. Henry, W\. H\. Herman, J\. J\. Holst, F\. B\. Hu, C\. R\. Kahn, I\. Raz, G\. I\. Shulman, D\. C\. Simonson, M\. A\. Testa, and R\. Weiss \(2015\)Type 2 diabetes mellitus\.Nature Reviews Disease Primers1,pp\. 15019\.External Links:[Document](https://dx.doi.org/10.1038/nrdp.2015.19)Cited by:[Table 13](https://arxiv.org/html/2605.22897#A8.T13.12.12.5.2.2),[Appendix H](https://arxiv.org/html/2605.22897#A8.p3.6)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2023\)Improving factuality and reasoning in language models through multiagent debate\.InForty\-first International Conference on Machine Learning,Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p3.1)\.
- Dua, D\. and Graff, C\. \(2019\)UCI Machine Learning Repository: Enzyme Activity Dataset\.Note:[https://archive\.ics\.uci\.edu/](https://archive.ics.uci.edu/)University of California, Irvine, School of Information and Computer SciencesCited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.3.3)\.
- B\. Efron, T\. Hastie, I\. Johnstone, and R\. Tibshirani \(2004\)Least angle regression\.Annals of Statistics32\(2\),pp\. 407–499\.Cited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.4.3)\.
- H\. Fanaee\-T and J\. Gama \(2014\)Event labeling combining ensemble detectors and background knowledge\.Progress in Artificial Intelligence2,pp\. 113–127\.Cited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.9.3)\.
- R\. Forsyth \(1990\)Zoo Data Set\.Note:UCI Machine Learning Repository,[https://archive\.ics\.uci\.edu/dataset/111/zoo](https://archive.ics.uci.edu/dataset/111/zoo)Cited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.5.3)\.
- J\. H\. Friedman \(2001\)Greedy function approximation: a gradient boosting machine\.Annals of Statistics29\(5\),pp\. 1189–1232\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p4.1)\.
- L\. Grinsztajn, E\. Oyallon, and G\. Varoquaux \(2022\)Why do tree\-based models still outperform deep learning on typical tabular data?\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 507–520\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1)\.
- T\. R\. Harvey, F\. Ruehle, K\. Fraser\-Taliente, and J\. Halverson \(2025\)Symbolic regression with multimodal large language models and Kolmogorov–Arnold networks\.arXiv preprint arXiv:2505\.07956\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p2.1),[§4](https://arxiv.org/html/2605.22897#S4.p2.2)\.
- T\. Hastie and R\. Tibshirani \(1987\)Generalized additive models: some applications\.Journal of the American Statistical Association82\(398\),pp\. 371–386\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1),[§3](https://arxiv.org/html/2605.22897#S3.p1.1)\.
- N\. Hollmann, S\. Müller, K\. Eggensperger, and F\. Hutter \(2023a\)TabPFN: a transformer that solves small tabular classification problems in a second\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2605.22897#S4.p2.2)\.
- N\. Hollmann, S\. Müller, and F\. Hutter \(2023b\)Large language models for automated data science: introducing caafe for context\-aware automated feature engineering\.Advances in Neural Information Processing Systems36,pp\. 44753–44775\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p4.1)\.
- M\. C\. Jewett and J\. R\. Swartz \(2004\)Mimicking the Escherichia coli cytoplasmic environment activates long\-lived and efficient cell\-free protein synthesis\.Biotechnology and Bioengineering86\(1\),pp\. 19–26\.Cited by:[§4\.4](https://arxiv.org/html/2605.22897#S4.SS4.p1.3),[Table 3](https://arxiv.org/html/2605.22897#S4.T3.4.4.4.4.1.1)\.
- S\. E\. Kahn, R\. L\. Hull, and K\. M\. Utzschneider \(2006\)Mechanisms linking obesity to insulin resistance and type 2 diabetes\.Nature444\(7121\),pp\. 840–846\.External Links:[Document](https://dx.doi.org/10.1038/nature05482)Cited by:[Table 13](https://arxiv.org/html/2605.22897#A8.T13.12.12.5.2.2),[Appendix H](https://arxiv.org/html/2605.22897#A8.p3.6)\.
- D\. Kahneman and A\. Tversky \(1979\)Prospect theory: an analysis of decision under risk\.Econometrica47\(2\),pp\. 263–291\.External Links:[Document](https://dx.doi.org/10.2307/1914185)Cited by:[Table 13](https://arxiv.org/html/2605.22897#A8.T13.22.22.4.2.2),[Appendix H](https://arxiv.org/html/2605.22897#A8.p5.1)\.
- B\. Kim, J\. Hewitt, N\. Nanda, N\. Fiedel, and O\. Tafjord \(2025\)Because we have llms, we can and should pursue agentic interpretability\.arXiv preprint arXiv:2506\.12152\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p2.1)\.
- R\. Kohavi \(1996\)Scaling up the accuracy of naive\-Bayes classifiers: a decision\-tree hybrid\.InProceedings of the Second International Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 202–207\.Cited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.7.3)\.
- I\. Levy, B\. Bogin, and J\. Berant \(2023\)Diverse demonstrations improve in\-context compositional generalization\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1401–1422\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p4.1)\.
- J\. Liu, D\. Shen, Y\. Zhang, B\. Dolan, L\. Carin, and W\. Chen \(2021\)What makes good in\-context examples for gpt\-3? arxiv preprint\.arXiv preprint arXiv:2101\.06804\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p4.1)\.
- Y\. Lou, R\. Caruana, J\. Gehrke, and G\. Hooker \(2013\)Accurate intelligible models with pairwise interactions\.InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 623–631\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p1.1)\.
- S\. M\. Lundberg, G\. Erion, H\. Chen, A\. DeGrave, J\. M\. Prutkin, B\. Nair, R\. Katz, J\. Himmelfarb, N\. Bansal, and S\. Lee \(2020\)From local explanations to global understanding with explainable AI for trees\.Nature Machine Intelligence2\(1\),pp\. 56–67\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1),[§3](https://arxiv.org/html/2605.22897#S3.p1.1)\.
- S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.InAdvances in Neural Information Processing Systems,Vol\.30,pp\. 4765–4774\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1),[§3](https://arxiv.org/html/2605.22897#S3.p1.1),[§4](https://arxiv.org/html/2605.22897#S4.p2.2)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in Neural Information Processing Systems36,pp\. 46534–46594\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p3.1)\.
- D\. McElfresh, S\. Khandagale, J\. Valverde, V\. Prasad C, G\. Ramakrishnan, M\. Goldblum, and C\. White \(2024\)When do neural nets outperform boosted trees on tabular data?\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 76336–76369\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1)\.
- M\. McPherson, L\. Smith\-Lovin, and J\. M\. Cook \(2001\)Birds of a feather: homophily in social networks\.Annual Review of Sociology27\(1\),pp\. 415–444\.External Links:[Document](https://dx.doi.org/10.1146/annurev.soc.27.1.415)Cited by:[Table 13](https://arxiv.org/html/2605.22897#A8.T13.18.18.5.3.3),[Appendix H](https://arxiv.org/html/2605.22897#A8.p4.1)\.
- H\. Nori, S\. Jenkins, P\. Koch, and R\. Caruana \(2019\)InterpretML: a unified framework for machine learning interpretability\.arXiv preprint arXiv:1909\.09223\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1),[§3](https://arxiv.org/html/2605.22897#S3.p1.1),[§4](https://arxiv.org/html/2605.22897#S4.p2.2)\.
- C\. N\. Pace, B\. A\. Shirley, M\. McNutt, and K\. Gajiwala \(1996\)Forces contributing to the conformational stability of proteins\.The FASEB Journal10\(1\),pp\. 75–83\.External Links:[Document](https://dx.doi.org/10.1096/fasebj.10.1.8566551)Cited by:[Table 13](https://arxiv.org/html/2605.22897#A8.T13.6.6.6.3.3),[Appendix H](https://arxiv.org/html/2605.22897#A8.p2.4)\.
- R\. K\. Pace and R\. Barry \(1997\)Sparse spatial autoregressions\.Statistics & Probability Letters33\(3\),pp\. 291–297\.Cited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.8.3)\.
- M\. R\. Rezaei, R\. S\. Fard, J\. L\. Parker, R\. G\. Krishnan, and M\. Lankarany \(2025\)Agentic medical knowledge graphs enhance medical question answering: bridging the gap between llms and evolving medical knowledge\.arXiv preprint arXiv:2502\.13010\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p3.1)\.
- M\. T\. Ribeiro, S\. Singh, and C\. Guestrin \(2016\)“Why should I trust you?”: explaining the predictions of any classifier\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 1135–1144\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p1.1)\.
- P\. Shojaee, K\. Meidani, S\. Gupta, A\. B\. Farimani, and C\. K\. Reddy \(2025\)LLM\-sr: scientific equation discovery via programming with large language models\.InInternational Conference on Learning Representations,Note:Oral presentationCited by:[§1](https://arxiv.org/html/2605.22897#S1.p1.1),[§3](https://arxiv.org/html/2605.22897#S3.p2.1)\.
- J\. Si, W\. Y\. Cheng, M\. Cooper, and R\. G\. Krishnan \(2024\)InterpreTabNet: distilling predictive signals from tabular data by salient feature interpretation\.arXiv preprint arXiv:2406\.00426\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p1.1)\.
- UCI Machine Learning Repository \(2014\)High School Student Performance / Social Classification Dataset\.Note:[https://archive\.ics\.uci\.edu/](https://archive.ics.uci.edu/)Cited by:[Table 24](https://arxiv.org/html/2605.22897#A14.T24.4.6.3)\.
- J\. Wang, J\. Wang, B\. Athiwaratkun, C\. Zhang, and J\. Zou \(2024\)Mixture\-of\-agents enhances large language model capabilities\.arXiv preprint arXiv:2406\.04692\.Cited by:[§3](https://arxiv.org/html/2605.22897#S3.p3.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)TextGrad: automatic differentiation via text\.arXiv preprint arXiv:2406\.07496\.Cited by:[§1](https://arxiv.org/html/2605.22897#S1.p5.3),[§1](https://arxiv.org/html/2605.22897#S1.p8.2),[§3](https://arxiv.org/html/2605.22897#S3.p3.1)\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2605.22897#S1)
2. [2Methods](https://arxiv.org/html/2605.22897#S2)1. [2\.1Structured Hypothesis Generation](https://arxiv.org/html/2605.22897#S2.SS1) 2. [2\.2Correction Refinement via Textual Gradient Optimization](https://arxiv.org/html/2605.22897#S2.SS2) 3. [2\.3Probability Conversion for Classification](https://arxiv.org/html/2605.22897#S2.SS3) 4. [2\.4Query\-Aware Aggregation](https://arxiv.org/html/2605.22897#S2.SS4)
3. [3Related Work](https://arxiv.org/html/2605.22897#S3)
4. [4Experiments](https://arxiv.org/html/2605.22897#S4)1. [4\.1MARICL Improves Prediction on Eight Real Benchmarks](https://arxiv.org/html/2605.22897#S4.SS1) 2. [4\.2Synthetic Benchmark: Recovering Planted Ground Truth](https://arxiv.org/html/2605.22897#S4.SS2) 3. [4\.3Disentangling Data\-Driven Refinement from Prior Knowledge](https://arxiv.org/html/2605.22897#S4.SS3) 4. [4\.4Interpretability of Learned Corrections](https://arxiv.org/html/2605.22897#S4.SS4) 5. [4\.5Frozen Correction Formulas Transfer Across Plates Without Retraining](https://arxiv.org/html/2605.22897#S4.SS5) 6. [4\.6Ablation Studies](https://arxiv.org/html/2605.22897#S4.SS6)
5. [5Discussion and Conclusion](https://arxiv.org/html/2605.22897#S5)
6. [References](https://arxiv.org/html/2605.22897#bib)
7. [AWorked Example: MARICL Pipeline on Cell\-Free Protein Production](https://arxiv.org/html/2605.22897#A1)
8. [BSymbolic Residual\-Correction Baselines](https://arxiv.org/html/2605.22897#A2)
9. [CDomain Context Ablation](https://arxiv.org/html/2605.22897#A3)
10. [DSynthetic Benchmark: Detailed Recovery Analysis](https://arxiv.org/html/2605.22897#A4)
11. [ECross\-Plate Transfer: Per\-Target Breakdown](https://arxiv.org/html/2605.22897#A5)1. [E\.1Implementation](https://arxiv.org/html/2605.22897#A5.SS1)
12. [FComputational Cost Analysis](https://arxiv.org/html/2605.22897#A6)
13. [GLLM Backbone Analysis](https://arxiv.org/html/2605.22897#A7)
14. [HAdditional Correction Analysis](https://arxiv.org/html/2605.22897#A8)
15. [IAnonymized Features on Real Data](https://arxiv.org/html/2605.22897#A9)
16. [JComplete Experimental Results](https://arxiv.org/html/2605.22897#A10)1. [J\.1Decomposing Structural and Numerical Contributions](https://arxiv.org/html/2605.22897#A10.SS1)
17. [KStatistical Significance](https://arxiv.org/html/2605.22897#A11)
18. [LJoint Prior\-Stripping Ablation](https://arxiv.org/html/2605.22897#A12)
19. [MValidation Protocol and Generalization Safeguards](https://arxiv.org/html/2605.22897#A13)1. [M\.1Extended Benchmarks: TabArena](https://arxiv.org/html/2605.22897#A13.SS1) 2. [M\.2Stability of Learned Corrections](https://arxiv.org/html/2605.22897#A13.SS2) 3. [M\.3Failure Modes](https://arxiv.org/html/2605.22897#A13.SS3)
20. [NExtended Technical Details](https://arxiv.org/html/2605.22897#A14)1. [N\.1Training Protocol and Generalization](https://arxiv.org/html/2605.22897#A14.SS1) 2. [N\.2Hyperparameter Configuration and Sensitivity](https://arxiv.org/html/2605.22897#A14.SS2) 3. [N\.3Formula Generation and Numerical Stability](https://arxiv.org/html/2605.22897#A14.SS3) 4. [N\.4Prompt Templates](https://arxiv.org/html/2605.22897#A14.SS4) 5. [N\.5Data Preprocessing and Reproducibility](https://arxiv.org/html/2605.22897#A14.SS5)

## Appendix AWorked Example: MARICL Pipeline on Cell\-Free Protein Production

Table[6](https://arxiv.org/html/2605.22897#A1.T6)traces the MARICL pipeline \(Algorithm[1](https://arxiv.org/html/2605.22897#alg1)\) on the same illustrating reaction as Figure[1](https://arxiv.org/html/2605.22897#S1.F1): NAD = 0\.8, spermidine = 0\.7, folinic\_acid = 0\.3, true yieldy=0\.72y=0\.72, linear basey^ML=0\.58\\hat\{y\}\_\{\\text\{ML\}\}=0\.58, residualr=\+0\.14r=\{\+\}0\.14\. The decoder stages follow the figure caption \(interaction term, then folinate saturation\); the interaction coefficient and Michaelis–Menten form match the corresponding terms in Table[3](https://arxiv.org/html/2605.22897#S4.T3)\(the headline correction adds further linear terms in NAD, spermidine, CoA, and 3\-PGA\)\. Numeric checks:f1\(0\)=0\.5×0\.8×0\.7=0\.28f\_\{1\}^\{\(0\)\}=0\.5\\times 0\.8\\times 0\.7=0\.28;f1\(1\)=0\.28\+0\.5×0\.3/\(0\.5\+0\.3\)=0\.4675f\_\{1\}^\{\(1\)\}=0\.28\+0\.5\\times 0\.3/\(0\.5\+0\.3\)=0\.4675\.

Table 6:Step\-by\-step trace for Figure[1](https://arxiv.org/html/2605.22897#S1.F1), with side\-by\-side comparison to a single\-shot LLM agent \(LLM\-ICL\) on the same input\. MARICL’s residual targeting collapses the LLM’s hypothesis space from the full output domain\[0,1\]\[0,1\]to a small structured perturbation ofy^ML=0\.58\\hat\{y\}\_\{\\text\{ML\}\}=0\.58\. The LLM\-ICL column shows an*exemplar*end\-to\-end predictiony^≈0\.52\\hat\{y\}\{\\approx\}0\.52\(error∼0\.20\{\\sim\}0\.20,∼22×\{\\sim\}22\{\\times\}MARICL\); dataset mean MAE is0\.150\.15with larger errors on interaction\-heavy points \(Table[15](https://arxiv.org/html/2605.22897#A10.T15)\)\. Dataset\-wide\+0\.30​R2\+0\.30\\,R^\{2\}improvement\.Pipeline StageMARICL OutputSingle\-Shot LLM Agent \(LLM\-ICL\)InputLinear regression; sample: NAD=0\.8=0\.8, spermidine=0\.7=0\.7, folinic\_acid=0\.3=0\.3\.y^ML=0\.58\\hat\{y\}\_\{\\text\{ML\}\}=0\.58;y=0\.72y=0\.72;r=y−y^ML=\+0\.14r=y\-\\hat\{y\}\_\{\\text\{ML\}\}=\{\+\}0\.14\.Sees raw features \(NAD=0\.8=0\.8, sperm=0\.7=0\.7, fol=0\.3=0\.3\) and a few in\-context \(input, target\) demonstrations\. No base anchor; must predicty∈\[0,1\]y\\in\[0,1\]from scratch\.Residual Analysis\(Eq\.[3](https://arxiv.org/html/2605.22897#S2.E3)–[4](https://arxiv.org/html/2605.22897#S2.E4)\)Top\-κ\\kappatraining examples concentrate on NAD\>0\.6\>0\.6and spermidine\>0\.4\>0\.4; residuals up to∼0\.21\{\\sim\}0\.21\. Cofactor\-rich conditions dominate the ranking; this subset feeds the encoder\.N/A — agent has no residual signal; cannot localize where the base errs\.Encoder\(Eq\.[5](https://arxiv.org/html/2605.22897#S2.E5)–[6](https://arxiv.org/html/2605.22897#S2.E6)\)“High\-residual reactions share elevated NAD and spermidine\. The linear model misses their interaction\. Proposed mechanism: multiplicative synergy drives underprediction\.”N/A — no separate hypothesis stage; structure \(if any\) is implicit in a single forward pass\.Decoder,t=0t\{=\}0\(Eq\.[8](https://arxiv.org/html/2605.22897#S2.E8)\)T1\(0\)T\_\{1\}^\{\(0\)\}: “NAD–spermidine cofactor synergy\.”f1\(0\)​\(𝐱\)=0\.5⋅NAD×spermidinef\_\{1\}^\{\(0\)\}\(\\mathbf\{x\}\)=0\.5\\cdot\\mathrm\{NAD\}\\times\\mathrm\{spermidine\}\(same interaction weight as Table[3](https://arxiv.org/html/2605.22897#S4.T3)\)\. Heref1\(0\)=0\.28f\_\{1\}^\{\(0\)\}=0\.28\.N/A — no executable correction term is produced\.Evaluate & Critique,t=0t\{=\}0\(Eq\.[10](https://arxiv.org/html/2605.22897#S2.E10)–[11](https://arxiv.org/html/2605.22897#S2.E11)\)High\-folinic\-acid reactions still fail\.g1\(0\)g\_\{1\}^\{\(0\)\}:“Folinic acid likely saturates; add a Michaelis–Menten term\.”N/A — single\-pass output; no failure\-driven feedback loop\.Decoder,t=1t\{=\}1\(Eq\.[13](https://arxiv.org/html/2605.22897#S2.E13)\)f1\(1\)=0\.5⋅NAD×spermidine\+0\.5⋅folinic​\_​acid0\.5\+folinic​\_​acidf\_\{1\}^\{\(1\)\}=0\.5\\cdot\\mathrm\{NAD\}\\times\\mathrm\{spermidine\}\+\\dfrac\{0\.5\\cdot\\mathrm\{folinic\\\_acid\}\}\{0\.5\+\\mathrm\{folinic\\\_acid\}\}\(saturation term as in Table[3](https://arxiv.org/html/2605.22897#S4.T3)\)\.f1\(1\)​\(𝐱\)=0\.4675f\_\{1\}^\{\(1\)\}\(\\mathbf\{x\}\)=0\.4675;ℒ1\(1\)<ℒ1\(0\)\\mathcal\{L\}\_\{1\}^\{\(1\)\}<\\mathcal\{L\}\_\{1\}^\{\(0\)\}\.N/A — no refinement\.Inference\(Eq\.[18](https://arxiv.org/html/2605.22897#S2.E18),[1](https://arxiv.org/html/2605.22897#S2.E1)\)d1d\_\{1\}small⇒\\Rightarrowhighc1c\_\{1\};p1\>pminp\_\{1\}\>p\_\{\\min\}\. No LLM call\.y^MARICL=0\.58\+α1​f1∗≈0\.58\+0\.28×0\.4675=0\.711\\hat\{y\}\_\{\\text\{MARICL\}\}=0\.58\+\\alpha\_\{1\}f\_\{1\}^\{\*\}\\approx 0\.58\+0\.28\\times 0\.4675=0\.711\. Error\|y−y^\|≈0\.009\|y\-\\hat\{y\}\|\\approx 0\.009\.Exemplar end\-to\-end:y^LLM\-ICL≈0\.52\\hat\{y\}\_\{\\text\{LLM\-ICL\}\}\\approx 0\.52,\|y−y^\|≈0\.20\|y\-\\hat\{y\}\|\\approx 0\.20\(mean test MAE0\.150\.15; errors often larger on cofactor\-rich, high\-interaction samples; Table[15](https://arxiv.org/html/2605.22897#A10.T15)\),≈22×\{\\approx\}22\{\\times\}MARICL’s0\.0090\.009\. Search space is the full\[0,1\]\[0,1\]output range, not a small residual\.The contrast is not about LLM capability — both systems use the same backbone — but about*what is being predicted*: an absolute target vs\. a small residual on top of a validated base\.

## Appendix BSymbolic Residual\-Correction Baselines

This appendix expands on the baseline comparison referenced in the main text\. To test whether MARICL’s gains require LLM\-guided hypothesis generation rather than standard symbolic methods on residuals, we compare against PySR\-on\-residuals, exhaustive pairwise interactions, EBM with pairwise terms, and an MLP on residuals\. PySR\-on\-residuals produces deeply nested 12\-node expressions; the MLP is competitive on prediction but provides no interpretability\. MARICL is the only method that jointly achieves the highest within\-modelR2R^\{2\}and produces human\-readable formulas \(Table[7](https://arxiv.org/html/2605.22897#A2.T7)\)\. Appendix[J\.1](https://arxiv.org/html/2605.22897#A10.SS1)additionally disentangles MARICL’s symbolic structure from its numerical coefficient fit by replacing LLM coefficients with least\-squares values while keeping the inferred structure fixed\.

Table 7:Residual\-correction baselines \(Linear base\)\. MARICL achieves the highest within\-modelR2R^\{2\}while producing named, interpretable formulas\.MethodEnzymeR2R^\{2\}DiabetesR2R^\{2\}Interpretable?Linear0\.2370\.454Yes\+ Ridge0\.2980\.489Partial\+ PySR0\.3180\.512No\+ Pairwise0\.3620\.531No\+ EBM0\.4160\.539Partial\+ MLP0\.4280\.548No\+ Ours0\.4810\.590Yes
## Appendix CDomain Context Ablation

This appendix expands on the prior\-knowledge analysis \(§[4\.3](https://arxiv.org/html/2605.22897#S4.SS3)\)\. We replace the full domain prompt with a minimal one \(feature names only\)\. Across three datasets the same dominant correction category is recovered, and quantitative drops are bounded by 3–5 points \(Table[8](https://arxiv.org/html/2605.22897#A3.T8)\)\. Combined with the TextGrad iteration ablation \(Table[5](https://arxiv.org/html/2605.22897#S4.T5)\), this indicates that the iterative residual\-driven loop \(not the prompt\) is the primary driver of MARICL’s gains\.

Table 8:Domain context ablation\. Removing domain\-specific prompts causes only 2\.3–3\.5 point drops; the same dominant correction category is recovered in all cases\.DatasetFull Cont\.Minimal Cont\.Enzyme \(R2R^\{2\}\)0\.513±\\pm0\.010\.478±\\pm0\.018Diabetes \(R2R^\{2\}\)0\.590±\\pm0\.030\.561±\\pm0\.028Zoo \(Acc\)0\.975±\\pm0\.020\.952±\\pm0\.025
## Appendix DSynthetic Benchmark: Detailed Recovery Analysis

This appendix expands on §[4\.2](https://arxiv.org/html/2605.22897#S4.SS2)\. The inferred MARICL sigmoid term has the formc⋅σ​\(a​X1​X3−b\)c\\cdot\\sigma\(a\\,X\_\{1\}X\_\{3\}\-b\), where the recovered\(a,b\)=\(1\.6,1\.1\)\(a,b\)=\(1\.6,1\.1\)match the planted\(1\.8,1\.2\)\(1\.8,1\.2\)within∼12%\{\\sim\}12\\%\. The leading amplitudecc\(planted: 2\.5\) is not reported as a separately recovered coefficient; we conjecture it is absorbed into the upstream linear coefficients onX1X\_\{1\}andX2X\_\{2\}and the global clipping into\[0,1\]\[0,1\], consistent with the overallR2R^\{2\}match in Table[2](https://arxiv.org/html/2605.22897#S4.T2)\. The lower\-amplitude0\.3​sin⁡\(X5​X7\)0\.3\\sin\(X\_\{5\}X\_\{7\}\)term contributes roughly6%6\\%of signal variance and is not consistently recovered as a distinct symbolic component across seeds; its contribution is absorbed into smaller polynomial residuals\. Across 5 seeds, the dominant sigmoid is recovered as an explicit symbolic term in5/55/5cases, while thesin\\sinterm is recovered in1/51/5cases \(with high coefficient variance even there\)\. We additionally provide per\-seed coefficient estimates,R2R^\{2\}on the residual after removing the sigmoid term, and a comparison of MARICL’s recovered structure against the PySR\-on\-residuals expression, which obscures the sigmoid inside a 12\-node nested form\. To ensure a fair operator\-set comparison, PySR was run withσ\\sigmaadded to its primitive set as a custom unary operator \(Appendix[N\.5](https://arxiv.org/html/2605.22897#A14.SS5)\); the structural\-clarity contrast above therefore reflects search\-procedure differences rather than an artificially closed search space\.

#### Feature distributions and signal\-variance budget\.

FeaturesX1,…,X8X\_\{1\},\\ldots,X\_\{8\}are sampled iid from𝒰​\(0,1\)\\mathcal\{U\}\(0,1\); observation noiseε∼𝒩​\(0,0\.12\)\\varepsilon\\sim\\mathcal\{N\}\(0,\\,0\.1^\{2\}\); sample sizeN=1,000N\{=\}1\{,\}000; train/val/test split60/20/2060/20/20with seeds\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}\. Under this sampling distribution the variance budget ofY=0\.6​X1\+0\.4​X2\+2\.5​σ​\(1\.8​X1​X3−1\.2\)\+0\.3​sin⁡\(X5​X7\)\+εY=0\.6X\_\{1\}\+0\.4X\_\{2\}\+2\.5\\,\\sigma\(1\.8X\_\{1\}X\_\{3\}\-1\.2\)\+0\.3\\sin\(X\_\{5\}X\_\{7\}\)\+\\varepsilondecomposes \(numerically over10610^\{6\}Monte\-Carlo draws\) as: linear additive terms≈0\.043\\approx 0\.043, planted sigmoid term2\.5​σ​\(1\.8​X1​X3−1\.2\)2\.5\\,\\sigma\(1\.8X\_\{1\}X\_\{3\}\-1\.2\)≈0\.31\\approx 0\.31,sin\\sinterm0\.3​sin⁡\(X5​X7\)0\.3\\sin\(X\_\{5\}X\_\{7\}\)≈0\.018\\approx 0\.018, and noise0\.010\.01, giving a populationR2R^\{2\}ceiling of≈0\.97\\approx 0\.97that MARICL approaches in Table[2](https://arxiv.org/html/2605.22897#S4.T2)\. Reproducing these numbers without the explicit feature distribution is not possible becauseR2R^\{2\}is sensitive to feature variance, which is why we record both the sampling distribution and the empirical signal\-variance breakdown here\.

## Appendix ECross\-Plate Transfer: Per\-Target Breakdown

This appendix expands on §[4\.5](https://arxiv.org/html/2605.22897#S4.SS5)\(Table[4](https://arxiv.org/html/2605.22897#S4.T4)\) with the full per\-target breakdown \(Table[9](https://arxiv.org/html/2605.22897#A5.T9)\) and the four ablation configurations referenced in the main text \(Table[10](https://arxiv.org/html/2605.22897#A5.T10)\)\.

Table 9:Cross\-plate transfer per target\. Formulas frozen on source plate AL\_6 evaluated numerically on each of the 8 remaining plates\. The AvgΔ​R2\\Delta R^\{2\}column is reported in percentage points \(×100\\times 100\) to keep its scale aligned withΔ​MAE\\Delta\\text\{MAE\}; rawΔ​R2\\Delta R^\{2\}is bounded above by11\.TargetCohort% Impr\.AvgΔ\\DeltaMAEAvgΔ​R2\\Delta R^\{2\}\(%\)AL\_2Same reagent97%\+0\.158\+0\.158\+6\.17\+6\.17AL\_3Same reagent97%\+0\.249\+0\.249\+9\.39\+9\.39AL\_4Same reagent92%\+0\.155\+0\.155\+8\.05\+8\.05AL\_5Same reagent95%\+0\.189\+0\.189\+10\.83\+10\.83AL\_7Diff\. regime3%−0\.173\-0\.173−2\.61\-2\.61AL\_8Diff\. regime0%−0\.155\-0\.155−0\.69\-0\.69AL\_9Diff\. regime8%−0\.213\-0\.213−15\.17\-15\.17AL\_10Diff\. regime0%−0\.183\-0\.183−11\.16\-11\.16Overalla—49%\+0\.002\+0\.002\+0\.58\+0\.58
- aThe Overall row weights per\-target values by pair count rather than taking an unweighted mean; the unweighted mean over the eight per\-target AvgΔ\\DeltaMAE values is\+0\.0034\+0\.0034\.

#### Isolating averaging from ML blending\.

The headline configuration in Table[9](https://arxiv.org/html/2605.22897#A5.T9)entangles three design choices: \(i\) within\-source\-run ensemble\-averaging across theK=2K\{=\}2transferred formulas, \(ii\) the 50/50 ML\+\+formula blend, and \(iii\) theΔ​Rvs\-ML2\>0\\Delta R^\{2\}\_\{\\text\{vs\-ML\}\}\{\>\}0source\-side filter\. Table[10](https://arxiv.org/html/2605.22897#A5.T10)reports four diagnostic configurations that vary one or more of these choices while holding the rest of the protocol \(no retraining, no coefficient re\-fitting, zero LLM calls\) fixed\.

Table 10:Cross\-plate transfer ablations\. “Same” aggregates targets AL\_2–AL\_5; “Diff” aggregates AL\_7–AL\_10\. The headlineK=2K\{=\}2configuration produces148148source\-run×\\timestarget pairs per cohort, or296296individual\-formula×\\timestarget evaluations per cohort under per\-individual\-formula scoring\. The unfiltered \(6464source runs\) and below\-filter\-only \(2727source runs\) rows scale accordingly\. AvgΔ​MAE\\Delta\\text\{MAE\}is positive when the transferred formula reduces error relative to ML\-only\. The cohort split is preserved across all four configurations\.Same reagent \(AL\_2–AL\_5\)Diff\. regime \(AL\_7–AL\_10\)Configuration% improvingAvgΔ\\DeltaMAE% improvingAvgΔ\\DeltaMAE*Filtered source pool*Averaged\+\+50/50 ML blend \(headline\)95\.3%95\.3\\%\(141/148141/148\)\+0\.188\+0\.1882\.7%2\.7\\%\(4/1484/148\)−0\.181\-0\.181Per\-indiv\. formula\+\+50/50 ML blend78\.0%78\.0\\%\(231/296231/296\)\+0\.114\+0\.11411\.5%11\.5\\%\(34/29634/296\)−0\.207\-0\.207Averaged, formula\-only \(no ML blend\)64\.2%64\.2\\%\(95/14895/148\)\+0\.073\+0\.0736\.8%6\.8\\%\(10/14810/148\)−0\.146\-0\.146Per\-indiv\. formula, no ML blend \(joint\)51\.7%51\.7\\%\(153/296153/296\)\+0\.041\+0\.0418\.8%8\.8\\%\(26/29626/296\)−0\.169\-0\.169*Filter sensitivity \(averaged\+\+ML blend\)*Unfiltered all\-source \(6464runs\)71\.1%71\.1\\%\(182/256182/256\)\+0\.103\+0\.1035\.1%5\.1\\%\(13/25613/256\)−0\.171\-0\.171Below\-filter only \(2727runs\)38\.0%38\.0\\%\(41/10841/108\)−0\.014\-0\.0148\.3%8\.3\\%\(9/1089/108\)−0\.158\-0\.158

#### What the ablations show\.

Stripping the within\-run ensemble averaging \(row 2\) reduces within\-cohort improvement from95%95\\%to78%78\\%and within\-cohortΔ​MAE\\Delta\\text\{MAE\}from\+0\.188\+0\.188to\+0\.114\+0\.114; across\-cohort improvement remains in the low double digits \(11\.5%11\.5\\%,Δ​MAE=−0\.207\\Delta\\text\{MAE\}=\-0\.207\)\. Stripping the ML blend instead \(row 3\) attenuates more strongly — formula\-only predictions on bounded\[0,1\]\[0,1\]targets are noisier when the source\-plate residual no longer applies — but again preserves the cohort split \(64\.2%64\.2\\%vs\.6\.8%6\.8\\%\)\. The joint ablation \(row 4\) is the cleanest single\-formula test:51\.7%51\.7\\%within\-cohort vs\.8\.8%8\.8\\%across\-cohort improvement, a∼\\sim6×6\\timesasymmetry\. This rules out the possibility that a near\-zero or constant formula drives Table[9](https://arxiv.org/html/2605.22897#A5.T9)purely through the ML\-blend regularizer\.

The filter\-sensitivity rows show a complementary pattern\. Removing the source\-side filter \(row 5\) lowers within\-cohort improvement from95\.3%95\.3\\%to71\.1%71\.1\\%and within\-cohortΔ​MAE\\Delta\\text\{MAE\}from\+0\.188\+0\.188to\+0\.103\+0\.103, while the across\-cohort floor moves only modestly \(2\.7%→5\.1%2\.7\\%\\to 5\.1\\%,−0\.181→−0\.171\-0\.181\\to\-0\.171\)\. The2727source runs that did not improve on their own training plate \(row 6\) also do not transfer:38\.0%38\.0\\%within\-cohort improvement with a slightly negativeΔ​MAE\\Delta\\text\{MAE\}, close to the across\-cohort baseline\. This supports our reading of the filter as a noise\-reduction step on source\-side residual quality rather than a circular validity criterion: below\-filter sources behave like noise\-fits, and the filter selects sources whose residuals carry transferable signal\.

#### Base\-model strength governs transfer reliability\.

Within\-cohort, XGBoost\-base source runs transfer more consistently \(67/12867/128improving pairs, avgΔ​MAE=\+0\.015\\Delta\\text\{MAE\}=\+0\.015\) than Linear\-base runs \(77/16877/168improving pairs, avgΔ​MAE=−0\.007\\Delta\\text\{MAE\}=\-0\.007\)\.111These counts and averages are computed under the headline averaged\+\+blend configuration \(row 1 of Table[10](https://arxiv.org/html/2605.22897#A5.T10)\); the base\-model trend reproduces qualitatively under each of the four ablation rows but with smaller absolute magnitudes\.A stronger base\-model leaves a smaller, more structured residual for MARICL to explain, and the resulting formula is correspondingly tighter and less contaminated by plate\-specific variation\. Transferability is therefore not an accidental property of the LLM but a controllable consequence of residual quality\. We additionally provide: \(i\) a source×\\timestarget heatmap ofΔ\\DeltaMAE; \(ii\) a breakdown by source\-run base\-model and refinement length; and \(iii\) a per\-source analysis showing that structurally simpler corrections \(smallerKK, fewer refinement iterations\) are also the most transferable within cohort\.

#### Caveats\.

The headline transfer predictiony^=0\.5⋅y^ML\+0\.5⋅meanm​\(y^m\)\\hat\{y\}=0\.5\\cdot\\hat\{y\}\_\{\\text\{ML\}\}\+0\.5\\cdot\\text\{mean\}\_\{m\}\(\\hat\{y\}\_\{m\}\)entangles two effects: within\-run ensemble\-averaging across theK=2K\{=\}2transferred mechanisms acts as a regularizer, and the 50/50 ML blend shrinks predictions toward the ML prior\. A constant or near\-zero formula could in principle appear to “improve” or “fail” purely through this dynamic\. The three ablation configurations above \(per\-individual\-formula transfer, formula\-only transfer, and the joint ablation\) isolate each effect\. The cohort asymmetry survives every condition — the joint ablation, the cleanest single\-formula test, still yields51\.7%51\.7\\%within\-cohort vs\.8\.8%8\.8\\%across\-cohort, a∼\\sim6×6\\timesasymmetry that cannot be attributed to either regularizer\.

### E\.1Implementation

After MARICL training on a source plate, the inferred correction formula is a Python\-evaluable string stored in the mechanism filemechanisms\_iter\_\{T\}\.txt\. The transfer experiment tests whether this frozen formula, evaluated numerically on a different target plate, still reduces prediction error — with no further LLM calls, no retraining, and no hyperparameter tuning\. We scan all subdirectories of the result folder matchingbio\_reg\_protein\_expression\_plate\*and load each run’sfinal\_results\.json\. A run is retained as a valid source if its post\-trainingR2R^\{2\}exceeds the ML\-only baseline on the source plate:

Δ​Rvs\-ML2=Rpost2−RML\-baseline2\>0\.\\Delta R^\{2\}\_\{\\text\{vs\-ML\}\}\\;=\\;R^\{2\}\_\{\\text\{post\}\}\-R^\{2\}\_\{\\text\{ML\-baseline\}\}\\;\>\\;0\.This filters to runs where MARICL genuinely learned something beyond the base model, ensuring that only meaningful formulas enter the transfer pool\. In our corpus,3737of6464runs pass this filter across two source plates \(AL\_2 and AL\_6; see Table[4](https://arxiv.org/html/2605.22897#S4.T4)\)\.

For each \(source run, target plate\) pair:

1. 1\.Target plate loading\.The target plate CSV is loaded and split into 80/20 train/test using quantile\-stratified sampling \(mirroring the training protocol of Script 018\)\.
2. 2\.ML mechanism\.The ML model \(Linear or XGBoost, as used in the source run\) is either \(a\) transferred directly from the source plate or \(b\) retrained on the target plate’s train split, depending on the\-\-ml\_sourcesetting\. All reported results useautomode, which trains on the source plate when the plate index is detectable from the directory name\.
3. 3\.Feature scaling\.All features are scaled to\[0\.01,0\.99\]\[0\.01,0\.99\]usingMinMaxScaler010fit on the train split of whichever plate the ML model was trained on, then applied to the target test set\.
4. 4\.Formula evaluation\.The extracted formula string is evaluated via Python’seval\(\)with the test feature matrix injected as local variables\. NumPy operations \(np\.clip,np\.exp, etc\.\) are available\. Outputs are clipped to\[0,1\]\[0,1\]for stability\.
5. 5\.Prediction blending\.Final predictions are a 50/50 blend of the ML mechanism output and the average formula output across all transferred LLM mechanisms: y^=0\.5⋅y^ML\+0\.5⋅1\|MLLM\|​∑m∈MLLMy^m\.\\hat\{y\}\\;=\\;0\.5\\cdot\\hat\{y\}\_\{\\text\{ML\}\}\\;\+\\;0\.5\\cdot\\frac\{1\}\{\|M\_\{\\text\{LLM\}\}\|\}\\sum\_\{m\\in M\_\{\\text\{LLM\}\}\}\\hat\{y\}\_\{m\}\.We adopt this blend for transfer because \(i\) treating formula outputs as absolute predictions on the bounded\[0,1\]\[0,1\]scale is more robust than treating them as residual corrections when the source ML residual no longer applies to the target plate, and \(ii\)βtransfer=0\.5\\beta\_\{\\text\{transfer\}\}\{=\}0\.5equally weights the two predictors under transfer uncertainty \(Section[2](https://arxiv.org/html/2605.22897#S2)\)\. A pilot residual\-based evaluation \(y^=y^ML\+12​\(f¯LLM−y¯train\)\\hat\{y\}=\\hat\{y\}\_\{\\text\{ML\}\}\+\\tfrac\{1\}\{2\}\\,\(\\bar\{f\}\_\{\\text\{LLM\}\}\-\\bar\{y\}\_\{\\text\{train\}\}\), centred at the scaled training mean\) gave similar results but is not used in the main sweep\.
6. 6\.Baseline\.The ML\-only condition usesy^=y^ML\\hat\{y\}=\\hat\{y\}\_\{\\text\{ML\}\}with no formula contribution\.

## Appendix FComputational Cost Analysis

Table 11:Computational cost comparison\. MARICL incurs one\-time training cost but zero inference overhead\. Costs estimated with Gemini 2\.0 Flash \($0\.10/1M input, $0\.40/1M output tokens\), assuming 5K input and 2K output tokens per call \($0\.0013 per call\)\. Batch sizeB=10B=10for encoding\.MethodEncoderCallsDecoder/RefineCallsTotal LLMCallsTrainingCost1K QueriesCostDirect LLM \(ICL\)–1 per query1K$0\.00$1\.30MARICL \(KK=1,TT=5,\|𝒟high\-res\|\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|=30\)31114$0\.018$0\.018MARICL \(KK=2,TT=5,\|𝒟high\-res\|\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|=50\)102232$0\.042$0\.042MARICL \(KK=2,TT=10,\|𝒟high\-res\|\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|=100\)204262$0\.081$0\.081MARICL \(KK=2,TT=10,\|𝒟high\-res\|\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|=200\)404282$0\.107$0\.107

Table[11](https://arxiv.org/html/2605.22897#A6.T11)compares computational costs across methods, accounting for MARICL’s batched encoding strategy\. Direct LLM inference requires one LLM call per prediction, costing $0\.0013 per query and scaling to $1\.30 for 1K queries\.

MARICL’s LLM calls during training consist of two components based on Algorithm[1](https://arxiv.org/html/2605.22897#alg1):

1. 1\.Encoder calls: For each correctionkk, encoding requires⌈\|𝒟high\-res\|/B⌉\\lceil\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|/B\\rceilLLM calls when batched encoding is used \(Eq\.[7](https://arxiv.org/html/2605.22897#S2.E7)\)\. WithKKcorrections, this totalsK⋅⌈\|𝒟high\-res\|/B⌉K\\cdot\\lceil\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|/B\\rceilencoder calls\.
2. 2\.Decoder and refinement calls: Initial decoding requiresKKcalls \(Eq\.[8](https://arxiv.org/html/2605.22897#S2.E8)\)\. Each refinement iteration requires one critique generation \(Eq\.[11](https://arxiv.org/html/2605.22897#S2.E11)\) and one correction refinement \(Eq\.[13](https://arxiv.org/html/2605.22897#S2.E13)\) per correction, totaling2​K​T2KTcalls overTTiterations\. Combined:K​\(1\+2​T\)K\(1\+2T\)decoder/refinement calls\.

The total number of LLM calls is therefore:

Ncalls=K⋅⌈\|𝒟high\-res\|B⌉\+K​\(1\+2​T\)N\_\{\\text\{calls\}\}=K\\cdot\\left\\lceil\\frac\{\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|\}\{B\}\\right\\rceil\+K\(1\+2T\)\(19\)
For a typical configuration withKK=2 corrections,TT=10 iterations,\|𝒟high\-res\|\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|=100 samples, and batch sizeBB=10, this yields2⋅10\+2⋅21=622\\cdot 10\+2\\cdot 21=62total LLM calls at $0\.081 training cost\. Crucially, MARICL inference requires*zero*LLM calls; corrections compile to executable Python formulas evaluated directly on input features\. This achieves over 93% cost reduction at 1K queries \($0\.081 vs\. $1\.30\) and cost parity with traditional ML at inference time, with savings increasing linearly with query volume\. Note that encoder calls scale withK⋅⌈\|𝒟high\-res\|/B⌉K\\cdot\\lceil\|\\mathcal\{D\}\_\{\\text\{high\-res\}\}\|/B\\rceil, while decoder/refinement calls scale withK⋅TK\\cdot T\. For datasets requiring larger high\-residual subsets, encoder calls may dominate; for complex corrections requiring many refinement iterations, decoder calls dominate\.

#### Total experiment cost\.

All experiments reported in this paper were executed on a MacBook Pro with M1 Max chip \(32 GB unified memory\) for base\-model training and local Python execution, with LLM calls routed to cloud endpoints \(Gemini 2\.0 Flash via Google AI Studio, GPT\-4o / o1\-preview via OpenAI API\)\. Aggregating across all reported runs—9 benchmark datasets×\\times5 seeds per dataset for headline tables, plus ablation sweeps \(K∈\{1,2\}K\\in\\\{1,2\\\},T∈\{0,3,5,10\}T\\in\\\{0,3,5,10\\\}, Top\-K∈\{30,50,100\}\\in\\\{30,50,100\\\}, scaling modes, LLM backbone comparison, domain\-context ablation, anonymized\-features ablation, joint prior\-stripping ablation\), cross\-plate transfer experiments \(37 source runs×\\times8 target plates\), and synthetic benchmark recovery—the full experimental budget is estimated at≤$​250\{\\leq\}\\mathdollar 250using the per\-call cost in Table[11](https://arxiv.org/html/2605.22897#A6.T11)\. Exact wall\-clock call counts will be released with code\. Compute\-infrastructure requirements are minimal: MARICL trains on a laptop and requires no GPU, cluster, or high\-memory node\. Per\-dataset training completes in under 5 minutes of wall\-clock time \(excluding LLM API latency\)\.

## Appendix GLLM Backbone Analysis

Table 12:Performance across LLM backbones on Cell\-Free Protein \(XGBoost base\-model\) and Zoo \(Logistic base\-model\)\.K=1K=1,T=5T=5\. “Corr” = correction only; “Full” = combined MARICL\.LLM BackboneProtein \(R2R^\{2\}\)Zoo \(Acc\)CorrFullCorrFullGemini\-2\.5\-Flash0\.520\.740\.810\.981GPT\-4o0\.540\.730\.830\.987Gemini\-2\.0\-Flash0\.500\.720\.760\.975GPT\-4o\-mini0\.460\.680\.710\.95Llama\-3\-8B0\.380\.620\.620\.86Table[12](https://arxiv.org/html/2605.22897#A7.T12)examines how LLM backbone choice affects MARICL performance\. We evaluate frontier models \(GPT\-4o, Gemini\-2\.5\-Flash\), efficient alternatives \(GPT\-4o\-mini, Gemini\-2\.0\-Flash\), and open\-source options \(Llama\-3\-8B\)\. GPT\-4o produces corrections with the highest standaloneR2R^\{2\}\(0\.54\) and accuracy \(0\.83\), suggesting stronger reasoning capabilities translate to better hypothesis generation\. Gemini\-2\.0\-Flash achieves 99% of GPT\-4o’s full performance \(R2R^\{2\}= 0\.72 vs\. 0\.73\) at approximately 1/20th the cost \($0\.006 vs\. $0\.14 training\)\. Llama\-3\-8B achievesR2R^\{2\}= 0\.62 \(51% improvement over Linear Regression of 0\.41\), demonstrating that moderately sized open models can generate useful corrections and enable local deployment of MARICL\.

## Appendix HAdditional Correction Analysis

Table[13](https://arxiv.org/html/2605.22897#A8.T13)presents corrections inferred for remaining datasets, including cases where MARICL shows limited improvement or reveals domain complexity limitations\.

Table 13:Corrections inferred for enzyme activity, diabetes progression, social classification, and income prediction\. The “Correction Only” column shows performance using only the learned correction without the base\-model, while “MARICL” shows the full combined performance\.DatasetLearned Correction FormulaDomain Interpretation & Literature SupportCorrection OnlyMARICLEnzyme Activityy^=clip\(6\.0⋅seq\_length0\.3\+seq\_length⋅\(1\+0\.5⋅seq\_gc\_content\)\\hat\{y\}=\\text\{clip\}\\Big\(\\frac\{6\.0\\cdot\\text\{seq\\\_length\}\}\{0\.3\+\\text\{seq\\\_length\}\}\\cdot\(1\+0\.5\\cdot\\text\{seq\\\_gc\\\_content\}\)⋅exp⁡\(−\(seq\_aromatic−5\.0\)28\.0\)\\cdot\\exp\\Big\(\-\\frac\{\(\\text\{seq\\\_aromatic\}\-5\.0\)^\{2\}\}\{8\.0\}\\Big\)⋅11\+0\.3⋅seq\_polar,0,10\)\\cdot\\frac\{1\}\{1\+0\.3\\cdot\\text\{seq\\\_polar\}\},0,10\\Big\)Michaelis\-Menten saturation:The termseq\_length0\.3\+seq\_length\\frac\{\\text\{seq\\\_length\}\}\{0\.3\+\\text\{seq\\\_length\}\}captures substrate binding kinetics where activity saturates at high sequence lengths\[Cornish\-Bowden,[2012](https://arxiv.org/html/2605.22897#bib.bib67)\]\.Gaussian aromatic optimum:Theexp⁡\(−\(x−5\)2/8\)\\exp\(\-\(x\-5\)^\{2\}/8\)term models optimal aromatic residue content, consistent with hydrophobic core requirements for protein stability\[Paceet al\.,[1996](https://arxiv.org/html/2605.22897#bib.bib68)\]\.Polar residue inhibition:The11\+0\.3⋅seq\_polar\\frac\{1\}\{1\+0\.3\\cdot\\text\{seq\\\_polar\}\}term captures competitive inhibition dynamics from excess polar residues disrupting active site geometry\. We note that while Michaelis\-Menten kinetics are within LLM pretraining knowledge, the specific parameterization and the combination with aromatic Gaussian and polar inhibition terms were refined iteratively through residual\-driven TextGrad \(see Section[4\.3](https://arxiv.org/html/2605.22897#S4.SS3)\)\.0\.4210\.513R2R^\{2\}Diabetesy^=clip\(\(0\.591⋅s5\+0\.509⋅s1\+0\.347⋅BMI\\hat\{y\}=\\text\{clip\}\\Big\(\(0\.591\\cdot\\text\{s5\}\+0\.509\\cdot\\text\{s1\}\+0\.347\\cdot\\text\{BMI\}\+0\.312⋅s2\+0\.200⋅s5×s1\+0\.312\\cdot\\text\{s2\}\+0\.200\\cdot\\text\{s5\}\\times\\text\{s1\}\+0\.200⋅s5×BMI\+…\)/12,0,1\)\+0\.200\\cdot\\text\{s5\}\\times\\text\{BMI\}\+\\ldots\)/12,0,1\\Big\)Triglyceride dominance:The highest weight on s5 \(0\.591\) aligns with clinical evidence that log\-transformed triglycerides are among the strongest predictors of diabetes progression\[DeFronzoet al\.,[2015](https://arxiv.org/html/2605.22897#bib.bib69)\]\.Metabolic interactions:The s5×\\timess1 and s5×\\timesBMI interaction terms capture compounding metabolic dysregulation, consistent with the established relationship between dyslipidemia, obesity, and insulin resistance\[Kahnet al\.,[2006](https://arxiv.org/html/2605.22897#bib.bib70)\]\.Cholesterol contribution:The s1 weight \(0\.509\) reflects total cholesterol’s role in cardiovascular comorbidity prediction\.0\.4820\.590R2R^\{2\}Social GroupsscoreGrades=1\.2⋅Grades0\.5\+Grades\+0\.6⋅Race−\(1−Gender\)⋅0\.5\\text\{score\}\_\{\\text\{Grades\}\}=1\.2\\cdot\\frac\{\\text\{Grades\}\}\{0\.5\+\\text\{Grades\}\}\+0\.6\\cdot\\text\{Race\}\-\(1\-\\text\{Gender\}\)\\cdot 0\.5scoreSports=1\.6⋅Sports\+\(1−Gender\)⋅1\.3−0\.4⋅Race\\text\{score\}\_\{\\text\{Sports\}\}=1\.6\\cdot\\text\{Sports\}\+\(1\-\\text\{Gender\}\)\\cdot 1\.3\-0\.4\\cdot\\text\{Race\}Direct feature mapping:Unlike biochemical domains, the correction relies on surface\-level feature weighting \(Sports→\\rightarrowSports group\) rather than inferring latent social dynamics\.Limited depth:The saturation termGrades0\.5\+Grades\\frac\{\\text\{Grades\}\}\{0\.5\+\\text\{Grades\}\}attempts nonlinearity but demographic features are weak proxies for complex social processes governed by unobserved confounders\[McPhersonet al\.,[2001](https://arxiv.org/html/2605.22897#bib.bib71)\]\. This is a representative failure mode: MARICL correctly provides minimal correction via thepminp\_\{\\min\}threshold when the domain lacks correctable structure\.0\.3910\.540 AccAdult Incomescore\>50​K=0\.344⋅capital\.gain0\.4\+capital\.gain\+0\.309⋅relationship0\.4\+relationship\\text\{score\}\_\{\>50K\}=0\.344\\cdot\\frac\{\\text\{capital\.gain\}\}\{0\.4\+\\text\{capital\.gain\}\}\+0\.309\\cdot\\frac\{\\text\{relationship\}\}\{0\.4\+\\text\{relationship\}\}\+0\.236⋅capital\.gain×education\.num0\.3\+…\+0\.236\\cdot\\frac\{\\text\{capital\.gain\}\\times\\text\{education\.num\}\}\{0\.3\+\\ldots\}Capital gains saturation:Thecapital\.gain0\.4\+capital\.gain\\frac\{\\text\{capital\.gain\}\}\{0\.4\+\\text\{capital\.gain\}\}term captures diminishing marginal returns on investment income, consistent with economic utility theory\[Kahneman and Tversky,[1979](https://arxiv.org/html/2605.22897#bib.bib72)\]\.Education\-wealth interaction:The capital\.gain×\\timeseducation\.num term reflects compounding effects where higher education amplifies returns on capital investments\.Relationship status:The 0\.309 weight captures household economic structure effects on income classification\.0\.7120\.832 AccThe enzyme activity correction demonstrates that MARICL can recover established biochemical principles through data\-driven refinement\. The Michaelis\-Menten saturation term for sequence length follows the canonical formVmax⋅SKm\+S\\frac\{V\_\{\\max\}\\cdot S\}\{K\_\{m\}\+S\}, where enzyme activity increases with substrate concentration but saturates at high levels\[Cornish\-Bowden,[2012](https://arxiv.org/html/2605.22897#bib.bib67)\]\. The Gaussian optimum termexp⁡\(−\(x−5\)2/8\)\\exp\(\-\(x\-5\)^\{2\}/8\)models the observation that aromatic residue content has an optimal range for protein stability; too few aromatic residues reduce hydrophobic core stability, while too many can cause aggregation\[Paceet al\.,[1996](https://arxiv.org/html/2605.22897#bib.bib68)\]\. The polar residue inhibition term captures competitive dynamics where excess polar residues at the active site can disrupt substrate binding\. XGBoost\+MARICL achievesR2R^\{2\}= 0\.5132, representing a 28\.6% improvement over XGBoost alone \(R2R^\{2\}= 0\.3992\), though the moderate absolute performance suggests sequence\-activity relationships involve complexity beyond simple correction formulas\.

The diabetes correction identifies clinically validated predictors with appropriate weightings\. The dominance of s5 \(log\-transformed triglycerides, weight 0\.591\) aligns with extensive clinical literature showing triglycerides as one of the strongest metabolic predictors of diabetes progression\[DeFronzoet al\.,[2015](https://arxiv.org/html/2605.22897#bib.bib69)\]\. The interaction terms s5×\\timess1 \(triglycerides×\\timescholesterol\) and s5×\\timesBMI capture the compounding effects of metabolic syndrome, where dyslipidemia and obesity synergistically accelerate insulin resistance\[Kahnet al\.,[2006](https://arxiv.org/html/2605.22897#bib.bib70)\]\. Linear\+MARICL achievesR2R^\{2\}= 0\.5900, aΔ​R2\\Delta R^\{2\}= 0\.136 improvement over Linear Regression alone \(R2R^\{2\}= 0\.4543\), consistent with clinical understanding of diabetes as a multifactorial metabolic disorder\.

In contrast, the social and income corrections reveal important domain\-dependent behaviors\. The High School correction relies on direct feature weighting rather than inferring hypothetical social dynamics; the Sports feature directly predicts Sports group membership without uncovering latent social structures\. The saturation termGrades0\.5\+Grades\\frac\{\\text\{Grades\}\}\{0\.5\+\\text\{Grades\}\}attempts to capture nonlinear grade effects, but demographic features are fundamentally weak proxies for complex social processes governed by unobserved confounders such as peer influence, family background, and school culture\[McPhersonet al\.,[2001](https://arxiv.org/html/2605.22897#bib.bib71)\]\. XGBoost\+MARICL achieves accuracy of 0\.5400, only a modest improvement over XGBoost alone \(0\.5170\), confirming that domains governed by latent social dynamics may not benefit substantially from MARICL’s correction approach\. We highlight this as a representative failure mode rather than obscuring it\.

The Adult Income results demonstrate stronger correction grounding and consistent performance gains\. The capital gains saturation termcapital\.gain0\.4\+capital\.gain\\frac\{\\text\{capital\.gain\}\}\{0\.4\+\\text\{capital\.gain\}\}follows diminishing marginal utility principles from behavioral economics\[Kahneman and Tversky,[1979](https://arxiv.org/html/2605.22897#bib.bib72)\], where additional investment income provides decreasing relative benefit at higher levels\. The education\-wealth interaction term captures the well\-documented phenomenon that higher education amplifies returns on capital investments through financial literacy and investment opportunities\. XGBoost\+MARICL achieves accuracy of 0\.832 and macro F1 of 0\.800, outperforming XGBoost alone \(accuracy 0\.813, macro F1 0\.692\)\. The macro F1 improvement \(\+15\.6% relative; \+0\.108 absolute\) reflects MARICL’s ability to address class imbalance through targeted corrections\. Notably, MARICL helps both weak \(Logistic: macro F1 improves from 0\.679 to 0\.740\) and strong \(XGBoost: macro F1 improves from 0\.692 to 0\.800\) base\-models on this dataset, with the largest lift coming from minority\-F1 recovery rather than majority precision\. We note that the lift on XGBoost reflects its initially low minority\-class recall \(Minority F1 = 0\.498 despite Accuracy = 0\.813\); MARICL’s residual\-driven corrections concentrate on these minority\-class errors, raising minority F1 to 0\.720\.

## Appendix IAnonymized Features on Real Data

The domain\-context ablation in Section[4\.3](https://arxiv.org/html/2605.22897#S4.SS3)keeps feature names intact, which themselves carry semantic content \(e\.g\., “NAD”, “BMI”, “capital\.gain”\)\. To bound the contribution of feature\-name priors, we re\-run MARICL on Cell\-Free Protein and Diabetes with all feature names replaced by opaque identifiers \(feat\_0, …,feat\_d\) before any prompt construction\. Numerical values, residuals, and the iterative refinement loop are otherwise identical\.

Table 14:Anonymized\-feature ablation\. “Anonym\.” replaces feature names with opaque identifiers\. The data\-driven fractionGanon/GfullG\_\{\\text\{anon\}\}/G\_\{\\text\{full\}\}estimates how much of MARICL’s gain survives when feature semantics are withheld\.DatasetLinearMARICL \(full\)MARICL \(anonym\.\)Ganon/GfullG\_\{\\text\{anon\}\}/G\_\{\\text\{full\}\}Cell\-Free Protein0\.4120\.6480\.5710\.67Diabetes0\.4540\.5900\.5480\.69On Cell\-Free Protein, roughly two\-thirds of the gain \(Ganon=0\.159G\_\{\\text\{anon\}\}=0\.159ofGfull=0\.236G\_\{\\text\{full\}\}=0\.236\) is recovered without feature semantics, indicating that the residual\-driven loop captures most of the structure from data alone; the remaining third reflects the contribution of biochemistry priors triggered by feature names like “NAD” and “spermidine”\. On Diabetes, the data\-driven fraction is similar \(0\.69\), consistent withs1–s6already being partially anonymized in the source data\. We report this transparently rather than claiming MARICL is fully prior\-free on real benchmarks\.

## Appendix JComplete Experimental Results

Table[15](https://arxiv.org/html/2605.22897#A10.T15)presents the complete regression results across all three datasets using Gemini\-flash\-2\.0 LLM backbone, and Table[16](https://arxiv.org/html/2605.22897#A10.T16)presents the complete classification results\.

Table 15:Complete regression results across all datasets\. Best results are shown inbold, second best areunderlined\.↑\\uparrowindicates higher is better,↓\\downarrowindicates lower is better\.Cell\-Free ProteinEnzyme ActivityDiabetes ProgressionMethodR2R^\{2\}↑\\uparrowMAE↓\\downarrowR2R^\{2\}↑\\uparrowMAE↓\\downarrowR2R^\{2\}↑\\uparrowMAE↓\\downarrowTraditional ML BaselinesLinear Regression0\.4115±\\pm0\.02000\.1413±\\pm0\.00500\.2373±\\pm0\.01000\.1931±\\pm0\.00500\.4543±\\pm0\.02000\.1586±\\pm0\.0050XGBoost0\.5787±\\pm0\.02500\.1090±\\pm0\.00500\.3992±\\pm0\.02000\.1655±\\pm0\.00500\.3080±\\pm0\.01500\.1647±\\pm0\.0050Interpretable ML ModelsEBM0\.5946±\\pm0\.02500\.1169±\\pm0\.00500\.4160±\\pm0\.02000\.1580±\\pm0\.00500\.5389±\\pm0\.02500\.1487±\\pm0\.0050Symbolic Regression0\.4500±\\pm0\.02000\.1350±\\pm0\.00500\.2500±\\pm0\.01000\.1850±\\pm0\.00500\.4800±\\pm0\.02000\.1520±\\pm0\.0050Neural Tabular ModelsTabPFN0\.6782±\\pm0\.03000\.1007±\\pm0\.00500\.4128±\\pm0\.01000\.1683±\\pm0\.00500\.6170±\\pm0\.03000\.1352±\\pm0\.0050LLM\-Based MethodsLLM\-LEx0\.3308±\\pm0\.05000\.2048±\\pm0\.02000\.2550±\\pm0\.06000\.2553±\\pm0\.02000\.3457±\\pm0\.03000\.2416±\\pm0\.0100LLM\-ICL0\.3500±\\pm0\.02000\.1500±\\pm0\.00500\.2200±\\pm0\.01000\.2000±\\pm0\.00500\.4000±\\pm0\.02000\.1700±\\pm0\.0050MARICL \(Ours\)Linear \+ MARICL0\.6475±\\pm0\.03000\.0991±\\pm0\.00500\.4812±\\pm0\.02500\.1493±\\pm0\.00500\.5900±\\pm0\.03000\.1401±\\pm0\.0050XGBoost \+ MARICL0\.7231±\\pm0\.03000\.0937±\\pm0\.00500\.5132±\\pm0\.01000\.1436±\\pm0\.00500\.5430±\\pm0\.02500\.1520±\\pm0\.0050

Table 16:Complete classification results across all datasets\. Best results are shown inbold, second best areunderlined\.↑\\uparrowindicates higher is better\.Zoo \(7 classes\)High School \(3 classes\)Adult Income \(2 classes\)MethodAccuracy↑\\uparrowF1↑\\uparrowAccuracy↑\\uparrowF1↑\\uparrowAccuracy↑\\uparrowMacro F1↑\\uparrowTraditional ML BaselinesLogistic Regression0\.8570±\\pm0\.03000\.8330±\\pm0\.03000\.4670±\\pm0\.02000\.4080±\\pm0\.02000\.7380±\\pm0\.02000\.6790±\\pm0\.0200XGBoost0\.9050±\\pm0\.03000\.8860±\\pm0\.03000\.5170±\\pm0\.02000\.3520±\\pm0\.02000\.8130±\\pm0\.02500\.6920±\\pm0\.0250Interpretable ML ModelsEBM0\.9048±\\pm0\.03000\.8836±\\pm0\.03000\.5333±\\pm0\.02000\.4984±\\pm0\.02000\.8125±\\pm0\.02500\.7350±\\pm0\.0250Symbolic Regression0\.8700±\\pm0\.02000\.8500±\\pm0\.02000\.4800±\\pm0\.02000\.4200±\\pm0\.02000\.7500±\\pm0\.02000\.6500±\\pm0\.0200Neural Tabular ModelsTabPFN0\.9524±\\pm0\.02500\.9365±\\pm0\.02500\.5500±\\pm0\.02000\.4732±\\pm0\.02000\.8525±\\pm0\.03000\.8180±\\pm0\.0300LLM\-Based MethodsLLM\-LEx0\.2000±\\pm0\.01000\.2400±\\pm0\.01000\.5000±\\pm0\.02000\.4485±\\pm0\.02000\.6000±\\pm0\.02000\.5128±\\pm0\.0200LLM\-ICL0\.8000±\\pm0\.02000\.7800±\\pm0\.02000\.4500±\\pm0\.02000\.3800±\\pm0\.02000\.7000±\\pm0\.02000\.6000±\\pm0\.0200MARICL \(Ours\)Logistic \+ MARICL0\.9750±\\pm0\.02000\.9700±\\pm0\.02000\.5210±\\pm0\.02000\.4830±\\pm0\.02000\.8130±\\pm0\.02500\.7400±\\pm0\.0250XGBoost \+ MARICL0\.9520±\\pm0\.02500\.9490±\\pm0\.02500\.5400±\\pm0\.02000\.5100±\\pm0\.02000\.8324±\\pm0\.02500\.8000±\\pm0\.0250

### J\.1Decomposing Structural and Numerical Contributions

MARICL’s correctionfkf\_\{k\}is a symbolic expression with LLM\-generated coefficients\. Two distinct contributions could explain its performance: the inferred*structure*\(which features, which interactions, which nonlinearities\) and the*numerical fit*\(the coefficient values within that structure\)\. We isolate each by refitting coefficients via ordinary least squares on the training set while holding the structure fixed, and conversely by randomizing coefficients within the inferred structure\.

Table 17:Structure\-vs\-coefficient decomposition \(Linear,R2R^\{2\}\)\. “LSQ\-refit” replaces LLM coefficients with least\-squares values; “Random\-coef” samples coefficients uniformly in\[−1,1\]\[\-1,1\]within MARICL’s structure \(mean over 20 random draws\)\.ConfigurationCell\-FreeDiabetesEnzymeLinear Base\-model0\.4120\.4540\.237MARICL \(full\)0\.6480\.5900\.481MARICL \+ LSQ\-refit0\.6610\.6020\.529MARICL \+ Random\-coef0\.4980\.5120\.341The decomposition shows that the inferred structure is the dominant contribution: random coefficients within MARICL’s structure already recover3636–44%44\\%of the gain over the baseline\. LSQ refitting adds a further0\.0120\.012–0\.0480\.048R2R^\{2\}, indicating that LLM\-generated coefficients are sound but not numerically optimal within their own structure\. Practitioners who require the last fraction of accuracy can post\-hoc refit; those who prefer fully LLM\-generated formulas \(e\.g\., for faithful representation of the agent’s reasoning\) lose only modest performance\. We treat structural inference as MARICL’s primary contribution and numerical fitting as a complementary post\-processing step\.

## Appendix KStatistical Significance

For each MARICL variant we compute paired Wilcoxon signed\-rank tests against the strongest non\-MARICL baseline on each dataset, using per\-\(seed, fold\) paired metric values across 5 seeds×\\times5 cross\-validation folds \(n=25n\{=\}25paired observations per dataset; the 5\-fold CV protocol matches Table[22](https://arxiv.org/html/2605.22897#A14.T22)\)\. Atn=25n\{=\}25the smallest possible exact two\-sidedpp\-value is≈6×10−8\\approx 6\\times 10^\{\-8\}, so the rawpp\-values reported below are well within the attainable range of the test\. We apply Benjamini–Hochberg correction at FDR0\.050\.05across the 9 dataset\-level comparisons, computingq\(i\)=minj≥i⁡mj​p\(j\)q\_\{\(i\)\}=\\min\_\{j\\geq i\}\\tfrac\{m\}\{j\}\\,p\_\{\(j\)\}withm=9m\{=\}9\. Table[18](https://arxiv.org/html/2605.22897#A11.T18)reports raw and correctedpp\-values\. Results are reported for the better\-performing MARICL variant on each dataset \(Linear\+MARICL or XGBoost\+MARICL\)\.

Table 18:Paired Wilcoxon signed\-rank tests, MARICL \(best variant\) vs\. strongest non\-MARICL baseline\.Δ\\Deltadenotes the metric improvement \(positive favors MARICL\)\.pBHp\_\{\\text\{BH\}\}is the Benjamini–Hochberg\-correctedpp\-value at FDR0\.050\.05\. BoldΔ\\Deltaindicates correctedp<0\.05p<0\.05\.DatasetStrongest baselineMetricΔ\\DeltapprawpBHp\_\{\\text\{BH\}\}Cell\-Free ProteinTabPFNR2R^\{2\}\+0\.045\+0\.0450\.0310\.056Enzyme ActivityEBMR2R^\{2\}\+0\.0970\.0080\.045DiabetesTabPFNR2R^\{2\}−0\.027\-0\.0270\.0940\.106Cal\. HousingXGBoostR2R^\{2\}\+0\.0290\.0160\.045Bike SharingXGBoostR2R^\{2\}\+0\.0230\.0200\.045ZooTabPFNAccuracy\+0\.023\+0\.0230\.0390\.059High SchoolTabPFNAccuracy−0\.010\-0\.0100\.1560\.156Adult IncomeTabPFNAccuracy−0\.020\-0\.0200\.0780\.100SyntheticPySR\-on\-resid\.R2R^\{2\}\+0\.0380\.0120\.045#### Reading guide\.

Four of nine MARICL improvements survive Benjamini–Hochberg correction at FDR0\.050\.05\(Enzyme Activity, Synthetic, California Housing, Bike Sharing\)\. Two further datasets are borderline at the0\.050\.05threshold \(Cell\-Free Proteinq=0\.056q\{=\}0\.056; Zooq=0\.059q\{=\}0\.059\); their rawpp\-values would clear FDR control at any threshold≥0\.06\\geq 0\.06\. The three remaining comparisons \(Diabetes, High School, Adult Income\) correspond to datasets where TabPFN is competitive with or stronger than MARICL on raw accuracy\. On these datasets, MARICL’s contribution is the interpretable closed\-form correction formula rather than a statistically significant improvement in predictive accuracy\. We highlight this honestly rather than reporting only the headline metric: MARICL is most clearly advantageous on domains with correctable mechanistic structure \(Enzyme Activity, Synthetic, the TabArena benchmarks, and – borderline – Cell\-Free Protein and Zoo\), and ties or slightly trails the strongest tabular foundation model on benchmarks where the predictive signal is dominated by features the base\-model already exploits\.

## Appendix LJoint Prior\-Stripping Ablation

The single\-channel ablations in Appendices[C](https://arxiv.org/html/2605.22897#A3)withhold one source of LLM prior knowledge, but cannot bound the joint contribution: feature names alone may cue domain priors that survive context removal, and a strong backbone may compensate for either\. We therefore report the combined ablation, which represents the most prior\-stripped configuration achievable without altering the algorithm: feature names replaced with opaque identifiers \(feat\_0,…,feat\_d\\texttt\{feat\\\_0\},\\ldots,\\texttt\{feat\\\_d\}\), no domain context in any prompt, and Llama\-3\-8B as the LLM backbone in place of frontier models\.

Table 19:Joint prior\-stripping ablation \(Linear base\-model,K=2K=2,T=10T=10\)\.Gjoint/GfullG\_\{\\text\{joint\}\}/G\_\{\\text\{full\}\}is the fraction of MARICL’s gain that survives all three ablations simultaneously\. PySR\-on\-residuals included as the strongest fully prior\-free baseline\.DatasetLinearPySR\-resJoint\-stripFullGjoint/GfullG\_\{\\text\{joint\}\}/G\_\{\\text\{full\}\}Cell\-Free0\.4120\.4710\.5270\.6480\.49Diabetes0\.4540\.5120\.5210\.5900\.49Enzyme0\.2370\.3180\.3580\.4810\.50Approximately half of MARICL’s end\-to\-end gain survives joint prior\-stripping \(Gjoint/Gfull≈0\.49G\_\{\\text\{joint\}\}/G\_\{\\text\{full\}\}\\approx 0\.49–0\.500\.50across three datasets\)\. The surviving half characterizes the residual\-driven loop in isolation: a refinement procedure that operates on data alone and produces gains comparable to or exceeding PySR\-on\-residuals \(Table[19](https://arxiv.org/html/2605.22897#A12.T19)\), the strongest fully prior\-free symbolic baseline\. The remaining half characterizes what MARICL extracts when its priors align with the domain: a deployment benefit available wherever the LLM’s pretraining is informative for the task at hand\.

We frame these as complementary rather than competing contributions\. The data\-driven half is what makes MARICL applicable in prior\-poor settings \(anonymized features, novel domains, on\-prem open\-source backbones\) where prior\-dependent methods would not function\. The prior\-aligned half is what makes MARICL more useful than a symbolic regressor on real scientific benchmarks where domain knowledge is freely available in pretraining corpora and not exploiting it would be wasteful\. The inference argument is established separately through the synthetic benchmark \(§[4\.2](https://arxiv.org/html/2605.22897#S4.SS2)\), where priors are absent by construction, and the cross\-plate transfer experiment \(§[4\.5](https://arxiv.org/html/2605.22897#S4.SS5)\), where mechanism is testable through cohort\-aligned generalization; the present ablation does not aim to establish discovery but to make explicit what each channel contributes\.

## Appendix MValidation Protocol and Generalization Safeguards

Because textual gradient optimization uses training performance to generate critiques and select the best iteration, we formally clarify the role of each data split and provide both structural and empirical safeguards against overfitting\.

#### Strict data partitioning\.

All datasets are split into training, validation, and test sets\. Residual computation and high\-residual selection use training data only \(Eq\.[4](https://arxiv.org/html/2605.22897#S2.E4)\)\. The held\-out test set is never accessed until final evaluation\. No test\-set information influences any design decision, hyperparameter choice, or correction refinement\.

#### Capacity bound of the hypothesis space\.

Corrections are constrained to 3–8 component symbolic formulas built from a restricted operation set \(addition, multiplication, clipping, sigmoid, rational saturationx/\(K\+x\)x/\(K\+x\), Gaussianexp⁡\(−\(x−μ\)2/σ2\)\\exp\(\-\(x\-\\mu\)^\{2\}/\\sigma^\{2\}\); see Appendix[N\.3](https://arxiv.org/html/2605.22897#A14.SS3)\)\. Withddfeatures and at most 8 components, the effective number of structural degrees of freedom is on the order of𝒪​\(d⋅8\)\\mathcal\{O\}\(d\\cdot 8\), far below the validation set sizes \(88–6,512 samples\)\. This structural constraint makes memorization of examples infeasible, independent of how many refinement iterations are run\.

#### Lossy critique channel\.

Unlike numerical optimization where gradients transmit per\-example information, textual critiques are lossy natural language summaries \(e\.g\., “overcorrects for high\-value samples”\)\. The LLM receives aggregate error pattern descriptions, not raw examples or numerical gradients\. This information bottleneck further limits the effective bandwidth available for overfitting\.

#### Analogy to standard model selection\.

Neural architecture search, early stopping, AutoML, and pruning is similar to gradient\-based training\. The key distinction from gradient\-based overfitting is that each refinement step produces a discrete, structurally constrained candidate rather than incrementally adjusting continuous parameters toward the validation surface\.

### M\.1Extended Benchmarks: TabArena

To evaluate MARICL at larger scale and on datasets with less transparent feature semantics, we include two widely used TabArena regression benchmarks \(Table[20](https://arxiv.org/html/2605.22897#A13.T20)\)\.

Table 20:TabArena benchmark results\. MARICL provides consistent improvements across both base models on larger\-scale datasets with less explicit feature semantics\.Task \(NN/dd\)LinearLinear\+MARICLXGBoostXGB\+MARICLCalifornia \(20,640 / 8\)0\.5760\.648 \(\+0\.072\)0\.8320\.861 \(\+0\.029\)Bike Sharing \(17,389 / 12\)0\.3910\.493 \(\+0\.102\)0\.8940\.917 \(\+0\.023\)

On California Housing, MARICL inferred a location\-density interaction \(latitude×\\timeslongitude×\\timesmedian\_income saturation\) capturing known geographic price gradients, interpretable to urban planners\. On Bike Sharing, MARICL identified temperature\-humidity interaction terms with hour\-of\-day modulation\. These datasets are 100–200×\\timeslarger than our smallest benchmarks and have less semantically transparent features than Zoo or Enzyme, confirming that MARICL scales beyond small, domain\-rich settings\.

### M\.2Stability of Learned Corrections

A critical question is whether MARICL’s corrections are stable across independent runs or are artifacts of LLM stochasticity\. We assess stability across three axes using 5 seeds\.

Table 21:Correction stability across 5 random seeds\. High School shows least stability, consistent with weak domain structure\.DatasetCorrection FormRecovered \(NN/5\)Coeff\. CV \(%\)Diabetess5×\\timesBMI \+ s5×\\timess1 interaction4/511\.7Zoohair×\\timesmilk×\\times\(1–eggs\)5/54\.2EnzymeMichaelis\-Menten \+ aromatic Gaussian4/513\.8Adult Incomecapital\.gain saturation \+ edu\. interaction5/58\.1High SchoolSports→\\toSports direct mapping3/522\.4

Random seeds\(Table[21](https://arxiv.org/html/2605.22897#A13.T21)\): Dominant correction forms are recovered in 4–5 of 5 independent runs on all datasets except High School \(3/5\), which has the weakest mechanistic structure\. Coefficient variation is below 14% for well\-structured domains\.Data splits: The top\-ranked correction’s structural form remained consistent in 4\+ of 5 folds on Zoo, Adult Income, and Diabetes\.Prompt paraphrasing: Three semantically equivalent prompt variants \(original, shorter, restructured\) produce consistent correction categories with performance variance<<0\.02R2R^\{2\}\. Low cross\-seed consistency on High School \(CV = 22\.4%\) is itself a diagnostic signal for insufficient domain structure, not a failure of the method\.

### M\.3Failure Modes

We consolidate failure modes to characterize when MARICL provides limited benefit:

Base model already captures nonlinear structure\.Adding LLM\-generated corrections to a stronger base yields strictly smaller within\-model gains:\+0\.236​Δ​R2\+0\.236\\,\\Delta R^\{2\}for Linear vs\.\+0\.144​Δ​R2\+0\.144\\,\\Delta R^\{2\}for XGBoost on Cell\-Free Protein, and\+0\.107\+0\.107vs\.\+0\.033​Δ​R2\+0\.033\\,\\Delta R^\{2\}when going fromK=1K\{=\}1toK=2K\{=\}2corrections under the headline ablation configuration \(Table[5](https://arxiv.org/html/2605.22897#S4.T5)\)\. When the base already captures the dominant nonlinearity, the residual signal available for correction shrinks and the relative cost of any LLM\-induced noise grows — the limiting case being a strong base with no exploitable residual\.

Unobserved confounders\.High School gains only \+0\.023 Acc; demographic features are weak proxies for social dynamics governed by peer influence and family background\. MARICL correctly identifies this limitation via thepminp\_\{\\min\}threshold rather than hallucinating spurious corrections\.

High dimensionality with noise features\.On synthetic data with 100 features \(10 relevant, 90 noise\), performance drops fromR2R^\{2\}= 0\.71 to 0\.68, demonstrating that irrelevant features introduce noise into hypothesis generation\.

LLM stochasticity on weakly structured domains\.Correction recovery varies directly with domain quality: Zoo 5/5 runs, Adult Income 5/5, but High School only 3/5 \(Table[21](https://arxiv.org/html/2605.22897#A13.T21)\)\. Low cross\-seed consistency is itself a diagnostic signal for insufficient domain structure\.

## Appendix NExtended Technical Details

This appendix provides additional details on the MARICL framework, including implementation specifics, experimental protocols, and extended analyses that complement the main text\.

### N\.1Training Protocol and Generalization

The iterative refinement process in MARICL involves generating textual critiques and selecting corrections based on performance\. A key design principle is that textual gradients operate fundamentally differently from numerical gradients: they are natural language descriptions of error patterns that must generalize through the LLM’s reasoning capabilities, rather than continuous parameters susceptible to overfitting through gradient descent\. Appendix[M](https://arxiv.org/html/2605.22897#A13)provides a formal treatment of the validation protocol and generalization safeguards; here we provide additional empirical detail\.

#### Strict Data Partitioning\.

We enforce complete separation between data splits: the training set provides residuals for correction learning and guides critique generation and correction selection, and the test set remains untouched until final evaluation\. No information from the test set influences any design decision, hyperparameter choice, or correction refinement\.

#### Limited Refinement Cycles\.

We cap refinement atT=10T=10iterations across all datasets, selecting the best\-performing iteration via early stopping\. This fixed budget prevents indefinite optimization against the training set\. In practice, corrections achieve optimal performance at iterationt∈\[2,5\]t\\in\[2,5\], with later iterations showing diminishing returns or slight degradation\.

#### Correction Complexity Constraints\.

Generated formulas are constrained to interpretable forms including products, ratios, and saturation terms with typically 3–8 components\. This implicit regularization limits the hypothesis space and prevents memorization of data examples\. The decoder prompt explicitly requests “simple, interpretable formulas using domain\-meaningful transformations\.”

To verify these design choices yield robust generalization, we conducted 5\-fold cross\-validation on three representative datasets\. Table[22](https://arxiv.org/html/2605.22897#A14.T22)demonstrates that test performance closely tracks validation performance, with mean absolute differences below 0\.02 across all metrics\.

Table 22:Cross\-validation results demonstrating validation\-test consistency \(mean±\\pmstd across 5 folds\)\.DatasetValR2R^\{2\}/ AccTestR2R^\{2\}/ Acc\|Δ\|\|\\Delta\|Cell\-Free Protein0\.718±\\pm0\.030\.705±\\pm0\.040\.013Diabetes0\.584±\\pm0\.020\.571±\\pm0\.030\.013Adult Income0\.836±\\pm0\.020\.824±\\pm0\.020\.012

### N\.2Hyperparameter Configuration and Sensitivity

Table[23](https://arxiv.org/html/2605.22897#A14.T23)summarizes the complete hyperparameter configuration used throughout our experiments\. Parameters fall into two categories: those selected via grid search on validation data \(KK,κ\\kappa,β\\betafor*classification*only\) and those fixed based on preliminary analysis \(pminp\_\{\\min\},γ\\gamma,BB, regressionτ\\tau\)\.

Table 23:Hyperparameter settings with selection methodology and sensitivity characteristics\.ParameterValueSelectionSensitivityKK\(corrections\)2Grid\{1,2,3,4\}\\\{1,2,3,4\\\}Low \(±5%\\pm 5\\%across range\)κ\\kappa\(residual fraction\)0\.3Grid\{0\.2,0\.3,0\.4,0\.5\}\\\{0\.2,0\.3,0\.4,0\.5\\\}Medium \(±8%\\pm 8\\%across range\)TT\(iterations\)10Early stoppingLow \(optimal att∈\[2,7\]t\\in\[2,7\]\)BB\(batch size\)10Context windowLowpminp\_\{\\min\}\(threshold\)0\.1FixedLow \(filters<5%<5\\%corrections\)β\\beta\(class\. blend\)0\.3–0\.7Val\. grid\{0\.3,0\.5,0\.7\}\\\{0\.3,0\.5,0\.7\\\}Medium \(dataset dependent\)τ\\tau\(reg\.pkp\_\{k\}scale\)0\.2×0\.2\{\\times\}target rangeFixed \(τ=0\.2\\tau\{=\}0\.2,y∈\[0,1\]y\\in\[0,1\]\)Lowτk\\tau\_\{k\}\(softmax temp\.\)0\.5–3\.0Val\. grid; min\. ECEMediumγ\\gamma\(confidence scale\)2\.0FixedLowτfail\\tau\_\{\\text\{fail\}\}\(failure threshold\)0\.5FixedLow \)#### Train vs\. validation roles\.

The global scorespkp\_\{k\}in Eq\.[17](https://arxiv.org/html/2605.22897#S2.E17)useMAEk\\text\{MAE\}\_\{k\}or macro\-F1evaluated on thetrainingsplit aftermk∗m\_\{k\}^\{\*\}is selected \(same split as Eq\.[9](https://arxiv.org/html/2605.22897#S2.E9); Algorithm[1](https://arxiv.org/html/2605.22897#alg1)\); these scores are then frozen for inference\. Hyperparameters that appear in probability outputs or the classification ensemble —β\\betain Eq\.[2](https://arxiv.org/html/2605.22897#S2.E2),τk\\tau\_\{k\}in Eq\.[14](https://arxiv.org/html/2605.22897#S2.E14), and the headline choicesKK,κ\\kappa— are selected byvalidation\-set search \(ECE forτk\\tau\_\{k\}, validation loss forβ\\beta\)\. The test set is held out from all of these decisions\.

#### Classificationβ\\betavs\. regression transferβtransfer\\beta\_\{\\text\{transfer\}\}\.

Eq\.[2](https://arxiv.org/html/2605.22897#S2.E2)requiresβ∈\[0,1\]\\beta\\in\[0,1\]; our search uses\{0\.3,0\.5,0\.7\}\\\{0\.3,0\.5,0\.7\\\}\(*not*β=3\\beta\{=\}3\)\. A value of33can arise only as a softmax temperature candidateτk\\tau\_\{k\}in the calibration grid, never as a mixing weight\. Cross\-plate*regression*transfer instead uses a fixedβtransfer=12\\beta\_\{\\text\{transfer\}\}=\\tfrac\{1\}\{2\}\(Section[2](https://arxiv.org/html/2605.22897#S2)\), independent of the classification grid; Table[10](https://arxiv.org/html/2605.22897#A5.T10)reports sensitivity to dropping this blend\.

#### Joint Sensitivity Analysis\.

Figure[3](https://arxiv.org/html/2605.22897#A14.F3)presents joint sensitivity results forKKandκ\\kappaon the Cell\-Free Protein dataset\. Performance is robust across a wide range:K∈\[1,3\]K\\in\[1,3\]andκ∈\[0\.2,0\.4\]\\kappa\\in\[0\.2,0\.4\]all achieveR2\>0\.70R^\{2\}\>0\.70\. The combinationK=2K=2,κ=0\.3\\kappa=0\.3offers a good balance between correction diversity and focus on systematic errors\.

κ=0\.2\\kappa=0\.2κ=0\.3\\kappa=0\.3κ=0\.4\\kappa=0\.4κ=0\.5\\kappa=0\.5K=1K=10\.6980\.7050\.7010\.688K=2K=20\.7120\.7230\.7180\.704K=3K=30\.7080\.7160\.7110\.695K=4K=40\.6950\.7020\.6980\.682Figure 3:Joint sensitivity ofKK\(corrections\) andκ\\kappa\(residual fraction\) on Cell\-Free Protein \(R2R^\{2\}\)\. Performance is stable acrossK∈\[1,3\]K\\in\[1,3\]andκ∈\[0\.2,0\.4\]\\kappa\\in\[0\.2,0\.4\]\.
#### Practitioner Guidance\.

Based on our experiments, we recommend: \(1\) Start withK=2K=2corrections, increasing only if performance plateaus; \(2\) Setκ=0\.3\\kappa=0\.3as default, adjusting downward for noisy data or upward for systematic model failures; \(3\) Use early stopping withT=10T=10maximum iterations; \(4\) For classification, tuneβ\\betaon validation data, starting withβ=0\.5\\beta=0\.5\.

### N\.3Formula Generation and Numerical Stability

Generated Python formulas undergo multi\-stage validation to ensure correctness and numerical stability\.

#### Sandboxed Execution\.

Formulas execute in a restricted environment with explicitly allowed and blocked operations:

- •Allowed: Arithmetic \(\+,−,∗,/\+,\-,\*,/\),np\.clip,np\.exp,np\.log1p,np\.maximum,np\.minimum,np\.abs, sigmoid, feature access
- •Blocked: File I/O, imports, system calls, loops, recursion,eval,exec

#### Numerical Stability Constraints\.

The decoder prompt includes explicit guidance:

> Ensure numerical stability: \(1\) useclip\(x, min, max\)for bounded outputs; \(2\) add small constants to denominators \(e\.g\.,x / \(0\.1 \+ y\)\); \(3\) preferlog1p\(x\)overlog\(x\)for values near zero; \(4\) bound exponential arguments to prevent overflow\.

#### Post\-Execution Validation\.

After execution, outputs are checked for:

- •NaN or Inf values→\\rightarrowformula rejected, re\-generated with error message
- •Values outside domain bounds→\\rightarrowclipped with warning logged
- •Type mismatches→\\rightarrowformula rejected

#### Validation Statistics\.

Across all experiments: 94\.2% of generated formulas passed validation on first attempt, 5\.1% required one re\-generation for syntax errors, 0\.7% required re\-generation for numerical issues, and 0% of final corrections produced invalid outputs on test data\.

### N\.4Prompt Templates

For reproducibility we reproduce the schema for text of every prompt used in MARICL\. All prompts are templated; placeholders in\{curly braces\}are filled at runtime with the corresponding artifacts \(high\-residual examples, feature descriptions, prior critiques, etc\.\)\. No prompt is used at inference time — compiled formulas are executed directly in the sandbox of Appendix[N\.3](https://arxiv.org/html/2605.22897#A14.SS3)\.

#### Encoder prompts𝒫encoder\(k\)\\mathcal\{P\}\_\{\\text\{encoder\}\}^\{\(k\)\}\.

The encoder analyses the augmented context𝒞aug\\mathcal\{C\}\_\{\\text\{aug\}\}and emits a structured hypothesiszkz\_\{k\}\. We useKKtemplates that differ only in which residual aspect they emphasise \(error patterns vs\. direct sample patterns\.

> \[ENCODER PROMPT \#1 \-\-\- ERROR PATTERNS\] You are analysing systematic prediction errors of a base model\. Feature descriptions: \{features\} Domain context: \{domain\_context\} High\-residual examples \(input, target, base\-prediction, residual\): \{high\_residual\_table\} Identify the feature combinations and nonlinearities that the base model appears to miss\. Output a structured hypothesiszkz\_\{k\}with fields: \- hypothesised\_pattern: \.\.\. \- implicated\_features: \[\.\.\.\] \- functional\_form\_guess: \.\.\. \- rationale: \.\.\.

> \[ENCODER PROMPT \#2 \-\-\- SAMPLE PATTERNS\] \(implementation: MARICLPipeline\.\_encode\_latent\_z\_batched in maicl\_lib\_v2\.py, template encoder\_with\_data in maicl\_config\.yaml\) 2\. SAMPLE PATTERNS \(direct pattern learning perspective\): \- What direct relationships do you see between features and target values in these samples? \- How do feature values relate to target values? \(e\.g\., "When age is high and BMI is high, target is typically high"\) \- What prediction rules would work based on the sample patterns themselves? \- What feature combinations consistently lead to high/low target values? \- Learn the TRUE UNDERLYING PATTERN from the samples, not just how to fix errors\. IMPORTANT: Focus on learning to PREDICT based on sample patterns, not just learning to fix specific errors\. Your mechanism should capture the true underlying relationships visible in the samples\. \[\.\.\. full encoder\_with\_data template output requirements from maicl\_config\.yaml \.\.\.\] Latent mechanism z:

#### Decoder prompt𝒫decoder\\mathcal\{P\}\_\{\\text\{decoder\}\}\(Eq\.[8](https://arxiv.org/html/2605.22897#S2.E8)\)\.

The decoder convertszkz\_\{k\}into the dual representation\(Tk,fk\)\(T\_\{k\},f\_\{k\}\)\.

> \[DECODER PROMPT\] Given the structured hypothesis below, produce \(1\) a natural\-language explanationTkT\_\{k\}in 2\-\-3 sentences, and \(2\) an executable Python expressionfkf\_\{k\}using only the operators in \{allowed\_operators\}\. Ensure numerical stability: clip bounded outputs, add small constants to denominators, prefer log1p over log near zero, and bound exponential arguments\. Do not use loops, imports, or file I/O\. Hypothesis: \{z\_k\} Return JSON: \{"T\_k": "\.\.\.", "f\_k": "\.\.\."\}\.

#### Critique prompt𝒫critique\\mathcal\{P\}\_\{\\text\{critique\}\}\(Eq\.[11](https://arxiv.org/html/2605.22897#S2.E11)\)\.

At iterationtt, the critique prompt generates a textual gradientgk\(t\)g\_\{k\}^\{\(t\)\}from the current stateSk\(t\)S\_\{k\}^\{\(t\)\}and the failure setℰk\(t\)\\mathcal\{E\}\_\{k\}^\{\(t\)\}\.

> \[CRITIQUE PROMPT\] Current hypothesis: \{z\_k\} Current correction: \{T\_k, f\_k\} Training loss: \{L\_k\} Worst remaining failures \(input, target, current prediction, error\): \{failure\_set\_table\} Diagnose why the correction fails on these examples\. Distinguish structural mismatch \(wrong functional form\) from coefficient mismatch \(right form, wrong scale\)\. Suggest a concrete refinement tofkf\_\{k\}that addresses the dominant failure mode without harming low\-error examples\.

#### Refinement prompt \(Eq\.[13](https://arxiv.org/html/2605.22897#S2.E13)\)\.

The refined correction at iterationt\+1t\{\+\}1is sampled conditional on the full textual stateSk\(t\)S\_\{k\}^\{\(t\)\}\(which includes the critiquegk\(t\)g\_\{k\}^\{\(t\)\}\)\.

> \[REFINEMENT PROMPT\] \(implementation: MARICLPipeline\.decoder in maicl\_lib\_v2\.py, template decoder\_default in maicl\_config\.yaml\) At refinement time \(iterationt\+1t\{\+\}1\), the \{latent\_z\} placeholder is replaced by the accumulated state𝒮k\(t\)\\mathcal\{S\}\_\{k\}^\{\(t\)\}rather thanzk\(0\)z\_\{k\}^\{\(0\)\}alone\. You are a mechanism decoder that converts a latent mechanism representation into an interpretable, executable mechanism\. GOAL: Produce mechanisms that are BOTH high\-performing and highly interpretable, without unnecessary verbosity\. HARD REQUIREMENTS \(do not skip\): 1\. The mechanism must be descriptive and textual \(not just math\)\. 2\. The mechanism must explicitly include nonlinearities AND interactions\. 3\. The mechanism must introduce intermediate combinatory concepts \(named, explained\) that capture nonlinear interactions\. 4\. The final formula must be executable and appear as a SINGLE LINE starting with "Formula:" so it can be extracted programmatically\. WHAT TO INCLUDE: \- Named intermediate concepts \(1\-\-3\): define them briefly, but do NOT rely on them in the final Formula line\. Instead, inline/expand them in the final Formula expression\. \- Nonlinear transforms: at least one \(saturation/log/soft\-threshold/inverse\-U\)\. \- Interaction terms: at least one \(prefer nonlinear interaction form x\*y/\(K\+x\*y\)\)\. \- One short "why this helps" sentence referencing typical residual patterns \(extremes / interactions / thresholds\)\. STYLE: \- Prefer causal language \("limits", "enhances", "inhibits", "gates", "synergizes"\) over purely statistical phrasing\. \- Keep constants realistic and stable \(avoid huge coefficients\)\. \- Avoid full code blocks or function definitions; use compact math expressions\.

#### Note on prediction\-execution prompts\.

Earlier exploratory versions of MARICL used an LLM\-based execution path \(prompting the model to evaluatefkf\_\{k\}on a query and return \{y\_hat, confidence\}\)\. This path is deprecated in the final framework: all reported numbers use direct sandboxed Python evaluation \(Appendix[N\.3](https://arxiv.org/html/2605.22897#A14.SS3)\), so inference is zero\-LLM\-cost as stated in Section[2\.4](https://arxiv.org/html/2605.22897#S2.SS4)and Appendix[F](https://arxiv.org/html/2605.22897#A6)\. We retain no execution prompt in the released pipeline\.

### N\.5Data Preprocessing and Reproducibility

#### Feature Preprocessing\.

All numerical features are standardized to zero mean and unit variance using training set statistics, with identical transformations applied to validation and test sets\. Categorical features are one\-hot encoded\.

#### Target Scaling\.

For regression tasks: Cell\-Free Protein and Enzyme Activity targets are scaled to\[0,1\]\[0,1\]via min\-max normalization \(reported MAE reflects scaled targets\); Diabetes targets retain their original scale as a disease progression measure \(R2R^\{2\}is scale\-invariant\)\.

#### Dataset Sources and Licenses\.

Table[24](https://arxiv.org/html/2605.22897#A14.T24)lists the provenance and license of every dataset used in this paper\. All datasets are publicly redistributable under their stated terms; no proprietary or restricted data is used\.

Table 24:Dataset sources, citations, and licenses\.DatasetSourceCitationLicenseCell\-Free ProteinNature Comm\. supp\.Borkowskiet al\.\[[2020](https://arxiv.org/html/2605.22897#bib.bib34)\]CC BY 4\.0Enzyme ActivityUCI ML RepositoryDua, D\. and Graff, C\.\[[2019](https://arxiv.org/html/2605.22897#bib.bib75)\]CC BY 4\.0Diabetesscikit\-learn / Efron et al\.Efronet al\.\[[2004](https://arxiv.org/html/2605.22897#bib.bib76)\]PublicZooUCI ML RepositoryForsyth\[[1990](https://arxiv.org/html/2605.22897#bib.bib77)\]CC BY 4\.0High SchoolUCI ML RepositoryUCI Machine Learning Repository\[[2014](https://arxiv.org/html/2605.22897#bib.bib78)\]CC BY 4\.0Adult IncomeUCI ML RepositoryKohavi\[[1996](https://arxiv.org/html/2605.22897#bib.bib79)\]CC BY 4\.0California HousingTabArena / sklearnPace and Barry\[[1997](https://arxiv.org/html/2605.22897#bib.bib80)\]CC0Bike SharingTabArena / UCIFanaee\-T and Gama\[[2014](https://arxiv.org/html/2605.22897#bib.bib81)\]CC BY 4\.0Synthetic–—MIT
#### Code and Library Licenses\.

All baselines are implemented using publicly released libraries under permissive licenses: scikit\-learn \(BSD\-3\), XGBoost \(Apache 2\.0\), InterpretML/EBM \(MIT\), TabPFN \(Apache 2\.0 / model weights under prior\-data\-release terms; see PriorLabs repository\), PySR \(Apache 2\.0\), and SymbolicRegression\.jl \(Apache 2\.0\)\. LLM backbones are accessed via the official APIs of their providers \(Google Gemini, OpenAI\) under their respective terms of service; the open\-source Llama\-3\-8B is used under the Meta Llama 3 Community License\. Our own MARICL code will be released under the MIT license upon publication\.

#### Baseline Implementations\.

All baselines use scikit\-learn or official packages\.

- •Linear/Logistic Regression:sklearnwith default L2 regularization\.
- •XGBoost \(tuned\): v1\.7, 5\-fold CV grid search on the training split, selecting by validationR2R^\{2\}\(regression\) or macro F1 \(classification\)\. Grid:n\_estimators∈\{100,200,400,600\}\\in\\\{100,200,400,600\\\},max\_depth∈\{3,4,6,8\}\\in\\\{3,4,6,8\\\},learning\_rate∈\{0\.03,0\.05,0\.1\}\\in\\\{0\.03,0\.05,0\.1\\\},reg\_lambda∈\{0,1,2,5\}\\in\\\{0,1,2,5\\\},reg\_alpha∈\{0,0\.1,0\.5\}\\in\\\{0,0\.1,0\.5\\\}\. For Adult Income we additionally sweepscale\_pos\_weight∈\{1,2,3,4\}\\in\\\{1,2,3,4\\\}to address class imbalance\. Selected configurations and resulting metrics are reported in Table[25](https://arxiv.org/html/2605.22897#A14.T25)\.
- •EBM:interpretv0\.4 with default settings\.
- •
- •Symbolic Regression: PySR v0\.11,niterations=140, operators\{\+,−,∗,/,exp,log,,σ\}\\\{\+,\-,\*,/,\\exp,\\log,\\sqrt\{\},\\,\\sigma\\\}, whereσ​\(x\)=1/\(1\+exp⁡\(−x\)\)\\sigma\(x\)\{=\}1/\(1\+\\exp\(\-x\)\)is provided as a custom unary operator via PySR’sunary\_operatorsinterface so that the synthetic benchmark’s planted nonlinearity is inside the search space; the sameσ\\sigmaprimitive is available to MARICL through its sandbox \(Appendix[N\.3](https://arxiv.org/html/2605.22897#A14.SS3)\)\.
- •PySR on residuals: same PySR configuration applied tori=yi−fML​\(𝐱i\)r\_\{i\}=y\_\{i\}\-f\_\{\\text\{ML\}\}\(\\mathbf\{x\}\_\{i\}\)\.

Table 25:Tuned XGBoost vs\. default XGBoost vs\. XGBoost\+MARICL across all benchmarks \(mean±\\pmstd, 5 seeds; selection by 5\-fold CV on the train split, reported on test\)\. The headline MARICL\+\+XGBoost vs\. XGBoost gap narrows under tuning but persists across all nine benchmarks; on Cell\-Free Protein and Adult Income the gap narrows from\+0\.144​R2\+0\.144\\,R^\{2\}and\+0\.108\+0\.108macro F1 to\+0\.106\+0\.106and\+0\.069\+0\.069respectively\.Selected configurations \(best CV winner per dataset\):Cell\-Freen=400, d=4, lr=0\.05,λ\\lambda=1,α\\alpha=0\.1; Enzymen=200, d=4, lr=0\.05,λ\\lambda=2,α\\alpha=0; Diabetesn=100, d=3, lr=0\.05,λ\\lambda=2,α\\alpha=0\.1; Californian=600, d=8, lr=0\.05,λ\\lambda=1,α\\alpha=0; Biken=600, d=8, lr=0\.03,λ\\lambda=1,α\\alpha=0; Zoon=200, d=4, lr=0\.1,λ\\lambda=1,α\\alpha=0; High Schooln=200, d=3, lr=0\.1,λ\\lambda=2,α\\alpha=0\.1; Adultn=400, d=6, lr=0\.05,λ\\lambda=2,α\\alpha=0\.1, scale\_pos\_weight=3\.DatasetMetricXGB \(default\)XGB \(tuned\)XGB\+MARICLΔ\\Deltavs\. tunedCell\-Free ProteinR2R^\{2\}0\.579±0\.0250\.579\{\\pm\}0\.0250\.617±0\.0220\.617\{\\pm\}0\.0220\.723±0\.0300\.723\{\\pm\}0\.030\+0\.106\+0\.106Enzyme ActivityR2R^\{2\}0\.399±0\.0200\.399\{\\pm\}0\.0200\.428±0\.0190\.428\{\\pm\}0\.0190\.513±0\.0100\.513\{\\pm\}0\.010\+0\.085\+0\.085DiabetesR2R^\{2\}0\.308±0\.0150\.308\{\\pm\}0\.0150\.354±0\.0240\.354\{\\pm\}0\.0240\.543±0\.0250\.543\{\\pm\}0\.025\+0\.189\+0\.189California HousingR2R^\{2\}0\.8320\.8320\.844±0\.0060\.844\{\\pm\}0\.0060\.8610\.861\+0\.017\+0\.017Bike SharingR2R^\{2\}0\.8940\.8940\.903±0\.0050\.903\{\\pm\}0\.0050\.9170\.917\+0\.014\+0\.014ZooAcc0\.905±0\.0300\.905\{\\pm\}0\.0300\.914±0\.0250\.914\{\\pm\}0\.0250\.952±0\.0250\.952\{\\pm\}0\.025\+0\.038\+0\.038High SchoolAcc0\.517±0\.0200\.517\{\\pm\}0\.0200\.527±0\.0200\.527\{\\pm\}0\.0200\.540±0\.0200\.540\{\\pm\}0\.020\+0\.013\+0\.013Adult IncomeAcc0\.813±0\.0250\.813\{\\pm\}0\.0250\.823±0\.0060\.823\{\\pm\}0\.0060\.832±0\.0250\.832\{\\pm\}0\.025\+0\.009\+0\.009Adult Incomemacro F10\.692±0\.0250\.692\{\\pm\}0\.0250\.731±0\.0140\.731\{\\pm\}0\.0140\.800±0\.0250\.800\{\\pm\}0\.025\+0\.069\+0\.069

Similar Articles

Adaptive Latent Agentic Reasoning

arXiv cs.CL

This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

arXiv cs.CL

This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.

Counterexample Guided Learning in the Large using Reasoning Agents

arXiv cs.LG

This paper proposes using counterexample-guided learning for LLMs to perform regular-expression induction, where a verifier provides counterexamples to refine candidate expressions. The method significantly improves sample efficiency and success rates on challenging tasks, demonstrating that LLMs can benefit from structured feedback beyond treating it as additional data.

Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

arXiv cs.AI

Researchers from the University of Michigan introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework that enables LLM agents to reason about the internal assumptions, dependencies, and execution behavior of scientific simulators rather than treating them as black boxes. The framework improves explanation quality and decision-making reliability across high-stakes domains like healthcare, finance, and public policy.