Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
This paper introduces DoLQ, a multi-agent framework that uses Large Language Models to perform both qualitative and quantitative evaluations for discovering ordinary differential equations from observational data.
arXiv:2605.07323v1 Announce Type: new Abstract: Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real-world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM-based qualitative and quantitative evaluation. DoLQ employs a multi-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.
Original Article
View Cached Full Text
Cached at: 05/11/26, 07:17 AM
# Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation
Source: [https://arxiv.org/html/2605.07323](https://arxiv.org/html/2605.07323)
###### Abstract

Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning\. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real\-world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility\. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM\-based qualitative and quantitative evaluation\. DoLQ employs a multi\-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search\. Experiments on multi\-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations\. Our code is available at[https://github\.com/Bon99yun/DoLQ](https://github.com/Bon99yun/DoLQ)\.

Symbolic Regression, Ordinary Differential Equations, Large Language Models, Scientific Discovery

## 1Introduction

To understand complex dynamical systems, scientists have long pursued concise mathematical laws\(Wigner and others,[1990](https://arxiv.org/html/2605.07323#bib.bib1)\)\. Differential equations are essential tools for describing the dynamic behavior of systems across various fields including physics, chemistry, and biology, and when the governing equations are known, it becomes possible to predict and control the future states of the system\. However, in many real\-world systems, the form of governing equations is not clearly known, and one must infer them from observational data\. Consequently, there is a growing need for methodologies capable of deriving interpretable, explicit mathematical formulas directly from data\(Rudin,[2019](https://arxiv.org/html/2605.07323#bib.bib2)\)\.

![Refer to caption](https://arxiv.org/html/2605.07323v1/x1.png)Figure 1:Similar MSE does not imply correct discovery: even with a low MSE, the identified equation may differ from the true one, suggesting that quantitative metrics alone are insufficient and that qualitative evaluation is therefore necessary\.![Refer to caption](https://arxiv.org/html/2605.07323v1/x2.png)Figure 2:Overview of the DoLQ framework for LLM\-based ODE discovery\. The framework operates through an iterative loop among three components: \(1\) the Sampler Agent proposes candidate terms with physical justifications based on the system description and Scientist Agent’s feedback; \(2\) the Parameter Optimizer makes functions and fits their parameters; and \(3\) the Scientist Agent evaluates each term through qualitative semantic assessment and quantitative iterative comparison, determining actions that guide subsequent iterations\.Sparse Identification of Nonlinear Dynamics \(SINDy\)\(Bruntonet al\.,[2016](https://arxiv.org/html/2605.07323#bib.bib5)\)addresses this equation discovery problem by constructing a predefined library of basis functions and identifying system dynamics via sparse regression\. While efficient, SINDy struggles with selecting appropriate libraries for systems with unknown functional forms\. Alternatively, Symbolic Regression \(SR\) approaches seek to find a symbolic functionffthat maps input variables𝒙\\boldsymbol\{x\}to observed outputsyy, satisfyingy≈f\(𝒙\)y\\approx f\(\\boldsymbol\{x\}\), from given data\(Koza,[1992](https://arxiv.org/html/2605.07323#bib.bib3); Schmidt and Lipson,[2009](https://arxiv.org/html/2605.07323#bib.bib4)\)\. SR extends to dynamical system discovery by identifying the right\-hand side𝒇\\boldsymbol\{f\}of an ordinary differential equation \(ODE\), formulated as𝒙˙=𝒇\(t,𝒙\)\\dot\{\\boldsymbol\{x\}\}=\\boldsymbol\{f\}\(t,\\boldsymbol\{x\}\), which governs the system\. SR\-based methods overcome the library constraint through autonomous discovery of differential equations\. For instance, ODEformer\(d’Ascoli and others,[2024](https://arxiv.org/html/2605.07323#bib.bib6)\)leverages the Transformer architecture to generate ODEs from observational data, enabling the discovery of diverse equation structures\. However, relying solely on numerical data, these methods often fail to verify the physical meaning or consistency of the generated terms\.

More recently, LLMs have entered the SR landscape\. These approaches replace the traditional evolutionary algorithms used for equation generation with LLMs\(Shojaeeet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib7); Grayeliet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib8)\)\. By leveraging the extensive prior knowledge and hypothesis generation capabilities of LLMs, these studies have outperformed traditional approaches\. However, their evaluation of candidate equations still relies primarily on quantitative criteria such as mean squared error \(MSE\) or model complexity\. As Figure[1](https://arxiv.org/html/2605.07323#S1.F1)illustrates, equations with similarly low numerical error can nevertheless imply different physical mechanisms\. This highlights the limitation of purely numerical evaluation and motivates the need for qualitative assessment\. Since differential equations in science are often grounded in natural phenomena and human intuition, the physical plausibility of discovered terms becomes essential\.

To address these limitations, we proposeDoLQ, a novel framework thatDiscoversOrdinary differential equations withLLM\-basedQualitative andQuantitative evaluation\. DoLQ is a collaborative framework composed of three key components: a Sampler Agent, a Parameter Optimizer, and a Scientist Agent\. The Sampler Agent leverages the semantic reasoning of LLMs to propose candidate terms grounded in qualitative physical justification\. The Parameter Optimizer determines the optimal numerical parameters for proposed terms\. Crucially, the Scientist Agent enriches the search with qualitative evaluation by assessing whether candidate equations are semantically and physically plausible rather than relying on numerical fit alone\. By combining physical reasoning with numerical validation, DoLQ can identify governing laws for multi\-dimensional ODEs beyond simple polynomial forms across diverse dynamical systems\.

Our main contributions are summarized as follows:

- •Integrated qualitative\-quantitative framework:DoLQ explicitly embeds qualitative physical reasoning into the search loop, guiding the discovery process toward equations that satisfy both numerical accuracy and physical interpretability while maintaining compact equation structures\.
- •Superior performance on complex systems:We demonstrate that DoLQ achieves the highest success rates in identifying correct governing equations, particularly excelling where existing methods struggle to capture the correct dynamics\.
- •Impact of qualitative evaluation:We demonstrate the effectiveness of qualitative reasoning in filtering out physically implausible terms, validating that semantic assessment plays a crucial role in guiding the discovery process toward correct governing equations\.

## 2Problem Formulation

We consider a data\-driven discovery task where the input consists of two parts: numerical observations and a semantic description\. Let𝒟=\{\(ti,𝒙\(ti\)\)\}i=0N−1\\mathcal\{D\}=\\\{\(t\_\{i\},\\boldsymbol\{x\}\(t\_\{i\}\)\)\\\}\_\{i=0\}^\{N\-1\}be the time\-series state observations, where𝒙\(ti\)∈ℝd\\boldsymbol\{x\}\(t\_\{i\}\)\\in\\mathbb\{R\}^\{d\}represents the system state at timetit\_\{i\}\. Additionally, a system description𝒯\\mathcal\{T\}is provided, which represents the domain\-specific prior knowledge that a practitioner would possess when approaching the discovery task\.

Our goal is to discover an interpretable system of ODEs𝒙˙\(t\)=𝒇\(t,𝒙\)\\dot\{\\boldsymbol\{x\}\}\(t\)=\\boldsymbol\{f\}\(t,\\boldsymbol\{x\}\), where𝒇=\[f0,…,fd−1\]T\\boldsymbol\{f\}=\[f\_\{0\},\\dots,f\_\{d\-1\}\]^\{T\}is composed of functional terms proposed by an LLM that synthesizes the domain knowledge from𝒯\\mathcal\{T\}with numerical patterns in𝒟\\mathcal\{D\}\. The objective is to identify𝒇\\boldsymbol\{f\}that minimizes the Mean Squared Error \(MSE\)\. For thejj\-th dimension, the integral MSE is defined as

∑i=0N−1\(xj\(ti\)−\(xj\(t0\)\+∫t0tifj\(s,𝒙\)𝑑s\)\)2\.\\sum\_\{i=0\}^\{N\-1\}\\left\(x\_\{j\}\(t\_\{i\}\)\-\\left\(x\_\{j\}\(t\_\{0\}\)\+\\int\_\{t\_\{0\}\}^\{t\_\{i\}\}f\_\{j\}\(s,\\boldsymbol\{x\}\)\\,ds\\right\)\\right\)^\{2\}\.\(1\)Here, the integral represents the state trajectory obtained via numerical integration of the ODE system starting from the initial condition𝒙\(t0\)\\boldsymbol\{x\}\(t\_\{0\}\)\.

Specifically,𝒯\\mathcal\{T\}comprises facts known before data collection and without reference to mathematical formulas, such as physical principles and qualitative assessments\. While providing this context,𝒯\\mathcal\{T\}strictly prohibits specifying mathematical structures; for example, describing “air resistance” is permitted, whereas explicit terms like−cx2\-cx^\{2\}orsin⁡\(x\)\\sin\(x\)are not\. This ensures the model leverages physical intuition without knowledge of the ground\-truth symbolic forms\.

## 3Methodology

With the problem formulation in place, we now describe the DoLQ framework and its components in detail\. DoLQ is an iterative framework consisting of a Sampler Agent for generating symbolic candidates, a Parameter Optimizer for estimating coefficients, and a Scientist Agent for qualitative and quantitative evaluation \(Figure[2](https://arxiv.org/html/2605.07323#S1.F2)\)\. This cycle progressively refines the discovered equations by leveraging both domain knowledge and numerical patterns, guiding the search towards physically consistent and accurate models\.

### 3\.1Sampler Agent

The Sampler Agent is an LLM\-based component that proposes candidate terms for each dimension of the system of ODEs, along with natural language justifications grounded in the system description and feedback from the Scientist Agent\. See Appendix[K\.1](https://arxiv.org/html/2605.07323#A11.SS1)for detailed prompt examples\.

Sampler prompt\.The input prompt consists of three components: \(1a\) the role specification stating the SR task and the system description𝒯\\mathcal\{T\}providing domain\-specific context and physical principles; \(1b\) Scientist Agent guidance, including accumulated knowledge synthesized from previous iterations, term\-by\-term evaluation results \(keeporhold\), and the set of removed term skeletons marked as ineffective; \(1c\) technical constraints including output format requirements and illustrative examples\.

Sampler LLM response\.The Sampler Agent generates ODE candidates in a structured format, pairing each symbolic term with a natural language justification grounded in the system description\. Crucially, each term is formatted as an executable Python snippet utilizing state variablesxand parameter placeholdersparams\[\.\.\.\]\(e\.g\.,params\[0\] \* x\)\. As illustrated in Figure[2](https://arxiv.org/html/2605.07323#S1.F2), the output is organized into multiple candidate hypotheses\. For each hypothesis, a distinct set of terms is generated for every dimension of the system\. Each term is accompanied by its physical justification and reasoning, allowing the framework to explore diverse structural assumptions while maintaining the ability to evaluate and ablate distinct physical mechanisms\.

### 3\.2Parameter optimizer

The Parameter Optimizer takes term lists proposed by the Sampler Agent and refines them into accurate differential equations through two steps: constructing executable functions and optimizing their parameters\.

Function construction\.For each set of terms, we construct an executable skeleton function by instantiating the symbolic terms with trainable parameters\. This function represents a candidate differential equationfj\(t,𝒙;𝜽\)f\_\{j\}\(t,\\boldsymbol\{x\};\\boldsymbol\{\\theta\}\)for thejj\-th dimension, where𝜽\\boldsymbol\{\\theta\}denotes the parameters to be optimized\. See Appendix[B\.4](https://arxiv.org/html/2605.07323#A2.SS4)for detailed examples\.

Hybrid optimization\.We find the optimal parameters𝜽\\boldsymbol\{\\theta\}that minimize the residual MSE:

∑i=0N−1\(x˙j\(ti\)−fj\(ti,𝒙;𝜽\)\)2,\\sum\_\{i=0\}^\{N\-1\}\\left\(\\dot\{x\}\_\{j\}\(t\_\{i\}\)\-f\_\{j\}\(t\_\{i\},\\boldsymbol\{x\};\\boldsymbol\{\\theta\}\)\\right\)^\{2\},\(2\)wherex˙j\(ti\)\\dot\{x\}\_\{j\}\(t\_\{i\}\)denotes the numerically estimated derivative at timetit\_\{i\}, computed via finite differences from𝒟\\mathcal\{D\}\.

We adopt residual MSE instead of integral MSE \(Eq\.[1](https://arxiv.org/html/2605.07323#S2.E1)\) for two reasons: \(1\) integral MSE requires costly numerical integration at each optimization step, and \(2\) in multi\-dimensional systems, errors in one dimension cause trajectory divergence during integration, resulting in poor scores even for correctly identified dimensions\. Residual MSE evaluates each dimension independently, enabling proper credit assignment\. We therefore use residual MSE throughout optimization\.

While many SR studies utilize BFGS\(Fletcher,[2013](https://arxiv.org/html/2605.07323#bib.bib43)\), it is known to be sensitive to initialization and often fails to converge if the starting values are not well\-posed \(see Section[5\.2](https://arxiv.org/html/2605.07323#S5.SS2)\)\. Therefore, we employ a hybrid strategy that first applies differential evolution\(Storn and Price,[1997](https://arxiv.org/html/2605.07323#bib.bib11)\)to identify a promising parameter region, followed by BFGS optimization for refinement\. In practice, we evaluate all three strategies—BFGS alone, differential evolution alone, and the hybrid method—selecting the candidate with the lowest MSE\.

### 3\.3Scientist Agent

The Scientist Agent re\-evaluates the optimized candidate equations𝒇\\boldsymbol\{f\}received from the Parameter Optimizer\. It assesses each candidate term from two complementary perspectives: qualitative evaluation based on the system description and quantitative contribution measured through iterative comparison\. Upon completing these evaluations, the agent synthesizes both results to provide feedback and recommended actions to the Sampler Agent\.

#### 3\.3\.1Quantitative evaluation

For quantitative evaluation, the Scientist Agent assesses which terms in the optimized equation are truly necessary\. We perform a simple ablation test\. For each term in the equation, we temporarily remove it by setting its coefficient to zero and recalculate the residual MSE\. If the error significantly increases without the term, it is contributing meaningfully to the fit\. If the error stays the same or decreases, the term is unnecessary or harmful and should be removed\.

Based on this ablation analysis, terms are classified into three categories:good\(terms whose removal significantly increases error, indicating positive contribution\),neutral\(terms whose removal has negligible impact on error\), andbad\(terms whose removal decreases error, indicating negative impact\)\. To make the ablation criterion concrete, consider the illustrative example shown in the middle panel of Figure[2](https://arxiv.org/html/2605.07323#S1.F2)\. In that example,np\.sin\(x1\)is classified asgoodbecause removing it increases the MSE by 78\.2%, whereasx1\*\*4is classified asbadbecause its removal decreases the overall error, indicating that the term was inflating the fit\. See Appendix[B](https://arxiv.org/html/2605.07323#A2)for details\.

#### 3\.3\.2Qualitative evaluation

For qualitative evaluation, the Scientist LLM assesses how well each term aligns with the physical or mathematical meaning described in the system description\. See Appendix[K\.2](https://arxiv.org/html/2605.07323#A11.SS2)for details\.

Scientist prompt\.The input prompt to the Scientist Agent comprises four components: \(2a\) context information including the current iteration progress and system description, enabling the agent to understand the search trajectory and physical background; \(2b\) accumulated insights from previous evaluations and constraints listing previously removed term structures to prevent redundant proposals in subsequent iterations; \(2c\) experimental results presenting the Global Best equation, Previous Attempt, and Current Attempt, each with per\-dimension MSE values, optimized coefficients, and the Sampler’s physical justifications for each proposed term, allowing the agent to assess performance improvements and evaluate the physical plausibility of new terms relative to established baselines; \(2d\) evaluation instructions specifying semantic quality grading criteria and requirements for physical reasoning, ensuring consistent evaluation standards across iterations\.

Scientist LLM response\.The Scientist Agent produces a structured response containing two components\. The first component, term evaluations, provides dimension\-specific assessments where each term receives a semantic quality grade paired with physical reasoning\. Terms are assignedgoodif they clearly align with physical principles described in the system description,neutralif they are partially relevant but lack a clear physical basis, orbadif they lack physical meaning or contradict the system description\. The second component, updated insight, synthesizes the current analysis into accumulated knowledge, identifying captured physical mechanisms and aspects requiring further investigation\. This structured format enables systematic tracking of modeling progress\.

#### 3\.3\.3Feedback synthesis\.

The final action for each term is determined by combining both evaluations: \(1\) if semantic quality isbad, immediatelyremoveregardless of performance impact; \(2\) if both semantic quality and performance impact aregood, assignkeep; \(3\) all other combinations are assignedhold\. Terms receivingholdfor two consecutive iterations are converted toremove\. Removed term skeletons, where parameters are replaced with placeholder symbols, are recorded to prevent re\-proposal\. To avoid permanently discarding potentially valid terms, a probabilistic forgetting mechanism periodically clears these entries, allowing re\-exploration of previously rejected candidates\. The detailed algorithm and decision logic are provided in Appendix[B\.3](https://arxiv.org/html/2605.07323#A2.SS3)\.

Table 1:Benchmark ODEs and their temporal domains for ID and ID\-Ext evaluation\.Table 2:Quantitative performance comparison measured by NMSE\. We report the dimension\-averaged results\. Bold with underline indicates the best, bold indicates the second best\. NaN indicates solver failure or numerical instability\.

## 4Experiments

### 4\.1Benchmark and dataset

To evaluate our method, we select seven ODE problems from ODEbench\(d’Ascoli and others,[2024](https://arxiv.org/html/2605.07323#bib.bib6)\), a comprehensive benchmark for ODE discovery, with their physical descriptions and domain knowledge supplemented by referencing Wikipedia and related literature\. For four\-dimensional systems, we additionally construct benchmark problems following the same approach\. Thus, we use a total of eight ODE datasets\. We focus on four representative systems for detailed analysis in the main experiments\. We use three systems for quantitative performance comparison as shown in Table[2](https://arxiv.org/html/2605.07323#S3.T2): SIR\(2D\), CDIMA\(2D\), and Glider\(4D\)\. Additionally, we use Glider\(2D\) for structural equation analysis as shown in Figure[3](https://arxiv.org/html/2605.07323#S4.F3)\. Details on the complete benchmark suite are provided in Appendix[A](https://arxiv.org/html/2605.07323#A1)\.

Benchmark problems\.Table[1](https://arxiv.org/html/2605.07323#S3.T1)shows four representative systems from the benchmark suite\. We categorize problems by their functional complexity: systems with polynomial terms versus those with rational, trigonometric, or exponential terms\. Table[2](https://arxiv.org/html/2605.07323#S3.T2)presents quantitative evaluation on three systems: SIR\(2D\) models disease spread, CDIMA\(2D\) models chemical kinetics with nonlinear saturation terms, and Glider\(4D\) captures flight dynamics\. Figure[3](https://arxiv.org/html/2605.07323#S4.F3)shows structural equation comparison on a dimensionless Glider\(2D\) variant for clearer visualization\.

Extended in\-domain evaluation\.To assess the generalization capability of discovered equations, we build upon the evaluation framework ofShojaeeet al\.\([2025](https://arxiv.org/html/2605.07323#bib.bib7)\), which proposes in\-domain \(ID\) and out\-of\-domain \(OOD\) test sets\. While OOD evaluation is ideal for testing generalization, it is impractical in our setting because integral MSE\-based validation requires initial conditions that may be unavailable in the extended regime\. To address this limitation, we introduce extended\-ID \(ID\-Ext\) test sets that extend the temporal or state\-space coverage while retaining observed initial conditions, enabling rigorous testing of whether discovered equations maintain predictive accuracy over longer horizons\.

Data generation\.State trajectories for both ID and ID\-Ext regimes were generated using the fourth\-order Runge–Kutta method\. The time derivativesx˙\\dot\{x\}in Eq\. \([2](https://arxiv.org/html/2605.07323#S3.E2)\) were obtained through numerical differentiation based on the finite difference method\. For each ODE, we prepared textual descriptions by referencing relevant research literature, with equation names and explicit formula structures removed to prevent direct inference of the symbolic form\. The ID regime is used for model training and selection, while the ID\-Ext regime serves to evaluate extrapolation capability under extended temporal or spatial ranges\. The time domain ranges for ID and ID\-Ext in Table[1](https://arxiv.org/html/2605.07323#S3.T1)were selected based on the temporal intervals where the data distribution exhibited the most significant differences, ensuring rigorous evaluation of extrapolation capability\.

![Refer to caption](https://arxiv.org/html/2605.07323v1/x3.png)Figure 3:Equation comparison on the 2D dimensionless Glider system\. Ground truth terms are highlighted in green\. Group \(b\) shows models with reasonable NMSE: DoLQ achieves integral NMSE<10−3<10^\{\-3\}across all dimensions, and LLM\-SR exhibits the lowest NMSE among the remaining methods\. Group \(c\) shows models with higher NMSE\. Each model’s discovered equation terms and their corresponding optimized coefficients are displayed\.
### 4\.2Baseline models and experimental setup

We evaluate DoLQ against representative LLM\-based SR methods: ICSR\(Merleret al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib10)\), LASR\(Grayeliet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib8)\), LLM\-SR\(Shojaeeet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib7)\), and EDL\(Duet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib9)\)\. We selected these baselines based on their capability to incorporate textual descriptions of the system𝒯\\mathcal\{T\}via prompts and their ability to generate symbolic ODE expressions\. Notably, among these baselines, LLM\-SR and EDL were originally proposed and evaluated specifically for symbolic discovery of ordinary differential equations\. The remaining methods, LASR and ICSR, were primarily designed for general symbolic regression tasks; nevertheless, we adapt them to our problem setting to enable a fair comparison on ODE discovery\.

Most existing SR methods target scalar\-valued functions \(f:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}\), whereas our task requires discovering vector\-valued functions \(f:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{d\}\) for ODE systems\. To ensure fair comparison, we adapted the baselines to support vector\-valued outputs by modifying their prompts and providing the same system description𝒯\\mathcal\{T\}\. All experiments were conducted using Gemini 2\.5 Flash Lite\(Teamet al\.,[2023](https://arxiv.org/html/2605.07323#bib.bib41)\)\. To ensure equivalent LLM call conditions for a fair comparison, we configured the search budget as follows\. For DoLQ, we perform 100 iterations\. In contrast, baseline models LLM\-SR, EDL, and ICSR are configured to perform 100 iterations per dimension, with LASR set to a comparable expected number of iterations per dimension\. A similar budget strategy was maintained for 4D systems; further details are provided in Appendix[B](https://arxiv.org/html/2605.07323#A2)\.

### 4\.3Performance comparison

We use normalized mean squared error \(NMSE\) to enable fair comparison across different problems with varying scales\. We evaluate performance using two metrics: integral NMSE and residual NMSE, with the results summarized in Table[2](https://arxiv.org/html/2605.07323#S3.T2)\. For the SIR\(2D\) system, since the equation form is readily inferable from the description, most models achieve low error rates, successfully identifying the governing dynamics\. Notably, many models identify the correct mathematical structure practically from the first iteration, demonstrating that existing methods are capable when the problem structure is relatively straightforward\.

However, the CDIMA\(2D\) system presents a distinct challenge due to its complex nonlinear saturation terms\. In this case, a significant divergence emerges between residual NMSE and integral NMSE\. Baseline models often exhibit low residual NMSE yet high integral NMSE, indicating a failure to capture the true underlying physics\. Only DoLQ consistently achieves superior performance with stable integral NMSE scores, demonstrating its effectiveness in discovering equations with complex functional forms\.

For the Glider\(4D\) system, both LLM\-SR and DoLQ successfully identify the correct governing equations with low NMSE\.

![Refer to caption](https://arxiv.org/html/2605.07323v1/x4.png)Figure 4:Success scores measured across eight benchmark ODEs\. NMSE test: success is achieved if the integral NMSE across all dimensions is<10−3<10^\{\-3\}\. Term test: success is achieved if the discovered equation matches the ground truth structure after excluding terms with negligible impact\.![Refer to caption](https://arxiv.org/html/2605.07323v1/x5.png)Figure 5:Cumulative token usage on the 4D Glider system\. Stars \(\*\) indicate that LLM\-SR and DoLQ successfully identified the governing equation\. NMSE values reach their minimum on the Glider system\.
### 4\.4Structural comparison

Beyond the three representative systems in Table[1](https://arxiv.org/html/2605.07323#S3.T1), we further analyze the learning outcomes on an additional ODE problem: a 2D dimensionless variant of the Glider system\. This simplified version retains the essential nonlinear trigonometric dynamics while enabling clearer visualization and interpretation of the discovered terms compared to the full 4D system\.

Qualitative examination reveals critical differences in the discovered physical laws\. In Figure[3](https://arxiv.org/html/2605.07323#S4.F3), we categorize the results on this system into groups with reasonable NMSE \(b\) and those without \(c\)\. Both DoLQ and LLM\-SR successfully identify all ground truth terms, whereas other baselines capture only partial dynamics\. However, LLM\-SR often proposes equation skeletons containing numerous unnecessary terms\. In contrast, DoLQ proposes significantly fewer terms while successfully including all ground truth components\. This efficiency stems from the Scientist Agent’s rigorous quantitative and qualitative assessment; beneficial terms receive positive evaluations, while detrimental terms are explicitly identified and removed, preventing their recurrence in subsequent iterations\. Consequently, DoLQ achieves superior performance with a more focused and efficient search process\.

### 4\.5Overall comparison and discussion

As shown in Figure[4](https://arxiv.org/html/2605.07323#S4.F4), DoLQ achieves the highest success rate across all evaluation criteria, outperforming baseline methods on both NMSE test and Term test across diverse problem types\.

Analyzing baseline characteristics reveals distinct limitations\. LASR demonstrates reasonable performance on problems with simple polynomial terms, producing clean equations with minimal unnecessary components\. However, the evolutionary algorithm’s search capability is fundamentally limited, failing to discover governing equations for complex systems involving non\-polynomial functional forms\. LLM\-SR achieves moderate NMSE performance \(4/8 success rate\), yet shows limited success in capturing the correct equation structure\. This discrepancy arises from proposing excessive candidate terms, which not only complicates parameter optimization but also generates unnecessarily verbose prompts\. As evidenced by Figure[5](https://arxiv.org/html/2605.07323#S4.F5), LLM\-SR consumes substantially more tokens than DoLQ, with detailed statistics provided in Appendix[C\.4](https://arxiv.org/html/2605.07323#A3.SS4)\.

DoLQ addresses these fundamental limitations through the Scientist Agent’s evaluation framework\. By rigorously assessing both qualitative evaluation and quantitative evaluation of each proposed term, the Scientist Agent effectively filters out unnecessary candidates early in the search process\. This prevents the bloated equation skeletons that plague LLM\-SR while maintaining the capacity to discover complex functional forms beyond the reach of evolutionary methods\. The result is a framework that achieves superior accuracy across both simple and complex systems, consistently identifying the correct governing equations where baseline methods fail\. DoLQ’s multi\-agent architecture not only discovers physically consistent governing equations with higher success rates but also enables more interpretable results through compact equation structures\. By combining qualitative reasoning with quantitative validation, DoLQ establishes a new standard for reliable and accurate ODE discovery in dynamical systems, demonstrating robust performance across diverse problem types while maintaining computational efficiency as an additional benefit\.

## 5Component validation

### 5\.1Necessity of the Scientist Agent

To demonstrate the critical role of qualitative reasoning, we conduct an ablation study comparing DoLQ against a baseline variant labeled “w/o Scientist,” where the Scientist Agent is removed from the loop\. In this configuration, the Sampler Agent directly receives the equation skeletons, optimized coefficients, and NMSE scores from the Parameter Optimizer as part of the prompt for the next iteration, bypassing the qualitative evaluation step\. As illustrated in Figure[6](https://arxiv.org/html/2605.07323#S5.F6), the full DoLQ framework containing the Scientist Agent exhibits significantly faster convergence than the baseline\. An analysis of the reasoning logs confirms that the Scientist Agent plays a pivotal role in the discovery process; its feedback effectively filters out physically implausible terms early in the search, guiding the model toward the correct governing equations more efficiently\.

Unless otherwise noted, all methods were executed three times under the same configuration, and we report the best run; Figure[6](https://arxiv.org/html/2605.07323#S5.F6)shows one representative convergence trace generated under this protocol rather than an average over runs\. Specifically, the plotted DoLQ trace corresponds to the best of these three runs\.

![Refer to caption](https://arxiv.org/html/2605.07323v1/x6.png)Figure 6:Impact of the Scientist Agent on search convergence for the Glider\(2D\) problem\. DoLQ with the Scientist Agent discovers the correct equation at iteration 27, while the baseline without the Scientist Agent finds it at iteration 62, demonstrating that qualitative feedback accelerates convergence\.![Refer to caption](https://arxiv.org/html/2605.07323v1/x7.png)Figure 7:Action frequencies for the top\-ranked terms in each target dimension during DoLQ execution on the Glider\(2D\) problem\. Highlighted terms appear in the ground\-truth equation and consistently receive highkeepfrequency, indicating that the Scientist Agent effectively preserves physically meaningful terms\.To further understand how the Scientist Agent operates within DoLQ, we analyze its decision patterns during the discovery process\. Figure[7](https://arxiv.org/html/2605.07323#S5.F7)reveals that ground truth terms consistently receive highkeepfrequencies, demonstrating the Scientist Agent’s effectiveness in identifying physically relevant terms\. Notably, even when these terms occasionally receiveremoverecommendations, they reappear frequently across iterations due to the probabilistic sampling mechanism that considers keep/remove probabilities, global best information, and previous iteration history, enabling robust convergence toward correct equations\.

### 5\.2Effectiveness of hybrid parameter optimization

Our choice of combining differential evolution with BFGS is motivated by the need for robust parameter estimation in the rugged loss landscapes of symbolic regression\. While the LLM may propose a structurally correct equation skeleton, standard gradient\-based optimization like BFGS often fails to identify the optimal parameters due to its sensitivity to initialization and susceptibility to local minima\. As shown in Figure[8](https://arxiv.org/html/2605.07323#S5.F8), even when the correct functional form is provided, BFGS yields suboptimal parameter estimation, whereas our hybrid approach—combining the global search of differential evolution with the local refinement of BFGS—successfully locates the global optimum\. Crucially, the successful optimization by differential evolution verifies that the skeleton proposed by the LLM is indeed structurally similar to the ground truth\. Relying solely on BFGS could lead to the incorrect rejection of a reasonable skeleton due to optimization failure, concealing the true quality of the proposed structure\. We provide further analysis on the adoption frequency of each optimizer across different problem complexities in Appendix[D](https://arxiv.org/html/2605.07323#A4)\.

![Refer to caption](https://arxiv.org/html/2605.07323v1/x8.png)Figure 8:BFGS alone can fail even with a structurally correct skeleton, whereas the hybrid optimizer successfully recovers the correct solution\. This illustrates why hybrid parameter optimization is necessary\.

## 6Related work

Symbolic regression\.SR aims to recover governing equations of dynamical systems in the form of explicit mathematical formulas, evolving from evolutionary algorithms\(Cranmer,[2023](https://arxiv.org/html/2605.07323#bib.bib45)\)to incorporate physical constraints\(Udrescu and Tegmark,[2020](https://arxiv.org/html/2605.07323#bib.bib18)\), reinforcement learning\(Petersenet al\.,[2021](https://arxiv.org/html/2605.07323#bib.bib19); Mundhenket al\.,[2021](https://arxiv.org/html/2605.07323#bib.bib21); Xuet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib22)\), and neural sequence models\(Biggioet al\.,[2021](https://arxiv.org/html/2605.07323#bib.bib23); Kamiennyet al\.,[2022](https://arxiv.org/html/2605.07323#bib.bib24); Valipouret al\.,[2021](https://arxiv.org/html/2605.07323#bib.bib25); Shojaeeet al\.,[2023](https://arxiv.org/html/2605.07323#bib.bib26); Yuet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib27); Xianget al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib28); Huanget al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib29)\)\. For dynamical systems, recent methods combine neural networks with genetic programming\(Gao and others,[2023](https://arxiv.org/html/2605.07323#bib.bib32)\)and Transformer\-based architectures\(d’Ascoli and others,[2024](https://arxiv.org/html/2605.07323#bib.bib6)\)\. Recent advances include symbolic\-numeric pre\-training\(Meidaniet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib30)\), interactive offline reinforcement learning\(Tianet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib31)\), and grammar\-based discovery\(Omejcet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib33)\)\. However, most existing methods prioritize point\-wise or derivative\-based fitting metrics without validating long\-horizon trajectory consistency\. While functional SR has been extensively studied with established benchmarks\(Zhanget al\.,[2012](https://arxiv.org/html/2605.07323#bib.bib46)\), ODE\-specific benchmarks\(Matsubaraet al\.,[2022](https://arxiv.org/html/2605.07323#bib.bib47)\)remain relatively limited, requiring further research\(Oliveiraet al\.,[2018](https://arxiv.org/html/2605.07323#bib.bib20)\)\.

LLMs for scientific discovery\.LLMs leverage encoded scientific knowledge for equation discovery, optimizing skeletons via evolutionary search\(Shojaeeet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib7)\), self\-improvement\(Duet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib9)\), in\-context learning\(Merleret al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib10)\), concept libraries\(Grayeliet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib8)\), and symbol\-numeric pre\-training\(Meidaniet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib30)\)\. Recent work shows LLMs can interpret discovered expressions\(Guoet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib37)\)and diagnose errors through multi\-agent reasoning\(Parket al\.,[2026](https://arxiv.org/html/2605.07323#bib.bib38)\), while formal verification enables structured validation\(Sunet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib39)\)\. Beyond SR, LLMs translate natural language to formal mathematics through autoformalization and reasoning\(Sorocoet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib36)\)and automate research pipelines\(Luet al\.,[2024](https://arxiv.org/html/2605.07323#bib.bib34)\)\. However, evaluation still relies on derivative matching rather than trajectory integration, and LLMs’ semantic reasoning capacity for validation remains underutilized\.

Modeling and evaluation of dynamical systems\.Deep learning approaches for differential equation modeling\(Czarneckiet al\.,[2017](https://arxiv.org/html/2605.07323#bib.bib35); Dupontet al\.,[2019](https://arxiv.org/html/2605.07323#bib.bib13); Greydanuset al\.,[2019](https://arxiv.org/html/2605.07323#bib.bib16); Raissiet al\.,[2019](https://arxiv.org/html/2605.07323#bib.bib14); Rubanovaet al\.,[2019](https://arxiv.org/html/2605.07323#bib.bib12); Cranmeret al\.,[2019](https://arxiv.org/html/2605.07323#bib.bib17); Karniadakiset al\.,[2021](https://arxiv.org/html/2605.07323#bib.bib15); Linotet al\.,[2023](https://arxiv.org/html/2605.07323#bib.bib40); Leeet al\.,[2025](https://arxiv.org/html/2605.07323#bib.bib44)\)achieve robust predictions but lack interpretability\. In contrast, SR discovers interpretable governing equations in closed form\. Our approach integrates LLM\-based physical reasoning into symbolic discovery to improve equation accuracy\.

## 7Conclusion

This paper presents DoLQ, a multi\-agent framework integrating qualitative and quantitative evaluation for discovering governing ODEs\. By employing LLMs to validate physical plausibility, DoLQ demonstrates substantial improvements in discovery success rates and equation interpretability, particularly for systems with complex functional forms\. The systematic integration of semantic reasoning with numerical validation enables consistent identification of correct governing equations across diverse problem types, achieving the highest performance on comprehensive benchmarks\. Several limitations warrant further investigation: parameter optimization depends on numerical differentiation, introducing sensitivity to measurement noise, and reasoning fixation occasionally causes premature convergence to implausible hypotheses\. Future research will address these through integral\-based parameter estimation and diversified hypothesis generation with refined feedback mechanisms\. While ODEs benefit from established numerical solvers, extension to PDEs presents considerable difficulty due to the absence of universal solution methods\. We plan to investigate advances in PDE solver research to enable reliable equation discovery in spatiotemporal dynamical systems\.

## Impact Statement

This work advances automated scientific discovery by enabling interpretable equation learning from observational data\. The enhanced ability to discover governing equations can accelerate research in physics, biology, and engineering, reducing dependence on domain expertise for model development\. However, practitioners should remain cautious of potential risks: automated discovery systems may generate spurious correlations or physically implausible models if deployed without proper validation\. We emphasize the importance of human oversight in applying discovered equations to real\-world systems, particularly in safety\-critical domains\.

## Acknowledgments

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation \(IITP\) grant funded by the Korea government \(MSIT\) \[RS\-2021\-II211341, Artificial Intelligence Graduate School Program \(Chung\-Ang University\)\]\. This work was supported by the National Research Foundation of Korea \(NRF\) grant funded by the Korea government \(MSIT\) \(RS\-2025\-02303239 and RS\-2026\-25497362\)\.

## References

- L\. Biggio, T\. Bendinelli, A\. Neitz, A\. Lucchi, and G\. Parascandolo \(2021\)Neural symbolic regression that scales\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 936–945\.External Links:[Link](https://proceedings.mlr.press/v139/biggio21a.html)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- S\. L\. Brunton, J\. L\. Proctor, and J\. N\. Kutz \(2016\)Discovering governing equations from data by sparse identification of nonlinear dynamical systems\.Proceedings of the National Academy of Sciences113\(15\),pp\. 3932–3937\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1517384113),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.1517384113),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1517384113Cited by:[§1](https://arxiv.org/html/2605.07323#S1.p2.6)\.
- M\. Cranmer, S\. Greydanus, S\. Hoyer, P\. Battaglia, D\. Spergel, and S\. Ho \(2019\)Lagrangian neural networks\.InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations,External Links:[Link](https://openreview.net/forum?id=iE8tFa4Nq)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- M\. Cranmer \(2023\)Interpretable Machine Learning for Science with PySR and SymbolicRegression\.jl\.arXiv\.Note:arXiv:2305\.01582 \[astro\-ph, physics:physics\]External Links:[Link](http://arxiv.org/abs/2305.01582),[Document](https://dx.doi.org/10.48550/arXiv.2305.01582)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- W\. M\. Czarnecki, S\. Osindero, M\. Jaderberg, G\. Swirszcz, and R\. Pascanu \(2017\)Sobolev training for neural networks\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/758a06618c69880a6cee5314ee42d52f-Paper.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- S\. d’Ascoliet al\.\(2024\)ODEFormer: symbolic regression of dynamical systems with transformers\.InProceedings of the 12th International Conference on Learning Representations \(ICLR\),Vienna, Austria\.Note:ICLR 2024External Links:[Link](https://openreview.net/forum?id=TzoHLiGVMo)Cited by:[§A\.1](https://arxiv.org/html/2605.07323#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.07323#S1.p2.6),[§4\.1](https://arxiv.org/html/2605.07323#S4.SS1.p1.1),[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- M\. Du, Y\. Chen, Z\. Wang, L\. Nie, and D\. Zhang \(2024\)Large language models for automatic equation discovery of nonlinear dynamics\.Physics of Fluids36\(9\),pp\. 097121\.External Links:ISSN 1070\-6631,[Document](https://dx.doi.org/10.1063/5.0224297),[Link](https://doi.org/10.1063/5.0224297),https://pubs\.aip\.org/aip/pof/article\-pdf/doi/10\.1063/5\.0224297/20152367/097121\_1\_5\.0224297\.pdfCited by:[§4\.2](https://arxiv.org/html/2605.07323#S4.SS2.p1.1),[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- E\. Dupont, A\. Doucet, and Y\. W\. Teh \(2019\)Augmented neural odes\.InAdvances in Neural Information Processing Systems,Vol\.32\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/21be9a4bd4f81549a9d1d241981cec3c-Paper.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- R\. Fletcher \(2013\)Practical methods of optimization\.John Wiley & Sons\.Cited by:[§3\.2](https://arxiv.org/html/2605.07323#S3.SS2.p5.1)\.
- E\. Gaoet al\.\(2023\)Probabilistic grammars for modeling dynamical systems from coarse, noisy, and partial data\.Research Square\.Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- A\. Grayeli, A\. Sehgal, O\. Costilla\-Reyes, M\. Cranmer, and S\. Chaudhuri \(2024\)Symbolic regression with a learned concept library\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 44678–44709\.External Links:[Document](https://dx.doi.org/10.52202/079017-1419),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/4ec3ddc465c6d650c9c419fb91f1c00a-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.07323#S1.p3.1),[§4\.2](https://arxiv.org/html/2605.07323#S4.SS2.p1.1),[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- S\. Greydanus, M\. Dzamba, and J\. Yosinski \(2019\)Hamiltonian neural networks\.InAdvances in Neural Information Processing Systems,Vol\.32\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/26cd8ecadce0d4efd6cc8a8725cbd1f8-Paper.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- Z\. Guo, S\. Wang, Y\. Tian, J\. Yang, H\. Yu, X\. Na, L\. Kovács, L\. Li, P\. A\. Ioannou, and F\. Wang \(2025\)SR\-llm: an incremental symbolic regression framework driven by llm\-based retrieval\-augmented generation\.Proceedings of the National Academy of Sciences122\(52\),pp\. e2516995122\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2516995122),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.2516995122),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.2516995122Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- Z\. Huang, D\. Z\. Huang, T\. Xiao, D\. Ma, Z\. Ming, H\. Shi, and Y\. Wen \(2025\)Improving monte carlo tree search for symbolic regression\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Wic0OgYsgy)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- P\. Kamienny, S\. d’Ascoli, G\. Lample, and F\. Charton \(2022\)End\-to\-end symbolic regression with transformers\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 10269–10281\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/42eb37cdbefd7abae0835f4b67548c39-Paper-Conference.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- G\. E\. Karniadakis, I\. G\. Kevrekidis, L\. Lu,et al\.\(2021\)Physics\-informed machine learning\.Nature Reviews Physics3,pp\. 422–440\.Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- J\. R\. Koza \(1992\)Genetic programming: on the programming of computers by means of natural selection\.MIT Press,55 Hayward St, Cambridge, MA, United States\.External Links:ISBN 978\-0\-262\-11170\-6Cited by:[§1](https://arxiv.org/html/2605.07323#S1.p2.6)\.
- J\. Y\. Lee, S\. Ko, and Y\. Hong \(2025\)Finite element operator network for solving elliptic\-type parametric pdes\.SIAM Journal on Scientific Computing47\(2\)\.Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- A\. J\. Linot, J\. W\. Burby, Q\. Tang, P\. Balaprakash, M\. D\. Graham, and R\. Maulik \(2023\)Stabilized neural ordinary differential equations for long\-time forecasting of dynamical systems\.Journal of Computational Physics474,pp\. 111838\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcp.2022.111838),[Link](https://www.sciencedirect.com/science/article/pii/S0021999122009019)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha \(2024\)The ai scientist: towards fully automated open\-ended scientific discovery\.External Links:2408\.06292,[Link](https://arxiv.org/abs/2408.06292)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- Y\. Matsubara, N\. Chiba, R\. Igarashi, and Y\. Ushiku \(2022\)SRSD: Rethinking datasets of symbolic regression for scientific discovery\.InNeurIPS 2022 AI for Science: Progress and Promises,Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- K\. Meidani, P\. Shojaee, C\. K\. Reddy, and A\. B\. Farimani \(2024\)SNIP: bridging mathematical symbolic and numeric realms with unified pre\-training\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=KZSEgJGPxu)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1),[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- M\. Merler, K\. Haitsiukevich, N\. Dainese, and P\. Marttinen \(2024\)In\-context symbolic regression: leveraging large language models for function discovery\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 4: Student Research Workshop\),X\. Fu and E\. Fleisig \(Eds\.\),Bangkok, Thailand,pp\. 427–444\.External Links:[Link](https://aclanthology.org/2024.acl-srw.49/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-srw.49),ISBN 979\-8\-89176\-097\-4Cited by:[§4\.2](https://arxiv.org/html/2605.07323#S4.SS2.p1.1),[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- T\. Mundhenk, M\. Landajuela, R\. Glatt, C\. P\. Santiago, D\. faissol, and B\. K\. Petersen \(2021\)Symbolic regression via deep reinforcement learning enhanced genetic programming seeding\.InAdvances in Neural Information Processing Systems,M\. Ranzato, A\. Beygelzimer, Y\. Dauphin, P\.S\. Liang, and J\. W\. Vaughan \(Eds\.\),Vol\.34,pp\. 24912–24923\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2021/file/d073bb8d0c47f317dd39de9c9f004e9d-Paper.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- L\. O\. V\. Oliveira, J\. F\. B\. Martins, L\. F\. Miranda, and G\. L\. Pappa \(2018\)Analysing symbolic regression benchmarks under a meta\-learning approach\.InProceedings of the Genetic and Evolutionary Computation Conference Companion,pp\. 1342–1349\.Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- N\. Omejc, B\. Gec, J\. Brence,et al\.\(2024\)Probabilistic grammars for modeling dynamical systems from coarse, noisy, and partial data\.Machine Learning113\(10\),pp\. 7689–7721\.External Links:[Document](https://dx.doi.org/10.1007/s10994-024-06522-1),[Link](https://doi.org/10.1007/s10994-024-06522-1)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- D\. Park, H\. Moon, and S\. Ryu \(2026\)A self\-correcting multi\-agent llm framework for language\-based physics simulation and explanation\.npj Artificial Intelligence2\(1\),pp\. 10\.External Links:[Document](https://dx.doi.org/10.1038/s44387-025-00057-z),[Link](https://doi.org/10.1038/s44387-025-00057-z),ISSN 3005\-1460Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- B\. K\. Petersen, M\. L\. Larma, T\. N\. Mundhenk, C\. P\. Santiago, S\. K\. Kim, and J\. T\. Kim \(2021\)Deep symbolic regression: recovering mathematical expressions from data via risk\-seeking policy gradients\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=m5Qsh0kBQG)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- M\. Raissi, P\. Perdikaris, and G\.E\. Karniadakis \(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational Physics378,pp\. 686–707\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcp.2018.10.045),[Link](https://www.sciencedirect.com/science/article/pii/S0021999118307125)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- Y\. Rubanova, R\. T\. Q\. Chen, and D\. K\. Duvenaud \(2019\)Latent ordinary differential equations for irregularly\-sampled time series\.InAdvances in Neural Information Processing Systems,Vol\.32\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/42a6845a557bef704ad8ac9cb4461d43-Paper.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p3.1)\.
- C\. Rudin \(2019\)Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead\.Nature machine intelligence1\(5\),pp\. 206–215\.Cited by:[§1](https://arxiv.org/html/2605.07323#S1.p1.1)\.
- M\. Schmidt and H\. Lipson \(2009\)Distilling free\-form natural laws from experimental data\.Science324\(5923\),pp\. 81–85\.External Links:[Document](https://dx.doi.org/10.1126/science.1165893),[Link](https://www.science.org/doi/abs/10.1126/science.1165893),https://www\.science\.org/doi/pdf/10\.1126/science\.1165893Cited by:[§1](https://arxiv.org/html/2605.07323#S1.p2.6)\.
- P\. Shojaee, K\. Meidani, A\. Barati Farimani, and C\. Reddy \(2023\)Transformer\-based planning for symbolic regression\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 45907–45919\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/8ffb4e3118280a66b192b6f06e0e2596-Paper-Conference.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- P\. Shojaee, K\. Meidani, S\. Gupta, A\. B\. Farimani, and C\. K\. Reddy \(2025\)LLM\-SR: scientific equation discovery via programming with large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=m2nmp8P5in)Cited by:[§1](https://arxiv.org/html/2605.07323#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.07323#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2605.07323#S4.SS2.p1.1),[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- M\. Soroco, J\. Song, M\. Xia, K\. Emond, W\. Sun, and W\. Chen \(2025\)PDE\-controller: LLMs for autoformalization and reasoning of PDEs\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=7epYTVsWEI)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- R\. Storn and K\. Price \(1997\)Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces\.Journal of Global Optimization11\(4\),pp\. 341–359\.External Links:[Document](https://dx.doi.org/10.1023/A%3A1008202821328),ISSN 1573\-2916,[Link](https://doi.org/10.1023/A:1008202821328)Cited by:[§3\.2](https://arxiv.org/html/2605.07323#S3.SS2.p5.1)\.
- J\. Sun, Y\. Tang, A\. Li, C\. J\. Maddison, and K\. S\. Meel \(2025\)Enumerate\-conjecture\-prove: formally solving answer\-construction problems in math competitions\.External Links:2505\.18492,[Link](https://arxiv.org/abs/2505.18492)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p2.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§B\.2](https://arxiv.org/html/2605.07323#A2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.07323#S4.SS2.p2.3)\.
- Y\. Tian, W\. Zhou, M\. Viscione, H\. Dong, D\. S\. Kammer, and O\. Fink \(2025\)Interactive symbolic regression with co\-design mechanism through offline reinforcement learning\.Nature Communications16\(1\),pp\. 3930\.External Links:[Document](https://dx.doi.org/10.1038/s41467-025-59288-y),[Link](https://doi.org/10.1038/s41467-025-59288-y)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- S\. Udrescu and M\. Tegmark \(2020\)AI feynman: a physics\-inspired method for symbolic regression\.Science Advances6\(16\),pp\. eaay2631\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.aay2631),[Link](https://www.science.org/doi/abs/10.1126/sciadv.aay2631),https://www\.science\.org/doi/pdf/10\.1126/sciadv\.aay2631Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- M\. Valipour, B\. You, M\. Panju, and A\. Ghodsi \(2021\)SymbolicGPT: a generative transformer model for symbolic regression\.External Links:2106\.14131,[Link](https://arxiv.org/abs/2106.14131)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- E\. P\. Wigneret al\.\(1990\)The unreasonable effectiveness of mathematics in the natural sciences\.Mathematics and science13,pp\. 1–14\.Cited by:[§1](https://arxiv.org/html/2605.07323#S1.p1.1)\.
- Z\. Xiang, K\. Ashen, X\. Qian, and X\. Qian \(2025\)Graph\-based symbolic regression with invariance and constraint encoding\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=JYB6wFcbky)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- Y\. Xu, Y\. Liu, and H\. Sun \(2024\)Reinforcement symbolic regression machine\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 54416–54440\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/ef3a55fa15aa5fe39b7a2617b3a5d06e-Paper-Conference.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- Z\. Yu, J\. Ding, Y\. Li, and D\. Jin \(2025\)Symbolic regression via mdlformer\-guided search: from minimizing prediction error to minimizing description length\.InInternational Conference on Learning Representations,Y\. Yue, A\. Garg, N\. Peng, F\. Sha, and R\. Yu \(Eds\.\),Vol\.2025,pp\. 65306–65333\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/file/a402493de088886740b5939f666a6e56-Paper-Conference.pdf)Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.
- Y\. Zhang, P\. M\. Duc, O\. Corcho, and J\. Calbimonte \(2012\)SRBench: a streaming RDF/SPARQL benchmark\.InInternational Semantic Web Conference,pp\. 641–657\.Cited by:[§6](https://arxiv.org/html/2605.07323#S6.p1.1)\.

## Appendix ADataset details

### A\.1Benchmark problems and ground truth equations

Table[3](https://arxiv.org/html/2605.07323#A1.T3)provides the complete specifications for all eight benchmark ODE systems used in our experiments\. Problems ID 1\-7 are sourced from ODEbench\(d’Ascoli and others,[2024](https://arxiv.org/html/2605.07323#bib.bib6)\), while ID 8 is a 4D variant of the Glider problem, constructed to evaluate performance on four\-dimensional systems with non\-polynomial terms\. For each benchmark problem, we list its identifier \(ID 1\-8\), descriptive title, ground truth governing equations in mathematical form, and the temporal domains used for both ID and ID\-Ext evaluation regimes\. These specifications serve as the reference for comparing the equations discovered by DoLQ and baseline methods, as detailed in TableLABEL:tab:equations\_appendix\.

Table 3:Benchmark ODEs and their ground truth governing equations and temporal domains used for in\-domain \(ID\) and extended in\-domain \(ID\-Ext\) evaluations\.
### A\.2Natural language descriptions for benchmark problems

Table[4](https://arxiv.org/html/2605.07323#A1.T4)provides the natural language descriptions \(𝒯\\mathcal\{T\}\) for each benchmark ODE system\. These descriptions contain domain\-specific context and physical principles that are provided as input to the LLM\-based agents during the equation discovery process\.

Table 4:Natural language descriptions for benchmark problems\.To reduce equation\-level leakage, we removed equation names, explicit symbolic forms, and direct function\-form clues from all descriptions before prompting\. This design does not prove the absence of memorization in a strict causal sense, but it prevents the prompt from containing the answer in symbolic form and forces discovery to depend on consistency between the description, numerical optimization, and iterative term evaluation\.

### A\.3Data generation procedure

We generate synthetic trajectory data for all benchmark systems using numerical integration\. The data generation process consists of three main stages: ODE solving, derivative computation, and dataset partitioning\.

Numerical integration\.State trajectories are generated using the Runge–Kutta method\. For each ODE system specified in Table[3](https://arxiv.org/html/2605.07323#A1.T3), we integrate the ground truth equations over the temporal domains indicated in the table\.

Derivative computation\.We compute time derivativesx˙\(t\)\\dot\{x\}\(t\)using two complementary methods to validate numerical consistency\. First, we evaluate the derivatives directly by substituting the interpolated state values into the ground truth ODE functions:x˙\(ti\)=f\(ti,x\(ti\)\)\\dot\{x\}\(t\_\{i\}\)=f\(t\_\{i\},x\(t\_\{i\}\)\), whereffrepresents the right\-hand side of the ODE system\. This provides the analytically correct derivative values denoted asx˙gt\\dot\{x\}\_\{\\text\{gt\}\}\. Second, we compute numerical derivatives using NumPy’s gradient function with second\-order accurate edge handling \(edge\_order=2\), which applies the central difference schemex˙\(ti\)≈\(x\(ti\+1\)−x\(ti−1\)\)/\(2Δt\)\\dot\{x\}\(t\_\{i\}\)\\approx\(x\(t\_\{i\+1\}\)\-x\(t\_\{i\-1\}\)\)/\(2\\Delta t\)for interior points and forward/backward differences at boundaries\. These numerically differentiated values, denoted asx˙t\\dot\{x\}\_\{t\}, serve as the target derivative values for residual\-based optimization \(Eq\.[2](https://arxiv.org/html/2605.07323#S3.E2)in the main text\)\. For evaluation, we measure Residual NMSE by comparing discovered equations’ predictions againstx˙gt\\dot\{x\}\_\{\\text\{gt\}\}, ensuring consistency with the ground truth dynamics\.

Dataset partitioning\.Generated trajectories are partitioned into two distinct temporal regimes for model evaluation\. The ID regime spans fromtstartt\_\{\\text\{start\}\}totendt\_\{\\text\{end\}\}as specified in Table[3](https://arxiv.org/html/2605.07323#A1.T3), representing the temporal range used for model training and in\-distribution testing\. The ID\-Ext regime extends fromtstartt\_\{\\text\{start\}\}totood\_endt\_\{\\text\{ood\\\_end\}\}, combining both the ID temporal range and an extended temporal domain for OOD evaluation\. Specifically, for the extended portion beyondtendt\_\{\\text\{end\}\}, we use the state value attendt\_\{\\text\{end\}\}as the initial condition and continue the integration up totood\_endt\_\{\\text\{ood\\\_end\}\}, creating a continuous trajectory that tests the model’s extrapolation capability over longer time horizons\. All experiments use noise\-free data \(σ=0\\sigma=0\) with 1000 uniformly sampled time points for each regime\. Initial conditions for each system are predetermined and specified in the benchmark configuration rather than randomly sampled, ensuring reproducibility and consistency across all experimental runs\.

### A\.4State trajectory visualization

We visualize the state trajectories for the benchmark ODE problems generated using the procedure described in the previous subsection\. As shown in Figure[9](https://arxiv.org/html/2605.07323#A1.F9), we present the temporal evolution of state variables for all eight benchmark problems \(ID 1\-8\), where each subfigure displays both the ID training regime \(solid lines\) and the ID\-Ext evaluation regime \(extended temporal domain\)\. The visualization demonstrates the diversity of dynamical behaviors in our benchmark suite, ranging from simple polynomial dynamics in the SIR model to complex nonlinear oscillations in systems like CDIMA and the Glider systems\.

![Refer to caption](https://arxiv.org/html/2605.07323v1/images_paper/id_ext_plot.png)Figure 9:State trajectories for benchmark ODEs in ID and ID\-Ext regimes\. The figures display: ID 1: SIR infection model; ID 2: Glider \(dimensionless\); ID 3: Reduced model for chlorine dioxide\-iodine\-malonic acid reaction; ID 4: Isothermal autocatalytic reaction model; ID 5: Interacting bar magnets; ID 6: Binocular rivalry model; ID 7: Oscillator death model; ID 8: Glider \(physical units\)\.

## Appendix BImplementation details

### B\.1DoLQ hyperparameters

The specific hyperparameters used for the DoLQ framework in our experiments are detailed in Table[5](https://arxiv.org/html/2605.07323#A2.T5)\. These settings were chosen to balance exploration capability with computational efficiency\.

Table 5:Hyperparameters for DoLQ\. Values are derived from the experimental configuration specific to the results reported\.CategoryParameterValueDescriptionLLM ConfigurationSampler Modelgemini\-2\.5\-flash\-liteModel for generating candidate termsScientist Modelgemini\-2\.5\-flash\-liteModel for reasoning and evaluationSampler Temperature0\.90\.9High temperature for diverse generationScientist Temperature0\.60\.6Lower temperature for stable reasoningMax Tokens \(Sampler\)30003000Constraint on generation lengthMax Tokens \(Scientist\)30003000Constraint on evaluation lengthSearch & OptimizationEvolution Iterations100100Total number of search cyclesMax Parameters88Maximum trainable coefficients per termDifferential evolution tolerance10−510^\{\-5\}Convergence threshold for differential evolutionBFGS Tolerance10−910^\{\-9\}Convergence threshold for BFGSSystem SettingsDifferential evolution strategybest1binStandard differential evolution strategyDifferential evolution population size2020Number of candidates in the differential evolution poolForget Probability0\.010\.01Chance to revive a banned termTimeout240240sMaximum execution time per iteration

In all experiments, the Sampler Agent produces 3 candidate hypotheses per iteration, and each equation in a candidate is constrained to contain at most 10 terms\. We do not impose a separate sparsity regularizer; instead, sparsity is controlled operationally through this term cap together with the Scientist Agent’s keep/hold/remove feedback and the subsequent ablation\-based filtering\. Outputs exceeding the term cap are discarded before optimization\. If candidate evaluation produces overflow, underflow, division\-by\-zero, NaN, or∞\\infty, we assign the corresponding objective value to∞\\inftyand discard that candidate from further consideration in the current iteration\.

### B\.2Experimental settings for baseline models

To ensure a fair comparison with our proposed framework, DoLQ, we incorporated the system description𝒯\\mathcal\{T\}\(providing domain\-specific context and physical principles\) into the prompts for each baseline method \(LASR, LLM\-SR, EDL, ICSR\), adapted to their respective prompt structures\. All baseline methods utilize Gemini 2\.5 Flash Lite\(Teamet al\.,[2023](https://arxiv.org/html/2605.07323#bib.bib41)\)as the underlying LLM\. For experimental parameters not explicitly mentioned in the following tables, we followed the default settings of each respective baseline method\. Specifically for ICSR, the binary and unary operators listed in Table[9](https://arxiv.org/html/2605.07323#A2.T9)were explicitly provided in the prompt to define the permissible function space\.

Table 6:Experimental settings for LASR\.
Table 7:Experimental settings for LLM\-SR\.
Table 8:Experimental settings for EDL\.
Table 9:Experimental settings for ICSR\.

The hyperparameter configurations for each baseline method are summarized in Tables[6](https://arxiv.org/html/2605.07323#A2.T6),[7](https://arxiv.org/html/2605.07323#A2.T7),[8](https://arxiv.org/html/2605.07323#A2.T8), and[9](https://arxiv.org/html/2605.07323#A2.T9)\. Table[6](https://arxiv.org/html/2605.07323#A2.T6)shows the LASR settings, which employ a hybrid evolutionary approach with LLM\-guided mutations at a probability of 0\.02, running 31 iterations with 6 populations for both 2D and 4D systems\. Table[7](https://arxiv.org/html/2605.07323#A2.T7)details the LLM\-SR configuration, which utilizes an island\-based evolutionary strategy with 3 islands and adaptive reset mechanisms \(reset period of 483 for 2D and 304 for 4D\)\. Table[8](https://arxiv.org/html/2605.07323#A2.T8)presents the EDL setup, configured with 100 epochs and usingR2R^\{2\}as the fitness metric with 10 samples per generation\. Finally, Table[9](https://arxiv.org/html/2605.07323#A2.T9)specifies the ICSR parameters, which performs 100 iterations of in\-context symbolic regression with the operator set explicitly provided in the prompt\.

### B\.3Algorithm for term retention strategy

This algorithm details the “simple ablation test” mentioned in Section[3\.3\.1](https://arxiv.org/html/2605.07323#S3.SS3.SSS1), classifying terms based on their impact on residual MSE\.

Algorithm 1: Term Contribution Analysis via Ablation TestInput:Equationf\(𝐱;θ\)f\(\\mathbf\{x\};\\mathbf\{\\theta\}\), Parametersθ=\{c1,…,ck\}\\mathbf\{\\theta\}=\\\{c\_\{1\},\\dots,c\_\{k\}\\\}, Data𝒟\\mathcal\{D\}, Thresholdδ=0\.05\\delta=0\.05 Output:Classification per term\(Good/Bad/Neutral\)1:Calculate Baseline Error: MSEbase←Evaluate\(f\(𝐱;θ\),𝒟\)MSE\_\{base\}\\leftarrow\\text\{Evaluate\}\(f\(\\mathbf\{x\};\\mathbf\{\\theta\}\),\\mathcal\{D\}\)2:For each termi∈\{1,…,k\}i\\in\\\{1,\\dots,k\\\}: ctemp←θ\.copy\(\)c\_\{temp\}\\leftarrow\\mathbf\{\\theta\}\.\\text\{copy\}\(\) ctemp\[i\]←0c\_\{temp\}\[i\]\\leftarrow 0// Temporarily remove termiiMSEablated←Evaluate\(f\(𝐱;ctemp\),𝒟\)MSE\_\{ablated\}\\leftarrow\\text\{Evaluate\}\(f\(\\mathbf\{x\};c\_\{temp\}\),\\mathcal\{D\}\)Δ←\(MSEablated−MSEbase\)/\(MSEbase\+ϵ\)\\Delta\\leftarrow\(MSE\_\{ablated\}\-MSE\_\{base\}\)/\(MSE\_\{base\}\+\\epsilon\)// Calculate Change RateIf Δ\>δ\\Delta\>\\delta: Classification←\\leftarrowGood\(Positive Impact\) Else IfΔ<−δ\\Delta<\-\\delta: Classification←\\leftarrowBad\(Negative Impact\) Else: Classification←\\leftarrowNeutral3:ReturnClassifications

##### Decision logic for term retention

In the ablation test, we do not re\-optimize the remaining coefficients after zeroing a term; instead, we reuse the previously optimized coefficient vector and set only the selected coefficient to zero\. We adopted this approximation because a pilot comparison with full re\-optimization changed only a very small number of keep/remove decisions while substantially increasing runtime\. All optimization and ablation steps are performed on the given training trajectory rather than on a separate validation set\.

The final decision for each term is determined by combining the qualitative evaluation from the Scientist LLM \(Figure[15](https://arxiv.org/html/2605.07323#A11.F15)\) and the quantitative evaluation from the Simple Ablation Test \(Algorithm[B\.3](https://arxiv.org/html/2605.07323#A2.SS3)\)\. The decision matrix, including the accumulated hold mechanism, is presented in Table[10](https://arxiv.org/html/2605.07323#A2.T10)\. Specifically, we enforce a “2\-strike” rule where terms maintained in theHoldstate for two consecutive iterations are definitively removed \(Remove\) if they fail to improve\.

Table 10:Term Retention Decision Matrix\. It summarizes the final action determination based on the dual evaluation results and the accumulated hold mechanism\.Semantic Quality
\(Scientist LLM\)Quantitative Impact
\(Ablation Test\)Previous State
\(Iterationt−1t\-1\)Final Action
\(Iterationtt\)BadAnyAnyRemoveGoodGood\(Positive\)AnyKeepGoodNeutral/BadNone / KeepHold\(1st\)Hold \(1st strike\)Hold\(2nd\)Hold \(2nd strike\)RemoveNeutralAnyNone / KeepHold\(1st\)Hold \(1st strike\)Hold\(2nd\)Hold \(2nd strike\)Remove

### B\.4Function construction

As shown in Figure[13](https://arxiv.org/html/2605.07323#A11.F13), the Sampler agent generates a list of candidate terms\. To enable numerical optimization of the coefficients, we transform these symbolic terms into an executable Python function\. Figure[10](https://arxiv.org/html/2605.07323#A2.F10)illustrates this process, where each term is assigned a learnable parameter \(e\.g\.,params\[i\]\) and combined into a formatted function body\.

Function Construction ExampleInput \(Term List\)Output \(Python Code\)x0\_t= \["x0","np\.sin\(x1\)","x0\*x0"\]⇒\\Rightarrowdefx0\_t\(x0, x1, x2, x3, params\):
importnumpyasnp
return\(params\[0\]\*x0\) \+\\\\backslash
\(params\[1\]\*np\.sin\(x1\)\) \+\\\\backslash
\(params\[2\]\*\(x0\*x0\)\) \+\\\\backslash
params\[3\]\*1Figure 10:Visual representation of the function construction process\. A list of symbolic terms \(left\) is converted into an executable Python function \(right\) with learnable coefficients \(params\) and a bias term, enabling numerical optimization\.

## Appendix CComprehensive experimental results

### C\.1Discovered governing equations

In this section, we present the explicit mathematical forms of the governing equations discovered by each methodology for the benchmark ODE systems, detailing the structure and the optimized high\-precision coefficients\. These results are systematically organized in TableLABEL:tab:equations\_appendix, which provides a comprehensive comparison across all eight benchmark problems \(ID 1\-8 corresponding to the systems specified in Table[3](https://arxiv.org/html/2605.07323#A1.T3)\)\. For each problem, the table lists the discovered equations from DoLQ and all baseline methods \(EDL, ICSR, LASR, LLM\-SR\), showing the complete equation structure with optimized coefficients for every dimension\. This detailed presentation facilitates direct structural comparison between the equations recovered by our framework and those from baseline methods, enabling qualitative assessment of which methods successfully capture the ground truth terms versus those that propose spurious or incomplete dynamics\.

Table 11:Governing equations discovered by each methodology\.IDMethodDim\.DiscoveredEquationsDiscoveredEquations1DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(−0\.4000164923132323∗\(x0∗x1\)\)\+\(1\.1797068684934541e−06∗\(−x0∗x1∗x1\)\)\+0\.00014926603696994873∗1\(\-0\.4000164923132323\*\(x0\*x1\)\)\+\(1\.1797068684934541e\-06\*\(\-x0\*x1\*x1\)\)\+0\.00014926603696994873\*1x˙1\\dot\{x\}\_\{1\}\(0\.3999858850733014∗\(x0∗x1\)\)\+\(0\.3133017263306753∗\(−x1\)\)\+\(−0\.00010236009691644724∗\(x1∗x1\)\)−0\.000729280144169156∗1\(0\.3999858850733014\*\(x0\*x1\)\)\+\(0\.3133017263306753\*\(\-x1\)\)\+\(\-0\.00010236009691644724\*\(x1\*x1\)\)\-0\.000729280144169156\*1EDLx˙0\\dot\{x\}\_\{0\}−0\.4∗x0∗x1\-0\.4\*x0\*x1x˙1\\dot\{x\}\_\{1\}−0\.1128∗2∗x02\+0\.0051∗2∗x13\+0\.0064∗x0∗x1\+0\.4826∗x12−0\.2024∗x0∗sqrt\(x1\)\-0\.1128\*2\*x0^\{2\}\+0\.0051\*2\*x1^\{3\}\+0\.0064\*x0\*x1\+0\.4826\*x1^\{2\}\-0\.2024\*x0\*sqrt\(x1\)ICSRx˙0\\dot\{x\}\_\{0\}−0\.4∗x0∗x1\+0\.0001\-0\.4\*x0\*x1\+0\.0001x˙1\\dot\{x\}\_\{1\}0\.4∗x0∗x1−0\.314∗x1−0\.00020\.4\*x0\*x1\-0\.314\*x1\-0\.0002LASRx˙0\\dot\{x\}\_\{0\}x1∗\(x0∗−0\.4000072832475115\)x1\*\(x0\*\-0\.4000072832475115\)x˙1\\dot\{x\}\_\{1\}x1∗\(\(x0∗0\.40001272551202366\)\+−0\.314017558209762\)x1\*\(\(x0\*0\.40001272551202366\)\+\-0\.314017558209762\)LLM\-SRx˙0\\dot\{x\}\_\{0\}\(−0\.5113286722182488∗\(x0∗x1\)/1\.2782811394075673\)−\(−1\.6210813251309898e−05∗x1\)\(\-0\.5113286722182488\*\(x0\*x1\)/1\.2782811394075673\)\-\(\-1\.6210813251309898e\-05\*x1\)x˙1\\dot\{x\}\_\{1\}\(0\.7940229825361336/\(1\+−0\.0007829235554072554∗\(x0\+x1\)\)\)∗\(1/\(1\+exp\(−0\.0023009370812258584∗\(\(x1/\(x0\+x1\)\)−−0\.7319833222420157\)\)\)\)∗x0∗x1−0\.31384689979086494∗x1\(0\.7940229825361336/\(1\+\-0\.0007829235554072554\*\(x0\+x1\)\)\)\*\(1/\(1\+exp\(\-0\.0023009370812258584\*\(\(x1/\(x0\+x1\)\)\-\-0\.7319833222420157\)\)\)\)\*x0\*x1\-0\.31384689979086494\*x12DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(−0\.9999342734666715∗\(sin\(x1\)\)\)\+\(−0\.19999685243662685∗\(x02\)\)\+\(−3\.159527567024422e−05∗\(x0∗sin\(x1\)\)\)\+8\.291723665240451e−06∗1\(\-0\.9999342734666715\*\(sin\(x1\)\)\)\+\(\-0\.19999685243662685\*\(x0^\{2\}\)\)\+\(\-3\.159527567024422e\-05\*\(x0\*sin\(x1\)\)\)\+8\.291723665240451e\-06\*1x˙1\\dot\{x\}\_\{1\}\(−0\.9996732126004018∗\(cos\(x1\)/x0\)\)\+\(−6\.162357405925185e−05∗\(x0∗sin\(x1\)\)\)\+\(1\.0002613154943063∗\(x0\)\)−0\.0006655465341315331∗1\(\-0\.9996732126004018\*\(cos\(x1\)/x0\)\)\+\(\-6\.162357405925185e\-05\*\(x0\*sin\(x1\)\)\)\+\(1\.0002613154943063\*\(x0\)\)\-0\.0006655465341315331\*1EDLx˙0\\dot\{x\}\_\{0\}−0\.2∗sin\(x1\)−1\.0∗x02\-0\.2\*sin\(x1\)\-1\.0\*x0^\{2\}x˙1\\dot\{x\}\_\{1\}1\.2906∗x1∗log\(abs\(x0\)\)\+3\.7156∗x1\+0\.4329∗x02∗x1−1\.2452∗x0\+1\.55∗x1∗cos\(x0\)1\.2906\*x1\*log\(abs\(x0\)\)\+3\.7156\*x1\+0\.4329\*x0^\{2\}\*x1\-1\.2452\*x0\+1\.55\*x1\*cos\(x0\)ICSRx˙0\\dot\{x\}\_\{0\}−0\.2∗x02−1\.0∗sin\(1\.0∗x1\)−0\.0001\-0\.2\*x0^\{2\}\-1\.0\*sin\(1\.0\*x1\)\-0\.0001x˙1\\dot\{x\}\_\{1\}−0\.5662/\(x01\.894∗x10\.5626\)\+0\.8685∗x0−0\.1773∗x1\+1\.0748\-0\.5662/\(x0^\{1\.894\}\*x1^\{0\.5626\}\)\+0\.8685\*x0\-0\.1773\*x1\+1\.0748LASRx˙0\\dot\{x\}\_\{0\}\(x0/−1\.7671421572704085\)−sin\(x1\)\(x0/\-1\.7671421572704085\)\-sin\(x1\)x˙1\\dot\{x\}\_\{1\}x0−\(cos\(x1\)/x0\)x0\-\(cos\(x1\)/x0\)LLM\-SRx˙0\\dot\{x\}\_\{0\}\(−1\.387691981374681∗0\.9997683841467897∗sin\(x1\)\+\(−0\.27884783934870216∗square\(x0\)−\(−0\.00023524831049328578\)∗power\(x0,3\)\)\+\(\(−0\.0005339077881395173\)∗x0∗cos\(x1\)\+\(−0\.0009606900297937713\)∗x0∗sin\(x1\)\)\+\(\(1\.1025607057532369e−05\)∗x0∗square\(x1\)\+\(0\.0001392070749430701\)∗square\(x0\)∗x1\+\(1\.0871856110837581e−05\)∗x0∗square\(cos\(x1\)\)\+\(0\.0003566962804791643\)∗square\(x0\)∗sin\(x1\)\)\)/1\.387691981374681\(\-1\.387691981374681\*0\.9997683841467897\*sin\(x1\)\+\(\-0\.27884783934870216\*square\(x0\)\-\(\-0\.00023524831049328578\)\*power\(x0,3\)\)\+\(\(\-0\.0005339077881395173\)\*x0\*cos\(x1\)\+\(\-0\.0009606900297937713\)\*x0\*sin\(x1\)\)\+\(\(1\.1025607057532369e\-05\)\*x0\*square\(x1\)\+\(0\.0001392070749430701\)\*square\(x0\)\*x1\+\(1\.0871856110837581e\-05\)\*x0\*square\(cos\(x1\)\)\+\(0\.0003566962804791643\)\*square\(x0\)\*sin\(x1\)\)\)/1\.387691981374681x˙1\\dot\{x\}\_\{1\}\(\(1\.00016068611199∗x02\)\+\(−0\.0003203621386719613∗x0\)\+\(−0\.00026883955217234126\)−0\.9994481859297301∗cos\(x1\)\)/maximum\(x0,1e−6\)\(\(1\.00016068611199\*x0^\{2\}\)\+\(\-0\.0003203621386719613\*x0\)\+\(\-0\.00026883955217234126\)\-0\.9994481859297301\*cos\(x1\)\)/maximum\(x0,1e\-6\)3DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(−0\.999924337295147∗\(x0\)\)\+\(0\.0013713847096808362∗\(x0/\(1\.0\+x02\)\)\)\+\(4\.0000051777750265∗\(−x1∗x0/\(1\.0\+x02\)\)\)\+8\.899331099033029∗1\(\-0\.999924337295147\*\(x0\)\)\+\(0\.0013713847096808362\*\(x0/\(1\.0\+x0^\{2\}\)\)\)\+\(4\.0000051777750265\*\(\-x1\*x0/\(1\.0\+x0^\{2\}\)\)\)\+8\.899331099033029\*1x˙1\\dot\{x\}\_\{1\}\(1\.3999732914208531∗\(−x0∗x1/\(1\.0\+x02\)\)\)\+\(−4\.4947665548779136e−05∗\(−x02/\(1\.0\+x02\)\)\)\+\(−1\.3999589124979535∗\(−x0\)\)\+4\.897208257542713e−07∗1\(1\.3999732914208531\*\(\-x0\*x1/\(1\.0\+x0^\{2\}\)\)\)\+\(\-4\.4947665548779136e\-05\*\(\-x0^\{2\}/\(1\.0\+x0^\{2\}\)\)\)\+\(\-1\.3999589124979535\*\(\-x0\)\)\+4\.897208257542713e\-07\*1EDLx˙0\\dot\{x\}\_\{0\}−7\.159∗x13\+10\.3164∗x0−4\.0771∗cos\(x1\)2−0\.3057∗sqrt\(x0\)\+0\.0662∗x1\-7\.159\*x1^\{3\}\+10\.3164\*x0\-4\.0771\*cos\(x1\)^\{2\}\-0\.3057\*sqrt\(x0\)\+0\.0662\*x1x˙1\\dot\{x\}\_\{1\}−0\.7501∗sqrt\(x0\)\+0\.2166∗x12\+10\.8516∗x02−9\.7161∗x1−0\.3873∗cos\(x0\)3\-0\.7501\*sqrt\(x0\)\+0\.2166\*x1^\{2\}\+10\.8516\*x0^\{2\}\-9\.7161\*x1\-0\.3873\*cos\(x0\)^\{3\}ICSRx˙0\\dot\{x\}\_\{0\}−4\.0∗x0∗x1/\(x02\+0\.9999\)−1\.0002∗x0∗tanh\(2\.0678∗x1\)\+8\.9004\-4\.0\*x0\*x1/\(x0^\{2\}\+0\.9999\)\-1\.0002\*x0\*tanh\(2\.0678\*x1\)\+8\.9004x˙1\\dot\{x\}\_\{1\}−17\.0218∗x0∗tanh\(2\.5653∗x0\)\+18\.7118∗x0−3\.2262∗x1/\(x0\+3\.8981\)−0\.6359\-17\.0218\*x0\*tanh\(2\.5653\*x0\)\+18\.7118\*x0\-3\.2262\*x1/\(x0\+3\.8981\)\-0\.6359LASRx˙0\\dot\{x\}\_\{0\}\(\(sin\(x0\)\+−0\.7225424108770416\)∗−1\.3716248441728514\)\+tan\(sinh\(sin\(x1\+−1\.271652676453637\)\)\)\(\(sin\(x0\)\+\-0\.7225424108770416\)\*\-1\.3716248441728514\)\+tan\(sinh\(sin\(x1\+\-1\.271652676453637\)\)\)x˙1\\dot\{x\}\_\{1\}log\(x0\)log\(x0\)LLM\-SRx˙0\\dot\{x\}\_\{0\}\(7\.2413552436477575∗\(x0/\(−0\.09692412820488534\+x0\)\)\)∗\(1−\(x11\.6726750850832244\)/\(4\.2671124403329941\.6726750850832244\+x11\.6726750850832244\)\)\+\(−1\.6934497454181765∗x0∗x1∗\(\(x0−2\.2251470718338093\)/\(1\.2897199381103897−2\.2251470718338093\+x0−2\.2251470718338093\)\)\)\(7\.2413552436477575\*\(x0/\(\-0\.09692412820488534\+x0\)\)\)\*\(1\-\(x1^\{1\.6726750850832244\}\)/\(4\.267112440332994^\{1\.6726750850832244\}\+x1^\{1\.6726750850832244\}\)\)\+\(\-1\.6934497454181765\*x0\*x1\*\(\(x0^\{\-2\.2251470718338093\}\)/\(1\.2897199381103897^\{\-2\.2251470718338093\}\+x0^\{\-2\.2251470718338093\}\)\)\)x˙1\\dot\{x\}\_\{1\}\(0\.15305571645674615∗\(x0/\(1\.4598732465999258\+x0\)\)\)∗\(1−17\.709388951838832∗x1\)−\(−2\.096703476296624∗\(\(x01\.9666706388834827\)/\(2\.0872911189094361\.9666706388834827\+x01\.9666706388834827\)\)∗x1\)−\(−1\.3593840857406783∗x0\)\(0\.15305571645674615\*\(x0/\(1\.4598732465999258\+x0\)\)\)\*\(1\-17\.709388951838832\*x1\)\-\(\-2\.096703476296624\*\(\(x0^\{1\.9666706388834827\}\)/\(2\.087291118909436^\{1\.9666706388834827\}\+x0^\{1\.9666706388834827\}\)\)\*x1\)\-\(\-1\.3593840857406783\*x0\)4DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(0\.4973058292805835∗\(−x0\)\)\+\(0\.21746562709380776∗\(−x0∗x1\)\)\+\(0\.6344054886410723∗\(−x1∗x1\)\)\+0\.5264411613474004∗1\(0\.4973058292805835\*\(\-x0\)\)\+\(0\.21746562709380776\*\(\-x0\*x1\)\)\+\(0\.6344054886410723\*\(\-x1\*x1\)\)\+0\.5264411613474004\*1x˙1\\dot\{x\}\_\{1\}\(0\.41858642320557227∗\(x0∗x13\)\)\+\(−1\.6764582281044875∗\(x1/\(1\+x1\)\)\)\+\(1\.5117151384013812∗\(tanh\(x1\)\)\)\+0\.028354488818915925∗1\(0\.41858642320557227\*\(x0\*x1^\{3\}\)\)\+\(\-1\.6764582281044875\*\(x1/\(1\+x1\)\)\)\+\(1\.5117151384013812\*\(tanh\(x1\)\)\)\+0\.028354488818915925\*1EDLx˙0\\dot\{x\}\_\{0\}−0\.1566∗sin\(x1\)2−1\.404∗x02\+0\.5305∗x1\-0\.1566\*sin\(x1\)^\{2\}\-1\.404\*x0^\{2\}\+0\.5305\*x1x˙1\\dot\{x\}\_\{1\}0\.8039∗x0∗x1\+0\.0724∗sin\(x1\)20\.8039\*x0\*x1\+0\.0724\*sin\(x1\)^\{2\}ICSRx˙0\\dot\{x\}\_\{0\}2\.4649∗x0∗cos\(0\.9101∗x1\)−2\.9665∗x0\+0\.50332\.4649\*x0\*cos\(0\.9101\*x1\)\-2\.9665\*x0\+0\.5033x˙1\\dot\{x\}\_\{1\}0\.0692∗x0\+0\.6177∗x1−0\.16790\.0692\*x0\+0\.6177\*x1\-0\.1679LASRx˙0\\dot\{x\}\_\{0\}abs\(sinh\(tanh\(sin\(tan\(x1\+\(\(x00\.22334073977903218\)\+\(x0/x0\)\)\)\+x0\)\)\)\+\(x0∗0\.3514105571067361\)\)∗−0\.35812758021702057abs\(sinh\(tanh\(sin\(tan\(x1\+\(\(x0^\{0\.22334073977903218\}\)\+\(x0/x0\)\)\)\+x0\)\)\)\+\(x0\*0\.3514105571067361\)\)\*\-0\.35812758021702057x˙1\\dot\{x\}\_\{1\}\(\(x0∗x1\)−0\.019991627704281027\)∗x1\(\(x0\*x1\)\-0\.019991627704281027\)\*x1LLM\-SRx˙0\\dot\{x\}\_\{0\}\(0\.7575404568207408∗\(x01\.0100603887529287\)∗\(x10\.6955603195414775\)\)\+\(−0\.6206768205830506∗x0\)\+0\.39548215725015795\+\(−1\.6012562268622113∗x1\)\+\(0\.5863974471340245∗\(x10\.698948044530037\)\)\(0\.7575404568207408\*\(x0^\{1\.0100603887529287\}\)\*\(x1^\{0\.6955603195414775\}\)\)\+\(\-0\.6206768205830506\*x0\)\+0\.39548215725015795\+\(\-1\.6012562268622113\*x1\)\+\(0\.5863974471340245\*\(x1^\{0\.698948044530037\}\)\)x˙1\\dot\{x\}\_\{1\}\(0\.007339212327735855∗x0\)\+\(0\.9624797409788882∗x12\)\+\(0\.7873542337819555∗x13\)−\(−0\.06037485229609352∗x1\)−\(−0\.06037485265191514∗x1\)−\(0\.09930729898681726∗x0∗x1\)−\(1\.2126452269944556∗x13\)\+\(0\.0073391823827276045∗x0\)\(0\.007339212327735855\*x0\)\+\(0\.9624797409788882\*x1^\{2\}\)\+\(0\.7873542337819555\*x1^\{3\}\)\-\(\-0\.06037485229609352\*x1\)\-\(\-0\.06037485265191514\*x1\)\-\(0\.09930729898681726\*x0\*x1\)\-\(1\.2126452269944556\*x1^\{3\}\)\+\(0\.0073391823827276045\*x0\)5DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(−1\.0015488038824854∗\(sin\(x0\)\)\)\+\(0\.33102674079192346∗\(sin\(x0−x1\)\)\)\+\(−0\.0018115741826902861∗\(cos\(x0\)\)\)\+0\.0017490917848575025∗1\(\-1\.0015488038824854\*\(sin\(x0\)\)\)\+\(0\.33102674079192346\*\(sin\(x0\-x1\)\)\)\+\(\-0\.0018115741826902861\*\(cos\(x0\)\)\)\+0\.0017490917848575025\*1x˙1\\dot\{x\}\_\{1\}\(−0\.9981457993630254∗\(sin\(x1\)\)\)\+\(0\.3300067495528232∗\(sin\(x1−x0\)\)\)\+\(−0\.012578074617818194∗\(cos\(x1\)\)\)\+0\.012716069769122665∗1\(\-0\.9981457993630254\*\(sin\(x1\)\)\)\+\(0\.3300067495528232\*\(sin\(x1\-x0\)\)\)\+\(\-0\.012578074617818194\*\(cos\(x1\)\)\)\+0\.012716069769122665\*1EDLx˙0\\dot\{x\}\_\{0\}−0\.3375∗sin\(x0\)−0\.684∗x1\-0\.3375\*sin\(x0\)\-0\.684\*x1x˙1\\dot\{x\}\_\{1\}−0\.2973∗x1−0\.6309∗x0\-0\.2973\*x1\-0\.6309\*x0ICSRx˙0\\dot\{x\}\_\{0\}5\.0117∗x0∗sin\(x1\)−2\.2897∗x1−0\.26395\.0117\*x0\*sin\(x1\)\-2\.2897\*x1\-0\.2639x˙1\\dot\{x\}\_\{1\}−0\.2981∗x0−0\.6191∗x1\+0\.0019\-0\.2981\*x0\-0\.6191\*x1\+0\.0019LASRx˙0\\dot\{x\}\_\{0\}\(x0∗−0\.66306757779156\)\+0\.04301494511373006\(x0\*\-0\.66306757779156\)\+0\.04301494511373006x˙1\\dot\{x\}\_\{1\}x0∗\(\(x0/−1\.192434344729412\)\+0\.25749851031674253\)x0\*\(\(x0/\-1\.192434344729412\)\+0\.25749851031674253\)LLM\-SRx˙0\\dot\{x\}\_\{0\}−1\.000043264361751∗sin\(x0\)\+−0\.3300718799861972∗sin\(x1−x0\)\+−1\.568587192043395e−05∗cos\(x1\+x0\)\-1\.000043264361751\*sin\(x0\)\+\-0\.3300718799861972\*sin\(x1\-x0\)\+\-1\.568587192043395e\-05\*cos\(x1\+x0\)x˙1\\dot\{x\}\_\{1\}−0\.9938291044117465∗x1\+−0\.3292639827583344∗sin\(x0−x1\)−\(−2\.656564954764913e−06/\(x02\+x12\+1e−9\)1\.5\)\+−0\.0003613964745632571∗x0∗sin\(x1\)\-0\.9938291044117465\*x1\+\-0\.3292639827583344\*sin\(x0\-x1\)\-\(\-2\.656564954764913e\-06/\(x0^\{2\}\+x1^\{2\}\+1e\-9\)^\{1\.5\}\)\+\-0\.0003613964745632571\*x0\*sin\(x1\)6DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(1\.3286098112316618∗\(−x0\)\)\+\(1\.178235675777727∗\(−x1\)\)\+\(−0\.6010789351853646∗\(−power\(x0,3\)\)\)\+0\.9285456156774677∗1\(1\.3286098112316618\*\(\-x0\)\)\+\(1\.178235675777727\*\(\-x1\)\)\+\(\-0\.6010789351853646\*\(\-power\(x0,3\)\)\)\+0\.9285456156774677\*1x˙1\\dot\{x\}\_\{1\}\(0\.23753405260404614∗\(−x1\)\)\+\(−0\.7898637222390873∗\(−x0\)\)\+\(−0\.8360904268130745∗\(tanh\(x1\)\)\)\+\(2\.290634305077196∗\(−tanh\(x0\)\)\)\+0\.9338877404981648∗1\(0\.23753405260404614\*\(\-x1\)\)\+\(\-0\.7898637222390873\*\(\-x0\)\)\+\(\-0\.8360904268130745\*\(tanh\(x1\)\)\)\+\(2\.290634305077196\*\(\-tanh\(x0\)\)\)\+0\.9338877404981648\*1EDLx˙0\\dot\{x\}\_\{0\}−1\.1878∗x1−0\.467∗log\(sin\(x0\)\)\-1\.1878\*x1\-0\.467\*log\(sin\(x0\)\)x˙1\\dot\{x\}\_\{1\}−1\.1299∗log\(sin\(x0\)\)−0\.4331∗x1\-1\.1299\*log\(sin\(x0\)\)\-0\.4331\*x1ICSRx˙0\\dot\{x\}\_\{0\}−1\.8485∗x1−0\.1948∗sin\(x0\)\+0\.7327\-1\.8485\*x1\-0\.1948\*sin\(x0\)\+0\.7327x˙1\\dot\{x\}\_\{1\}0\.2393∗x0−2\.1268∗x1\+0\.64310\.2393\*x0\-2\.1268\*x1\+0\.6431LASRx˙0\\dot\{x\}\_\{0\}cos\(exp\(tan\(x0\)\)\)cos\(exp\(tan\(x0\)\)\)x˙1\\dot\{x\}\_\{1\}tanh\(0\.14631324378767832/x1\)−x0tanh\(0\.14631324378767832/x1\)\-x0LLM\-SRx˙0\\dot\{x\}\_\{0\}\(\(−2\.249593853293885∗\(x0/\(1\.1957556897109163\+x0\)\)\+1\.0667454050960814\)−\(3\.6741828960616694∗\(x1/\(2\.3342324749372967\+x1\)\)\)\)\(\(\-2\.249593853293885\*\(x0/\(1\.1957556897109163\+x0\)\)\+1\.0667454050960814\)\-\(3\.6741828960616694\*\(x1/\(2\.3342324749372967\+x1\)\)\)\)x˙1\\dot\{x\}\_\{1\}x1∗\(−2\.9587039457791064∗\(1−\(x1/0\.5623115553288086\)\)−\(1\.966298296973735∗x0∗1\.966298237331697\)\)\+0\.9508865699097846x1\*\(\-2\.9587039457791064\*\(1\-\(x1/0\.5623115553288086\)\)\-\(1\.966298296973735\*x0\*1\.966298237331697\)\)\+0\.95088656990978467DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(−1\.1655257292555141∗\(sin\(x1−x0\)\)\)\+\(1\.22475958770386∗\(cos\(x0\)\)\)\+\(−0\.1090013166633977∗\(x0∗sin\(x0\)\)\)\+0\.817067905629052∗1\(\-1\.1655257292555141\*\(sin\(x1\-x0\)\)\)\+\(1\.22475958770386\*\(cos\(x0\)\)\)\+\(\-0\.1090013166633977\*\(x0\*sin\(x0\)\)\)\+0\.817067905629052\*1x˙1\\dot\{x\}\_\{1\}\(−2\.2117156170501504∗\(x1\)\)\+\(−1\.291170751585586∗\(x0\)\)\+\(−4\.676614135451172∗\(cos\(x1\)\)\)\+\(1\.8750516943971047∗\(sin\(x1−x0\)\)\)\+10\.469917369126234∗1\(\-2\.2117156170501504\*\(x1\)\)\+\(\-1\.291170751585586\*\(x0\)\)\+\(\-4\.676614135451172\*\(cos\(x1\)\)\)\+\(1\.8750516943971047\*\(sin\(x1\-x0\)\)\)\+10\.469917369126234\*1EDLx˙0\\dot\{x\}\_\{0\}−0\.817∗x12−4\.3577∗x03∗sin\(x0\)\+2\.7509∗x1\+1\.5327∗x0∗sqrt\(x1\)\+0\.0552∗x0∗cos\(x0\)\-0\.817\*x1^\{2\}\-4\.3577\*x0^\{3\}\*sin\(x0\)\+2\.7509\*x1\+1\.5327\*x0\*sqrt\(x1\)\+0\.0552\*x0\*cos\(x0\)x˙1\\dot\{x\}\_\{1\}−1\.7872∗x0\+0\.3915∗sqrt\(x1\)−0\.7313∗cos\(x0\)\+0\.1632∗x02∗x1\+0\.1673∗x13\-1\.7872\*x0\+0\.3915\*sqrt\(x1\)\-0\.7313\*cos\(x0\)\+0\.1632\*x0^\{2\}\*x1\+0\.1673\*x1^\{3\}ICSRx˙0\\dot\{x\}\_\{0\}−0\.2806∗x0\+0\.6592∗x1∗cos\(1\.1398∗x0\)\+2\.0662\-0\.2806\*x0\+0\.6592\*x1\*cos\(1\.1398\*x0\)\+2\.0662x˙1\\dot\{x\}\_\{1\}−0\.2806∗x0\+0\.6592∗x1∗cos\(1\.1398∗x0\)\+1\.6062\-0\.2806\*x0\+0\.6592\*x1\*cos\(1\.1398\*x0\)\+1\.6062LASRx˙0\\dot\{x\}\_\{0\}\(cos\(x0\)∗sin\(x1\)\)\+1\.4320121866725906\(cos\(x0\)\*sin\(x1\)\)\+1\.4320121866725906x˙1\\dot\{x\}\_\{1\}\(cos\(x0\)∗sin\(x1\)\)\+0\.9720121867249\(cos\(x0\)\*sin\(x1\)\)\+0\.9720121867249LLM\-SRx˙0\\dot\{x\}\_\{0\}\(−\(−1\.231629475981958\)∗x0−2\.158066362365913∗x02−\(−0\.3037486836573964\)∗x03\)\+\(1\.3169135770056555∗x1\+2\.287561908522607∗x12\+−1\.2108364457957292∗x13\)\+\(−0\.8950111650180933∗\(x1−x0\)\+1\.609022512581006∗\(x1−x0\)2\+0\.3125475185298518∗\(x1−x0\)3\)\(\-\(\-1\.231629475981958\)\*x0\-2\.158066362365913\*x0^\{2\}\-\(\-0\.3037486836573964\)\*x0^\{3\}\)\+\(1\.3169135770056555\*x1\+2\.287561908522607\*x1^\{2\}\+\-1\.2108364457957292\*x1^\{3\}\)\+\(\-0\.8950111650180933\*\(x1\-x0\)\+1\.609022512581006\*\(x1\-x0\)^\{2\}\+0\.3125475185298518\*\(x1\-x0\)^\{3\}\)x˙1\\dot\{x\}\_\{1\}\(0\.14605034843290507∗x1\)\+\(0\.3954492324348049∗x0\)\+\(1\.249712176091898∗\(x0−x1\)\)\+\(−0\.2869847813397451∗x13\)\+\(0\.03189090914482745∗x03\)\+\(−0\.08031202424973191∗x12\)\+\(−0\.5242674943108351∗x02\)\+\(0\.017802263147943736∗x04\)\+\(−0\.15007455215067977∗x14\)\+\(−0\.09580817521497335∗\(x0−x1\)3\)\(0\.14605034843290507\*x1\)\+\(0\.3954492324348049\*x0\)\+\(1\.249712176091898\*\(x0\-x1\)\)\+\(\-0\.2869847813397451\*x1^\{3\}\)\+\(0\.03189090914482745\*x0^\{3\}\)\+\(\-0\.08031202424973191\*x1^\{2\}\)\+\(\-0\.5242674943108351\*x0^\{2\}\)\+\(0\.017802263147943736\*x0^\{4\}\)\+\(\-0\.15007455215067977\*x1^\{4\}\)\+\(\-0\.09580817521497335\*\(x0\-x1\)^\{3\}\)8DoLQ\(ours\)x˙0\\dot\{x\}\_\{0\}\(−9\.809127338404517∗\(sin\(x1\)\)\)\+\(−9\.411027230248375e−05∗\(cos\(x1\)\)\)\+\(−0\.030620460686854464∗\(x0∗x0\)\)−4\.9118855130869645e−05∗1\(\-9\.809127338404517\*\(sin\(x1\)\)\)\+\(\-9\.411027230248375e\-05\*\(cos\(x1\)\)\)\+\(\-0\.030620460686854464\*\(x0\*x0\)\)\-4\.9118855130869645e\-05\*1x˙1\\dot\{x\}\_\{1\}\(0\.9990536032782628∗\(−9\.81/x0∗cos\(x1\)\)\)\+\(0\.6130165352306062∗\(x0\)\)\+\(−0\.0032944410806144153∗\(cos\(x1\)\)\)−0\.0023916253783730504∗1\(0\.9990536032782628\*\(\-9\.81/x0\*cos\(x1\)\)\)\+\(0\.6130165352306062\*\(x0\)\)\+\(\-0\.0032944410806144153\*\(cos\(x1\)\)\)\-0\.0023916253783730504\*1x˙2\\dot\{x\}\_\{2\}\(1\.0000423141055985∗\(x0∗cos\(x1\)\)\)\+\(−1\.5764831178241946e−05∗\(x0∗x0∗cos\(x1\)\)\)\+0\.00014446800549605261∗1\(1\.0000423141055985\*\(x0\*cos\(x1\)\)\)\+\(\-1\.5764831178241946e\-05\*\(x0\*x0\*cos\(x1\)\)\)\+0\.00014446800549605261\*1x˙3\\dot\{x\}\_\{3\}\(1\.0000402047694594∗\(x0∗sin\(x1\)\)\)\+\(−2\.003183912960774e−05∗\(x0∗x0∗sin\(x1\)\)\)\+\(−6\.303122694043895e−06∗\(9\.81∗cos\(x1\)\)\)−4\.353532868248261e−05∗1\(1\.0000402047694594\*\(x0\*sin\(x1\)\)\)\+\(\-2\.003183912960774e\-05\*\(x0\*x0\*sin\(x1\)\)\)\+\(\-6\.303122694043895e\-06\*\(9\.81\*cos\(x1\)\)\)\-4\.353532868248261e\-05\*1EDLx˙0\\dot\{x\}\_\{0\}−0\.0306∗sin\(x1\)−9\.8091∗x02\-0\.0306\*sin\(x1\)\-9\.8091\*x0^\{2\}x˙1\\dot\{x\}\_\{1\}−0\.9504∗x03−0\.2996∗x12−0\.0006∗x3\+1\.3296∗sqrt\(abs\(x1\)\)−175\.3583∗x0/2\-0\.9504\*x0^\{3\}\-0\.2996\*x1^\{2\}\-0\.0006\*x3\+1\.3296\*sqrt\(abs\(x1\)\)\-175\.3583\*x0/2x˙2\\dot\{x\}\_\{2\}0\.9999∗x0∗cos\(x1\)0\.9999\*x0\*cos\(x1\)x˙3\\dot\{x\}\_\{3\}0\.9999∗x0∗sin\(x1\)0\.9999\*x0\*sin\(x1\)ICSRx˙0\\dot\{x\}\_\{0\}−0\.2813∗x0\+0\.0015∗x1∗x2−0\.0968∗x3∗sin\(1\.0007∗x1\)\+0\.4387\-0\.2813\*x0\+0\.0015\*x1\*x2\-0\.0968\*x3\*sin\(1\.0007\*x1\)\+0\.4387x˙1\\dot\{x\}\_\{1\}−0\.0605∗x0−6\.8291∗x1∗exp\(−2\.7884∗x0\)−0\.0042∗x2∗x3\+5\.306\-0\.0605\*x0\-6\.8291\*x1\*exp\(\-2\.7884\*x0\)\-0\.0042\*x2\*x3\+5\.306x˙2\\dot\{x\}\_\{2\}0\.9998∗x0∗cos\(x1\)\+0\.02170\.9998\*x0\*cos\(x1\)\+0\.0217x˙3\\dot\{x\}\_\{3\}0\.9999∗x0∗sin\(x1\)\+0\.00020\.9999\*x0\*sin\(x1\)\+0\.0002LASRx˙0\\dot\{x\}\_\{0\}sin\(x1\)∗−9\.813480254368727sin\(x1\)\*\-9\.813480254368727x˙1\\dot\{x\}\_\{1\}\(\(log\(x0\)−\(\(x2−5\.963031610926088\)∗0\.24925925160994267\)\)∗8\.24239810331568\)/x0\(\(log\(x0\)\-\(\(x2\-5\.963031610926088\)\*0\.24925925160994267\)\)\*8\.24239810331568\)/x0x˙2\\dot\{x\}\_\{2\}cos\(x1\)∗x0cos\(x1\)\*x0x˙3\\dot\{x\}\_\{3\}sin\(x1\)∗x0sin\(x1\)\*x0LLM\-SRx˙0\\dot\{x\}\_\{0\}\(−9\.807899351331447∗sin\(x1\)\)\+−\(\(0\.4145037567137417\+1\.5751662511703768∗\(−0\.0001796563617646803\+−8\.696424882717339e−05∗x1\)2\)∗0\.07394440250282637∗x02\)\+\(\(−0\.0001796563617646803\+−8\.696424882717339e−05∗x1\)∗0\.07394440250282637∗x02∗sin\(x1\)\)\+\(−−0\.00026688377456063225∗x0\)\+\(3\.1382118431267775e−05∗x2\)\+\(−7\.49115792450938e−06∗x3\)\+\(0\.00024089007421140555∗x0∗cos\(x1\)∗sin\(x1\)\)\(\-9\.807899351331447\*sin\(x1\)\)\+\-\(\(0\.4145037567137417\+1\.5751662511703768\*\(\-0\.0001796563617646803\+\-8\.696424882717339e\-05\*x1\)^\{2\}\)\*0\.07394440250282637\*x0^\{2\}\)\+\(\(\-0\.0001796563617646803\+\-8\.696424882717339e\-05\*x1\)\*0\.07394440250282637\*x0^\{2\}\*sin\(x1\)\)\+\(\-\-0\.00026688377456063225\*x0\)\+\(3\.1382118431267775e\-05\*x2\)\+\(\-7\.49115792450938e\-06\*x3\)\+\(0\.00024089007421140555\*x0\*cos\(x1\)\*sin\(x1\)\)x˙1\\dot\{x\}\_\{1\}\(\(0\.4312345753897334∗x0\)\+\(−7\.190875450891451∗x02\)\+\(115\.22337211688654∗cos\(x1\)\)\)/\(−11\.719230243958577∗x0\+−0\.023905720212081497\+1e−9\)\+\(0\.0004499464651382155∗x2\)\+\(0\.0002873146658744879∗x3\)\+\(3\.203979516768474e−05∗x0∗x1\)\(\(0\.4312345753897334\*x0\)\+\(\-7\.190875450891451\*x0^\{2\}\)\+\(115\.22337211688654\*cos\(x1\)\)\)/\(\-11\.719230243958577\*x0\+\-0\.023905720212081497\+1e\-9\)\+\(0\.0004499464651382155\*x2\)\+\(0\.0002873146658744879\*x3\)\+\(3\.203979516768474e\-05\*x0\*x1\)x˙2\\dot\{x\}\_\{2\}\(−\(−8\.313688998942643e−05\)∗sin\(x1\)\)\+\(0\.00012926385963329574∗−0\.0002110727122192323∗x3∗x02∗sin\(x1\)\)\+\(−0\.00012926385963329574∗0\.05∗x02∗cos\(x1\)\)\+\(−0\.00012926385963329574∗1\.5157577205870516∗\(−0\.0002110727122192323∗x3\)2∗x02∗cos\(x1\)\)\+\(−1\.0∗exp\(−\(\(x3−1\.0\)/1\.0\)2\)∗x02∗cos\(x1\)\)\+\(−−0\.9999727429727798∗\(x0−−4\.820794712947023e−05\)∗cos\(x1\)\)\+\(1\.7829345690512744e−08∗x3∗x1∗x0\)\(\-\(\-8\.313688998942643e\-05\)\*sin\(x1\)\)\+\(0\.00012926385963329574\*\-0\.0002110727122192323\*x3\*x0^\{2\}\*sin\(x1\)\)\+\(\-0\.00012926385963329574\*0\.05\*x0^\{2\}\*cos\(x1\)\)\+\(\-0\.00012926385963329574\*1\.5157577205870516\*\(\-0\.0002110727122192323\*x3\)^\{2\}\*x0^\{2\}\*cos\(x1\)\)\+\(\-1\.0\*exp\(\-\(\(x3\-1\.0\)/1\.0\)^\{2\}\)\*x0^\{2\}\*cos\(x1\)\)\+\(\-\-0\.9999727429727798\*\(x0\-\-4\.820794712947023e\-05\)\*cos\(x1\)\)\+\(1\.7829345690512744e\-08\*x3\*x1\*x0\)x˙3\\dot\{x\}\_\{3\}\(0\.00013369546648802163∗x0∗cos\(x1\)\)\+\(−6\.586738720708613e−06∗x02∗sin\(x1\)\)\+\(−4\.30440238524879e−06∗x03∗cos\(x1\)\)\+\(−2\.158359661264577e−05∗x2∗x0∗cos\(x1\)\)\+\(0\.00011240049424943671∗x0\)\+\(−0\.0001459955779485213∗x0∗sin\(x1\)2\)\+\(1\.2061925694746197e−05∗x02∗cos\(x1\)\)\+\(−3\.152838841683395e−05∗x0∗sin\(x1\)∗cos\(x1\)\)\+2\.7038601280307466e−05\+\(0\.9999723594741682∗x0∗sin\(x1\)\)\(0\.00013369546648802163\*x0\*cos\(x1\)\)\+\(\-6\.586738720708613e\-06\*x0^\{2\}\*sin\(x1\)\)\+\(\-4\.30440238524879e\-06\*x0^\{3\}\*cos\(x1\)\)\+\(\-2\.158359661264577e\-05\*x2\*x0\*cos\(x1\)\)\+\(0\.00011240049424943671\*x0\)\+\(\-0\.0001459955779485213\*x0\*sin\(x1\)^\{2\}\)\+\(1\.2061925694746197e\-05\*x0^\{2\}\*cos\(x1\)\)\+\(\-3\.152838841683395e\-05\*x0\*sin\(x1\)\*cos\(x1\)\)\+2\.7038601280307466e\-05\+\(0\.9999723594741682\*x0\*sin\(x1\)\)
### C\.2Quantitative results

To quantitatively assess the fidelity of the discovered models, we employ the Normalized Mean Squared Error \(NMSE\) metric, evaluating the deviation between the ground truth dynamics and the identified governing equations for each state variable dimension\. TableLABEL:tab:nmse\_appendixpresents these NMSE values across all benchmark problems \(ID 1\-8\) for both DoLQ and baseline methods\. The table is structured to show four evaluation metrics for each dimension: ID NMSE \(Residual\) and ID NMSE \(Integral\) measure performance on the training domain, while ID\-Ext NMSE \(Residual\) and ID\-Ext NMSE \(Integral\) assess generalization to extended temporal regimes\. Residual NMSE evaluates point\-wise derivative matching, whereas Integral NMSE measures trajectory prediction accuracy through numerical integration from initial conditions\. This comprehensive quantitative comparison enables systematic analysis of both fitting quality and extrapolation capability across different problem complexities and functional forms\.

Letx˙^j,i\\hat\{\\dot\{x\}\}\_\{j,i\}denote the derivative predicted by the discovered equation at time indexii, and letx~j,i\\tilde\{x\}\_\{j,i\}denote the trajectory obtained by numerically integrating the discovered ODE from the initial condition\. The two scale\-normalized metrics are defined as follows:

ResidualNMSEj=∑i\(x˙j,i−x˙^j,i\)2∑ix˙j,i2\+ϵ\\mathrm\{Residual\\ NMSE\}\_\{j\}=\\frac\{\\sum\_\{i\}\(\\dot\{x\}\_\{j,i\}\-\\hat\{\\dot\{x\}\}\_\{j,i\}\)^\{2\}\}\{\\sum\_\{i\}\\dot\{x\}\_\{j,i\}^\{2\}\+\\epsilon\}IntegralNMSEj=∑i\(xj,i−x~j,i\)2∑ixj,i2\+ϵ\\mathrm\{Integral\\ NMSE\}\_\{j\}=\\frac\{\\sum\_\{i\}\(x\_\{j,i\}\-\\tilde\{x\}\_\{j,i\}\)^\{2\}\}\{\\sum\_\{i\}x\_\{j,i\}^\{2\}\+\\epsilon\}Residual MSE in Eq\. \([2](https://arxiv.org/html/2605.07323#S3.E2)\) is used inside the search loop for parameter optimization and term ablation, whereas residual and integral NMSE are used only for final scale\-normalized evaluation\.

Table 12:NMSE comparison for each dimension\.IDMethodDimID NMSE \(Residual\)ID NMSE \(Integral\)ID\-Ext NMSE \(Residual\)ID\-Ext NMSE \(Integral\)ValueValueValueValue1DoLQ\(ours\)x0x\_\{0\}2\.13×10−82\.13\\times 10^\{\-8\}3\.45×10−93\.45\\times 10^\{\-9\}1\.04×10−81\.04\\times 10^\{\-8\}2\.36×10−92\.36\\times 10^\{\-9\}x1x\_\{1\}1\.28×10−81\.28\\times 10^\{\-8\}4\.52×10−94\.52\\times 10^\{\-9\}2\.07×10−82\.07\\times 10^\{\-8\}5\.03×10−85\.03\\times 10^\{\-8\}EDLx0x\_\{0\}2\.39×10−82\.39\\times 10^\{\-8\}—9\.16×10−99\.16\\times 10^\{\-9\}—x1x\_\{1\}3\.71×1013\.71\\times 10^\{1\}—2\.80×1012\.80\\times 10^\{1\}—ICSRx0x\_\{0\}2\.92×10−82\.92\\times 10^\{\-8\}1\.89×10−91\.89\\times 10^\{\-9\}1\.22×10−81\.22\\times 10^\{\-8\}1\.32×10−91\.32\\times 10^\{\-9\}x1x\_\{1\}2\.69×10−82\.69\\times 10^\{\-8\}7\.36×10−97\.36\\times 10^\{\-9\}1\.62×10−81\.62\\times 10^\{\-8\}2\.24×10−82\.24\\times 10^\{\-8\}LASRx0x\_\{0\}2\.22×10−82\.22\\times 10^\{\-8\}5\.62×10−95\.62\\times 10^\{\-9\}8\.56×10−98\.56\\times 10^\{\-9\}3\.42×10−93\.42\\times 10^\{\-9\}x1x\_\{1\}1\.34×10−81\.34\\times 10^\{\-8\}8\.77×10−98\.77\\times 10^\{\-9\}6\.64×10−96\.64\\times 10^\{\-9\}1\.38×10−81\.38\\times 10^\{\-8\}LLM\-SRx0x\_\{0\}2\.12×10−82\.12\\times 10^\{\-8\}3\.70×10−93\.70\\times 10^\{\-9\}8\.59×10−98\.59\\times 10^\{\-9\}2\.21×10−92\.21\\times 10^\{\-9\}x1x\_\{1\}1\.19×10−81\.19\\times 10^\{\-8\}4\.58×10−94\.58\\times 10^\{\-9\}2\.37×10−82\.37\\times 10^\{\-8\}5\.36×10−85\.36\\times 10^\{\-8\}2DoLQ\(ours\)x0x\_\{0\}4\.93×10−94\.93\\times 10^\{\-9\}1\.42×10−81\.42\\times 10^\{\-8\}5\.83×10−95\.83\\times 10^\{\-9\}2\.19×10−82\.19\\times 10^\{\-8\}x1x\_\{1\}3\.51×10−83\.51\\times 10^\{\-8\}4\.69×10−84\.69\\times 10^\{\-8\}3\.64×10−83\.64\\times 10^\{\-8\}5\.06×10−85\.06\\times 10^\{\-8\}EDLx0x\_\{0\}1\.43×1011\.43\\times 10^\{1\}1\.02×1001\.02\\times 10^\{0\}1\.19×1011\.19\\times 10^\{1\}1\.43×1001\.43\\times 10^\{0\}x1x\_\{1\}1\.64×1021\.64\\times 10^\{2\}—3\.19×1023\.19\\times 10^\{2\}—ICSRx0x\_\{0\}1\.21×10−81\.21\\times 10^\{\-8\}0\.0001101\.51×10−81\.51\\times 10^\{\-8\}0\.0618x1x\_\{1\}0\.0052940\.0007630\.07720\.0222LASRx0x\_\{0\}0\.11270\.06530\.14780\.0721x1x\_\{1\}5\.61×10−85\.61\\times 10^\{\-8\}0\.11165\.33×10−85\.33\\times 10^\{\-8\}0\.1431LLM\-SRx0x\_\{0\}1\.41×1001\.41\\times 10^\{0\}9\.32×1009\.32\\times 10^\{0\}1\.15×1001\.15\\times 10^\{0\}1\.80×1011\.80\\times 10^\{1\}x1x\_\{1\}1\.19×1001\.19\\times 10^\{0\}8\.20×1008\.20\\times 10^\{0\}1\.12×1001\.12\\times 10^\{0\}1\.59×1011\.59\\times 10^\{1\}3DoLQ\(ours\)x0x\_\{0\}4\.22×10−84\.22\\times 10^\{\-8\}3\.79×10−83\.79\\times 10^\{\-8\}3\.65×10−83\.65\\times 10^\{\-8\}1\.16×10−71\.16\\times 10^\{\-7\}x1x\_\{1\}3\.52×10−93\.52\\times 10^\{\-9\}1\.58×10−81\.58\\times 10^\{\-8\}1\.06×10−81\.06\\times 10^\{\-8\}1\.14×10−71\.14\\times 10^\{\-7\}EDLx0x\_\{0\}2\.14×1052\.14\\times 10^\{5\}—2\.65×1052\.65\\times 10^\{5\}—x1x\_\{1\}4\.31×1024\.31\\times 10^\{2\}—4\.15×1024\.15\\times 10^\{2\}—ICSRx0x\_\{0\}5\.11×10−85\.11\\times 10^\{\-8\}4\.03×10−54\.03\\times 10^\{\-5\}5\.11×10−85\.11\\times 10^\{\-8\}0\.000210x1x\_\{1\}0\.0010318\.09×10−58\.09\\times 10^\{\-5\}0\.0007560\.000303LASRx0x\_\{0\}0\.06881\.66×1011\.66\\times 10^\{1\}0\.07583\.08×1013\.08\\times 10^\{1\}x1x\_\{1\}0\.51296\.14×1006\.14\\times 10^\{0\}0\.50476\.62×1016\.62\\times 10^\{1\}LLM\-SRx0x\_\{0\}0\.0052550\.0049340\.0061840\.007153x1x\_\{1\}8\.93×10−58\.93\\times 10^\{\-5\}0\.0052359\.19×10−59\.19\\times 10^\{\-5\}0\.01164DoLQ\(ours\)x0x\_\{0\}1\.92×10−61\.92\\times 10^\{\-6\}3\.27×10−93\.27\\times 10^\{\-9\}0\.26750\.000975x1x\_\{1\}9\.95×10−89\.95\\times 10^\{\-8\}1\.71×10−91\.71\\times 10^\{\-9\}0\.0013865\.48×10−55\.48\\times 10^\{\-5\}EDLx0x\_\{0\}3\.05×1033\.05\\times 10^\{3\}2\.94×1012\.94\\times 10^\{1\}1\.03×1031\.03\\times 10^\{3\}5\.83×1005\.83\\times 10^\{0\}x1x\_\{1\}7\.11×1017\.11\\times 10^\{1\}9\.47×1009\.47\\times 10^\{0\}1\.92×1001\.92\\times 10^\{0\}2\.35×1002\.35\\times 10^\{0\}ICSRx0x\_\{0\}7\.54×10−67\.54\\times 10^\{\-6\}1\.35×10−61\.35\\times 10^\{\-6\}0\.01200\.001100x1x\_\{1\}0\.0001541\.88×10−51\.88\\times 10^\{\-5\}0\.02670\.007869LASRx0x\_\{0\}0\.0001596\.88×10−76\.88\\times 10^\{\-7\}1\.05×1001\.05\\times 10^\{0\}0\.007504x1x\_\{1\}1\.19×10−71\.19\\times 10^\{\-7\}2\.29×10−82\.29\\times 10^\{\-8\}6\.71×10−86\.71\\times 10^\{\-8\}0\.000871LLM\-SRx0x\_\{0\}6\.08×10−66\.08\\times 10^\{\-6\}2\.45×10−82\.45\\times 10^\{\-8\}0\.57900\.004555x1x\_\{1\}7\.52×10−77\.52\\times 10^\{\-7\}2\.13×10−82\.13\\times 10^\{\-8\}0\.0078900\.0007845DoLQ\(ours\)x0x\_\{0\}2\.43×10−72\.43\\times 10^\{\-7\}5\.61×10−105\.61\\times 10^\{\-10\}1\.27×10−71\.27\\times 10^\{\-7\}2\.87×10−92\.87\\times 10^\{\-9\}x1x\_\{1\}4\.85×10−74\.85\\times 10^\{\-7\}6\.86×10−86\.86\\times 10^\{\-8\}4\.56×10−74\.56\\times 10^\{\-7\}1\.97×10−71\.97\\times 10^\{\-7\}EDLx0x\_\{0\}5\.86×1005\.86\\times 10^\{0\}7\.16×1007\.16\\times 10^\{0\}2\.68×1002\.68\\times 10^\{0\}2\.61×1012\.61\\times 10^\{1\}x1x\_\{1\}2\.08×1012\.08\\times 10^\{1\}6\.26×1026\.26\\times 10^\{2\}1\.56×1011\.56\\times 10^\{1\}8\.86×1028\.86\\times 10^\{2\}ICSRx0x\_\{0\}0\.0026104\.30×10−54\.30\\times 10^\{\-5\}0\.23130\.0435x1x\_\{1\}2\.58×10−62\.58\\times 10^\{\-6\}1\.89×10−51\.89\\times 10^\{\-5\}5\.54×10−55\.54\\times 10^\{\-5\}0\.0310LASRx0x\_\{0\}0\.0003461\.33×10−51\.33\\times 10^\{\-5\}0\.01170\.001680x1x\_\{1\}0\.0090380\.01140\.03390\.0892LLM\-SRx0x\_\{0\}2\.48×10−72\.48\\times 10^\{\-7\}6\.19×10−106\.19\\times 10^\{\-10\}1\.17×10−71\.17\\times 10^\{\-7\}4\.77×10−84\.77\\times 10^\{\-8\}x1x\_\{1\}4\.91×10−74\.91\\times 10^\{\-7\}6\.40×10−86\.40\\times 10^\{\-8\}0\.0002190\.0001036DoLQ\(ours\)x0x\_\{0\}1\.10×10−61\.10\\times 10^\{\-6\}3\.28×10−83\.28\\times 10^\{\-8\}2\.33×10−62\.33\\times 10^\{\-6\}7\.31×10−87\.31\\times 10^\{\-8\}x1x\_\{1\}1\.07×10−61\.07\\times 10^\{\-6\}3\.71×10−83\.71\\times 10^\{\-8\}7\.50×10−67\.50\\times 10^\{\-6\}4\.55×10−64\.55\\times 10^\{\-6\}EDLx0x\_\{0\}1\.11×10−51\.11\\times 10^\{\-5\}—1\.19×10−51\.19\\times 10^\{\-5\}—x1x\_\{1\}5\.01×1015\.01\\times 10^\{1\}—7\.15×1017\.15\\times 10^\{1\}—ICSRx0x\_\{0\}0\.0002981\.76×10−51\.76\\times 10^\{\-5\}0\.0048467\.17×10−57\.17\\times 10^\{\-5\}x1x\_\{1\}0\.0004263\.68×10−53\.68\\times 10^\{\-5\}0\.01700\.006141LASRx0x\_\{0\}0\.03660\.01610\.02810\.0158x1x\_\{1\}0\.01850\.01800\.02620\.0620LLM\-SRx0x\_\{0\}7\.03×10−67\.03\\times 10^\{\-6\}6\.29×10−76\.29\\times 10^\{\-7\}5\.25×10−65\.25\\times 10^\{\-6\}7\.12×10−57\.12\\times 10^\{\-5\}x1x\_\{1\}2\.47×10−62\.47\\times 10^\{\-6\}4\.72×10−74\.72\\times 10^\{\-7\}0\.0001660\.0002287DoLQ\(ours\)x0x\_\{0\}0\.0004673\.62×10−53\.62\\times 10^\{\-5\}6\.53×1006\.53\\times 10^\{0\}0\.3966x1x\_\{1\}0\.0002390\.0001253\.70×1023\.70\\times 10^\{2\}0\.5569EDLx0x\_\{0\}3\.30×1053\.30\\times 10^\{5\}—1\.56×1071\.56\\times 10^\{7\}—x1x\_\{1\}3\.93×1013\.93\\times 10^\{1\}—2\.10×1042\.10\\times 10^\{4\}—ICSRx0x\_\{0\}0\.0005102\.55×10−52\.55\\times 10^\{\-5\}1\.30×1011\.30\\times 10^\{1\}0\.4600x1x\_\{1\}0\.0005100\.0001161\.30×1011\.30\\times 10^\{1\}1\.01×1001\.01\\times 10^\{0\}LASRx0x\_\{0\}1\.20×10−71\.20\\times 10^\{\-7\}3\.14×10−83\.14\\times 10^\{\-8\}4\.15×10−74\.15\\times 10^\{\-7\}1\.79×10−81\.79\\times 10^\{\-8\}x1x\_\{1\}1\.20×10−71\.20\\times 10^\{\-7\}1\.43×10−71\.43\\times 10^\{\-7\}4\.15×10−74\.15\\times 10^\{\-7\}3\.93×10−83\.93\\times 10^\{\-8\}LLM\-SRx0x\_\{0\}9\.64×10−69\.64\\times 10^\{\-6\}8\.79×10−88\.79\\times 10^\{\-8\}1\.67×1031\.67\\times 10^\{3\}—x1x\_\{1\}9\.78×10−69\.78\\times 10^\{\-6\}3\.77×10−73\.77\\times 10^\{\-7\}5\.84×1035\.84\\times 10^\{3\}—8DoLQ\(ours\)x0x\_\{0\}7\.23×10−87\.23\\times 10^\{\-8\}1\.10×10−61\.10\\times 10^\{\-6\}6\.82×10−86\.82\\times 10^\{\-8\}3\.94×10−63\.94\\times 10^\{\-6\}x1x\_\{1\}4\.78×10−64\.78\\times 10^\{\-6\}2\.01×10−72\.01\\times 10^\{\-7\}3\.20×10−63\.20\\times 10^\{\-6\}3\.38×10−73\.38\\times 10^\{\-7\}x2x\_\{2\}1\.23×10−81\.23\\times 10^\{\-8\}3\.40×10−63\.40\\times 10^\{\-6\}1\.68×10−81\.68\\times 10^\{\-8\}6\.47×10−66\.47\\times 10^\{\-6\}x3x\_\{3\}9\.38×10−99\.38\\times 10^\{\-9\}1\.43×10−61\.43\\times 10^\{\-6\}3\.13×10−83\.13\\times 10^\{\-8\}1\.01×10−51\.01\\times 10^\{\-5\}EDLx0x\_\{0\}2\.69×1032\.69\\times 10^\{3\}—2\.60×1032\.60\\times 10^\{3\}—x1x\_\{1\}5\.67×1045\.67\\times 10^\{4\}—5\.20×1045\.20\\times 10^\{4\}—x2x\_\{2\}1\.61×10−81\.61\\times 10^\{\-8\}—2\.91×10−82\.91\\times 10^\{\-8\}—x3x\_\{3\}1\.08×10−81\.08\\times 10^\{\-8\}—3\.22×10−83\.22\\times 10^\{\-8\}—ICSRx0x\_\{0\}0\.0006781\.01×1001\.01\\times 10^\{0\}0\.0009452\.85×1002\.85\\times 10^\{0\}x1x\_\{1\}0\.10141\.14×1001\.14\\times 10^\{0\}1\.02×1001\.02\\times 10^\{0\}2\.74×1002\.74\\times 10^\{0\}x2x\_\{2\}3\.72×10−53\.72\\times 10^\{\-5\}0\.16056\.03×10−56\.03\\times 10^\{\-5\}0\.4885x3x\_\{3\}1\.52×10−81\.52\\times 10^\{\-8\}2\.32×1002\.32\\times 10^\{0\}3\.65×10−83\.65\\times 10^\{\-8\}6\.67×1006\.67\\times 10^\{0\}LASRx0x\_\{0\}0\.02644\.24×1004\.24\\times 10^\{0\}0\.02551\.64×1021\.64\\times 10^\{2\}x1x\_\{1\}0\.07351\.31×1001\.31\\times 10^\{0\}1\.95×1001\.95\\times 10^\{0\}3\.39×1003\.39\\times 10^\{0\}x2x\_\{2\}1\.84×10−81\.84\\times 10^\{\-8\}1\.84×1001\.84\\times 10^\{0\}2\.05×10−82\.05\\times 10^\{\-8\}0\.8254x3x\_\{3\}1\.51×10−81\.51\\times 10^\{\-8\}6\.70×1006\.70\\times 10^\{0\}3\.40×10−83\.40\\times 10^\{\-8\}3\.85×1033\.85\\times 10^\{3\}LLM\-SRx0x\_\{0\}6\.89×10−86\.89\\times 10^\{\-8\}7\.58×10−77\.58\\times 10^\{\-7\}7\.08×10−87\.08\\times 10^\{\-8\}1\.01×10−51\.01\\times 10^\{\-5\}x1x\_\{1\}3\.89×10−63\.89\\times 10^\{\-6\}1\.92×10−71\.92\\times 10^\{\-7\}3\.24×10−63\.24\\times 10^\{\-6\}5\.18×10−75\.18\\times 10^\{\-7\}x2x\_\{2\}1\.29×10−81\.29\\times 10^\{\-8\}2\.81×10−62\.81\\times 10^\{\-6\}1\.71×10−81\.71\\times 10^\{\-8\}2\.65×10−62\.65\\times 10^\{\-6\}x3x\_\{3\}9\.91×10−99\.91\\times 10^\{\-9\}7\.05×10−77\.05\\times 10^\{\-7\}8\.81×10−88\.81\\times 10^\{\-8\}5\.76×10−65\.76\\times 10^\{\-6\}
### C\.3Scores

In this section, we summarize the comprehensive evaluation scores for each benchmark problem\. Figure[4](https://arxiv.org/html/2605.07323#S4.F4)and Table[13](https://arxiv.org/html/2605.07323#A3.T13)present a binary success/failure assessment across all eight problems \(P1\-P8, corresponding to benchmark IDs 1\-8 from Table[3](https://arxiv.org/html/2605.07323#A1.T3)\) for each method\. The evaluation is based on two distinct criteria: the NMSE test row indicates whether the normalized mean squared error satisfies the designated convergence threshold \(integral NMSE<10−3<10^\{\-3\}across all dimensions\), and the Term test row verifies whether the discovered equation accurately recovers the correct symbolic terms of the ground truth dynamics after excluding terms with negligible impact\. A blue checkmark \(✓\) denotes success for that problem, while an empty cell indicates failure\. The rightmost Total column sums the number of successful problems for each method and evaluation criterion, providing an aggregate view of discovery performance across the benchmark suite\.

As prior ODE\-discovery baselines do not report structural recovery under a unified false\-positive/false\-negative \(FP/FN\) protocol, precision/recall metrics are omitted from the main comparison\. Instead, the Term test evaluates whether the recovered equation matches the ground\-truth term set after excluding negligible terms, which serves as a strict structure\-recovery criterion on benchmarks with known governing equations\.

Table 13:Detailed scores for each problem by evaluation type and method\.
### C\.4Token usage details

This section presents a detailed comparison of token usage between the proposed framework and baseline models\. The token consumption data in the following table is reported in thousands \(K\) of tokens, distinguishing between input tokens \(In\) and output tokens \(Out\)\. For each benchmark problem \(ID 1\-8\), we track two metrics: “Found” columns show the cumulative tokens consumed up to the point when the method successfully discovered a reasonable equation \(if applicable\), while “Total” columns report the tokens used throughout the entire experimental run until completion or timeout\. This distinction enables analysis of both the efficiency of successful discovery and the overall computational cost\. All values are reported directly from the experiment logs collected during our experiments\. Note that for DoLQ, the reported token counts represent the combined usage of both the Sampler Agent and the Scientist Agent, and no reasoning tokens were generated during our experiments\.

Table 14:Token usage until discovery\.IDMethodFound \(K\)Total \(K\)InOutInOut1DoLQ\(ours\)20\.210\.122696\.8EDL4\.996\.96551439ICSR24324\.824324\.8LASR357387357387LLM\-SR3384433384432DoLQ\(ours\)69\.640\.0289156EDL14942\.5465118ICSR26239\.026239\.0LASR426426426426LLM\-SR5336027038163DoLQ\(ours\)188110282166EDL129302296296ICSR30944\.130944\.1LASR483447483447LLM\-SR6526928138134DoLQ\(ours\)44\.626\.4201109EDL1\.492\.97193239ICSR26035\.126035\.1LASR401404401404LLM\-SR781775781775
IDMethodFound \(K\)Total \(K\)InOutInOut5DoLQ\(ours\)20\.412\.4223106EDL0\.610\.82186222ICSR25631\.125631\.1LASR368420368420LLM\-SR6247846247846DoLQ\(ours\)10\.38\.08239136EDL0\.410\.53190233ICSR25529\.425529\.4LASR345394345394LLM\-SR7307017307017DoLQ\(ours\)18389\.4233114EDL26\.929\.8279287ICSR25633\.925633\.9LASR383407383407LLM\-SR7588547588548DoLQ\(ours\)17368\.7784264EDL73\.622\.9989265ICSR77272\.977272\.9LASR809930809930LLM\-SR1238137918521997

## Appendix DAnalysis of optimizer adoption

To further analyze the behavior of the hybrid optimizer, we investigate the adoption frequency of BFGS and differential evolution across different problem types\. As shown in Figure[11](https://arxiv.org/html/2605.07323#A4.F11), the selection rate of these methods varies significantly depending on the mathematical complexity of the system\. For systems with simpler functional forms, BFGS is frequently sufficient to find the optimal parameters due to the relatively smooth loss landscapes\. Conversely, for systems with complex functional forms \(e\.g\., Glider\), differential evolution is adopted more frequently\. This demonstrates that the global search capability of differential evolution is critical for escaping local minima and accurately identifying parameters in rugged loss landscapes where local gradient\-based methods often fail\.

Moreover, the left panel of Figure[11](https://arxiv.org/html/2605.07323#A4.F11)reveals a critical advantage of the hybrid optimizer\. When examiningx˙0\\dot\{x\}\_\{0\}with the ground truth skeleton structure, we observe that differential evolution achieves lower MSE compared to using BFGS alone\. This is particularly significant because assigning better scores to correct skeleton structures provides a crucial advantage during the symbolic regression search process, enabling more effective exploration and convergence to the true underlying dynamics\.

![Refer to caption](https://arxiv.org/html/2605.07323v1/x9.png)Figure 11:Adoption frequency of BFGS and differential evolution across different ODE systems\. Differential evolution is more frequently selected for systems with intricate functional forms where the loss landscape is more rugged, while BFGS is often sufficient for simpler systems\.
## Appendix EQuantitative results under shifted initial conditions

To evaluate robustness against initial\-state variations, the discovered equations are re\-assessed using an alternative initial condition \(ic\_1\), maintaining the original ID and ID\-Ext temporal regimes and NMSE metrics\. The results are summarized in Table[15](https://arxiv.org/html/2605.07323#A5.T15)\.

Table 15:Quantitative performance comparison measured by NMSE under shifted initial conditions\. We report dimension\-averaged results for the same discovered equations, re\-evaluated from an alternative initial condition \(ic\_1\) rather than the default initial state, demonstrating robustness and generalization across initial conditions\. Bold with underline indicates the best, bold indicates the second best\. NaN indicates solver failure or numerical instability\.
## Appendix FAdditional comparison on the 2D dimensionless Glider system \(ID 2\)

A detailed analysis of the 2D dimensionless Glider system \(ID 2\) is provided to supplement the main evaluation\. Table[16](https://arxiv.org/html/2605.07323#A6.T16)reports per\-dimension NMSE values, Table[17](https://arxiv.org/html/2605.07323#A6.T17)summarizes token consumption, and Table[18](https://arxiv.org/html/2605.07323#A6.T18)lists the discovering equations across varying LLM backbones, facilitating a comprehensive comparison of accuracy, efficiency, and symbolic structure\.

Table 16:Per\-dimension NMSE on the 2D dimensionless Glider system \(ID 2\) when DoLQ is executed with different LLM models\. All settings are identical to Table[5](https://arxiv.org/html/2605.07323#A2.T5), with only the model changed\. The table reports residual and integral NMSE on the ID and ID\-Ext evaluation regimes for each state dimension\.Table 17:Token usage on the 2D dimensionless Glider system \(ID 2\) when DoLQ is executed with different LLM models\. All settings are identical to Table[5](https://arxiv.org/html/2605.07323#A2.T5), with only the model changed\. We report the total input and output tokens consumed during symbolic ODE discovery\.Table 18:Final equations discovered on the 2D dimensionless Glider system \(ID 2\) when DoLQ is executed with different LLM models\. All settings are identical to Table[5](https://arxiv.org/html/2605.07323#A2.T5), with only the model changed\. For each model, we list the identified governing equations forx˙0\\dot\{x\}\_\{0\}andx˙1\\dot\{x\}\_\{1\}\.
## Appendix GRobustness analyses

### G\.1Sigma\-noise robustness

This section reports a sigma\-noise robustness comparison on the 2D dimensionless Glider system \(ID 2\)\. DoLQ stores one system\-level run per sigma, whereas ICSR, LASR, and LLM\-SR store per\-dimension outputs\. To ensure a fair comparison, all final discovered equations are re\-evaluated under a common protocol\. Following the paper definition, the ID regime corresponds to the trajectory overtstart∼tendt\_\{\\text\{start\}\}\\sim t\_\{\\text\{end\}\}, while the ID\-Ext regime corresponds to the full trajectory overtstart∼tood\_endt\_\{\\text\{start\}\}\\sim t\_\{\\text\{ood\\\_end\}\}\. For each noise levelσ\\sigmaand state dimension, NMSE is averaged across initial conditions\. Tables[19](https://arxiv.org/html/2605.07323#A7.T19),[20](https://arxiv.org/html/2605.07323#A7.T20), and[21](https://arxiv.org/html/2605.07323#A7.T21)jointly summarize the averaged metrics, per\-dimension metrics, and final discovered equations under these noise levels\.

The results indicate that DoLQ does not intrinsically filter observation noise; its absolute accuracy deteriorates asσ\\sigmaincreases, despite remaining competitive with baseline methods\. Robust equation discovery under noisy trajectories therefore remains a clear limitation and an important direction for future work\.

Table 19:Average NMSE measured across different sigma\-noise levels for the dimensionless Glider system \(ID 2\)\. DoLQ stores system\-level runs, whereas ICSR, LASR, EDL, and LLM\-SR store method\-specific outputs; all values are re\-evaluated under the same ID and ID\-Ext regimes\. Cases where optimization exceeded the 5\-minute time limit or the loss became∞\\inftyare reported as NaN\.Table 20:Per\-dimension NMSE measured across different sigma\-noise levels for the dimensionless Glider system \(ID 2\)\. All values are recomputed from the final discovered equations using the same ID and ID\-Ext evaluation protocol\. Cases where optimization exceeded the 5\-minute time limit or the loss became∞\\inftyare reported as NaN\.Table 21:Final discovered equations under different noise levels for the dimensionless Glider system \(ID 2\)\. A dash \(\-\) indicates that no valid equation was available for reporting\.Different LLM backbones can converge to slightly different symbolic forms due to disparities in mathematical inductive bias and because early term proposals continually alter subsequent search trajectories in iterative mechanisms like DoLQ\. Several distinct LLM backbones—Gemini 2\.5 Flash\-Lite, GPT\-4o mini, DeepSeek\-V3\.2, and Grok 4\.1 Fast—were evaluated to assess DoLQ’s robustness across providers\. Empirically, Gemini 2\.5 Flash\-Lite, DeepSeek\-V3\.2, and Grok 4\.1 Fast successfully recovered the Glider system dynamics with only minor discrepancies in extraneous terms or coefficients\. Conversely, GPT\-4o mini frequently yielded qualitatively mismatched structures and inferior integral NMSE\. For established benchmarks, discovery performance is evaluated primarily by NMSE and secondarily by term\-level agreement with the ground truth; for entirely unknown systems, deeper scientific validation would be essential to confirm the physical validity of the inferred structures\.

### G\.2Robustness to noisy descriptions

To assess sensitivity to description noise, DoLQ was evaluated on the Glider\(2D\) system with modified natural\-language descriptions but fixed numerical trajectories\. When provided with a partially mismatched description, DoLQ retained certain meaningful components, such assin⁡\(x1\)\\sin\(x\_\{1\}\)and1/x01/x\_\{0\}, yet failed to completely recover the ground\-truth equations; average residual NMSE degraded to1\.19×10−21\.19\\times 10^\{\-2\}\(ID\) and7\.37×10−27\.37\\times 10^\{\-2\}\(ID\-Ext\)\. Under a fully mismatched description, the core dynamics ofx1x\_\{1\}\(x0−cos⁡\(x1\)/x0x\_\{0\}\-\\cos\(x\_\{1\}\)/x\_\{0\}\) were completely lost, and the residual NMSE further increased to3\.15×10−23\.15\\times 10^\{\-2\}\(ID\) and1\.83×10−11\.83\\times 10^\{\-1\}\(ID\-Ext\)\. This suggests that DoLQ inherently trusts the provided qualitative description and lacks an intrinsic mechanism to correct erroneous prior information\.

## Appendix HReasoning\-trace analysis

### H\.1Hallucination audit of reasoning traces

We audited one representative Glider\(2D\) run at the log level and found that the most common hallucination pattern is premature mechanistic storytelling\. Representative exploratory terms includedx0cos⁡\(x1\)x\_\{0\}\\cos\(x\_\{1\}\),log⁡\(x0\)\\log\(x\_\{0\}\), andtanh⁡\(x0\)\\tanh\(x\_\{0\}\)\. These terms were sometimes assigned confident physical narratives before sufficient numerical evidence had accumulated\. This effect appeared most clearly when the prompts required explicit semantic justification for exploratory terms whose physical meaning was only weakly grounded by the system description\. In our audit, these narratives affected search direction but not final acceptance, because every candidate still had to pass coefficient optimization, residual evaluation, and iterative comparison against competing candidates\. The over\-interpreted terms were not retained in the final discovered equation\.

## Appendix ITime complexity analysis

The computational cost of DoLQ is analyzed by decoupling LLM token processing from numerical parameter optimization\. As these operations utilize fundamentally different resources, merging them into a single asymptotic bound would obfuscate the system’s true computational load\. Table[22](https://arxiv.org/html/2605.07323#A9.T22)outlines the notation adopted for this complexity analysis\.

Table 22:Notation used in the time complexity analysis of DoLQ\.##### LLM cost\.

At each DoLQ cycle, the Sampler Agent and Scientist Agent are each invoked once\. If their total input and output token counts are denoted byLsampL\_\{\\mathrm\{samp\}\}andLsciL\_\{\\mathrm\{sci\}\}, respectively, then the token\-processing cost per cycle is

O\(Lsamp\+Lsci\)\.O\(L\_\{\\mathrm\{samp\}\}\+L\_\{\\mathrm\{sci\}\}\)\.\(3\)The Scientist Agent can reduce practical runtime by filtering poor structures early, but this affects the average search trajectory rather than the worst\-case asymptotic form of a single LLM call\.

##### Parameter optimization cost\.

For thejj\-th dimension, one evaluation of the residual objective in Eq\. \([2](https://arxiv.org/html/2605.07323#S3.E2)\) overNNtime points costsO\(Nkj\)O\(Nk\_\{j\}\), assuming each symbolic primitive in a candidate term is evaluated in constant time\. Differential evolution therefore incurs

O\(G⋅NP⋅N⋅kj\)\.O\(G\\cdot NP\\cdot N\\cdot k\_\{j\}\)\.\(4\)Using dense BFGS refinement adds repeated residual evaluations together with quasi\-Newton updates overpjp\_\{j\}trainable parameters, yielding

O\(Ij\(Nkj\+pj2\)\)\.O\\left\(I\_\{j\}\(Nk\_\{j\}\+p\_\{j\}^\{2\}\)\\right\)\.\(5\)Since DoLQ evaluates BFGS alone, differential evolution alone, and the hybrid differential\-evolution\-plus\-BFGS strategy before selecting the best result, the constant factor increases, but the asymptotic optimization cost per dimension remains

O\(\(G⋅NP\+Ij\)Nkj\+Ijpj2\)\.O\\left\(\(G\\cdot NP\+I\_\{j\}\)Nk\_\{j\}\+I\_\{j\}p\_\{j\}^\{2\}\\right\)\.\(6\)

##### Scientist quantitative evaluation cost\.

The quantitative evaluation in Algorithm[B\.3](https://arxiv.org/html/2605.07323#A2.SS3)performs one baseline residual evaluation and then one additional residual evaluation per term after temporarily zeroing its coefficient\. For thejj\-th dimension, this requireskj\+1k\_\{j\}\+1residual evaluations, so the total cost is

O\(\(kj\+1\)Nkj\)=O\(Nkj2\)\.O\\left\(\(k\_\{j\}\+1\)Nk\_\{j\}\\right\)=O\(Nk\_\{j\}^\{2\}\)\.\(7\)This term is separate from the LLM reasoning cost and must be counted explicitly in the overall complexity\.

##### Lightweight bookkeeping\.

Function construction, JSON parsing, and keep/hold/remove bookkeeping are linear in the number of proposed terms, i\.e\.,

O\(∑j=1dsyskj\),O\\left\(\\sum\_\{j=1\}^\{d\_\{\\mathrm\{sys\}\}\}k\_\{j\}\\right\),\(8\)which is dominated by parameter optimization and ablation\.

##### Overall complexity\.

For one DoLQ cycle that evaluatesHHcandidate hypotheses, the numerical computation scales as

O\(H∑j=1dsys\[\(G⋅NP\+Ij\)Nkj\+Ijpj2\+Nkj2\]\),O\\left\(H\\sum\_\{j=1\}^\{d\_\{\\mathrm\{sys\}\}\}\\left\[\(G\\cdot NP\+I\_\{j\}\)Nk\_\{j\}\+I\_\{j\}p\_\{j\}^\{2\}\+Nk\_\{j\}^\{2\}\\right\]\\right\),\(9\)while the LLM token\-processing cost scales as

O\(Lsamp\+Lsci\)\.O\(L\_\{\\mathrm\{samp\}\}\+L\_\{\\mathrm\{sci\}\}\)\.\(10\)OverTTsearch cycles, the total complexity becomes

O\(T\(Lsamp\+Lsci\)\)O\\left\(T\(L\_\{\\mathrm\{samp\}\}\+L\_\{\\mathrm\{sci\}\}\)\\right\)\(11\)for token processing and

O\(TH∑j=1dsys\[\(G⋅NP\+Ij\)Nkj\+Ijpj2\+Nkj2\]\)O\\left\(TH\\sum\_\{j=1\}^\{d\_\{\\mathrm\{sys\}\}\}\\left\[\(G\\cdot NP\+I\_\{j\}\)Nk\_\{j\}\+I\_\{j\}p\_\{j\}^\{2\}\+Nk\_\{j\}^\{2\}\\right\]\\right\)\(12\)for numerical computation\.

##### Practical implication\.

The main numerical advantage of DoLQ comes from using residual MSE during optimization\. Unlike integral\-based objectives such as Eq\. \([1](https://arxiv.org/html/2605.07323#S2.E1)\), Eq\. \([2](https://arxiv.org/html/2605.07323#S3.E2)\) does not require solving the ODE at every optimizer step, so the inner loop is reduced to direct residual evaluation on sampled data\. Integral NMSE remains useful for final evaluation, but it is not part of the optimization loop analyzed above\. Similarly, early filtering by the Scientist Agent improves practical efficiency by reducing the average number of poor candidates and unnecessary terms, rather than by changing the worst\-case asymptotic order\.

## Appendix JEfficiency analysis

To ensure an equitable baseline comparison, LLM\-SR’s runtime metrics are aggregated across all target dimensions, reflecting its dimension\-wise execution paradigm compared to DoLQ’s joint system\-level discovery\. Evaluations were performed in identical environments \(Intel Core i7\-12700K, 20 logical cores, 94 GiB RAM\)\. Table[23](https://arxiv.org/html/2605.07323#A10.T23)details the empirical CPU\-time, wall\-clock duration, total elapsed time, and peak memory usage\.

Table 23:CPU, runtime, and peak\-memory comparison of DoLQ and LLM\-SR\. Total CPU time is the summed CPU time, total wall time is the wall\-clock runtime, end\-to\-end elapsed time is the full elapsed time, and peak memory is the maximum resident memory usage in GiB\.Given LLM\-SR’s decoupled execution, direct comparison necessitates aggregating its CPU and temporal metrics while extracting the highest peak memory recorded across per\-dimension runs\. The maximum value accurately captures the instantaneous memory ceiling rather than an artificial sum\. Consequently, while DoLQ does not uniformly outperform baselines in raw CPU computation time, it exhibits systematically lower wall\-clock duration, total elapsed time, and peak memory footprints\.

Floating\-point operations \(FLOPs\) are intentionally omitted because the hybrid pipeline integrates remote LLM inference with local numerical optimization, rendering hardware\-agnostic FLOP estimation both ill\-defined and challenging to reproduce\. Efficiency is thus benchmarked primarily through computation time and memory allocation\.

## Appendix KPrompts and responses

### K\.1Sampler agent prompt and response

We present example prompts and outputs for the Sampler and Scientist agents in Figure[12](https://arxiv.org/html/2605.07323#A11.F12)through Figure[15](https://arxiv.org/html/2605.07323#A11.F15)\. The labeled sections in these figures explicitly correspond to the components illustrated in the framework overview \(Figure[2](https://arxiv.org/html/2605.07323#S1.F2)\), detailing the specific instructions and constraints provided to each agent during the discovery process\.

Sampler Prompt\(1a\)You are a helpful assistant tasked with discovering mathematical term structures for scientific systems\. Complete the ’term\_list’ below, considering the physical meaning and relationships of inputs\.\# System DescriptionGlider flight experiment\. State variables: forward velocity x0, path angle x1, horizontal position x2, vertical altitude x3\. \[\.\.\.\. omitted\.\.\.\.\]\#\#\# SCIENTIST AGENT GUIDANCEThe Scientist agent has analyzed previous experiments and provides the following guidance:\#\#\#\# Accumulated Knowledge \(Theory\)\- The kinematic terms for horizontal and vertical position \(x2\_t and x3\_t\) show strong physical relevance, particularly components involving x0 and trigonometric functions of x1\. The damping term in x0\_t and the velocity\-dependent term in x1\_t demonstrate good physical grounding\. However, constant terms lack clear physical justification\.\(1b\)\#\#\#\# Term\-by\-Term Evaluation \(Previous Attempt Analysis\)Evaluation results for each term\. keep = retain, hold = hold/modify, remove = eliminate:x0\_t:\- C\*\(x0\) :KEEP\- C\*\(np\.sin\(x1\)\) :KEEP\- C\*x0\*\*3 :REMOVEx1\_t:\- C\*\(np\.cos\(x1\)\) :KEEP\- C\*\(x0\) :HOLD\- C\*x1\*\*3 :REMOVEx2\_t:\- C\*\(x0\*np\.cos\(x1\)\) :KEEPx3\_t:\- C\*\(x0\*np\.sin\(x1\)\) :KEEP\- C\*x0\*x1\*x2 :REMOVE\#\#\#\# Removed Terms List \(Ban List\)The following term structures have negatively impacted performance\.Do NOT propose them again:x0\_t: C\*\(9\.81\), C\*\(np\.cos\(x1\)\), C\*\(x0\*x1\)x1\_t: C\*\(1\), C\*\(np\.sin\(x1\)\), C\*\(x0\*x1\), C\*\(x3\)x2\_t: C\*\(9\.81\), C\*\(x1\)x3\_t: C\*\(x0\), C\*\(x1\), C\*\(x3\)\(1c\)Goal:Reflect the Scientist’s insights and guidance in the equation structure\.\[Required Conditions \(Violation Will Cause Errors\)\]1\. You can use: import numpy as np2\. Target System Context: Input variables are x0, x1, x2, x3\.\- This system is 4\-dimensional\- Variables x4 and above do not exist\.3\. Term Format: Propose terms WITHOUT coefficients\. The system will automatically attach trainable parameters\.\- Correct: "x0", "np\.sin\(x0\)", "x0\*x1"\- Incorrect: "params\[0\]\*x0", "C\*x0", "0\.5\*x0"4\. Term Complexity: You MAY use internal constants if they have physical meaning \(e\.g\., frequency, phase\)\.\- Example: "np\.sin\(2\*x0\)" is allowed and encouraged if the factor 2 is significant\.\- Note: The system will still attach an outer trainable parameter \(e\.g\., params\[0\]\*np\.sin\(2\*x0\)\)\.5\. Symbolic Constants: Do NOT use symbolic constants like ’g’, ’k’, ’m’\. Use numerical values\.\- Correct: "9\.81\*x0" \(if g=9\.81 is known\), "np\.pi\*x0"\- Incorrect: "g\*x0" \(will cause NameError\)6\. No duplicates: Equations identical to previous attempts are forbidden\. Structural modifications are required\.7\. Reasoning required: When proposing each term, provide a physical/mathematical reasoning based on the system description \(desc\)\.\[Example \(4D System\)\]x0\_t: \["x0", "x1\*x2", "x3"\]x1\_t: \["x0", "np\.sin\(x1\)"\]x2\_t: \["x0\*x1"\]x3\_t: \["\-9\.81", "x0"\]Figure 12:The Sampler Prompt\. The labeled sections\(1a\)\-\(1c\)correspond to the Sampler components in Figure[2](https://arxiv.org/html/2605.07323#S1.F2):\(1a\)System Description & Insight,\(1b\)Feedback & Constraints, and\(1c\)Action & Instruction\.Sampler LLM Response\{
"ode\_pairs": \[
\{
"x0\_t": \["x0", "np\.sin\(x1\)", "x0\*x0"\],
"x1\_t": \["np\.cos\(x1\)", "x0", "x0\*x0"\],
"x2\_t": \["x0\*np\.cos\(x1\)", "x0"\],
"x3\_t": \["x0\*np\.sin\(x1\)", "\-9\.81", "x2\*x2"\]
\},
\[\.\.\.\. other pairs omitted\.\.\.\.\]
\]
\}Figure 13:Example response from the Sampler LLM showing the structured JSON response containing candidate ODE terms for each dimension\.
### K\.2Scientist agent prompt and response

Figure[14](https://arxiv.org/html/2605.07323#A11.F14)and Figure[15](https://arxiv.org/html/2605.07323#A11.F15)present a representative Scientist\-agent prompt/response pair, showing how accumulated insights, experiment summaries, and evaluation instructions are organized before the model returns term\-level judgments and actions\.

Scientist Prompt\(2a\)You are a senior scientist specializing in ODE discovery\. Your role is to evaluate proposed mathematical terms and provide guidance to improve the term\_list in the next iteration\.Progress:Currently on iteration 50 of 100 totalSystem Description:Glider flight experiment\. State variables: forward velocity x0, path angle x1, horizontal position x2, vertical altitude x3\. \[\.\.\.\. omitted\.\.\.\.\]\(2b\)Accumulated Insights:The kinematic terms for x2\_t and x3\_t show strong physical relevance, particularly components involving x0 and trigonometric functions of x1\. The damping term in x0\_t and the velocity\-dependent term in x1\_t demonstrate good physical grounding\. \[\.\.\.\. omitted\.\.\.\.\]\(2c\)Previous Experiment Results:\[Global Best\] \(x0\_t: 2\.187e\-02, x1\_t: 6\.520e\-01, x2\_t: 2\.846e\-07, x3\_t: 2\.005e\-07\)x0\_t = \-0\.2688\*\(x0\) \- 9\.8336\*\(np\.sin\(x1\)\) \- 0\.0279\*\(x0\*np\.cos\(x1\)\) \+ 0\.4777\[\.\.\.\. other dimensions omitted\.\.\.\.\]Current Attempt:\(x0\_t: 2\.761e\-02, x1\_t: 3\.555e\+00, x2\_t: 2\.846e\-07, x3\_t: 2\.086e\-07\)x0\_t = \-0\.2982\*\(x0\) \- 9\.8333\*\(np\.sin\(x1\)\)\[\.\.\.\. other dimensions omitted\.\.\.\.\]\(2d\)Task:Perform term\-by\-term evaluation\.For all terms in the current attempt, evaluate the following 2 aspects:1\.Semantic Quality:\-good: Clearly aligns with the physical/mathematical meaning \(Max 3 per function\)\-neutral: Has some relevance but not essential\-bad: Unrelated to or contradicts the system2\.Reasoning:Explain your evaluation in 1\-2 sentences\. Focus on physical meaning\.Important Notes:\- Use coefficient analysis: Terms with coefficients near 0 are removal candidates\.Figure 14:The Scientist Prompt\. The labeled sections\(2a\)\-\(2d\)correspond to the Scientist components in Figure[2](https://arxiv.org/html/2605.07323#S1.F2):\(2a\)Role & Description,\(2b\)Accumulated Insights,\(2c\)Experiment Results, and\(2d\)Evaluation Task\.Scientist LLM Response\{
"x0\_t": \[
\{"term": "C\*\(x0\)", "semantic\_quality": "good", "action": "keep"\},
\{"term": "C\*\(np\.sin\(x1\)\)", "semantic\_quality": "good", "action": "keep"\},
\{"term": "C\*\(x0\*x0\)", "semantic\_quality": "neutral", "action": "hold"\}
\],
\[\.\.\.\. x1\_t, x2\_t, x3\_t omitted\.\.\.\.\]
\}Figure 15:Example response from the Scientist LLM showing term\-by\-term evaluation with semantic quality assessment and action recommendations\.
### K\.3Initial sampler prompt at iteration 1

Figure[16](https://arxiv.org/html/2605.07323#A11.F16)shows the initialization\-time Sampler prompt used before any prior evaluations are available, making explicit what context is given when accumulated knowledge, term\-level feedback, and removed\-term history are still empty\.

Initial Sampler Prompt \(Iteration 1\)You are a helpful assistant tasked with discovering mathematical term structures for scientific systems\. Complete the term\_list below, considering physical meaning and relationships of inputs\.\# System Description
The glider is viewed as an idealized system whose motion is expected to arise from the interplay of gravity and aerodynamic forces, with speed and flight path angle evolving from assumed initial conditions\.\#\#\# SCIENTIST AGENT GUIDANCE
The Scientist agent has analyzed previous experiments and provides the following guidance:\#\#\#\# Accumulated Knowledge \(Theory\)None\. This is the first iteration, so no accumulated insight had been produced yet\.\#\#\#\# Term\-by\-Term Evaluation \(Previous Attempt Analysis\)Evaluation results for each term\. keep = retain, hold = hold/modify, remove = eliminate:None\. There was no previous attempt because this prompt was given before any sampler output had been evaluated\.\#\#\#\# Removed Terms List \(Ban List\)The following term structures have negatively impacted performance\.Do NOT propose them again:None\. At iteration 1, no term skeleton had yet been removed\.Goal:Reflect the Scientist’s insights and guidance in the equation structure\.\[Required Conditions \(Violation Will Cause Errors\)\]
1\. You can use: import numpy as np
2\. Target System Context: Input variables are x0, x1\.
\-\- This system is 2\-dimensional
\-\- Variables x2 and above do not exist\.
3\. Term Format: Propose terms WITHOUT coefficients\. The system will automatically attach trainable parameters\.
\-\- Correct: "x0", "np\.sin\(x0\)", "x0\*x1"
\-\- Incorrect: "params\[0\]\*x0", "C\*x0", "0\.5\*x0"
4\. Term Complexity: You MAY use internal constants if they have physical meaning \(e\.g\., frequency, phase\)\.
\-\- Example: "np\.sin\(2\*x0\)" is allowed and encouraged if the factor 2 is significant\.
\-\- Note: The system will still attach an outer trainable parameter \(e\.g\., params\[0\]\*np\.sin\(2\*x0\)\)\.
5\. Symbolic Constants: Do NOT use symbolic constants like ’g’, ’k’, ’m’\. Use numerical values\.
\-\- Correct: "9\.81\*x0" \(if g=9\.81 is known\), "np\.pi\*x0"
\-\- Incorrect: "g\*x0" \(will cause NameError\)
6\. No duplicates: Equations identical to previous attempts are forbidden\. Structural modifications are required\.
7\. Reasoning required: When proposing each term, provide a physical/mathematical reasoning based on the system description \(desc\)\.\[Example \(2D System\)\]
x0\_t: \["x0", "x1"\]
x1\_t: \["x0", "np\.sin\(x1\)"\]Figure 16:Initial prompt to the Sampler agent at iteration 1, where accumulated knowledge, term\-level evaluation, and removed\-term history are empty at initialization\.
Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

Similar Articles

Uncertainty Quantification for Large Language Diffusion Models

Discovering Reinforcement Learning Interfaces with Large Language Models

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

Submit Feedback

Similar Articles

Uncertainty Quantification for Large Language Diffusion Models
Discovering Reinforcement Learning Interfaces with Large Language Models
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs
ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search