Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World Replications

arXiv cs.AI Papers

Summary

This paper investigates growth dynamics in deterministic equational discovery across three toy substrates and two real-world replications, finding substrate-conditional saturating power-law scaling.

arXiv:2605.23983v1 Announce Type: new Abstract: We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher-order list; n=592 trajectories), short-range substrate sizes fit a power-law N(t) proportional to t^b. Within each substrate b is architecture-sensitive (cross-validated R^2 approximately 0.82); the regression does not transfer across substrates (arith+bool to list yields R^2 approximately -0.84). A heuristic mean-field closure model predicts a saturating power-law dN/dt = K N^k exp(-mu N) of which the pure power-law is the short-range approximation. Three robustness checks: bootstrap intervals on (k, mu) are tight in 4/5 toy trajectories and degenerate in 1/5; out-of-sample forecasting on toy data (fit first 100 epochs, predict next 400) is won by pure power-law 5/5, indicating the toy trajectories do not reach saturation; on two real-world growth proxies the result splits. New Mathlib/*.lean file additions per month (mathlib4, 60 months, 9701 files) support the saturating form on OOS forecasting by approximately 7x over pure power-law; Coq mathcomp monthly commits (129 months, 3083 commits) favour pure power-law on both tests with mu collapsing to zero. The dynamics are substrate-conditional at two levels: within-substrate architecture-to-b regressions do not transfer, and the preferred functional family for N(t) itself (pure vs. saturating power-law) differs by substrate. We propose "saturating power-law growth with substrate-conditional (k, mu), observable when the substrate has reached its saturation regime" as a working framing.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:03 AM

# Saturating Scaling Laws for Equational Discovery A phenomenology of growth dynamics in three toy substrates
Source: [https://arxiv.org/html/2605.23983](https://arxiv.org/html/2605.23983)
\(May 14, 2026\)

###### Abstract

We investigate growth dynamics in deterministic equational discovery substrates\. Across three toy domains \(arithmetic, boolean, higher\-order list;n=592n=592trajectories\), short\-range substrate sizes fit a power\-lawN​\(t\)∝tbN\(t\)\\propto t^\{b\}\. Within each substratebbis architecture\-sensitive \(cross\-validatedR2≈0\.82R^\{2\}\\approx 0\.82\); the regression does*not*transfer across substrates \(arith\+bool→\\tolist yieldsR2≈−0\.84R^\{2\}\\approx\-0\.84\)\. A heuristic mean\-field closure model predicts a saturating power\-lawd​N/d​t=K​Nk​e−μ​NdN/dt=KN^\{k\}e^\{\-\\mu N\}of which the pure power\-law is the short\-range approximation\. Three robustness checks: bootstrap intervals on\(k,μ\)\(k,\\mu\)are tight in 4/5 toy trajectories and degenerate in 1/5; out\-of\-sample forecasting on toy data \(fit first 100 epochs, predict next 400\) is*won by pure power\-law 5/5*, indicating the toy trajectories do not reach saturation; on two real\-world growth proxies the result splits\. NewMathlib/\*\.leanfile additions per month \(mathlib4, 60 months, 9,701 files\) support the saturating form on OOS forecasting by∼7×\\sim 7\\timesover pure power\-law; Coq mathcomp monthly commits \(129 months,3,0833\{,\}083commits\) favour pure power\-law on both tests withμ→0\\mu\\to 0\. The dynamics are substrate\-conditional at two levels: within\-substrate architecture\-to\-bbregressions do not transfer, and the preferred functional family forN​\(t\)N\(t\)itself \(pure vs\. saturating power\-law\) differs by substrate\. We propose “saturating power\-law growth with substrate\-conditional\(k,μ\)\(k,\\mu\), observable when the substrate has reached its saturation regime” as a working framing\.

## 1Introduction

Empirical scaling laws have shaped the engineering of large language models\(Kaplan et al\.,[2020](https://arxiv.org/html/2605.23983#bib.bib1);Hoffmann et al\.,[2022](https://arxiv.org/html/2605.23983#bib.bib2)\)\. They have just begun to be extended to*discovery systems*, substrates that grow over time by adding rules, lemmas, programs, or architectures\(Liu et al\.,[2025](https://arxiv.org/html/2605.23983#bib.bib3);Ellis et al\.,[2021](https://arxiv.org/html/2605.23983#bib.bib4);Nandi et al\.,[2021](https://arxiv.org/html/2605.23983#bib.bib5)\)\. Whether*deterministic*symbolic discovery, the kind that has driven decades of work in inductive logic programming, equality saturation, and program synthesis, obeys a comparable scaling law is open\.

This paper provides a first empirical and theoretical investigation for deterministic equational discovery in toy substrates, intended as a phenomenology rather than a definitive theory\.

We make four observations, with limitations noted in §[6](https://arxiv.org/html/2605.23983#S6)\.

1. 1\.Empirical scaling law\.Across 592 independent discovery trajectories spanning three substrates and seven architectural degrees of freedom, the substrate size\|S​\(t\)\|\|S\(t\)\|follows a power\-law\-formtbt^\{b\}at short range, withb∈\[0,1\.15\]b\\in\[0,1\.15\]in arithmetic and boolean,b∈\[0,0\.99\]b\\in\[0,0\.99\]in a higher\-order list domain\.
2. 2\.Within\-substrate predictability\.A gradient\-boosted regression on five architecture features \(generator class, soundness\-filter strictness, recursion depth, batch size, seed\) predicts the trajectory exponentbbwith cross\-validatedR2=0\.815±0\.052R^\{2\}=0\.815\\pm 0\.052in arith\+bool \(n=344n=344\) andR2=0\.818±0\.073R^\{2\}=0\.818\\pm 0\.073in the list domain \(n=248n=248\)\.
3. 3\.Cross\-substrate transfer fails\.A regression trained on arith\+bool predicts list\-domainbbwithR2=−0\.84R^\{2\}=\-0\.84,*worse*than predicting the substrate mean\. The optimal architecture for compounding inverts across substrates: the novelty filter that amplifies growth in flat\-typed substrates suppresses it in higher\-order typed substrates\. Addingdomainas a categorical feature recoversR2=0\.88R^\{2\}=0\.88\.
4. 4\.A saturating closure model is consistent with the data in some regimes and not others\.We sketch a phenomenological mean\-field derivation yieldingd​N/d​t=K⋅Nk⋅e−μ​NdN/dt=K\\cdot N^\{k\}\\cdot e^\{\-\\mu N\}from uniform\-coverage and approximate\-independence assumptions on the substrate’s rewrite system\. The pure power\-lawtbt^\{b\}is the short\-range approximation; saturation begins atN∼1/μN\\sim 1/\\mu\. We test the prediction with three robustness checks: bootstrap intervals on\(k,μ\)\(k,\\mu\)are tight in 4/5 toy trajectories and degenerate in 1/5; out\-of\-sample forecasting on toy data is won by pure power\-law5/55/5, indicating toys do not reach saturation; on two real\-world substrates as monthly growth proxies, the results split: mathlib4 \(new\.leanfile additions, 60 months,9,7019\{,\}701files\) supports the saturating form on OOS forecasting by∼7×\\sim 7\\timesover pure power\-law, whereas Coq mathcomp \(129 months,3,0833\{,\}083commits\) favours pure power\-law on both tests with the saturatingμ→0\\mu\\to 0\. Whether saturation is observable appears to depend on where the substrate sits in its own life cycle\.

#### Implication\.

Within the tested regimes, the dynamics of symbolic discovery appear to be substrate\-conditional at two levels: the architecture parameters that predictbbwithin a substrate do not transfer across substrates \(R2≈−0\.84R^\{2\}\\approx\-0\.84\), and the functional family preferred forN​\(t\)N\(t\)itself \(pure power\-law vs\. saturating power\-law\) differs by substrate\. We propose “saturating power\-law growth with substrate\-specific\(k,μ\)\(k,\\mu\), observable when the substrate’s own life cycle has reached saturation” as a working framing\.

## 2Setup

#### The substrate\.

A*discovery substrate*SSis a finite set of typed equational rewrite rules\(ℓ,r\)\(\\ell,r\)\. The system growsSSover discovery stepst∈ℕt\\in\\mathbb\{N\}by proposing candidate rules from a typed term grammar, checking each for semantic soundness via random instantiation, and committing those that pass a configurable acceptance filter\.

We instantiate three substrates of increasing typing complexity:

- •arith, equational identities over the typed grammar\(\+,∗,x,y,z,0,1,2\)\(\+,\*,x,y,z,0,1,2\)with 3 free variables\.
- •bool, over\(and,or,not,p,q,r,0,1\)\(\\mathrm\{and\},\\mathrm\{or\},\\mathrm\{not\},p,q,r,0,1\)\.
- •list, DreamCoder\-inspired\(Ellis et al\.,[2021](https://arxiv.org/html/2605.23983#bib.bib4)\)higher\-order list domain with primitives map, filter, fold, reverse, length, append, cons, and integer arithmetic \(\+,−,∗\+,\-,\*\); six named unary functions \(inc, dec, double, square, neg, id\); and six predicates \(is\_pos, is\_neg, is\_zero, nonzero, is\_even, is\_odd\)\.

#### The discovery loop\.

Per step, the system: \(i\) generates a batch ofKKcandidate equations from a*generator*\(random, compositional from substrate\-derived subterms, frequency\-weighted, or MDL\-greedy toward large subterms\); \(ii\) tests semantic equivalence under random environments \(12 samples per pair, exhaustive 8\-world check for boolean\); \(iii\) generalises accepted pairs by replacing free variables with pattern variables; \(iv\) discards the candidate if a configurable*acceptance filter*rejects it \(anyaccepts all sound rules,noveltyrejects rules whose LHS is already derivable from currentSS\); \(v\) commits accepted rules toSS\.

The harness is 750 lines of Python\. Per\-epoch wall\-clock is between 0\.1 ms and 10 s depending on architecture and\|S\|\|S\|\. Full Phase A\+B sweeps \(n=592n=592\) run in under three hours on eight cores\.

#### The dependent variable\.

We measure\|S​\(t\)\|\|S\(t\)\|, the total committed rule count afterttdiscovery steps\. All scaling laws below are fit to this single quantity, log\-log againsttt\.

## 3A phenomenological closure model

We sketch a*mean\-field, phenomenological*closure model for the expected form ofN​\(t\)≡\|S​\(t\)\|N\(t\)\\equiv\|S\(t\)\|\. The derivation is heuristic and rests on two explicit simplifying assumptions \(A1,A2below\) that ignore rule overlap structure, nonuniform candidate distributions, and closure correlations; we make those assumptions visible because they are what a more rigorous treatment would have to replace\.

#### Generator growth\.

At steptt, the generator produces candidates whose count scales asK⋅g​\(S\)K\\cdot g\(S\)whereg​\(S\)=1g\(S\)=1for the random generator andg​\(S\)∝\|S\|kg\(S\)\\propto\|S\|^\{k\}for compositional, frequency\-weighted, or MDL\-greedy generators of effective recombination depthkk\.

#### Coverage and closure\.

Each rule\(ℓ,r\)∈S\(\\ell,r\)\\in Srewrites a fractionμ∈\(0,1\)\\mu\\in\(0,1\)of the typed candidate space\.μ\\muis small for narrowly\-typed substrates \(list with higher\-order operators\), large for flat\-typed ones \(arith, bool\)\. Under approximate independence, defensible because the novelty filter explicitly enforces non\-redundancy, the probability that a random candidate is*not*covered by any rule is

ρ​\(S\)=\(1−μ\)\|S\|≈e−μ⋅\|S\|\\rho\(S\)=\(1\-\\mu\)^\{\|S\|\}\\approx e^\{\-\\mu\\cdot\|S\|\}\(1\)for smallμ\\mu\.

#### The growth ODE\.

Combining:

d​Sd​t=K⋅g​\(S\)⋅ρ​\(S\)=K⋅Sk⋅e−μ⋅S\\frac\{dS\}\{dt\}=K\\cdot g\(S\)\\cdot\\rho\(S\)=K\\cdot S^\{k\}\\cdot e^\{\-\\mu\\cdot S\}\(2\)

#### Short range \(μ​S≪1\\mu S\\ll 1\)\.

e−μ​S≈1e^\{\-\\mu S\}\\approx 1andd​S/d​t=K⋅SkdS/dt=K\\cdot S^\{k\}, integrating to a pure power law:

S​\(t\)=\(\(1−k\)⋅K⋅t\)1/\(1−k\)=tb,where​b=1/\(1−k\)\.S\(t\)=\(\(1\-k\)\\cdot K\\cdot t\)^\{1/\(1\-k\)\}=t^\{b\},\\quad\\text\{where \}b=1/\(1\-k\)\.\(3\)

#### Long range \(μ​S≳1\\mu S\\gtrsim 1\)\.

The exponential saturatesd​S/d​t→0dS/dt\\to 0\. A practical surrogate that fits both regimes is the*logistic\-power\-law*

S​\(t\)≈a⋅tk1\+μ⋅tkS\(t\)\\approx\\frac\{a\\cdot t^\{k\}\}\{1\+\\mu\\cdot t^\{k\}\}\(4\)which admits standard nonlinear least\-squares fitting\.

#### Three predictions\.

1. 1\.Short\-range scaling: pure power\-law withb=1/\(1−k\)b=1/\(1\-k\)fork<1k<1\.
2. 2\.Saturation transition at\|S\|∼1/μ\|S\|\\sim 1/\\mu: below, power\-law fits well; above, saturating form \([4](https://arxiv.org/html/2605.23983#S3.E4)\) beats pure power\-law and stretched\-exponential under model selection\.
3. 3\.Substrate\-specificμ\\mu: cross\-substrate transfer of the architecture\-to\-bbregression fails unlessμ\\muis encoded in the features, which it is not, since generator/filter/depth/batch knobs do not index substrate coverage geometry\.

All three predictions are tested in §[5](https://arxiv.org/html/2605.23983#S5)\.

## 4Experiments

#### Architecture sweep\.

We sweep five features:generator∈\{\\in\\\{random,compositional,freq,mdl\_greedy\}\\\};filter∈\{\\in\\\{any,novelty\}\\\};depth∈\{2,3,4\}\\in\\\{2,3,4\\\};batch\_size∈\{40,60,80,120,160\}\\in\\\{40,60,80,120,160\\\};seed∈\{0,…,4\}\\in\\\{0,\\ldots,4\\\}\.

#### Trajectory regimes\.

Phase A\+B short\-range: 30 epochs per trajectory,n=344n=344arith\+bool,n=248n=248list\. Long\-range: 500 epochs,n=5n=5list\-domain atcompositional \+ any \+ depth=2 \+ bs=80\.

#### Fitting\.

For each trajectory we fit three candidate growth laws \(power\-law, stretched\-exponential, saturating power\-law\) and compare via AIC\.

#### Architecture regression\.

A gradient\-boosted regressor \(sklearn’s GradientBoostingRegressor, 200 estimators, max\_depth=3=3\) on the 5\-feature architecture vector predictsbb\. Cross\-validatedR2R^\{2\}and MAE via 5\-foldKK\-fold \(shuffle, seed=0=0\)\.

#### Cross\-substrate regression\.

Train on arith\+bool only; test on list\.R2<0R^\{2\}<0means the model is worse than predicting the substrate mean\.

#### Baseline\.

Popper\(Cropper and Morel,[2021](https://arxiv.org/html/2605.23983#bib.bib6)\)runs on a sequence of 6 progressively\-harder kinship\-style discovery tasks \(father, mother, grandparent, ancestor, son, daughter\), each task’s learned predicates added to the next task’s background knowledge\.

## 5Results

### 5\.1Power\-law form, regime\-dependent exponent

Across344\+248344\+248short\-range trajectories, rule\-count trajectories fit power\-lawtbt^\{b\}withbbvalues ranging continuously from 0 to 1\.15\.

Table 1:Short\-range power\-law exponentbbdistribution by substrate\.The list\-domain distribution is bimodal: 42% of configurations collapse tob≈0b\\approx 0because the novelty filter rejects nearly every candidate, while the surviving 58% spread continuously over\[0\.4,1\.0\]\[0\.4,1\.0\]\. The novelty filter that amplifies growth in flat\-typed substrates suppresses it in higher\-order typed substrates, a phenomenon we explain in §[6](https://arxiv.org/html/2605.23983#S6)\.

### 5\.2Within\-substratebbis architecture\-sensitive

The fitted short\-range exponentbbis regressable from architecture features within each substrate: cross\-validatedR2=0\.815±0\.052R^\{2\}=0\.815\\pm 0\.052on arith\+bool,R2=0\.818±0\.073R^\{2\}=0\.818\\pm 0\.073on list, MAE in both cases below 0\.10\. We do not interpret this as a strong predictive achievement, the regression is a 200\-tree GBR on a small synthetic feature set, and the analogy to neural scaling\-law fits should not be over\-read\. What it does show is that within a fixed substrate the short\-range growth dynamics are not architecture\-invariant: small changes in generator, filter, depth, or batch size produce meaningfully differentbbvalues\. Whether this regression transfers*across*substrates is tested next and is the more substantive question\.

![Refer to caption](https://arxiv.org/html/2605.23983v1/phase_b_summary.png)Figure 1:Phase A\+B in arith\+bool: held\-out scaling\-law prediction \(R=20\.815\{\}^\{2\}=0\.815\), distribution ofbb,bbvs\. batch\-size by generator, top\-12 architectures\.
### 5\.3Cross\-substrate transfer fails \(R2=−0\.84R^\{2\}=\-0\.84\)

Train the regression on arith\+bool alone \(n=344n=344\); test on held\-out list\-domain trajectories \(n=248n=248\):

R2=−0\.84,MAE=0\.376,b^¯=0\.70,b¯=0\.36\.R^\{2\}=\-0\.84,\\quad\\mathrm\{MAE\}=0\.376,\\quad\\overline\{\\hat\{b\}\}=0\.70,\\quad\\overline\{b\}=0\.36\.The regression cannot transfer\. Mechanistically, the model learns thatnovelty \+ mdl\_greedy \+ bs≥\\geq120produces highbbin arith\+bool and applies that lesson to list, where the same architecture*destroys*growth\. Addingdomainas a categorical feature in a pooled regression restoresR2=0\.883±0\.024R^\{2\}=0\.883\\pm 0\.024\.

![Refer to caption](https://arxiv.org/html/2605.23983v1/list_deep_summary.png)Figure 2:Cross\-substrate transfer fails\.Left:predicted vs\. actualbb, arith\+bool→\\tolist,R2=−0\.84R^\{2\}=\-0\.84\.Middle:list\-domainbbdistribution \(n=248n=248, mean 0\.36, bimodal\)\.Right:R2comparison, in\-substrate\+0\.82\+0\.82, cross\-substrate−0\.84\-0\.84, pooled withdomainfeature\+0\.88\+0\.88\.Domain is a non\-redundant feature\.The architecture knobs we used do not encode the substrate’s closure geometry;μ\\mufrom §[3](https://arxiv.org/html/2605.23983#S3)is the missing variable\.

### 5\.4Long range: a hypothesis consistent withn=5n=5

Scope of this evidence\.We extend five list\-domain trajectories to 500 epochs \(∼\\sim3 decades ofNN\) and fit all three candidate growth laws at six trajectory windows\. We emphasise upfront that this isn=5n=5trajectories in a single domain at a single architecture family \(compositional \+ any \+ depth=2 \+ bs=80\); AIC discrimination on correlated trajectories can be fragile, and the saturating form was selected from a small candidate set partly because it was the one our closure model predicted\. We treat this section as evidence*consistent with*the saturating hypothesis rather than as a confirmation\. Limitations of this design are listed in §[6](https://arxiv.org/html/2605.23983#S6)\.

Table 2:AIC winners by trajectory length used for the fit\. At every window≥200\\geq 200epochs, the saturating power\-law wins in every trajectory\.At all windows of≥200\\geq 200epochs, the saturating power\-law \([4](https://arxiv.org/html/2605.23983#S3.E4)\) wins in every trajectory\. Fitted parameters:k∈\[0\.77,0\.95\]k\\in\[0\.77,0\.95\],μ∈\[0\.0006,0\.0011\]\\mu\\in\[0\.0006,0\.0011\]\(point estimates; we did not bootstrap confidence intervals for\(k,μ\)\(k,\\mu\)in this draft, and out\-of\-sample predictive forecasting is also not reported, both flagged as needed robustness checks in §[6](https://arxiv.org/html/2605.23983#S6)\)\. The point estimate for the saturation kneeN∼1/μN\\sim 1/\\mulies in\[909,1667\]\[909,1667\]; finalNNacross the five trajectories is\[912,1210\]\[912,1210\], placing the runs at or just past the model\-predicted transition\.

![Refer to caption](https://arxiv.org/html/2605.23983v1/long_list_discrimination.png)Figure 3:Left:five list\-domain trajectories at 500 epochs \(log\-log\)\.Right:AIC winner shifts from*power\-law dominant at≤50\\leq 50epochs*to*saturating power\-law dominant at≥200\\geq 200epochs*\. The transition is the one predicted by Eq\.[2](https://arxiv.org/html/2605.23983#S3.E2)\.The empirical pattern is consistent with our phenomenological prediction: within the tested regime, the pure power\-law appears to be a short\-range approximation of a saturating dynamic\. The strength of the evidence, within\-design, is the consistency across five seeds; the obvious weakness is that all five share one substrate, one architecture family, and one trajectory length, so the result speaks only to that point in design space\.

### 5\.5Robustness: bootstrap, forecasting, and where the toy result breaks

We perform two robustness checks on the long\-range list trajectories: bootstrap parameter intervals, and out\-of\-sample forecasting\. The second of these is the most informative single test in this paper, and it shows that the in\-sample saturating\-form preference on the toy substrate is partially a fitting artifact\.

#### Bootstrap intervals on\(k,μ\)\(k,\\mu\)\.

500 residual\-resampled fits per trajectory yield 95% intervals of width≤0\.015\\leq 0\.015onkkand≤0\.0001\\leq 0\.0001onμ\\mufor four of the five trajectories\. The fifth trajectory yieldsμ=−0\.00226\\mu=\-0\.00226, i\.e\. a*negative*closure rate, indicating that the saturating fit’s third parameter is absorbing curvature slack rather than a genuine saturation signal there\. We report this honestly: the bootstrap is tight where the saturation signal is real, and degenerate where it is not\.

#### Out\-of\-sample forecasting\.

For each trajectory we fit on epochs 1 \.\. 100 and predict 101 \.\. 500\. The pure power\-law wins on every trajectory by held\-out RMSE \(Table[3](https://arxiv.org/html/2605.23983#S5.T3)\), reversing the in\-sample AIC preference for the saturating form\.

Table 3:Out\-of\-sample forecasting RMSE in the toy list domain: fit first 100 epochs, predict 101\.\.500\. Pure power\-law wins 5/5; the saturating model fits better in\-sample but extrapolates worse\.
#### What this means\.

The saturating model is preferred under in\-sample AIC at the 500\-epoch range, but does not generalise to held\-out data when fit on a 100\-epoch prefix\. The most likely explanation is that the toy trajectories do not actually saturate within 500 epochs \(finalNNat most 1210, model\-predicted saturation knee∼1000\\sim 1000, so we are at or just crossing the transition\); a 3\-parameter saturating fit on the first 100 epochs picks a saturation knee that is too aggressive and under\-predicts the later growth\. The pure power\-law extrapolates better because no saturation has actually been observed in the data being extrapolated to\. We treat the toy long\-range result as*consistent with*the saturating hypothesis at in\-sample, but*not yet confirmed*by out\-of\-sample forecasting in this substrate, and we now turn to a real\-world dataset where the verdict is clearer\.

### 5\.6Mathlib monthly commits: real\-data support for the saturating form

The list\-domain toy does not reach saturation within 500 epochs\. The mathlib4 formal\-mathematics library\(Mathlib maintainers,[2025](https://arxiv.org/html/2605.23983#bib.bib8)\)provides a real\-world growth trajectory at much greater range\. We extract per\-month cumulative commit counts from a public clone of the repository \(60 months, 2021\-05 to 2026\-04, 56,954 commits in total\) and fit the same three models\.

![Refer to caption](https://arxiv.org/html/2605.23983v1/mathlib_monthly.png)Figure 4:Mathlib4 monthly cumulative commits \(proxy for substrate size\)\.Left:log\-log fits across all 60 months; saturating power\-law wins in\-sample AIC \(725 vs 917 for pure power\-law\)\.Right:out\-of\-sample forecast fit on first 30 months, predicting next 30; saturating RMSE 15,821 vs pure power\-law 79,261 \(∼\\sim5×\\timesbetter\)\.In\-sample: saturating power\-law wins byΔ​AIC=192\\Delta\\text\{AIC\}=192\(R2=0\.9996R^\{2\}=0\.9996vs0\.98860\.9886for pure power\-law; stretched\-exp does not converge\)\.Out\-of\-sample\(fit first 30 months, predict next 30\): saturating power\-law RMSE=15,821=15\{,\}821vs pure power\-law RMSE=79,261=79\{,\}261\(MAPE28%28\\%vs127%127\\%\)\. The mathlib4 growth curve visibly bends in the last 12 months, consistent with the saturation transition our closure model predicts\.

#### Cleaner proxy: new\.leanfiles only\.

Commit count is noisy \(refactors, doc fixes, tooling, CI\)\. We re\-run the analysis counting only newly\-addedMathlib/\*\.leanfiles per month, a much cleaner substrate\-growth proxy \(each new file≈\\approxone fresh module of definitions and theorems\)\. Total:9,7629\{,\}762new\.leanfiles over 60 months\. The results are:

\*On the full 60 months, the saturating\-form optimiser converges to a degenerate fit \(R²=0\.023=0\.023,k→99k\\to 99,μ→0\\mu\\to 0\) — a local minimum issue, not a model issue\. When fit on only the first 30 months \(the OOS\-forecast setup\), saturating finds the correct shape and predicts the held\-out last 30 months∼7×\\sim 7\\timesbetter than power\-law\. The cleaner proxy thus*strengthens*the saturating hypothesis on OOS forecasting \(the more rigorous test\) while showing the in\-sample fit to be optimiser\-sensitive\. We treat the OOS result as the load\-bearing one\.

### 5\.7A second real substrate: Coq mathcomp shows the opposite

We replicate the analysis onmath\-comp/math\-comp, the Mathematical Components library for Coq \(129 months,3,0833\{,\}083cumulative commits as of 2026\-04\)\. The result is different from mathlib\.

Table 4:Coq mathcomp monthly commits \(129 months\)\. Power\-law wins both in\-sample AIC and out\-of\-sample RMSE; the saturating form’sμ\\mucollapses to∼0\\sim 0\(it reduces to a power\-law\) and the extra parameter hurts the out\-of\-sample forecast\.On mathcomp, power\-law wins both in\-sample \(by a small margin\) and out\-of\-sample \(by∼\\sim67×\\times\)\. The saturating\-form parameterμ\\mucollapses to approximately zero, meaning the model effectively reduces to a power\-law with an extra unused parameter that hurts forecasting variance\.

#### What two real substrates tell us\.

Mathlib4 \(the rapidly\-growing Lean library\) supports the saturating hypothesis; mathcomp \(the long\-stable Coq library\) supports the pure power\-law\. The dynamics are themselves substrate\-conditional, not just the parameters\. We interpret this as evidence that whether saturation is observable depends on where the substrate sits in its own life cycle: mathlib4 had a step\-change during the Lean port and is plausibly past its growth peak; mathcomp has had decades of steady scope and shows no saturation signal\. The honest empirical claim is therefore weaker than “saturation universally applies” and stronger than “the toys say nothing”: the saturating form is supported in at least one real\-world substrate that has reached saturation, and not supported in another that has not\.

![Refer to caption](https://arxiv.org/html/2605.23983v1/mathcomp_monthly.png)Figure 5:Coq mathcomp monthly cumulative commits \(129 months\)\. Power\-law wins in\-sample by AIC and out\-of\-sample by∼\\sim67×\\timeson RMSE\. Compare to Figure[4](https://arxiv.org/html/2605.23983#S5.F4): same analysis, different real\-world substrate, different winner\.

### 5\.8Popper baseline \(preliminary, 6 tasks\)

As a preliminary baseline we run Popper\(Cropper and Morel,[2021](https://arxiv.org/html/2605.23983#bib.bib6)\)on a sequence of six progressively\-harder kinship\-discovery tasks, cumulatively adding learned predicates to the background knowledge\. The cumulative\-clauses curve fitsN​\(task\)=3\.04⋅task0\.954N\(\\text\{task\}\)=3\.04\\cdot\\text\{task\}^\{0\.954\}withR2=0\.997R^\{2\}=0\.997\. The exponent sits inside our substrate’sbbrange, but we note this comparison is weak:n=6n=6tasks, cumulative\-clauses is a coarse proxy for substrate growth, the metric is not strictly apples\-to\-apples with our discovery\-loop quantity, and the substrates are different \(ILP Horn\-clause learning vs\. equational rule discovery\)\. We include the number for orientation and flag a deeper baseline study as future work\.

## 6Discussion

#### Consistency of form across substrates\.

The same functional family, saturating power\-law \([2](https://arxiv.org/html/2605.23983#S3.E2)\), is consistent with the observed growth dynamics in three toy substrates and with an independently\-implemented ILP system \(Popper\) at the preliminary scope reported\. We do not claim universality, the test is three substrates and one baseline system, but we note the regularity within these tested regimes as worth investigating at greater breadth\.

#### Parameter conditionality\.

Within our data, the fitted exponentbband saturation rateμ\\mudiffer across substrates by roughly an order of magnitude, and the architecture knobs we used \(generator, filter, depth, batch size\) do not span the axis along which they vary\. Cross\-substrate transfer of the architecture\-to\-bbregression fails empirically,R2≈−0\.84R^\{2\}\\approx\-0\.84\. We interpret this as suggestive evidence against architecture\-only predictability across substrates; a stronger claim, that any architecture\-only regression must fail, would require evidence beyond the present design\.

#### Engineering implications \(within scope\)\.

Within the tested toy substrates, an architecture\-conditional regression onbbachievesR2≈0\.82R^\{2\}\\approx 0\.82within\-substrate\. This may be operationally useful for compute allocation in similar settings, but we caution that the prediction does not transfer across substrates and we have not tested transfer to non\-toy discovery systems\.

#### Why the novelty filter inverts\.

In arith\+bool, rule coverageμ\\muis large enough that the substrate saturates relatively slowly; the novelty filter prunes the small fraction of redundant rules, keeping each new commit informative and sustainingd​S/d​tdS/dt\. In list,μ\\muis much smaller \(rules are highly type\-specific\); candidates are usually novel by default, and the filter rejects on type\-matching technicalities faster than it rejects on genuine redundancy, collapsingd​S/d​tdS/dtto zero\. The inversion is a direct consequence of substrate\-specificμ\\mu\.

#### Limitations\.

- •Three toy substrates \+ two real\-world proxies\.mathlib4 and mathcomp are real but coarse \(commit counts include refactors, docs, and tooling\)\. A cleaner proxy would count newly\-added rule\-bearing files \(\.leanor\.v\) per month\. Souper, OBO, Wikidata, and algebraic\-rewrite corpora are not tested\.
- •Long\-range evidence isn=5n=5in one toy domain\.The saturating\-form result on toys rests on five trajectories at one architecture family in the list domain\. We report bootstrap intervals on\(k,μ\)\(k,\\mu\)\(§[5\.5](https://arxiv.org/html/2605.23983#S5.SS5)\) and out\-of\-sample forecasting \(toy power\-law wins 5/5\), so the toy result is properly downgraded to “consistent with, not confirmed\.” The mathlib4 in\-sample \+ out\-of\-sample double\-win is the strongest single piece of evidence; the mathcomp opposite\-finding tempers that into “observable in some real substrates, not others\.”
- •The closure model is heuristic\.Assumptions \(A1\) uniform coverage and \(A2\) approximate independence ignore rule overlap structure, typed\-grammar combinatorics, nonuniform candidate distributions, and closure correlations\. A mature treatment would replace each\.
- •In\-sample model selection\.AIC discrimination is in\-sample and on correlated trajectories; the saturating form was among a small candidate set chosen partly because the closure model predicted it\. Out\-of\-sample predictive forecasting \(fit prefix, predict tail\) would be a stronger test\.
- •Trajectory range in arith\+bool\.arith\+bool long\-range trajectories \(500\+ epochs\) hit late\-epoch normalization costs in the Python harness; a Rust port is needed to extend them\. The long\-range comparison is therefore restricted to the list domain\.
- •Filter ablation\.Onlyanyandnoveltyacceptance filters are tested\. Other ILP disciplines \(entailment checking, MDL\) may produce different dynamics\.
- •Baseline depth\.The Popper comparison is preliminary \(n=6n=6tasks, coarse cumulative\-clauses proxy\) and not strictly apples\-to\-apples with our metric\.

#### What would elevate this work\.

We are honest about scope\. The present study reaches “toy but intriguing\.” Three concrete upgrades would, in our view, move it toward “possibly a genuine phenomenon\.”

- •\(A\) One non\-toy substrate\.The single highest\-leverage addition\. Candidates: monthly Lean / mathlib commit\-derived theorem counts;eggrewrite\-rule growth on a public corpus; symbolic\-algebra rewrites from a system like SymPy or Maple; theorem\-discovery traces from AlphaGeometry or similar; compiler rewrite databases \(Souper, LLVM peephole\)\. Replicating the within\-substratebb\-regression, cross\-substrate transfer failure, and saturating\-form preference on even one such substrate would change the paper’s category from synthetic phenomenology to candidate real\-world law\.
- •\(B\) Predictive forecasting\.Fit the saturating model on the firstτ\\tauepochs of each long trajectory and predict the remainingT−τT\-\\tau; compare against pure\-power\-law and stretched\-exponential predictions under out\-of\-sample log\-likelihood\. A clean win for the saturating form at, say,τ=100,T=500\\tau=100,T=500would convert the present in\-sample AIC discrimination into a forecasting result\. This is the cleanest single test that would distinguish the saturating model from a fitting artifact\.
- •\(C\) Deriveμ\\mufrom substrate combinatorics\.Ourμ\\muis operationally defined but theoretically thin\. A more concrete derivation would relateμ\\muto substrate\-level quantities such as type entropy of the term grammar, rewrite\-overlap density, branching factor of admissible derivations, or closure\-set combinatorics over the grammar’s productions\. Ifμ\\mubecomes a function of substrate\-level statistics derivable a priori, the architecture features can be augmented with that prediction, and the cross\-substrate transfer test \(§[5\.3](https://arxiv.org/html/2605.23983#S5.SS3)\) becomes a falsifiable check of the combinatorial theory rather than a negative result\.

#### A speculative positioning\.

We close on a deliberately non\-load\-bearing observation\. Neural systems are often described as scaling through parameter capacity, data diversity, and optimisation smoothness\(Kaplan et al\.,[2020](https://arxiv.org/html/2605.23983#bib.bib1);Hoffmann et al\.,[2022](https://arxiv.org/html/2605.23983#bib.bib2)\)\. The present work, taken at face value, points to a different set of axes for symbolic substrate\-style discovery: recombinational novelty \(the compositional yieldkk\), closure geometry \(the coverage parameterμ\\mu\), and saturation dynamics \(the resulting growth ODE\)\. Whether this is a fully separate scaling regime or just a different parameterisation of the same underlying compute\-vs\-capability trade\-off is, in our view, an open question that is worth pursuing as a deliberate research direction, what one might tentatively call a statistical mechanics of symbolic discovery\. We do not claim to have established such a programme here; we note that the data in this paper is at least consistent with that framing being non\-empty\.

## 7Related Work

#### Scaling laws for neural models\.

Kaplan et al\.\([2020](https://arxiv.org/html/2605.23983#bib.bib1)\)andHoffmann et al\.\([2022](https://arxiv.org/html/2605.23983#bib.bib2)\)established the methodology of power\-law fitting and cross\-experiment prediction for neural LMs\. Our application to a discrete deterministic substrate is methodologically parallel but substantively new\.

#### Scaling laws for discovery\.

Liu et al\.\([2025](https://arxiv.org/html/2605.23983#bib.bib3)\)report a linear\-in\-compute scaling law for LLM\-driven neural architecture discovery\. Our substrate is deterministic and our law is saturating power\-law; we differentiate sharply\. The cross\-substrate transfer failure suggests any discovery\-scaling claim should be reported with substrate\-conditional parameters\.

#### Rule discovery via equality saturation\.

Nandi et al\.\([2021](https://arxiv.org/html/2605.23983#bib.bib5)\)uses egraph saturation to*infer*rewrite rules, the algorithmic ancestor of our substrate harness\. We contribute the empirical scaling\-law analysis they did not\.

#### Library learning\.

DreamCoder\(Ellis et al\.,[2021](https://arxiv.org/html/2605.23983#bib.bib4)\)grows a library of programs via wake\-sleep cycles with neural\-guided proposal; growth curves are reported but not fit to scaling laws\.

#### Inductive logic programming\.

Popper\(Cropper and Morel,[2021](https://arxiv.org/html/2605.23983#bib.bib6)\)learns Horn\-clause programs from examples with predicate invention\. We use Popper as our scaling\-comparable baseline\.

## 8Conclusion

Deterministic substrate\-style discovery, in our tested regimes, exhibits a saturating power\-law growthd​N/d​t=K⋅Nk⋅e−μ​NdN/dt=K\\cdot N^\{k\}\\cdot e^\{\-\\mu N\}in some substrates and a pure power\-law in others\. Within our toy data, within\-substrate exponentsbbare architecture\-sensitive \(R2≈0\.82R^\{2\}\\approx 0\.82cross\-validated\) but cross\-substrate transfer of the architecture\-to\-bbregression fails \(R2≈−0\.84R^\{2\}\\approx\-0\.84\)\. The saturating form wins in\-sample AIC on long\-range toy data but loses out\-of\-sample forecasting, indicating the toy data does not actually reach saturation in≤500\\leq 500epochs\. Of the two real\-world proxies tested, mathlib4 \(new\.leanfiles per month\) supports the saturating form on OOS forecasting by∼7×\\sim 7\\times; Coq mathcomp \(monthly commits\) favours pure power\-law withμ→0\\mu\\to 0\. We propose “saturating power\-law growth with substrate\-conditional\(k,μ\)\(k,\\mu\), observable when the substrate has reached its saturation regime” as a working framing, conditional on the limitations in §[6](https://arxiv.org/html/2605.23983#S6)\. Three concrete upgrades would change the paper’s category: \(A\) replication on a non\-toy substrate at trajectory granularity rather than commit granularity, \(B\) a derivation ofμ\\mufrom grammar statistics that makes cross\-substrate transfer falsifiable, and \(C\) tested generalisation to a neural\-guided discovery substrate \(DreamCoder, AlphaProof\) where the parameterμ\\muwould have to be defined operationally for a stochastic proposal mechanism\. We treat the present paper as a starting phenomenology rather than a definitive scaling law\.

## Appendix AMathlib retrospective

The mathlib formal\-mathematics library has grown from∼\\sim170 K LOC in 2020\(The mathlib Community,[2020](https://arxiv.org/html/2605.23983#bib.bib7)\)to∼\\sim2\.1 M LOC in late 2025\(Mathlib maintainers,[2025](https://arxiv.org/html/2605.23983#bib.bib8)\), with intermediate∼\\sim1 M LOC during the Lean3→\\toLean4 port in 2023\. Fitting a power\-law on these three datapoints yields

\|S​\(t\)\|≈4957⋅t2\.87,R2=0\.988\|S\(t\)\|\\approx 4957\\cdot t^\{2\.87\},\\quad R^\{2\}=0\.988withttin years since founding \(2017\)\.We report this as a suggestive precedent only\.Three datapoints is statistically vacuous; LOC is a proxy for substrate size that conflates rule count with proof length; the Lean3→\\toLean4 migration period inflates the middle point\. The result is consistent with our saturating\-power\-law framework in the compounding regime, but does not constitute a rigorous test\. Pulling monthly commit\-history data for a proper mathlib growth analysis requires git\-log infrastructure beyond the present harness\.

## Appendix BAlgorithms

#### Notation\.

SS, current substrate \(set of rewrite rules\);pool\\mathrm\{pool\}, multiset of multi\-arg subterms harvested from accepted rules;freq\\mathrm\{freq\}, usage counter overpool\\mathrm\{pool\};GG, candidate generator \(one ofrandom,compositional,freq,mdl\_greedy\);FF, acceptance filter \(one ofany,novelty\);dd, max recursion depth;KK, batch size;TT, total discovery epochs\.

Algorithm 1Equational substrate discovery1:functionDiscover\(

𝒟,G,F,d,K,T\\mathcal\{D\},G,F,d,K,T\)

2:

S←InitialRules​\(𝒟\)S\\leftarrow\\textsc\{InitialRules\}\(\\mathcal\{D\}\)
3:

pool←∅;freq←∅;sizes←\[\]\\mathrm\{pool\}\\leftarrow\\emptyset;\\ \\mathrm\{freq\}\\leftarrow\\emptyset;\\ \\mathrm\{sizes\}\\leftarrow\[\\,\]
4:for

t←1t\\leftarrow 1to

TTdo

5:

cands←\[G\(pool,freq,d\)for1\.\.K\]\\mathrm\{cands\}\\leftarrow\[G\(\\mathrm\{pool\},\\mathrm\{freq\},d\)\\text\{ for \}1\.\.K\]
6:

nfs←\[Normalize​\(c,S\)​for​c∈cands\]\\mathrm\{nfs\}\\leftarrow\[\\textsc\{Normalize\}\(c,S\)\\text\{ for \}c\\in\\mathrm\{cands\}\]
7:

𝒢←GroupBy​\(cands,nfs\)\\mathcal\{G\}\\leftarrow\\textsc\{GroupBy\}\(\\mathrm\{cands\},\\mathrm\{nfs\}\)⊳\\trianglerightgroup by normal form

8:for

g∈𝒢g\\in\\mathcal\{G\}with

\|g\|≥2\|g\|\\geq 2do

9:

ℓ′←arg⁡maxx∈g⁡\|x\|;r′←arg⁡minx∈g⁡\|x\|\\ell^\{\\prime\}\\leftarrow\\arg\\max\_\{x\\in g\}\|x\|;\\ r^\{\\prime\}\\leftarrow\\arg\\min\_\{x\\in g\}\|x\|
10:if

Sound​\(ℓ′,r′\)\\textsc\{Sound\}\(\\ell^\{\\prime\},r^\{\\prime\}\)and

\|ℓ′\|\>\|r′\|\|\\ell^\{\\prime\}\|\>\|r^\{\\prime\}\|then

11:

\(ℓ,r\)←Generalize​\(ℓ′,r′\)\(\\ell,r\)\\leftarrow\\textsc\{Generalize\}\(\\ell^\{\\prime\},r^\{\\prime\}\)⊳\\trianglerightfree vars↦\\mapstopattern vars

12:if

F\.Passes​\(ℓ,r,S\)F\.\\textsc\{Passes\}\(\\ell,r,S\)then

13:

S←S∪\{\(ℓ,r\)\}S\\leftarrow S\\cup\\\{\(\\ell,r\)\\\}
14:

pool←pool∪Subterms​\(ℓ′\)∪Subterms​\(r′\)\\mathrm\{pool\}\\leftarrow\\mathrm\{pool\}\\cup\\textsc\{Subterms\}\(\\ell^\{\\prime\}\)\\cup\\textsc\{Subterms\}\(r^\{\\prime\}\)
15:

freq←Increment​\(freq,pool\)\\mathrm\{freq\}\\leftarrow\\textsc\{Increment\}\(\\mathrm\{freq\},\\mathrm\{pool\}\)
16:endif

17:endif

18:endfor

19:

sizes\.Append​\(\|S\|\)\\mathrm\{sizes\}\.\\textsc\{Append\}\(\|S\|\)
20:endfor

21:return

S,sizesS,\\mathrm\{sizes\}
22:endfunction

Sound\(ℓ,r\)\(\\ell,r\)evaluates both terms onnsn\_\{s\}random environments \(12 for arith/list, exhaustive 8 worlds for bool\) and returns true iff outputs match exactly\.Generalizereplaces free variables\{x,y,z,p,q,r,x​s,y​s\}\\\{x,y,z,p,q,r,xs,ys\\\}with pattern variables\{A,B,…\}\\\{A,B,\\ldots\\\}via a shared substitution map so thatℓ\\ellandrrremain semantically linked\.Normalizeapplies rules in hit\-count\-descending order until a fixpoint or 48\-step cap\.

Algorithm 2Scaling\-law fit, within\-substrate prediction, cross\-substrate test1:functionAnalyse\(

\{𝒯i\}\\\{\\mathcal\{T\}\_\{i\}\\\}\)⊳\\triangleright𝒯i\\mathcal\{T\}\_\{i\}: trajectory\(t,\|S​\(t\)\|\)\(t,\|S\(t\)\|\)with metadata𝐱i\\mathbf\{x\}\_\{i\}

2:foreach

𝒯i\\mathcal\{T\}\_\{i\}do

3:

a^i,b^i←arg⁡mina,b​∑t\(log⁡\|S​\(t\)\|i−log⁡\(a​tb\)\)2\\hat\{a\}\_\{i\},\\hat\{b\}\_\{i\}\\leftarrow\\arg\\min\_\{a,b\}\\sum\_\{t\}\\big\(\\log\|S\(t\)\|\_\{i\}\-\\log\(at^\{b\}\)\\big\)^\{2\}
4:

AICi\(pl\),AICi\(str\),AICi\(sat\)←FitAndAIC​\(𝒯i,Eqs\.[3](https://arxiv.org/html/2605.23983#S3.E3)–[4](https://arxiv.org/html/2605.23983#S3.E4)\)\\mathrm\{AIC\}^\{\(\\mathrm\{pl\}\)\}\_\{i\},\\mathrm\{AIC\}^\{\(\\mathrm\{str\}\)\}\_\{i\},\\mathrm\{AIC\}^\{\(\\mathrm\{sat\}\)\}\_\{i\}\\leftarrow\\textsc\{FitAndAIC\}\(\\mathcal\{T\}\_\{i\},\\text\{Eqs\.\\,\\ref\{eq:powerlaw\}\-\-\\ref\{eq:saturating\}\}\)
5:endfor

6:

7:*In\-substrate*: 5\-fold CV on

\(𝐱i,b^i\)\(\\mathbf\{x\}\_\{i\},\\hat\{b\}\_\{i\}\)per substrate

⇒Rwithin2\\Rightarrow R^\{2\}\_\{\\mathrm\{within\}\}
8:*Cross\-substrate*: train on substrates

𝒜∪ℬ\\mathcal\{A\}\\cup\\mathcal\{B\}, test on

𝒞⇒Rcross2\\mathcal\{C\}\\Rightarrow R^\{2\}\_\{\\mathrm\{cross\}\}
9:*Pooled*: train on all data with categoricaldomainfeature added

10:then 5\-fold CV

⇒Rpool2\\Rightarrow R^\{2\}\_\{\\mathrm\{pool\}\}
11:return

\{Rwithin2,Rcross2,Rpool2\}\\\{R^\{2\}\_\{\\mathrm\{within\}\},R^\{2\}\_\{\\mathrm\{cross\}\},R^\{2\}\_\{\\mathrm\{pool\}\}\\\}, AIC winner counts per window

12:endfunction

#### Implementation notes\.

Algorithm[1](https://arxiv.org/html/2605.23983#alg1)is∼\\sim300 lines Python with hit\-count rule ordering and lazy generalization\. Per\-epoch wall\-clock ranges from0\.10\.1ms \(list,\|S\|<50\|S\|<50\) to1010s \(arithmdl\_greedy\+noveltyat\|S\|\>500\|S\|\>500\)\. Algorithm[2](https://arxiv.org/html/2605.23983#alg2)usesscipy\.optimize\.curve\_fitfor nonlinear least\-squares \(stretched\-exp, saturating\) and closed\-form OLS on log\-log \(power\-law\); architecture regression issklearn\.ensemble\.GradientBoostingRegressorwith200200estimators andmax\_depth=3=3\. Architecture features used:generator\(4\-way one\-hot\),filter\(2\-way one\-hot\),depth\(int\),batch\_size\(int\)\. Total compute for all reported experiments:∼\\sim45 min wall\-clock on eight cores\.

## Appendix CFunctional\-form discrimination at short range

For completeness we report AIC/BIC discrimination across all 213 Phase A trajectories with≥8\\geq 8epochs and≥10\\geq 10rules at short range\. Power\-law wins 16% \(AIC\) / 22% \(BIC\); stretched\-exponential wins 49% / 41%; linear wins 22% / 28%; log\-normal wins 13% / 9%\. Restricting to the compounding regime \(mdl\_greedy \+ novelty \+ depth=3\), power\-law wins 43%, consistent with the prediction that the compounding regime is where the short\-range power\-law approximation is closest\. The long\-range result in §[5\.4](https://arxiv.org/html/2605.23983#S5.SS4)\(saturating power\-law wins 5/5 at≥200\\geq 200epochs\) is the resolution of the short\-range ambiguity in our toy data\.

## Appendix DCross\-substrate\(k,μ\)\(k,\\mu\)on short\-range toy data

We fit the saturating power\-law to every short\-range trajectory and report the per\-substrate\(k,μ\)\(k,\\mu\)distributions \(after filtering to stable fits,\|k\|<5\|k\|<5and\|μ\|<0\.1\|\\mu\|<0\.1, which retains 75%, 89%, and 63% of trajectories in arith, bool, and list respectively\)\.

Thekkdistributions are tightly clustered \(∼0\.88\\sim 0\.88\.\.0\.950\.95across substrates\); theμ\\mudistributions are statistically indistinguishable from zero\. We interpret this as the expected consequence of fitting a saturating form on short trajectories where the saturation knee is not reached:μ\\muis not identified, and the saturating fit becomes a re\-parameterised power\-law\. This is consistent with the OOS\-forecasting result in §[5\.5](https://arxiv.org/html/2605.23983#S5.SS5)that pure power\-law wins extrapolation on toy data\.

## Appendix EToward a more rigorousμ\\mu

Our closure model definesμ\\muas “the expected coverage fraction of a typical rule in the candidate space” under uniform\-coverage and independence assumptions\. We did not deriveμ\\mufrom grammar statistics in this paper, but here we record what such a derivation would have to address\.

LetΓ=\(Σ,𝒯,≤d\)\\Gamma=\(\\Sigma,\\mathcal\{T\},\\leq d\)be the typed term grammar of the substrate, withΣ=\{fi\}\\Sigma=\\\{f\_\{i\}\\\}the set of function symbols of arityaia\_\{i\}, and let𝒯d\\mathcal\{T\}\_\{d\}denote the set of well\-typed terms up to depthdd\. The size of𝒯d\\mathcal\{T\}\_\{d\}obeys the generating\-function recurrence

\|𝒯d\|=∑i\|𝒯d−1\|ai\+\(leaf count\),\|\\mathcal\{T\}\_\{d\}\|=\\sum\_\{i\}\|\\mathcal\{T\}\_\{d\-1\}\|^\{a\_\{i\}\}\+\\text\{\(leaf count\)\},which gives\|𝒯d\|=O​\(camaxd\)\|\\mathcal\{T\}\_\{d\}\|=O\(c^\{a\_\{\\max\}^\{d\}\}\)forccthe maximum fan\-out\.

A rule\(ℓ,r\)∈S\(\\ell,r\)\\in Sdefines a coverage setCov​\(ℓ\)=\{t∈𝒯d:∃σ\.t=σ​\(ℓ\)\}\\mathrm\{Cov\}\(\\ell\)=\\\{t\\in\\mathcal\{T\}\_\{d\}:\\exists\\sigma\.\\,t=\\sigma\(\\ell\)\\\}, whereσ\\sigmaranges over substitutions of pattern variables\. Ifℓ\\ellcontainsvvpattern variables and has\|ℓ\|\|\\ell\|size, then\|Cov​\(ℓ\)\|≈\|𝒯d−\|ℓ\|\|v\|\\mathrm\{Cov\}\(\\ell\)\|\\approx\|\\mathcal\{T\}\_\{d\-\|\\ell\|\}\|^\{v\}to leading order indd\. The expected coverage fraction, averaged over the distribution of rules actually committed, is

μ=𝔼\(ℓ,r\)∼p​\(S\)​\[\|Cov​\(ℓ\)\|\|𝒯d\|\]=𝔼ℓ​\[camaxd−\|ℓ\|​v\]⋅c−amaxd\.\\mu=\\mathbb\{E\}\_\{\(\\ell,r\)\\sim p\(S\)\}\\\!\\\!\\left\[\\frac\{\|\\mathrm\{Cov\}\(\\ell\)\|\}\{\|\\mathcal\{T\}\_\{d\}\|\}\\right\]\\;=\\;\\mathbb\{E\}\_\{\\ell\}\\\!\\left\[c^\{a\_\{\\max\}^\{d\-\|\\ell\|\}v\}\\right\]\\cdot c^\{\-a\_\{\\max\}^\{d\}\}\.This is dominated by the average rule complexity\|ℓ\|\|\\ell\|and variable countvv, both of which are properties of the substrate’s type system\. Substrates with more constrained types \(higher\-order list with explicit typing\) yield smaller averagevvand thus smallerμ\\mu; flat\-typed substrates \(arith over \(\+,\*\)\) yield largervvand largerμ\\mu\. This is qualitatively consistent with our empirical finding that the novelty filter suppresses growth in the list domain and amplifies it in arith\+bool, but the quantitative connection requires \(a\) measuring the actual distributionp​\(ℓ,r∣S\)p\(\\ell,r\\mid S\)from a substrate’s history, \(b\) estimating\|Cov\|\|\\mathrm\{Cov\}\|for each rule by direct combinatorial counting in𝒯d\\mathcal\{T\}\_\{d\}, and \(c\) verifying the independence assumption \(A2\) by computing pairwise overlap ofCov​\(ℓi\)∩Cov​\(ℓj\)\\mathrm\{Cov\}\(\\ell\_\{i\}\)\\cap\\mathrm\{Cov\}\(\\ell\_\{j\}\)for committed rules\. None of these are done in this paper; we note them as the natural follow\-up that would convert the heuristic in §[3](https://arxiv.org/html/2605.23983#S3)into a derivation ofμ\\mufrom substrate statistics\.

## References

- Kaplan et al\. \(2020\)Kaplan, J\., McCandlish, S\., Henighan, T\., Brown, T\. B\., Chess, B\., Child, R\., Gray, S\., Radford, A\., Wu, J\., Amodei, D\.Scaling laws for neural language models\.*arXiv:2001\.08361*, 2020\.
- Hoffmann et al\. \(2022\)Hoffmann, J\., Borgeaud, S\., Mensch, A\., et al\.Training compute\-optimal large language models\.*arXiv:2203\.15556*, 2022\.
- Liu et al\. \(2025\)Liu, Y\., Nan, Y\., Xu, W\., Hu, X\., Ye, L\., Qin, Z\., Liu, P\.AlphaGo moment for model architecture discovery\.*arXiv:2507\.18074*, 2025\.
- Ellis et al\. \(2021\)Ellis, K\., Wong, C\., Nye, M\., Sablé\-Meyer, M\., Morales, L\., Hewitt, L\., Cary, L\., Solar\-Lezama, A\., Tenenbaum, J\.DreamCoder: bootstrapping inductive program synthesis with wake\-sleep library learning\.In*PLDI*, 2021\.
- Nandi et al\. \(2021\)Nandi, C\., Willsey, M\., Zhu, A\., Wang, Y\. R\., Saiki, B\., Anderson, A\., Schulz, A\., Grossman, D\., Tatlock, Z\.Rewrite rule inference using equality saturation\.In*OOPSLA*, 2021\.
- Cropper and Morel \(2021\)Cropper, A\., Morel, R\.Learning programs by learning from failures\.In*IJCAI*, 2021\.
- The mathlib Community \(2020\)The mathlib Community\.The Lean mathematical library\.*arXiv:1910\.09336*, 2020\.
- Mathlib maintainers \(2025\)Mathlib maintainers\.Growing mathlib\.*arXiv:2508\.21593*, 2025\.

Similar Articles

Evaluation-driven Scaling for Scientific Discovery

Hugging Face Daily Papers

SimpleTES framework scales evaluation-driven discovery loops across 21 scientific problems, yielding 2× speedups on LASSO, 24.5% quantum gate reductions, and new Erdos constructions while enabling trajectory-level model post-training.

On the Smallness of the Large Language Models Scaling Exponents

arXiv cs.AI

The paper discusses the small scaling exponents of large language models, arguing that they indicate an unsustainable regime in terms of energy resources. It also examines the 'pedestal effect' and draws analogies with fluid turbulence to comment on data smoothness.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Hugging Face Daily Papers

This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.