Staged Factorial Screening for Budget-Constrained Micro-Pretraining

arXiv cs.LG Papers

Summary

This paper proposes a staged factorial screening workflow for budget-constrained micro-pretraining, demonstrating that short designed experiments can identify stable hyperparameter penalty directions and support a screen-then-refine strategy.

arXiv:2606.05186v1 Announce Type: new Abstract: Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:08 AM

# Staged Factorial Screening for Budget-Constrained Micro-Pretraining
Source: [https://arxiv.org/html/2606.05186](https://arxiv.org/html/2606.05186)
\(2026\-04\-27\)

###### Abstract

Budget\-constrained micro\-pretraining is common in automated research loops because many candidate recipes must be triaged on shared accelerators before larger search budgets are spent\. Best\-so\-far trajectories can find better recipes, but they do not identify which factors drive early performance differences\. We test whether a staged fractional\-factorial workflow can recover stable early effect structure under strict wall\-clock budgets\.

On a fixed autoresearch\-derived single\-GPU training loop we run613experiments across pilot and follow\-up screens at2,5, and10minutes; full16\-condition seeded reruns at5and10minutes; targeted anchor checks; same\-host greedy and matched\-cost random baselines; a60\-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through24h\. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases\. Within the predeclared seeded full\-screen families,D,A,B, andCretain non\-zero estimates at5and10minutes after within\-budget Benjamini\-Hochberg correction, whileEdoes not\. Rerun\-complete D\-fixed follow\-up analysis keeps interactions present but smaller in absolute magnitude than the main penalties\. Random search can reach strong incumbents in this32\-condition space, but repeatedly in the same low\-penalty region and without factor attribution\. The60\-minute bridge anchor is best, although that package does not separate workflow refinement from the larger bridge model's capacity advantage\. In bounded12hand24hthree\-anchor continuations on both hosts, the bridge has the lowest sample mean while the non\-bridge ordering stays host\-sensitive\.

We therefore present a bounded methods result: use short designed screens to identify high\-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space\. The evidence supports a bridge\-centered recommendation through24hon two hosts, not hardware\-invariant ranking or general hyperparameter\-optimization superiority\.

## 1\. Introduction

Automated training workflows make it easy to launch many short runs, but they do not make early\-training behavior easier to interpret\. A best\-so\-far trajectory can locate a better recipe without identifying whether the gain came from depth, width, batch size, learning rate, or a specific interaction among them\. If the goal is early structural understanding rather than only incumbent improvement, experimental design matters\.

This paper studies that design problem in the budget\-constrained micro\-pretraining regime\. We ask whether a compact staged factorial screen can identify stable early penalty directions under strict wall\-clock budgets, and whether the resulting signals remain strong enough to support a practical screen\-then\-refine workflow\. The focus is not long\-horizon convergence or full cross\-hardware equivalence\. The focus is what can be learned early, cheaply, and reproducibly on a runnable training loop before larger search budgets are committed, with a bounded later cross\-host check used to test whether the strongest anchor\-level signal remains visible outside the original runtime path\.

We test two bounded hypotheses on the main host\.H1: the dominant early main\-effect penalties from total batch and model size relax materially as wall\-clock budget increases from2to10minutes\.H2: after screening and local bridge refinement, the reduced centered\-width region remains informative at60minutes and later anchor\-level continuations by preserving separation from the predeclared control on the main host and staying competitive with or better than the bounded greedy incumbent\.

Our evidence supports a narrow methodological conclusion\. Under2\- to10\-minute budgets on the main host, batch and model\-size penalties dominate the early effect structure and relax substantially as budget increases\. Full seeded reruns of the5\- and10\-minute screens show thatD,A,B, andCretain non\-zero estimates after within\-budget BH\-FDR correction whileEdoes not\. Rerun\-complete follow\-up analysis shows that interactions remain real but secondary in absolute size\. A targeted multi\-seed confirmation layer shows that budget and condition structure dominate seed identity in the selected anchor subset\. A matched\-cost random baseline shows that strong incumbents can also be found by chance in this small space, but mostly by landing in the same low\-penalty region identified by the screen\. A later60\-minute bridge package over four predeclared anchors shows that the reduced\-space bridge region remains operationally informative at a longer horizon, subject to an unresolved capacity confound inside that anchor set\. Still later12hand24hthree\-anchor continuations on Windows and Linux keep the bridge anchor at the lowest sample mean on both hosts while again showing that the rest of the anchor hierarchy remains host\-sensitive\. Taken together, these results support a staged workflow for this regime: screen early, confirm anchor regions, then refine locally inside the reduced space\.

## 2\. Contributions

This paper makes three contributions\.

1. 1\.It presents a staged short\-horizon screening methodology for micro\-pretraining: factorial screen, focused confirmation, and local refinement inside a reduced space\.
2. 2\.It provides host\-bounded empirical evidence that early recipe effects are strongly budget\-dependent, with main penalties from total batch and model size relaxing substantially between2and10minutes\.
3. 3\.It demonstrates a bounded empirical pattern in this regime:D,A,B, andCretain seeded short\-budget effects whileEdoes not; random search reaches competitive incumbents without factor attribution; and the bridge\-centered anchor has the lowest sample mean through small dual\-host12hand24hcontinuations even though non\-bridge ordering remains host\-sensitive\.

## 3\. Related Work

Autoresearch\-style systems focus on fast experiment throughput and best\-so\-far progress rather than designed attribution \[1, 2\]\.

Classical hyperparameter search establishes the baseline comparison set\. Random search is a strong default in high\-dimensional hyperparameter spaces \[9\], and later work on functional ANOVA and random\-forest surrogates estimates which hyperparameters matter after search data have already been collected \[10\]\. Those papers motivate our emphasis on attribution, but they do not study compact staged screening inside a fixed short\-horizon training loop\.

Budgeted hyperparameter optimization methods such as Hyperband and BOHB allocate resources adaptively across configurations to improve anytime incumbent quality under finite budgets \[11, 12\]\. Bayesian optimization provides another major optimizer\-centered line of comparison \[13\]\. Population\-based training and asynchronous successive halving are also natural adaptive alternatives for distributed hyperparameter selection \[20, 21\]\. Their target is efficient optimizer performance over collections of tasks\. Our target is different: direct factor readout on one runnable host before a larger automated search budget is committed\.

Classical design\-of\-experiments provides the screening logic behind this paper\. Fractional\-factorial designs trade aliasing against coverage in order to expose large main effects and a controlled subset of interactions early \[14, 15, 16\]\. Response\-surface methodology and central\-composite\-style follow\-ups are natural DOE extensions once a promising region is found \[17, 18\]\. We apply the screening logic to micro\-pretraining, then add a confirmation layer and a longer\-horizon bridge package tailored to the short\-budget training workflow\.

Large\-scale scaling and optimizer studies provide important context but ask different questions: cross\-scale trends, compute\-optimal training rules, transfer rules, and architecture\-aware defaults \[3, 4, 19\], optimizer behavior under larger training horizons \[5, 6, 7, 8\], or learning\-rate schedule diagnostics such as cyclical learning rates and super\-convergence \[22, 23\]\. The gap we address is narrower\. We ask whether a staged designed screen can recover useful factor structure early enough to reduce the search space before heavier automated search begins on the same host\-bounded training loop\.

## 4\. Experimental Setup

### 4\.1 Platform and Baseline

The main study runs execute on a remote Windows host with one NVIDIA A100 40GB GPU\. The harness is an autoresearch seed baseline adapted to this runtime path \(SDPA fallback and disabledtorch\.compile\)\. We treat this as the primary measured environment, not an implementation footnote\. A later bounded replication package is run on a separate Linux host with one NVIDIA L40S GPU in order to test whether the bridge\-centered anchor result remains visible after a host, operating\-system, and accelerator change\. The Linux package is intentionally not a like\-for\-like second\-A100 replication; it is a bounded portability check on the available independent Linux accelerator, so positive transfer supports directional robustness rather than matched\-hardware equivalence\.

The original screening, greedy, bridge, and D\-fixed rerun blocks were executed under an earlier fixed\-seed harness\. For the later confirmation layers we added explicitRUN\_SEEDsupport and ran both a targeted90\-run anchor subset across2,5, and10minutes and full16\-condition seeded screens at5and10minutes \(160rows total\)\. The paper therefore contains both legacy fixed\-seed blocks and later multi\-seed packages: seed variation is quantified directly on selected anchors across all three budgets and on the full screening design at5and10minutes\.

The frozen runner uses locally cached training shards fromkarpathy/climbmix\-400b\-shuffle, withshard\_06542\.parquetpinned as the validation shard\. Tokenization uses a rustbpe\-trained, tiktoken\-compatible BPE with vocabulary8192\. All runs use context length2048and report finalval\_bpbover40 \* 524288validation tokens from the pinned shard\. Appendix A records the dataset URL, shard identifier, tokenizer artifacts, source snapshot, and reproducibility bundle contents\.

Model families are parameterized byDEPTHandASPECT\_RATIOon a Transformer\-style causal language\-model runner \[24\]\. For a given depth, the nominal width isDEPTH \* ASPECT\_RATIO, rounded up to the next multiple ofHEAD\_DIM=128; the attention head count is thenn\_embd / 128, and the attention window pattern is fixed toSSSLwith the last layer forced to full context\. Optimization is also frozen except for the experimental factors: matrix\-valued transformer weights use Muon, embeddings, unembedding, and scalar parameters use AdamW, AdamW betas are\(0\.8, 0\.95\), warmup ratio is fixed to0\.0, and the Windows fallback path uses device batch size32with gradient accumulation to realize the configuredTOTAL\_BATCH\_SIZE\.

Primary outcome is final validation bits per byte \(val\_bpb, lower is better\), computed as per\-token cross\-entropy in nats on the pinned validation stream divided by UTF\-8 target bytes and converted to bits per byte; special tokens are excluded from both sums\.

### 4\.2 Factors

Five pilot factors:

CodeFactorLowHighADEPTH68BASPECT\_RATIO4872CMATRIX\_LR0\.030\.05DTOTAL\_BATCH\_SIZE262144524288EWARMDOWN\_RATIO0\.250\.50
### 4\.3 Screening Design Definition

The pilot screen is a regular2ˆ\(5\-1\)fractional\-factorial design with generatorE = A\*B\*C\*D, equivalently defining relationI = A\*B\*C\*D\*E\. This is a resolution\-V design\. Main effects are therefore aliased only with four\-factor terms, while each two\-factor term is aliased with a complementary three\-factor term\. For example,Ais aliased withBCDE,BwithACDE, andA:BwithCDE\.

We use the pilot screen strictly as a main\-effect screening design\. The paper does not claim that the initial16\-run pilot by itself identifies unaliased two\-factor interactions\. Interaction discussion is moved to the separate D\-fixed follow\-up package, where the design and model are tailored to the reduced factor set\.

### 4\.4 Statistical Methods

All modeled factors use coded levels\{\-1, \+1\}\. Under this coding, a high\-minus\-low effect is2\*beta, wherebetais the fitted coefficient in the corresponding linear model\.

Condition\-level means and95%confidence intervals are reported from repeated runs within each cell using Student\-tintervals based on the sample standard deviation and the within\-cell replicate count\. These intervals appear in the seeded condition summaries, the60\-minute bridge package, the Linux cross\-host anchor package, and the later12hand24hanchor continuations\.

For the seeded full\-screen reruns, we fit separate budget\-specific ordinary least\-squares models

val\_bpb ~ A \+ B \+ C \+ D \+ E \+ seed\_factor

at5and10minutes\. We report high\-minus\-low effects, two\-sided coefficientp\-values, and95%Wald intervals using the residual degrees of freedom from each fitted model\. For these seeded full\-screen main effects, we additionally apply Benjamini\-Hochberg false\-discovery\-rate control within each predeclared five\-effect family at a fixed budget\. At both5and10minutes,D,A,B, andCsurvive BH\-FDR atq=0\.05, whileEdoes not\. In this paper, "retain a non\-zero estimate" for the seeded full screens means that the reported95%interval excludes zero and the corresponding main effect also survives that within\-budget BH correction\.

As a sensitivity check on the correction family, applying BH\-FDR once across the combined10seeded full\-screen main\-effect tests \(5factors x2budgets\) gives the same qualitative retention set:D,A,B, andCsurvive at both budgets, whileEdoes not\. We retain the within\-budget presentation because the models and decision questions are budget\-specific, but the conclusion does not depend on that split\.

For the targeted seed\-confirmation subset, we fit the fixed\-effects model

val\_bpb ~ C\(budget\_factor\) \* C\(condition\_factor\) \+ C\(seed\_factor\)

and summarize variance shares with Type\-II ANOVAeta\_sq = SS\_term / SS\_total\. Theseeta\_sqvalues are descriptive effect\-size summaries, not variance\-component estimates from a random\-effects model\.

For the D\-fixed follow\-up, we analyze the legacy fixed\-seed regime separately from the later explicit\-seed packages\. The pooled follow\-up model is

val\_bpb ~ budget10 \+ A \+ B \+ C \+ E \+ A:B \+ A:C \+ B:C \+ A:E \+ B:E \+ C:E \+ budget10:\(A \+ B \+ C \+ E \+ A:B \+ A:C \+ B:C \+ A:E \+ B:E \+ C:E\)

using only the original and rerun D\-fixed blocks from that same regime\. The reported intervals are model\-based Wald intervals from the fitted covariance matrix and should be read as rerun\-variability summaries within the fixed\-seed regime, not as independent\-seed inference\.

Because this block lacks independent seed variation, its intervals are expected to understate the variability a broad independently seeded interaction study would see\. We therefore use the D\-fixed block only to diagnose interaction structure inside the legacy regime, not as a multiplicity\-corrected decision layer\.

The pairwise dominance counts \(100/100,16/16, and similar\) are descriptive cross\-seed win counts over all left\-right seed products, with denominatorn\_left \* n\_right\. Because each seed value participates in multiple pairings, we do not treat those denominators as independent Bernoulli trials, and we do not attachp\-values or confidence intervals to them\.

We report raw two\-sidedp\-values for transparency\. Outside the seeded full\-screen main\-effect families, thesep\-values are descriptive diagnostics rather than multiplicity\-corrected decision rules\.

### 4\.5 Run Accounting and Reproducibility

Table 1 enumerates the completed packages that feed the present manuscript\. Using this accounting, the paper currently draws on613completed runs:569on the Windows A100 host and44on the Linux L40S host\.

packagehostregimebudgetsrowspilot screeningWindows A100legacy fixed\-seed2,5,10min48extreme\-condition replicationWindows A100legacy fixed\-seed2,10min20D\-fixed follow\-up original \+ rerunWindows A100legacy fixed\-seed5,10min64same\-host greedy baselinesWindows A100legacy fixed\-seed5,10min70centered\-width bridge probesWindows A100legacy fixed\-seed2,5,10min9targeted seed confirmationWindows A100explicit seeded reruns2,5,10min90full seeded screensWindows A100explicit seeded reruns5,10min160random\-search baselineWindows A100matched\-cost independent draws10min8060\-minute bridge packageWindows A100explicit seeded reruns60min16Linux cross\-host anchorsLinux L40Sexplicit seeded reruns10,60min32Windows 12h anchorsWindows A100explicit seeded reruns12h6Linux 12h anchorsLinux L40Sexplicit seeded reruns12h6Windows 24h anchorsWindows A100explicit seeded reruns24h6Linux 24h anchorsLinux L40Sexplicit seeded reruns24h6total613Every package is defined by an explicit matrix or result table with factor codings, decoded hyperparameters, time budget, and seed identifiers where applicable\. The manuscript therefore distinguishes data\-generation regimes at the package level instead of treating all completed runs as one exchangeable sample\. We freeze the upstream baseline commit and the exact local source snapshot used for this branch in Appendix A, together with the package\-level matrix files and machine\-readable summaries that define the experimental configurations\. The later dual\-host24hpackages are treated as bounded three\-anchor hardening results rather than as broad reruns of the full design space\.

### 4\.6 Experimental Sequence

The study follows ten stages:

1. 1\.Pilot screeningacross2,5, and10minute budgets to estimate the dominant early penalties\.
2. 2\.Reduced follow\-up rerunsto test whether interaction structure remains after fixing a major penalty source\.
3. 3\.Bounded greedy comparisonto contrast best\-so\-far search with the structural readout from designed screening\.
4. 4\.Matched\-cost random\-search baselineover five independent16\-draw batches to test whether competitive incumbents appear routinely without structured attribution\.
5. 5\.Targeted seed confirmationover extremes and bridge points to test whether the main ordering holds across independent seeds in a compact anchor subset\.
6. 6\.Full seeded\-screen rerunsover all16screening conditions at5and10minutes to attach seed uncertainty to the main\-effect estimates and condition ordering\.
7. 7\.Longer\-horizon bridge packageover four predeclared anchors at60minutes to test whether the reduced\-space ranking remains informative beyond the short screening horizon\.
8. 8\.Cross\-host Linux anchor replicationover the same four anchors at10and60minutes to test whether the bridge\-centered signal remains visible outside the original Windows A100 path\.
9. 9\.Dual\-host12hanchor continuationsoverbridge,greedy, andcontrolto test whether the bridge keeps the lowest sample mean in these small descriptive three\-anchor continuations on both hosts\.
10. 10\.Dual\-host24hanchor continuationsover the same three anchors to test whether that bridge\-centered sample\-mean result persists at a full\-day horizon on both hosts\.

## 5\. Results

### 5\.1 Budget Curves from Pilot and Seeded Full Screens

Main\-effect estimates \(high minus low\) onval\_bpb\. The2\-minute column is the original single\-seed pilot estimate and is used descriptively\. The5\- and10\-minute columns are seeded fixed\-effects estimates from the full16\-condition reruns \(5seeds per condition\)\.

Pilot2\-minute estimates:

factoreffectA\+0\.1112B\+0\.1109C\-0\.0597D\+0\.1847E\+0\.0448Seeded5\-minute estimates:

factoreffect95% CIpBH qA\+0\.0568\[\+0\.0426,\+0\.0710\]1\.98e\-114\.94e\-11B\+0\.0441\[\+0\.0299,\+0\.0583\]3\.57e\-085\.94e\-08C\-0\.0165\[\-0\.0307,\-0\.0023\]0\.02320\.0290D\+0\.0818\[\+0\.0676,\+0\.0960\]9\.01e\-184\.50e\-17E\+0\.0054\[\-0\.0088,\+0\.0196\]0\.4540\.454Seeded10\-minute estimates:

factoreffect95% CIpBH qA\+0\.0180\[\+0\.0141,\+0\.0218\]8\.86e\-142\.21e\-13B\+0\.0059\[\+0\.0021,\+0\.0098\]0\.003190\.00532C\-0\.0045\[\-0\.0084,\-0\.0007\]0\.02190\.0274D\+0\.0346\[\+0\.0308,\+0\.0385\]9\.03e\-284\.52e\-27E\-0\.0013\[\-0\.0052,\+0\.0026\]0\.5010\.501Interpretation:

- •Dremains the largest penalty at5and10minutes, with a strong but incomplete relaxation as budget increases\.
- •Aremains the second\-largest penalty at both seeded budgets and is clearly non\-zero\.
- •Bretains a non\-zero estimate after within\-budget correction at both seeded budgets, but by10minutes it is small relative toAandD\.
- •Cremains beneficial at both seeded budgets, again with clear relaxation\.
- •Eis not distinguishable from zero at either seeded budget, so the earlier single\-seed signal does not survive broad reruns\.

In practical metric terms, the seeded10\-minute batch\-size penalty \(D,\+0\.0346val\_bpb\) is about six times the width penalty \(B,\+0\.0059\) and about eight times the learning\-rate benefit magnitude \(\|C\|=0\.0045\)\. These comparisons are within\-metric effect\-size context only: they help rank recipe choices on validation compression, but they do not establish downstream task impact\.

Figure 1 collects those main\-effect estimates across budgets and makes the relaxation pattern visible at a glance: the largest short\-budget penalties shrink materially by5and10minutes, whileEcollapses toward zero under seeded reruns\.

![Refer to caption](https://arxiv.org/html/2606.05186v1/x1.png)Figure 1:Main\-effect penalties relax with budget\. The2\-minute points come from the original pilot; the5\- and10\-minute points include seeded full\-screen error bars\.
### 5\.2 Replication at 2 and 10 Minutes

2\-minute extreme\-condition replication \(top 2 \+ bottom 2 from 2\-minute pilot\):

source conditionnmeanval\_bpbsd121\.3176740\.000163321\.2741050\.0002671421\.7194490\.0007331621\.6757830\.00035810\-minute extreme\-condition replication \(top 4 \+ bottom 2 from 10\-minute pilot\):

source conditionnmeanval\_bpbsd121\.0936240\.000218321\.0952420\.000131521\.0920180\.000128721\.0906400\.0006681421\.1520820\.0003101621\.1426470\.000116Interpretation:

- •Replications preserve strong separation between best and worst regions\.
- •Variance is low, and top\-vs\-bottom separation is consistent within this sampled condition set\.

### 5\.3 D\-Fixed Follow\-Up Reruns at 5 vs 10 Minutes

We analyze the D\-fixed follow\-up separately because it belongs to the legacy fixed\-seed regime\. Within that regime, we combine the original and full\-rerun D\-fixed ABCE blocks at5and10minutes \(64rows total;32per budget\) in a pooled within\-host model overA,B,C,E, the six two\-way terms, and budget interactions\. The intervals below quantify rerun variability within that fixed\-seed regime rather than independent\-seed uncertainty\.

effect5 min effect \[95% CI\]10 min effect \[95% CI\]A\+0\.0227\[\+0\.0214,\+0\.0240\]\+0\.0084\[\+0\.0071,\+0\.0097\]B\+0\.0210\[\+0\.0197,\+0\.0223\]\+0\.0023\[\+0\.0010,\+0\.0036\]C\-0\.0085\[\-0\.0098,\-0\.0072\]\-0\.0013\[\-0\.0026,0\.0000\]E\+0\.0014\[\+0\.0001,\+0\.0027\]\-0\.0028\[\-0\.0041,\-0\.0015\]A:B\+0\.0101\[\+0\.0088,\+0\.0114\]\+0\.0049\[\+0\.0036,\+0\.0062\]A:C\-0\.0054\[\-0\.0067,\-0\.0041\]\-0\.0012\[\-0\.0024,\+0\.0001\]B:C\-0\.0024\[\-0\.0037,\-0\.0011\]\-0\.0003\[\-0\.0016,\+0\.0010\]A:E\+0\.0017\[\+0\.0004,\+0\.0030\]\-0\.0005\[\-0\.0018,\+0\.0008\]B:E\+0\.0023\[\+0\.0010,\+0\.0036\]\-0\.0005\[\-0\.0018,\+0\.0008\]C:E\-0\.0018\[\-0\.0031,\-0\.0005\]\-0\.0009\[\-0\.0022,\+0\.0004\]Interpretation:

- •Main effects shrink notably with budget, especiallyAandB\.
- •A:Bremains positive at both budgets, but its absolute effect declines from\+0\.0101to\+0\.0049; the pooled budget interaction forA:Bis negative \(delta=\-0\.0052,p=1\.97e\-06\)\.
- •Relative salience still rises because the main effects shrink faster:\|A:B\| / \|A\|increases from0\.444at5 minto0\.585at10 min\.
- •Interactions remain secondary in absolute size even after the rerun\-complete analysis\.

### 5\.4 Same\-Host Greedy Baselines

Greedy summaries:

- •5\-minute greedy \(50runs\):6accepted updates, best1\.152694at step12\.
- •10\-minute greedy \(20runs\):9accepted updates, best1\.090276at step15\.
- •10\-minute greedy first hitsA=\-1, D=\-1at step7\.

Interpretation:

- •Greedy improves incumbents effectively, especially early\.
- •These trajectories show search behavior, but by themselves they do not identify which factors produced the gain; that attribution still comes from the designed comparisons\.

### 5\.5 Matched\-Cost Random Search Baseline

We ran a matched\-cost random baseline over the full32\-condition recipe space: five independent batches of16random draws each at10minutes \(80rows total\)\. This uses the same per\-run budget as the original16\-run designed screen, but without structured coverage or factor balancing\.

Benchmarks from the original10\-minute single\-seed comparisons:

- •pilot screened best:1\.090745
- •greedy best:1\.090276

Batch\-best summary from the five random batches:

statisticvaluemean batch best1\.090615median batch best1\.090450min batch best1\.089706max batch best1\.092171batches at or better than pilot screened best3 / 5batches at or better than greedy best2 / 5Per\-batch winners:

batchbest conditionrecipebestval\_bpb1full32\_c10A\-1\_B\+1\_C\-1\_D\-1\_E\+11\.0904502full32\_c06A\-1\_B\-1\_C\+1\_D\-1\_E\+11\.0921713full32\_c14A\-1\_B\+1\_C\+1\_D\-1\_E\+11\.0899754full32\_c14A\-1\_B\+1\_C\+1\_D\-1\_E\+11\.0907715full32\_c10A\-1\_B\+1\_C\-1\_D\-1\_E\+11\.089706Interpretation:

- •In this small32\-condition space, matched\-cost random search can reach incumbents that are competitive with and sometimes slightly better than the original screened\-best and greedy\-best single\-seed references\.
- •This comparison is intentionally narrow: each random batch has16draws at the10\-minute budget and is matched to a single\-budget16\-condition screen, not to the full multi\-stage613\-run evidence program\.
- •The random baseline should therefore be read as an incumbent\-quality stress test for the original10\-minute screen, not as a cost\-matched replacement for the full screen\-confirm\-refine workflow\.
- •That result rules out a simple superiority claim based only on incumbent quality at10minutes\. It does not test seeded mean\-vs\-mean superiority against the later rerun packages\.
- •The random winners are not dispersed across the space\. All five batch winners shareA=\-1andD=\-1, and four of five also useB=\+1\. Random search succeeds mainly when it stumbles into the same low\-penalty region that the designed screen isolates structurally\.
- •The methodological value of the staged screen therefore remains attribution and disciplined refinement, not a guarantee that no random batch can hit a strong incumbent\.

### 5\.6 Center\-Point Bridge Runs

ASPECT\_RATIO=64bridge runs:

time budgetnmeanval\_bpbsd12031\.2849820\.00046530031\.1526580\.00023060031\.0922240\.000168Interpretation:

- •Center\-point mean performance improves as budget increases from 120s to 600s\.
- •These runs improve comparability between greedy trajectories \(which explored64\) and two\-level pilot screens \(48/72\)\.

### 5\.7 Targeted Seed Confirmation Across Budgets

We ran a targeted seed\-confirmation layer over six anchor conditions: two prior best conditions, two prior worst conditions, and two bridge conditions atASPECT\_RATIO=64\. This yields90rows total \(6conditions x3budgets x5seeds\)\. This targeted subset complements the later full seeded\-screen reruns by covering the2\-minute regime and the centered\-width bridge conditions directly\.

In a pooled fixed\-effects model over budget, condition, and seed, the dominant variance sources are budget and condition structure, not seed identity:

termeta\_sqbudget0\.6035condition0\.2723budget:condition0\.1079seed0\.0044residual0\.0118Descriptive cross\-seed win counts over the seeded subset \(n\_left \* n\_rightpairings, not independent\-trial denominators\):

budgetbest<worstbridge<worstbest<bridge2 min100/100100/10078/1005 min100/100100/10062/10010 min100/100100/10025/100Local refinement signals from the same seeded subset:

budgetbridge\_d8 \- bridge\_d6bridge\_d6 \- best\_best2 min\+0\.1167\+0\.02255 min\+0\.0191\-0\.001910 min\-0\.0005\-0\.0013Interpretation:

- •Worst regions remain clearly separated from both best and bridge regions under explicit seed variation at all three budgets\.
- •Seed effects are statistically non\-zero in the targeted subset, but they are small relative to budget and condition structure\.
- •TheASPECT\_RATIO=64bridges are operationally important: the depth\-6 bridge overtakes the screened best extremes on meanval\_bpbat5 min, and both bridges do so by10 min\.
- •At centered width, the depth penalty is large at2 min, smaller at5 min, and near\-tied by10 min\.

Figure 2 visualizes the same point from a variance\-allocation perspective: the seeded subset is driven primarily by budget and condition structure, not by seed identity\.

![Refer to caption](https://arxiv.org/html/2606.05186v1/x2.png)Figure 2:In the seeded confirmation subset, variance is dominated by budget and condition structure rather than seed identity\.
### 5\.8 Full Seeded\-Screen Confirmation at 5 and 10 Minutes

We then reran the full16\-condition screening design at5and10minutes with five independent seeds per condition \(160rows total\)\. This is the main statistical upgrade relative to the earlier draft because it removes the earlier weakness that full\-screen main effects at those budgets were single\-seed estimates\.

Condition ordering in the seeded full screens:

For5 min:

- •best mean:c07=1\.160271\[1\.152515,1\.168028\]
- •second mean:c03=1\.163509\[1\.160124,1\.166894\]
- •worst mean:c14=1\.371836\[1\.307309,1\.436364\]

For10 min:

- •best mean:c07=1\.088859\[1\.085720,1\.091999\]
- •second mean:c05=1\.094040\[1\.090097,1\.097983\]
- •worst mean:c14=1\.164241\[1\.147242,1\.181241\]

Interpretation:

- •The best region from the original screens holds under the seeded reruns:c07remains the best mean condition at both5and10minutes\.
- •The worst region also holds:c14andc16remain clearly worst at both budgets\.
- •The seeded reruns confirm that the large penalties are not artifacts of a narrow anchor subset\. At5and10minutes the full\-screen evidence now showsD,A,B, andCretaining non\-zero estimates after within\-budget BH correction, whileEis effectively null\.
- •This changes the status of the10\-minute story from "suggestive point estimates" to "corrected main\-effect evidence with explicit seed uncertainty\."

### 5\.9 60\-Minute Bridge Package

We ran a60\-minute bridge package over four predeclared anchor conditions with four independent seeds each \(16rows total\): the best screened10\-minute extreme, the best seeded bridge condition atASPECT\_RATIO=64, the best10\-minute greedy incumbent, and a predeclared control outside the reduced low\-penalty region\.

Role\-to\-condition mapping:

- •bridge\_best:best\_bridge\_10min\_d8\_ar64
- •greedy\_winner:greedy\_winner\_10min\_s15
- •screened\_best:best\_screened\_10min\_c07
- •control:predeclared\_control\_c10

role10 min mean60 min mean95% CIbridge\_best1\.0962920\.974511\[0\.972148, 0\.976874\]greedy\_winner1\.0902760\.981837\[0\.979531, 0\.984143\]screened\_best1\.0888590\.984299\[0\.983023, 0\.985574\]control1\.1402701\.001604\[0\.999867, 1\.003341\]Descriptive cross\-seed win counts in the same package:

- •bridge\_best < greedy:16/16
- •screened\_best < greedy:1/16
- •bridge\_best < control:16/16
- •greedy < control:16/16
- •bridge\_best \- greedymean gap:\-0\.007326val\_bpb\(standard error of the difference0\.001038, about7\.1standard errors\)

Figure 3 shows the anchor\-level seed trajectories for this package and makes the longer\-horizon crossover visually explicit: the centered\-width bridge becomes the best mean60\-minute anchor while the predeclared control remains worst\.

![Refer to caption](https://arxiv.org/html/2606.05186v1/x3.png)Figure 3:The60\-minute bridge package keeps the reduced\-space bridge best and the predeclared control worst\. Points show seed runs; black markers show means with 95% confidence intervals\.Interpretation:

- •The predeclared control remains worst at60 min, so the reduced\-space story does not collapse at the longer horizon\.
- •The best centered\-width bridge condition becomes the best mean60\-minute performer and beats the greedy winner in all descriptive cross\-seed win counts in this package\.
- •The best screened extreme remains the best mean condition at10minutes in the full seeded\-screen reruns, but it is overtaken by the centered\-width bridge at60minutes\. This is exactly the kind of transition the staged workflow is meant to expose\.
- •The60\-minute winner is also the larger bridge model \(depth=8, about50\.3Mparameters\), while the screened best extreme and greedy winner are bothdepth=6configurations \(39\.8Mparameters\)\. This is consistent with the paper's budget\-relaxation story, but this anchor package does not by itself separate a workflow effect from a later\-horizon capacity effect inside the bridge region\.
- •The60\-minute package is therefore consistent with both local workflow refinement and simple capacity relaxation; it should not be read as isolating a workflow\-only effect\.
- •The best screened extreme remains clearly better than the control, but no longer matches the bridge or greedy winner\. Within this anchor package, that pattern is consistent with keeping a local bridge\-refinement step rather than freezing the original screened extreme as the final answer\.
- •This is still an anchor\-set continuation check, not a claim of global long\-horizon optimality\.

For scale, the60\-minute bridge\-vs\-greedy mean gap is0\.007326val\_bpb, about21%of the seeded10\-minuteDmain\-effect penalty and about41%of the seeded10\-minuteApenalty\. This makes the bridge gap operationally visible inside the paper's metric, while still leaving downstream\-transfer value untested\.

### 5\.10 Cross\-Host Linux Anchor Replication

We then reran the same four\-anchor package on a Linux host with one NVIDIA L40S GPU at10and60minutes with four seeds per anchor \(32rows total\)\. This is not a like\-for\-like A100 replication\. It is a bounded cross\-host check asking whether the strongest operational signal from the original host remains visible after a change in operating system, runtime path, and accelerator class\.

Linux condition ordering:

For10 min:

- •best mean:best\_bridge\_10min\_d8\_ar64=1\.147247\[1\.144626,1\.149868\]
- •second mean:predeclared\_control\_c10=1\.156941\[1\.155960,1\.157922\]
- •worst mean:best\_screened\_10min\_c07=1\.175117\[1\.166961,1\.183273\]

For60 min:

- •best mean:best\_bridge\_10min\_d8\_ar64=1\.003547\[1\.002316,1\.004777\]
- •second mean:predeclared\_control\_c10=1\.011303\[1\.009919,1\.012686\]
- •worst mean:best\_screened\_10min\_c07=1\.036137\[1\.033837,1\.038437\]

Descriptive cross\-seed win counts on Linux:

- •bridge\_best < greedy:16/16at10 min,16/16at60 min
- •bridge\_best < control:16/16at10 min,16/16at60 min
- •greedy < control:0/16at10 min,0/16at60 min

Interpretation:

- •The strongest part of the Windows story remains after the host change: the bridge has the lowest sample mean and continues to beat the greedy anchor at both budgets\.
- •The full original anchor ordering does not survive unchanged\. On Linux, the predeclared control does not remain worst; it ranks second at both budgets, while the screened extreme becomes worst\.
- •The cross\-host evidence is therefore mixed rather than fully confirmatory\. It supports bridge\-centered continuation evidence, but it does not support a stronger claim that the same full anchor hierarchy or bad\-control separation is host\-invariant\.

### 5\.11 Twelve\-Hour Three\-Anchor Continuations on Windows and Linux

To test whether the strongest bridge\-centered signal remains visible beyond60minutes, we then ran the same three\-anchor continuation \(bridge,greedy,control\) at12hon both hosts with two seeds per anchor \(12rows total across the two hosts\)\. These later anchor packages are small descriptive sample\-mean checks, not powered inferential studies\. Their Student\-tintervals use the within\-cell replicate count \(n=2, hencedf=1\) and are shown only as coarse dispersion summaries\.

Windows12hcondition ordering:

- •best mean:best\_bridge\_10min\_d8\_ar64=0\.926420\[0\.914934,0\.937906\]
- •second mean:greedy\_winner\_10min\_s15=0\.941248\[0\.935511,0\.946984\]
- •worst mean:predeclared\_control\_c10=0\.954340\[0\.948615,0\.960064\]

Linux12hcondition ordering:

- •best mean:best\_bridge\_10min\_d8\_ar64=0\.937047\[0\.934620,0\.939474\]
- •second mean:predeclared\_control\_c10=0\.956076\[0\.953052,0\.959100\]
- •worst mean:greedy\_winner\_10min\_s15=0\.961561\[0\.949668,0\.973454\]

Descriptive cross\-seed win counts at12h:

- •Windows: - –bridge\_best < greedy:4/4 - –bridge\_best < control:4/4 - –greedy < control:4/4
- •Linux: - –bridge\_best < greedy:4/4 - –bridge\_best < control:4/4 - –greedy < control:0/4

Interpretation:

- •The strongest operational signal remains visible at12hon both hosts: the bridge has the lowest sample mean and continues to beat the greedy anchor in descriptive pairwise counts\.
- •Windows12hpreserves the fullbridge < greedy < controlordering seen in the main\-host60\-minute story, so the same\-host long\-horizon reading is materially stronger than it was at60minutes alone\.
- •Linux12hagain preserves the bridge\-centered advantage but not the rest of the hierarchy\. The control does not remain worst; it ranks above the greedy anchor\.
- •The combined12hread therefore strengthens the paper only in a bounded descriptive sense: the bridge has the lowest sample mean in both small three\-anchor packages, while the rest of the ordering remains host\-sensitive\.

### 5\.12 Twenty\-Four\-Hour Three\-Anchor Continuations on Windows and Linux

We then extended the same three\-anchor continuation \(bridge,greedy,control\) to24hon both hosts with two seeds per anchor \(12rows total across the two hosts\)\. These full\-day packages remain small and descriptive; they test whether the bridge keeps the lowest sample mean at the longest horizon in this paper, not whether the ranking is established with high\-power long\-horizon inference\.

Windows24hcondition ordering:

- •best mean:best\_bridge\_10min\_d8\_ar64=0\.923374\[0\.910319,0\.936430\]
- •second mean:greedy\_winner\_10min\_s15=0\.938600\[0\.933867,0\.943334\]
- •worst mean:predeclared\_control\_c10=0\.950929\[0\.943623,0\.958235\]

Linux24hcondition ordering:

- •best mean:best\_bridge\_10min\_d8\_ar64=0\.930297\[0\.926377,0\.934216\]
- •second mean:predeclared\_control\_c10=0\.952079\[0\.950758,0\.953400\]
- •worst mean:greedy\_winner\_10min\_s15=0\.954650\[0\.941715,0\.967585\]

Descriptive cross\-seed win counts at24h:

- •Windows: - –bridge\_best < greedy:4/4 - –bridge\_best < control:4/4 - –greedy < control:4/4
- •Linux: - –bridge\_best < greedy:4/4 - –bridge\_best < control:4/4 - –greedy < control:0/4

Interpretation:

- •The bridge\-refined anchor has the lowest sample mean at24hon both hosts and beats both the greedy anchor and the predeclared control in all descriptive cross\-seed pairings\.
- •Windows24hpreserves the samebridge < greedy < controlordering already seen at12h, so the same\-host long\-horizon read remains internally consistent through a full\-day horizon\.
- •Linux24hagain preserves the bridge\-centered advantage but not the rest of the hierarchy\. The control does not remain worst; it ranks above the greedy anchor\.
- •The24hread therefore strengthens the paper in one bounded descriptive way: the bridge has the lowest sample mean in both small three\-anchor packages through24h, while the non\-bridge ordering remains host\-sensitive\.

Figure 4 combines the four long\-horizon anchor packages and makes the bounded cross\-host pattern visible at a glance: Windows preservesbridge < greedy < controlat both horizons, while Linux preserves the bridge as best but flips the non\-bridge ordering\.

![Refer to caption](https://arxiv.org/html/2606.05186v1/x4.png)Figure 4:Dual\-host long\-horizon anchor packages at12hand24h\. In all four panels, the bridge anchor has the lowest sample mean\. Windows preservesbridge < greedy < controlat both horizons, while Linux rankscontrol < greedyamong the two non\-bridge anchors\. Points show seed runs; black markers show means with 95% confidence intervals\.

## 6\. Discussion

The evidence supports a bounded methodological conclusion\. In this host regime, short\-budget training behavior is dominated by a small set of main penalties, especially total batch and model size\. Those penalties relax substantially as budget increases from2to10minutes\. The new full seeded\-screen reruns sharpen that claim materially\. At5and10minutes, the seeded fixed\-effects models showD,A,B, andCretaining non\-zero estimates once the full16\-condition design is rerun across five independent seeds per condition and the within\-budget BH correction is applied, whileEdoes not\. That is the cleanest answer to the earlier concern that the10\-minute effects might be at the noise floor: some of them are, but not all of them, and the full seeded reruns now separate the real effects from the null one\.

The targeted seeded anchor package remains useful because it covers the2\-minute regime and the centered\-width bridge conditions directly\. In that subset, budget and condition structure dominate seed identity, worst regions remain clearly separated from best and bridge regions, and the bridge ordering shifts with budget exactly where the operational refinement story says it should\. The later60\-minute bridge package then sharpens the longer\-horizon interpretation\. The best screened extreme does not remain best at the longer horizon; the centered\-width bridge does\. That is important\. The recommendation is therefore not "screen once and stop\." The recommendation is "screen, confirm, then refine locally\." The matched\-cost random baseline sharpens the same point from the other side\. In this32\-condition space, random search can hit a strong incumbent\. What it does not do is explain why that incumbent is good or why the good incumbents cluster in the same low\-penalty region\. The practical value of the design is interpretability and search\-space reduction, not a broad claim that designed screening universally outperforms greedy or random search\. A greedy trajectory can still locate a strong incumbent, and a random batch can sometimes do so as well\. What the present study shows is narrower: best\-so\-far trajectories alone do not provide direct factor attribution, whereas a compact staged screen can identify high\-penalty directions and expose a plausible refinement region before larger search budgets are spent\. The60\-minute bridge package should also be read narrowly: it shows that the staged workflow surfaces a reduced region in which a larger centered\-width bridge becomes the best anchor, but it does not fully disentangle workflow value from the later\-horizon capacity advantage of that bridge model\.

The new Linux anchor package changes the paper's external\-validity story, but only partially\. At10minutes,60minutes,12h, and now24h, the strongest anchor\-level result remains visible cleanly: on Linux, the bridge has the lowest sample mean and continues to beat the greedy anchor\. That weakens the earlier single\-host caveat in a meaningful way because the bridge\-centered result is no longer confined to the original Windows A100 runtime path or to the shorter continuation horizons\. At the same time, the Linux packages do not preserve the full original anchor ordering\. The predeclared control does not remain worst at10minutes,60minutes,12h, or24h, and the rest of the ranking remains host\-sensitive\. This means the paper should not claim full cross\-host replication or host\-invariant preservation of the complete ranking structure\. The correct reading is narrower: the bridge\-centered sample\-mean result is directionally stable across hosts through24h, while other anchor relationships are host\-sensitive\.

This workflow should therefore be read as a bounded method for short\-horizon experimentation with limited longer\-horizon support on a small anchor set and cross\-host support for the bridge\-centered refinement step through24h\. It does not establish global long\-horizon optimality, hardware\-invariant ranking structure, or scaling\-law generality\. Those require additional experiments beyond the present bounded regime\.

## 7\. Limitations

1. 1\.Cross\-host evidence is still partial rather than complete\. The paper now includes bounded Linux L40S anchor packages at10minutes,60minutes,12h, and24h, but the original full anchor ordering does not survive unchanged across hosts and the results do not support a hardware\-invariant claim\.
2. 2\.Focus on finalval\_bpb, not downstream task transfer metrics\. No downstream evaluation is included, so practitioner recommendations are provisional for validation\-compression performance in this runner rather than demonstrated task\-transfer guidance\.
3. 3\.The60\-minute,12h, and24hpackages are still anchor\-set continuation checks over a small predeclared anchor set; the12hand24hpackages have onlyn=2seeds per anchor and should be read as coarse descriptive sample\-mean hardening, not high\-power long\-horizon inference\.
4. 4\.Search baselines are still bounded to same\-host greedy and a single\-budget random\-search stress test; the paper does not include adaptive multi\-fidelity baselines such as Hyperband, BOHB, ASHA, or population\-based training\.
5. 5\.The branch now includes full16\-condition seeded screens at5and10minutes plus a targeted seeded anchor subset across all three budgets, but the full2\-minute screen, greedy baselines, and the complete D\-fixed grid were not rerun across broad seed distributions\.
6. 6\.Reproducibility is now package\-explicit and archive\-frozen at the snapshot level, but still not ideal from a software\-distribution perspective: the study depends on a frozen local source snapshot plus matrix artifacts rather than a single tagged public repository for the entire paper workspace\.
7. 7\.The work is methodological and small\-scale, with no human\-subject data or deployed system evaluation, but it still reduces the cost of iterative recipe search for language\-model training\. As with other training\-efficiency methods, that can have dual\-use value if repurposed for broader capability\-seeking automation\. We therefore present the contribution as an auditable bounded workflow result rather than a turnkey optimization system\.

## 8\. Conclusion

This paper started from an interaction\-atlas hypothesis and ends with a narrower claim: under strict short budgets on the main host, training\-recipe behavior is dominated by budget\-sensitive penalties\. Three\-budget screens identify the large penalties early, full seeded\-screen reruns at5and10minutes show thatD,A,B, andCretain non\-zero estimates after within\-budget correction whileEdoes not, the matched\-cost random baseline shows that strong incumbents can be reached by chance in this small space but mostly inside the same low\-penalty region, rerun\-complete D\-fixed follow\-ups show that interactions remain real but secondary in absolute size inside the legacy fixed\-seed regime, the targeted multi\-seed subset shows that worst regions remain worse across seeds while centered\-width bridges become better refinement targets at5and10minutes, the later60\-minute bridge package shows that the best centered\-width bridge remains the strongest anchor condition at a longer horizon, and the later12hand24hcontinuations show that the bridge has the lowest sample mean in these small descriptive continuations on both hosts even though the full anchor ordering does not\.

The operational recommendation is therefore staged: screen to eliminate high\-penalty directions, confirm the anchor regions, then refine locally inside the reduced space before spending larger automated search budgets\. The60\-minute bridge result matters precisely because it shows that the local refinement step can overtake both the original screened extreme and the bounded greedy incumbent once the budget is long enough for the depth penalty to relax\. The later dual\-host continuation packages matter because they show that the bridge keeps the lowest sample mean in these small descriptive three\-anchor continuations on both hosts through24h\. The Linux replication matters because it shows that this bridge\-centered recommendation is not confined to a single runtime path, even though the rest of the anchor hierarchy still appears host\-sensitive\.

## Appendix A\. Reproducibility Snapshot

The upstreamautoresearchbaseline used for this study was frozen from a local clone at commit:

- •228791fb499afffb54b46200aca536f79142f117

The runner\-relevant local source snapshot used to materialize the experiment engines in this paper workspace is archived as:

- •autoresearch\-seed\.zip
- •SHA256:D15B7F68F9BDFF6E06D58FB9E2692152D8B44C34CC34652435C1F94675ADCADC
- •public contents manifest:autoresearch\_seed\_snapshot\_manifest\_2026\-04\-27\.json

The local control and parsing artifacts that define the execution path for this branch are:

- •materialize\_condition\.py
- •parse\_train\_summary\.py
- •patch\_prepare\_time\_budget\.py
- •train\_windows\_fallback\.py
- •remote\-env\.ps1
- •remote\-env\-linux\.ps1

The package\-level configuration dumps for the major result blocks are stored as explicit matrices and base\-condition tables in the paper workspace\. The principal ones are:

- •pilot\_16run\_design\_matrix\.csv
- •replication\_12run\_matrix\.csv
- •followup\_abce\_dlow\_16run\_matrix\.csv
- •followup\_abce\_dlow\_16run\_matrix\_10min\.csv
- •center\_points\_bridge\_9run\_matrix\.csv
- •seed\_confirmation\_base\_conditions\_6row\.csv
- •full\_screen\_seeded\_base\_conditions\_16row\.csv
- •bridge\_60min\_base\_conditions\_4row\.csv
- •crosshost\_linux\_anchor\_base\_conditions\_4row\.csv
- •long\_horizon\_anchor\_base\_conditions\_3row\.csv
- •long\_horizon\_anchor\_12h\_6run\.csv
- •long\_horizon\_anchor\_24h\_base\_conditions\_3row\.csv
- •long\_horizon\_anchor\_24h\_6run\.csv
- •table1\_run\_manifest\_2026\-04\-27\.csv
- •table1\_run\_manifest\_2026\-04\-27\.json

The CSV manifest contains one package row per Table 1 block\. Aggregate totals live in the companion JSON manifest so that the CSV cannot be naively double\-counted\.

The fixed data/tokenizer identifiers used by the runner are:

- •
- •pinned validation shard:shard\_06542\.parquet
- •tokenizer artifacts expected by the runner:tokenizer\.pklandtoken\_bytes\.pt
- •tokenizer vocabulary size:8192

Together, the source snapshot, generator code, dataset and tokenizer identifiers, package matrices, and curated CSV/JSON summaries provide the configuration\-level audit trail for every package reported in Table 1\.

## References

\[3\]*Predictable Scale: Part I, Step Law – Optimal Hyperparameter Scaling Law in Large Language Model Pretraining*\. arXiv:2503\.04715\.[https://arxiv\.org/abs/2503\.04715](https://arxiv.org/abs/2503.04715)

\[5\]*Rethinking Language Model Scaling under Transferable Hypersphere Optimization*\.arXiv:2603\.28743\.[https://arxiv\.org/abs/2603\.28743](https://arxiv.org/abs/2603.28743)

\[7\]*Hyperparameter Transfer Enables Consistent Gains of Matrix\-Preconditioned Optimizers Across Scales*\. arXiv:2512\.05620\.[https://arxiv\.org/abs/2512\.05620](https://arxiv.org/abs/2512.05620)

\[10\] Frank Hutter, Holger H\. Hoos, and Kevin Leyton\-Brown\.*An Efficient Approach for Assessing Hyperparameter Importance*\. Proceedings of the 31st International Conference on Machine Learning, PMLR 32, 2014\.[https://proceedings\.mlr\.press/v32/hutter14\.html](https://proceedings.mlr.press/v32/hutter14.html)

\[11\] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar\.*Hyperband: A Novel Bandit\-Based Approach to Hyperparameter Optimization*\. Journal of Machine Learning Research, 18\(185\):1\-52, 2018\.[https://jmlr\.org/beta/papers/v18/16\-558\.html](https://jmlr.org/beta/papers/v18/16-558.html)

\[12\] Stefan Falkner, Aaron Klein, and Frank Hutter\.*BOHB: Robust and Efficient Hyperparameter Optimization at Scale*\. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018\.[https://proceedings\.mlr\.press/v80/falkner18a\.html](https://proceedings.mlr.press/v80/falkner18a.html)

\[14\] Douglas C\. Montgomery\.*Design and Analysis of Experiments*\. Wiley, 10th edition, 2019\.

\[15\] George E\. P\. Box, J\. Stuart Hunter, and William G\. Hunter\.*Statistics for Experimenters: Design, Innovation, and Discovery*\. Wiley, 2nd edition, 2005\.

\[16\] C\. F\. Jeff Wu and Michael Hamada\.*Experiments: Planning, Analysis, and Optimization*\. Wiley, 2nd edition, 2009\.

\[17\] George E\. P\. Box and K\. B\. Wilson\.*On the Experimental Attainment of Optimum Conditions*\. Journal of the Royal Statistical Society, Series B, 13\(1\):1\-45, 1951\.

\[18\] Raymond H\. Myers, Douglas C\. Montgomery, and Christine M\. Anderson\-Cook\.*Response Surface Methodology: Process and Product Optimization Using Designed Experiments*\. Wiley, 4th edition, 2016\.

\[19\] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W\. Rae, Oriol Vinyals, and Laurent Sifre\.*Training Compute\-Optimal Large Language Models*\. arXiv:2203\.15556, 2022\.[https://arxiv\.org/abs/2203\.15556](https://arxiv.org/abs/2203.15556)

\[20\] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M\. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu\.*Population Based Training of Neural Networks*\. arXiv:1711\.09846, 2017\.[https://arxiv\.org/abs/1711\.09846](https://arxiv.org/abs/1711.09846)

\[21\] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar\.*A System for Massively Parallel Hyperparameter Tuning*\. arXiv:1810\.05934, 2018\.[https://arxiv\.org/abs/1810\.05934](https://arxiv.org/abs/1810.05934)

\[22\] Leslie N\. Smith\.*Cyclical Learning Rates for Training Neural Networks*\. IEEE Winter Conference on Applications of Computer Vision, 2017\. arXiv:1506\.01186\.[https://arxiv\.org/abs/1506\.01186](https://arxiv.org/abs/1506.01186)

\[23\] Leslie N\. Smith and Nicholay Topin\.*Super\-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates*\. Proceedings of SPIE 11006, Artificial Intelligence and Machine Learning for Multi\-Domain Operations Applications, 2019\. arXiv:1708\.07120\.[https://arxiv\.org/abs/1708\.07120](https://arxiv.org/abs/1708.07120)

\[24\] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\.*Attention Is All You Need*\. arXiv:1706\.03762, 2017\.[https://arxiv\.org/abs/1706\.03762](https://arxiv.org/abs/1706.03762)

Similar Articles

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv cs.LG

This paper benchmarks sub-1B models on mathematical reasoning tasks, revealing that full fine-tuning actively harms performance in models under 300M parameters, while parameter-efficient fine-tuning (PEFT) like LoRA and DoRA provides stability. The authors recommend defaulting to PEFT for all aligned sub-1B models and caution against full FT for architectures smaller than 500M to prevent catastrophic forgetting.

HRM-Text: Efficient Pretraining Beyond Scaling

arXiv cs.CL

HRM-Text introduces a Hierarchical Recurrent Model that decouples computation into slow and fast layers, enabling efficient pretraining from scratch on only 40 billion tokens and a $1,500 budget, achieving competitive performance with larger models.

A Bitter Lesson for Data Filtering (1 minute read)

TLDR AI

This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.