Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining
Summary
This paper studies a staged promotion protocol for micro-pretraining, using escalating budgets from minutes to hours to filter configurations. It finds that early screens are useful but unstable, and that a staged approach can retain a long-horizon reference while identifying alternatives that fail continuation thresholds.
View Cached Full Text
Cached at: 06/11/26, 01:36 PM
# Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining
Source: [https://arxiv.org/html/2606.11387](https://arxiv.org/html/2606.11387)
\(2026\-06\-09\)
###### Abstract
Short pretraining runs make many candidate recipes affordable, but they can also over\-promote configurations that only look strong at tiny budgets\. We study this tradeoff as a bounded case study in staged promotion for a fixed single\-GPU micro\-pretraining runner\. Here, "micro\-pretraining" means a single\-node, single\-GPU experimental runner with staged wall\-clock budgets, not that every run is sub\-minute\. Starting from twelve candidate configurations derived from a prior public screening study, we run budgets of2minutes,5minutes,10minutes,60minutes, and12hours on two heterogeneous host blocks: a Windows A100 path and a Linux L40S path\.
The early screens are useful but unstable: at5minutes, the best Windows and Linux conditions differ, and the eventual12\-hour top\-ranked condition is not the mean\-best condition at the replicated10\-minute gate\. Because seed ranges differ across stages, these changes are operational promotion evidence rather than within\-seed learning\-curve estimates\. A replicated60\-minute gate then retains the bridge reference derived from*Staged Factorial Screening for Budget\-Constrained Micro\-Pretraining*\[1\], where it ranks first in all four host\-seed cells\. In the final12\-hour confirmation package, the bridge reference ranks first in all four host\-seed cells across seeds46and47; the greedy comparator ranks second but does not meet the frozen0\.010 val\_bpbnear\-equivalence rule; and the d8/ar48 \(depth\-8, aspect\-48\) cheaper sentinel ranks third and does not meet the frozen0\.020 val\_bpbmean\-gap cheaper\-architecture rule\.
The executed12\-hour branch spends144GPU\-hours, and the full staged protocol records169\.2training GPU\-hours including screening stages\. Continuing all four60\-minute candidates to the same confirmation would spend192GPU\-hours; continuing all nine replicated10\-minute candidates would spend432GPU\-hours\. The latter two numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference\. The result is a bounded cost\-allocation finding, not evidence that the protocol outperforms adaptive hyperparameter optimization or that the largest model's advantage is capacity\-normalized\.
## 1\. Introduction
Small pretraining experiments are often used as filters before heavier training budgets are spent\. The operational reason is simple: if a short run can reject a bad recipe, the saved accelerator time can be used elsewhere\. The scientific risk is also simple: a short run can rank configurations in an order that does not survive more time, another seed, or another host\.
In this paper, micro\-pretraining means a single\-node, single\-GPU experimental runner used to make small\-budget pretraining decisions\. It does not mean that all stages are tiny in wall\-clock time: the protocol deliberately escalates from minute\-scale screens to12\-hour confirmation runs\.
This paper studies that tradeoff as a promotion problem\. We do not ask whether a short run can prove a configuration is globally best\. We ask whether a small, auditable, two\-worker promotion schedule can keep an observed long\-horizon reference in the candidate set while identifying plausible alternatives that do not meet frozen continuation thresholds\.
The study starts from a twelve\-condition matrix derived from the prior staged factorial micro\-pretraining screening campaign reported in \[1\]\. The bridge reference is the best reference condition carried forward from that prior screening campaign; this paper tests whether staged promotion retains and challenges it, not whether a naive search discovers it from scratch\. The candidates include that bridge reference, a greedy comparator, a high\-penalty control, several smaller or cheaper variants, and local variants around the bridge region\. We then run a multi\-fidelity schedule on two heterogeneous host blocks: a Windows A100 host and a Linux L40S host\. Early budgets are intentionally cheap\. Later budgets are spent only after written pre\-analysis gates\.
Operationally, a gate observes a candidate setS\_t, a fixed wall\-clock budgetb\_t, blocked host measurements, and predeclared thresholds, then chooses a smaller setS\_\{t\+1\}before the next budget is spent\. The object of study is not only the final score, but whether the sequence of gates avoids over\-pruning while reducing expensive continuations\.
The final result is narrower than a general optimizer claim but useful for constrained experimentation\. Early rankings are unstable enough that a hard prune at5or10minutes would be risky\. However, carrying reference and host\-sensitive candidates through a replicated60\-minute gate keeps the eventual12\-hour top\-ranked condition in the promoted set\. Because seed ranges change across stages, this is retention evidence under an operational promotion schedule, not causal evidence that budget duration alone changed the ordering\. The12\-hour package then closes the cheaper\-model branch for this study: the bridge reference ranks first in all four host\-seed cells, while the cheaper sentinel and greedy comparator do not meet the frozen thresholds\. No24\-hour continuation is launched after this result\.
The main contribution is therefore methodological discipline rather than a new architecture\. A staged promotion rule can reduce long\-horizon spending if it is used conservatively: broad cheap screens, replicated intermediate gates, frozen thresholds, and explicit stopping when plausible branches fail\.
## 2\. Contributions
This paper makes four contributions\.
1. 1\.It documents a fully staged promotion protocol for this fixed micro\-pretraining runner: smoke test, cheap screen, replicated cheap screen, replicated60\-minute confirmation, and two\-seed12\-hour confirmation, with frozen gates and auditable budget accounting\.
2. 2\.It shows that the early screen is not stable enough for aggressive pruning: the5\- and10\-minute reads are host\-sensitive, and the eventual12\-hour top\-ranked condition is not the mean\-best condition at10minutes\. Because later stages use different seed ranges, this is operational promotion evidence rather than a within\-seed duration effect\.
3. 3\.It shows that a conservative promotion rule can still retain the long\-horizon reference derived from*Staged Factorial Screening for Budget\-Constrained Micro\-Pretraining*\[1\] in the promoted set: the bridge reference passes every gate and ranks first in all four12\-hour host\-seed cells under this fixed wall\-clock protocol, a comparison that is capacity\-confounded because the bridge reference is also the largest final condition\.
4. 4\.It provides budget accounting for the stopping decision: the executed12\-hour confirmation uses144GPU\-hours, compared with192GPU\-hours for continuing all four60\-minute candidates and432GPU\-hours for continuing all nine10\-minute candidates\. These comparison budgets are accounting counterfactuals, not observed outcomes for skipped continuations\.
## 3\. Related Work
Hyperparameter optimization under finite budgets is the closest methodological context\. Hyperband formulates hyperparameter optimization as adaptive resource allocation with early stopping over randomly sampled configurations \[2\]\. ASHA extends successive\-halving\-style promotion to massively parallel settings \[3\]\. BOHB combines model\-based search with bandit\-style budget allocation \[4\]\. These methods motivate the idea that not every configuration should receive the largest budget\. This paper does not claim to outperform those methods\. Instead, it studies a small, manually auditable promotion protocol for a fixed runner, two workers, and a narrow candidate matrix\.
The distinction is operational\. Hyperband, ASHA, BOHB, and Bayesian optimization are automated search procedures for broader optimization problems\. The protocol studied here is a practitioner\-in\-the\-loop decision record: it freezes gates, preserves reference and control value, and explains why particular continuations are stopped\. It is complementary to automated HPO rather than a replacement for it\.
Reporting practices are also central\. Dodge et al\. argue that final test scores alone are insufficient and recommend showing validation performance as a function of compute budget \[5\]\. Our figures follow that principle: the paper reports stage trajectories, host\-seed cells, and budget counterfactuals rather than only the final12\-hour winner\.
Small\-scale pretraining decisions are increasingly important because full pretraining comparisons are expensive\. DataDecide studies how well small experiments predict larger pretraining choices across many corpora and scales \[6\]\. Optimizer\-comparison work has also emphasized that rankings can flip with training scale, tuning effort, and evaluation timing \[7\]\. Those warnings are directly relevant here: we treat early screens as candidate\-generation mechanisms, not as proof of long\-horizon quality\.
## 4\. Experimental Setup
### 4\.1 Runner And Hosts
All experiments use a fixed micro\-pretraining runner derived from the prior screening branch reported in \[1\] and instrumented for the staged\-promotion study\. The runner reports final validation bits per byte \(val\_bpb, lower is better\), parameter count, peak VRAM, total tokens, training seconds, and final checkpoint path\. We useval\_bpbrather than perplexity because it is a direct compression\-style validation loss for the fixed byte/token stream and remains comparable across wall\-clock\-limited runs where models process different token counts\.
val\_bpb = \- \(1 / \(N log 2\)\) \* sum\_\{i=1\}ˆ\{N\} log p\(x\_i \| x\_\{<i\}\),
wherelogdenotes the natural logarithm,Nis the number of evaluated validation bytes/tokens under the fixed tokenizer stream, and lower values indicate better compression of the validation stream\. The primary experiment path uses two heterogeneous host blocks:
host blockoperating pathacceleratorWindowsWindows runner pathNVIDIA A100 40GBLinuxLinux runner pathNVIDIA L40SThe hosts are not treated as identical replicas\. They are blocked observations used to test whether the same promotion decision remains directionally visible after changing operating path and accelerator\. We use heterogeneous hardware to test whether promotion decisions survive the most conservative host change available in this environment, while acknowledging that the A100\-L40S difference conflates architecture, operating system, driver stack, and filesystem path\. The descriptive standard deviations reported in Section 5\.4 should be read against this composite\-block caveat\.
The frozen runner uses locally cached training shards fromkarpathy/climbmix\-400b\-shuffle, withshard\_06542\.parquetpinned as the validation shard\. Tokenization uses a rustbpe\-trained, tiktoken\-compatible BPE with vocabulary8192\. All runs use context length2048and report finalval\_bpbover40 \* 524288validation tokens from the pinned shard\. Appendix A records the dataset URL, shard identifier, tokenizer artifacts, source snapshot, and reproducibility bundle contents\.
The runner fixes the run seed with Python and PyTorch seed calls and records the seed in each summary\. It does not claim bitwise deterministic replay across GPU architectures: deterministic PyTorch algorithms are not enabled,torch\.set\_float32\_matmul\_precision\("high"\)is used, and CUDA/cuDNN kernels may differ between the A100 and L40S paths\. The seed design therefore supports repeated operational reads, not exact binary replay\.
### 4\.2 Candidate Matrix
The starting matrix contains twelve conditions\. The prior*Staged Factorial Screening for Budget\-Constrained Micro\-Pretraining*study \[1\] screened depth, aspect\-ratio, and learning\-rate settings and produced the bridge reference, a greedy comparator, and a high\-penalty control\. The remaining conditions fill local variants and smaller\-model cells around that region to test whether staged promotion retains or rejects plausible alternatives\. The roles are predeclared so that promotion can retain reference and control value instead of only selecting the current best short\-budget row\. The short labels are used in later compact result tables\. Exact condition identifiers are internal reproducibility IDs and are carried in the ancillary matrix\.
labelcondition idroledepthaspectmatrix lrbatchbridgep06\_bridge\_bestreference best8640\.05262144greedyp06\_greedy\_winnersearch comparator6720\.03262144controlp06\_controlhigh\-penalty control8480\.03524288c03p06\_best\_c03small reference6480\.05262144c01p06\_best\_c01small reference6480\.03262144bridge\-d6p06\_bridge\_d6\_ar64shallow bridge6640\.05262144d4/ar48p06\_small\_d4\_ar48\_lr05aggressive small4480\.05262144d4/ar64p06\_small\_d4\_ar64\_lr05centered small4640\.05262144d4/ar72p06\_small\_d4\_ar72\_lr03shallow wide4720\.03262144d6/ar64p06\_d6\_ar64\_lr03local variant6640\.03262144d8/ar48p06\_d8\_ar48\_lr05local variant8480\.05262144d4/highbatchp06\_small\_highbatch\_d4\_ar64cheap high\-batch control4640\.05524288
### 4\.3 Promotion Schedule
The experiment uses staged wall\-clock budgets\. Each gate is written before the next expensive stage\.
stagecandidatesseedshostsbudget per conditionbudgeted GPU\-hourspurposeStage 03122min0\.2instrumentation smoke testStage 1A12125min2\.0cheap early screenStage 1B121210min4\.0first longer cheap screenStage 1C91210min3\.0seed\-43 top\-9 replicationStage 2A41260min8\.0seed\-44 confirmationStage 2B41260min8\.0seed\-45 confirmationStage 331212h72\.0seed\-46 long\-horizon testStage 3B31212h72\.0seed\-47 confirmationThe final12\-hour branch therefore spends144GPU\-hours\. The observed recorded training time across all remote result summaries is169\.2GPU\-hours, including earlier screens and confirmation stages\. The unrounded internal accounting value is169\.214, computed from per\-run training seconds\. These are training\-time accounting numbers; they do not include queueing, launch overhead, or human supervision time\.
### 4\.4 Frozen Decision Rules
The first cheap screens were used conservatively\. After the5\- and10\-minute seed\-42 screens, host rank agreement was still low, so the next action was not a60\-minute jump\. Instead, we repeated the top\-9 subset at10minutes with seed43\.
After the replicated10\-minute screen, four conditions were promoted to60minutes:
conditionreasond8/ar48best robust absolute performer at replicated10minutesd6/ar64cheap sentinel within the short\-budget tolerancebridgepredeclared bridge referencegreedypredeclared greedy comparatorAfter the replicated60\-minute gate, three conditions were promoted to12hours:
conditionreasonbridgeranked first in every60\-minute host\-seed cellgreedygreedy comparator and near\-best on Windowsd8/ar48best cheaper\-architecture sentinelThe frozen Stage 3B rule was:
conditionpass criterionbridgeremains best on both hosts and both12\-hour seedsgreedywithin0\.010 val\_bpbof bridge in all12\-hour host\-seed cellsd8/ar48within0\.020 val\_bpbof bridge on mean12\-hour gapThe aggregation asymmetry is intentional but should be read as a policy choice, not a statistical discovery\. The greedy comparator is a same\-class comparator, so the rule required near\-equivalence in every host\-seed cell\. The d8/ar48 condition is a cheaper\-architecture sentinel, so the rule allowed a wider mean\-gap tolerance to ask whether a smaller branch was "good enough" on average\. The0\.010and0\.020 val\_bpbthresholds are predeclared policy bands for this fixed runner and were used as stopping rules, not as general significance thresholds; the sensitivity table lets readers assess how robust the decisions are to nearby alternatives\.
The public ancillary bundle includes both the written pre\-analysis records and a threshold\-sensitivity table\. At12hours, greedy would pass under a looser0\.020mean\-gap convention but fails the declared all\-cell0\.010max\-gap rule; d8/ar48 fails both mean\-gap and max\-gap checks through0\.030\. At60minutes, both greedy and d8/ar48 pass a0\.020mean\-gap screen but fail a corresponding max\-gap check, consistent with their retention for12\-hour confirmation under predeclared roles rather than acceptance as final\. The paper therefore treats the thresholds as frozen stopping rules rather than general superiority tests\.
### 4\.5 Analysis Policy
The primary endpoint is finalval\_bpbat the end of the allocated wall\-clock budget\. Because the long\-horizon read has only two seeds per final condition and two heterogeneous host blocks, we do not report p\-values for the12\-hour comparison\. We report ranks, gaps from the bridge reference, token counts, parameter counts, peak VRAM, and budget accounting\.
Hosts are treated as blocks\. Seeds are treated as repeated initialization/data\-order settings within the fixed runner\. The paper uses descriptive thresholds because those thresholds were the promotion criteria, not because the final confirmation package has enough independent observations for broad inference\.
## 5\. Results
### 5\.1 Stage 0: Instrumentation Smoke Test
Stage 0 verified that the instrumented runner produced parseable summaries and final metrics on both hosts\.
conditionmeanval\_bpbWindowsLinuxparams Mtokens Md4/ar641\.2746611\.2659881\.28333411\.53447282\.444288greedy1\.4676451\.3771541\.55813539\.84628428\.835840bridge1\.5696511\.4782591\.66104450\.33217620\.316160This stage is not used for scientific ranking or screening\. It is a pure instrumentation check\. It shows why fixed\-time comparisons need token accounting: smaller models can process many more tokens inside the same wall\-clock budget\.
### 5\.2 Stage 1: Cheap Screens Are Useful But Unstable
At5minutes, the best Windows and Linux conditions differ\. The Windows winner is bridge\-d6; the Linux winner is d8/ar48\. The cross\-host rank agreement is low, so an aggressive prune at this point would be risky\.
conditionWindows rankLinux rankmeanval\_bpbbridge\-d6141\.185989d8/ar48511\.180862c03271\.191919d4/ar48821\.190858d6/ar64461\.192230d4/ar64931\.191005c01391\.198454d4/ar726101\.203480d4/highbatch1151\.216565bridge1081\.211893greedy7111\.214838control12121\.264903The replicated10\-minute top\-9 screen gives a more useful but still host\-sensitive read:
conditionmeanval\_bpbmean rankbest\-worst rankgap from bestd8/ar481\.1107013\.251\-60\.000000d6/ar641\.1224453\.253\-40\.011744bridge\-d61\.1225903\.252\-50\.011889bridge1\.1226434\.502\-70\.011942c011\.1261015\.003\-70\.015400greedy1\.1290245\.001\-90\.018323c031\.1270136\.005\-70\.016313control1\.1430366\.254\-80\.032335d4/ar481\.1526928\.508\-90\.041991The final12\-hour top\-ranked condition, the bridge reference, is not the mean\-best condition at this gate\. The mean\-best condition at this gate is d8/ar48 \(33\.0Mparameters\), which later ranks third in the final12\-hour confirmation package\. It remains in the promoted set because the promotion rule includes predeclared reference/control value rather than only short\-budget rank\. Because subsequent stages use new seeds, this observation should be read as a conservative promotion warning, not as a within\-seed learning\-curve reversal\.
Figure 1 summarizes the staged funnel\. It shows the key design choice: spend cheap budgets broadly, then narrow before the expensive12\-hour branch\.
Figure 1:Staged promotion funnel\. The design narrows candidate count before spending 12\-hour budgets\.
### 5\.3 Stage 2: Replicated 60\-Minute Gate Retains The Bridge
The replicated60\-minute gate gives a stronger promotion read under seeds44and45\. The bridge reference ranks first in all four host\-seed cells\. Because these are not the same seeds used at10minutes, the result should be read as stage\-gate evidence for retention, not as proof that duration alone changed the ordering\.
conditionmeanval\_bpbsdval\_bpbmean gapmax gapparams Mtokens Mbridge0\.9879400\.0177190\.0000000\.00000050\.332176609\.812480greedy1\.0044750\.0291650\.0165350\.02692739\.846284790\.626304d8/ar481\.0071180\.0101410\.0191780\.02623833\.030544850\.001920d6/ar641\.0255090\.0188750\.0375690\.03960826\.3457721081\.081856This gate drops the weakest of the four promoted conditions and sends three final conditions to12hours: bridge, greedy comparator, and the best cheaper\-architecture sentinel\.
Figure 2 shows the gap trajectory across the operational gate sequence\. The short\-budget leader does not remain competitive at12hours\. The bridge reference separates in the later gate reads, with the seed\-budget confound noted above\.
Figure 2:Early promise separates from 12\-hour survival\. Lines show meanval\_bpbgap from the stage best across the replicated10\-minute, replicated60\-minute, and two\-seed12\-hour reads; stages use different seed ranges, so the trajectory is operational promotion evidence rather than a within\-seed learning curve\.
### 5\.4 Stage 3: 12\-Hour Confirmation Retains The Bridge Reference
The final confirmation package runs three conditions at12hours on both hosts with seeds46and47\. The bridge ranks first in all four host\-seed cells\.
conditionfirst\-rank cellsmeanval\_bpbsdval\_bpbmean gapmax gapparams Mtokens Mbridge4/40\.9319150\.0064000\.0000000\.00000050\.3321767275\.347968greedy0/40\.9512080\.0103260\.0192940\.02292739\.8462849428\.336640d8/ar480/40\.9647010\.0044370\.0327860\.03689033\.03054410090\.119168The standard deviations are descriptive across the four host\-seed cells\. Given the composite host block, this variation combines seed effects with host\-block effects and is not a seed\-only dispersion estimate\.
The frozen thresholds are not met by either alternative\. The greedy comparator fails the0\.010 val\_bpball\-cell near\-equivalence rule because its maximum gap is0\.022927and its mean gap is0\.019294\. The d8/ar48 cheaper sentinel fails the cheaper\-architecture rule because its mean gap is0\.032786, above the0\.020threshold\.
The threshold sensitivity is:
checkobserved gapdeclared ruledecisionsensitivity notegreedy max gap0\.022927max gap<= 0\.010failwould pass only at a looser0\.030max\-gap rulegreedy mean gap0\.019294diagnostic onlynot primarywould pass a looser0\.020mean\-gap screend8/ar48 mean gap0\.032786mean gap<= 0\.020failfails through0\.030d8/ar48 max gap0\.036890diagnostic onlyfailfails through0\.030This result is also capacity\-confounded\. The bridge reference has50\.3Mparameters, compared with39\.8Mfor the greedy comparator and33\.0Mfor d8/ar48\. In the12\-hour final set, parameter count and finalval\_bpbmove in the same direction while the smaller models process more tokens, so the result should not be read as a capacity\-normalized recipe comparison\. It shows that the promoted bridge reference remained the strongest observed condition under this wall\-clock protocol, not that the promotion rule separated workflow quality from model size\.
The paired host\-seed values are:
hostseedbridgegreedyd8/ar48Windows460\.9264700\.9419860\.959303Linux460\.9370290\.9599560\.966803Windows470\.9262900\.9425520\.963180Linux470\.9378690\.9603390\.969518Figure 3 shows the same result as a host\-seed confirmation plot\.
Figure 3:12\-hour seed\-stability check across two hosts\. The bridge condition ranks first in all four host\-seed cells\.
### 5\.5 Cost\-Quality Frontier
The final result is not a cheap\-model victory\. The smaller candidates process more tokens within the same wall\-clock budget, but they do not match the bridge in finalval\_bpb\.
Figure 4 makes that tradeoff explicit\. Bubble area tracks tokens processed\. The33\.0M\-parameter d8/ar48 candidate processes about10\.09Btokens across the four host\-seed cells, compared with about7\.28Bfor the50\.3M\-parameter bridge, but its final meanval\_bpbis worse by0\.032786\.
Figure 4:Cost\-quality frontier at 12 hours\. Smaller models process more tokens but remain worse in finalval\_bpbin this final set\.
### 5\.6 Budget Counterfactual
The primary cost result is the stopping decision\. The final branch spends144GPU\-hours on three conditions across two seeds and two hosts\. Continuing all four60\-minute candidates would spend192GPU\-hours\. Continuing all nine replicated10\-minute candidates would spend432GPU\-hours\.
scenario12\-hour confirmation GPU\-hoursadded GPU\-hours vs executed branchexecuted three\-condition branch1440continue all four60\-minute candidates19248continue all nine10\-minute candidates432288These numbers are accounting counterfactuals, not observed outcomes for the skipped continuations\. They show how much long\-horizon budget the promotion rule avoided spending after plausible branches did not meet the frozen thresholds\. They should not be read as evidence that the skipped continuations could not have won or failed to improve\.
Figure 5:Promotion gates reduced long\-horizon spend under frozen stopping rules\. The executed branch uses 144 GPU\-hours; continuing all top\-9 candidates would spend 432 GPU\-hours as an accounting counterfactual\.
## 6\. Discussion
The result supports a conservative use of short experiments\. The early screens are valuable because they expose promising regions and obvious failures, but they are not stable enough for final claims\. The bridge condition is not the mean\-best condition at the replicated10\-minute gate\. If the procedure had selected only the short\-budget winner, the final12\-hour reference would have been at risk of being dropped\.
Within this seed\-confounded staged design, the60\-minute gate is the earliest stage where the bridge reference ranks first in all host\-seed cells under the seeds used at that stage\. It is still much cheaper than a12\-hour continuation, and it gives a better basis for spending the later budget on a smaller anchor set, while not proving that duration alone caused the ordering change\.
The failed cheaper branch is scientifically useful\. The d8/ar48 candidate is the clearest example of why cheap screens should be treated as candidate\-generation mechanisms rather than final evidence\. It is the best absolute performer at the replicated10\-minute gate, remains close enough at60minutes to justify a12\-hour check, and then ranks third in the final confirmation package\. Its narrow aspect ratio and smaller parameter count allow more tokens under the fixed wall\-clock budget, but that throughput advantage does not translate into lower finalval\_bpbin this final set\. The result does not show that smaller models cannot win\. It shows that this particular cheaper branch did not meet the frozen "good enough" criterion, so the paper should stop instead of extending it to24hours\.
This is also why the paper should not be framed as a replacement for adaptive HPO systems\. Hyperband, ASHA, BOHB, and Bayesian optimization address broader optimizer problems\. This paper addresses a narrower question: can a small, transparent promotion schedule stop additional long\-horizon spending in a fixed experimental branch after declared criteria are not met? In this case, yes\.
Among the six candidates not promoted to60minutes, all had higherval\_bpbthan d8/ar48 at the replicated10\-minute gate, and d8/ar48 itself did not meet the cheaper\-architecture threshold at12hours\. This transitivity argument does not prove the dropped conditions would have failed, but it bounds the plausibility gap under the observed gate sequence\.
## 7\. Limitations
The long\-horizon confirmation has only two seeds per final condition\. The result is a descriptive blocked comparison, not a high\-powered inferential study\.
Seeds and budgets are partially confounded across stages\. The10\-minute,60\-minute, and12\-hour reads use different seed ranges, so ranking changes across stages cannot be attributed to training duration alone\. The stage comparisons are operational promotion evidence, not within\-seed learning\-curve estimates\.
The hosts are heterogeneous\. Windows A100 and Linux L40S should be read as host blocks, not matched hardware replications\.
The12\-hour comparison is also capacity\-confounded\. The bridge reference is the largest of the three final conditions, and the paper does not include equal\-parameter or equal\-token controls that would separate model capacity from promotion\-rule effects\.
The run order was sequential\. Thermal, cache, filesystem, or host\-state effects could influence timing and throughput\. The main endpoint is finalval\_bpb, but order should still be disclosed in any public artifact package\.
The budget counterfactuals are accounting comparisons\. They do not prove that uncontinued candidates would fail at12hours\.
The candidate matrix is inherited from a prior branch and local variants around it\. The result does not establish global optimality for the configuration space\.
The paper does not run Hyperband, ASHA, BOHB, or Bayesian optimization baselines\. Those methods are related work and framing context, not defeated comparators\.
The public package uses curated summaries and source snapshots rather than host\-local execution folders\. Scrubbed copies of the copied host summaries are included underanc/remote\-results/, with path\-bearing columns removed\. Internalremote\-plans/, raw run directories, checkpoint paths, and launcher scripts contain environment\-specific paths and are excluded from the arXiv upload bundle\.
## 8\. Conclusion
In this fixed micro\-pretraining runner and twelve\-condition candidate matrix, a conservative staged promotion rule retained the observed long\-horizon bridge reference in the promoted set across two12\-hour seeds and two heterogeneous host blocks; that final comparison remains capacity\-confounded because the bridge reference is also the largest final condition\. The same frozen thresholds were not met by a greedy comparator or a plausible cheaper\-architecture branch, so the study stopped before spending an additional24\-hour or continue\-all budget\.
The paper's main lesson is disciplined stopping\. Small experiments add value when they are used to decide what not to run next, provided the promotion rule retains reference conditions, repeats unstable cheap screens, and avoids turning short\-budget leaders into unsupported long\-horizon claims\. The portable part is the discipline: predeclared reference/control roles, replicated cheap screens, frozen stopping rules, explicit stopping, and transparent sensitivity checks\. The runner\-specific parts are the absolute0\.010and0\.020thresholds, the all\-cell versus mean\-gap threshold asymmetry, the A100/L40S host block, and the169\.2GPU\-hour accounting\.
## Appendix A\. Reproducibility Snapshot
The public arXiv package uses a curatedanc/directory rather than host\-local execution folders\. Start with:
- •anc/README\.txtfor the package overview\.
- •anc/MANIFEST\.jsonfor the package inventory\.
- •anc/table\_manifest\.jsonto map manuscript tables to data\.
- •anc/figure\_manifest\.jsonto map figures to source artifacts\.
The internal project workspace is organized around four artifact groups:
- •Candidate definition and gates: the starting matrix plus frozen pre\-analysis and stage\-decision records\.
- •Observed run summaries: copied host summaries and derived stage\-analysis tables, with path\-bearing columns removed from the public bundle\.
- •Reproducibility scripts: the figure generator, analysis scripts, and instrumented runner snapshot\.
- •Public figures and manifests: manuscript figures plus machine\-readable mappings from figures and tables back to their source artifacts\.
The public package includes source snapshots, scrubbed matrices, scrubbed remote\-result summaries, scrubbed analysis summaries, figure\-generation scripts, and separate figure/table manifests that map derived public outputs to source artifacts or declared setup records\. The gate records are underanc/preanalysis/; threshold sensitivity is underanc/analysis/p06\_threshold\_sensitivity\_2026\-05\-02\.\*; and host\-result summaries are underanc/remote\-results/\.
See Section 4\.1 for the fixed runner, dataset, validation shard, tokenizer, context length, and validation\-token count\.
The runner source snapshot records the seed calls and CUDA environment fields captured by the summary writer\. Deterministic PyTorch algorithms are not enabled in the public runner snapshot, so exact bitwise replay across host blocks is not claimed\.
The generated cache binaries are not bundled in this paper workspace\. The public source bundle provides the source snapshot, generator code, dataset/shard identifiers, tokenizer artifact names, scrubbed host summaries, and derived analysis tables\. Exact binary cache hashes are therefore unavailable unless recovered from the training hosts\.
The local internal folder contains remote execution plans and copied summaries with environment\-specific paths\. These are useful for provenance, but the public arXiv package uses curated summaries and excludes those host\-local operational files\. The bundle also excludes Python caches, raw run work directories, host\-local launch scripts, checkpoint binaries, and absolute private host paths\. The included scrubbed summaries are sufficient to re\-derive the reported tables and figures without the private execution folders\.
## References
\[1\] Felipe Chavarro Polania\. Staged Factorial Screening for Budget\-Constrained Micro\-Pretraining\. arXiv:2606\.05186, 2026\.[https://arxiv\.org/abs/2606\.05186](https://arxiv.org/abs/2606.05186)
\[2\] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar\. Hyperband: A Novel Bandit\-Based Approach to Hyperparameter Optimization\. Journal of Machine Learning Research, 18\(185\):1\-52, 2018\.[https://jmlr\.org/papers/v18/16\-558\.html](https://jmlr.org/papers/v18/16-558.html)
\[4\] Stefan Falkner, Aaron Klein, and Frank Hutter\. BOHB: Robust and Efficient Hyperparameter Optimization at Scale\. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1437\-1446, 2018\.[https://proceedings\.mlr\.press/v80/falkner18a\.html](https://proceedings.mlr.press/v80/falkner18a.html)
\[5\] Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A\. Smith\. Show Your Work: Improved Reporting of Experimental Results\. Proceedings of EMNLP\-IJCNLP, pages 2185\-2194, 2019\.[https://aclanthology\.org/D19\-1224/](https://aclanthology.org/D19-1224/)
\[6\] Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D\. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A\. Smith, Pang Wei Koh, and Jesse Dodge\. DataDecide: How to Predict Best Pretraining Data with Small Experiments\. arXiv:2504\.11393, 2025\.[https://arxiv\.org/abs/2504\.11393](https://arxiv.org/abs/2504.11393)
\[7\] Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang\. Fantastic Pretraining Optimizers and Where to Find Them\. arXiv:2509\.02046, 2025\.[https://arxiv\.org/abs/2509\.02046](https://arxiv.org/abs/2509.02046)Similar Articles
Staged Factorial Screening for Budget-Constrained Micro-Pretraining
This paper proposes a staged factorial screening workflow for budget-constrained micro-pretraining, demonstrating that short designed experiments can identify stable hyperparameter penalty directions and support a screen-then-refine strategy.
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
This paper shows that mixing post-training data into pretraining (early exposure) improves how robustly a model retains capabilities after subsequent fine-tuning, challenging the notion that immediate post-training performance predicts retention. Controlled experiments on 135M and 1B models demonstrate that early exposure consistently improves the trade-off between upstream retention and downstream performance.
Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training
This paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training, proposing a two-stage framework that reduces hyperparameter search overhead by up to 90% while maintaining performance.
HRM-Text: Efficient Pretraining Beyond Scaling
HRM-Text introduces a Hierarchical Recurrent Model that decouples computation into slow and fast layers, enabling efficient pretraining from scratch on only 40 billion tokens and a $1,500 budget, achieving competitive performance with larger models.
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Proposes Demo2Reward, a test-time prompt optimization technique for VLM reward models using a few expert demonstrations, significantly reducing false positives and improving policy learning in robotics without additional model training.