From Forecasting Leaderboards to Deployment Decisions: A Fail-Closed Certification Protocol
Summary
This paper introduces a fail-closed certification protocol to determine when a forecasting leaderboard winner can be reliably used as deployment-ready top-1 advice, given a fixed decision interface and deployed utility. It presents a locked native audit that prevents overclaiming by blocking apparent forecast/deployment winner inversions.
View Cached Full Text
Cached at: 06/25/26, 05:10 AM
# 1 Introduction
Source: [https://arxiv.org/html/2606.24996](https://arxiv.org/html/2606.24996)
From Forecasting Leaderboards to Deployment Decisions: A Fail\-Closed Certification Protocol
Geumyoung KimChungbuk National Universitygoldzero@chungbuk\.ac\.kr
Abstract
Forecasting leaderboards rank models by predictive quality, but their winners are often read as deployment\-ready top\-1 advice\. That reading can fail when forecasts are passed through a fixed decision interface, such as an alert threshold, a top\-kkbudget, or a switching\-cost policy\. We study when a forecast\-side winner can be certified as deployment\-actionable for a specified interface and deployed utility\. We introduce a fail\-closed certification protocol whose gates are sufficient evidential conditions for a strong claim: a friction\-caused, non\-tie, statistically supported, and recurrent deployment\-side reversal\. Traffic\-Hourly provides a certified anchor: winners agree at zero friction, but positive switching friction makes the forecast winner deployed\-suboptimal\. A locked native audit tests overclaiming: across 22 verified candidates and 362 full\-grid cells, 155 apparent forecast/deployment winner inversions are blocked before certification\. The contribution is not a new forecaster, metric, or universal utility, but a conservative protocol for deciding when forecasting leaderboard winners should be read as deployment\-actionable top\-1 advice\.
![[Uncaptioned image]](https://arxiv.org/html/2606.24996v1/assets/github-mark.png)GitHub:[github\.com/GamGomYang/forecast\-actionability](https://github.com/GamGomYang/forecast-actionability)
Forecasting leaderboards are usually designed to answer a predictive question: which model forecasts best under a chosen accuracy, calibration, or probabilistic scoring metric? In practice, however, a leaderboard winner is often used more strongly, as if it were also the safest model\-selection recommendation for downstream deployment\. This stronger interpretation can fail once forecasts are passed through a fixed decision interface\. An alerting system may issue warnings when predicted risk exceeds a threshold, select the top\-kkalerts under a fixed budget, or penalize frequent action switches\. A highly reactive forecaster can improve one\-step predictive accuracy while inducing costly switching after the interface is applied\. In such cases, the forecast\-side winner need not be the deployed\-side winner\.
Figure 1:Fail\-closed certification for forecasting leaderboards\. \(A\) A forecast\-side winner becomes deployment\-actionable top\-1 advice only after a fixed interface, deployed\-utility evaluation, and pre\-specified certification gates\. \(B\) In the locked native audit, 155 apparent forecast/deployment winner inversions are routed through the gates and zero receive certified promotion, illustrating overclaim prevention\.This paper asks an audit question: when is a forecasting leaderboard winner sufficiently supported as deployment\-actionable top\-1 advice? The protocol does not replace the forecasting metric, retrain the forecasters, or propose a universal utility function\. It audits a specified leaderboard cell together with a fixed forecast\-to\-decision interface and a deployed utility\. The*forecast\-side winner*is the model ranked first by the forecasting metric, and the*deployed\-side winner*is the model with the highest deployed utility after the same interface is applied to every forecaster\. Formally, for forecastermm, letsms\_\{m\}denote its forecast\-side score andum\(κ\)u\_\{m\}\(\\kappa\)denote the deployed utility obtained after applying the fixed interfaceggunder friction levelκ\\kappa\. The forecast\-side winner ismF=argminmsmm\_\{F\}=\\arg\\min\_\{m\}s\_\{m\}for a forecasting loss, whereas the deployed\-side winner ismD\(κ\)=argmaxmum\(κ\)m\_\{D\}\(\\kappa\)=\\arg\\max\_\{m\}u\_\{m\}\(\\kappa\)\. A case is deployment\-actionable only if the evidence supports readingmFm\_\{F\}as reliable top\-1 deployment advice under this specified interface\.
The certification protocol is fail\-closed: ambiguous cases are not promoted to headline failures\. If the zero\-friction baseline is not aligned, if the winner changes under a tie audit, or if the deployed shortfall lacks conservative uncertainty support, the row is routed to a diagnostic or review\-needed outcome rather than certified as a deployment\-facing selection failure\. Each certification gate corresponds to a competing explanation for an apparent forecast/deployment winner mismatch: objective mismatch, absence of a positive\-friction reversal, tie instability, statistical uncertainty, or insufficient recurrence/support\. The report card is the user\-facing output; the fail\-closed certification protocol is the evidential rule that determines which rows, if any, may be promoted\. The evaluated object is therefore not a standalone forecasting model, dataset, or scoring rule, but a leaderboard cell together with a specified forecast\-to\-decision interface and a pre\-specified decision about whether its top\-1 forecast winner is deployment\-actionable\. Appendix[A](https://arxiv.org/html/2606.24996#A1)gives the full label vocabulary and first\-failing\-gate rule\.
The contribution is threefold: \(i\) a formalization of the deployment\-facing interpretation problem for forecasting leaderboards, where the model that ranks first by forecast quality should not automatically be treated as the best deployment recommendation after a fixed interface; \(ii\) a fail\-closed certification protocol that separates certified deployment\-facing selection failures from objective mismatch, tie sensitivity, uncertainty limitation, low\-support evidence, and no\-detected\-failure cases; and \(iii\) an empirical demonstration of both sides of the protocol, with Traffic\-Hourly certifying a clean failure when all gates pass and the locked native audit blocking 155 apparent forecast/deployment winner inversions before promotion without sufficient evidence\.
## 2Results: Two Roles for the Certification Protocol
### 2\.1Evidence Roles: Anchor Certification vs\. Overclaim Prevention
We use the fail\-closed certification protocol in two distinct evidence roles\. The anchor suite asks whether the gates can certify a clean deployment\-facing selection failure when the required assumptions hold\. The locked native audit asks the complementary question: whether apparent forecast/deployment winner mismatches are prevented from becoming headline claims when the assumptions fail\.
These roles are fixed before interpretation\. Traffic\-Hourly is the primary certified anchor; Event\-micro is caveated support; the locked native audit is an overclaim\-prevention audit; NOAA is frozen appendix confirmation; and inventory is a bounded operational check\. This separation prevents mixed cases from being upgraded into primary positive evidence\.
Table 1:Clean anchor cases: Traffic\-Hourly passes all certification gates under positive friction\. The table reports whether the forecast\-side winner becomes deployed\-suboptimal after the fixed interface is evaluated\.- •*Note\.*Mean shortfall is the paired deployed\-utility loss from choosing the forecast\-side winner instead of the deployed\-side winner\. Subopt\. seeds counts the seeds where the forecast\-side winner is deployed\-suboptimal\. Abbreviations: R\-short = Reactive short, R\-sharp = Reactive sharp, L\-smooth = Lagged smoother, Calib\. = Calibrated\. Traffic family sweeps are in Appendix Table[A5](https://arxiv.org/html/2606.24996#A3.T5)\.
### 2\.2Clean Anchor: Traffic\-Hourly Can Be Certified
Traffic\-Hourly is a forecasting\-native alert\-selection setting: forecasters produce hourly risk scores, a fixed\-budget interface selects the top\-kkalerts, and deployed utility rewards correct alert allocation while penalizing action switching through the friction parameterκ\\kappa\. Table[1](https://arxiv.org/html/2606.24996#S2.T1)isolates clean anchor cases from the broader native audit\. In Traffic\-Hourly, forecast and deployed winners agree at zero friction; under positive switching friction, Reactive short remains forecast\-best but the deployed winner shifts to Lagged smoother or Calibrated/Lagged alternatives\. In the representative Top\-kk,k=249k=249, rows, the forecast\-selected model is deployed\-suboptimal in 100/100 seeds at bothκ=0\.5\\kappa=0\.5andκ=1\.0\\kappa=1\.0\. Appendix Table[A5](https://arxiv.org/html/2606.24996#A3.T5)reports the within\-family breadth check over five Top\-kkbudgets and two relative\-rank variants\.
The mechanism is intentionally transparent\. The reactive model wins the forecast\-side score because it tracks short\-run variation, but this same reactivity induces costly switching after the fixed interface is applied\. Onceκ\\kappais positive, smoother alternatives can become deployment\-optimal even though they do not win the forecast\-side leaderboard\.
Event\-micro is retained as caveated support rather than a second primary anchor: its positive\-friction rows are stable, but its near\-zero row is less clean than Traffic\-Hourly\. It therefore supports the anchor pattern without being upgraded to primary evidence\.
We use Traffic\-Hourly as a mechanistic anchor rather than as a prevalence estimate\. Thus, Traffic\-Hourly is certified as a deployment\-facing top\-1 selection failure by the certification protocol: the zero\-friction row rules out pure objective mismatch, and the positive\-friction rows satisfy inversion, stability, uncertainty, and support requirements\. Appendix checks cover Traffic\-Hourly budgets, relative\-rank variants, rolling\-split proxies, andϵ\\epsilon\-tie audits\. Inventory remains a limited operational corroboration check rather than a third primary anchor\.
### 2\.3Locked Native Audit: Apparent Inversions Are Not Enough
The pre\-specified native audit tests the opposite risk: whether apparent forecast/deployment winner inversions would be promoted without sufficient evidence if reported naively\. Here native means the locked real\-data candidate grid used by the audit rather than the appendix\-only NOAA check; locked means the candidate set and gate rules are fixed before assigning report\-card labels\. This audit is deliberately not used to estimate how often deployment\-facing selection failures occur; it tests whether the certification protocol resists promoting naive winner mismatches when the certification assumptions are not met\. The certification protocol is applied to 22 verified real\-data native candidates and 362 native full\-grid cells\. Among these cells, 155 show an apparent forecast/deployment winner inversion\. This audit is therefore adversarial to overclaiming rather than favorable to promotion: it starts from many apparent forecast/deployment winner mismatches and asks whether any survive the same evidential requirements imposed on the positive anchor\.
Thus, the native audit provides an overclaim\-prevention audit rather than a prevalence estimate: apparent forecast/deployment winner inversions are not promoted unless all fail\-closed gates are satisfied\. The zero\-friction/objective gate blocks 96 rows, the tie\-stability gate blocks 2, the bootstrap\-CI gate blocks 32, and the recurrence/support gate blocks the remaining 25\. No apparent forecast/deployment winner inversion in the locked native audit passes all gates\. Thus, the audit yields zero certified promotions and zero failures promoted without sufficient evidence among 155 apparent forecast/deployment winner inversions\. After the zero\-friction/objective gate, 59 rows remain as review\-needed rather than certified evidence\.
This is the intended conservative behavior\. The locked audit does not prove that deployment risk is absent; rather, it shows that winner mismatch alone is insufficient evidence for a deployment\-facing selection\-failure claim\. Ambiguous rows remain visible as review\-needed diagnostics instead of being converted into headline failures\. Figure[1](https://arxiv.org/html/2606.24996#S1.F1)B visualizes the gate flow, and Table[2](https://arxiv.org/html/2606.24996#S2.T2)reports the exact ledger\.
Table 2:Locked native overclaim\-prevention ledger\. No apparent inversion clears every gate\.Gate / outcomeRemainingRoutedAll native cells362–Apparent inversion screen155207Zero\-fric/objective gate5996Tie\-stability gate572Bootstrap\-CI gate2532Recurrence/support gate025Final certified promotions0–Rows retained as review\-needed59–
### 2\.4Power and False\-Promotion Diagnostics
A certification\-power diagnostic explains why low\-support native rows are routed to review\-needed: when already\-certified Traffic\-Hourly and Event\-micro anchors are downsampled, no row is promoted atn≤20n\\leq 20, and promotion appears only aroundn=30n=30or above\.
This diagnostic explains why the support gate is not a cosmetic restriction\. Under the same promotion logic, even already\-certified anchor rows are not promoted when downsampled to very small support\. Promotion appears only once the effective support is large enough for the direction and recurrence checks to become reliable\. Thus, routing low\-support native rows to review\-needed is a deliberate fail\-closed behavior rather than a failed detection\.
Appendix[D\.1](https://arxiv.org/html/2606.24996#A4.SS1)reports the negative\-control audit: 400 randomized winner assignments create many apparent forecast/deployment winner inversions, but no run clears the gates\.
Finally, the frozen NOAA check is retained as appendix confirmation: it supports the high\-friction pattern but remains mixed at lower friction, so it is not promoted to primary evidence \(Appendix[E\.1](https://arxiv.org/html/2606.24996#A5.SS1)\)\.
These diagnostics support the interpretation that non\-certification in the locked native audit is not merely a failure to find positive cases\. The same rules also refuse to promote known positive anchors when support is too small, and they refuse to promote randomized negative controls despite many apparent forecast/deployment winner inversions\.
### 2\.5Why the Certification Gates Are Not Arbitrary
The certification gates are not intended as necessary conditions for all possible deployment failures\. They are sufficient conditions for promoting a strong paper\-facing claim: a friction\-caused, non\-tie, statistically supported, and recurrent deployment\-side reversal\. This distinction is central to the fail\-closed design\. A row that fails a gate is not declared safe; it is routed to the first applicable diagnostic label\.
The gates are ordered to remove competing explanations for an apparent forecast/deployment winner mismatch\. The zero\-friction gate removes pure objective mismatch: if the forecast\-side winner and deployed\-side winner already disagree atκ=0\\kappa=0, then a positive\-friction mismatch cannot be attributed to deployment friction alone\. The positive\-friction inversion gate then checks whether the forecast\-side winner actually becomes deployed\-suboptimal after the fixed interface and friction are applied\. The tie\-stability gate removes cases in which the reported winner identity is an artifact of anϵ\\epsilon\-level ranking ambiguity\. The conservative confidence\-interval gate removes cases in which the paired deployed\-utility shortfall is not statistically supported\. Finally, the recurrence/support gate removes isolated seed, split, or grid artifacts that do not recur with sufficient pre\-specified support\.
This rationale is decision\-theoretic rather than model\-training based\. Predict\-then\-optimize and decision\-focused learning show that predictive quality and downstream decision quality need not be aligned, but those methods typically modify the training objective or optimize through a downstream problem\. Our setting is post\-hoc: forecasters, the leaderboard score, the forecast\-to\-decision interface, the deployed utility, and the friction grid are fixed before certification\. The protocol therefore asks a narrower question: whether the top\-1 forecast\-side selection remains defensible as deployment\-facing advice under that fixed evaluation object\.
##### Proposition 1\. Sufficient conditions for certifying a deployment\-facing selection failure\.
Consider a fixed leaderboard cell with a fixed forecaster set, forecast\-side score, forecast\-to\-decision interface, deployed utility, and friction levelκ\\kappa\. LetmFm\_\{F\}denote the forecast\-side winner andmD\(κ\)m\_\{D\}\(\\kappa\)denote the deployed\-side winner after applying the fixed interface\. A certified deployment\-facing selection failure is promoted only if the following conditions hold:
1. \(i\)mF=mD\(0\)m\_\{F\}=m\_\{D\}\(0\), so the forecast\-side winner is also deployment\-optimal before friction is introduced;
2. \(ii\)mF≠mD\(κ\)m\_\{F\}\\neq m\_\{D\}\(\\kappa\)for a positive friction level;
3. \(iii\)the reversal is not explained by anϵ\\epsilon\-tie or unstable winner identity;
4. \(iv\)the paired deployed\-utility shortfall from selectingmFm\_\{F\}instead ofmD\(κ\)m\_\{D\}\(\\kappa\)has a conservative confidence interval with positive lower bound; and
5. \(v\)the reversal recurs with sufficient support across the pre\-specified replicate or split structure\.
Under these conditions, the report card certifies a sufficient, not necessary, claim: the forecast\-side leaderboard winner is not supported as deployment\-actionable top\-1 advice for the specified interface and utility\. If any condition fails, the protocol does not infer deployment safety; it routes the row to the first applicable diagnostic label\.
The locked native audit illustrates why the gates are needed\. Starting from apparent forecast/deployment winner inversions alone would produce many apparent deployment warnings\. After applying the same fail\-closed certification sequence, however, no apparent native forecast/deployment winner inversion is certified as deployment\-facing evidence\. Thus, the audit does not show that deployment risk is absent; it shows that winner mismatch alone is insufficient for a deployment\-facing selection\-failure claim\.
## 3Discussion and Related Work
##### Implications for forecasting leaderboards\.
Predictive rank and deployment\-facing selection advice should be reported separately\. A leaderboard can correctly identify the best forecaster under a forecast metric while still leaving open whether that winner remains best after a fixed decision interface\. The report card adds an actionability layer to leaderboard cells likely to be used as deployment advice: it certifies only aligned zero\-friction behavior, stable positive\-friction inversion, uncertainty support, and sufficient recurrence, while objective\-mismatch warnings and review\-needed rows remain visible but not headline claims\.
##### Decision\-level forecast evaluation\.
Dynamic forecasting benchmarks such as ForecastBench and Prophet Arena evaluate forecasting capability on unresolved future questions and reduce contamination riskKarger et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib1)\);Yang et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib2)\); time\-series archives, foundation models, and benchmarks, including the Monash archive, GIFT\-Eval, decoder\-only TSFMs, Chronos, Moirai, text\-conditioned forecasting, and TSFM evaluation pipelines, broaden the forecasters that may appear on future leaderboardsGodahewa et al\.\([2021](https://arxiv.org/html/2606.24996#bib.bib4)\);Aksu et al\.\([2024](https://arxiv.org/html/2606.24996#bib.bib3)\);Das et al\.\([2024](https://arxiv.org/html/2606.24996#bib.bib5)\);Ansari et al\.\([2024](https://arxiv.org/html/2606.24996#bib.bib6)\);Woo et al\.\([2024](https://arxiv.org/html/2606.24996#bib.bib7)\);Williams et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib8)\);Li et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib9)\);Goktas et al\.\([2026](https://arxiv.org/html/2606.24996#bib.bib10)\);Zhao et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib11)\)\. Proper scoring rules remain central for probabilistic forecast evaluationGneiting and Raftery\([2007](https://arxiv.org/html/2606.24996#bib.bib17)\);Murphy\([1993](https://arxiv.org/html/2606.24996#bib.bib18)\)\. Recent decision\-level forecast evaluation work argues that forecasts should be evaluated not only by statistical accuracy but also by their value for downstream decisions\. Weather and air\-quality forecasting studies show that forecast\-level model rankings can differ from decision\-level rankings when the evaluation object includes a concrete decision taskRaeth and Ludwig\([2025](https://arxiv.org/html/2606.24996#bib.bib19)\);Berlinghieri et al\.\([2024](https://arxiv.org/html/2606.24996#bib.bib20)\)\. Our setting is complementary: rather than designing a new decision\-specific forecasting benchmark, we ask whether a given forecast\-side leaderboard winner can be certified as deployment\-actionable top\-1 advice under a fixed interface and deployed utility\.
##### Decision\-focused learning and predict\-then\-optimize\.
Predict\-then\-optimize and decision\-focused learning show that minimizing prediction error alone may be misaligned with downstream decision qualityDonti et al\.\([2017](https://arxiv.org/html/2606.24996#bib.bib21)\);Elmachtoub and Grigas\([2022](https://arxiv.org/html/2606.24996#bib.bib22)\);Mandi et al\.\([2024](https://arxiv.org/html/2606.24996#bib.bib23)\)\. These methods typically modify training objectives or optimize through a downstream decision problem\. Our protocol is post\-hoc: it keeps the forecasters, forecast metric, fixed decision interface, deployed utility, and friction grid unchanged, and audits whether the top\-1 forecast\-side selection remains defensible as deployment\-facing advice\.
##### Benchmark uncertainty and rank instability\.
A separate line of benchmark work emphasizes that model rankings can be unstable under data sampling, initialization, hyperparameter choices, task aggregation, and statistical uncertaintyBouthillier et al\.\([2021](https://arxiv.org/html/2606.24996#bib.bib12)\);Longjohn et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib13)\);Shchur et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib14)\);Brigato et al\.\([2025](https://arxiv.org/html/2606.24996#bib.bib15)\);Neuhof and Benjamini\([2026](https://arxiv.org/html/2606.24996#bib.bib16)\)\. This motivates the fail\-closed use of tie audits, conservative confidence intervals, and recurrence/support requirements\. In our setting, these checks are not cosmetic robustness tests; they determine whether an apparent forecast/deployment winner inversion is strong enough to be certified as a deployment\-facing selection failure\.
##### Positioning\.
Unlike decision\-focused learning and predict\-then\-optimize, we do not train through the downstream decision\. Unlike benchmark papers that primarily improve aggregate forecasting comparisons, we audit whether a particular forecast\-side winner should be interpreted as deployment\-facing top\-1 advice after a fixed interface\. The gate sequence is therefore a certification layer on top of an existing leaderboard cell, not a replacement for forecasting evaluation\.
##### Limitations and scope\.
This paper should be read as an initial certification\-protocol study, not as a comprehensive estimate of how often forecasting leaderboards fail as deployment advice\. The clean certified evidence is concentrated in Traffic\-Hourly\. Event\-micro provides caveated support, NOAA is retained as frozen appendix confirmation, and inventory is used only as a bounded operational check\. Keeping these roles separate is important: mixed or lower\-support cases are not upgraded into primary evidence\.
The locked native audit has a similarly limited role\. It is an overclaim\-prevention audit, not evidence that deployment risk is absent\. Its purpose is to show how apparent forecast/deployment winner inversions are blocked before certification when they are explained by objective mismatch, tie sensitivity, uncertainty limitation, or insufficient recurrence/support\.
Finally, the protocol assumes fixed forecast\-to\-decision interfaces and fixed deployed utilities\. It does not learn a new decision policy, propose a universal utility function, or replace forecast\-side metrics\. A natural next step is to evaluate the same fail\-closed certification layer across larger pre\-registered forecasting suites, adaptive decision interfaces, and online model\-switching settings\.
## References
- Karger et al\. \(2025\)E\. Karger, H\. Bastani, C\. Yueh\-Han, Z\. Jacobs, D\. Halawi, F\. Zhang, and P\. E\. Tetlock\.ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities\.In*ICLR*, 2025\. arXiv:2409\.19839\.
- Yang et al\. \(2025\)Q\. Yang, S\. Mahns, S\. Li, A\. Gu, J\. Wu, and H\. Xu\.LLM\-as\-a\-Prophet: Understanding Predictive Intelligence with Prophet Arena\.In*ICLR*, 2026\. arXiv:2510\.17638\.
- Aksu et al\. \(2024\)T\. Aksu, G\. Woo, J\. Liu, X\. Liu, C\. Liu, S\. Savarese, C\. Xiong, and D\. Sahoo\.GIFT\-Eval: A Benchmark for General Time Series Forecasting Model Evaluation\.NeurIPS Workshop on Time Series in the Age of Large Models, 2024\. Extended version: arXiv:2410\.10393\.
- Godahewa et al\. \(2021\)R\. Godahewa, C\. Bergmeir, G\. I\. Webb, R\. J\. Hyndman, and P\. Montero\-Manso\.Monash Time Series Forecasting Archive\.In*NeurIPS Datasets and Benchmarks*, 2021\. arXiv:2105\.06643\.
- Das et al\. \(2024\)A\. Das, W\. Kong, R\. Sen, and Y\. Zhou\.A decoder\-only foundation model for time\-series forecasting\.In*ICML*, 2024\. arXiv:2310\.10688\.
- Ansari et al\. \(2024\)A\. F\. Ansari et al\.Chronos: Learning the language of time series\.arXiv:2403\.07815, 2024\.
- Woo et al\. \(2024\)G\. Woo et al\.Unified training of universal time series forecasting transformers\.arXiv:2402\.02592, 2024\.
- Williams et al\. \(2025\)A\. R\. Williams, A\. Ashok, E\. Marcotte, V\. Zantedeschi, J\. Subramanian, R\. Riachi, J\. Requeima, A\. Lacoste, I\. Rish, N\. Chapados, and A\. Drouin\.Context is Key: A Benchmark for Forecasting with Essential Textual Information\.In*ICML*, PMLR 267, pages 66887–66944, 2025\.
- Li et al\. \(2025\)Z\. Li, X\. Qiu, P\. Chen, Y\. Wang, H\. Cheng, Y\. Shu, J\. Hu, C\. Guo, A\. Zhou, C\. S\. Jensen, and B\. Yang\.TSFM\-Bench: A Comprehensive and Unified Benchmark of Foundation Models for Time Series Forecasting\.In*KDD*, pages 5595–5606, 2025\. doi:10\.1145/3711896\.3737442; arXiv:2410\.11802\.
- Goktas et al\. \(2026\)D\. Goktas, G\. Riano\-Briceno, A\. Abdullah, A\. Nair, C\. Shen, B\. de Lucio, A\. Magnusson, F\. Mashrur, A\. Abdulla, S\. Sen, M\. Thippireddy, G\. Schwartz, and A\. Greenwald\.TempusBench: An Evaluation Framework for Time\-Series Forecasting\.arXiv:2604\.11529, 2026\.
- Zhao et al\. \(2025\)Z\. Zhao, J\. Ni, S\. Xu, H\. Liu, W\. Jin, and B\. A\. Prakash\.TimeRecipe: A Time\-Series Forecasting Recipe via Benchmarking Module Level Effectiveness\.arXiv:2506\.06482, 2025\.
- Bouthillier et al\. \(2021\)X\. Bouthillier, P\. Delaunay, M\. Bronzi, A\. Trofimov, B\. Nichyporuk, J\. Szeto, N\. Sepah, E\. Raff, K\. Madan, V\. Voleti, S\. E\. Kahou, V\. Michalski, T\. Arbel, C\. Pal, G\. Varoquaux, and P\. Vincent\.Accounting for Variance in Machine Learning Benchmarks\.In*MLSys*, 2021\. arXiv:2103\.03098\.
- Longjohn et al\. \(2025\)R\. Longjohn, G\. Gopalan, and E\. Casleton\.Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks\.arXiv:2501\.04234, 2025\.
- Shchur et al\. \(2025\)O\. Shchur, A\. F\. Ansari, C\. Turkmen, L\. Stella, N\. Erickson, P\. Guerron, M\. Bohlke\-Schneider, and Y\. Wang\.fev\-bench: A Realistic Benchmark for Time Series Forecasting\.arXiv:2509\.26468, 2025\.
- Brigato et al\. \(2025\)L\. Brigato, R\. Morand, K\. J\. Strommen, M\. Panagiotou, M\. Schmidt, and S\. Mougiakakou\.Position: There are no Champions in Long\-Term Time Series Forecasting\.arXiv:2502\.14045, 2025\.
- Neuhof and Benjamini \(2026\)B\. Neuhof and Y\. Benjamini\.Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation\.arXiv:2606\.08679, 2026\.
- Gneiting and Raftery \(2007\)T\. Gneiting and A\. E\. Raftery\.Strictly Proper Scoring Rules, Prediction, and Estimation\.*Journal of the American Statistical Association*, 102\(477\):359–378, 2007\.
- Murphy \(1993\)A\. H\. Murphy\.What Is a Good Forecast? An Essay on the Nature of Goodness in Weather Forecasting\.*Weather and Forecasting*, 8\(2\):281–293, 1993\.
- Raeth and Ludwig \(2025\)K\. Raeth and N\. Ludwig\.Evaluating Weather Forecasts from a Decision Maker’s Perspective\.arXiv:2512\.14779, 2025\.
- Berlinghieri et al\. \(2024\)R\. Berlinghieri, D\. R\. Burt, P\. Giani, A\. M\. Fiore, and T\. Broderick\.A Framework for Evaluating PM2\.5 Forecasts from the Perspective of Individual Decision Making\.arXiv:2409\.05866, 2024\.
- Donti et al\. \(2017\)P\. L\. Donti, B\. Amos, and J\. Zico Kolter\.Task\-based End\-to\-end Model Learning in Stochastic Optimization\.In*NeurIPS*, 2017\. arXiv:1703\.04529\.
- Elmachtoub and Grigas \(2022\)A\. N\. Elmachtoub and P\. Grigas\.Smart “Predict, then Optimize”\.*Management Science*, 68\(1\):9–26, 2022\. Earlier version: arXiv:1710\.08005\.
- Mandi et al\. \(2024\)J\. Mandi, J\. Kotary, S\. Berden, M\. Mulamba, V\. Bucarey, T\. Guns, and F\. Fioretto\.Decision\-Focused Learning: Foundations, State of the Art, Benchmark and Future Opportunities\.*Journal of Artificial Intelligence Research*, 80:1623–1701, 2024\. Earlier version: arXiv:2307\.13565\. doi:10\.1613/jair\.1\.15320\.
## Appendix AReport\-Card Output and Gate Procedure
##### Appendix roadmap\.
Appendix A formalizes the report\-card labels and first\-failing\-gate rule used by the fail\-closed certification protocol\. Appendix B fixes the evidence roles used in the paper\. Appendix C reports positive\-anchor robustness, Appendix D reports overclaim\-prevention and false\-promotion controls, and Appendix E collects closing checks and reproducibility notes\. The appendix is therefore organized by evidential role rather than by experiment chronology\.
### A\.1Label Vocabulary
Table[A1](https://arxiv.org/html/2606.24996#A1.T1)gives the compact label vocabulary used by the report card\. A certified deployment\-facing selection failure requires zero\-friction alignment, positive\-friction winner inversion, stable tie behavior, conservative uncertainty support, and sufficient recurrence/support\. Review\-needed collects rows that show some warning signal but do not clear every certification gate\. We use verified to mean that the cell passed the data\-availability and fixed\-interface replay checks required by the audit; it does not imply that the cell is certified as a selection failure\.
Table A1:Report\-card labels\. The action column states how each outcome is used in the paper\-facing evidence summary\.
### A\.2Gate Rationale
The certification gates are used as sufficient evidential conditions, not as necessary conditions for all possible deployment failures\. Each gate removes a specific competing explanation for an apparent forecast/deployment winner mismatch\.
Table A2:Gate rationale for the fail\-closed certification protocol\.This table also clarifies the interpretation of non\-certification\. A non\-certified row is not evidence of deployment safety\. It is evidence that the stronger claim of a friction\-caused, non\-tie, statistically supported, recurrent deployment\-side reversal has not been established under the specified protocol\.
### A\.3First\-Failing\-Gate Audit Procedure
Procedure[1](https://arxiv.org/html/2606.24996#alg1)states the report\-card label assignment rule for one fixed\-interface cell\. This procedure is not intended as a learned algorithm\. It is a reproducible audit rule: the first failed evidential condition determines the non\-certified diagnostic label, and only rows passing all gates are promoted\.
Procedure 1First\-failing\-gate audit rule for certification\-protocol labeling1:Identify the forecast\-side winner
mFm\_\{F\}under the forecast metric\.
2:Identify the deployed\-side winner
mD\(κ\)m\_\{D\}\(\\kappa\)after applying the fixed interface\.
3:ifdata or replay checks failthen
4:returnExcluded From Evidence
5:endif
6:ifthe zero\-friction row is missing or
mF≠mD\(0\)m\_\{F\}\\neq m\_\{D\}\(0\)beyond the pre\-specified tolerancethen
7:returnObjective\-Mismatch Warning
8:endif
9:if
mF=mD\(κ\)m\_\{F\}=m\_\{D\}\(\\kappa\)at the tested positive\-friction settingthen
10:returnNo Detected Deployment\-Facing Selection Failure
11:endif
12:ifwinner identity or suboptimality changes under the
ϵ\\epsilon\-tie auditthen
13:returnTie\-Sensitive / Review\-Needed
14:endif
15:ifthe paired deployed shortfall CI is not conservatively positivethen
16:returnUncertainty\-Limited / Review\-Needed
17:endif
18:ifrecurrence or support thresholds are not metthen
19:returnReview\-Needed
20:endif
21:returnCertified Deployment\-Facing Selection Failure
## Appendix BData, Interfaces, and Candidate Families
The certification protocol is applied only after forecasts, fixed interfaces, deployed utilities, and replay checks are materialized\. Table[A3](https://arxiv.org/html/2606.24996#A2.T3)summarizes the evidence roles used in this paper; it is a scope map, not an expansion of the experimental claims\.
Table A3:Data and interface scope\. Evidence roles are kept separate: Traffic\-Hourly is the primary anchor, Event\-micro is caveated support, NOAA is frozen appendix confirmation, and inventory is a bounded operational check\.The native candidate audit uses naive\-last, moving\-average short/long, reactive\-short, ridge\-lag, and bridge aliases for the anchor families reactive\-sharp, calibrated baseline, and lagged smoother\. The fixed\-interface families are threshold alert, top\-kkbudget, hysteresis threshold, and capacity allocation\. Alert\-style grids useκ∈\{0,0\.05,0\.10,0\.25,0\.50,1\.00\}\\kappa\\in\\\{0,0\.05,0\.10,0\.25,0\.50,1\.00\\\}; allocation grids useκ∈\{0,0\.001,0\.005,0\.010,0\.050\}\\kappa\\in\\\{0,0\.001,0\.005,0\.010,0\.050\\\}\. The certification\-gate values and anchor robustness gates are defined in frozen configuration files\.
## Appendix CPositive Anchor Robustness
### C\.1Traffic\-Hourly Forecasting\-Native Checks
#### C\.1\.1Fixed\-Budget Top\-kkAlert
Table A4:Traffic\-Hourly Top\-kkprimary\-anchor check under a fixed\-budget interface\. Agreement is strongest at zero friction and weakens at higher friction, where the deployed winner shifts away from the forecast\-side winner\.In the selected Top\-kksetting, Reactive short is the forecast\-side winner and the deployed winner at zero friction\. At frictions0\.50\.5and1\.01\.0, the deployed winner shifts to Lagged smoother even though Reactive short remains Brier\-best and continues to win on the forecast side\. The qualitative reading is the same as in the main text: under a fixed interface, the more reactive forecast\-optimal family need not remain deployment\-optimal once switching is penalized\.
#### C\.1\.2Strict Extended Interface Matrix
Table A5:Traffic\-Hourly extended interface matrix\. All positive\-friction rows retain the certification verdict across five Top\-kkbudgets and two relative\-rank variants under unchanged gates\.Table[A5](https://arxiv.org/html/2606.24996#A3.T5)shows that the Traffic\-Hourly certification pattern is stable across the tested Top\-kkbudgets and relative\-rank variants, supporting its role as the primary mechanistic anchor rather than a single selected cell\.
#### C\.1\.3Relative\-Rank Traffic Variant
Table A6:Traffic\-Hourly relative\-rank appendix variant for the primary anchor\. The same qualitative separation between forecast\-side and deployed\-side selection appears under a relative\-rank fixed\-interface variant\.As an additional Traffic variant, the relative\-rank target shows the same strict certification pattern under a different forecasting\-native label construction\. Reactive short remains the forecast\-side winner, while Calibrated baseline becomes the deployed winner once switching costs are introduced\. The compact table showsm=0\.10m=0\.10; Table[A5](https://arxiv.org/html/2606.24996#A3.T5)confirms that the alternatem=0\.15m=0\.15variant also passes at both positive\-friction levels\. This breadth check stays within the existing clean Traffic\-Hourly anchor rather than introducing a new domain claim\.
### C\.2Event\-Micro Robustness
Event\-micro is retained as caveated supporting evidence\. We construct a minimal forecasting\-native binary\-event setting evaluated over 100 seeds, in which each forecaster emits event probabilities, a fixed thresholding rule maps probabilities to actions, and switching friction penalizes action changes\. The main text uses Brier as the canonical forecast\-side criterion\.

κ\\kappaForecastDeployedAgree\.Shortfall0\.00SharpSharp0\.620\.0030\.05SharpSharp0\.600\.0020\.10SharpSharp0\.580\.0020\.25SharpCalibrated0\.510\.0030\.50SharpCalibrated0\.310\.0111\.00SharpSmoother0\.010\.057
Figure A1:Event\-micro caveated\-support summary\. Positive\-friction rows show stable winner drift, but the weaker zero\-friction behavior keeps Event\-micro in a supporting rather than primary\-anchor role\.Log\-loss reranking, alternate\-threshold, and hysteresis\-interface checks preserve the same qualitative winner\-drift pattern, with smaller magnitudes under hysteresis\. The compact interval table below records the interface\-aware deployed\-shortfall evidence without expanding the appendix into a full experiment log\.
Table A7:Interface\-aware bootstrap intervals for the Event\-micro caveated\-support deployed shortfall\. Positive intervals remain visible under both interfaces, with smaller shortfalls under hysteresis\.
### C\.3Certification Power, Rolling Splits, and Tie Sensitivity
These checks reuse the fixed anchor outputs and apply split\-proxy gating over seed/replicate partitions, rather than expanding the broader real\-data screen\. Traffic\-Hourly Top\-kkpasses the zero\-friction, positive\-recurrence, bootstrap, and tie gates in all three split proxies\. Event\-micro remains supportive but carries the caveat stated in the main text: two split proxies pass the zero\-friction gate, while one split is labeled objective\-mismatch at zero friction and is therefore not promoted\.
Table A8:Downsampling diagnostic for certified anchor rows\. Promotion appears only once support reaches roughlyn=30n=30, illustrating why low\-support native rows are routed to review\-needed\.Table A9:Anchor\-first rolling\-split gate summary\. Traffic\-Hourly Top\-kkpasses all split\-proxy gates, while Event\-micro remains caveated because one split does not clear the zero\-friction gate\.The uncertainty andϵ\\epsilon\-tie checks are used only to prevent promotion without sufficient evidence\. Rows that lack conservative interval support or have unstable winner identity are routed to review\-needed rather than counted as certified selection failures\. The tested relative tie thresholds,ϵrel∈\{0,0\.001,0\.005\}\\epsilon\_\{\\mathrm\{rel\}\}\\in\\\{0,0\.001,0\.005\\\}, do not change the deployed\-suboptimal conclusion or winner identity for the primary Traffic\-Hourly and Event\-micro positive\-friction rows\. Inventory tie checks remain part of the bounded inventory check and are not used to upgrade inventory to primary evidence\.
Takeaway: the Traffic\-Hourly anchor remains stable across the tested interface variants, while Event\-micro remains supportive but caveated\.
## Appendix DNative Overclaim\-Prevention Controls
The locked native audit contains 22 verified candidates and 362 full\-grid cells across bike share, electricity load, EPA air quality, retail demand, NYC taxi demand, weather alerts, and web traffic\. This scope is used only for overclaim prevention: the audit asks whether apparent forecast/deployment winner mismatches survive the same certification gates required of the positive anchors\.
Table[2](https://arxiv.org/html/2606.24996#S2.T2)in the main text can be read as an empirical gate\-ablation ledger\. Apparent forecast/deployment winner inversion alone would yield 155 warnings, but each gate removes a distinct competing explanation: objective mismatch, tie instability, uncertainty limitation, or insufficient recurrence/support\. The final zero\-promotion outcome is therefore not a null result; it is evidence that the protocol does not promote winner\-identity mismatch into a certified deployment\-facing claim without sufficient support\.
### D\.1Negative\-Control False\-Promotion Audit
As a false\-promotion check, we randomize winner identity in two ways: shuffling forecast winners and permuting deployed winners\. These controls deliberately create many apparent forecast/deployment winner inversions while preserving the same gate logic used by the certification protocol\. The purpose is to test whether apparent forecast/deployment winner mismatch alone can pass the certification gates; in 400 randomized runs, no run produces a gate\-passing promotion\.
Table A10:Negative\-control audit\. Randomized winners create many apparent forecast/deployment winner inversions; none clears all gates\.Takeaway: many apparent native forecast/deployment winner inversions are visible, but none survives the full certification sequence\.
## Appendix EAppendix\-Only Checks and Reproducibility
### E\.1Frozen NOAA Check
The frozen NOAA check is retained as appendix confirmation\. It supports the high\-friction pattern but remains mixed at lower friction, so it is not promoted to primary evidence\.
Table A11:Frozen NOAA visibility check\. High friction confirms in two of three splits; lower friction remains mixed\.
### E\.2Inventory Corroboration
Inventory is retained as a bounded operational check: the replenishment interface partly buffers forecast\-side differences at zero and low friction, while higher friction favors smoother order trajectories, so the mixed pattern is not promoted to primary\-anchor evidence\.
Table A12:Inventory check\. Mixed zero/low\-friction behavior keeps inventory in the bounded\-corroboration role\.Similar Articles
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.
Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment
Proposes a decision mechanism for forecasting conversational derailment that decouples trigger decisions from derailment likelihood estimation, using forward-looking simulations to defer alerts when recovery is plausible, reducing false positives.
Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge
The paper describes a metric-aware hybrid forecasting system for the CTF4Science Lorenz challenge, combining neural denoisers, ODE fitting, and histogram-tail substitution to optimize different metrics across nine task pairs, achieving a public leaderboard score of 83.85529.
ForecastBench-Sim: A Simulated-World Forecasting Benchmark
Introduces ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, designed to provide controlled, immediately resolvable tasks for evaluating probabilistic reasoning in AI systems.
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
This paper introduces a population coupling trend and h-field diagnostic to analyze the relationship between coding and reasoning capabilities across frontier AI models, finding that capabilities cooperate but with varying emphasis per lab. It provides a playbook for measurement and predicts benchmark saturation trends.