Verifiable Rewards for Calibrated Probabilistic Forecasting

arXiv cs.LG Papers

Summary

The paper proposes a verifiable label-free reward for training calibrated probabilistic forecasters using reinforcement learning, avoiding the calibration degradation that occurs when rewarding single outcomes. Applied to NFL win probability, a 7B model trained with this reward achieves calibration comparable to the betting market.

arXiv:2607.00164v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards can in principle train calibrated probabilistic forecasters, since a proper scoring rule such as the Brier score is computed from outcomes alone and is minimized in expectation by the true probability. In practice it degrades calibration, and existing remedies address epistemic uncertainty, where a model's confidence accompanies a verifiably correct or incorrect answer. We study aleatoric forecasting, where the forecast itself is the output and the label is one stochastic outcome, taking NFL in-game win probability as a testbed with the betting market as a reference. Rewarding the realized per-play outcome fails, because the single outcome is a noisy target and the policy gradient corrupts the chain of thought. We introduce a verifiable, label-free reward, a state-conditioned empirical win rate estimated from past outcomes, that removes the label noise, and we keep the gradient off the reasoning, by direct prediction or a gradient mask, so it cannot be corrupted. Trained with this reward alone, without human labels or supervised fine-tuning, a 7B model reaches the calibration of the betting market by direct prediction and is better calibrated than a zero-shot frontier model. That frontier model and a tabular estimator reach the same Brier score as this model, identifying the market's small remaining edge as live in-game information beyond their shared inputs. Masking the gradient, rather than dropping the chain of thought, preserves reasoning from which the forecast follows, which ordinary chain-of-thought training corrupts.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:36 AM

# Verifiable Rewards for Calibrated Probabilistic Forecasting
Source: [https://arxiv.org/html/2607.00164](https://arxiv.org/html/2607.00164)
###### Abstract

Reinforcement learning with verifiable rewards can in principle train calibrated probabilistic forecasters, since a proper scoring rule such as the Brier score is computed from outcomes alone and is minimized in expectation by the true probability\. In practice it degrades calibration, and existing remedies address epistemic uncertainty, where a model’s confidence accompanies a verifiably correct or incorrect answer\. We study aleatoric forecasting, where the forecast itself is the output and the label is one stochastic outcome, taking NFL in\-game win probability as a testbed with the betting market as a reference\. Rewarding the realized per\-play outcome fails, because the single outcome is a noisy target and the policy gradient corrupts the chain of thought\. We introduce a verifiable, label\-free reward, a state\-conditioned empirical win rate estimated from past outcomes, that removes the label noise, and we keep the gradient off the reasoning, by direct prediction or a gradient mask, so it cannot be corrupted\. Trained with this reward alone, without human labels or supervised fine\-tuning, a 7B model reaches the calibration of the betting market by direct prediction and is better calibrated than a zero\-shot frontier model\. That frontier model and a tabular estimator reach the same Brier score as this model, identifying the market’s small remaining edge as live in\-game information beyond their shared inputs\. Masking the gradient, rather than dropping the chain of thought, preserves reasoning from which the forecast follows, which ordinary chain\-of\-thought training corrupts\.

## 1Introduction

Game statexxscore, time, spreadNaivereward the outcome, train the whole completionchain of thoughtP = 88%single outcomey∈\{0,1\}y\\in\\\{0,1\\\}one noisy Bernoulli drawr=1−\(p−y\)2r=1\-\(p\-y\)^\{2\}policy gradient over the whole completionreasoning is corruptedoverconfidentOursreward the rate, train the answer spanchain of thoughtP = 61%empirical ratep^​\(x\)\\hat\{p\}\(x\)win rate over similar statesr=1−\(p−p^\)2r=1\-\(p\-\\hat\{p\}\)^\{2\}gradient on the answer span onlyreasoning is preservedcalibrated tothe market

Figure 1:The same game state, two ways to train a 7B model against outcomes\. Rewarding the single realized outcome and updating the whole completion drives the chain of thought to extreme, overconfident numbers\. Rewarding a state\-conditioned empirical win ratep^​\(x\)\\hat\{p\}\(x\)and confining the gradient to the answer span leaves the reasoning intact and calibrates the forecaster to the betting market, using no human labels and no supervised fine\-tuning\.Reinforcement learning with verifiable rewards \(RLVR\) post\-trains a language model against a reward computed from observed outcomes\[[1](https://arxiv.org/html/2607.00164#bib.bib1),[2](https://arxiv.org/html/2607.00164#bib.bib2)\]\. Calibrated probabilistic prediction is a natural target for it: a proper scoring rule such as the Brier score is computable from outcomes alone and is minimized in expectation by the true probability\[[3](https://arxiv.org/html/2607.00164#bib.bib3),[4](https://arxiv.org/html/2607.00164#bib.bib4)\], so optimizing it should produce forecasts whose probabilities match observed frequencies\.

In practice, reinforcement learning does the opposite, degrading calibration and leaving models overconfident, whether the reward is human feedback\[[5](https://arxiv.org/html/2607.00164#bib.bib5)\]or a verifiable correctness signal\[[6](https://arxiv.org/html/2607.00164#bib.bib6),[7](https://arxiv.org/html/2607.00164#bib.bib7)\]\. The work that corrects this addresses epistemic uncertainty, where the model answers a question that has a correct answer and the goal is a calibrated confidence that the answer is right\[[8](https://arxiv.org/html/2607.00164#bib.bib8),[6](https://arxiv.org/html/2607.00164#bib.bib6)\]\.

A second kind of uncertainty is left untreated\. It is aleatoric: the output is itself a probability, and the label is a single realized outcome of a stochastic event, with no answer that can be called correct\. NFL in\-game win probability is a clear instance\. At any point in a game the forecaster states the probability that the team in possession wins; the realized result is one Bernoulli draw from a rate that no model observes directly; and the betting market provides a strong, independent estimate of that rate\. Whether RLVR can train a calibrated forecaster in this regime, and what makes it fail, has not been studied\.

Reinforcement learning against the per\-play Brier score decalibrates such a forecaster through two mechanisms\. The label for a play is a single realized outcome, one draw from the rate being estimated, so a reward scored against it has high variance and pulls the policy toward whichever result occurred\. When the model reasons in language before answering, a second problem appears: optimizing the final probability rewrites the reasoning into incoherent arguments\. We remove the variance by rewarding a state\-conditioned empirical win rate estimated from past outcomes, a verifiable and label\-free target, and we remove the corruption by keeping the policy gradient off the reasoning, through direct prediction or a gradient mask over the answer tokens \([fig\.˜1](https://arxiv.org/html/2607.00164#S1.F1)\)\. We train Qwen2\.5\-7B\-Instruct without supervised fine\-tuning and evaluate on held\-out seasons against the betting market, a frontier model, and the empirical\-rate table\.

Trained by direct prediction, the model matches the market’s calibration on held\-out games, at an expected calibration error \(ECE\) of0\.0290\.029against the market’s0\.0270\.027, and it is better calibrated than a zero\-shot frontier model, with no human labels and no access to the market\. On the Brier score it matches that frontier model and a tabular estimator, the limit that the public game state allows; the market’s small remaining edge is the live in\-game information none of them can see\. Masking the gradient instead of discarding the reasoning keeps the chain of thought faithful, lowering the rate at which the stated probability does not follow from it from22\.4%22\.4\\%to4\.4%4\.4\\%, at a small cost in sharpness\.

#### Contributions\.

1. 1\.Our reward, a state\-conditioned empirical win rate estimated from realized outcomes, is verifiable and label\-free, and replaces the single realized outcome, whose variance decalibrates per\-play Brier training; on held\-out games it scores within0\.0070\.007Brier of the betting market \([section˜4\.1](https://arxiv.org/html/2607.00164#S4.SS1)\)\.
2. 2\.Reinforcement learning decalibrates the forecaster through the gradient on the chain of thought, not through the reward: with the rate target, training the full completion drives held\-out ECE from0\.190\.19to0\.300\.30, whereas confining the gradient to the answer, by direct prediction or an answer\-span mask, holds it near the market’s \([section˜4\.2](https://arxiv.org/html/2607.00164#S4.SS2)\)\.
3. 3\.With the chain of thought dropped, the model reaches the information ceiling of the public game state: it matches the market’s calibration \(0\.0290\.029against0\.0270\.027ECE\), is better calibrated than a zero\-shot frontier model, and reaches the same Brier score as that frontier model and a tabular rate estimator, leaving the market’s edge as live in\-game information \([section˜6](https://arxiv.org/html/2607.00164#S6)\)\.

## 2Related work

Reinforcement learning improves accuracy but degrades calibration, leaving models overconfident\.Leng et al\. \[[5](https://arxiv.org/html/2607.00164#bib.bib5)\]attribute this, under human feedback, to a reward model that prefers confident answers\.Ma et al\. \[[6](https://arxiv.org/html/2607.00164#bib.bib6)\]identify a conflict between the accuracy and calibration gradients under verifiable rewards, and decouple the two with a masked\-gradient update\.Bereket and Leskovec \[[7](https://arxiv.org/html/2607.00164#bib.bib7)\]attribute the overconfidence on binary stochastic outcomes to the standard\-deviation normalization in the group\-relative advantage, which they remove\. These analyses take the realized outcome as the target\. With a denoised conditional\-rate target we find instead that the normalization is necessary, and that the overconfidence is driven by the gradient on the reasoning rather than by the advantage\.

A second class of methods rewards a proper scoring rule on a verbalized confidence\.Damani et al\. \[[8](https://arxiv.org/html/2607.00164#bib.bib8)\]add a Brier\-score term to a correctness reward,Bani\-Harouni et al\. \[[9](https://arxiv.org/html/2607.00164#bib.bib9)\]use the logarithmic score, andBand et al\. \[[10](https://arxiv.org/html/2607.00164#bib.bib10)\]reward a downstream reader’s accuracy\. These score a confidence in an answer that is right or wrong\. We score the forecast itself against a state\-conditioned empirical rate, where the label is a single stochastic outcome and correctness is undefined\.

Language models are also trained and evaluated as event forecasters\.Halawi et al\. \[[11](https://arxiv.org/html/2607.00164#bib.bib11)\]approach human\-crowd accuracy through retrieval and aggregation, andPratt et al\. \[[12](https://arxiv.org/html/2607.00164#bib.bib12)\]find that prompting strategies do not yield calibrated forecasts\.Turtel et al\. \[[13](https://arxiv.org/html/2607.00164#bib.bib13)\]reward realized outcomes on prediction\-market and news questions, andTurtel et al\. \[[14](https://arxiv.org/html/2607.00164#bib.bib14)\]rank self\-generated forecasts against outcomes\. Both improve calibration while keeping the chain of thought and rewarding the outcome, whereas we reward the conditional rate and mask the gradient off the reasoning\.Paleka et al\. \[[15](https://arxiv.org/html/2607.00164#bib.bib15)\]andKarger et al\. \[[16](https://arxiv.org/html/2607.00164#bib.bib16)\]document temporal leakage in such evaluations, which our season\-disjoint test on resolved games avoids\.

Our training is a choice of credit assignment within group\-relative optimization\[[1](https://arxiv.org/html/2607.00164#bib.bib1)\]\. DAPO\[[17](https://arxiv.org/html/2607.00164#bib.bib17)\]and GPG\[[18](https://arxiv.org/html/2607.00164#bib.bib18)\]remove the KL penalty and reference model, as we do\.Zhang et al\. \[[19](https://arxiv.org/html/2607.00164#bib.bib19)\]analyze the instability of the KL estimator when the gradient concentrates on a few tokens\.Wang et al\. \[[20](https://arxiv.org/html/2607.00164#bib.bib20)\]restrict updates to high\-entropy tokens, andWang et al\. \[[21](https://arxiv.org/html/2607.00164#bib.bib21)\], Tan et al\. \[[22](https://arxiv.org/html/2607.00164#bib.bib22)\]treat reasoning and answer tokens differently\. Our answer\-span mask restricts the gradient by token role\.

Win\-probability estimation for the NFL ranges from random forests\[[23](https://arxiv.org/html/2607.00164#bib.bib23)\]to the boosted models of the nflverse pipeline\[[24](https://arxiv.org/html/2607.00164#bib.bib24),[25](https://arxiv.org/html/2607.00164#bib.bib25)\], which we use as static baselines\.Brill et al\. \[[26](https://arxiv.org/html/2607.00164#bib.bib26)\]show that play\-by\-play data bounds the accuracy of any such model, consistent with our reading of the market’s edge as structural\.Polson and Stern \[[27](https://arxiv.org/html/2607.00164#bib.bib27)\]treat the point spread as an implied volatility\. The betting market is known to predict outcomes more accurately than statistical models or individual bettors\[[28](https://arxiv.org/html/2607.00164#bib.bib28),[29](https://arxiv.org/html/2607.00164#bib.bib29),[30](https://arxiv.org/html/2607.00164#bib.bib30),[31](https://arxiv.org/html/2607.00164#bib.bib31)\]\. We convert market odds to probabilities followingŠtrumbelj \[[32](https://arxiv.org/html/2607.00164#bib.bib32)\]and use them only as a reference\.

## 3Setting and background

We forecast in\-game win probability in American football\. The state at a given play comprises the score margin, the quarter and time remaining, down and distance, field position, the team in possession, and the public pregame point spread\. The forecaster returns a single probability that the team in possession wins\. We draw states from regular\-season National Football League games and partition them by season, training on 2015 through 2022, selecting on 2023, and testing on 2024, so that no game appears in more than one split\. The only market signal in the prompt is the pregame point spread, which is fixed before kickoff and public; the live win probability the market quotes during the game is withheld from both the model and the reward, and enters only at evaluation\.

The output is a probability, and the label is a single realized outcome of a stochastic event\. This separates the task from the calibration problems usually studied in reinforcement learning, where the model answers a question that has a correct answer and the quantity of interest is its confidence that the answer is right\[[8](https://arxiv.org/html/2607.00164#bib.bib8),[6](https://arxiv.org/html/2607.00164#bib.bib6)\]\. That uncertainty is epistemic, and a stronger model can in principle reduce it\. The uncertainty here is aleatoric: the outcome is random given the state, and no forecaster can remove it\. The target of calibration is therefore the conditional win rateη​\(x\)=Pr⁡\(win∣x\)\\eta\(x\)=\\Pr\(\\text\{win\}\\mid x\), not the correctness of an answer\.

We measure a forecaster by how closely its probabilities match observed frequencies\. The Brier score, the mean squared error\(p−y\)2\(p\-y\)^\{2\}between a forecastppand an outcomey∈\{0,1\}y\\in\\\{0,1\\\}\[[3](https://arxiv.org/html/2607.00164#bib.bib3)\], is strictly proper: among all functions of the state, its expectation is minimized uniquely byη​\(x\)\\eta\(x\)\[[4](https://arxiv.org/html/2607.00164#bib.bib4)\]\. It decomposes into reliability, resolution, and uncertainty\[[33](https://arxiv.org/html/2607.00164#bib.bib33)\]\. Reliability is calibration, the agreement between stated probabilities and realized frequencies; resolution is sharpness, the spread of forecasts across states of differing outcome rate; uncertainty is fixed by the base rate\. Because two forecasters can be equally calibrated yet differ in Brier score through resolution alone, we report calibration on its own, through the expected and maximum calibration error and reliability diagrams\[[34](https://arxiv.org/html/2607.00164#bib.bib34),[35](https://arxiv.org/html/2607.00164#bib.bib35),[36](https://arxiv.org/html/2607.00164#bib.bib36)\]\. We compare forecasters with a paired bootstrap over plays\[[37](https://arxiv.org/html/2607.00164#bib.bib37)\]\. Calibration error also tests whether training foundη\\etarather than memorizing the training outcomes, since a model that memorized them would be miscalibrated on held\-out games\.

We take the betting market as the ceiling: it quotes a live win probability for each state and predicts outcomes more accurately than statistical models built from game features\[[28](https://arxiv.org/html/2607.00164#bib.bib28),[29](https://arxiv.org/html/2607.00164#bib.bib29)\]\. We convert its odds to a probability\[[32](https://arxiv.org/html/2607.00164#bib.bib32)\]\. The gap between a forecaster and the market then measures the live in\-game information that the public state lacks\.

## 4Method

We post\-train the forecaster with group\-relative reinforcement learning, computing the reward from a state\-conditioned estimate of the win rate rather than from the realized outcome, and restricting the policy gradient to the tokens that carry the answer rather than to the whole completion\.

### 4\.1Conditional\-rate reward

The ideal reward would score each forecast against the true rateη​\(x\)\\eta\(x\), butη​\(x\)\\eta\(x\)is never observed\. The only label for a play is its realized outcomeyy, a single draw from that rate\. A reward built on the draw,r=1−\(p−y\)2r=1\-\(p\-y\)^\{2\}, is unbiased but noisy, since two identical favorites, one that wins and one that loses, produce equal and opposite gradients\. The noise does not average out with more plays at a fixed state, because it is intrinsic to a binary label, and in the small groups that GRPO compares\[[1](https://arxiv.org/html/2607.00164#bib.bib1)\]it overwhelms the per\-state signal\.

We score each forecast against an estimate of the rate instead\. We bin the training plays by score margin, time remaining, and pregame point spread, and set each bin to the fraction of its plays the possession team won\. Bins with few plays are shrunk toward the overall rate by an empirical\-Bayes prior, so a sparse bin is not governed by its few plays\[[38](https://arxiv.org/html/2607.00164#bib.bib38)\]\. The estimatep^​\(x\)\\hat\{p\}\(x\)uses only realized outcomes and the public pregame line, never the live market probability, which we hold out for evaluation\. The reward is

r=1−\(p−p^​\(x\)\)2\.r\\;=\\;1\-\\bigl\(p\-\\hat\{p\}\(x\)\\bigr\)^\{2\}\.\(1\)The target uses no human labels\. It comes from the same outcomes a binary reward would use, averaged over states to yield a low\-variance estimate ofη​\(x\)\\eta\(x\)\. On held\-out 2024 games it scores a Brier of0\.1430\.143, close to the market’s0\.1360\.136, and it supplies a dense reward at every state\. We callp^\\hat\{p\}the teacher, because a model trained on[eq\.˜1](https://arxiv.org/html/2607.00164#S4.E1)can be no better calibrated thanp^\\hat\{p\}itself\.

An ablation under identical training isolates the target as the cause \([table˜1](https://arxiv.org/html/2607.00164#S4.T1)\)\. Rewarding the realized outcome reaches a Brier of0\.1660\.166, but its calibration error drifts to0\.100\.10as the policy fits individual outcomes\. An equal blend of outcome and rate is worse on both,0\.1810\.181and0\.1210\.121, because it reintroduces the outcome’s noise\. The rate alone is best on both, at0\.1540\.154and0\.0500\.050\.

Table 1:Reward\-target ablation under identical training \(direct prediction,[section˜4\.2](https://arxiv.org/html/2607.00164#S4.SS2); in\-training held\-out,n=128n=128\)\. The realized\-outcome reward is higher\-variance and its calibration drifts; blending the outcome back in is worse on both axes; the conditional ratep^\\hat\{p\}is the most accurate and the most stable\.
### 4\.2Decoupling reasoning from the gradient

The rate target alone is not enough, because a chain of thought before the answer lets the same reward make calibration worse\. Applied to the full completion, the reward of[eq\.˜1](https://arxiv.org/html/2607.00164#S4.E1)drives the held\-out Brier score from the base model’s0\.250\.25to0\.340\.34and the calibration error from0\.190\.19to0\.300\.30\([fig\.˜2](https://arxiv.org/html/2607.00164#S4.F2)\)\. The policy gradient passes through the reasoning tokens, so raising the reward on the final probability rewrites the reasoning before it into pseudo\-quantitative arguments for ever more extreme numbers\. This is the accuracy\-calibration gradient conflict reported for verifiable\-reward training\[[6](https://arxiv.org/html/2607.00164#bib.bib6)\]\.

Two changes keep the reward and stop the gradient from reaching the reasoning\. The first, the direct model, drops the chain of thought and emits the probability directly\. With no reasoning tokens to corrupt, the reward calibrates the model, and the held\-out Brier score reaches the teacher’s level within fifty steps and stays there\. The second, the masked model, keeps the chain of thought but masks the gradient\. We narrow the completion mask, which selects the tokens that receive advantage, to the final “Probability: NN%” span, so the reasoning is still sampled and still conditions the answer but gets no gradient\. Masking the gradient needs the KL penalty turned off, because on the few tokens of the answer span the gradient is concentrated enough to make the penalty’s estimator overflow and the loss diverge, and the masked reasoning no longer needs the reference\-model anchor\. With the mask in place, the masked model calibrates as well as the direct model\.

![Refer to caption](https://arxiv.org/html/2607.00164v1/x1.png)Figure 2:Held\-out calibration during training \(fixedn=128n=128\)\. Applied to the full completion, the reward of[eq\.˜1](https://arxiv.org/html/2607.00164#S4.E1)drives a chain\-of\-thought policy away from calibration on both Brier \(a\) and ECE \(b\)\. Removing the reasoning from the gradient, by direct prediction or by masking the gradient to the answer span, recovers calibration to the teacher’s level\. The full\-completion and masked runs share the same reward, prompt, and base model, and differ only in the gradient mask\.The masked model is slightly less accurate than the direct model\. On held\-out 2024 games the direct model reaches a Brier of0\.1440\.144and the masked model0\.1520\.152, a gap of0\.0080\.008by a paired bootstrap over plays \(95% CI\[0\.006,0\.010\]\[0\.006,0\.010\]\)\. The gap is small, and far below the base\-to\-teacher gap that the reward closes\. Both trained models trail the market on Brier, but the shortfall is in resolution, not calibration: the direct model’s calibration error of0\.0290\.029matches the market’s0\.0270\.027, and its Brier is higher only because the static prompt cannot see the live information the market uses to separate states more sharply \([section˜6](https://arxiv.org/html/2607.00164#S6)\)\.

Table 2:Paired bootstrap over held\-out 2024 plays \(n=5185n=5185,10410^\{4\}resamples\)\. Masking the gradient to the answer span costs a small but significant amount of Brier relative to discarding the reasoning; both decoupled models trail the market by a similar, significant margin that[section˜6](https://arxiv.org/html/2607.00164#S6)attributes to resolution rather than calibration\.The masked model keeps its reasoning coherent\. A blinded judge read each completion and decided whether the stated probability follows from the reasoning, and the rate of inconsistent completions falls from22\.4%22\.4\\%for the base model to4\.4%4\.4\\%after masked training\. Full\-completion reinforcement learning leaves the base rate unchanged, so the gain comes from the masked objective and not from reinforcement learning in general, and a control that applies the masked prompt without training accounts for only part of it\. The base model writes arithmetic that does not sum and then reports an unrelated number, or rates a clear favorite below an even chance, while the masked model reads the live state and states a probability that follows from it \([fig\.˜4](https://arxiv.org/html/2607.00164#S4.F4)\)\. A probability lookup cannot do this, so the masked model is the better choice when a forecast must be explained, at the cost of slightly lower sharpness than direct prediction\.

![Refer to caption](https://arxiv.org/html/2607.00164v1/x2.png)Figure 3:Rate at which the stated probability does not follow from the model’s own chain of thought, on a blinded\-judge sample of held\-out plays \(95% bootstrap intervals\)\. Full\-completion chain\-of\-thought reinforcement learning leaves the base rate unchanged; masked training reduces it sharply, beyond what the masked prompt format alone accounts for\.Example 1\. Q4, 2:21 left\. WAS trailing by 12, 2nd & 6 at the NYG 8\-yard line\. Pregame line: WAS favored by 7\.5\. Empirical ratep^=0\.07\\hat\{p\}=0\.07; WAS lost\.

Base \(CoT\)*“\[ …\] Touchdown and Extra Point:0\.8×0\.95=0\.760\.8\\times 0\.95=0\.76or 76%\. Touchdown and Two\-Point Conversion:0\.8×0\.4=0\.320\.8\\times 0\.4=0\.32or 32%\. Total Probability:0\.76\+0\.32=1\.080\.76\+0\.32=1\.08, but since we can’t exceed 100%, we cap it at 100%\. \[ …\] approximately 32%\.”*0\.320\.32Masked\-CoT \(ours\)*“\[ …\] a 12\-point deficit with only 2 minutes and 21 seconds left \[ …\] makes a comeback extremely challenging\. Probability: 8%\.”*0\.080\.08
Example 2\. Q4, 7:57 left\. CHI leading by 3, 3rd & 6 at the CAR 42\-yard line\. Pregame line: CHI favored by 3\. Empirical ratep^=0\.80\\hat\{p\}=0\.80; CHI won\.

Base \(CoT\)*“\[ …\] Base Probability \(Favored by 3\): 60–70%\. Adjustment for Poor Performance:−10\-10to−20%\-20\\%\. Final Probability: 40–50%\.”*\(a leading, favored team, with the stated value at the bottom of its own adjusted range\)0\.400\.40Masked\-CoT \(ours\)*“\[ …\] CHI is in a strong position to maintain their lead, as they are favored and have the ball \[ …\] ample time to either score a touchdown or at least a field goal\. Probability: 85%\.”*0\.850\.85

Figure 4:Reasoning before and after gradient\-masked CoT training, on identical game states \(blinded\-judge sample, 2023\)\. In the base completions the stated probability does not follow from the text: an addition that sums to 108% reports one of its branches, and a leading favorite is rated below an even chance\. The masked\-CoT model reads the live state and states a probability consistent with it, close to the empirical ratep^\\hat\{p\}\. Inconsistency of this kind falls from 22\.4% of base completions to 4\.4% after masked training \([fig\.˜3](https://arxiv.org/html/2607.00164#S4.F3)\)\.

## 5Experimental setup

States are drawn from National Football League regular\-season play\-by\-play for the 2015 through 2024 seasons, using the public nflfastR data\[[24](https://arxiv.org/html/2607.00164#bib.bib24),[25](https://arxiv.org/html/2607.00164#bib.bib25)\]\. Each play is one training example: the game state of[section˜3](https://arxiv.org/html/2607.00164#S3), paired with the realized outcome of whether the team in possession won\. We split by season into40,24640\{,\}246training states for 2015 through 2022,5,2415\{,\}241for selection in 2023, and5,1855\{,\}185for test in 2024, so that no game appears in two splits\. The empirical\-rate target is estimated only from training\-season outcomes, and no prompt contains the live market win probability; we verified that it appears in none of the40,24640\{,\}246training prompts\.

The base model is Qwen2\.5\-7B\-Instruct, a dense non\-thinking model, used with no supervised fine\-tuning; the only training is reinforcement learning against the reward of[eq\.˜1](https://arxiv.org/html/2607.00164#S4.E1)\. We use group\-relative policy optimization\[[1](https://arxiv.org/html/2607.00164#bib.bib1)\]as implemented in the TRL library, with vLLM rollouts colocated in the training process, on a single NVIDIA L40S \(48 GB\) GPU\. The policy is adapted with LoRA \(rank1616,α=32\\alpha=32, dropout0\.050\.05\) on the attention and feed\-forward projections, in bfloat16; at this scale one card holds the policy, the reference model, and the rollout engine without quantization\. Each step samples eight completions per state at temperature0\.90\.9, uses a token\-level loss\[[17](https://arxiv.org/html/2607.00164#bib.bib17)\], and performs a single on\-policy update\. We disable TRL’s vLLM importance\-sampling correction, which otherwise masks the loss under the small per\-token mismatch between vLLM and training log\-probabilities; with single\-pass updates this recovers standard group\-relative optimization\. Training runs for250250steps with a2020\-step warmup, saving every5050; we select the checkpoint by Brier score on the full 2023 split and report it on 2024\.

The two models differ as in[section˜4\.2](https://arxiv.org/html/2607.00164#S4.SS2): the direct model emits the probability in at most4848tokens, while the masked model emits up to640640tokens of reasoning and applies the gradient only to the final probability span\. Both accumulate gradients over sixteen micro\-batches and keep reward scaling on\. The direct model uses a learning rate of2×10−52\\times 10^\{\-5\}and a KL coefficient of0\.010\.01; the masked model uses3×10−53\\times 10^\{\-5\}and a KL coefficient of zero\. These values come from a learning\-rate, batch\-size, and KL sweep reported in[appendix˜B](https://arxiv.org/html/2607.00164#A2)\.

We compare against the untrained Qwen2\.5\-7B\-Instruct, prompted with and without a chain of thought; a frontier model, DeepSeek\-V4, prompted zero\-shot through its API; the empirical\-rate teacherp^\\hat\{p\}; the nflverse win\-probability model and a gradient\-boosted model on the standard feature set \([section˜6](https://arxiv.org/html/2607.00164#S6)\); and the betting market, converted from odds to a win probability\[[32](https://arxiv.org/html/2607.00164#bib.bib32)\]and used as a near\-ceiling reference\. We report the Brier score, the expected and maximum calibration error over ten equal\-width bins, accuracy against the realized winner, and the Murphy decomposition\. Confidence intervals are nonparametric bootstraps over plays with10410^\{4\}resamples, and two forecasters are compared on the same plays with a paired bootstrap of the per\-play Brier difference\. Reliability diagrams use the same ten bins\. Checkpoint selection during training uses a fixed random sample of128128held\-out 2023 states, scored greedily at each save point; all reported numbers use the full splits\.

## 6Results

Reinforcement learning against the conditional\-rate reward calibrates the forecaster to the level of the betting market\. On the held\-out 2024 season the direct model reaches an ECE of0\.0290\.029, against the market’s0\.0270\.027and the untrained model’s0\.0570\.057\. Its maximum calibration error,0\.0600\.060, is below the market’s0\.0820\.082, and its reliability curve lies on the diagonal where the untrained model’s does not \([fig\.˜5](https://arxiv.org/html/2607.00164#S6.F5)\)\. Its Brier score of0\.1440\.144improves on the untrained0\.2060\.206and trails the market’s0\.1360\.136\([table˜3](https://arxiv.org/html/2607.00164#S6.T3)\)\. The masked chain\-of\-thought model is as well calibrated, at an ECE of0\.0300\.030, with a slightly higher Brier score of0\.1520\.152\.

Table 3:Held\-out 2024 \(n=5185n=5185\)\. Brier with95%95\\%bootstrap interval, expected and maximum calibration error, accuracy, and Murphy resolution\. The trained models reach the teacher and frontier\-model tier, the direct model matches the market on calibration, and both trail the market on Brier through resolution rather than reliability\.![Refer to caption](https://arxiv.org/html/2607.00164v1/x3.png)Figure 5:Reliability on held\-out 2024; points on the diagonal are perfectly calibrated and their area is proportional to the plays in the bin\. The trained forecasters follow the diagonal as closely as the market, while the untrained chain\-of\-thought model is overconfident\.The direct model matches the market on calibration but trails it on Brier by0\.0080\.008\. By the Murphy decomposition this gap is resolution, not reliability: the model is as well calibrated as the market but less sharp,0\.1060\.106against0\.1150\.115\. It assigns the right probability to the states it can describe, but separates fewer of them\. The missing sharpness comes from information the prompt does not contain, and no static forecaster does better\.

The frontier model and the teacher reach the same Brier score as the direct model,0\.1440\.144and0\.1430\.143against0\.1440\.144, and a paired bootstrap separates none of the three\. All three trail the market by0\.0080\.008, significant in each case \([fig\.˜6](https://arxiv.org/html/2607.00164#S6.F6)\), and the direct model is better calibrated than the frontier model,0\.0290\.029against0\.0430\.043\. No richer static model improves on the coarse rate\. The tuned nflverse model and a gradient\-boosted model on the full feature set score0\.1560\.156and0\.1580\.158against0\.1430\.143\([table˜4](https://arxiv.org/html/2607.00164#S6.T4)\), since the extra features add play\-level noise without game\-level signal\.Brill et al\. \[[26](https://arxiv.org/html/2607.00164#bib.bib26)\]reach the same bound analytically, showing that the correlated structure of play\-by\-play data limits the accuracy of any such model\. The market’s0\.0080\.008edge is therefore live in\-game information that no static input contains\.

![Refer to caption](https://arxiv.org/html/2607.00164v1/x4.png)Figure 6:Held\-out 2024 Brier with95%95\\%bootstrap intervals\. A label\-free 7B model trained by direct prediction, a frontier model prompted zero\-shot, and the empirical\-rate teacher converge near0\.1430\.143and trail the market by the same significant margin, placing the residual gap in the available information rather than in model capacity\.Table 4:Held\-out 2024 Brier and ECE for static\-feature forecasters\. Two models with more features than the coarse buckets are worse on Brier; the coarse empirical rate is the best static estimator, and the direct model reaches its level\.On the test season the direct model is better calibrated than the teacher it was trained on\. Its ECE is0\.0290\.029against the teacher’s0\.0440\.044, and its maximum calibration error0\.0600\.060against0\.0990\.099, at the same Brier score\. The policy smooths the bucketed target into a calibrated function of the state\.

The decalibration of full\-completion chain\-of\-thought training \([section˜4\.2](https://arxiv.org/html/2607.00164#S4.SS2)\) does not come from misreading the game state\. A blinded judge scored five state\-reading errors \(possession, score, spread, clock, and pregame anchoring\) and found each in one to ten percent of completions across the base, full\-completion, and masked models, so all of them read the game correctly\. The full\-completion models fail instead at the step from those facts to a number\. They write pseudo\-quantitative arguments whose probabilities exceed one and report a figure their own text does not support \([fig\.˜4](https://arxiv.org/html/2607.00164#S4.F4)\)\. The error is an inconsistency between the reasoning and the answer \([fig\.˜3](https://arxiv.org/html/2607.00164#S4.F3)\), not a factual mistake\. Masking the gradient leaves the reading of the state untouched and keeps the number consistent with it, which restores calibration\.

## 7Conclusion

We have shown that reinforcement learning can calibrate a probabilistic forecaster to the level of the betting market using only realized game outcomes, without human labels or supervised fine\-tuning\. Verifiable\-reward training has used the realized label as the reward, which is well behaved when that label is deterministic but injects irreducible variance when the outcome is stochastic\. We reward an aggregate of those same labels, a state\-conditioned empirical rate, which is equally verifiable and far less noisy, and which turns a proper scoring rule into a usable training signal for events that have no correct answer\. This carries verifiable\-reward training from tasks with a right answer to forecasting, where the answer is itself a probability\.

Even with the denoised target, taking the gradient through a chain of thought decalibrates the model, because optimizing the final probability rewrites the reasoning that produced it\. This refines earlier accounts: the accuracy\-calibration gradient conflict\[[6](https://arxiv.org/html/2607.00164#bib.bib6)\]and the overconfidence attributed to the group\-standard\-deviation normalization of the advantage\[[7](https://arxiv.org/html/2607.00164#bib.bib7)\]both take the realized outcome as the target, whereas with a denoised target that normalization is instead necessary and the problem that remains is the gradient’s reach into the reasoning\. Confining the gradient to the answer removes it, by dropping the reasoning or by masking the gradient on it, and the masked variant keeps a chain of thought from which the forecast follows\.

Three unrelated estimators converge on the same Brier score: a reinforcement\-trained small model, a frontier model, and a tabular rate\. The convergence is strong evidence that the score is the limit set by the information in the public game state, not a property of any one of them\. It is also a practical test of whether a forecaster has exhausted its inputs: once unrelated methods agree, further accuracy must come from new information, not a larger model\. A market, where one exists, supplies the independent reference that makes the test sharp\.

The method reaches a strong reference only where the public state already carries most of the predictive signal, and where outcomes are dense and resolved quickly enough to estimate a reliable empirical rate; sparse or long\-horizon events would weaken both the reward and the comparison against a market\. Within these limits the empirical rate could be enriched with more state features or replaced by a learned rate model to raise the ceiling it imposes, and the same reward and gradient mask should extend to other aleatoric domains where calibrated probabilities matter and outcomes are eventually observed, from weather to elections to clinical prognosis\. Calibrated forecasting can be learned from outcomes alone, and a small open model can match a market once its reward and its gradient are aligned with the quantity it is asked to forecast, while still exposing the reasoning behind each prediction\.

## References

- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.DeepSeekMath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Guo et al\. \[2025\]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al\.DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.*Nature*, 645\(8081\):633–638, 2025\.arXiv:2501\.12948\.
- Brier \[1950\]Glenn W\. Brier\.Verification of forecasts expressed in terms of probability\.*Monthly Weather Review*, 78\(1\):1–3, 1950\.
- Gneiting and Raftery \[2007\]Tilmann Gneiting and Adrian E\. Raftery\.Strictly proper scoring rules, prediction, and estimation\.*Journal of the American Statistical Association*, 102\(477\):359–378, 2007\.
- Leng et al\. \[2024\]Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang\.Taming overconfidence in LLMs: Reward calibration in RLHF\.*arXiv preprint arXiv:2410\.09724*, 2024\.
- Ma et al\. \[2026\]Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, and Le Sun\.Decoupling reasoning and confidence: Resurrecting calibration in reinforcement learning from verifiable rewards\.In*Proceedings of the 43rd International Conference on Machine Learning \(ICML\)*, 2026\.arXiv:2603\.09117\.
- Bereket and Leskovec \[2025\]Michael Bereket and Jure Leskovec\.Uncalibrated reasoning: GRPO induces overconfidence for stochastic outcomes\.*arXiv preprint arXiv:2508\.11800*, 2025\.
- Damani et al\. \[2025\]Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas\.Beyond binary rewards: Training LMs to reason about their uncertainty\.*arXiv preprint arXiv:2507\.16806*, 2025\.
- Bani\-Harouni et al\. \[2025\]David Bani\-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova, Nassir Navab, and Matthias Keicher\.Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models\.*arXiv preprint arXiv:2503\.02623*, 2025\.
- Band et al\. \[2024\]Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto\.Linguistic calibration of long\-form generations\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024\.arXiv:2404\.00474\.
- Halawi et al\. \[2024\]Danny Halawi, Fred Zhang, Chen Yueh\-Han, and Jacob Steinhardt\.Approaching human\-level forecasting with language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.arXiv:2402\.18563\.
- Pratt et al\. \[2024\]Sarah Pratt, Seth Blumberg, Pietro Kreitlon Carolino, and Meredith Ringel Morris\.Can language models use forecasting strategies?*arXiv preprint arXiv:2406\.04446*, 2024\.
- Turtel et al\. \[2025a\]Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, and Philipp Schoenegger\.Outcome\-based reinforcement learning to predict the future\.*arXiv preprint arXiv:2505\.17989*, 2025a\.
- Turtel et al\. \[2025b\]Benjamin Turtel, Danny Franklin, and Philipp Schoenegger\.LLMs can teach themselves to better predict the future\.*arXiv preprint arXiv:2502\.05253*, 2025b\.
- Paleka et al\. \[2025\]Daniel Paleka, Shashwat Goel, Jonas Geiping, and Florian Tramèr\.Pitfalls in evaluating language model forecasters\.*arXiv preprint arXiv:2506\.00723*, 2025\.
- Karger et al\. \[2024\]Ezra Karger, Houtan Bastani, Chen Yueh\-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E\. Tetlock\.ForecastBench: A dynamic benchmark of AI forecasting capabilities\.*arXiv preprint arXiv:2409\.19839*, 2024\.
- Yu et al\. \[2025\]Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, et al\.DAPO: An open\-source LLM reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.
- Chu et al\. \[2026\]Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang\.GPG: A simple and strong reinforcement learning baseline for model reasoning\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.arXiv:2504\.02546\.
- Zhang et al\. \[2026\]Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi\-Chih Yao\.On the design of KL\-regularized policy gradient algorithms for LLM reasoning\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.arXiv:2505\.17508\.
- Wang et al\. \[2025a\]Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin\.Beyond the 80/20 rule: High\-entropy minority tokens drive effective reinforcement learning for LLM reasoning\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2025a\.arXiv:2506\.01939\.
- Wang et al\. \[2025b\]Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou, and Ling Pan\.Stabilizing knowledge, promoting reasoning: Dual\-token constraints for RLVR\.*arXiv preprint arXiv:2507\.15778*, 2025b\.
- Tan et al\. \[2025\]Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang\.GTPO and GRPO\-S: Token and sequence\-level reward shaping with policy entropy\.*arXiv preprint arXiv:2508\.04349*, 2025\.
- Lock and Nettleton \[2014\]Dennis Lock and Dan Nettleton\.Using random forests to estimate win probability before each play of an NFL game\.*Journal of Quantitative Analysis in Sports*, 10\(2\):197–205, 2014\.
- Yurko et al\. \[2019\]Ronald Yurko, Samuel Ventura, and Maksim Horowitz\.nflWAR: A reproducible method for offensive player evaluation in football\.*Journal of Quantitative Analysis in Sports*, 15\(3\), 2019\.arXiv:1802\.00998\.
- Carl and Baldwin \[2024\]Sebastian Carl and Ben Baldwin\.*nflfastR: Functions to Efficiently Access NFL Play by Play Data*, 2024\.URL[https://www\.nflfastr\.com/](https://www.nflfastr.com/)\.R package\.
- Brill et al\. \[2025\]Ryan S\. Brill, Ronald Yurko, and Abraham J\. Wyner\.Exploring the difficulty of estimating win probability: A simulation study\.*Journal of Quantitative Analysis in Sports*, 2025\.arXiv:2406\.16171\.
- Polson and Stern \[2015\]Nicholas G\. Polson and Hal S\. Stern\.The implied volatility of a sports game\.*Journal of Quantitative Analysis in Sports*, 11\(3\):145–153, 2015\.
- Boulier and Stekler \[2003\]Bryan L\. Boulier and H\. O\. Stekler\.Predicting the outcomes of national football league games\.*International Journal of Forecasting*, 19\(2\):257–270, 2003\.
- Levitt \[2004\]Steven D\. Levitt\.Why are gambling markets organised so differently from financial markets?*The Economic Journal*, 114\(495\):223–246, 2004\.
- Franck et al\. \[2010\]Egon Franck, Erwin Verbeek, and Stephan Nüesch\.Prediction accuracy of different market structures: Bookmakers versus a betting exchange\.*International Journal of Forecasting*, 26\(3\):448–459, 2010\.
- Cox et al\. \[2021\]Justin Cox, Adam L\. Schwartz, Bonnie F\. Van Ness, and Robert A\. Van Ness\.The predictive power of college football spreads: Regular season versus bowl games\.*Journal of Sports Economics*, 22\(3\):251–273, 2021\.
- Štrumbelj \[2014\]Erik Štrumbelj\.On determining probability forecasts from betting odds\.*International Journal of Forecasting*, 30\(4\):934–943, 2014\.
- Murphy \[1973\]Allan H\. Murphy\.A new vector partition of the probability score\.*Journal of Applied Meteorology*, 12\(4\):595–600, 1973\.
- Guo et al\. \[2017\]Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q\. Weinberger\.On calibration of modern neural networks\.In*Proceedings of the 34th International Conference on Machine Learning \(ICML\)*, 2017\.arXiv:1706\.04599\.
- Bröcker and Smith \[2007\]Jochen Bröcker and Leonard A\. Smith\.Increasing the reliability of reliability diagrams\.*Weather and Forecasting*, 22\(3\):651–661, 2007\.
- Dimitriadis et al\. \[2021\]Timo Dimitriadis, Tilmann Gneiting, and Alexander I\. Jordan\.Stable reliability diagrams for probabilistic classifiers\.*Proceedings of the National Academy of Sciences*, 118\(8\):e2016191118, 2021\.
- Efron \[1979\]Bradley Efron\.Bootstrap methods: Another look at the jackknife\.*The Annals of Statistics*, 7\(1\):1–26, 1979\.
- Efron and Morris \[1975\]Bradley Efron and Carl Morris\.Data analysis using Stein’s estimator and its generalizations\.*Journal of the American Statistical Association*, 70\(350\):311–319, 1975\.

## Appendix AThe empirical\-rate target

The reward targetp^​\(x\)\\hat\{p\}\(x\)is a state\-conditioned empirical win rate estimated from training\-season outcomes\. Each play is placed in a bucket defined by three game\-level features: the score margin of the team in possession, in fourteen bins with edges at±1\\pm 1,±4\\pm 4,±7\\pm 7,±10\\pm 10,±14\\pm 14, and±21\\pm 21points; the time remaining, in seven bins with edges at22,55,1010,1515,3030, and4545minutes; and the public pregame point spread, in nine bins with edges at±0\.5\\pm 0\.5,±3\\pm 3,±7\\pm 7, and±10\\pm 10points\. A bucket’s raw value is the fraction of its training plays whose possession team won\. Sparse buckets are shrunk toward their coarser parents by a hierarchical empirical\-Bayes rule: writingnnandwwfor the count and wins in a bucket andp^parent\\hat\{p\}\_\{\\text\{parent\}\}for its parent estimate, the bucket value is\(w\+M​p^parent\)/\(n\+M\)\(w\+M\\,\\hat\{p\}\_\{\\text\{parent\}\}\)/\(n\+M\)with pseudocountM=25M=25, applied from the global rate downward\. The estimate for an evaluation play is read from the training\-season table alone, so no test outcome enters its own target\.

Finer state features do not raise the target\. A fine table that adds field position and down under the same hierarchical backoff matches the coarse table on game\-level calibration: on the 2023 split the coarse target scores0\.1530\.153Brier and0\.0120\.012ECE against the fine table’s0\.1530\.153and0\.0100\.010, both close to the market’s0\.1500\.150and0\.0190\.019\([table˜5](https://arxiv.org/html/2607.00164#A1.T5)\)\. Field position and down shape the current drive rather than the game outcome, so the reward uses the coarse score\-by\-time\-by\-spread table\.

Table 5:Teacher\-granularity ablation on held\-out 2023\. Adding field position and down to the coarse buckets does not improve game\-level calibration, so the reward target stays coarse\.
## Appendix BTraining and hyperparameter selection

Every checkpoint is selected by Brier score on a fixed random sample of128128held\-out 2023 states, scored greedily at each5050\-step save, and the two models are tuned separately\. The numbers below are this in\-training selection metric, with one setting varied at a time around a reference configuration; the reported test numbers use the full splits\.

For the direct model the learning rate matters most\. Raising it from5×10−65\\times 10^\{\-6\}to1×10−51\\times 10^\{\-5\}lowers the in\-training Brier from0\.1600\.160to0\.1540\.154at equal calibration error, and2×10−52\\times 10^\{\-5\}sharpens further at a higher calibration error; on the larger batch this higher rate is the optimum\. Effective batch size, set by gradient accumulation, improves Brier up to four games per step and then saturates\. Three negatives are decisive \([table˜6](https://arxiv.org/html/2607.00164#A2.T6)\): turning off reward scaling collapses resolution, blending the realized outcome back into the target atλ=0\.5\\lambda=0\.5reintroduces its variance, and lowering the KL coefficient below0\.010\.01worsens the calibration error\. Temperature is neutral between0\.70\.7and1\.01\.0, and the selected configuration varies by0\.0040\.004Brier across three seeds\. The reported direct model uses a learning rate of2×10−52\\times 10^\{\-5\}, gradient accumulation over sixteen micro\-batches, a KL coefficient of0\.010\.01, and reward scaling on\.

Table 6:Direct\-model ablation, in\-training selection metric \(greedy, fixedn=128n=128held\-out 2023, best\-Brier checkpoint\)\. Each row changes one setting from the reference\. The learning rate and the effective batch size trade sharpness against calibration; reward scaling, the rate target, and a non\-zero KL coefficient are decisive negatives when removed\.The masked model differs in two settings\. The KL anchor must be removed: with the gradient concentrated on the few answer tokens, a coefficient of0\.010\.01makes thek3k\_\{3\}KL estimator overflow and the loss diverge, and1×10−31\\times 10^\{\-3\}degrades calibration over training, whileβ=0\\beta=0is stable because the answer\-span mask already protects the reasoning the reference model would otherwise anchor\. The learning rate follows an inverted\-U with a higher optimum than for the direct model \([table˜7](https://arxiv.org/html/2607.00164#A2.T7)\): held\-out Brier improves from5×10−65\\times 10^\{\-6\}to a peak at3×10−53\\times 10^\{\-5\}and collapses at4×10−54\\times 10^\{\-5\}, where the unanchored update drives probabilities to extremes\. The reported masked model uses a learning rate of3×10−53\\times 10^\{\-5\}andβ=0\\beta=0\.

Table 7:Masked\-model ablation, held\-out 2024 \(n=1000\{=\}1000subset, best checkpoint\)\. The KL anchor must be removed, and the learning rate follows an inverted\-U with an optimum at3×10−53\\times 10^\{\-5\}, above which the unanchored, concentrated gradient drives the maximum calibration error past0\.50\.5\.
## Appendix CBlinded\-judge protocol for reasoning quality

To measure reasoning quality we collect, for each model, its chain of thought on250250held\-out 2023 plays stratified across game states, with the ground\-truth game facts taken from each play’s features\. A separate model judges every completion blind to its source and decides, with a quote from the completion supporting each judgment, whether the chain of thought misreads any of five game facts \(possession, score, spread, clock, and the pregame favorite\) and whether the stated probability follows from the reasoning\. We audited the judge by reading every flagged case; each flag carries a specific checkable reason, and the judge flags the masked model alongside the others\.

State\-reading errors are low and flat across the base model, the two full\-completion reinforcement runs, and the masked model\. Possession and score are misread in about one percent of completions, the pregame favorite in two to seven percent, the spread in three to six percent, and the clock in seven to ten percent, the last a base\-model weakness that no training variant changes\. Reinforcement learning does not corrupt the reading of the game\.

The models separate instead on whether the stated probability follows from the reasoning \([table˜8](https://arxiv.org/html/2607.00164#A3.T8)\)\. The base model is inconsistent on22\.4%22\.4\\%of completions, and full\-completion reinforcement learning does not improve it; the failures are broken probability arithmetic, drive\-level and game\-level probabilities conflated, and a final number the text does not support\. The masked model is inconsistent on4\.4%4\.4\\%\. The masked prompt format accounts for much of the reduction, since a base model read under the masked prompt is inconsistent on6\.8%6\.8\\%, and masked training lowers it further\. At250250plays the format and training intervals overlap on this checkpoint, while a separate control on the reported checkpoint isolates a significant training effect\.

Table 8:Rate at which the stated probability does not follow from the chain of thought, blinded judge over250250held\-out 2023 plays\. Full\-completion reinforcement learning does not improve on the base rate; the masked prompt format reduces it sharply and masked training reduces it further\.
## Appendix DData\-integrity audit

The prompt exposes one market\-derived quantity, the public pregame point spread, and never the live in\-game win probability that serves as the evaluation reference\. We scanned all40,24640\{,\}246training prompts and the 2023 and 2024 evaluation prompts for the market win probability and found no occurrence\. The empirical\-rate target is built only from realized outcomes and the pregame spread, and the target for an evaluation play is computed from the training\-season table alone, so neither the market probability nor any test outcome enters training\. The splits are disjoint by season, training on 2015 through 2022, selecting on 2023, and testing on 2024, so no game appears in more than one split\. Every reported number is recomputed from the per\-play predictions by a single deterministic script, so the tables and figures reproduce from the released predictions without a GPU\.

Similar Articles

Calibrated Preference Learning: The Case of Label Ranking

arXiv cs.LG

This paper formalizes calibration for probabilistic label ranking, introducing a hierarchy of calibration notions and showing that common models are poorly calibrated. It further demonstrates applications to RLHF reward models, where calibration correlates with but is not identical to accuracy.

Probabilistic Calibration Is a Trainable Capability in Language Models

arXiv cs.CL

This paper investigates whether probabilistic calibration in language models can be improved through fine-tuning, comparing soft-target and hard-target methods across 12 models. The results show that calibration is a trainable capability, though gains sometimes reduce downstream arithmetic reasoning capabilities.