Probabilistic Calibration Is a Trainable Capability in Language Models
Summary
This paper investigates whether probabilistic calibration in language models can be improved through fine-tuning, comparing soft-target and hard-target methods across 12 models. The results show that calibration is a trainable capability, though gains sometimes reduce downstream arithmetic reasoning capabilities.
View Cached Full Text
Cached at: 05/13/26, 06:19 AM
# Probabilistic Calibration Is a Trainable Capability in Language Models
Source: [https://arxiv.org/html/2605.11845](https://arxiv.org/html/2605.11845)
Davide Baldelli1,2,3,4,Sruthi Kuriakose6Maryam Hashemzadeh1,2,5Amal Zouaq2,3,4Sarath Chandar1,2,41Chandar Research Lab2Mila – Quebec AI Institute3LAMA\-WeST Lab4Polytechnique Montréal5Université de Montréal6Independent researcher
###### Abstract
Language models are increasingly used in settings where outputs must satisfy user\-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets\. We study whether this capability can be improved directly through fine\-tuning\. Concretely, we fine\-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine\-Tuning variants: a soft\-target method that converts the desired output distribution into trie\-derived next\-token targets, and a hard\-target method that trains on sampled completions from the same target distribution\. Across 12 models spanning four families, both methods substantially improve structured\-sampling fidelity on held\-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability\. Under our selected training configurations, the two methods exhibit different empirical profiles: hard\-target fine\-tuning is often strongest on structured numeric sampling, while soft\-target fine\-tuning performs better on broader stochastic generation benchmarks, including open\-ended random generation, multiple\-choice answer\-position balancing, and NoveltyBench\. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model\. Overall, our results show that probabilistic calibration can be improved through fine\-tuning, with our hard\-target configuration favoring exact numeric fidelity and our soft\-target configuration favoring broader stochastic transfer\. Code is available at[https://github\.com/chandar\-lab/calibration\-finetuning](https://github.com/chandar-lab/calibration-finetuning)\.
## 1Introduction
Language models are increasingly used as general\-purpose interfaces for generation and decision\-making\(Chatterji et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib2)\)\. In many such uses, producing a valid answer is not enough: when the request involves randomness, diversity, or balanced choice, the model should place probability mass according to the intended stochastic behavior\. In these settings, both the set of attainable outputs and the allocation of probability mass across them matter\. In formal spaces, calibrated sampling is a standard primitive: one calls routines likenp\.random\.normal\(\.\.\.\)to obtain draws from a specified distribution\. There is no equally reliable primitive for language space\. In practice, language models are the natural candidate interface for requests such as "tell me a random city" or "tell me a random number from 1 to 10," but their induced output distributions are poorly controlled\.
Current pretraining and post\-training objectives do not incentivize calibrated output probabilities: the former optimizes next\-token prediction on observed text, and the latter rewards task success, compliance or preferred responses\. Unsurprisingly, language models collapse onto narrow output subsets, violate simple support constraints, and induce distributions that are far from the requested target\(Renda et al\.,[2023](https://arxiv.org/html/2605.11845#bib.bib23); Zhao et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib32); Gu et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib6)\)\. Prompting workarounds can help but add inference\-time cost and prompt sensitivity\(Misaki and Akiba,[2025](https://arxiv.org/html/2605.11845#bib.bib18); Xiao et al\.,[2025b](https://arxiv.org/html/2605.11845#bib.bib28)\)\.
This raises a more specific question: can fine\-tuning make language models better calibrated probabilistic samplers, so that their samples reflect the intended randomness of the request? In this work, we study a simple setting in which this intended randomness is specified exactly by a known target distribution\. We train language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine\-Tuning strategies for teaching this behavior\. The first is*soft\-target*fine\-tuning: we enumerate the valid numeric outputs for a prompt, assign each output its target probability, and convert this distribution over full answers into supervised probabilities for the next token at each prefix\. The second is*hard\-target*fine\-tuning: we draw many numeric answers from the target distribution and train on them with standard next\-token cross\-entropy\.
Our main finding is that probabilistic calibration is a trainable capability\. Across 12 models, both soft\-target and hard\-target fine\-tuning substantially improve structured\-sampling fidelity on held\-out distribution families and unseen parameter settings\. The two methods, however, exhibit different empirical profiles\. With the final configurations selected by our ablations, hard\-target fine\-tuning is often strongest on structured numeric sampling itself, while soft\-target fine\-tuning performs better on broader stochastic generation settings and more often yields favorable perplexity diagnostics\.
In summary, we make the following contributions:
- •We study probabilistic calibration as a trainable capability of language models, framing the problem as learning both the valid output space and the probability mass assigned across it\.
- •We introduce and compare two fine\-tuning strategies for this setting: a soft\-target method based on trie\-induced next\-token supervision, and a hard\-target method based on repeated sampled completions from the same target distribution\.
- •Across 12 models, we show that both methods substantially improve structured\-sampling fidelity on held\-out families and unseen parameter settings, with hard\-target fine\-tuning often strongest on the in\-domain numeric benchmark\.
- •We show that these gains transfer beyond the synthetic training domain, with our soft\-target configuration often performing better on broader stochastic generation benchmarks such as open\-ended random generation, MCQ answer\-position balance\(Zhao et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib32)\), and NoveltyBench\(Zhang et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib31)\), while retention and perplexity results reveal model\-dependent tradeoffs under the selected training budgets\.
Figure 1:Calibration Fine\-Tuning pipeline\. A prompt defines target lawPP, discretized into canonical output spaceYYwith massesPYP\_\{Y\}\. Soft\-target fine\-tuning builds a token trie and matches the induced next\-token distribution withℒsoft\\mathcal\{L\}\_\{\\mathrm\{soft\}\}\. Hard\-target fine\-tuning samplesy∼PYy\\sim P\_\{Y\}, tokenizes it, and applies masked completion cross\-entropyℒhard\\mathcal\{L\}\_\{\\mathrm\{hard\}\}\. In both variants, only the LoRA adapter is updated\.
## 2Related Work
#### Language models as poor native samplers, and prompting workarounds\.
A growing line of work shows that standard autoregressive language models are weak stochastic generators\. Early controlled evaluations found large deviations from target distributions even in simple synthetic settings\(Renda et al\.,[2023](https://arxiv.org/html/2605.11845#bib.bib23)\), and later work showed that these failures persist and often worsen under more demanding independent single\-sample prompting protocols\(Zhao et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib32)\)\. Frontier models can often transform provided random seeds into target distributions, yet still fail when asked to sample from those distributions directly\(Gu et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib6)\)\. Related failures appear beyond textbook numeric settings, including password generation\(Karanjai et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib10)\)and game\-theoretic decision\-making\(Guo et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib7)\)\. Prompting\-based interventions can partly improve stochastic behavior by injecting extra reasoning or synthetic entropy\(Misaki and Akiba,[2025](https://arxiv.org/html/2605.11845#bib.bib18); Xiao et al\.,[2025b](https://arxiv.org/html/2605.11845#bib.bib28)\), but they do so at inference time and introduce additional prompt sensitivity and computational overhead, whereas our focus is on training the model itself to better match a known target distribution\.
#### Calibration, soft targets, and probability\-shaping objectives\.
We use calibration in a distributional sense, meaning alignment of output probabilities with a target statistical distribution rather than epistemic uncertainty calibration\(Kapoor et al\.,[2024](https://arxiv.org/html/2605.11845#bib.bib9)\)\. More broadly, our method belongs to a family of objectives that shape predictive distributions through soft supervision, distillation, or calibration\-aware training\(Lee et al\.,[2022](https://arxiv.org/html/2605.11845#bib.bib13); Kim et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib11); Liang and Wang,[2026](https://arxiv.org/html/2605.11845#bib.bib14); Pereira et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib21); Kirk et al\.,[2023](https://arxiv.org/html/2605.11845#bib.bib12); Xiao et al\.,[2025a](https://arxiv.org/html/2605.11845#bib.bib27); Parikh et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib20); Luo et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib15)\)\. Related ideas also appear in distributional alignment for LLM\-as\-a\-judge settings\(Chen et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib3)\), and closest to our setting,Zhang et al\. \([2024](https://arxiv.org/html/2605.11845#bib.bib30)\)fine\-tune language models with sequence\-level targets to produce diffuse distributions over valid outputs\. We build on this direction by extending distributional fine\-tuning to a broader family of probability distributions, comparing sampled\-completion and trie\-derived token\-level supervision, and evaluating transfer to natural\-language stochastic behavior together with capability retention\.
#### Diversity, creativity, and epistemic breadth\.
Our transfer evaluations connect to a literature on diversity, creativity, and epistemic breadth in generative models\(Mohammadi,[2024](https://arxiv.org/html/2605.11845#bib.bib19); Doshi and Hauser,[2024](https://arxiv.org/html/2605.11845#bib.bib4); Holzner et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib8); Luo et al\.,[2026](https://arxiv.org/html/2605.11845#bib.bib16); Wright et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib26); Zhang et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib31)\)\. We do not aim to solve these broader problems directly\. Instead, we use them as a downstream stress test: if Calibration Fine\-Tuning teaches a more faithful stochastic capability, some of that behavior should transfer beyond mathematical distribution prompts into more natural random\-generation settings\.
## 3Calibration Fine\-Tuning
Figure[1](https://arxiv.org/html/2605.11845#S1.F1)summarizes Calibration Fine\-Tuning\. Both variants start from the same synthetic sampling prompts and the same induced discrete target over a canonical output space, but differ in supervision: soft\-target fine\-tuning matches trie\-induced next\-token distributions, while hard\-target fine\-tuning imitates sampled canonical completions from the same discrete target\.
#### Task setup\.
We train on prompts of the form:
> Generate exactly ONE random number from a \[distribution\] distribution with parameters \[params\]\. Output ONLY the number\.
When applicable, this request is wrapped in the model’s native chat template; for reasoning\-capable models such as Qwen3 and GPT\-OSS, we also disable reasoning traces so that training targets only the sampled numeric output\.
Each prompt specifies a distribution family and parameters, inducing a target probability lawPPover the reals or integers\. To make supervision tractable, we discretize this law into a finite canonical output spaceY=\{y1,…,yM\}Y=\\\{y\_\{1\},\\dots,y\_\{M\}\\\}of valid numeric strings together with an induced discrete distributionPYP\_\{Y\}overYY\. For integer\-valued families,YYis the finite support when available and otherwise a quantile\-truncated support set\. For continuous families, we quantize a quantile\-bounded interval at fixed decimal precision; when the resulting decimal grid exceeds the output\-space cap, we keep evenly spaced grid points, compute bin masses from CDF differences between adjacent midpoints, and assign outside\-interval tail mass to the edge bins\. In the final training runs, both methods use five\-decimal canonical outputs; soft\-target fine\-tuning capsYYat 1001 bins, while hard\-target fine\-tuning uses 16384 bins, following the ablations in Appendix[E](https://arxiv.org/html/2605.11845#A5)\.
#### Soft\-target fine\-tuning\.
For each canonical outputyi∈Yy\_\{i\}\\in Y, we take its induced massPY\(yi\)P\_\{Y\}\(y\_\{i\}\), tokenize it, append EOS, and insert the resulting token sequence into a prefix trie\. For a given trie prefixpp\(a partial token sequence\), we normalize the remaining mass over all continuations consistent withppto obtain a target distribution over valid next tokens:
q\(v∣p\)=∑i:tiextends\(p,v\)PY\(yi\)∑j:tjextendspPY\(yj\)\.q\(v\\mid p\)=\\frac\{\\sum\_\{i:\\,t\_\{i\}\\text\{ extends \}\(p,v\)\}P\_\{Y\}\(y\_\{i\}\)\}\{\\sum\_\{j:\\,t\_\{j\}\\text\{ extends \}p\}P\_\{Y\}\(y\_\{j\}\)\}\.\(1\)
Letπθ\(v∣p\):=softmax\(𝐳p/τ\)v\\pi\_\{\\theta\}\(v\\mid p\):=\\mathrm\{softmax\}\(\{\\mathbf\{z\}\}\_\{p\}/\\tau\)\_\{v\}denote the model’s next\-token distribution at prefixpp\. We train the model to match the trie\-derived targetq\(⋅∣p\)q\(\\cdot\\mid p\)at each visited prefix by minimizing
ℓ\(p\)=KL\(q\(⋅∣p\)∥πθ\(⋅∣p\)\)=∑v∈𝒱\(p\)q\(v∣p\)logq\(v∣p\)πθ\(v∣p\),\\ell\(p\)=\\mathrm\{KL\}\\\!\\bigl\(q\(\\cdot\\mid p\)\\;\\big\\\|\\;\\pi\_\{\\theta\}\(\\cdot\\mid p\)\\bigr\)=\\sum\_\{v\\in\\mathcal\{V\}\(p\)\}q\(v\\mid p\)\\,\\log\\frac\{q\(v\\mid p\)\}\{\\pi\_\{\\theta\}\(v\\mid p\)\},\(2\)where𝒱\(p\)\\mathcal\{V\}\(p\)is the set of trie children atpp\. To form the training loss, we sample a canonical outputy∼PYy\\sim P\_\{Y\}, tokenize it as𝐭=Tok\(y\)∘eos=\(t1,…,tL,eos\)\\mathbf\{t\}=\\mathrm\{Tok\}\(y\)\\circ\\textsc\{eos\}=\(t\_\{1\},\\dots,t\_\{L\},\\textsc\{eos\}\), and average the prefix loss along this sampled path:
ℒsoft=1L\+1∑k=0Lℓ\(\(t1,…,tk\)\)\.\\mathcal\{L\}\_\{\\mathrm\{soft\}\}=\\frac\{1\}\{L\+1\}\\sum\_\{k=0\}^\{L\}\\ell\\bigl\(\(t\_\{1\},\\dots,t\_\{k\}\)\\bigr\)\.\(3\)
#### Hard\-target fine\-tuning\.
The soft\-target objective gives dense distributional supervision, but it requires constructing next\-token targets for every trie prefix\. We therefore also study a second strategy closer to standard supervised fine\-tuning: can the model learn the same sampling behavior from completions sampled from the target distribution? The hard\-target variant replaces prefix\-level soft supervision with supervision from sampled completions\. For each instance, we drawy∼PYy\\sim P\_\{Y\}, form𝐭=Tok\(y\)∘eos\\mathbf\{t\}=\\mathrm\{Tok\}\(y\)\\circ\\textsc\{eos\}, and optimize masked autoregressive cross\-entropy on the concatenated prompt\-completion sequence:
ℒhard=1\|𝐭\|∑k=1\|𝐭\|−logπθ\(tk∣𝐱,t<k\),\\mathcal\{L\}\_\{\\mathrm\{hard\}\}=\\frac\{1\}\{\|\\mathbf\{t\}\|\}\\sum\_\{k=1\}^\{\|\\mathbf\{t\}\|\}\-\\log\\pi\_\{\\theta\}\\\!\\left\(t\_\{k\}\\mid\\mathbf\{x\},t\_\{<k\}\\right\),\(4\)where𝐱\\mathbf\{x\}are the prompt tokens and loss is applied only to completion tokens\. Since each example provides only one sampled path, we repeat prompts multiple times per epoch with independently resampled completions\.
Both objectives are implemented with frozen\-base LoRA adapters; training details and procedural summaries are given in Section[4\.1](https://arxiv.org/html/2605.11845#S4.SS1), Appendix[B](https://arxiv.org/html/2605.11845#A2), and Appendix Algorithms[1](https://arxiv.org/html/2605.11845#alg1)–[2](https://arxiv.org/html/2605.11845#alg2)\. The key difference is supervision granularity: soft\-target fine\-tuning gives dense prefix\-level supervision, while hard\-target fine\-tuning gives sparse sampled\-path supervision and therefore requires more updates in practice\. Appendix[E](https://arxiv.org/html/2605.11845#A5)discusses the final discretization and training choices\.
## 4Experimental Setup
We next outline the experimental protocol, including data construction, model selection, optimization settings, and downstream evaluations\.
### 4\.1Implementation Details
#### Data\.
The training data is fully synthetic\. Our benchmark spans three tiers of increasing distributional complexity and includes 30 distribution families in total: 24 seen families used for training and six held\-out OOD families reserved for test\-time evaluation only \(Bernoulli, Poisson, Maxwell, TruncNorm, Chi, and Weibull\), which let us test transfer to unseen distribution types\. The full benchmark, parameter ranges, and train/test splits are listed in Appendix Table[5](https://arxiv.org/html/2605.11845#A2.T5)\.
For each seen training family, we discretize the parameter space on a fixed grid, yielding 1988 prompt configurations across all families\. Each prompt is paired with a canonical output space derived from the corresponding target distribution: discrete families use truncated integer support, while continuous families are quantized over a bounded interval at fixed decimal precision\. In the final training runs, both methods use five\-decimal canonical outputs; soft\-target fine\-tuning caps the output space at 1001 bins, while hard\-target fine\-tuning uses 16384 bins\. Training batches use family\-balanced ordering\. Appendix[E](https://arxiv.org/html/2605.11845#A5)discusses the corresponding discretization and training\-budget ablations\.
#### Models\.
We train and evaluate four model families spanning roughly 0\.6B to 27B parameters: Qwen3\(Yang et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib29)\)at 0\.6B, 1\.7B, 4B, 8B, and 14B; Gemma\-3\-it\(Team et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib24)\)at 1B, 4B, 12B, and 27B; Llama\-3\.2\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.11845#bib.bib5)\)at 1B and 3B; and GPT\-OSS\(Agarwal et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib1)\)at 20B\. For every model, we report three conditions under the same prompting and decoding settings \(temperature 1, top\-p=1p=1, independent single\-sample requests\): the original checkpoint \(Base\), a soft\-target Calibration Fine\-Tuning adapter \(Soft\), and a hard\-target Calibration Fine\-Tuning adapter \(Hard\)\.
#### Details\.
Both methods use frozen\-base LoRA adapters with rank 16, alpha 32, and dropout 0\.05, applied to the query, key, value, and output attention projections\. We optimize with AdamW using learning rate2×10−42\\times 10^\{\-4\}, weight decay 0\.01, and a cosine schedule with 3% linear warmup; all runs use maximum sequence length 256, per\-device batch size 8, and gradient accumulation 1 on 4 A100 GPUs\. Soft\-target fine\-tuning trains for 3 epochs, corresponding to 189 optimizer steps per model\. Because hard\-target supervision is sparser, hard\-target fine\-tuning uses 16 sampled completions per prompt per epoch and trains for 2 epochs, yielding 1988 optimizer steps per model\. Appendix[B](https://arxiv.org/html/2605.11845#A2)provides the remaining implementation details\.
### 4\.2Evaluation Axes
We organize evaluation around six axes, moving from probabilistic sampling in mathematical spaces, to natural\-language stochastic behavior, novelty and creativity, and capability retention\.
#### Structured distribution sampling\.
This is the primary benchmark\. Using the same prompt format introduced in Section[3](https://arxiv.org/html/2605.11845#S3), we evaluate each model along two complementary axes\. At thelogit level, we compute the forward KL between the model’s next\-token distribution and the trie\-derived targetq\(v∣p\)q\(v\\mid p\), averaging over prefixes along 4 Monte Carlo sampled candidate paths per prompt\. For all reported results, this target is built from a shared high\-resolution evaluation output space with five\-decimal canonical outputs and max bins=16384=16384\. This common reference makes logit KL comparable across Base, Soft, and Hard, but is stricter than the 1001\-bin output space used to train the soft\-target configuration\. At thesample level, we draw 1,000 independent generations per prompt, parse valid numeric outputs, and compare the resulting empirical samples against the reference SciPy distribution using valid rate and Wasserstein\-1 distance,
W1=1N∑i=1N\|x\(i\)−F∗−1\(i−0\.5N\)\|,W\_\{1\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\|x\_\{\(i\)\}\-\{F^\{\*\}\}^\{\-1\}\\\!\\left\(\\tfrac\{i\-0\.5\}\{N\}\\right\)\\right\|,\(5\)wherex\(i\)x\_\{\(i\)\}is theii\-th order statistic of the valid samples\. When no valid numeric sample is obtained for a prompt configuration, the correspondingW1W\_\{1\}is undefined\. In aggregate summaries, we average finite normalizedW1W\_\{1\}estimates within each family and report the median across families whenever every family has at least one finite estimate\. For the aggregate summaries in Figure[2](https://arxiv.org/html/2605.11845#S5.F2)and Appendix Tables[6](https://arxiv.org/html/2605.11845#A4.T6)–[7](https://arxiv.org/html/2605.11845#A4.T7), we report a scale\-normalized variant: for each prompt,W1W\_\{1\}is divided by the targetQ95−Q05Q\_\{95\}\-Q\_\{05\}width, then averaged within family and aggregated across families with a median\. We also evaluate String Seed of Thought \(SSOT\) prompting\(Misaki and Akiba,[2025](https://arxiv.org/html/2605.11845#bib.bib18)\)as an inference\-time baseline under the same sample\-level metric, with details reported in Appendix[D\.3](https://arxiv.org/html/2605.11845#A4.SS3)\.
#### Open\-ended random generation\.
We test whether fine\-tuning broadens stochastic support and increases empirical diversity on a 102\-prompt benchmark of open\-ended random\-generation requests that we constructed for this work, spanning categories such as names, cities, animals, foods, chemical elements, medical concepts, and landmarks\. Prompt wording varies across formulations such as "Think of," "Choose," "Name," and "Pick," while consistently enforcing a strict output contract \("Output ONLY the answer"\); the full prompt set is given in Appendix[C\.3](https://arxiv.org/html/2605.11845#A3.SS3)\. For each prompt, we collect 100 independent samples and measure two quantities: the top\-90% support size, i\.e\., the number of first\-step next tokens required to cover 90% of the model’s probability mass, and the unique\-output fraction, i\.e\., the fraction of distinct samples after normalization\.
#### NoveltyBench\.
We also evaluate open\-ended generative breadth on NoveltyBench\(Zhang et al\.,[2025](https://arxiv.org/html/2605.11845#bib.bib31)\)using its curated and WildChat splits \(100 and 1,000 prompts respectively\)\. To remain faithful to the original benchmark, we draw 10 generations per prompt with temperature 1 and top\-p=1p=1, partition responses with the benchmark’s classifier\-based procedure \(a lexical short\-circuit for very short answers and otherwiseyimingzhang/deberta\-v3\-large\-generation\-similarity\), and score them withSkywork/Skywork\-Reward\-Gemma\-2\-27B\-v0\.2\. Raw rewards are converted to the benchmark’s 1–10 ratings, and utility credits only the first response in each semantic partition under the benchmark’s patience\-based discounting\. We report two split\-level summary metrics:*mean distinct*, the average number of semantic partitions per prompt, and*mean utility*, the benchmark’s reward\-weighted summary\.
#### MCQ generation with uniform answer positions\.
We adopt the MCQ answer\-position balancing protocol ofZhao et al\. \([2026](https://arxiv.org/html/2605.11845#bib.bib32)\)\. Models are asked to generate medical multiple\-choice questions under a prompt that encourages approximately uniform placement of the correct answer among A/B/C/D; the full prompt is given in Appendix[C\.4](https://arxiv.org/html/2605.11845#A3.SS4)\. We generate 1,000 independent MCQs per model and first report the MCQ parse rate, i\.e\., the fraction of generations that can be parsed as MCQs with a question, four A/B/C/D options, and a correct\-answer field in A/B/C/D\. Among these parseable generations, we then compute the total variation \(TV\) distance between the empirical answer\-position frequencies and the uniform distribution\.
#### Capability retention\.
We evaluate whether fine\-tuning degrades general capabilities using the TinyBenchmarks suite\(Polo et al\.,[2024](https://arxiv.org/html/2605.11845#bib.bib22)\): tinyMMLU, tinyHellaSwag, tinyTruthfulQA, tinyWinoGrande, and tinyGSM8K, each consisting of 100 items drawn from the full benchmark\. We report the gp\-IRT \(Generalized Performance Item Response Theory\) aggregate score, which combines observed per\-item accuracy with IRT\-based extrapolation to estimate performance on the full benchmark from the 100\-item subset\. All retention evaluations use greedy decoding\.
#### PALOMA perplexity\.
We measure retained language\-model fit on PALOMA\(Magnusson et al\.,[2024](https://arxiv.org/html/2605.11845#bib.bib17)\), reporting both perplexity and bits\-per\-byte for tokenizer\-robust comparison; details in Appendix[D\.8](https://arxiv.org/html/2605.11845#A4.SS8)\.
## 5Main Results
We now summarize results on structured sampling, downstream transfer, and capability retention\.
Figure 2:Main structured\-sampling result\. Left: family\-median normalizedW1W\_\{1\}on held\-out OOD families; right: family\-median normalizedW1W\_\{1\}on unseen parameter settings from seen families\. Calibration Fine\-Tuning consistently improves distributional fidelity across model sizes and families\. Hatched bars indicate outputs with insufficient valid parsed samples for a finite empirical estimate\.#### Structured sampling\.
Figure[2](https://arxiv.org/html/2605.11845#S5.F2)and Appendix Tables[6](https://arxiv.org/html/2605.11845#A4.T6)–[7](https://arxiv.org/html/2605.11845#A4.T7)show that Calibration Fine\-Tuning strongly improves structured distribution sampling on both held\-out families and unseen parameter settings\. Family\-median normalizedW1W\_\{1\}drops sharply for nearly every model, and trie\-based logit KL decreases by roughly an order of magnitude\. These gains are not merely formatting effects: models with weak baseline valid rates often improve substantially, while models already near\-perfect in validity still show large reductions inW1W\_\{1\}and logit KL\. Hard\-target fine\-tuning is usually strongest on this benchmark, especially on unseen parameters, while soft\-target fine\-tuning remains competitive and occasionally slightly better on held\-out families\. Figure[3](https://arxiv.org/html/2605.11845#S5.F3)gives representative held\-out OOD examples, and the appendix reports the full per\-distribution breakdowns\.
We report String Seed of Thought \(SSOT\) prompting\(Misaki and Akiba,[2025](https://arxiv.org/html/2605.11845#bib.bib18)\)as an inference\-time baseline in Appendix[D\.3](https://arxiv.org/html/2605.11845#A4.SS3)\. SSOT can improve over the base checkpoint when the model reliably follows the seed\-and\-reasoning protocol, but it is brittle, model\-dependent, and more expensive per sample than direct sampling, and thus remains weaker than Calibration Fine\-Tuning across our evaluated checkpoints\.
Figure 3:Qualitative OOD sampling examples\. Each panel overlays the target with base and Calibration Fine\-Tuning samples for one held\-out OOD configuration, using one strong model per family\.
#### Open\-ended random generation\.
Table 1:Generative\-distribution summary\. Support Size and Unique Output Rate measure open\-ended random\-generation diversity; MCQ TV measures answer\-position balance over parseable generations; NoveltyBench Utility is the benchmark’s patience\-discounted reward metric\. MCQ TV is omitted when no generation is parseable\. Base is the original checkpoint; Soft and Hard are the two Calibration Fine\-Tuning variants\. Bold denotes the best value for each available metric\.Calibration Fine\-Tuning also transfers beyond explicit distribution\-name prompts\. Table[1](https://arxiv.org/html/2605.11845#S5.T1)summarizes the aggregate open\-ended random\-generation results\. The broadest and most consistent effect is on stochastic support: soft\-target fine\-tuning increases the top\-90%90\\%support size for every model, often by one to two orders of magnitude, and is the best variant on this metric for all 12 models\. Hard\-target fine\-tuning also broadens support substantially, but is usually less extreme than soft\-target fine\-tuning\.
The unique\-output results are more selective\. For the stronger Qwen and Gemma checkpoints, broader support often translates into genuinely more diverse repeated generations, again with soft\-target fine\-tuning usually strongest\. However, support expansion does not uniformly improve useful diversity, especially in smaller or less stable models\. Overall, under the selected configurations, soft\-target fine\-tuning is more effective at broadening open\-ended stochastic support, though the quality of that broadened support remains model\-dependent\. Appendix Figures[9](https://arxiv.org/html/2605.11845#A4.F9)and[10](https://arxiv.org/html/2605.11845#A4.F10)provide the fuller per\-prompt breakdowns and qualitative repeated\-sampling examples\.
#### NoveltyBench\.
NoveltyBench tests whether broader stochastic support translates into semantically useful open\-ended generation under the benchmark’s original partition\-and\-reward pipeline\. Table[1](https://arxiv.org/html/2605.11845#S5.T1)shows that the broad pattern again favors soft\-target fine\-tuning, which is best on overall utility for 8 of the 12 models and is especially strong for medium and large Qwen and Gemma checkpoints\. The gains are not uniform: GPT\-OSS\-20B is the clearest counterexample, where both fine\-tuned variants increase distinctness but reduce utility relative to the base checkpoint, and Qwen3\-0\.6B shows a similar failure mode\. Overall, NoveltyBench strengthens the transfer case for the selected soft\-target configuration, while also showing that higher semantic spread is only valuable when it remains aligned with utility\. Appendix[D\.5](https://arxiv.org/html/2605.11845#A4.SS5)provides the split\-level results and representative qualitative cases\.
#### MCQ answer\-position balance\.
MCQ transfer is more mixed than open\-ended random generation\. Table[1](https://arxiv.org/html/2605.11845#S5.T1)shows that both variants usually improve answer\-position balance within the Qwen family, with soft\-target fine\-tuning strongest for most checkpoints and hard\-target fine\-tuning slightly better at 8B\. Outside Qwen, transfer is less uniform: hard\-target fine\-tuning is often more reliable for Gemma, GPT\-OSS\-20B improves under both variants, and the Llama checkpoints remain unstable\. Overall, MCQ transfer is real but not robust across families, and it does not consistently favor one variant\. For the full parse\-rate and TV breakdown, see Appendix[D\.6](https://arxiv.org/html/2605.11845#A4.SS6)\.
#### Capability retention\.
Table 2:Retention summary\. TinyBenchmarks gp\-IRT measures downstream retention; PALOMA perplexity measures held\-out language\-model fit\. Base is the original checkpoint; Soft and Hard are the two Calibration Fine\-Tuning variants\. Bold denotes the best value for each metric\.Table[2](https://arxiv.org/html/2605.11845#S5.T2)reports TinyBenchmarks gp\-IRT together with PALOMA perplexity, and Appendix Table[21](https://arxiv.org/html/2605.11845#A4.T21)gives the full TinyBenchmarks breakdown\. The retention picture is mixed and less uniformly positive than the structured and transfer results above\. On aggregate gp\-IRT, the base checkpoint remains best for most models, especially among the smaller Qwen and Gemma checkpoints, both Llama models, and GPT\-OSS\-20B\. A few medium and large models do improve under Calibration Fine\-Tuning, notably Qwen3\-8B, Qwen3\-14B, Gemma\-3\-12B\-it, and Gemma\-3\-27B\-it\. At the task level, MMLU, HellaSwag, and WinoGrande improve modestly on average, while TruthfulQA is nearly flat\. The clearest systematic cost is GSM8K, where strict and flexible scoring regress substantially\. Appendix[D\.7](https://arxiv.org/html/2605.11845#A4.SS7)provides a deeper task\-level breakdown\.
#### PALOMA perplexity\.
PALOMA gives a more favorable language\-modeling view\. Under the selected training budgets, soft\-target fine\-tuning gives the best aggregate perplexity for all Qwen and Gemma models, while hard\-target fine\-tuning gives the clearest gains for GPT\-OSS\-20B and Llama\-3\.2\-3B\-it\. This argues against the simple interpretation that Calibration Fine\-Tuning merely flattens token probabilities across the vocabulary\. For a deeper analysis of PALOMA, see Appendix[D\.8](https://arxiv.org/html/2605.11845#A4.SS8)\.
#### Hyperparameter ablations\.
Appendix[E](https://arxiv.org/html/2605.11845#A5)reports the ablations used to choose the final discretization and training budgets\. The selected settings, five\-decimal canonical outputs with max bins=1001=1001for soft\-target fine\-tuning, and five\-decimal canonical outputs with max bins=16384=16384, 16 sampled completions per prompt, and 2 epochs for hard\-target fine\-tuning, reflect the best overall tradeoff in held\-out and unseen\-parameter performance rather than monotonic preferences\.
## 6Discussion and Limitations
#### What does Calibration Fine\-Tuning seem to teach?
Our main conclusion is that probabilistic calibration is trainable\. Across 12 models, Calibration Fine\-Tuning sharply improves structured distribution sampling on both held\-out families and unseen parameter settings, reducing sample\-level error and logit\-level miscalibration\. Hard\-target fine\-tuning is strongest on this in\-domain benchmark, especially for unseen\-parameter generalization, showing that simple supervised training can teach substantially better stochastic fidelity than the base checkpoint\.
#### How does it transfer?
The structured gains do transfer to natural language settings, but selectively\. The clearest effect is that soft\-target fine\-tuning broadens stochastic support and often improves semantically diverse generation, as reflected by open\-ended random generation and NoveltyBench\. MCQ answer\-position balance gives a weaker but still positive signal, showing that some of this learned stochastic behavior transfers even when the target distribution is implicit in the task format\. Overall, the selected soft\-target configuration gives the most reliable language\-space transfer, while the selected hard\-target configuration remains strongest on exact numeric sampling\.
#### What remains brittle?
The gains are selective and should not be overinterpreted\. Capability retention is mixed, with the clearest systematic cost on GSM8K and aggregate gp\-IRT often still favoring the base checkpoint, especially for smaller models\. One plausible mechanism is that both Calibration Fine\-Tuning variants train short direct completions without reasoning traces; after post\-training regimes that often reward long reasoning, this may shift models toward shorter generations and hurt reasoning\-heavy tasks, an effect we observe most strongly for soft\-target fine\-tuning\. Because hard\-target supervision is sampled\-path rather than dense prefix\-level supervision, our final hard\-target configuration uses more optimizer steps; retention differences between the two variants should therefore be read as configuration\-level tradeoffs, not as isolated effects of the loss objective\. A second practical limitation is that our structured targets are finite canonical approximations of the underlying SciPy laws: tail mass is assigned to edge bins to preserve total probability, which is tractable and consistent across training and evaluation but can distort heavy\-tailed targets\. However, PALOMA perplexity is frequently preserved or improved, arguing against the view that Calibration Fine\-Tuning merely flattens token probabilities\. The remaining challenge is to preserve general capabilities while combining the structured\-sampling strength of our hard\-target configuration with the broader transfer profile of its soft\-target counterpart\.
## 7Conclusion
We studied Calibration Fine\-Tuning as a simple way to improve stochastic generation behavior in language models\. Across 12 models, both soft\-target and hard\-target fine\-tuning substantially improve structured sampling, showing that probabilistic calibration is a trainable capability\. Future work should study the mechanisms learned by each objective and develop retention\-aware variants of Calibration Fine\-Tuning\. Overall, our results show that language models can be trained to behave more like controlled probabilistic samplers, with gains that extend beyond the synthetic distributions used for supervision\.
## Acknowledgements
Sarath Chandar is supported by the Canada CIFAR AI Chairs program, the Canada Research Chair in Lifelong Machine Learning, and the NSERC Discovery Grant\. Experiments were conducted using computational resources provided by Mila Quebec AI Institute\.
## References
- Agarwal et al\. \[2025\]Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al\.gpt\-oss\-120b & gpt\-oss\-20b model card\.*arXiv preprint arXiv:2508\.10925*, 2025\.
- Chatterji et al\. \[2025\]Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman\.How people use chatgpt\.Technical report, National Bureau of Economic Research, 2025\.
- Chen et al\. \[2025\]Luyu Chen, Zeyu Zhang, Haoran Tan, Quanyu Dai, Hao Yang, Zhenhua Dong, and Xu Chen\.Beyond single\-point judgment: Distribution alignment for LLM\-as\-a\-judge, 2025\.URL[https://arxiv\.org/abs/2505\.12301](https://arxiv.org/abs/2505.12301)\.
- Doshi and Hauser \[2024\]Anil R\. Doshi and Oliver P\. Hauser\.Generative AI enhances individual creativity but reduces the collective diversity of novel content\.*Science Advances*, 10\(28\), 2024\.doi:10\.1126/sciadv\.adn5290\.URL[https://www\.science\.org/doi/10\.1126/sciadv\.adn5290](https://www.science.org/doi/10.1126/sciadv.adn5290)\.
- Grattafiori et al\. \[2024\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia\-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El\-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala\-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching\-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric\-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina\-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean\-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L\. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma\.The llama 3 herd of models, 2024\.URL[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)\.
- Gu et al\. \[2026\]Xiangming Gu, Soham De, Michalis Titsias, Larisa Markeeva, Petar Veličković, and Razvan Pascanu\.The illusion of stochasticity in llms\.*arXiv preprint arXiv:2604\.06543*, 2026\.
- Guo et al\. \[2025\]Zihao Guo, Hongtao Lv, Chaoli Zhang, Yibowen Zhao, Yixin Zhang, and Lizhen Cui\.The illusion of randomness: How LLMs fail to emulate stochastic decision\-making in rock\-paper\-scissors games?In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 8618–8637, Suzhou, China, November 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-335\-7\.doi:10\.18653/v1/2025\.findings\-emnlp\.458\.URL[https://aclanthology\.org/2025\.findings\-emnlp\.458/](https://aclanthology.org/2025.findings-emnlp.458/)\.
- Holzner et al\. \[2025\]Niklas Holzner, Sebastian Maier, and Stefan Feuerriegel\.Generative AI and creativity: A systematic literature review and meta\-analysis, 2025\.URL[https://arxiv\.org/abs/2505\.17241](https://arxiv.org/abs/2505.17241)\.
- Kapoor et al\. \[2024\]Sanyam Kapoor, Nate Gruver, Manley Roberts, Arka Pal, Samuel Dooley, Micah Goldblum, and Andrew Wilson\.Calibration\-tuning: Teaching large language models to know what they don’t know\.In Raúl Vázquez, Hande Celikkanat, Dennis Ulmer, Jörg Tiedemann, Swabha Swayamdipta, Wilker Aziz, Barbara Plank, Joris Baan, and Marie\-Catherine de Marneffe, editors,*Proceedings of the 1st Workshop on Uncertainty\-Aware NLP \(UncertaiNLP 2024\)*, pages 1–14, St Julians, Malta, March 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.uncertainlp\-1\.1\.URL[https://aclanthology\.org/2024\.uncertainlp\-1\.1/](https://aclanthology.org/2024.uncertainlp-1.1/)\.
- Karanjai et al\. \[2025\]Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, and Weidong Shi\.Evaluating the quality of randomness and entropy in tasks supported by large language models, 2025\.URL[https://arxiv\.org/abs/2510\.12080](https://arxiv.org/abs/2510.12080)\.
- Kim et al\. \[2025\]Suyoung Kim, Seonguk Park, Junhoo Lee, and Nojun Kwak\.The role of teacher calibration in knowledge distillation\.*IEEE Access*, 13:115548–115557, 2025\.doi:10\.1109/access\.2025\.3585106\.URL[https://doi\.org/10\.1109/access\.2025\.3585106](https://doi.org/10.1109/access.2025.3585106)\.
- Kirk et al\. \[2023\]Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu\.Understanding the effects of RLHF on LLM generalisation and diversity\.*arXiv preprint arXiv:2310\.06452*, 2023\.doi:10\.48550/arXiv\.2310\.06452\.URL[https://arxiv\.org/abs/2310\.06452](https://arxiv.org/abs/2310.06452)\.
- Lee et al\. \[2022\]Dongkyu Lee, Ka Chun Cheung, and Nevin L\. Zhang\.Adaptive label smoothing with self\-knowledge in natural language generation\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 9781–9792\. Association for Computational Linguistics, 2022\.doi:10\.18653/v1/2022\.emnlp\-main\.664\.URL[https://aclanthology\.org/2022\.emnlp\-main\.664/](https://aclanthology.org/2022.emnlp-main.664/)\.
- Liang and Wang \[2026\]Jinliang Liang and Jiyue Wang\.ICSD: An improved approach for clean\-sample\-guided self\-knowledge distillation with data augmentation\.*IEEE Access*, 14:3672–3690, 2026\.doi:10\.1109/access\.2026\.3650832\.URL[https://doi\.org/10\.1109/access\.2026\.3650832](https://doi.org/10.1109/access.2026.3650832)\.
- Luo et al\. \[2025\]Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei\.Your pre\-trained LLM is secretly an unsupervised confidence calibrator, 2025\.URL[https://arxiv\.org/abs/2505\.16690](https://arxiv.org/abs/2505.16690)\.
- Luo et al\. \[2026\]Queenie Luo, Gary King, Michael Puett, and Michael D\. Smith\.Inducing sustained creativity and diversity in large language models, 2026\.URL[https://arxiv\.org/abs/2603\.19519](https://arxiv.org/abs/2603.19519)\.
- Magnusson et al\. \[2024\]Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya H Jha, Oyvind Tafjord, Dustin Schwenk, Evan P Walsh, Yanai Elazar, Kyle Lo, et al\.Paloma: A benchmark for evaluating language model fit\.*Advances in Neural Information Processing Systems*, 37:64338–64376, 2024\.
- Misaki and Akiba \[2025\]Kou Misaki and Takuya Akiba\.String seed of thought: Prompting LLMs for distribution\-faithful and diverse generation, 2025\.URL[https://arxiv\.org/abs/2510\.21150](https://arxiv.org/abs/2510.21150)\.
- Mohammadi \[2024\]Behnam Mohammadi\.Creativity has left the chat: The price of debiasing language models\.*arXiv preprint arXiv:2406\.05587*, 2024\.doi:10\.48550/arXiv\.2406\.05587\.URL[https://arxiv\.org/abs/2406\.05587](https://arxiv.org/abs/2406.05587)\.
- Parikh et al\. \[2026\]Nisarg Parikh, Ananya Sai, Pannaga Shivaswamy, Kunjal Panchal, and Andrew Lan\.CATTO: Balancing preferences and confidence in language models, 2026\.URL[https://arxiv\.org/abs/2601\.23096](https://arxiv.org/abs/2601.23096)\.
- Pereira et al\. \[2026\]Shovon Niverd Pereira, Krishna Khadka, and Yu Lei\.TabKD: Tabular knowledge distillation through interaction diversity of learned feature bins, 2026\.URL[https://arxiv\.org/abs/2603\.15481](https://arxiv.org/abs/2603.15481)\.
- Polo et al\. \[2024\]Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin\.tinybenchmarks: evaluating llms with fewer examples, 2024\.URL[https://arxiv\.org/abs/2402\.14992](https://arxiv.org/abs/2402.14992)\.
- Renda et al\. \[2023\]Alex Renda, Aspen K\. Hopkins, and Michael Carbin\.Can llms generate random numbers? evaluating llm sampling in controlled domains, 2023\.URL[https://openreview\.net/forum?id=Vhh1K9LjVI](https://openreview.net/forum?id=Vhh1K9LjVI)\.OpenReview\.
- Team et al\. \[2025\]Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa\-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan\-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, András György, André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot, Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A\. Choquette\-Choo, CJ Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak\-Plucińska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini, Jean Pouget\-Abadie, Jetha Chan, Joe Stanton, John Wieting, Jonathan Lai, Jordi Orbay, Joseph Fernandez, Josh Newlan, Ju yeong Ji, Jyotinder Singh, Kat Black, Kathy Yu, Kevin Hui, Kiran Vodrahalli, Klaus Greff, Linhai Qiu, Marcella Valentine, Marina Coelho, Marvin Ritter, Matt Hoffman, Matthew Watson, Mayank Chaturvedi, Michael Moynihan, Min Ma, Nabila Babar, Natasha Noy, Nathan Byrd, Nick Roy, Nikola Momchev, Nilay Chauhan, Noveen Sachdeva, Oskar Bunyan, Pankil Botarda, Paul Caron, Paul Kishan Rubenstein, Phil Culliton, Philipp Schmid, Pier Giuseppe Sessa, Pingmei Xu, Piotr Stanczyk, Pouya Tafti, Rakesh Shivanna, Renjie Wu, Renke Pan, Reza Rokni, Rob Willoughby, Rohith Vallu, Ryan Mullins, Sammy Jerome, Sara Smoot, Sertan Girgin, Shariq Iqbal, Shashir Reddy, Shruti Sheth, Siim Põder, Sijal Bhatnagar, Sindhu Raghuram Panyam, Sivan Eiger, Susan Zhang, Tianqi Liu, Trevor Yacovone, Tyler Liechty, Uday Kalra, Utku Evci, Vedant Misra, Vincent Roseberry, Vlad Feinberg, Vlad Kolesnikov, Woohyun Han, Woosuk Kwon, Xi Chen, Yinlam Chow, Yuvein Zhu, Zichuan Wei, Zoltan Egyed, Victor Cotruta, Minh Giang, Phoebe Kirk, Anand Rao, Kat Black, Nabila Babar, Jessica Lo, Erica Moreira, Luiz Gustavo Martins, Omar Sanseviero, Lucas Gonzalez, Zach Gleicher, Tris Warkentin, Vahab Mirrokni, Evan Senter, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, Yossi Matias, D\. Sculley, Slav Petrov, Noah Fiedel, Noam Shazeer, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Jean\-Baptiste Alayrac, Rohan Anil, Dmitry, Lepikhin, Sebastian Borgeaud, Olivier Bachem, Armand Joulin, Alek Andreev, Cassidy Hardin, Robert Dadashi, and Léonard Hussenot\.Gemma 3 technical report, 2025\.URL[https://arxiv\.org/abs/2503\.19786](https://arxiv.org/abs/2503.19786)\.
- Virtanen et al\. \[2020\]Pauli Virtanen, Ralf Gommers, Travis E\. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J\. van der Walt, Matthew Brett, Joshua Wilson, K\. Jarrod Millman, Nikolay Mayorov, Andrew R\. J\. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W\. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E\. A\. Quintero, Charles R\. Harris, Anne M\. Archibald, Antônio H\. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1\.0 Contributors\.SciPy 1\.0: Fundamental Algorithms for Scientific Computing in Python\.*Nature Methods*, 17:261–272, 2020\.doi:10\.1038/s41592\-019\-0686\-2\.
- Wright et al\. \[2025\]Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Peter Ebert Christensen, Chan Young Park, and Isabelle Augenstein\.Epistemic diversity and knowledge collapse in large language models, 2025\.URL[https://arxiv\.org/abs/2510\.04226](https://arxiv.org/abs/2510.04226)\.
- Xiao et al\. \[2025a\]Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, and Weijie J\. Su\.On the algorithmic bias of aligning large language models with RLHF: Preference collapse and matching regularization\.*Journal of the American Statistical Association*, 120\(552\):2154–2164, 2025a\.doi:10\.1080/01621459\.2025\.2555067\.URL[https://doi\.org/10\.1080/01621459\.2025\.2555067](https://doi.org/10.1080/01621459.2025.2555067)\.
- Xiao et al\. \[2025b\]Tim Z\. Xiao, Johannes Zenn, Zhen Liu, Weiyang Liu, Robert Bamler, and Bernhard Schölkopf\.Flipping against all odds: Reducing LLM coin flip bias via verbalized rejection sampling, 2025b\.URL[https://arxiv\.org/abs/2506\.09998](https://arxiv.org/abs/2506.09998)\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Zhang et al\. \[2024\]Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, J Zico Kolter, and Daphne Ippolito\.Forcing diffuse distributions out of language models\.In*First Conference on Language Modeling*, 2024\.URL[https://openreview\.net/forum?id=9JY1QLVFPZ](https://openreview.net/forum?id=9JY1QLVFPZ)\.
- Zhang et al\. \[2025\]Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, and Daphne Ippolito\.Noveltybench: Evaluating language models for humanlike diversity\.*arXiv preprint arXiv:2504\.05228*, 2025\.
- Zhao et al\. \[2026\]Minda Zhao, Yilun Du, and Mengyu Wang\.Large language models are bad dice players: Llms struggle to generate random numbers from statistical distributions\.*arXiv preprint arXiv:2601\.05414*, 2026\.doi:10\.48550/arXiv\.2601\.05414\.URL[https://arxiv\.org/abs/2601\.05414](https://arxiv.org/abs/2601.05414)\.
Appendix
## Appendix ACalibration Fine\-Tuning Algorithms
Algorithm[1](https://arxiv.org/html/2605.11845#alg1)formalizes the soft\-target Calibration Fine\-Tuning pipeline described in Section[3](https://arxiv.org/html/2605.11845#S3)\. The first phase constructs a prefix trie from the discretized target distribution and derives per\-prefix next\-token targets\. The second phase samples a training path through the trie and computes the prompt\-conditioned calibration loss\. Algorithm[2](https://arxiv.org/html/2605.11845#alg2)describes the hard\-target variant, which reuses the same prompt distributions and canonical output space, but replaces trie\-derived soft targets with repeated sampled completions and standard completion cross\-entropy\.
Algorithm 1Soft\-target Calibration Fine\-Tuning0:Prompt text
uu, target law
PP, tokenizer
𝒯\\mathcal\{T\}, model
πθ\\pi\_\{\\theta\}, temperature
τ\\tau
1:// Build output space and prefix trie
2:Discretize
PPinto canonical outputs
Y=\{y1,…,yM\}Y=\\\{y\_\{1\},\\dots,y\_\{M\}\\\}with induced masses
PY\(yi\)P\_\{Y\}\(y\_\{i\}\)
3:foreach
yi∈Yy\_\{i\}\\in Ydo
4:Tokenize:
𝐬i←𝒯\(yi\)∘⟨eos⟩\\mathbf\{s\}\_\{i\}\\leftarrow\\mathcal\{T\}\(y\_\{i\}\)\\circ\\langle\\textsc\{eos\}\\rangle
5:foreach proper prefix
ppof
𝐬i\\mathbf\{s\}\_\{i\}do
6:
prefix\_mass\[p\]\+=PY\(yi\)\\texttt\{prefix\\\_mass\}\[p\]\\mathrel\{\+\}=P\_\{Y\}\(y\_\{i\}\)
7:
next\_mass\[p\]\[𝐬i\[\|p\|\+1\]\]\+=PY\(yi\)\\texttt\{next\\\_mass\}\[p\]\[\\mathbf\{s\}\_\{i\}\[\|p\|\+1\]\]\\mathrel\{\+\}=P\_\{Y\}\(y\_\{i\}\)
8:endfor
9:endfor
10:foreach prefix
ppin triedo
11:
q\(v∣p\)←next\_mass\[p\]\[v\]/prefix\_mass\[p\]q\(v\\mid p\)\\leftarrow\\texttt\{next\\\_mass\}\[p\]\[v\]\\;/\\;\\texttt\{prefix\\\_mass\}\[p\]for each child
vv
12:endfor
13:// Training step
14:
𝐱←𝒯\(u\)\\mathbf\{x\}\\leftarrow\\mathcal\{T\}\(u\)
15:Sample
y∼PYy\\sim P\_\{Y\}; let
𝐭=\(t1,…,tL,⟨eos⟩\)=𝒯\(y\)∘⟨eos⟩\\mathbf\{t\}=\(t\_\{1\},\\dots,t\_\{L\},\\langle\\textsc\{eos\}\\rangle\)=\\mathcal\{T\}\(y\)\\circ\\langle\\textsc\{eos\}\\rangle
16:for
k=0k=0to
LLdo
17:
p←\(t1,…,tk\)p\\leftarrow\(t\_\{1\},\\dots,t\_\{k\}\)
18:Compute
πθ\(⋅∣𝐱,p\)←softmax\(fθ\(𝐱,p\)/τ\)\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\},p\)\\leftarrow\\mathrm\{softmax\}\(f\_\{\\theta\}\(\\mathbf\{x\},p\)/\\tau\)
19:
ℓk←KL\(q\(⋅∣p\)∥πθ\(⋅∣𝐱,p\)\)\\ell\_\{k\}\\leftarrow\\mathrm\{KL\}\\\!\\left\(q\(\\cdot\\mid p\)\\,\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\},p\)\\right\)
20:endfor
21:
ℒsoft←1L\+1∑k=0Lℓk\\mathcal\{L\}\_\{\\mathrm\{soft\}\}\\leftarrow\\frac\{1\}\{L\+1\}\\sum\_\{k=0\}^\{L\}\\ell\_\{k\}
22:Update LoRA parameters via
∇ℒsoft\\nabla\\mathcal\{L\}\_\{\\mathrm\{soft\}\}
Algorithm 2Hard\-target Calibration Fine\-Tuning0:Training prompts
𝒫train\\mathcal\{P\}\_\{\\mathrm\{train\}\}, target laws
\{Pp\}p∈𝒫train\\\{P\_\{p\}\\\}\_\{p\\in\\mathcal\{P\}\_\{\\mathrm\{train\}\}\}, tokenizer
𝒯\\mathcal\{T\}, model
πθ\\pi\_\{\\theta\}, samples per prompt per epoch
RR, epochs
EE
1:for
e=1e=1to
EEdo
2:Form epoch prompt multiset
𝒫~e←⋃p∈𝒫train\{p\}R\\tilde\{\\mathcal\{P\}\}\_\{e\}\\leftarrow\\bigcup\_\{p\\in\\mathcal\{P\}\_\{\\mathrm\{train\}\}\}\\\{p\\\}^\{R\}
3:Shuffle
𝒫~e\\tilde\{\\mathcal\{P\}\}\_\{e\}with family\-balanced ordering
4:foreach prompt
p∈𝒫~ep\\in\\tilde\{\\mathcal\{P\}\}\_\{e\}do
5:Discretize
PpP\_\{p\}into canonical outputs
YpY\_\{p\}with induced distribution
PY,pP\_\{Y,p\}
6:Sample canonical output
y∼PY,py\\sim P\_\{Y,p\}
7:
𝐱←𝒯\(prompt\_text\(p\)\)\\mathbf\{x\}\\leftarrow\\mathcal\{T\}\(\\texttt\{prompt\\\_text\}\(p\)\)
8:
𝐭←𝒯\(y\)∘⟨eos⟩\\mathbf\{t\}\\leftarrow\\mathcal\{T\}\(y\)\\circ\\langle\\textsc\{eos\}\\rangle
9:Construct training sequence
𝐳←𝐱∘𝐭\\mathbf\{z\}\\leftarrow\\mathbf\{x\}\\circ\\mathbf\{t\}
10:Set labels
𝐦←\(−100\)\|𝐱\|∘𝐭\\mathbf\{m\}\\leftarrow\(\-100\)^\{\|\\mathbf\{x\}\|\}\\circ\\mathbf\{t\}\{mask prompt tokens\}
11:endfor
12:foreach minibatch
\{\(𝐳\(b\),𝐦\(b\)\)\}b=1B\\\{\(\\mathbf\{z\}^\{\(b\)\},\\mathbf\{m\}^\{\(b\)\}\)\\\}\_\{b=1\}^\{B\}do
13:Compute next\-token logits on each sequence
𝐳\(b\)\\mathbf\{z\}^\{\(b\)\}
14:
ℒhard←CE\(shifted logits,shifted labels\)\\mathcal\{L\}\_\{\\mathrm\{hard\}\}\\leftarrow\\mathrm\{CE\}\(\\text\{shifted logits\},\\text\{shifted labels\}\)over non\-masked positions
15:Update LoRA parameters via
∇ℒhard\\nabla\\mathcal\{L\}\_\{\\mathrm\{hard\}\}
16:endfor
17:endfor
## Appendix BImplementation and Reproducibility Details
### B\.1Training and Data Construction
Tables[3](https://arxiv.org/html/2605.11845#A2.T3)–[4](https://arxiv.org/html/2605.11845#A2.T4)consolidate the training and data\-construction settings described in Section[4\.1](https://arxiv.org/html/2605.11845#S4.SS1)\. Both Calibration Fine\-Tuning variants share the same model loading procedure, frozen\-base LoRA setup, optimizer, FSDP/bf16 distributed training setup, and family\-balanced batching; method\-specific settings differ only where required by the supervision format\.
For data construction, both methods use the same 1988 training prompts\. For continuous distributions, we first intersect the quantile interval\[Q\(0\.001\),Q\(0\.999\)\]\[Q\(0\.001\),Q\(0\.999\)\]with any finite support bounds, then form the decimal grid at precisiondd\. If this grid contains more than the configured maximum number of bins, we retain evenly spaced grid points including the endpoints\. Bin boundaries are the midpoints between retained centers, andPYP\_\{Y\}is computed from CDF differences across these bins; probability mass below the lower endpoint and above the upper endpoint is assigned to the first and last bins\. For integer\-valued distributions, we enumerate the finite integer support when available; otherwise, we use quantile\-derived integer bounds, cap the resulting support if needed, and assign any upper\-tail mass to the last retained value\. The final canonical precision isd=5d=5decimals for both variants, with method\-specific caps of 1001 bins for soft\-target fine\-tuning and 16384 bins for hard\-target fine\-tuning\.
Thus, the structured\-sampling objectives are defined with respect to a finite canonical approximation of the underlying SciPy law; edge\-bin tail assignment is a practical choice that preserves total probability mass but can distort tail behavior for heavy\-tailed targets\.
Table 3:Shared training and data\-construction settings for both Calibration Fine\-Tuning variants\. These values are held fixed across model families with no per\-model tuning\.Table 4:Method\-specific training hyperparameters\. Hard\-target fine\-tuning uses more optimizer steps because each completion provides sparse sampled\-path supervision rather than dense prefix\-level targets\.All final fine\-tuning runs used one internal GPU\-cluster node with 4 NVIDIA A100\-SXM4\-80GB GPUs, 24 CPU cores, and approximately 1TB system memory, using FSDP/bf16 training\. W&B\-recorded optimizer\-loop wall\-clock times ranged from 0\.7 to 14\.9 minutes for soft\-target fine\-tuning \(median 1\.8 minutes; 2\.6 A100 GPU\-hours total across 12 models\) and from 6\.5 to 154\.9 minutes for hard\-target fine\-tuning \(median 18\.3 minutes; 26\.7 A100 GPU\-hours total\)\. Final evaluation jobs used the same cluster; summed per\-run wall\-clock estimates were 4\.5h for structured sampling, 3\.1h for open\-ended random generation, 17\.6h for MCQ generation, 4\.8h for TinyBenchmarks, 11\.5h for PALOMA, and 194\.0h for NoveltyBench, with NoveltyBench dominated by generation, semantic partitioning, and reward\-model scoring\.
#### Assets\.
We use public model checkpoints and evaluation assets: Qwen3, Gemma\-3\-it, Llama\-3\.2\-Instruct, GPT\-OSS, NoveltyBench, TinyBenchmarks, PALOMA, the MCQ protocol ofZhao et al\. \[[2026](https://arxiv.org/html/2605.11845#bib.bib32)\], and the evaluator checkpoints used by NoveltyBench\. All are cited in the main text or appendix, and the released artifact lists the exact checkpoints, public sources, and upstream access or license requirements\. Our training prompts and structured\-sampling benchmark are synthetically generated and documented in Tables[5](https://arxiv.org/html/2605.11845#A2.T5)and Appendix[C\.1](https://arxiv.org/html/2605.11845#A3.SS1)\.
### B\.2Benchmark Distribution Reference
Table[5](https://arxiv.org/html/2605.11845#A2.T5)lists all 30 distribution families used in the Calibration Fine\-Tuning benchmark, organized by tier, along with the training parameter regions and paper\-selected evaluation configurations\. Shaded rows denote held\-out OOD families excluded from training\.
Table 5:Calibration Fine\-Tuning benchmark distributions\. The SciPy Name column lists the underlyingscipy\.statsdistribution used to instantiate each target familyVirtanen et al\. \[[2020](https://arxiv.org/html/2605.11845#bib.bib25)\]; samples are then drawn from the corresponding frozen SciPy distribution\. For seen families, the Train column summarizes the parameter region used for supervised calibration, and the Test column gives the paper\-selected evaluation configuration used for unseen\-parameter generalization\. Shaded rows denote held\-out OOD families, which are excluded from training and evaluated only at test time\.
## Appendix CPrompt Templates
### C\.1Structured Sampling
All structured sampling training and evaluation prompts are generated from the following template:
> Generate exactly ONE random number from a \[Distribution\] distribution with parameters \[params\]\. Output ONLY the number\.
The\[Distribution\]field is the display name from Table[5](https://arxiv.org/html/2605.11845#A2.T5), and\[params\]is rendered as comma\-separated key\-value pairs in the canonical parameterization used by the corresponding distribution family\. When applicable, the prompt is wrapped in the model’s native chat template; for reasoning\-capable models, we use the same reasoning\-suppression settings as in training and evaluation\. Generations are counted as valid numeric outputs if, after stripping any reasoning trace, they parse as a finite decimal or scientific\-notation literal\.
### C\.2String Seed of Thought Baseline
For the String Seed of Thought baseline\[Misaki and Akiba,[2025](https://arxiv.org/html/2605.11845#bib.bib18)\], we keep the original task prompt unchanged as the user message and add a benchmark\-specific system or developer instruction\. In the structured\-sampling setting, this instruction asks the model to first generate a complex random string, use it as the seed for the requested probabilistic decision, place the seed and reasoning in intermediate tags, and emit the final sampled value in an<answer\>tag\. For GPT\-OSS, the same idea is implemented through its native reasoning protocol: the seed and reasoning are kept in analysis, and the final channel is instructed to contain only the sampled answer\. SSOT generations usemax\_new\_tokens=\\,\{=\}\\,2048 to accommodate the seed and reasoning trace\. We then extract the final answer from the requested answer channel or tag before applying the same parsing and evaluation code used for direct sampling\. Results for this baseline are reported in Appendix[D\.3](https://arxiv.org/html/2605.11845#A4.SS3)\.
### C\.3Open\-Ended Random Generation
All prompts follow the template:
> \[Verb\]a random\[category\]\. Output ONLY the answer\.
where\[Verb\]cycles through "Think of," "Choose," "Name," and "Pick\." Representative examples include:
> Choose a random first name\. Output ONLY the answer\. Name a random city\. Output ONLY the answer\. Think of a random animal\. Output ONLY the answer\. Pick a random musical instrument\. Output ONLY the answer\.
The full set of 102 prompts spans categories including words, names, cities, countries, animals, foods, jobs, hobbies, sports, colors, flowers, trees, diseases, body parts, musical instruments, vehicles, gemstones, programming languages, and more\.
### C\.4MCQ Generation
The exact prompt used for MCQ generation evaluation is:
> You are a medical education expert who creates high\-quality multiple\-choice questions for medical students and professionals\. Please generate a medical multiple\-choice question \(single answer, 4 options\)\. The question should cover medical knowledge and be of moderate difficulty\. Please strictly follow this format: Question: \[Question content\] A\. \[Option A content\] B\. \[Option B content\] C\. \[Option C content\] D\. \[Option D content\] Correct Answer: \[A/B/C/D\] Explanation: \[Brief explanation\] Requirements: \(1\) The question should have practical medical value\. \(2\) All four options should be plausible with reasonable distractors\. \(3\) Only one correct answer\. \(4\) Output directly without any additional content\. \(5\) Cover different medical knowledge areas \(e\.g\., internal medicine, surgery, pharmacology, pathology, diagnostics\)\. \(6\) The correct answer should be evenly distributed among A, B, C, D options to avoid bias toward any particular option\.
For MCQ generation, we usemax\_new\_tokens=\\,\{=\}\\,256\. A regex\-based strict parser validates that any generation with a missing, duplicated, or malformed field is counted as a parse failure\. No constrained decoding is applied\.
## Appendix DAdditional Experimental Results
### D\.1Structured Distribution Sampling: Held\-Out Families
Table[6](https://arxiv.org/html/2605.11845#A4.T6)reports the full held\-out\-family structured\-sampling results across all 12 models\. Figures[4](https://arxiv.org/html/2605.11845#A4.F4)–[5](https://arxiv.org/html/2605.11845#A4.F5)give per\-distribution breakdowns across the six held\-out OOD families \(Bernoulli, Poisson, Maxwell, TruncNorm, Chi, and Weibull\)\. Hatched bars in theW1W\_\{1\}plots indicate distribution/model pairs for which no finiteW1W\_\{1\}estimate is available because no valid parsed samples were obtained\. Unless otherwise stated, all appendix logit\-KL results use the same trie\-derived evaluation target as the main text, built from five\-decimal canonical outputs with max bins=16384=16384\. Because this evaluation target is shared across methods, logit KL is directly comparable across Base, Soft, and Hard; however, it is intentionally higher\-resolution than the 1001\-bin target used during soft\-target training\.
Table 6:Structured\-sampling results on held\-out distribution families\. For each model, Base denotes the original checkpoint, Soft the soft\-target Calibration Fine\-Tuning checkpoint, and Hard the hard\-target Calibration Fine\-Tuning checkpoint\. OODW1W\_\{1\}is theQ95Q\_\{95\}\-Q05Q\_\{05\}normalized Wasserstein\-1 distance, averaged within family and then aggregated across families with a median over 18 held\-out configurations from six OOD families; Logit KL is the corresponding trie\-based logit\-evaluation average, computed against the shared five\-decimal, max\-bins\-1638416384evaluation target; Valid rate is the fraction of generations that parse to in\-support numeric outputs\.![[Uncaptioned image]](https://arxiv.org/html/2605.11845v1/x3.png)
Figure 4:Per\-distribution median normalizedW1W\_\{1\}on the held\-out\-family structured\-sampling benchmark\. Each panel corresponds to one held\-out OOD distribution and compares sample fidelity for the base, soft\-target, and hard\-target models\.
Figure 5:Per\-distribution logit\-level KL on the held\-out\-family structured\-sampling benchmark, computed against the shared five\-decimal, max\-bins\-1638416384trie target\. Each panel isolates one held\-out OOD distribution and compares token\-level calibration for the base, soft\-target, and hard\-target models\.
### D\.2Structured Distribution Sampling: Unseen Parameters
Table[7](https://arxiv.org/html/2605.11845#A4.T7)reports the unseen\-parameter evaluation, where models are tested on parameter settings outside the training grid for seen distribution families\. Figures[7](https://arxiv.org/html/2605.11845#A4.F7)–[6](https://arxiv.org/html/2605.11845#A4.F6)give per\-distribution breakdowns of sample\-levelW1W\_\{1\}and logit\-level KL\. Entries marked “—” in Table[7](https://arxiv.org/html/2605.11845#A4.T7)indicate aggregateW1W\_\{1\}values that are undefined because at least one unseen family has no finiteW1W\_\{1\}estimate\.
Table 7:Structured\-sampling results on unseen parameter settings from seen distribution families\. For each model, Base denotes the original checkpoint, Soft the soft\-target Calibration Fine\-Tuning checkpoint, and Hard the hard\-target Calibration Fine\-Tuning checkpoint\. Unseen\-parameterW1W\_\{1\}is theQ95Q\_\{95\}\-Q05Q\_\{05\}normalized Wasserstein\-1 distance, averaged within family and then aggregated across families with a median; Logit KL is the corresponding trie\-based logit\-evaluation average, computed against the shared five\-decimal, max\-bins\-1638416384evaluation target; Valid rate is the fraction of generations that parse to in\-support numeric outputs\.Figure 6:Per\-distribution logit\-level KL on the unseen\-parameter structured\-sampling benchmark, computed against the shared five\-decimal, max\-bins\-1638416384trie target\. Each panel shows calibration quality for one unseen parameter setting under the base, soft\-target, and hard\-target models\.Figure 7:Per\-distribution median normalizedW1W\_\{1\}on the unseen\-parameter structured\-sampling benchmark\. Each panel corresponds to one unseen parameter setting and compares sample fidelity for the base, soft\-target, and hard\-target models\.
### D\.3String Seed of Thought Results
Tables[8](https://arxiv.org/html/2605.11845#A4.T8)and[9](https://arxiv.org/html/2605.11845#A4.T9)compare the base checkpoint, soft\-target Calibration Fine\-Tuning, and a structured\-sampling baseline based on String Seed of Thought \(SSOT\) prompting\[Misaki and Akiba,[2025](https://arxiv.org/html/2605.11845#bib.bib18)\]\. We follow the SSOT protocol, adapting only the prompt wrapper needed to fit the chat format of each model family\. Unlike Calibration Fine\-Tuning, SSOT does not update model parameters\. It instead asks the model to generate an internal random seed string and use that seed through an explicit reasoning trace before emitting the final sampled answer\. SSOT is a sample\-only inference\-time baseline, so we do not report trie\-based logit KL\. Entries marked “—” indicate completed runs whose aggregateW1W\_\{1\}is undefined because at least one family has no finiteW1W\_\{1\}estimate\. We also use 250 generations per prompt rather than 1,000, because SSOT changes the generation protocol itself: while our main evaluation disables reasoning traces and asks the model to emit only the final sample, SSOT explicitly requires the model to generate a seed string and reason from it before producing the answer\. In practice, these reasoning traces can be verbose and model\-dependent, making each sample substantially more expensive to generate\.
Table 8:OOD structured\-sampling comparison between Base, Soft, and the SSOT inference\-time baseline\. SSOT uses 250 generations per prompt\. LowerW1W\_\{1\}and higher valid rate are better; “—” indicates undefined aggregateW1W\_\{1\}\.Table 9:Unseen\-parameter structured\-sampling comparison between Base, Soft, and the SSOT inference\-time baseline\. SSOT uses 250 generations per prompt\. LowerW1W\_\{1\}and higher valid rate are better; “—” indicates undefined aggregateW1W\_\{1\}\.Figure 8:Qualitative OOD sampling comparison for GPT\-OSS\-20B\. Each panel overlays the target distribution with empirical samples from the base checkpoint, the SSOT inference\-time baseline, and soft\-target Calibration Fine\-Tuning on the same four held\-out OOD configurations used in the main qualitative structured\-sampling figure\.The resulting baseline is mixed, and its main limitation is reliability\. SSOT can improve sampling when the model follows the protocol well\. The strongest evidence comes from GPT\-OSS\-20B, which improves over the base checkpoint on both OOD families and unseen parameters; Qwen3\-14B also improves on both splits but with lower validity, while Gemma\-3\-27B\-it improves on held\-out OOD families but not on unseen parameters\. However, this improvement remains smaller than the one obtained by soft\-target Calibration Fine\-Tuning, as also shown qualitatively in Figure[8](https://arxiv.org/html/2605.11845#A4.F8)\. Across the other checkpoints, the behavior is less consistent\. Some models produce finite but weak gains, some preserve high validity but worsen distributional fit, and others fail through invalid outputs, zero\-valid distributions, or extreme numeric samples that make the normalizedW1W\_\{1\}aggregate undefined or astronomically large\.
Overall, SSOT is a useful inference\-time comparison because it tests whether prompting alone can induce calibrated sampling without parameter updates\. The answer on this benchmark is only partially positive: SSOT can help for models that reliably execute the seed\-and\-reasoning protocol, but it is brittle and model\-dependent\. In contrast, soft\-target Calibration Fine\-Tuning gives a more stable intervention: it directly changes the sampling distribution, achieves lower normalizedW1W\_\{1\}across the evaluated settings, preserves near\-perfect validity for the strongest checkpoints, and is substantially cheaper at inference time because it does not require generating a reasoning trace for each sample\. Thus, SSOT should be read as evidence that inference\-time randomness can sometimes improve over the base model, rather than as a robust substitute for training\-time calibration\.
### D\.4Open\-Ended Random Generation
Figure[9](https://arxiv.org/html/2605.11845#A4.F9)gives the prompt\-level view underlying the aggregate open\-generation results\. Each point corresponds to one prompt/model pair, comparing the base checkpoint to either soft\-target or hard\-target Calibration Fine\-Tuning\. The clearest effect is on first\-token stochastic support: soft\-target fine\-tuning increases top\-90% next\-token support on 1101/1224 prompt/model pairs, across 102 prompts and 12 model/config pairs, while hard\-target fine\-tuning does so on 1089/1224 pairs\. The effect on realized output diversity is more model\-dependent\. Unique\-output rate increases for most Qwen, Gemma, and GPT\-OSS prompts, but decreases for Llama prompts, showing that broader first\-token support does not always translate into more diverse full completions\.
Figure[10](https://arxiv.org/html/2605.11845#A4.F10)shows that these aggregate gains correspond to visibly less concentrated empirical output distributions\. For Qwen3\-14B on the weekday prompt, the base model assigns 80% of samples to Wednesday, whereas the soft\-target and hard\-target checkpoints spread mass across several weekdays, with the top answer falling to roughly 40%\. The same pattern appears in larger open supports: for Gemma\-3\-27B\-it on the city prompt, the base model places 74% of samples on its top four cities, while the soft\-target and hard\-target checkpoints reduce this concentration to 15% and 24%, respectively\. GPT\-OSS\-20B exhibits an even stronger version of this behavior on cities and mammals, where the calibrated checkpoints make the empirical distribution close to flat over the displayed labels\. Overall, the qualitative examples support the main open\-generation conclusion: Calibration Fine\-Tuning often spreads probability mass over plausible alternatives, although the effect is weaker for the Llama checkpoints\.
Figure 9:Per\-prompt open\-generation comparisons at the 90% probability\-mass threshold\. Each point is one prompt/model pair and compares first\-step next\-token support breadth or unique\-output percentage between the base model and one fine\-tuned variant \(soft\-target or hard\-target Calibration Fine\-Tuning\)\. Points are colored by model family and use marker shape to denote model size\.Figure 10:Qualitative open\-generation examples from repeated sampling\. For selected prompts \(city, mammal, weekday\) and one strong model from each family, we compare empirical output frequencies under the base, soft\-target, and hard\-target models\. To choose which outputs to display, we take the top 10 outputs from each state, union them, rank the union by peak then total empirical frequency across the three states, and show the top 15 resulting labels\.
### D\.5NoveltyBench
Table[10](https://arxiv.org/html/2605.11845#A4.T10)reports the full NoveltyBench results across all evaluated models, split by benchmark partition\. We show both*mean distinct*and*mean utility*on the curated and WildChat splits separately to distinguish raw semantic breadth from the benchmark’s reward\-weighted overall utility\.
Table 10:NoveltyBench split\-level summary across all evaluated models\. Distinct is the mean number of semantic partitions per prompt, and Utility is the benchmark’s patience\-discounted reward\-based score\. We report both metrics separately on the curated and WildChat splits\. Base denotes the original checkpoint, Soft the soft\-target Calibration Fine\-Tuning checkpoint, and Hard the hard\-target Calibration Fine\-Tuning checkpoint\. Bold values denote the best reported result for each metric\.The following tables provide qualitative prompt\-level examples from the curated split\. We rank prompts by prompt\-level utility deltas to surface three narrative regimes: cases where soft\-target Calibration Fine\-Tuning clearly helps, cases where the hard\-target variant is competitive or best, and failure cases where Calibration Fine\-Tuning hurts reward\-weighted utility\. For each prompt and method, we display the top five semantic\-partition representatives ranked by partition score together with that method’s prompt\-level Distinct/Utility values\.
These examples also illustrate a limitation of the NoveltyBench scoring pipeline\. Since semantic partitions and utilities are computed by the benchmark evaluator rather than by human preference, the reported Distinct/Utility values do not always perfectly match qualitative answer quality\. For example, in Table[19](https://arxiv.org/html/2605.11845#A4.T19), the base model is assigned high utility despite repeatedly concentrating on a small set of answers \(mostly Tesla Model 3 and Toyota RAV4 variants\), while the calibrated checkpoints produce visibly more distinct recommendations\.
Table 11:Representative NoveltyBench cases where soft\-target Calibration Fine\-Tuning clearly improves reward\-weighted utility\. Model: Qwen3\-14B \(Product & Purchase Recommendations\)\. Prompt: I am doing online shopping for a new leather wallet\. What’s the first store/website I should check out? Please just give me one suggestion; I don’t need to hear your whole thought process\.Table 12:Representative NoveltyBench cases where soft\-target Calibration Fine\-Tuning clearly improves reward\-weighted utility\. Model: Gemma\-3\-27B\-it \(Subjective Rankings & Opinions\)\. Prompt: What is the coolest Pokémon from the second generation? Just give me one\.Table 13:Representative NoveltyBench cases where soft\-target Calibration Fine\-Tuning clearly improves reward\-weighted utility\. Model: GPT\-OSS\-20B \(Creativity\)\. Prompt: Write a haiku about a whale and a walnut tree\.Table 14:Representative cases where the hard\-target variant is competitive or best on prompt\-level NoveltyBench utility\. Model: Qwen3\-14B \(Factual Knowledge\)\. Prompt: List a capital city in Africa\.Table 15:Representative cases where the hard\-target variant is competitive or best on prompt\-level NoveltyBench utility\. Model: Gemma\-3\-27B\-it \(Creativity\)\. Prompt: One thing you might find in a magical forest\.Table 16:Representative cases where the hard\-target variant is competitive or best on prompt\-level NoveltyBench utility\. Model: GPT\-OSS\-20B \(Factual Knowledge\)\. Prompt: Name one person who won the Fields Medal\.Table 17:Representative failure cases where Calibration Fine\-Tuning hurts prompt\-level utility\. Model: GPT\-OSS\-20B \(Product & Purchase Recommendations\)\. Prompt: Suggest a birthday gift for a friend’s 25th birthday\. Just oneTable 18:Representative failure cases where Calibration Fine\-Tuning hurts prompt\-level utility\. Model: Qwen3\-14B \(Character & Entity Naming\)\. Prompt: Suggest a name for a dappled\-gray filly living in the mountains\.Table 19:Representative failure cases where Calibration Fine\-Tuning hurts prompt\-level utility\. Model: GPT\-OSS\-20B \(Product & Purchase Recommendations\)\. Prompt: What’s the best car to get in 2023? Just give me one single suggestion\. Otherwise I’ll have decision paralysis\.
### D\.6MCQ Answer\-Position Balance
Table[20](https://arxiv.org/html/2605.11845#A4.T20)reports the full MCQ answer\-position results under the prompt in Appendix[C\.4](https://arxiv.org/html/2605.11845#A3.SS4), following the protocol ofZhao et al\. \[[2026](https://arxiv.org/html/2605.11845#bib.bib32)\]\. This evaluation differs from structured sampling because the target distribution is only implicit: the prompt asks the model to generate valid medical multiple\-choice questions while distributing the correct answer approximately uniformly across A/B/C/D\. We therefore report both MCQ parse rate, the fraction of generations that can be parsed as MCQs with a question, four A/B/C/D options, and a correct\-answer field in A/B/C/D, and TV distance from the uniform answer\-position distribution, computed over parseable generations\.
The main pattern is that Calibration Fine\-Tuning often improves answer\-position balance, but the transfer is weaker than in structured numeric sampling\. Both variants reduce TV relative to the base model on most checkpoints, with the most consistent gains in the Qwen family and substantial improvements for GPT\-OSS\-20B\. Gemma is more mixed: soft\-target fine\-tuning often improves balance but can reduce parseability for larger checkpoints, while hard\-target fine\-tuning tends to preserve parseability better\. The Llama checkpoints remain unstable; in particular, TV is not reported when no generation is parseable, so low or missing TV should not be read without the parse\-rate column\. Overall, MCQ balancing provides evidence that Calibration Fine\-Tuning can transfer to implicit natural\-language randomness constraints, but this transfer is model\-family dependent and must be interpreted jointly with parseability\.
Table 20:Multiple\-choice generation results across all evaluated models\. MCQ parse rate is the fraction of generations that can be parsed as MCQs with a question, four A/B/C/D options, and a correct\-answer field in A/B/C/D\. MCQ TV is the total variation distance between answer\-position frequencies and the uniform distribution over parseable generations, and is omitted when no generation is parseable\.
### D\.7Capability Retention
Table[21](https://arxiv.org/html/2605.11845#A4.T21)and Figure[11](https://arxiv.org/html/2605.11845#A4.F11)report the full TinyBenchmarks retention breakdown\[Polo et al\.,[2024](https://arxiv.org/html/2605.11845#bib.bib22)\]\. TinyBenchmarks is useful in this setting because it estimates broad downstream capability from small 100\-example benchmark slices, allowing us to compare many fine\-tuned checkpoints under a fixed greedy\-decoding protocol\. The aggregate gp\-IRT results are model\-dependent rather than uniformly preserved: the base checkpoint remains strongest for many smaller models and for both Llama checkpoints, while Calibration Fine\-Tuning improves aggregate retention for several medium and large Qwen and Gemma checkpoints\.
The task\-level view clarifies this mixed picture\. MMLU, HellaSwag, and WinoGrande improve on average after fine\-tuning, and TruthfulQA remains nearly unchanged, suggesting that Calibration Fine\-Tuning does not simply degrade downstream behavior uniformly\. The clearest systematic cost is arithmetic reasoning: both strict and flexible GSM8K gp\-IRT decrease, with hard\-target fine\-tuning usually less damaging than soft\-target fine\-tuning on this task\. Overall, these results suggest that Calibration Fine\-Tuning can improve stochastic generation behavior while preserving part of the original capability profile, but retention\-aware variants are needed if calibration gains must be obtained without arithmetic degradation\.
Table 21:Retention results measured by TinyBenchmarks aggregate gp\-IRT\. The left table reports one row per model, averaging across the six TinyBenchmarks tasks\. The right table reports one row per TinyBenchmarks task, averaging across all evaluated models\. Higher is better throughout\.Figure 11:Per\-task retention gp\-IRT for base, soft\-target, and hard\-target models across the TinyBenchmarks suite\. This figure breaks the aggregate retention summary down by benchmark to show where Calibration Fine\-Tuning preserves or changes capability\.
### D\.8PALOMA
To measure retained language\-model fit on held\-out natural text, we run teacher\-forced next\-token evaluation on seven PALOMA test slices: WikiText\-103, C4, Dolma, mC4\-English, Penn Treebank, RedPajama, and Falcon RefinedWeb\[Magnusson et al\.,[2024](https://arxiv.org/html/2605.11845#bib.bib17)\]\. We score fixed 2,048\-token windows with stride 1,024 using the base model tokenizer and aggregate token\-level negative log\-likelihood across all slices\. From these totals we report both perplexity and bits\-per\-byte, with the latter providing a tokenizer\-robust comparison across model families\.
Tables[22](https://arxiv.org/html/2605.11845#A4.T22)and[23](https://arxiv.org/html/2605.11845#A4.T23)report PALOMA language\-modeling retention\[Magnusson et al\.,[2024](https://arxiv.org/html/2605.11845#bib.bib17)\]by model and by evaluation slice, while Figures[12](https://arxiv.org/html/2605.11845#A4.F12)and[13](https://arxiv.org/html/2605.11845#A4.F13)show the full per\-slice breakdown for each checkpoint\. We report both perplexity and bits\-per\-byte: perplexity measures next\-token fit under each model tokenizer, while bits\-per\-byte gives a more tokenizer\-robust view across model families\.
The model\-level results are more favorable than the TinyBenchmarks retention results\. Soft\-target fine\-tuning gives the best aggregate PALOMA score for all Qwen and Gemma checkpoints, often with nontrivial perplexity reductions, while hard\-target fine\-tuning is strongest for GPT\-OSS\-20B and slightly best for Llama\-3\.2\-3B\-it\. The only clear aggregate degradation is Llama\-3\.2\-1B\-it, where both variants slightly worsen PALOMA\. This pattern supports the interpretation in the main text: Calibration Fine\-Tuning does not simply flatten the next\-token distribution or destroy language\-model fit; in many cases it improves held\-out text likelihood\.
The slice\-level averages show the same trend from a complementary angle\. For every PALOMA slice, at least one fine\-tuned variant improves over the base model in both perplexity and bits\-per\-byte\. Hard\-target fine\-tuning is strongest on several high\-perplexity slices, including WikiText\-103, mC4, Penn Treebank, and RedPajama, while soft\-target fine\-tuning is strongest on C4 and Dolma and remains competitive elsewhere\. Thus, the PALOMA gains are not driven by a single corpus, although their magnitude varies substantially by model family and slice\.
\(a\) PALOMA Perplexity↓\\downarrow
\(b\) PALOMA Bits per Byte↓\\downarrow
Table 22:PALOMA aggregate results by model\. Each row corresponds to one model and aggregates over the seven PALOMA evaluation slices\. Lower is better for both metrics\. Base denotes the original checkpoint, Soft the soft\-target Calibration Fine\-Tuning checkpoint, and Hard the hard\-target Calibration Fine\-Tuning checkpoint\. Bold values denote the best reported result for each metric\.\(a\) PALOMA Perplexity↓\\downarrow
\(b\) PALOMA Bits per Byte↓\\downarrow
Table 23:PALOMA slice\-level summary averaged across all evaluated models\. Each row corresponds to one PALOMA evaluation slice, and values are the mean across the 12 models for each condition\. Lower is better for both metrics\.Figure 12:PALOMA perplexity by evaluation slice for the base, soft\-target, and hard\-target models\. Lower is better\. The figure shows that language\-model fit is often preserved or improved after Calibration Fine\-Tuning, with the strongest variant depending on model family and slice\.Figure 13:PALOMA bits\-per\-byte by evaluation slice for the base, soft\-target, and hard\-target models\. Lower is better\. This tokenizer\-robust view follows the same broad pattern as perplexity: fine\-tuning often improves held\-out text likelihood rather than uniformly degrading it\.
## Appendix EHyperparameter Ablations
This appendix reports the ablations used to choose the output\-space construction and training budgets for Calibration Fine\-Tuning\. We run these sweeps on two representative large checkpoints, Qwen3\-14B and Gemma\-3\-12B\-it\. All ablation results are evaluated against the same high\-resolution reference distribution used in the main structured\-sampling evaluation, with five\-decimal canonical outputs and max bins=16384=16384\. We therefore compare training configurations under a fixed evaluation target rather than changing the metric with the training discretization\.
Table[24](https://arxiv.org/html/2605.11845#A5.T24)first shows the construction cost of the discretized output space and prefix trie\. Increasing the decimal precision mainly increases output\-space memory, while increasing max bins mainly increases trie construction time and trie memory\. This makes the discretization choice a practical tradeoff: finer output spaces provide a more faithful canonical approximation, but very high precision quickly becomes expensive before training even begins\.
Table[25](https://arxiv.org/html/2605.11845#A5.T25)reports the corresponding soft\-target ablation\. The selected setting,d=5d=5with max bins=1001=1001, is not chosen because finer or larger output spaces are monotonically better\. Instead, it gives the strongest or near\-strongest held\-out\-family performance for both ablated models while keeping the soft\-target trie compact\. Figure[14](https://arxiv.org/html/2605.11845#A5.F14)gives the same conclusion qualitatively: changing decimal precision affects the sampled distribution, but the effect is model\-dependent and smaller than the overall gap between calibrated and base behavior\.
Table[26](https://arxiv.org/html/2605.11845#A5.T26)reports the hard\-target ablation\. Because hard\-target supervision observes only sampled completions rather than dense trie\-derived token targets, we vary both the output\-space construction and the sparse\-supervision budget through samples per prompt and epochs\. The final setting,d=5d=5, max bins=16384=16384, 16 samples per prompt per epoch, and 2 epochs, keeps the total sampled draws at 32 per prompt and gives a stable compromise across OOD and unseen\-parameter performance\. Other settings are occasionally best on a single model or split, but the selected configuration is the most consistent across the two ablated large\-model checkpoints\.
Table 24:Construction cost for the discretization\-precision grid profiled in the granularity sweep\. We report mean output\-space and trie construction times across the prompt suite, together with the worst per\-prompt tracemalloc peak for output\-space construction and trie construction\. The green row marks the shared high\-resolution evaluation reference selected for the ablations\.Figure 14:Qualitative OOD sampling distributions for the output\-discretization ablation\. Colors denote the discretization precision used during soft\-target Calibration Fine\-Tuning\. Finer discretizations change behavior in a model\-dependent way, consistent with the quantitative ablation tables\.
Table 25:Full ablation over the granularity of the canonical numeric output space\. Rows vary the number of decimal placesddand the cap on the number of support bins used during training\. All checkpoints are evaluated against a shared high\-resolution reference withd=5d=5and max bins=16384=16384\. Results are reported asQ95Q\_\{95\}\-Q05Q\_\{05\}normalized Wasserstein\-1, averaged within family and then aggregated across families with a median\. The green row denotes the final selected setting\. Values in bold denote the best setting for each model and split\.ddmaxbinssppepochstotaldrawsOODW1↓W\_\{1\}\\downarrowUnseenW1↓W\_\{1\}\\downarrowQwen3\-14B31001162320\.07590\.063121001162320\.07810\.0751001162320\.07570\.062534096162320\.07530\.0637316384162320\.06270\.06133100184320\.07970\.06131001321320\.07680\.0551638484320\.0850\.0566516384162320\.07360\.0529516384321320\.07010\.0597Gemma\-3\-12B\-it31001162320\.11490\.096521001162320\.11660\.103151001162320\.11160\.094134096162320\.12360\.0922316384162320\.11360\.08143100184320\.11840\.082331001321320\.11380\.077751638484320\.11790\.0759516384162320\.1050\.0811516384321320\.12180\.0708
Table 26:Hard\-target fine\-tuning ablation on the two large models\. The anchor setting usesd=3d=3, max bins=1001=1001, and a matched training budget of 16 samples per prompt per epoch for 2 epochs\. Rows vary one factor at a time while keeping the total number of sampled draws explicit\. All checkpoints are evaluated against the shared reference withd=5d=5and max bins=16384=16384\. Results are reported asQ95Q\_\{95\}\-Q05Q\_\{05\}normalized Wasserstein\-1, averaged within family and then aggregated across families with a median\. The green row denotes the final selected setting\. Values in bold denote the best setting for each model and split\.Similar Articles
CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
The paper introduces CALIBER, a method for calibrating confidence in reasoning language models by eliciting confidence estimates both before and after reasoning, with supervision targets matched to the information state. It achieves significant reductions in Expected Calibration Error (up to 52.5%) and strong Brier scores and AUROC across multiple benchmarks.
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
This paper revisits the reliability paradox in the context of machine unlearning for language models, demonstrating that models can achieve low calibration error while relying on shortcut-based decision rules, thereby extending the paradox to unlearned models.
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
This paper investigates how similar large language model uncertainty is to human uncertainty, exploring alignment, calibration, and activation patterns in LLMs across multiple datasets and the impact of instruction fine-tuning.
Confidence Calibration in Large Language Models
This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.
Reading Calibrated Uncertainty from Language Model Trajectories
This paper introduces a method to calibrate uncertainty in language models by extracting eleven scale-invariant geometric features from per-layer MLP update trajectories and feeding them to a sparse linear probe, outperforming MSP under selective abstention by up to 21 AURC points.