Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

arXiv cs.CL 05/13/26, 04:00 AM Papers
llms sampling calibration diversity generative-ai research
Summary
This paper introduces a validity-diversity framework attributing diversity collapse in LLMs to order and shape miscalibration during decoding, validated across 14 language models.
arXiv:2605.11128v1 Announce Type: new Abstract: Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:09 AM
# Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
Source: [https://arxiv.org/html/2605.11128](https://arxiv.org/html/2605.11128)
Amin Banayeeanzade♠Qingchuan Yang11footnotemark:1♠Dhruv Tarsadiya♠Fatemeh Bahrani♠Leonardo Blas♠Alfy Samuel♣Robin Jia♠Meisam Razaviyayn♠Sai Praneeth Karimireddy♠♠University of Southern California♣Capital One\{banayeea,qcyang,razaviya,karimire\}@usc\.edu![[Uncaptioned image]](https://arxiv.org/html/2605.11128v1/figs/rocket.png)Demo:[https://diversitycalibration\.github\.io](https://diversitycalibration.github.io/)

###### Abstract

Diversity is essential for language\-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs\. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step\-by\-step probability distributions at inference time cause the problem\. We introduce a validity–diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding\. This framework decomposes the bottleneck into two complementary forms of miscalibration\. First,*order calibration*: valid tokens are not reliably ranked above invalid tokens, so rank\-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones\. Second,*shape calibration*: probability mass is overly concentrated only on few valid continuations while having a heavy\-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity\. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence\-level losses in diversity\. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines\. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.11128v1/x1.png)
![Refer to caption](https://arxiv.org/html/2605.11128v1/x2.png)

Figure 1:\(Left\)The token distribution of a generation step fromQwen3\.5\-35B\-A3B\. The distribution is very sharp in the front, followed by a heavy tail with mixed valid and invalid tokens\. As a result,\(Right\)many valid tokens are unlikely to appear in the output under any temperature sampling\.†tokens are subsampled non\-uniformly for enhanced visualization\.Diversity in generation is essential for a wide range of applications, including synthetic data generation\[[3](https://arxiv.org/html/2605.11128#bib.bib21)\], creative writing\[[49](https://arxiv.org/html/2605.11128#bib.bib2)\], recommendation systems\[[6](https://arxiv.org/html/2605.11128#bib.bib22)\], coding\[[47](https://arxiv.org/html/2605.11128#bib.bib23)\], and exploration for scientific discovery\[[39](https://arxiv.org/html/2605.11128#bib.bib20)\]\. However, contemporary large language models \(LLMs\) often exhibit a notable lack of diversity\[[18](https://arxiv.org/html/2605.11128#bib.bib3),[30](https://arxiv.org/html/2605.11128#bib.bib24),[36](https://arxiv.org/html/2605.11128#bib.bib25),[44](https://arxiv.org/html/2605.11128#bib.bib27)\]\. For instance,59%59\\%of stories in aGPT\-4\-generated dataset begin with“Once upon a time”, andGPT\-5\.5repeatedly outputs“Valparaíso, Chile”when asked to name a random city in the world \(see Appendix[G](https://arxiv.org/html/2605.11128#A7)\)\.

These examples illustrate a broader failure mode: when generation is overly concentrated on a small set of high\-probability outputs, many valid alternatives may be systematically undersampled\. Recent work has made substantial progress in measuring this lack of diversity\[[49](https://arxiv.org/html/2605.11128#bib.bib2)\], but they neither explain the source of the collapse nor provide diagnostic tools to systematically trace the problem\.

Although standard sampling methods intend to resolve the problem, they rather reveal the difficulty; temperature sampling flattens the distribution, but often shifts probability mass toward invalid or nonsensical continuations before sufficient diversity is recovered\[[35](https://arxiv.org/html/2605.11128#bib.bib11)\]\. Top\-token filtering methods such as top\-kkand min\-pp\[[28](https://arxiv.org/html/2605.11128#bib.bib6)\]truncate the ranked distribution, but they either drop many valid alternatives or include invalid tokens\. This suggests that the bottleneck is not merely the decoding heuristic, but the properties of the LLM distribution itself\. We therefore ask:

What are the distributional properties of LLMs that constrain their ability to generate outputs that are both valid and diverse?

To formalize this view, we study the inference\-time distribution of an LLM through the lens of*validity–diversity trade\-off*\. Rather than only measuring diversity on completed generations, we analyze the decoding process itself and how the model’s next\-token distribution allocates probability mass across generations\. This perspective reveals two distinct failure modes\.

First, LLMs fail in*order calibration*: valid tokens are not reliably ranked above invalid ones\. In Figure[1](https://arxiv.org/html/2605.11128#S1.F1), we show that many valid alternative tokens \(blue\) appear farther down the ranked distribution and are interleaved with invalid ones \(red\)\. When this occurs, any top\-token filtering rule faces an unavoidable trade\-off: expanding the cutoff recovers more valid continuations but also admits more invalid ones, while tightening it preserves validity but excludes valid alternatives\.

Second, LLMs fail in*shape calibration*: probability mass is non\-uniformly concentrated on a small number of valid continuations \(green\), but much smaller probability is assigned to other valid tokens \(blue\), while having a heavy tail of many invalid tokens \(red\)\. In the right panel of Figure[1](https://arxiv.org/html/2605.11128#S1.F1), we show that increasing temperature shifts probability mass away from the head, but much of this mass flows into the invalid tail rather than recovering rare valid alternatives\. We formalize these effects and show that the resulting validity–diversity loss compounds when generating longer sequences\.

Across1414language models spanning multiple families and scales, we find that these calibration failures constitute the primary bottleneck to diversity and addressing them unlocks broader diversity in model outputs\. Our findings also have implications for model training and design, suggesting directions for mitigating these bottlenecks at their source\. Finally, our analysis challenges implicit assumptions underlying common sampling strategies and provides guidance for more principled benchmarking and evaluation of decoding methods\.

#### Contributions\.

Building on this framework, our contributions are:

1. 1\.We introduce a framework for analyzing the validity–diversity trade\-off at both token and sequence levels\.
2. 2\.We introduce*order calibration*and*shape calibration*as two complementary distributional bottlenecks\. We theoretically and empirically demonstrate that local failures compound over sequence length\.
3. 3\.We develop controlled empirical diagnostics for probing these bottlenecks, including settings with exactly known valid sets and oracle cutoff baselines, and show that no decoding method that relies on top\-token filtering can effectively recover diversity\.

## 2Related Work

#### Diversity and mode collapse\.

Limited output diversity has emerged as a central failure mode of modern language models\[[44](https://arxiv.org/html/2605.11128#bib.bib27),[10](https://arxiv.org/html/2605.11128#bib.bib34),[12](https://arxiv.org/html/2605.11128#bib.bib33),[46](https://arxiv.org/html/2605.11128#bib.bib51)\]\. Homogeneity of the generation appears both within a single model across repeated samples\[[36](https://arxiv.org/html/2605.11128#bib.bib25)\]and across different models on the same prompts\[[18](https://arxiv.org/html/2605.11128#bib.bib3)\]\. Even frontier models remain substantially less diverse than humans\[[49](https://arxiv.org/html/2605.11128#bib.bib2)\], especially the models with extensive alignment and post\-training\[[45](https://arxiv.org/html/2605.11128#bib.bib19),[43](https://arxiv.org/html/2605.11128#bib.bib37),[22](https://arxiv.org/html/2605.11128#bib.bib38),[20](https://arxiv.org/html/2605.11128#bib.bib39),[30](https://arxiv.org/html/2605.11128#bib.bib24)\]\. Together, these works motivate studying diversity as a first\-class property of language generation\.

#### Evaluation\.

Diversity is inherently multi\-dimensional, and recent work has moved beyond narrow lexical metrics toward broader assessments of open\-ended generation\[[49](https://arxiv.org/html/2605.11128#bib.bib2),[18](https://arxiv.org/html/2605.11128#bib.bib3),[21](https://arxiv.org/html/2605.11128#bib.bib35)\]\. Attempts to improve diversity might lead to text degeneration\[[15](https://arxiv.org/html/2605.11128#bib.bib1)\], and temperature should be carefully tuned\[[38](https://arxiv.org/html/2605.11128#bib.bib40),[50](https://arxiv.org/html/2605.11128#bib.bib42)\]\. Therefore, diversity should not be assessed in isolation from quality\[[34](https://arxiv.org/html/2605.11128#bib.bib36)\], and raw diversity is misleading when many outputs are low\-quality\[[35](https://arxiv.org/html/2605.11128#bib.bib11),[42](https://arxiv.org/html/2605.11128#bib.bib10)\]\. Following this perspective, we treat diversity as meaningful only insofar as it broadens the space of valid, useful outputs, and we use this lens to diagnose when existing decoding rules fail to do so\.

#### Improving diversity\.

A long line of work has sought to improve diversity through prompting\[[48](https://arxiv.org/html/2605.11128#bib.bib9),[29](https://arxiv.org/html/2605.11128#bib.bib41),[41](https://arxiv.org/html/2605.11128#bib.bib52)\], training\[[25](https://arxiv.org/html/2605.11128#bib.bib45),[16](https://arxiv.org/html/2605.11128#bib.bib44),[26](https://arxiv.org/html/2605.11128#bib.bib46),[7](https://arxiv.org/html/2605.11128#bib.bib47)\], base\-aligned model collaboration\[[42](https://arxiv.org/html/2605.11128#bib.bib10),[32](https://arxiv.org/html/2605.11128#bib.bib43)\], and inference\-time interventions\[[40](https://arxiv.org/html/2605.11128#bib.bib7),[37](https://arxiv.org/html/2605.11128#bib.bib8)\]\. Closer to our work, inference\-time stochastic methods such as temperature scaling\[[1](https://arxiv.org/html/2605.11128#bib.bib15)\], top\-kk\[[8](https://arxiv.org/html/2605.11128#bib.bib30)\], top\-pp\[[15](https://arxiv.org/html/2605.11128#bib.bib1)\], and min\-pp\[[28](https://arxiv.org/html/2605.11128#bib.bib6)\]sampling modify the support or sharpness of the next\-token distribution\. Subsequent methods make truncation more adaptive\[[13](https://arxiv.org/html/2605.11128#bib.bib4),[27](https://arxiv.org/html/2605.11128#bib.bib5),[28](https://arxiv.org/html/2605.11128#bib.bib6),[33](https://arxiv.org/html/2605.11128#bib.bib48),[51](https://arxiv.org/html/2605.11128#bib.bib50)\]\. Our work complements this literature: rather than proposing another decoding strategy, we examine why existing sampling rules often fail to recover meaningful diversity\.

## 3Preliminaries

We consider an auto\-regressive LLM with vocabulary𝒱\\mathcal\{V\}\. Given a promptx∈𝒱∗x\\in\\mathcal\{V\}^\{\*\}and a generated prefixy<t=\(y1,…,yt−1\)∈𝒱t−1y\_\{<t\}=\(y\_\{1\},\\dots,y\_\{t\-1\}\)\\in\\mathcal\{V\}^\{t\-1\}, the model defines a conditional distributionp\(⋅∣y<t,x\)p\(\\cdot\\mid y\_\{<t\},x\)over the next token\. The probability of a complete outputy=\(y1,…,yd\)y=\(y\_\{1\},\\dots,y\_\{d\}\)of lengthddisp\(y∣x\)=∏t=1dp\(yt∣y<t,x\)p\(y\\mid x\)=\{\\prod\_\{t=1\}^\{d\}p\(y\_\{t\}\\mid y\_\{<t\},x\)\}\. When the task is fixed and unambiguous, we omitxxfrom our notations\.

We useV⊆𝒱∗V\\subseteq\\mathcal\{V\}^\{\*\}to represent the set of all valid responses toxx\. With slight misuse of notation, we sayy<t∈Vy\_\{<t\}\\in V, if there exists a continuationw⊆𝒱∗w\\subseteq\\mathcal\{V\}^\{\*\}such that the concatenationy<t∘w∈Vy\_\{<t\}\\circ w\\in V\.

###### Definition 3\.1\(Validity and Diversity\)\.

For a promptxx, letY∼p\(⋅∣x\)Y\\sim p\(\\cdot\\mid x\)denote the model’s distribution over complete responses\.

1. 1\.\(Validity\)We define validity as the total probability mass that the model assigns toVV: Val\(p\):=p\(Y∈V∣x\)=∑y∈Vp\(y∣x\)\.\\mathrm\{Val\}\(p\):=p\(Y\\in V\\mid x\)=\\sum\_\{y\\in V\}p\(y\\mid x\)\.
2. 2\.\(Diversity\)We assume that all valid responses are equally preferred, and hence define diversity as the normalized effective support size of the model distribution restricted to the valid set: Div\(p\):=eH\(p~\)\|V\|,\\mathrm\{Div\}\(p\):=\\frac\{e^\{H\(\\tilde\{p\}\)\}\}\{\|V\|\},wherep~=p\(Y∣Y∈V,x\)\\tilde\{p\}=p\(Y\\mid Y\\in V,x\), andHHis the Shannon entropy\.

Intuitively, validity captures the probability mass that the model assigns to valid responses, and diversity then quantifies the effective coverage of valid outputs under the distribution restricted to the valid set\[[14](https://arxiv.org/html/2605.11128#bib.bib17),[19](https://arxiv.org/html/2605.11128#bib.bib18)\]\. In a related work,Yanget al\.\[[45](https://arxiv.org/html/2605.11128#bib.bib19)\]defines the exponential of entropy to be a token\-level measure of the effective number of plausible next steps during generation\. Both validity and diversity take values in the interval\[0,1\]\[0,1\], with higher values indicating better performance\. Moreover, we define:

###### Definition 3\.2\(Valid Continuations and Valid Tokens\)\.

Given a contexty<t∈𝒱t−1y\_\{<t\}\\in\\mathcal\{V\}^\{t\-1\}, valid continuations are defined as the number of all sequences inVVthat begin with the prefixy<ty\_\{<t\},

N\(y<t\):=\|\{z∈𝒱∗:y<t∘z∈V\}\|,N\(y\_\{<t\}\):=\\bigl\|\\\{z\\in\\mathcal\{V\}^\{\*\}:y\_\{<t\}\\circ z\\in V\\\}\\bigr\|,
and the set of valid tokens is accordingly defined as tokens that lead to at least one valid continuation,

G\(y<t\):=\{v∈𝒱:N\(y<t∘v\)\>0\}\.G\(y\_\{<t\}\):=\\\{v\\in\\mathcal\{V\}:N\(y\_\{<t\}\\circ v\)\>0\\\}\.

A decoding rule achieves high validity if it assigns high probability only on valid tokens at each time step, and it achieves high diversity if it explores many distinct tokens inGGrather than concentrating on only a few of them\. However, LLMs, regardless of the decoding strategy employed, often exhibit a pronounced validity–diversity trade\-off\. In this work, we identify two primary sources of this phenomenon arising from properties of the model distribution: In §[4](https://arxiv.org/html/2605.11128#S4), we first introduce*order calibration*and its implications on top\-token filtering methods\. Next in §[5](https://arxiv.org/html/2605.11128#S5), we identify the*shape calibration*issue and we show that together, these effects constitute the primary sources of the observed validity–diversity trade\-off\.

## 4Order Calibration Fails: Valid tokens are not ranked first

Modern decoding strategies implicitly assume that valid tokens are concentrated near the top of the ranked distribution and that simple statistics of the distribution \(e\.g\., cumulative mass or relative probability gaps\) can reliably identify and retain these tokens\. Under this view, diversity is increased by expanding the retained set, while validity is preserved by truncating low\-probability regions\.

In this section, we show that the LLM token distributions systematically violate these assumptions\. Valid tokens are not confined to the head but are frequently interspersed with invalid tokens throughout the tail \(see Figure[1](https://arxiv.org/html/2605.11128#S1.F1)\), and the relationship between rank and validity is neither monotone nor stable across contexts\. As a result, any decoding rule based solely on rank\-based filtering faces an inherent limitation: it must inevitably trade off between excluding valid tokens and admitting invalid ones\. Even small imperfections in separating valid and invalid tokens at each step compound multiplicatively over long generations, leading to a sharp degradation in reachable valid outputs\.

#### Cutoff strategies\.

We abstract all the top\-token filtering methods as cutoff strategies\. LetSSdenote a cutoff strategy\. Given a prefixy<ty\_\{<t\}, it first sorts the tokens by their conditional probabilities, and then selects a cutoff index, retaining all tokens up to that index and discarding the rest\. Importantly, our framing strictly contains any adaptive top\-token filtering method, sinceSSis not a predetermined rule; we allow it to be any arbitrary cutoff strategy, potentially depending on the prefixy<ty\_\{<t\}and the model distributionp\(⋅∣y<t\)p\(\\cdot\\mid y\_\{<t\}\)at each step\. LetSt\(y<t\)S\_\{t\}\(y\_\{<t\}\)denote the set of the retained tokens\. An ideal strategySSwould include as many valid tokens as possible \(high recall\) while excluding all invalid tokens \(high precision\)\. Therefore, we define the following to measure the quality of a strategy:

###### Definition 4\.1\(Precision/Recall\)\.

LetV⊆𝒱dV\\subseteq\\mathcal\{V\}^\{d\}denote the set of valid sequences,G\(y<t\)G\(y\_\{<t\}\)the set of valid next tokens,N\(y<t\)N\(y\_\{<t\}\)the number of valid continuations, andSt\(y<t\)S\_\{t\}\(y\_\{<t\}\)the set of retained tokens for a prefixy<ty\_\{<t\}\.

1. 1\.\(Local Precision\)We define local precision as the fraction of retained tokens that are valid, Prect\(S;y<t\):=\|St\(y<t\)∩G\(y<t\)\|\|St\(y<t\)\|\.\\mathrm\{Prec\}\_\{t\}\(S;y\_\{<t\}\):=\\frac\{\|S\_\{t\}\(y\_\{<t\}\)\\cap G\(y\_\{<t\}\)\|\}\{\|S\_\{t\}\(y\_\{<t\}\)\|\}\.
2. 2\.\(Local Recall\)We define local recall as the fraction of valid continuations that remain reachable after truncation\. Rect\(S;y<t\):=∑v∈St\(y<t\)∩G\(y<t\)N\(y<t∘v\)N\(y<t\)\.\\mathrm\{Rec\}\_\{t\}\(S;y\_\{<t\}\):=\\frac\{\\sum\_\{v\\in S\_\{t\}\(y\_\{<t\}\)\\cap G\(y\_\{<t\}\)\}N\(y\_\{<t\}\\circ v\)\}\{N\(y\_\{<t\}\)\}\.

Moreover, letQSQ\_\{S\}denote the sequence distribution induced by uniformly sampling from the retained sets\. Then,

1. 3\.\(Sequence\-Level Precision\)We define sequence precision as the probability of generating a valid sequence, i\.e\., Precseq\(S\):=QS\(Y∈V\)\.\\mathrm\{Prec\}\_\{\\mathrm\{seq\}\}\(S\):=Q\_\{S\}\(Y\\in V\)\.
2. 4\.\(Sequence\-Level Recall\)We define sequence recall as the fraction of valid sequences that remain reachable\. Recseq\(S\):=\|\{y∈V:yt∈St\(y<t\)∀t\}\|\|V\|\\mathrm\{Rec\}\_\{\\mathrm\{seq\}\}\(S\):=\\frac\{\|\\\{y\\in V:y\_\{t\}\\in S\_\{t\}\(y\_\{<t\}\)\\ \\forall t\\\}\|\}\{\|V\|\}

#### Relation to validity and diversity\.

Sequence precision coincides exactly with validity\. Sequence recall captures a complementary notion of diversity: it measures how much of the valid output space remains accessible under the decoding rule\. While our definition of diversity is entropy\-based, recall provides a notion of coverage\. In particular, if recall is small, then the decoder can only explore a small subset of valid outputs, regardless of how probability is distributed within it\.

### 4\.1A controlled testbed for order calibration

Our goal is to empirically measure the precision–recall trade\-off introduced by any cutoff strategy\. A central challenge in practice is that the valid setVVis intractable to characterize, as it grows exponentially with sequence length, especially in open\-ended tasks such as storytelling\. To address this, we propose a practical procedure to approximate the precision/recall metrics\. Detailed setup, models, and prompts are found in Appendix[B\.1](https://arxiv.org/html/2605.11128#A2.SS1)\.

![Refer to caption](https://arxiv.org/html/2605.11128v1/x3.png)
![Refer to caption](https://arxiv.org/html/2605.11128v1/x4.png)Figure 2:\(Left\)We sweep the logits and cutoff thresholds at each conditional distribution to enumerate retained tokens up to a certain depth, followed by greedy decoding from each leaf\. A judge model then evaluates the validity of the generated sequences, allowing us to attribute a validity label to each token\. We then measure the number of valid/invalid tokens that were retained/dropped by the cutoff strategy to obtain local precision and recall\.\(Right\)The frontier precision–recall trade\-offs at different depths obtained by sweeping cutoff strategies\. The trade\-off degrades as depth increases\.#### Measuring token validity\.

Given a fixed prefixy<ty\_\{<t\}, we query the LLM to get a sorted list of all next token candidates\{v1,⋯,v\|𝒱\|\}\\\{v\_\{1\},\\cdots,v\_\{\\mathcal\{\|V\|\}\}\\\}\. For each tokenvjv\_\{j\}, we construct the extended prefixy<t∘vjy\_\{<t\}\\circ v\_\{j\}and then perform greedy decoding to completion\. This approximates the model’s most likely continuation conditioned on having selectedvjv\_\{j\}\(see Figure[2](https://arxiv.org/html/2605.11128#S4.F2), middle panel\)\. We then evaluate the resulting sequence using an LLM\-as\-a\-judge\[[11](https://arxiv.org/html/2605.11128#bib.bib29)\], scoring grammar, semantic, and overall validity\. By thresholding this score, we obtain a binary validity label for the tokenvjv\_\{j\}\. Repeating this process for all tokens yields an approximate assessment of token\-level validity given the prefix, and allows us to compute precision–recall trade\-off as a function of the cutoff strategy in the next sections\.

Appendix[A\.1](https://arxiv.org/html/2605.11128#A1.SS1)provides additional details, including the evaluation rubric\. Appendix[A\.2](https://arxiv.org/html/2605.11128#A1.SS2)validates the reliability of the LLM judge against human annotations and examines the impact of judge model choice\. Moreover, Appendix[A\.3](https://arxiv.org/html/2605.11128#A1.SS3)compares greedy completions vs sampling and shows that the results are robust to this choice\.

![Refer to caption](https://arxiv.org/html/2605.11128v1/x5.png)Figure 3:Local precision–recall trade\-off when sweeping the cutoff in a single generation step\.
#### Single\-step precision–recall trade\-off\.

The extracted labels allow us to measure the precision–recall trade\-off at a single decoding step to observe how valid tokens are distributed in the conditional\. In Figure[3](https://arxiv.org/html/2605.11128#S4.F3), we sweep the cutoff from the first index up to the token at rank8080k on a story generation task withQwen3\.5\-35B\. Plotting the precision and recall at each cutoff, we observe that precision drops sharply at the front, but recall slowly improves, even at high token indexes\. This indicates that there is a strong precision–recall trade\-off and order calibration is severely violated on this single conditional\. We provide qualitative examples in Appendix[B\.2](https://arxiv.org/html/2605.11128#A2.SS2)\.

Multi\-step precision–recall trade\-off\.We extend our single\-step methodology to a more realistic setup that captures the multi\-step effects\. Particularly, we extend the procedure to depthd\>1\{d\>1\}: Instead of immediately decoding greedily, we recursively expand tokens at each successive step, constructing a tree of continuations \(Figure[2](https://arxiv.org/html/2605.11128#S4.F2), left\)\. After expanding all nodes up toddtokens at each branch, we complete each leaf via greedy decoding\. By evaluating these resulting sequences with a judge, we obtain a validity label for each token in this sequence along the tree: a token is valid if at least one sequence containing it is valid\. Since the number of sequences grows exponentially with depth and each depth requires many LLM calls, we sweep up to depth33and subsample tokens at each depth, as detailed in Appendix[B](https://arxiv.org/html/2605.11128#A2)\.

Given the validity label of every token up to stepdd, we construct the precision–recall trade\-off curves by sweeping over all possible cutoff strategies, and computing the local trade\-off for every node using Definition[4\.1](https://arxiv.org/html/2605.11128#S4.Thmtheorem1)\. Note that our framework includes all top\-token filtering strategies, as it allows any node to adjust its own cutoff arbitrarily\. Each cutoff strategy gives a single point in the precision–recall curve; we take the Pareto frontier of all strategies as a representative of the best achievable trade\-off\.

#### Local precision–recall trade\-off worsens with depth\.

We perform the above procedure across1010seeds, each repeated with a random query from NoveltyBench\[[49](https://arxiv.org/html/2605.11128#bib.bib2)\]and a random prefixy<ty\_\{<t\}\. For each seed, we compute the Pareto\-optimal precision–recall curve at every node in the generation tree as previously mentioned\. Figure[2](https://arxiv.org/html/2605.11128#S4.F2)\(right\) summarizes the results\. To analyze the effect of horizon length, we group nodes by their depth in the generation tree and report, for each depth, the maximum, minimum, and average Pareto frontiers\. We observe the following:

- •Even the optimal cutoff strategy exhibits a non\-negligible local precision–recall trade\-off\.
- •The trade\-off worsens as depth increases \(fromd=1d=1tod=2d=2, and fromd=2d=2tod=3d=3\)\.
- •This degradation is not only present in the average, but also on the maximum frontier\.

Overall, these results show that the precision–recall trade\-off induced by cutoffs worsens with decoding depth\. This provides empirical evidence that order calibration is frequently violated: valid continuations are not reliably ranked above invalid ones, and this misalignment compounds over longer horizons\.

### 4\.2Local order failures compound into sequence\-level collapse

These measurements are at the level of local precision and recall\. At the sequence level, the trade\-off is stronger: even small but constant local imperfections lead to a dramatic collapse in global diversity since the errors incurred at each depth compound multiplicatively, as we show in the following:

###### Theorem 4\.2\(Compounding effect of decoding steps\)\.

Suppose that at leastmmdecoding positions exhibit a constant local precision–recall trade\-off: at each such position, high local precision must discard a constant fraction of valid continuations\. Then there exist constantsc,C\>0c,C\>0such that any cutoff strategySSsatisfying

Precseq\(S\)≥1−δ\\mathrm\{Prec\}\_\{\\mathrm\{seq\}\}\(S\)\\geq 1\-\\deltamust satisfy

Recseq\(S\)≤\(1−δ\)−Ce−cm\.\\mathrm\{Rec\}\_\{\\mathrm\{seq\}\}\(S\)\\leq\\left\(1\-\\delta\\right\)^\{\-C\}e^\{\-cm\}\.

![Refer to caption](https://arxiv.org/html/2605.11128v1/x6.png)

Figure 4:Precision–recall trade\-offs acrossQwen\-3,Llama\-3,Olmo\-3on 9 sizes and training stage\. Evaluations are averaged on 3 random positions and queries\.\(Top\)Average area under the precision–recall frontier\.\(Bottom\)Average recall at precision0\.80\.8\.Table 1:Semantic and Lexical Diversity of Cutoff Strategies\. Higher Embedding Diversity score corresponds to higher semantic diversity; lower Self\-BLEU score corresponds to higher lexical diversity\.
#### Interpretation\.

The theorem formalizes a compounding effect\. High sequence\-level validity leaves only a small total budget for local precision errors\. Hence, at mostO\(log⁡\(1/\(1−δ\)\)\)O\(\\log\(1/\(1\-\\delta\)\)\)decoding steps can tolerate such trade\-offs; the remaining steps must use near\-perfect cutoffs\. If valid and invalid tokens are interleaved, such cutoffs necessarily discard a constant fraction of valid continuations\. These losses compound multiplicatively, so the reachable valid set shrinks exponentially\. In this sense, any cutoff rule can maintain validity only by sacrificing broad validity, thereby proving a hardness result for any decoding strategy that relies on top\-token filtering\. The details of the proof are in Appendix[E](https://arxiv.org/html/2605.11128#A5)\.

### 4\.3Scaling and the diversity gap

Sections[4\.1](https://arxiv.org/html/2605.11128#S4.SS1)and[4\.2](https://arxiv.org/html/2605.11128#S4.SS2)establish local precision–recall as a meaningful diagnostic for order calibration\. We now use this diagnostic to quantify the practical importance of order calibration\.

#### Model Size\.

Figure[4](https://arxiv.org/html/2605.11128#S4.F4.18)summarizes the local precision–recall trade\-offs across model families, scales, and training stages \(see Appendix[B\.3](https://arxiv.org/html/2605.11128#A2.SS3)\)\. The average AUC of the precision–recall frontier exhibits a mild upward trend with model size\. However, the improvement is modest and far from eliminating the trade\-off\. In particular, when precision is fixed at0\.80\.8, recall remains low and non\-monotonic across model sizes\. This shows that larger models do not reliably recover more valid continuations under a high\-precision constraint\. Therefore, while scale slightly improves order calibration, it does not by itself resolve the failure in order calibration\.

#### Oracle filtering\.

To quantify how much diversity is lost specifically due to order miscalibration, we present an oracle validity filter that samples only from tokens labeled valid by our diagnostic procedure\. We apply the oracle filter only during the first two decoding steps and then continue generation normally\. For other strategies, we sweep a grid of temperatures and cutoff parameters and report the best diversity for validity at least0\.80\.8\(see Appendix[B\.4](https://arxiv.org/html/2605.11128#A2.SS4)\)\.

Table[1](https://arxiv.org/html/2605.11128#S4.T1)shows that even this limited oracle intervention yields a clear improvement: oracle filtering achieves the highest embedding diversity and the lowest Self\-BLEU\. This indicates that valid tokens excluded by rank\-based cutoffs yield meaningfully different generations\. Therefore, the order calibration gap has direct output\-level consequences: valid alternatives are present in the model distribution, but standard rank\-based sampling rules fail to reliably expose them\.

## 5Shape Calibration Fails: Sharp, heavy\-tailed distributions limit diversity

In this section, we focus on a second, complementary bottleneck, namely shape calibration\. While the LLM conditional distribution varies across tasks, prior work suggests that next\-token distributions are typically head\-heavy and long\-tailed, with most probability mass concentrated in a relatively small nucleus and a large, unreliable tail\[[13](https://arxiv.org/html/2605.11128#bib.bib4),[15](https://arxiv.org/html/2605.11128#bib.bib1)\]\. In Appendix[I](https://arxiv.org/html/2605.11128#A9), we randomly sample conditional next\-token distributions from a diverse set of tasks and examine their sorted logits\. We show that the logits’ behavior can be consistently described by a linear decay in the head, followed by a heavy\-tailed distribution that decays logarithmically\. After applying softmax, this translates to an exponential \(geometric\) decay in the head, i\.e\.,p\(vk∣y<t\)∝exp⁡\(−λk/T\)p\(v\_\{k\}\\mid y\_\{<t\}\)\\propto\\exp\(\-\\lambda k/T\)and a*Zipf\-like*\[[4](https://arxiv.org/html/2605.11128#bib.bib16)\]behavior in the tail,p\(vk∣y<t\)∝k−λ/Tp\(v\_\{k\}\\mid y\_\{<t\}\)\\propto k^\{\-\\lambda/T\}, wherekkdenotes the rank of tokenvkv\_\{k\}in the sorted vocabulary,TTis the temperature andλ\\lambdacontrols the sharpness of the distribution\.

These two properties together cause the shape calibration issue: the distribution is always very sharp over a very small portion of the head, even in tasks where we expect to observe an exact uniform distribution on valid tokens\. Temperature scaling is therefore often used to flatten the distribution head, increasing the probability of valid regions outside the head\. However, temperature scaling comes with a necessary caveat\. Although each invalid token remains with a small probability, the accumulated probability of invalid tokens especially grows very quickly with temperature scaling, leading to an unwanted validity–diversity trade\-off\.

While understanding how distributional miscalibration arises from LLM training and design pipeline is an important question\[[9](https://arxiv.org/html/2605.11128#bib.bib28),[5](https://arxiv.org/html/2605.11128#bib.bib49)\], we focus on its implications for the diversity\-validity trade\-off\. We emphasize that characterizing the exact geometric form is not intended as a literal empirical claim, but as an analytically convenient proxy for heavy\-tailed distributions\.

### 5\.1How severely does distribution shape affect validity–diversity trade\-off?

Although temperature is the most basic way of injecting randomness into decoding, in practice, higher temperatures often fail to recover a broad set of valid outputs\. Our goal in this subsection is to characterize this limitation by attributing the failure to distribution sharpness\. In fact, we show that even if the order calibration disappears, sharp LLM conditionals induce heavy validity–diversity trade\-offs\. For simplicity of the proof, we further impose the following assumption:

###### Assumption 5\.1\(Invariant valid branching\)\.

Assume that all valid sequences for a task have a fixed lengthdd\. For every valid prefixy<ty\_\{<t\}, the number of valid next\-token choices depends only on the positiontt, not on the particular prefix\. That is, there exist integersv1,…,vdv\_\{1\},\\ldots,v\_\{d\}such that for every valid prefixy<ty\_\{<t\},

\|G\(y<t\)\|=vt\.\|G\(y\_\{<t\}\)\|=v\_\{t\}\.Furthermore, we define the branching length as the number of positions at which there is more than one valid continuation:

m:=\|\{t∈\[d\]:vt≥2\}\|\.m:=\\left\|\\\{t\\in\[d\]:v\_\{t\}\\geq 2\\\}\\right\|\.

Assumption[5\.1](https://arxiv.org/html/2605.11128#S5.Thmtheorem1)fixes the generation length and removes prefix\-level heterogeneity, allowing the diversity loss to be expressed in terms of the effective branching lengthmm\.

###### Theorem 5\.2\(Validity–Diversity trade\-off\)\.

Consider a length\-ddgeneration task with a valid setVV\. Suppose that, at each valid prefix, the model’s ranked next\-token distribution is geometrically decaying, and suppose the valid next tokens occupy the top ranks\. Then any temperature\-scaled distribution satisfying

Val\(p\)≥1−ϵ\\mathrm\{Val\}\(p\)\\geq 1\-\\epsilonalso satisfies

Div\(p\)≤e−mc\(ϵ\),\\mathrm\{Div\}\(p\)\\leq e^\{\-mc\(\\epsilon\)\},
for some positive constantc\(ϵ\)\>0c\(\\epsilon\)\>0, wherec\(ϵ\)→ln⁡2c\(\\epsilon\)\\to\\ln 2asϵ→0\\epsilon\\to 0\.

#### Interpretation\.

The theorem shows that temperature scaling pays a local diversity price at every branching step\. To achieve high validity, the distribution must be sharp enough that invalid tokens receive little mass\. At the same time, this sharpness makes the conditional distribution over valid tokens non\-uniform, concentrating probability on the highest\-ranked valid continuations\. These per\-step entropy losses add across the sequence; after exponentiating entropy to obtain diversity, they yield an exponential decay inmm\. Moreover, the ratec\(ϵ\)c\(\\epsilon\)increases as the validity requirement becomes stricter\. In the high\-validity regime, the bound approachesmm, and can be stronger when many steps contain several valid continuations\. We provide the formal proof in Appendix[F](https://arxiv.org/html/2605.11128#A6)\.

![Refer to caption](https://arxiv.org/html/2605.11128v1/x7.png)
![Refer to caption](https://arxiv.org/html/2605.11128v1/x8.png)Figure 5:Effects of temperature scaling in random number generation onOlmo\-3\-7B\-Instruct\.\(Left\)Validity–diversity trade\-offs for unconstrained and constrained random\-number generation tasks across sequence lengthsdd\. Longer generations exhibit a stronger validity–diversity trade\-off\.\(Right\)Valid sequence probabilities for the length\-33unconstrained random\-number generation task\.

### 5\.2Empirical Investigation

#### Controlled random generation testbed\.

Investigating the diversity–validity trade\-off as a function of distribution shape requires us to calculate the probabilities of every valid sequence\. Sweeping all the conditionals on open\-ended generation is infeasible\. To address this, we propose two tasks with known valid sets as controlled testbeds: random\-number generation and naming a random US state\. We consider two variants of the random number generation task in Figure[5](https://arxiv.org/html/2605.11128#S5.F5)\. In the*unconstrained*setting, the model is asked to generate a length\-ddsequence of i\.i\.d\. digits, where each digit lies in\{0,…,9\}\\\{0,\\dots,9\\\}\. Thus, every sequence in\{0,…,9\}d\\\{0,\\dots,9\\\}^\{d\}is valid\. We expect the model to impose a uniform distribution on each conditional, given independence\. In the*constrained*setting, the model is asked to generate a length\-ddsequence of digits whose sum equals a specified targetNN\. Appendix[C](https://arxiv.org/html/2605.11128#A3)provides further details, including exact prompts and our experimental setup\.

These tasks offer several key advantages\. First, we can compute the exact generation probability for each valid sequence, since the entire valid set is known\. Second, we objectively expect each valid sequence to have the same generation probability\. Thus, any deviation from uniform distribution leads to a systematic validity–diversity trade\-off\. Moreover, the unconstrained random\-number generation task satisfies Assumption[5\.1](https://arxiv.org/html/2605.11128#S5.Thmtheorem1), serving as a proper testbed for our theorem\.

#### Sequence validity–diversity trade\-offs\.

For each candidate sequence inVV, we therefore feed the corresponding prefixes into the LLM and extract the next\-token logits\. We then apply different temperatures to these logits and compute, exactly over the known sequence space, both the probability mass assigned to the valid region and the entropy of the model’s distribution conditioned on validity\. This gives the validity–diversity curve induced by temperature scaling alone\.

The left panel shows the validity–diversity frontier for random\-number generation across sequence lengthsd∈\{2,3,4\}d\\in\\\{2,3,4\\\}\. As the length increases, the frontier becomes sharper: maintaining high validity requires a larger reduction in diversity\. This is consistent with Theorem[5\.2](https://arxiv.org/html/2605.11128#S5.Thmtheorem2), where each branching position contributes a local entropy loss and these losses compound across the sequence\. The constrained setting exhibits an even stronger trade\-off, despite violating Assumption[5\.1](https://arxiv.org/html/2605.11128#S5.Thmtheorem1), suggesting the broader applicability of our result\.

#### Shape calibration in sequence\-level\.

Although we have studied the shape of the conditional distribution, its implications at the sequence level are less transparent\. The right panel of Figure[5](https://arxiv.org/html/2605.11128#S5.F5)provides a sequence\-level view of temperature scaling trade\-off\. We plot the probabilities of each valid sequence in the unconstrained length\-33task, and show the magnitude of the invalid sequence mass\. We observe that the distribution is already highly concentrated: a small number of sequences receive orders\-of\-magnitude larger probability than the rest\. At the same time, many sequences lie in a long, low\-probability tail\. Raising the temperature can indeed move probability mass toward this tail, thereby increasing diversity among rare valid outputs\. However, the invalid sequence mass dominates even more as temperature increases\. Therefore, shifting mass toward the tail improves diversity only at the cost of reduced validity\. Thus, the empirical behavior mirrors the theory: temperature can flatten the distribution, but it cannot selectively recover valid diversity\.

### 5\.3When shape and order miscalibration interact

Top\-token filtering is often applied after temperature scaling to suppress invalid tail mass\. However, this does not remove the calibration problem; it couples shape calibration with order calibration\. A cutoff rule retains a prefix of the ranked distribution, so to preserve valid diversity while maintaining validity, this prefix must approximate the valid\-token setG\(y<t\)G\(y\_\{<t\}\)\. This requires two conditions: valid tokens must be concentrated near the top of the ranked distribution, and the cutoff rule must adapt to the local boundary of the valid set\. Different methods encode different boundary assumptions: top\-kkassumes a roughly fixed support size, top\-ppassumes a stable cumulative\-mass boundary, and min\-ppassumes a stable relative\-probability gap from the top token\.

#### Comparing validity–diversity trade\-offs\.

We test these assumptions on controlled tasks, including random number generation and random state generation tasks, where the valid set is known exactly\. Figure[6](https://arxiv.org/html/2605.11128#S5.F6.3)reports the trade\-offs on both tasks, and more results are found in Appendix[C](https://arxiv.org/html/2605.11128#A3)\. For each cutoff strategy, we sweep both temperature and the method\-associated parameter, since different methods can achieve their best validity–diversity trade\-off at different temperatures\[[50](https://arxiv.org/html/2605.11128#bib.bib42)\]\. Comparing methods at a single fixed temperature can therefore be misleading: poor performance may reflect a bad choice of temperature rather than a limitation of the filtering rule itself\.

#### Cutoff Oracle Strategy\.

To study the miscalibration, we include a cutoff oracle strategy in Figure[6](https://arxiv.org/html/2605.11128#S5.F6.3)\. Given that our generation task has a known ground truth, we directly compute the valid size at each generation step\. At each step, the oracle cutoff rule knows only the number of valid tokens,gt⋆=\|G\(y<t\)\|,g\_\{t\}^\{\\star\}=\|G\(y\_\{<t\}\)\|,and retains the topgt⋆g\_\{t\}^\{\\star\}ranked tokens\. However, the oracle doesnotknow which tokens are valid, and therefore remains a rank\-based cutoff rule\. On the random state generation task, the cutoff oracle achieves the ideal point\(1\.0,1\.0\)\(1\.0,1\.0\), indicating that the model is order\-calibrated in this controlled setting\. Therefore, the remaining gap reflects a failure of the methods’ implicit shape assumptions: fixed top\-kk, cumulative\-mass, or relative\-probability thresholds do not identify the valid\-token boundary\. On the random\-number generation task, the cutoff oracle also does not immediately achieve high diversity, showing that shape miscalibration is coupled with order miscalibration\.

![Refer to caption](https://arxiv.org/html/2605.11128v1/x9.png)

Figure 6:Validity–diversity Pareto frontiers for top\-token filtering methods on generation tasks fromLlama\-3\.1\-8B\-Instruct\.\(Left\)Name a random state in the US\.\(Right\)Unconstrained random number\. Each sampling method is swept over both temperature and its own truncation parameter\. The oracle\-size cutoff retains the top\|G\(y<t\)\|\|G\(y\_\{<t\}\)\|tokens at each prefix\.

## 6Future Work

This paper attributes the LLM validity–diversity trade\-off to two distributional properties: order and shape calibration\. Through empirical demonstrations and theoretical analysis, we show that miscalibration is a recurring bottleneck across model families, sizes, and controlled generation tasks\. Finally, as a preliminary piece of future work, in Appendix[H](https://arxiv.org/html/2605.11128#A8)we observe that order and shape calibration can potentially provide insights into domains beyond open\-ended diversity generation\.

Several directions remain open\. First, our results suggest that future decoding methods should move beyond top\-token filtering rules\. Calibration\-aware decoders could instead incorporate auxiliary validity signals\. Second, it remains important to understand where these calibration failures come from\. Pretraining, instruction tuning, preference optimization, and safety alignment may each affect the sharpness of the distribution and the rank ordering of valid alternatives\. Studying these effects could suggest training objectives that preserve broader valid support without sacrificing quality\.

## References

- \[1\]D\. H\. Ackley, G\. E\. Hinton, and T\. J\. Sejnowski\(1985\)A learning algorithm for boltzmann machines\.Cognitive Science9\(1\),pp\. 147–169\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[2\]D\. Alihosseini, E\. Montahaei, and M\. Soleymani Baghshah\(2019\)Jointly measuring diversity and quality in text generation models\.InProceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation,pp\. 90–98\.Cited by:[§B\.4](https://arxiv.org/html/2605.11128#A2.SS4.SSS0.Px2)\.
- \[3\]A\. Banayeeanzade, Q\. Yang, D\. Fu, S\. Hong, E\. Babinsky, A\. Samuel, A\. Kumar, R\. Jia, and S\. P\. Karimireddy\(2026\)EPSVec: efficient and private synthetic data generation via dataset vectors\.External Links:2602\.21218Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p1.1)\.
- \[4\]S\. Basu, G\. S\. Ramachandran, N\. S\. Keskar, and L\. R\. Varshney\(2021\)Mirostat: a neural text decoding algorithm that directly controls perplexity\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.11128#S5.p1.6)\.
- \[5\]H\. Chang and A\. McCallum\(2022\)Softmax bottleneck makes language models unable to represent multi\-mode word distributions\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8048–8073\.Cited by:[§5](https://arxiv.org/html/2605.11128#S5.p3.1)\.
- \[6\]J\. Chen, C\. Gao, S\. Yuan, S\. Liu, Q\. Cai, and P\. Jiang\(2025\)DLCRec: a novel approach for managing diversity in llm\-based recommender systems\.InProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining,pp\. 857–865\.Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p1.1)\.
- \[7\]J\. J\. Y\. Chung, V\. Padmakumar, M\. Roemmele, Y\. Sun, and M\. Kreminski\(2025\)Modifying large language model post\-training for diverse creative writing\.InSecond Conference on Language Modeling,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[8\]A\. Fan, M\. Lewis, and Y\. Dauphin\(2018\)Hierarchical neural story generation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 889–898\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[9\]M\. Finlayson, J\. Hewitt, A\. Koller, S\. Swayamdipta, and A\. Sabharwal\(2024\)Closing the curious case of neural text degeneration\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.11128#S5.p3.1)\.
- \[10\]B\. Ghafouri\(2026\)The variance paradox: how ai reduces diversity but increases novelty\.External Links:2508\.19264Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Z\. Lin, B\. Zhang, L\. Ni, W\. Gao, Y\. Wang, and J\. Guo\(2026\)A survey on llm\-as\-a\-judge\.The Innovation,pp\. 101253\.Cited by:[§4\.1](https://arxiv.org/html/2605.11128#S4.SS1.SSS0.Px1.p1.6)\.
- \[12\]Y\. Guo, G\. Shang, and C\. Clavel\(2025\)Benchmarking linguistic diversity of large language models\.Transactions of the Association for Computational Linguistics13,pp\. 1507–1526\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]J\. Hewitt, C\. Manning, and P\. Liang\(2022\)Truncation sampling as language model desmoothing\.InFindings of the Association for Computational Linguistics: EMNLP 2022,pp\. 3414–3427\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3),[§5](https://arxiv.org/html/2605.11128#S5.p1.6)\.
- \[14\]M\. O\. Hill\(1973\)Diversity and evenness: a unifying notation and its consequences\.Ecology54\(2\),pp\. 427–432\.Cited by:[Definition 3\.1](https://arxiv.org/html/2605.11128#S3.Thmtheorem1.p3.1)\.
- \[15\]A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi\(2020\)The curious case of neural text degeneration\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3),[§5](https://arxiv.org/html/2605.11128#S5.p1.6)\.
- \[16\]M\. Ismayilzada, A\. Laverghetta Jr\., S\. A\. Luchini, R\. Patel, A\. Bosselut, L\. V\. D\. Plas, and R\. E\. Beaty\(2025\)Creative preference optimization\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 9580–9609\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[17\]N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica\(2025\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Figure 12](https://arxiv.org/html/2605.11128#A8.F12),[Figure 12](https://arxiv.org/html/2605.11128#A8.F12.3.2),[Appendix H](https://arxiv.org/html/2605.11128#A8.p2.1)\.
- \[18\]L\. Jiang, Y\. Chai, M\. Li, M\. Liu, R\. Fok, N\. Dziri, Y\. Tsvetkov, M\. Sap, and Y\. Choi\(2025\)Artificial hivemind: the open\-ended homogeneity of language models \(and beyond\)\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]L\. Jost\(2006\)Entropy and diversity\.Oikos113\(2\),pp\. 363–375\.Cited by:[Definition 3\.1](https://arxiv.org/html/2605.11128#S3.Thmtheorem1.p3.1)\.
- \[20\]A\. Karan and Y\. Du\(2026\)Reasoning with sampling: your base model is smarter than you think\.InThe Fourteenth International Conference on Learning Representations,Cited by:[Appendix H](https://arxiv.org/html/2605.11128#A8.p2.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]R\. Karanjai, Y\. Lu, R\. Chodavarapu, L\. Xu, and W\. Shi\(2025\)Evaluating the quality of randomness and entropy in tasks supported by large language models\.External Links:2510\.12080Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1)\.
- \[22\]R\. Kirk, I\. Mediratta, C\. Nalmpantis, J\. Luketina, E\. Hambro, E\. Grefenstette, and R\. Raileanu\(2024\)Understanding the effects of RLHF on LLM generalisation and diversity\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]K\. Krippendorff\(1980\)Content analysis: an introduction to its methodology\.Cited by:[§A\.2](https://arxiv.org/html/2605.11128#A1.SS2.p2.1)\.
- \[24\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix K](https://arxiv.org/html/2605.11128#A11.p1.1)\.
- \[25\]T\. Li, Y\. Zhang, P\. Yu, S\. Saha, D\. Khashabi, J\. Weston, J\. Lanchantin, and T\. Wang\(2025\)Jointly reinforcing diversity and quality in language model generations\.External Links:2509\.02534Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[26\]Z\. Li, C\. Chen, T\. Xu, Z\. Qin, J\. Xiao, Z\. Luo, and R\. Sun\(2025\)Preserving diversity in supervised fine\-tuning of large language models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[27\]C\. Meister, T\. Pimentel, G\. Wiher, and R\. Cotterell\(2023\)Locally typical sampling\.Transactions of the Association for Computational Linguistics11\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[28\]N\. N\. Minh, A\. Baker, C\. Neo, A\. G\. Roush, A\. Kirsch, and R\. Shwartz\-Ziv\(2025\)Turning up the heat: min\-p sampling for creative and coherent LLM outputs\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p3.2),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[29\]K\. Misaki and T\. Akiba\(2026\)String seed of thought: prompting LLMs for distribution\-faithful and diverse generation\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[30\]S\. K\. Murthy, T\. Ullman, and J\. Hu\(2025\)One fish, two fish, but not the whole sea: alignment reduces language models’ conceptual diversity\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,pp\. 311–318\.Cited by:[§B\.4](https://arxiv.org/html/2605.11128#A2.SS4.SSS0.Px2.p1.3)\.
- \[32\]M\. Peeperkorn, T\. Kouwenhoven, D\. Brown, and A\. Jordanous\(2025\)Mind the gap: conformative decoding to improve output diversity of instruction\-tuned large language models\.External Links:2507\.20956Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[33\]E\. B\. Potraghloo, S\. Azizi, S\. Kundu, and M\. Pedram\(2026\)Top\-h decoding: adapting the creativity and coherence with bounded entropy in text generation\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[34\]R\. Schaeffer, J\. Kazdan, and Y\. Denisov\-Blanch\(2025\)Min\-p, max exaggeration: a critical analysis of min\-p sampling in language models\.External Links:2506\.13681Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1)\.
- \[35\]A\. Shypula, S\. Li, B\. Zhang, V\. Padmakumar, K\. Yin, and O\. Bastani\(2025\)Evaluating the diversity and quality of LLM generated content\.InSecond Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p3.2),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1)\.
- \[36\]Z\. Sourati, F\. Karimi\-Malekabadi, M\. Ozcan, C\. McDaniel, A\. Ziabari, J\. Trager, A\. Tak, M\. Chen, F\. Morstatter, and M\. Dehghani\(2025\)The shrinking landscape of linguistic diversity in the age of large language models\.External Links:2502\.11266Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[37\]Y\. Su and N\. Collier\(2023\)Contrastive search is what you need for neural text generation\.Transactions on Machine Learning Research\.Note:Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[38\]S\. Troshin, W\. Mohammed, Y\. Meng, C\. Monz, A\. Fokkens, and V\. Niculae\(2025\)Control the temperature: selective sampling for diverse and high\-quality LLM outputs\.InSecond Conference on Language Modeling,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1)\.
- \[39\]G\. Tu, S\. Zhang, T\. Zhang, Y\. Zhang, and D\. Yang\(2026\)Shared nature, unique nurture: prism for pluralistic reasoning via in\-context structure modeling\.External Links:2602\.21317Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p1.1)\.
- \[40\]A\. Vijayakumar, M\. Cogswell, R\. Selvaraju, Q\. Sun, S\. Lee, D\. Crandall, and D\. Batra\(2018\-Apr\.\)Diverse beam search for improved description of complex scenes\.Proceedings of the AAAI Conference on Artificial Intelligence\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[41\]Q\. Wang, S\. Pan, T\. Linzen, and E\. Black\(2025\)Multilingual prompting for improving LLM generation diversity\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 6367–6389\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[42\]Y\. Wang, C\. Yang, T\. Huang, M\. Chen, J\. May, and M\. Lee\(2025\)Optimizing diversity and quality through base\-aligned model collaboration\.External Links:2511\.05650Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[43\]P\. West and C\. Potts\(2025\)Base models beat aligned models at randomness and creativity\.InSecond Conference on Language Modeling,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[44\]W\. Xu, N\. Jojic, S\. Rao, C\. Brockett, and B\. Dolan\(2025\)Echoes in ai: quantifying lack of plot diversity in llm outputs\.Proceedings of the National Academy of Sciences\.Cited by:[§1](https://arxiv.org/html/2605.11128#S1.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[45\]C\. Yang, S\. Li, and A\. Holtzman\(2026\)LLM probability concentration: how alignment shrinks the generative horizon\.External Links:2506\.17871Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1),[Definition 3\.1](https://arxiv.org/html/2605.11128#S3.Thmtheorem1.p3.1)\.
- \[46\]S\. Yu, A\. Jabbar, R\. Hawkins, D\. Jurafsky, and M\. Cheng\(2025\)Generation space size: understanding and calibrating open\-endedness of llm generations\.External Links:2510\.12699Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1)\.
- \[47\]Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang\(2026\)Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[Appendix H](https://arxiv.org/html/2605.11128#A8.p1.1),[§1](https://arxiv.org/html/2605.11128#S1.p1.1)\.
- \[48\]J\. Zhang, S\. Yu, D\. Chong, A\. Sicilia, M\. R\. Tomz, C\. D\. Manning, and W\. Shi\(2025\)Verbalized sampling: how to mitigate mode collapse and unlock llm diversity\.External Links:2510\.01171Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.
- \[49\]Y\. Zhang, H\. Diddee, S\. Holm, H\. Liu, X\. Liu, V\. Samuel, B\. Wang, and D\. Ippolito\(2025\)NoveltyBench: evaluating creativity and diversity in language models\.InSecond Conference on Language Modeling,Cited by:[§B\.1](https://arxiv.org/html/2605.11128#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.11128#S1.p1.1),[§1](https://arxiv.org/html/2605.11128#S1.p2.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.11128#S4.SS1.SSS0.Px3.p1.2)\.
- \[50\]Y\. Zhou, M\. Keuper, and M\. Fritz\(2025\)Balancing diversity and risk in LLM sampling: how to select your method and parameter for open\-ended text generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 26352–26365\.Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.11128#S5.SS3.SSS0.Px1.p1.1)\.
- \[51\]W\. Zhu, H\. Hao, Z\. He, Y\. Ai, and R\. Wang\(2024\)Improving open\-ended text generation via adaptive decoding\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.11128#S2.SS0.SSS0.Px3.p1.3)\.

## Appendix ALLM\-as\-a\-judge Details

### A\.1Prompt and Model

Token validity is scored based on the greedy\-decoding completion\. We useQwen3\.5\-35B\-A3B\-FP8with thinking enabled as our judge\. The prompts used are documented below\.

`LLM\-as\-a\-judge Prompts`

`A\.2 Evaluating LLM\-as\-a\-judge To assess the inter\-annotator reliability, three authors independently annotated 100 generations using the rubric in Appendix A\.1\. We compute inter\-annotator agreement on the overall validity scores using Krippendorff’s alpha \[23\], obtaining α=0\.759\\alpha=0\.759\. This indicates reasonably strong agreement among annotators, suggesting that the rubric yields consistent human judgments\. To contrast our LLM\-based evaluation against human judgment, we compared Qwen3\.5\-35B\-A3B\-FP8 predictions against human labels from different annotators for 100 generations\. We use a threshold of 99 on grammar, semantic and overall score\. We treat the judge model score above this threshold as a positive prediction \(valid token\) and scores below the threshold as negative predictions \(invalid tokens\)\. Under this rule, Qwen3\.5\-35B\-A3B\-FP8 achieves 80\.9080\.90% accuracy and 79\.6079\.60% F1 when compared to humans\. We also compare GPT\-5\.4 against Qwen3\.5\-35B\-A3B\-FP8 under the same rule, obtaining 73\.4073\.40% accuracy and 61\.4561\.45% F1\. These results suggest that our LLM\-based evaluation can approximate human judgment and GPT\-5\.4\. A\.3 Greedy Decoding Robustness Our token\-validity annotation procedure uses greedy decoding after forcing a candidate token, treating the validity of the resulting completion as a proxy for whether that token preserves access to a valid continuation\. One concern is that a token deemed invalid under greedy decoding may still admit a valid sampled continuation, and vice versa\. To test this, Figure 7 reports an ablation in which we replace the greedy continuation with model suggested sampling parameters when constructing token\-level validity labels\. We sample 10 continuations for each token, and then recompute the resulting precision–recall curves under the same cutoff strategies\. The curves are nearly unchanged, indicating that the estimated precision and recall are not driven by the particular choice of greedy decoding\. This supports using greedy continuation as a computationally efficient proxy for token validity in the main experiments\. Figure 7: Robustness of token\-validity estimates to the continuation procedure\. We compare local precision–recall curves computed from token labels obtained by greedy decoding after each forced candidate token with labels obtained from stochastic sampling\. The resulting curves nearly overlap across cutoff indices, indicating that our estimated precision–recall trade\-off is not sensitive to using greedy decoding as the continuation procedure\. Appendix B Order Calibration Trade\-offs B\.1 Generation Setup We use a variety of LLMs in our experiments, each documented in the subsection below\. We randomly select 5 open\-ended generation categories from NoveltyBench, ranging from story\-telling, joke\-telling, poems, and item selection \[49\]\. B\.2 Sweep Generation Experiment Experiments were repeated 10 times for reproducibility\. The cost of exhaustive sweeping grows as NdN^\{d\}, where dd is the sweeping depth and NN is the number of candidate tokens evaluated at each conditional\. To keep the oracle evaluation computationally tractable, we sweep only three consecutive conditionals\. At each conditional, we consider candidates up to rank 10001000 and subsample every tenth token, yielding N=100N=100 candidates per conditional and 1003100^\{3\} evaluated branches in total\. In Table 2, we provide examples of tokens swept in the story generation task\. We sweep up to depth 3 from randomly selected starting token positions\. All examples are generated with Qwen3\.5\-35B\. Table 2: Generation examples from the creative story generation task\. The sweeping is performed on the second token after “In”\. For each candidate token, we force the model to continue from the prefix with that token, greedily decode the remaining sequence, and evaluate the resulting completion with the judge\. B\.3 Model Size Experiments Figure 8: Detailed view of Figure 4\. We evaluate models across 3 families with various sizes\. Diamond represents pre\-trained only models; circle represents post\-trained Qwen models; square represents post\-trained Olmo models; triangle represents post\-trained Llama models\. Figure 8 provides a detailed view of the models used for our experiment\. B\.4 Oracle Sampling Experiments For all our oracle generations, we only sweep the first two decoding steps\. We sweep the first conditional to obtain 1000 valid tokens\. We then uniformly sample from these 1000 tokens to obtain the second\-step conditionals and repeat the sweep once\. In Table 3, we report the parameter ranges we swept for each cutoff strategy\. For each parameter, we sample 1000 generations to compute the semantic and lexical diversity\. While there are multiple methods for measuring semantic and lexical diversity, we select embedding diversity and Self\-BLEU as representatives\. Table 3: Parameter grids used for decoding sweeps\. For each parameter setting, we sample 10001000 generations to report semantic and lexical diversity\. Embedding Diversity For each generated sequence, we obtain an embedding vector 𝐞i\\mathbf\{e\}\_\{i\} from Qwen3\-Embedding\-8B\. We compute pairwise cosine distances and define embedding diversity as EmbeddingDiversity=2n\(n−1\)∑i<j\(1−cos⁡\(𝐞i,𝐞j\)\)\.\\mathrm\{Embedding\\ Diversity\}=\\frac\{2\}\{n\(n\-1\)\}\\sum\_\{i<j\}\(1\-\\cos\(\\mathbf\{e\}\_\{i\},\\mathbf\{e\}\_\{j\}\)\)\. The range is between \[0,1\]\[0,1\], with higher values indicating higher semantic diversity\. Self\-BLEU \[2\] Given generations \{y1,…,yn\}\\\{y\_\{1\},\\dots,y\_\{n\}\\\} for a single task, we compute Self\-BLEU=1n∑i=1nBLEU\(yi,\{yj\}j≠i\)∈\[0,1\]\.\\mathrm\{Self\\text\{\-\}BLEU\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathrm\{BLEU\}\\\!\\left\(y\_\{i\},\\\{y\_\{j\}\\\}\_\{j\\neq i\}\\right\)\\in\[0,1\]\. where BLEU \[31\] measures the nn\-gram overlap between a candidate output and a set of reference outputs\. In Self\-BLEU, lower values indicate higher lexical diversity\. We use n=4n=4 in our experiments\. Appendix C Shape Calibration Trade\-offs C\.1 Random Number Generation Random Number Generation Prompts Figure 9: Validity–Diversity trade\-off frontiers for constrained and unconstrained random number generation across 4 model families\. Figure 9 shows the validity–diversity trade\-off frontiers across all four model families\. The cutoff oracle helps disentangle the effects of shape and order calibration\. In the unconstrained setting, the oracle often approaches the ideal point \(1\.0,1\.0\)\(1\.0,1\.0\), suggesting that the main gap stems from standard top\-token filters’ inability to infer the correct support size\. In contrast, in the sum\-constrained setting, the oracle itself remains separated from \(1\.0,1\.0\)\(1\.0,1\.0\), indicating stronger order miscalibration\. The additional gap between standard top\-token filtering methods and the oracle reflects shape miscalibration, since fixed top\-kk, cumulative\-mass, and relative\-probability thresholds do not reliably recover the correct valid\-token boundary\. Together, these results show that shape and order miscalibration jointly contribute to the validity–diversity trade\-off, with their interaction becoming more pronounced under compositional constraints\. C\.2 Random State Generation Name a Random State Prompt Figure 10 shows sequence\-level probabilities under different temperatures on the random state generation task\. As in Figure 5, the distribution is highly concentrated and heavy\-tailed: a small number of valid sequences receive much larger probability than the rest\. Increasing temperature flattens the valid\-sequence distribution, but it also shifts substantial probability mass into the invalid region before the model approaches a uniform distribution over valid states\. Thus, temperature scaling again improves valid diversity only by sacrificing validity\. Figure 10: Sequence probability for random state generation task\. Sequences are sorted by probability, and probabilities are plotted in log\-space\. The “tail” section represents the total probability mass of the invalid region\. Appendix D Formal Definitions of Calibration A decoding rule achieves high validity if it assigns high probability only on valid tokens at each time step and it achieves high diversity if it explores many distinct tokens in GG rather than concentrating on only a few of them\. Therefore, we define order and shape calibration as: Definition D\.1 \(Calibration\)\. Given y<t∈𝒱t−1y\_\{<t\}\\in\\mathcal\{V\}^\{t\-1\} and an LLM conditional distribution pp, 1\. \(Order Calibration\) pp is order calibrated if for any valid token v∈𝒱v\\in\\mathcal\{V\} and invalid token w∈𝒱w\\in\\mathcal\{V\}, it assigns a higher probability to the valid token, i\.e\., p\(v∣y<t\)≥p\(w∣y<t\)p\(v\\mid y\_\{<t\}\)\\geq p\(w\\mid y\_\{<t\}\)\. 2\. \(Shape Calibration\) pp is shape calibrated if for any token v∈𝒱v\\in\\mathcal\{V\}, it assigns probability mass to vv according to the number of valid continuations starting with vv, i\.e\., p\(v∣y<t\)∝N\(y<t∘v\)p\(v\\mid y\_\{<t\}\)\\propto N\(y\_\{<t\}\\circ v\)\. Note that shape calibration is stronger than order calibration: even if order calibration is resolved, shape calibration can still persist\. However, perfect shape calibration is strictly harder than order calibration\. Our notation of “calibration” is not directly related to LLM confidence calibration literature\. Appendix E Analysis of Theorem 4\.2 In this section, we provide a formal analysis of how local truncation decisions affect sequence\-level behavior\. We restate the key definitions for completeness and give full proofs of the results in Section 4\. E\.1 Setup and notation Fix a valid set V⊆𝒱dV\\subseteq\\mathcal\{V\}^\{d\}\. At each decoding step tt, a rule retains a nonempty subset St\(y<t\)⊆𝒱S\_\{t\}\(y\_\{<t\}\)\\subseteq\\mathcal\{V\} and samples Yt∼Unif\(St\(Y<t\)\)\.Y\_\{t\}\\sim\\mathrm\{Unif\}\(S\_\{t\}\(Y\_\{<t\}\)\)\. Let QSQ\_\{S\} denote the induced distribution over full sequences\. Let G\(y<t\)G\(y\_\{<t\}\) denote the set of valid next tokens, and let N\(y<t\)N\(y\_\{<t\}\) denote the number of valid completions extending y<ty\_\{<t\}\. Local precision and recall\. For a prefix y<ty\_\{<t\}, define Prect\(S;y<t\):=\|St\(y<t\)∩G\(y<t\)\|\|St\(y<t\)\|,\\mathrm\{Prec\}\_\{t\}\(S;y\_\{<t\}\):=\\frac\{\|S\_\{t\}\(y\_\{<t\}\)\\cap G\(y\_\{<t\}\)\|\}\{\|S\_\{t\}\(y\_\{<t\}\)\|\}, and Rect\(S;y<t\):=∑v∈St\(y<t\)∩G\(y<t\)N\(y<t∘v\)N\(y<t\)\.\\mathrm\{Rec\}\_\{t\}\(S;y\_\{<t\}\):=\\frac\{\\sum\_\{v\\in S\_\{t\}\(y\_\{<t\}\)\\cap G\(y\_\{<t\}\)\}N\(y\_\{<t\}\\circ v\)\}\{N\(y\_\{<t\}\)\}\. Sequence\-level precision and recall\. Precseq\(S\)=QS\(Y∈V\),Recseq\(S\)=\|\{y∈V:yt∈St\(y<t\)∀t\}\|\|V\|\.\\mathrm\{Prec\}\_\{\\mathrm\{seq\}\}\(S\)=Q\_\{S\}\(Y\\in V\),\\qquad\\mathrm\{Rec\}\_\{\\mathrm\{seq\}\}\(S\)=\\frac\{\|\\\{y\\in V:y\_\{t\}\\in S\_\{t\}\(y\_\{<t\}\)\\ \\forall t\\\}\|\}\{\|V\|\}\. E\.2 Multiplicative decomposition Theorem E\.1 \(Exact multiplicative decomposition\)\. Precseq\(S\)=∏t=1dαt\(S\),Recseq\(S\)=∏t=1dβt\(S\),\\mathrm\{Prec\}\_\{\\mathrm\{seq\}\}\(S\)=\\prod\_\{t=1\}^\{d\}\\alpha\_\{t\}\(S\),\\qquad\\mathrm\{Rec\}\_\{\\mathrm\{seq\}\}\(S\)=\\prod\_\{t=1\}^\{d\}\\beta\_\{t\}\(S\), where αt\(S\):=𝔼QS\[Prect\(S;Y<t\)∣Ys∈G\(Y<s\)∀s<t\],\\alpha\_\{t\}\(S\):=\\mathbb\{E\}\_\{Q\_\{S\}\}\\\!\\left\[\\mathrm\{Prec\}\_\{t\}\(S;Y\_\{<t\}\)\\mid Y\_\{s\}\\in G\(Y\_\{<s\}\)\\ \\forall s<t\\right\], and βt\(S\):=𝔼UV\[Rect\(S;Y<t\)∣Ys∈Ss\(Y<s\)∀s<t\]\.\\beta\_\{t\}\(S\):=\\mathbb\{E\}\_\{U\_\{V\}\}\\\!\\left\[\\mathrm\{Rec\}\_\{t\}\(S;Y\_\{<t\}\)\\mid Y\_\{s\}\\in S\_\{s\}\(Y\_\{<s\}\)\\ \\forall s<t\\right\]\. Proof\. Define Ft:=\{Ys∈G\(Y<s\) for all s≤t\}\.F\_\{t\}:=\\\{Y\_\{s\}\\in G\(Y\_\{<s\}\)\\text\{ for all \}s\\leq t\\\}\. A sequence is valid if and only if it preserves access to a valid continuation at every step, hence \{Y∈V\}=Fd\\\{Y\\in V\\\}=F\_\{d\}\. By the chain rule, Precseq\(S\)=QS\(Fd\)=∏t=1dQS\(Ft∣Ft−1\)\.\\mathrm\{Prec\}\_\{\\mathrm\{seq\}\}\(S\)=Q\_\{S\}\(F\_\{d\}\)=\\prod\_\{t=1\}^\{d\}Q\_\{S\}\(F\_\{t\}\\mid F\_\{t\-1\}\)\. Conditioned on Ft−1F\_\{t\-1\} and Y<tY\_\{<t\}, the next token is sampled uniformly from St\(y<t\)S\_\{t\}\(y\_\{<t\}\), yielding QS\(Ft∣Ft−1,Y<t\)=Prect\(S;y<t\)\.Q\_\{S\}\(F\_\{t\}\\mid F\_\{t\-1\},Y\_\{<t\}\)=\\mathrm\{Prec\}\_\{t\}\(S;y\_\{<t\}\)\. Taking expectations gives αt\(S\)\\alpha\_\{t\}\(S\)\. For recall, let UVU\_\{V\} denote the uniform distribution over VV and define Et:=\{Ys∈Ss\(Y<s\) for all s≤t\}\.E\_\{t\}:=\\\{Y\_\{s\}\\in S\_\{s\}\(Y\_\{<s\}\)\\text\{ for all \}s\\leq t\\\}\. Then Recseq\(S\)=UV\(Ed\)=∏t=1dUV\(Et∣Et−1\)\.\\mathrm\{Rec\}\_\{\\mathrm\{seq\}\}\(S\)=U\_\{V\}\(E\_\{d\}\)=\\prod\_\{t=1\}^\{d\}U\_\{V\}\(E\_\{t\}\\mid E\_\{t\-1\}\)\. Conditioned on Y<tY\_\{<t\}, the next token under UVU\_\{V\} is distributed proportionally to continuation counts, yielding Rect\(S;y<t\)\\mathrm\{Rec\}\_\{t\}\(S;y\_\{<t\}\)\. Taking expectations gives βt\(S\)\\beta\_\{t\}\(S\)\. ∎ Local trade\-off view\. The multiplicative decomposition shows that sequence\-level precision and recall are governed by accumulated local log\-losses, ut\(S\):=−log⁡αt\(S\),vt\(S\):=−log⁡βt\(S\)\.u\_\{t\}\(S\):=\-\\log\\alpha\_\{t\}\(S\),\\qquad v\_\{t\}\(S\):=\-\\log\\beta\_\{t\}\(S\)\. Thus, maintaining high sequence\-level precision imposes a small total budget on the precision losses ∑tut\(S\)\\sum\_\{t\}u\_\{t\}\(S\)\. Theorem E\.3 formalizes the consequence: if many steps necessarily incur nontrivial recall loss whenever their precision loss is small, then sequence\-level recall must decay exponentially\. E\.3 Compounding effect of hard decoding steps Formalization of Theorem 4\.2\. The main text theorem states that if many decoding steps incur an unavoidable loss in recall when enforcing high precision, then sequence\-level recall decays exponentially\. We now formalize this statement\. Definition E\.2 \(\(η,ρ\)\(\\eta,\\rho\)\-hard step\)\. Fix η,ρ\>0\\eta,\\rho\>0\. A step tt is \(η,ρ\)\(\\eta,\\rho\)\-hard if, for every decoding rule SS, ut\(S\)≤η⟹vt\(S\)≥ρ\.u\_\{t\}\(S\)\\leq\\eta\\quad\\Longrightarrow\\quad v\_\{t\}\(S\)\\geq\\rho\. Equivalently, whenever the local precision loss at step tt is at most η\\eta, the local recall loss is at least ρ\\rho\. Theorem E\.3 \(Compounding effect of hard decoding steps\)\. Suppose at least mm decoding steps are \(η,ρ\)\(\\eta,\\rho\)\-hard\. Then any decoding rule satisfying Precseq\(S\)≥1−δ\\mathrm\{Prec\}\_\{\\mathrm\{seq\}\}\(S\)\\geq 1\-\\delta also satisfies Recseq\(S\)≤exp⁡\(−ρ\(m−−log⁡\(1−δ\)η\)\+\)\.\\mathrm\{Rec\}\_\{\\mathrm\{seq\}\}\(S\)\\leq\\exp\\\!\\left\(\-\\rho\\left\(m\-\\frac\{\-\\log\(1\-\\delta\)\}\{\\eta\}\\right\)\_\{\+\}\\right\)\. Proof\. By the multiplicative decomposition, ∑t=1dut\(S\)=−log⁡Precseq\(S\)≤−log⁡\(1−δ\)\.\\sum\_\{t=1\}^\{d\}u\_\{t\}\(S\)=\-\\log\\mathrm\{Prec\}\_\{\\mathrm\{seq\}\}\(S\)\\leq\-\\log\(1\-\\delta\)\. Among the mm hard steps, fewer than −log⁡\(1−δ\)/η\-\\log\(1\-\\delta\)/\\eta steps can have ut\(S\)\>ηu\_\{t\}\(S\)\>\\eta; otherwise the total precision loss would exceed the budget\. Therefore at least \(m−−log⁡\(1−δ\)η\)\+\\left\(m\-\\frac\{\-\\log\(1\-\\delta\)\}\{\\eta\}\\right\)\_\{\+\} hard steps satisfy ut\(S\)≤ηu\_\{t\}\(S\)\\leq\\eta\. For each such step, hardness implies vt\(S\)≥ρv\_\{t\}\(S\)\\geq\\rho\. Hence −log⁡Recseq\(S\)=∑t=1dvt\(S\)≥ρ\(m−−log⁡\(1−δ\)η\)\+\.\-\\log\\mathrm\{Rec\}\_\{\\mathrm\{seq\}\}\(S\)=\\sum\_\{t=1\}^\{d\}v\_\{t\}\(S\)\\geq\\rho\\left\(m\-\\frac\{\-\\log\(1\-\\delta\)\}\{\\eta\}\\right\)\_\{\+\}\. Exponentiating gives the result\. ∎ Interpretation\. Maintaining high sequence\-level precision imposes a constant total precision\-loss budget across all decoding steps\. As the sequence length grows, most steps must operate in a near\-perfect regime\. If many such steps still incur a fixed recall loss, these losses accumulate multiplicatively, causing an exponential decay in the set of reachable valid sequences\. Appendix F Analysis of Theorem 5\.2 The main point is simple: if a model must place very high probability on valid tokens at each step, then its next\-token distribution must become sharper\. But sharper distributions are less uniform over the valid choices, which reduces validity\-conditioned diversity\. Since entropy losses add over sequence positions, the diversity loss compounds with length\. Definition F\.1 \(Discrete geometric ranked model\)\. Fix a temperature T\>0T\>0\. At each valid prefix y<ty\_\{<t\}, assume that after sorting tokens by decreasing probability, the next\-token distribution is geometric in rank: Pt\(T\)\(i∣y<t\)=\(1−qt\)qti,i=0,1,2,…,\|𝒱\|−1P\_\{t\}^\{\(T\)\}\(i\\mid y\_\{<t\}\)=\(1\-q\_\{t\}\)q\_\{t\}^\{i\},\\qquad i=0,1,2,\\dots,\|\\mathcal\{V\}\|\-1 where qt=exp⁡\(−λt/T\),λt\>0\.q\_\{t\}=\\exp\(\-\\lambda\_\{t\}/T\),\\qquad\\lambda\_\{t\}\>0\. Equivalently, Pt\(T\)\(i∣y<t\)∝exp⁡\(−λti/T\)\.P\_\{t\}^\{\(T\)\}\(i\\mid y\_\{<t\}\)\\propto\\exp\(\-\\lambda\_\{t\}i/T\)\. For each position tt, define the normalized sharpness zt:=λtvtT\.z\_\{t\}:=\\frac\{\\lambda\_\{t\}v\_\{t\}\}\{T\}\. This quantity measures how much the ranked distribution decays across the valid interval \{0,…,vt−1\}\\\{0,\\dots,v\_\{t\}\-1\\\}\. For v∈ℕ\+v\\in\\mathbb\{N\}\_\{\+\} and a\>0a\>0, define Hv\(a\)H\_\{v\}\(a\) as the entropy of the tilted distribution pv,a\(i\)=exp⁡\(−ai/v\)∑j=0v−1exp⁡\(−aj/v\),i=0,…,v−1\.p\_\{v,a\}\(i\)=\\frac\{\\exp\(\-ai/v\)\}\{\\sum\_\{j=0\}^\{v\-1\}\\exp\(\-aj/v\)\},\\qquad i=0,\\dots,v\-1\. When a=0a=0, this distribution is uniform over vv tokens and has entropy ln⁡v\\ln v\. When a\>0a\>0, it is tilted toward smaller ranks, so its entropy is smaller\. Define the per\-step entropy loss Δv\(a\):=ln⁡v−Hv\(a\)\.\\Delta\_\{v\}\(a\):=\\ln v\-H\_\{v\}\(a\)\. This measures how much diversity is lost, at one step, relative to being uniform over the vv valid choices\. Definition F\.2 \(Diversity\)\. Let Y∼P\(T\)\(⋅∣x\)Y\\sim P^\{\(T\)\}\(\\cdot\\mid x\), and let VV be the set of valid sequences\. Define Div\(P\(T\)\):=exp⁡\(H\(Y∣Y∈V\)\)\|V\|\.\\mathrm\{Div\}\(P^\{\(T\)\}\):=\\frac\{\\exp\(H\(Y\\mid Y\\in V\)\)\}\{\|V\|\}\. This quantity equals 11 when the model is uniform over valid sequences after conditioning on validity\. It decreases when the conditional distribution over valid sequences becomes more concentrated\. Lemma F\.3 \(Entropy loss increases with sharpness\)\. For every fixed vv, the entropy loss Δv\(a\)=ln⁡v−Hv\(a\)\\Delta\_\{v\}\(a\)=\\ln v\-H\_\{v\}\(a\) is nondecreasing in aa\. Moreover, if v≥2v\\geq 2 and a\>0a\>0, then Δv\(a\)\>0\.\\Delta\_\{v\}\(a\)\>0\. Proof\. Write θ=a/v\\theta=a/v and Z\(θ\)=∑j=0v−1e−θj\.Z\(\\theta\)=\\sum\_\{j=0\}^\{v\-1\}e^\{\-\\theta j\}\. For pθ\(i\)∝e−θip\_\{\\theta\}\(i\)\\propto e^\{\-\\theta i\}, Hv\(a\)=log⁡Z\(θ\)\+θ𝔼θ\[i\]\.H\_\{v\}\(a\)=\\log Z\(\\theta\)\+\\theta\\mathbb\{E\}\_\{\\theta\}\[i\]\. Hence ddθHv\(a\)=θddθ𝔼θ\[i\]=−θVarθ⁡\(i\)≤0\.\\frac\{d\}\{d\\theta\}H\_\{v\}\(a\)=\\theta\\frac\{d\}\{d\\theta\}\\mathbb\{E\}\_\{\\theta\}\[i\]=\-\\theta\\operatorname\{Var\}\_\{\\theta\}\(i\)\\leq 0\. Since θ=a/v\\theta=a/v, Hv\(a\)H\_\{v\}\(a\) is nonincreasing in aa, so Δv\(a\)=log⁡v−Hv\(a\)\\Delta\_\{v\}\(a\)=\\log v\-H\_\{v\}\(a\) is nondecreasing\. If v≥2v\\geq 2 and a\>0a\>0, then pv,ap\_\{v,a\} is nonuniform, so Hv\(a\)<log⁡vH\_\{v\}\(a\)<\\log v and Δv\(a\)\>0\\Delta\_\{v\}\(a\)\>0\. ∎ Theorem F\.4 \(discrete validity–diversity trade\-off\)\. Assume the discrete geometric ranked model and invariant valid branching\. Let L:=ln⁡ϵ−1\.L:=\\ln\\epsilon^\{\-1\}\. If Val\(P\(T\)\)≥1−ϵ,\\mathrm\{Val\}\(P^\{\(T\)\}\)\\geq 1\-\\epsilon, then Div\(P\(T\)\)≤exp⁡\(−∑t=1dΔvt\(L\)\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\left\(\-\\sum\_\{t=1\}^\{d\}\\Delta\_\{v\_\{t\}\}\(L\)\\right\)\. Equivalently, Div\(P\(T\)\)≤exp⁡\(−∑t=1d\(ln⁡vt−Hvt\(ln⁡ϵ−1\)\)\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\left\(\-\\sum\_\{t=1\}^\{d\}\\left\(\\ln v\_\{t\}\-H\_\{v\_\{t\}\}\(\\ln\\epsilon^\{\-1\}\)\\right\)\\right\)\. In particular, stricter validity requirements force a smaller upper bound on validity\-conditioned diversity: as ϵ\\epsilon decreases, L=ln⁡ϵ−1L=\\ln\\epsilon^\{\-1\} increases, and each entropy\-loss term Δvt\(L\)\\Delta\_\{v\_\{t\}\}\(L\) increases\. Proof\. At position tt, the valid next tokens are the first vtv\_\{t\} ranked tokens\. Therefore, the local probability of choosing a valid next token is Pt\(T\)\(Yt∈Gt\(y<t\)∣y<t\)\\displaystyle P\_\{t\}^\{\(T\)\}\(Y\_\{t\}\\in G\_\{t\}\(y\_\{<t\}\)\\mid y\_\{<t\}\) =∑i=0vt−1\(1−qt\)qti\\displaystyle=\\sum\_\{i=0\}^\{v\_\{t\}\-1\}\(1\-q\_\{t\}\)q\_\{t\}^\{i\} =1−qtvt\.\\displaystyle=1\-q\_\{t\}^\{v\_\{t\}\}\. Since qt=exp⁡\(−λt/T\),q\_\{t\}=\\exp\(\-\\lambda\_\{t\}/T\), we have qtvt=exp⁡\(−λtvt/T\)=e−zt\.q\_\{t\}^\{v\_\{t\}\}=\\exp\(\-\\lambda\_\{t\}v\_\{t\}/T\)=e^\{\-z\_\{t\}\}\. Thus the local validity probability is By invariant valid branching, the full\-sequence validity factorizes: Val\(P\(T\)\)=∏t=1d\(1−e−zt\)\.\\mathrm\{Val\}\(P^\{\(T\)\}\)=\\prod\_\{t=1\}^\{d\}\(1\-e^\{\-z\_\{t\}\}\)\. Suppose Val\(P\(T\)\)≥1−ϵ\.\\mathrm\{Val\}\(P^\{\(T\)\}\)\\geq 1\-\\epsilon\. Since the product is no larger than any individual factor, for every tt, 1−ϵ≤∏s=1d\(1−e−zs\)≤1−e−zt\.1\-\\epsilon\\leq\\prod\_\{s=1\}^\{d\}\(1\-e^\{\-z\_\{s\}\}\)\\leq 1\-e^\{\-z\_\{t\}\}\. Therefore e−zt≤ϵ,e^\{\-z\_\{t\}\}\\leq\\epsilon, which implies zt≥ln⁡ϵ−1=L\.z\_\{t\}\\geq\\ln\\epsilon^\{\-1\}=L\. Now condition on the event that the generated sequence is valid\. At position tt, conditioned on choosing one of the vtv\_\{t\} valid tokens, the rank distribution is pvt,zt\(i\)=exp⁡\(−zti/vt\)∑j=0vt−1exp⁡\(−ztj/vt\),i=0,…,vt−1\.p\_\{v\_\{t\},z\_\{t\}\}\(i\)=\\frac\{\\exp\(\-z\_\{t\}i/v\_\{t\}\)\}\{\\sum\_\{j=0\}^\{v\_\{t\}\-1\}\\exp\(\-z\_\{t\}j/v\_\{t\}\)\},\\qquad i=0,\\dots,v\_\{t\}\-1\. Its entropy is Hvt\(zt\)H\_\{v\_\{t\}\}\(z\_\{t\}\)\. By the chain rule for entropy, H\(Y∣Y∈V\)=∑t=1dHvt\(zt\)\.H\(Y\\mid Y\\in V\)=\\sum\_\{t=1\}^\{d\}H\_\{v\_\{t\}\}\(z\_\{t\}\)\. Also, by invariant valid branching, \|V\|=∏t=1dvt\.\|V\|=\\prod\_\{t=1\}^\{d\}v\_\{t\}\. Therefore Div\(P\(T\)\)\\displaystyle\\mathrm\{Div\}\(P^\{\(T\)\}\) =exp⁡\(H\(Y∣Y∈V\)\)\|V\|\\displaystyle=\\frac\{\\exp\(H\(Y\\mid Y\\in V\)\)\}\{\|V\|\} =exp⁡\(∑t=1dHvt\(zt\)−∑t=1dln⁡vt\)\\displaystyle=\\exp\\left\(\\sum\_\{t=1\}^\{d\}H\_\{v\_\{t\}\}\(z\_\{t\}\)\-\\sum\_\{t=1\}^\{d\}\\ln v\_\{t\}\\right\) =exp⁡\(−∑t=1d\(ln⁡vt−Hvt\(zt\)\)\)\.\\displaystyle=\\exp\\left\(\-\\sum\_\{t=1\}^\{d\}\\bigl\(\\ln v\_\{t\}\-H\_\{v\_\{t\}\}\(z\_\{t\}\)\\bigr\)\\right\)\. By Lemma F\.3, entropy loss is nondecreasing in sharpness\. Since zt≥Lz\_\{t\}\\geq L, we have ln⁡vt−Hvt\(zt\)≥ln⁡vt−Hvt\(L\)=Δvt\(L\)\.\\ln v\_\{t\}\-H\_\{v\_\{t\}\}\(z\_\{t\}\)\\geq\\ln v\_\{t\}\-H\_\{v\_\{t\}\}\(L\)=\\Delta\_\{v\_\{t\}\}\(L\)\. Plugging this into the previous display gives Div\(P\(T\)\)≤exp⁡\(−∑t=1dΔvt\(L\)\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\left\(\-\\sum\_\{t=1\}^\{d\}\\Delta\_\{v\_\{t\}\}\(L\)\\right\)\. This proves the desired bound\. Finally, because each Δvt\(L\)\\Delta\_\{v\_\{t\}\}\(L\) is nondecreasing in LL, and because L=ln⁡ϵ−1L=\\ln\\epsilon^\{\-1\} increases as ϵ\\epsilon decreases, stricter validity requirements force a smaller upper bound on validity\-conditioned diversity\. ∎ Corollary F\.5 \(Exponential diversity loss in branching length\)\. Let m:=\|\{t∈\[d\]:vt≥2\}\|m:=\\left\|\\\{t\\in\[d\]:v\_\{t\}\\geq 2\\\}\\right\| be the branching length, and define c𝒱\(ϵ\):=min2≤v≤\|𝒱\|⁡Δv\(ln⁡ϵ−1\)\.c\_\{\\mathcal\{V\}\}\(\\epsilon\):=\\min\_\{2\\leq v\\leq\|\\mathcal\{V\}\|\}\\Delta\_\{v\}\(\\ln\\epsilon^\{\-1\}\)\. For every ϵ∈\(0,1\)\\epsilon\\in\(0,1\), we have c𝒱\(ϵ\)\>0c\_\{\\mathcal\{V\}\}\(\\epsilon\)\>0 and Div\(P\(T\)\)≤exp⁡\(−mc𝒱\(ϵ\)\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\\!\\left\(\-m\\,c\_\{\\mathcal\{V\}\}\(\\epsilon\)\\right\)\. In particular, if every position has at least two valid next\-token choices, then m=dm=d, and validity\-conditioned diversity decays exponentially in sequence length\. Proof\. Theorem F\.4 gives Div\(P\(T\)\)≤exp⁡\(−∑t=1dΔvt\(ln⁡ϵ−1\)\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\\!\\left\(\-\\sum\_\{t=1\}^\{d\}\\Delta\_\{v\_\{t\}\}\(\\ln\\epsilon^\{\-1\}\)\\right\)\. Terms with vt=1v\_\{t\}=1 contribute no branching diversity\. For every term with vt≥2v\_\{t\}\\geq 2, Δvt\(ln⁡ϵ−1\)≥c𝒱\(ϵ\)\.\\Delta\_\{v\_\{t\}\}\(\\ln\\epsilon^\{\-1\}\)\\geq c\_\{\\mathcal\{V\}\}\(\\epsilon\)\. There are mm such terms, so ∑t=1dΔvt\(ln⁡ϵ−1\)≥mc𝒱\(ϵ\)\.\\sum\_\{t=1\}^\{d\}\\Delta\_\{v\_\{t\}\}\(\\ln\\epsilon^\{\-1\}\)\\geq m\\,c\_\{\\mathcal\{V\}\}\(\\epsilon\)\. Finally, c𝒱\(ϵ\)\>0c\_\{\\mathcal\{V\}\}\(\\epsilon\)\>0 because it is the minimum of finitely many strictly positive entropy losses\. This proves the claim\. ∎ Corollary F\.6 \(Two regimes of the diversity loss\)\. Let L:=ln⁡ϵ−1L:=\\ln\\epsilon^\{\-1\}, and define c𝒱\(ϵ\):=min2≤v≤\|𝒱\|⁡Δv\(L\)\.c\_\{\\mathcal\{V\}\}\(\\epsilon\):=\\min\_\{2\\leq v\\leq\|\\mathcal\{V\}\|\}\\Delta\_\{v\}\(L\)\. Then Div\(P\(T\)\)≤exp⁡\(−mc𝒱\(ϵ\)\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\\!\\left\(\-m\\,c\_\{\\mathcal\{V\}\}\(\\epsilon\)\\right\)\. Moreover, c𝒱\(ϵ\)c\_\{\\mathcal\{V\}\}\(\\epsilon\) has the following two regimes: c𝒱\(ϵ\)=132L2\+O\(L3\),L→0,c\_\{\\mathcal\{V\}\}\(\\epsilon\)=\\frac\{1\}\{32\}L^\{2\}\+O\(L^\{3\}\),\\qquad L\\to 0, and c𝒱\(ϵ\)→ln⁡2,L→∞\.c\_\{\\mathcal\{V\}\}\(\\epsilon\)\\to\\ln 2,\\qquad L\\to\\infty\. Equivalently, in the weak\-validity regime, Div\(P\(T\)\)≤exp⁡\(−m\(132L2\+O\(L3\)\)\),\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\\!\\left\(\-m\\left\(\\frac\{1\}\{32\}L^\{2\}\+O\(L^\{3\}\)\\right\)\\right\), while in the stringent\-validity regime, Div\(P\(T\)\)≤exp⁡\(−m\(ln⁡2−o\(1\)\)\)=2−m\+o\(m\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\\\!\\left\(\-m\(\\ln 2\-o\(1\)\)\\right\)=2^\{\-m\+o\(m\)\}\. Proof\. The first inequality follows directly from Corollary F\.5 with c\(ϵ\)c\(\\epsilon\) replaced by the finite\-vocabulary minimum c𝒱\(ϵ\)c\_\{\\mathcal\{V\}\}\(\\epsilon\)\. We now prove the two asymptotic regimes\. Recall that Δv\(L\)=DKL\(pv,L∥Uv\),\\Delta\_\{v\}\(L\)=D\_\{\\mathrm\{KL\}\}\\\!\\left\(p\_\{v,L\}\\,\\middle\\\|\\,U\_\{v\}\\right\), where UvU\_\{v\} is uniform on \{0,…,v−1\}\\\{0,\\dots,v\-1\\\} and pv,L\(i\)∝e−Li/vp\_\{v,L\}\(i\)\\propto e^\{\-Li/v\}\. For L→0L\\to 0, pv,Lp\_\{v,L\} is a small exponential tilt of UvU\_\{v\}\. The standard second\-order expansion of KL divergence gives Δv\(L\)=L22VarUv⁡\(iv\)\+O\(L3\)\.\\Delta\_\{v\}\(L\)=\\frac\{L^\{2\}\}\{2\}\\operatorname\{Var\}\_\{U\_\{v\}\}\\\!\\left\(\\frac\{i\}\{v\}\\right\)\+O\(L^\{3\}\)\. Since VarUv⁡\(iv\)=v2−112v2,\\operatorname\{Var\}\_\{U\_\{v\}\}\\\!\\left\(\\frac\{i\}\{v\}\\right\)=\\frac\{v^\{2\}\-1\}\{12v^\{2\}\}, we obtain Δv\(L\)=v2−124v2L2\+O\(L3\)\.\\Delta\_\{v\}\(L\)=\\frac\{v^\{2\}\-1\}\{24v^\{2\}\}L^\{2\}\+O\(L^\{3\}\)\. The coefficient v2−124v2\\frac\{v^\{2\}\-1\}\{24v^\{2\}\} is increasing in v≥2v\\geq 2, so the minimum over 2≤v≤\|𝒱\|2\\leq v\\leq\|\\mathcal\{V\}\| is attained at v=2v=2\. Hence c𝒱\(ϵ\)=Δ2\(L\)=132L2\+O\(L3\)\.c\_\{\\mathcal\{V\}\}\(\\epsilon\)=\\Delta\_\{2\}\(L\)=\\frac\{1\}\{32\}L^\{2\}\+O\(L^\{3\}\)\. For L→∞L\\to\\infty, fix any finite vv\. Then pv,Lp\_\{v,L\} concentrates on the top\-ranked token, so Hv\(L\)→0,Δv\(L\)=ln⁡v−Hv\(L\)→ln⁡v\.H\_\{v\}\(L\)\\to 0,\\qquad\\Delta\_\{v\}\(L\)=\\ln v\-H\_\{v\}\(L\)\\to\\ln v\. Because the minimum is over the finite set \{2,…,\|𝒱\|\}\\\{2,\\dots,\|\\mathcal\{V\}\|\\\}, we may pass the limit through the minimum: c𝒱\(ϵ\)=min2≤v≤\|𝒱\|⁡Δv\(L\)→min2≤v≤\|𝒱\|⁡ln⁡v=ln⁡2\.c\_\{\\mathcal\{V\}\}\(\\epsilon\)=\\min\_\{2\\leq v\\leq\|\\mathcal\{V\}\|\}\\Delta\_\{v\}\(L\)\\to\\min\_\{2\\leq v\\leq\|\\mathcal\{V\}\|\}\\ln v=\\ln 2\. Substituting these two asymptotics into Div\(P\(T\)\)≤exp⁡\(−mc𝒱\(ϵ\)\)\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\(\-m\\,c\_\{\\mathcal\{V\}\}\(\\epsilon\)\) gives the claimed diversity bounds\. ∎ Interpretation\. The result shows that high validity requires every local invalid\-token probability to be small\. In the geometric ranked model, this forces large normalized sharpness ztz\_\{t\}\. After conditioning on validity, the distribution over the vtv\_\{t\} valid choices is therefore tilted rather than uniform, causing an entropy loss Δvt\(zt\)\\Delta\_\{v\_\{t\}\}\(z\_\{t\}\) at each branching position\. Since entropy losses add over positions, the exponentiated diversity decays multiplicatively, yielding Div\(P\(T\)\)≤exp⁡\(−mc𝒱\(ϵ\)\)\.\\mathrm\{Div\}\(P^\{\(T\)\}\)\\leq\\exp\(\-m\\,c\_\{\\mathcal\{V\}\}\(\\epsilon\)\)\. Appendix G Experiments with Production\-Level Models In Figure 11, we prompted GPT\-5\.5 with the prompt “Name a random city in the world”, with default thinking level and temperature\. The models response almost always lie in a limited set of few cities\. Figure 11: Prompting GPT\-5\.5 to randomly name a city in the world\. The vast majority of answers, with or without user chat history, collapses to “Valparaíso, Chile\.” This shows a strong collapse in diversity\. Appendix H Case Study: Calibration in Coding We examine the distributional properties of LLMs in a simple coding setting\. Although diversity is not inherently the primary objective in coding tasks, prior work has shown that greedy decoding is often suboptimal\. Instead, Yue et al\. \[47\] advocate evaluating models using the pass@k metric, which measures whether at least one of kk sampled solutions correctly solves the task\. This highlights the importance of effective sampling even in domains where a single correct answer suffices\. Figure 12 illustrates sequence\-level probability distributions on a coding task from LiveCodeBench \[17\]\. We observe clear evidence of both order and shape miscalibration\. Valid solutions are interspersed with invalid ones throughout the ranked distribution \[20\], indicating poor order calibration\. At the same time, probability mass is unevenly concentrated across valid solutions, with a small subset dominating the distribution, reflecting shape miscalibration\. These results show that our framework helps diagnose model failures beyond diversity alone, providing a lens to understand why sampling\-based improvements such as higher pass@k remain difficult to achieve in practice\. Figure 12: Sequence\-level probability distribution for a coding task from LiveCodeBench \[17\], with sequences sorted by log\-probability\. Valid and invalid solutions are intermixed across the ranking \(order miscalibration\), and probability mass is concentrated on a small subset of valid solutions \(shape miscalibration\)\. Appendix I Empirical Analysis of Logits To investigate the logits distribution, we fit the logits at each generation conditional to a piecewise model, defined as f\(k\)=\{mk\+b,k≤c,A\+Blog⁡\(k\+C\),k\>c,f\(k\)=\\begin\{cases\}mk\+b,&k\\leq c,\\\\ A\+B\\log\(k\+C\),&k\>c,\\end\{cases\} where kk refers to the logit index and f\(k\)f\(k\) the corresponding logit value\. To find cc, we sweep over all token indexes\. Figures 13, 14, 15 show the curve fit and MSE, R2 of fitting to various tasks and conditionals\. Results show that LLM conditionals are consistently sharp\-headed and heavy\-tailed\. Figure 13: Logit fitting on Llama\-3\.1\-8B\-Instruct Figure 14: Logit fitting on Qwen3\.5\-35B\-A3B Figure 15: Logit fitting on Olmo\-3\-7B\-Instruct Appendix J Limitations Our work studies diversity collapse through the lens of validity–diversity calibration\. This perspective is useful because it connects output\-level diversity failures to local properties of the model’s conditional distributions\. However, several limitations should be noted\. Controlled tasks are diagnostic rather than exhaustive\. A substantial part of our empirical analysis uses controlled random\-number generation tasks, where the valid set can be characterized exactly\. This allows us to compute validity and diversity without relying on noisy semantic judgments, and makes it possible to isolate the effects of shape and order calibration\. However, these tasks are not intended to capture the full complexity of open\-ended generation\. In domains such as creative writing, scientific ideation, dialogue, or planning, validity is semantic, context\-dependent, and often graded rather than binary\. Therefore, our controlled experiments should be interpreted as diagnostics for specific distributional mechanisms, not as a complete account of all diversity failures in realistic generation settings\. Token\-level validity labels are approximate in open\-ended settings\. For open\-ended tasks, the valid\-token set is not directly observable\. Our empirical procedure approximates token validity by extending a prefix with a candidate token, greedily decoding a completion, and then evaluating the final sequence with an LLM judge\. This provides a practical estimate of whether a token preserves access to a valid continuation, but it is not an exact characterization of the true valid\-token set: a token that leads to an invalid greedy continuation may still admit valid continuations under another decoding path\. To assess the reliability of this approximation, we conduct a human validation study comparing judge\-based labels against human annotations\. This helps quantify the extent to which the LLM judge agrees with human judgments and reduces concern that our precision–recall estimates are artifacts of a particular judge model\. Nevertheless, the labels remain approximate, since both human and model judgments depend on the task rubric, validity threshold, and the particular continuation used for evaluation\. Our controlled random\-number experiments avoid this source of noise by using algorithmically known validity\. The theoretical models isolate mechanisms under stylized assumptions\. Our theoretical results are designed to formalize clean mechanisms rather than to fully model all LLM distributions\. In particular, the shape calibration analysis uses a ranked geometric distribution and, in its cleanest form, assumes invariant valid branching across prefixes\. These assumptions make it possible to show how local sharpness and entropy losses compound across sequence positions\. Real LLM conditionals may have heterogeneous branching factors, non\-geometric tails, prefix\-dependent valid sets, and interactions between syntax, semantics, and instruction\-following constraints\. The theory should therefore be read as a mechanistic explanation of why validity–diversity trade\-offs can arise, rather than as a literal generative model of all LLM behavior\. Oracle baselines are diagnostic, not deployable\. Several experiments use oracle information, such as the ground\-truth valid\-token set size or exact validity constraints in controlled tasks\. These oracle baselines are not meant to be practical decoding methods\. Their purpose is to separate failure modes\. For example, an oracle\-size cutoff tests whether a rank\-based method would improve if it knew the correct local support size, while still failing when valid and invalid tokens are interleaved in rank\. Thus, oracle performance should be interpreted as evidence about the source of the bottleneck, not as a directly available inference\-time algorithm\. Sequence\-level experiments are limited by sequence depth\. Exact validity–diversity computation becomes expensive as sequence length grows, because the number of possible continuations increases rapidly\. This limits the lengths and branching structures that can be exhaustively evaluated in controlled settings\. Our experiments therefore emphasize short\-to\-moderate horizons where exact computation is feasible\. The theory predicts that the relevant losses compound with the number of branching positions, but larger\-scale empirical validation over longer sequences remains an important direction for future work\. This work diagnoses rather than solves the bottleneck\. Our goal is to identify distributional mechanisms that constrain validity and diversity during decoding\. We do not propose a new decoding algorithm or training objective that fully resolves these issues\. The results suggest that effective solutions may need to improve the model’s conditional calibration directly, or use auxiliary validity signals that go beyond probability rank\. Designing such calibration\-aware training or decoding methods is left for future work\. Appendix K Compute Resources We conduct our experiments on 8 NVIDIA A6000 GPUs\. For experiments on commercial models, we use the corresponding official API endpoints\. Open\-source LLM inference is conducted through the vLLM \[24\] package\. The total computing time required to reproduce all our generations and evaluate the results is around 1 week\. Appendix L Societal Impacts Our work directly addresses the frontiers of LLM inference\. In particular, our work aims to understand the validity and diversity of LLM\-generated outputs through fine\-grained token control\. Although we observe no evidence for harmful content in our experiments, the token\-level control could potentially lead to instances where sensitive/harmful content is produced\. However, our method does not attempt to jailbreak or induce unsafe behavior in any model\. On a high level, work incentivizing LLM generation diversity can produce novel ideas and solutions to societal problems, but also increase risk of biased content\. Advancing model diversity and capability while guaranteeing trustworthiness and safety remains a high priority in this and all future works\.`
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Similar Articles

Intermittent random token injection during decoding stage increases LLM diversity without fine-tuning

Where does output diversity collapse in post-training?

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Submit Feedback

Similar Articles

Intermittent random token injection during decoding stage increases LLM diversity without fine-tuning
Where does output diversity collapse in post-training?
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction