Fine-Tuning Improves Information Conveyance in Language Models

arXiv cs.CL 06/01/26, 04:00 AM Papers
fine-tuning information-conveyance language-models entropy alignment semantic-diversity
Summary
This paper introduces Canopy Entropy (CE⋆) to measure the effective size of the generation space in language models, and finds that fine-tuning reorganizes uncertainty into more informative and semantically meaningful outputs, nearly tripling the correlation between entropy rate and semantic diversity.
arXiv:2605.30844v1 Announce Type: new Abstract: Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy ($\mathrm{CE}^\star$), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making $\mathrm{CE}^\star$ naturally quantify the effective size of the generation space. $\mathrm{CE}^\star$ jointly captures uncertainty in both the output length $N$ and the generated sequence $Y_{1:N}$ -- indeed, we show that it equals to total Shannon entropy $H(N, Y_{1:N}\mid X)$, where $X$ denotes the prompt. This formulation yields interpretable metrics, including a length-entropy correlation term $\rho(N, r_N)$, where $r_N$ is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine-tuned models consistently exhibit stronger positive correlation $\rho(N, r_N)$, even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output-length effects, we find that fine-tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy-entropy.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:29 AM
# Fine-Tuning Improves Information Conveyance in Language Models
Source: [https://arxiv.org/html/2605.30844](https://arxiv.org/html/2605.30844)
Yuwei Cheng Department of Statistics University of Chicago Chicago, IL 60637 yuweicheng@uchicago\.edu &Weiyi Tian11footnotemark:1 Department of Data Science University of Chicago Chicago, IL 60637 weiyitian@uchicago\.edu &Haifeng Xu Department of Computer Science University of Chicago Chicago, IL 60637 haifengxu@uchicago\.edu

###### Abstract

Fine\-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout\. To address this, we propose*Canopy Entropy*\(CE⋆\\mathrm\{CE\}^\{\\star\}\), a measure that views language generation from a tree perspective, where “canopy” represents the space of all possible rollouts, makingCE⋆\\mathrm\{CE\}^\{\\star\}naturally quantify the*effective size of the generation space*\.CE⋆\\mathrm\{CE\}^\{\\star\}jointly captures uncertainty in both the output lengthNNand the generated sequenceY1:NY\_\{1:N\}– indeed, we show that it equals to total Shannon entropyH\(N,Y1:N∣X\)H\(N,Y\_\{1:N\}\\mid X\), whereXXdenotes the prompt\. This formulation yields interpretable metrics, including a length–entropy correlation termρ\(N,rN\)\\rho\(N,r\_\{N\}\), whererNr\_\{N\}is the entropy rate, quantifying*information conveyance efficiency*by indicating whether longer outputs are more or less informative per token\. Empirically, across tasks and model families, we find that fine\-tuned models consistently exhibit stronger positive correlationρ\(N,rN\)\\rho\(N,r\_\{N\}\), even when total entropy decreases\. Furthermore, after controlling for model family, task, prompt, and output\-length effects, we find that fine\-tuning nearly*triples*the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently\. Overall, these results demonstrate that fine\-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations\. Our code is available at[https://github\.com/WeiyiTian/canopy\-entropy](https://github.com/WeiyiTian/canopy-entropy)\.

## 1Introduction

Large language models \(LLMs\) have achieved strong performance across tasks such as natural language understanding\([brown2020language,](https://arxiv.org/html/2605.30844#bib.bib2);[hendrycks2020measuring,](https://arxiv.org/html/2605.30844#bib.bib12)\), code generation\([chen2021evaluating,](https://arxiv.org/html/2605.30844#bib.bib3);[roziere2023code,](https://arxiv.org/html/2605.30844#bib.bib24)\), and mathematical reasoning\([ouyang2022training,](https://arxiv.org/html/2605.30844#bib.bib22);[lu2025fine,](https://arxiv.org/html/2605.30844#bib.bib19)\)\. A key driver of this success is fine\-tuning, including instruction tuning\([wei2021finetuned,](https://arxiv.org/html/2605.30844#bib.bib36);[chung2024scaling,](https://arxiv.org/html/2605.30844#bib.bib4)\)and alignment with human preferences\([ouyang2022training,](https://arxiv.org/html/2605.30844#bib.bib22);[rafailov2023direct,](https://arxiv.org/html/2605.30844#bib.bib23)\)\. While fine\-tuning substantially improves helpfulness and task performance, a growing body of work suggests that it may also reduce output diversity and compresses the effective generation space of language models\([wang2025optimizing,](https://arxiv.org/html/2605.30844#bib.bib34);[yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39);[lake2025distributional,](https://arxiv.org/html/2605.30844#bib.bib17);[west2025base,](https://arxiv.org/html/2605.30844#bib.bib37)\)\. Prior studies report that base models exhibit greater randomness and creativity, whereas aligned models produce more concentrated and similar responses across samples\([west2025base,](https://arxiv.org/html/2605.30844#bib.bib37)\)\. Similar conclusions have been drawn using branching factor\([yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39)\), lexical diversity\([lake2025distributional,](https://arxiv.org/html/2605.30844#bib.bib17)\), and token\-level uncertainty measures\([agarwal2025unreasonable,](https://arxiv.org/html/2605.30844#bib.bib1);[wang2026entropy,](https://arxiv.org/html/2605.30844#bib.bib33)\), reinforcing the view that fine\-tuning makes language models more deterministic and less diverse\.

However, existing analyses suffer from an important limitation–they do not control for output length, despite length being a major confounder in diversity measurement\([shaib2024standardizing,](https://arxiv.org/html/2605.30844#bib.bib26)\)\. Classical metrics such as type\-token ratio are biased by text length, often making shorter responses appear more diverse\([shaib2024standardizing,](https://arxiv.org/html/2605.30844#bib.bib26)\)\. Such metrics that ignore generation length cannot distinguish genuinely diverse outputs from statistical artifacts induced by shorter outputs\. This limitation is especially problematic in aligned language models because fine\-tuning changes not only what models generate, but also how long they generate\. Base models often produce shorter responses, while instruction\-tuned models generate longer and more structured outputs\([singhal2023long,](https://arxiv.org/html/2605.30844#bib.bib29)\)\. As illustrated in[Figure 2](https://arxiv.org/html/2605.30844#S5.F2), base models typically exhibit decreasing entropy rates over generation, implying that longer continuations become increasingly predictable and redundant\. In contrast, fine\-tuned models tend to maintain more stable entropy rates across longer trajectories, suggesting that additional tokens remain informative rather than collapsing into low\-information continuation\.

More importantly, prior work largely treats diversity as a static property of outputs, without studying how uncertainty evolves along generation trajectories\. To the best of our knowledge, no existing work analyzes the relationship between generation length and entropy rate \(see[Definition 3](https://arxiv.org/html/2605.30844#Thmdefinition3)\)—that is, whether longer generations become more informative or increasingly redundant\. Yet this interaction is crucial, since fine\-tuning learns not only*what*to say, but also*when*to elaborate and*when*to stop\. As a result, output length itself becomes part of the learned generation strategy, making diversity comparisons incomplete without accounting for how uncertainty is allocated across trajectories\.

Meanwhile, recent studies suggest a more nuanced picture than simple diversity reduction\. Fine\-tuning can increase certain forms of diversity, such as semantic diversity in code generation, even while reducing lexical variation\([shypuladoes,](https://arxiv.org/html/2605.30844#bib.bib28)\)\. These findings suggest that fine\-tuning may reorganize uncertainty and generation structure, rather than uniformly suppress diversity, leaving the relationship between uncertainty allocation, semantic diversity, and generation trajectories still poorly understood\.

Our contribution\.In this work, we revisit diversity in language models through a principled information\-theoretic lens\. We propose*Canopy Entropy*\(CE⋆\\mathrm\{CE\}^\{\\star\}\), an information\-theoretic measure that views language generation from a tree perspective\. Just as a biological canopy represents the total spread and reach of a tree’s branches, our metric views the “canopy” as the set of all potential leaves—or finished rollouts—available to the model\. This allowsCE⋆\\mathrm\{CE\}^\{\\star\}to naturally quantify the effective volume of the generation space by accounting for both how*widely*the model branches and how*deeply*it explores\.CE⋆\\mathrm\{CE\}^\{\\star\}is introduced to jointly capture output length and content uncertainty, and we show that it admits an exact characterization as the total Shannon entropyH\(N,Y1:N∣X\)H\(N,Y\_\{1:N\}\\mid X\), whereXXdenotes the prompt,NNis the generated length, andY1:NY\_\{1:N\}is the generated sequence\. Building on this formulation, we derive interpretable metrics and introduce a length–entropy correlation term that measures how uncertainty is allocated across generation trajectories \(see[section 3](https://arxiv.org/html/2605.30844#S3)\)\. SinceCE⋆\\mathrm\{CE\}^\{\\star\}exhausts the entire space of rollouts, a key challenge is to estimate it efficiently\. Towards that end, we develop Monte Carlo estimators and establish their consistency \(see[section 4](https://arxiv.org/html/2605.30844#S4)\)\.

Empirically, we evaluate multiple model families across sentence completion, mathematical reasoning, coding, and story generation, while explicitly controlling for output length\. Our results reveal three consistent findings:\(i\)fine\-tuning generally reduces token\-level and trajectory\-level uncertainty, although the magnitude of reduction is highly task\-dependent;\(ii\)fine\-tuning systematically shiftsρ\(N,rN\)\\rho\(N,r\_\{N\}\)from negative toward positive, indicating that longer generations become more informative rather than being redundant as often in base models; and\(iii\)entropy rate becomes a substantially stronger predictor of semantic diversity after fine\-tuning, suggesting that aligned models utilize uncertainty more efficiently and translate it more effectively into meaningful semantic diversity \(see[section 5](https://arxiv.org/html/2605.30844#S5)\)\. Overall, our results suggest that fine\-tuning does not simply reduce uncertainty or shrink the generation space\. Rather, it restructures how uncertainty is allocated across generation trajectories, producing outputs that are more organized, semantically meaningful, and information\-efficient\.

## 2Related work

Length as a fundamental confounder in diversity measurement\.Recent work\([singhal2023long,](https://arxiv.org/html/2605.30844#bib.bib29)\)shows that RLHF and instruction tuning systematically increase LLM output length, since longer responses are often positively correlated with human preference scores\. This makes output length a critical confounder in diversity evaluation\. Metrics that favor shorter generations may therefore unfairly penalize instruct models simply because they produce longer responses\. For example, n\-gram diversity metrics\([li2016diversity,](https://arxiv.org/html/2605.30844#bib.bib18)\)naturally decrease with sequence length, as longer texts inevitably reuse a finite vocabulary, reducing the ratio of unique n\-grams even when the underlying content remains semantically diverse\. Embedding\-based semantic diversity measures\([tevet2021evaluating,](https://arxiv.org/html/2605.30844#bib.bib32)\)suffer from a related issue: longer generation trajectories must be compressed into fixed\-dimensional representations, often through mean pooling, which can dilute semantic diversity\. Moreover, many existing diversity metrics rely on heuristic sampling procedures and lack a unified probabilistic foundation\([west2025base,](https://arxiv.org/html/2605.30844#bib.bib37)\)\. Consequently, current approaches cannot rigorously distinguish whether the increased verbosity of instruct models reflects genuine information gain or merely redundant continuation, motivating the need for a principled length\-aware and model\-intrinsic measure of generation diversity\.

From local branching to global canopy\.Conceptualizing autoregressive generation as a stochastic, path\-dependent branching tree, Branching Factor \(BF\)\([yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39)\)was introduced to quantify the probability concentration induced by model alignment\. However, their framework lacks a formal heuristic and focuses primarily on the*width*of the generation tree \(branching\) while ignoring its*depth*\(length\) and the coupling between the two\.CE⋆\\mathrm\{CE\}^\{\\star\}addresses this gap by providing a model\-intrinsic characterization of the generation space by jointly modeling content uncertainty and stochastic stopping behavior\. By deriving the BF as a normalized version ofCE⋆\\mathrm\{CE\}^\{\\star\}, we provide a principled theoretical foundation for its use\. Crucially, our length\-uncertainty correlation demonstrates that while fine\-tuning reduces the absolute BF, it restructures the generation tree to sustain information density across longer paths while translating uncertainty into semantic diversity more efficiently\. To the best of our knowledge, this is the first work to provide a unified measure of the global LLM generation space that accounts for both the local branching uncertainty and the stochasticity of trajectory length, while also studying the redistribution of uncertainty across generation trajectories\.

## 3Methodology

Notation convention\.For any discrete random variableZ∈Z\\inwith probabilityq\(z\)q\(z\)for eachz∈𝒵z\\in\\mathcal\{Z\}, its Shannon entropy\([shannon1948mathematical,](https://arxiv.org/html/2605.30844#bib.bib27)\)isH\(Z\)=∑z∈𝒵q\(z\)log⁡q\(z\)H\(Z\)=\\sum\_\{z\\in\\mathcal\{Z\}\}q\(z\)\\log q\(z\)\. LetXXbe another random variable andX,ZX,Zhave joint probabilityq\(z,x\)q\(z,x\), thenH\(Z\|X=x\)=∑z∈𝒵q\(z\|x\)log⁡q\(z\|x\)H\(Z\|X=x\)=\\sum\_\{z\\in\\mathcal\{Z\}\}q\(z\|x\)\\log q\(z\|x\)is the conditional entropy ofZZgiven realizedxx, whereasH\(Z\|X\)=∑xH\(Z\|x\)q\(x\)H\(Z\|X\)=\\sum\_\{x\}H\(Z\|x\)q\(x\)is the expected conditional entropy ofZZ\.

### 3\.1Measuring the size of LLM generation space from a tree perspective

![Refer to caption](https://arxiv.org/html/2605.30844v1/x1.png)Figure 1:Path\-dependent generation tree induced by autoregressive LLMs\.Autoregressive generation as a stochastic tree\.Autoregressive language models induce a stochastic generation process that can naturally be represented as a rooted, path\-dependent tree\. LetX∼pXX\\sim p\_\{X\}denote a prompt sampled from a prompt distribution over a space𝒳\\mathcal\{X\}, and let𝒱\\mathcal\{V\}denote the vocabulary\. Conditioned onX=xX=xand a decoding policy \(e\.g\., temperature, top\-kk, or nucleus sampling\), generation proceeds sequentially by sampling tokens from a conditional distribution\. Formally, letY1,Y2,…,Y\_\{1\},Y\_\{2\},\\ldots,denote the autoregressive token process generated by the model\. For each stept≥1t\\geq 1, define the generation history asGt:=\(Y1,…,Yt−1\),G\_\{t\}:=\(Y\_\{1\},\\ldots,Y\_\{t\-1\}\),withG1:=∅G\_\{1\}:=\\emptyset\. The next\-step random variable is denoted byZt∈𝒵:=𝒱∪\{EOS\},Z\_\{t\}\\in\\mathcal\{Z\}:=\\mathcal\{V\}\\cup\\\{\\texttt\{EOS\}\\\},whereEOSdenotes termination\. The conditional distributionq\(Zt∣Gt,X\)q\(Z\_\{t\}\\mid G\_\{t\},X\)governs the next\-step generation process\.111This formulation is equivalent to a two\-stage stochastic decision process Specifically, the model first makes a binary decision betweenSTOP\(terminate\) andCONT\(continue\)\. Conditional onCONT, it then samples the next token from the vocabulary𝒱\\mathcal\{V\}\. We provide a formal proof of this equivalence in[subsection A\.1](https://arxiv.org/html/2605.30844#A1.SS1)\.Under this construction, each node of the generation tree \(see[Figure 1](https://arxiv.org/html/2605.30844#S3.F1)\) corresponds to a historyGtG\_\{t\}\. Each outgoing edge corresponds to a possible realization ofZtZ\_\{t\}\. A complete generated sequence therefore corresponds to a root\-to\-leaf trajectory\. A key feature of this construction is its path dependence\. The next\-step distributionqqdepends on the entire generation historyGtG\_\{t\}, not merely the most recent token\. Thus, two trajectories ending with the same token may evolve very differently if they arise from different histories\.

Width and depth of the generation tree\.The size of the generation tree is determined by two complementary factors\. First, at each historyGtG\_\{t\}, the spread ofq\(⋅∣Gt,X\)q\(\\cdot\\mid G\_\{t\},X\)determines the number of plausible continuations, capturing the local branching behavior \(*width*\) of the tree\. Second, generation proceeds for a random number of steps before termination, inducing variability in trajectory depth\. We define the stopping timeN:=inf\{t≥1:Zt=EOS\},N:=\\inf\\\{t\\geq 1:Z\_\{t\}=\\texttt\{EOS\}\\\},which represents the generation length \(*depth*\)\. Under this formulation, the stochastic process jointly models both continuation uncertainty and stopping behavior, together determining the overall size of the generation space\.

A unified measure of generation space\.To quantify local uncertainty, we use Shannon entropy\([shannon1948mathematical,](https://arxiv.org/html/2605.30844#bib.bib27)\)\. For a given promptX=xX=xand a realized generation historyGt=gtG\_\{t\}=g\_\{t\}, define

H~\(gt∣x\):=H\(Zt∣Gt=gt,X=x\)=−∑z∈𝒵q\(z∣gt,x\)log⁡q\(z∣gt,x\)\.\\widetilde\{H\}\(g\_\{t\}\\mid x\):=H\(Z\_\{t\}\\mid G\_\{t\}=g\_\{t\},X=x\)=\-\\sum\_\{z\\in\\mathcal\{Z\}\}q\(z\\mid g\_\{t\},x\)\\log q\(z\\mid g\_\{t\},x\)\.\(1\)This quantity measures uncertainty via entropy in the immediate next generation step, including both continuation\-token selection and termination\. Note that this quantity summarizes only the uncertainty of the next\-step generation conditioned on a realized historygtg\_\{t\}and promptxx\. In the following, we slightly overload the notation by also using uppercaseGtG\_\{t\}andXXto denote realized histories and prompts whenever the distinction between random variables and their realizations is clear from context\. To jointly account for local branching diversity and random trajectory length, we introduce the following uncertainty measure\.

###### Definition 1\(Canopy Entropy\)\.

We define*Canopy Entropy*asCE⋆:=𝔼\[∑t=1N\(x\)H~\(gt∣x\)\]\\mathrm\{CE\}^\{\\star\}:=\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{N\(x\)\}\\widetilde\{H\}\(g\_\{t\}\\mid x\)\\right\]\. More formally,

CE⋆=𝔼x,\{gt\}t=1N\(x\)\[∑t=1N\(x\)H\(Zt∣Gt=gt,X=x\)\]\\mathrm\{CE\}^\{\\star\}=\\mathbb\{E\}\_\{x,\\\{g\_\{t\}\\\}\_\{t=1\}^\{N\(x\)\}\}\\left\[\\sum\_\{t=1\}^\{N\(x\)\}H\(Z\_\{t\}\\mid G\_\{t\}=g\_\{t\},X=x\)\\right\]where the𝔼\\mathbb\{E\}is over the randomness of promptxx,N\(x\)N\(x\)\(the length\), and\{gt\}t=1N\(x\)\\\{g\_\{t\}\\\}\_\{t=1\}^\{N\(x\)\}\(all possible partial histories of rollouts\)\.

That is, the quantityCE⋆\\mathrm\{CE\}^\{\\star\}aggregates entropy\-based uncertainty along generation trajectories and across prompts\. A single generated sequence corresponds to a specific traversal from the tree root \(promptXX\) to a leaf \(EOStermination\)\. The summation∑t=1N\(X\)\\sum\_\{t=1\}^\{N\(X\)\}thus represents the total uncertainty accumulated along that entire path\. By taking the expectation over all trajectories,CE⋆\\mathrm\{CE\}^\{\\star\}captures the aggregate “canopy” of the model’s output distribution across all possible rollouts\. As such, it provides a unified measure of generation space by accounting for both*local branching width*\(the diversity of choices at each node\) and*trajectory depth*\(the variability in generation length\)\.

CE⋆\\mathrm\{CE\}^\{\\star\}as the joint entropy of length and content\.Besides being an intuitive measure, above construction turns out to also admit a clean information\-theoretic characterization\. In particular, the cumulative uncertainty along a trajectory can be interpreted as the total entropy of the generation process\. For ease of reference, we callH\(N,Y1:N\|X\)H\(N,Y\_\{1:N\}\|X\)the*total uncertainty*,H\(N\|X\)H\(N\|X\)the*length uncertainty*, andH\(Y1:N∣N,X\)H\(Y\_\{1:N\}\\mid N,X\)the*content uncertainty*\. The following theorem formalizes the connection\.

###### Theorem 1\(Equivalence ofCE⋆\\mathrm\{CE\}^\{\\star\}and total uncertainty\)\.

CE⋆=H\(N,Y1:N∣X\)\.\\mathrm\{CE\}^\{\\star\}=H\(N,Y\_\{1:N\}\\mid X\)\.

Theorem[1](https://arxiv.org/html/2605.30844#Thmtheorem1)\(see[subsection A\.2](https://arxiv.org/html/2605.30844#A1.SS2)for a detailed proof\) shows that the Canopy Entropy coincides with the total uncertainty of the generation process\. By the chain rule of entropy,H\(N,Y1:N∣X\)=H\(N∣X\)\+H\(Y1:N∣N,X\),H\(N,Y\_\{1:N\}\\mid X\)=H\(N\\mid X\)\+H\(Y\_\{1:N\}\\mid N,X\),which decomposes the total uncertainty into two components\. The*length uncertainty*H\(N∣X\)H\(N\\mid X\)captures how unpredictable the stopping time is, while the*content uncertainty*H\(Y1:N∣N,X\)H\(Y\_\{1:N\}\\mid N,X\)captures how diverse the generated sequences are once the length is fixed\. This decomposition aligns naturally with the tree perspective: length uncertainty corresponds to variability in the depth of trajectories, while content uncertainty corresponds to the diversity of paths of a given length\. We can interpretexp⁡\(CE⋆\)\\exp\(\\mathrm\{CE\}^\{\\star\}\)as the effective number of plausible full sequences, analogous to how perplexity represents the effective branching at the token level\.

### 3\.2Two normalized variants of Canopy Entropy

WhileCE⋆=H\(N,Y1:N∣X\)\\mathrm\{CE\}^\{\\star\}=H\(N,Y\_\{1:N\}\\mid X\)provides a clean characterization of the generation space, it aggregates uncertainty across both sequence lengths and content diversity, making its direct interpretation difficult\. To obtain more interpretable metrics, we introduce two normalized variants ofCE⋆\\mathrm\{CE\}^\{\\star\}, respectively coined*Generation Perplexity*and*Branching Factor*\(as introduced in\([yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39)\)\)\. Intuitively, the former corresponds to the token\-level normalization ofCE⋆\\mathrm\{CE\}^\{\\star\}, whereas the latter is trajectory\-level normalization\. In both definitions, theexp\\expfunction converts entropy uncertainty to bit diversity\.

###### Definition 2\(GenPPL\)\.

Generation Perplexity is defined asGenPPL:=exp⁡\(∑X∼pXH\(N,Y1:N∣X=x\)∑X∼pX𝔼\[N∣X=x\]\)\.\\mathrm\{GenPPL\}:=\\exp\\left\(\\frac\{\\sum\_\{X\\sim p\_\{X\}\}\\,H\(N,Y\_\{1:N\}\\mid X=x\)\}\{\\sum\_\{X\\sim p\_\{X\}\}\\,\\mathbb\{E\}\[N\\mid X=x\]\}\\right\)\.

###### Definition 3\(BF,\([yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39)\)\)\.

Branching Factor is defined asBF:=exp⁡\(𝔼X𝔼N\[rN∣X=x\]\),\\mathrm\{BF\}:=\\exp\\left\(\\mathbb\{E\}\_\{X\}\\mathbb\{E\}\_\{N\}\[r\_\{N\}\\mid X=x\]\\right\),whererN:=H\(Y1:N∣N=n,X=x\)nr\_\{N\}:=\\frac\{H\(Y\_\{1:N\}\\mid N=n,X=x\)\}\{n\}is the entropy rate \(i\.e\., per\-token content uncertainty\)\.222In information theory, the entropy rate quantifies the average amount of information produced per generated token by a stochastic process, and can be interpreted as a measure of information density or non\-redundancy\.

Invoking Theorem[1](https://arxiv.org/html/2605.30844#Thmtheorem1), we observe that the only difference between Canopy EntropyCE⋆\\mathrm\{CE\}^\{\\star\}and GenPPL or BF is that – excluding the “exp\\exp” function –CE⋆\\mathrm\{CE\}^\{\\star\}does not have the normalization term1/\(∑X∼pX𝔼\[N∣X=x\]\)1/\(\\sum\_\{X\\sim p\_\{X\}\}\\,\\mathbb\{E\}\[N\\mid X=x\]\)or1/n1/n\. It is worthwhile to think deeper about the difference between GenPPL and BF, due to their different normalization factors\. Such difference will be prominent when the generated sequence lengthNNcould differ a lot\. In such case, GenPPL*treats each token equally*, regardless of it is from the short and long sequence, through a “global” normalization factor1/\(∑X∼pX𝔼\[N∣X=x\]\)1/\(\\sum\_\{X\\sim p\_\{X\}\}\\,\\mathbb\{E\}\[N\\mid X=x\]\)\. However, BF as the expectation ofrNr\_\{N\}*treats each trajectory equally*and consequently will discount the influence of tokens in long sequences much more than that in the short sequence\. Therefore, at a high level, GenPPL reflects the diversity of a*typical token*in the generation whereas BF reflects the diversity of a*typical trajectory*\. For further intuition, we refer readers to[subsection B\.1](https://arxiv.org/html/2605.30844#A2.SS1)for a concrete example illustrating the distinction between GenPPL and BF\.

GenPPL and BF together provide a more complete view of generation behavior across tasks\. In particular, we observe that structured tasks such as mathematical reasoning or coding are better characterized by token\-level uncertainty \(GenPPL\), whereas open\-ended tasks like story generation are better captured by trajectory\-level diversity \(BF\)\. Our framework provides a principled formulation of GenPPL as a generative analogue of perplexity\([jelinek1977perplexity,](https://arxiv.org/html/2605.30844#bib.bib15)\)for variable\-length processes, and offers a unified information\-theoretic interpretation of BF, initially proposed by Yang et\.al\([yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39)\), showing that both arise naturally as normalized variants of Canopy Entropy\. See Appendix[B\.2](https://arxiv.org/html/2605.30844#A2.SS2)for guidance on when to use BF or GenPPL, and Appendix[B\.3](https://arxiv.org/html/2605.30844#A2.SS3)for a comparison with perplexity\.

### 3\.3The length\-entropy correlation: how entropy evolves with generation length

Aggregate measures such as Canopy Entropy, GenPPL, and BF quantify the overall magnitude of uncertainty\. However, they fail to capture how uncertainty evolves across generation rollouts or how effectively uncertainty is translated into meaningful variation as generation length increases\. Whether longer generations remain information\-rich or become progressively redundant\([singhal2023long,](https://arxiv.org/html/2605.30844#bib.bib29);[shaib2024standardizing,](https://arxiv.org/html/2605.30844#bib.bib26)\)remains poorly understood despite rich previous studies on the diversity of language generation\([tevet2021evaluating,](https://arxiv.org/html/2605.30844#bib.bib32);[li2016diversity,](https://arxiv.org/html/2605.30844#bib.bib18);[friedman2022vendi,](https://arxiv.org/html/2605.30844#bib.bib8)\)\. Understanding this interaction is crucial for characterizing the*information\-conveying efficiency*of language models—namely, how much uncertainty is maintained per token as generation progresses\.

To study this phenomenon, we introduce the prompt\-controlled correlationρ\(N,rN\)\\rho\(N,r\_\{N\}\)between the lengthNNand entropy raterNr\_\{N\}, defined below\.

###### Definition 4\(The length\-entropy correlation\)\.

For each prompt, the length\-entropy correlation is defined as

ρ\(N,rN\)=𝔼X∼pX\{Corr\(N,rN\)∣X=x\},whererN=H\(Y1:N∣N=n,X=x\)n\\rho\(N,r\_\{N\}\)=\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\\bigg\\\{\\texttt\{Corr\}\(N,r\_\{N\}\)\\mid X=x\\bigg\\\},\\text\{ where \}r\_\{N\}=\\frac\{H\(Y\_\{1:N\}\\mid N=n,X=x\)\}\{n\}is the entropy rate andCorris any standard correlation notion from the set\{Pearson correlation, Spearman correlation, Kendrall correlation\}\.

A positive correlationρ\(N,rN\)\\rho\(N,r\_\{N\}\)indicates that longer sequences remain informative per token, while a negative correlation indicates that longer outputs become increasingly predictable\. This quantity characterize uncertainty allocation along trajectories and is later used in[section 5](https://arxiv.org/html/2605.30844#S5)to compare base and fine\-tuned models across tasks and model families\. In our evaluation, we observe consistent trends across the three standard correlation notions as mentioned in the definition above, which illustrates the robustness of the discovered correlations among generation length and entropy rate\. Notably, the correlationρ\(N,rN\)\\rho\(N,r\_\{N\}\)is evaluated over the randomness of rollouts drawn from an exponentially large space; we discuss efficient estimation methods in the section below, with full details deferred to[subsection A\.6](https://arxiv.org/html/2605.30844#A1.SS6)\.

## 4Efficient and consistent estimation ofCE⋆\\mathrm\{CE\}^\{\\star\}

The previous section introducedCE⋆=H\(N,Y1:N∣X\)\\mathrm\{CE\}^\{\\star\}=H\(N,Y\_\{1:N\}\\mid X\)and its derived metrics such as GenPPL, BP and length\-entropy correlation\. However, efficiently computing these metrics in practice is challenging since they are expectations over exponentially many trajectories\([hu2023amortizing,](https://arxiv.org/html/2605.30844#bib.bib13);[yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39)\)\. This section develops computationally tractable approaches to estimateCE⋆\\mathrm\{CE\}^\{\\star\}which are also generalizable to estimating the derived metrics\. In particular, we introduce a Monte Carlo estimator under a maximum length constraintTmaxT\_\{\\max\}, and analyze the consistency of this estimator\.

For a random promptXX, letN\(X\)N\(X\)be the generation’s \(random\) stopping time\. Given a maximum lengthTmaxT\_\{\\max\}, define the truncated stopping timeNmax\(X\):=min⁡\{N\(X\),Tmax\}\.N\_\{\\max\}\(X\):=\\min\\\{N\(X\),T\_\{\\max\}\\\}\.The corresponding truncated functional isCEmax⋆=:𝔼\[∑t=1Nmax\(X\)H~\(Gt∣X\)\]\\mathrm\{CE\}^\{\\star\}\_\{\\max\}=:\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{N\_\{\\max\}\(X\)\}\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\right\]\. Now supposex1,…,xPx\_\{1\},\\dots,x\_\{P\}are sampled i\.i\.d\. frompXp\_\{X\}, the prompt distribution\. For each promptxpx\_\{p\}, we generateMMindependent rollouts\. For theii\-th rollout, define the truncated cumulative local entropy bySmax\(p,i\):=∑t=1Nmax\(p,i\)H~\(Gt\(p,i\)∣xp\),S\_\{\\max\}^\{\(p,i\)\}:=\\sum\_\{t=1\}^\{N\_\{\\max\}^\{\(p,i\)\}\}\\widetilde\{H\}\(G\_\{t\}^\{\(p,i\)\}\\mid x\_\{p\}\),which measures the total branching uncertainty along the generated trajectory up to truncation\. We then define the Monte Carlo estimator \(see[algorithm 1](https://arxiv.org/html/2605.30844#algorithm1)for more detail\)

CE^max,P,M⋆:=1PM∑p=1P∑i=1MSmax\(p,i\)\.\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}:=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}S\_\{\\max\}^\{\(p,i\)\}\.The estimator’s convergence and consistency requires minor technical assumptions\. First, letg\(Tmax\)=CE⋆−CEmax⋆g\(T\_\{\\max\}\)=\\mathrm\{CE\}^\{\\star\}\-\\mathrm\{CE\}^\{\\star\}\_\{\\max\}denote the truncation bias\. Naturally, it always satisfiesg\(Tmax\)≥0g\(T\_\{\\max\}\)\\geq 0\. Convergence requiresdiminishing truncation bias— i\.e\.,g\(Tmax\)→0g\(T\_\{\\max\}\)\\to 0asTmax→∞T\_\{\\max\}\\to\\infty\. Second, in our estimation,TmaxT\_\{\\max\}increases as parameterP,MP,Mincreases\. Convergence requiresmodest truncation length— i\.e\., the length capTmax\(P,M\)T\_\{\\max\}\(P,M\)is picked such thatνmax2\(1P\+1PM\)→0\\nu^\{2\}\_\{\\max\}\(\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\)\\to 0asTmax=Tmax\(P,M\)→∞T\_\{\\max\}=T\_\{\\max\}\(P,M\)\\to\\infty, whereνmax2:=𝔼\[Nmax\(X\)2\]\\nu\_\{\\max\}^\{2\}:=\\mathbb\{E\}\\left\[N\_\{\\max\}\(X\)^\{2\}\\right\]is the second raw moment of the truncated stopping time\. Both conditions are easily satisfied when the tail ofNNis not too heavy, which we observe in our empirical evaluation \(see[Figure 3](https://arxiv.org/html/2605.30844#S5.F3)\)\. The following theorem characterizes the convergence ofCE^max,P,M⋆\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}under these technical conditions\.

###### Theorem 2\(Consistency and non\-asymptotic error bounds of the Monte Carlo estimator\)\.

Assume diminishing truncation biasg\(T\)g\(T\)and modest truncation length\. ThenCE^max,P,M⋆\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}converges toCE⋆\\mathrm\{CE\}^\{\\star\}in probability whenP,M→∞P,M\\to\\infty\. Moreover,\|CE^max,P,M⋆−CE⋆\|=Oℙ\(νmax1P\+1PM\+g\(Tmax\)\)\.\\left\|\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\\right\|=O\_\{\\mathbb\{P\}\}\\left\(\\nu\_\{\\max\}\\sqrt\{\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\}\+g\(T\_\{\\max\}\)\\right\)\.whereνmax2:=𝔼\[Nmax\(X\)2\]\.\\nu\_\{\\max\}^\{2\}:=\\mathbb\{E\}\\left\[N\_\{\\max\}\(X\)^\{2\}\\right\]\.

[Theorem 2](https://arxiv.org/html/2605.30844#Thmtheorem2)shows that the estimatorCE^max,P,M⋆\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}is consistent, with estimation error naturally decomposing into a Monte Carlo error term and a truncation bias term\. The Monte Carlo error is of orderOℙ\(1P\+1PM\),O\_\{\\mathbb\{P\}\}\\left\(\\frac\{1\}\{\\sqrt\{P\}\}\+\\frac\{1\}\{\\sqrt\{PM\}\}\\right\),arising from samplingPPprompts andMMrollouts per prompt\.

The second component is a truncation biasg\(Tmax\)g\(T\_\{\\max\}\), which captures the bias introduced by truncating trajectories at lengthTmaxT\_\{\\max\}instead of allowing full generations\. Empirically, we observe that the generation length distribution of LLMs typically exhibits exponential decay across tasks and model architectures \(see[Figure 3](https://arxiv.org/html/2605.30844#S5.F3)\)\. This implies that long trajectories are increasingly rare, so the truncation biasg\(Tmax\)g\(T\_\{\\max\}\)decays rapidly withTmaxT\_\{\\max\}\. We defer the proof of[Theorem 2](https://arxiv.org/html/2605.30844#Thmtheorem2)to Appendix[A\.5](https://arxiv.org/html/2605.30844#A1.SS5)\.

Notably, the same Monte Carlo framework extends to the estimation of GenPPL, BF, and prompt\-controlled correlation\. Their consistency follows by the continuous mapping theorem\([mann1943stochastic,](https://arxiv.org/html/2605.30844#bib.bib20)\)together with mild integrability and truncation\-bias conditions, which are empirically supported by the observed low truncation rates and rapidly decaying active\-rollout distributions \(see Appendix[A\.6](https://arxiv.org/html/2605.30844#A1.SS6)for detailed proofs and discussions\)\.

## 5Empirical studies

In this section, we present our experimental results, organized into two complementary parts\. First, we examine the global uncertainty structure of the LLM generation space using the metrics introduced in[section 3](https://arxiv.org/html/2605.30844#S3), includingCE⋆\\mathrm\{CE\}^\{\\star\}, GenPPL, BF, and length–entropy correlation\. Second, building on this analysis, we investigate how uncertainty is translated into semantic diversity through a regression framework that links entropy rate to semantic diversity\. We evaluate three model families—LLaMA\-3\.1\-8B\([grattafiori2024llama,](https://arxiv.org/html/2605.30844#bib.bib9)\), Qwen3\-8B\([yang2025qwen3,](https://arxiv.org/html/2605.30844#bib.bib38)\), and Gemma\-3\-12B\([gemmateam2025gemma3technicalreport,](https://arxiv.org/html/2605.30844#bib.bib31)\)—testing both their pre\-trained \(base\) and instruction\-tuned \(instruct\) versions\. We evaluate these models on four tasks spanning a spectrum from highly structured settings, including mathematical reasoning\([cobbe2021gsm8k,](https://arxiv.org/html/2605.30844#bib.bib5)\), coding\([zhuo2024bigcodebench,](https://arxiv.org/html/2605.30844#bib.bib41)\), and sentence completion[zellers2019hellaswag](https://arxiv.org/html/2605.30844#bib.bib40), to more open\-ended creative story generation\([ismayilzada2025creativepreferenceoptimization,](https://arxiv.org/html/2605.30844#bib.bib14)\)\. For each task, we sampleP=100P=100prompts, and for each sampled promptXpX\_\{p\}, we generateM=100M=100independent rollouts using stochastic decoding with temperature equals11and top\-k=100k=100\(renormalized next\-token probabilities\),333Specifically, the one step local entropyH~\(Gt\|X\)\\tilde\{H\}\(G\_\{t\}\|X\)is computed as−∑z∈𝒵100q\(z\|Gt,X\)log⁡q\(z\|Gt,X\)\-\\sum\_\{z\\in\\mathcal\{Z\}\_\{100\}\}q\(z\|G\_\{t\},X\)\\log q\(z\|G\_\{t\},X\), where𝒵100\\mathcal\{Z\}\_\{100\}is the set of the top\-100 tokens \(includingEOSif present\) andqqis the renormalized distribution over the sampled subset\.where each rollout produces a sequenceY1:NY\_\{1:N\}with stopping timeNNup to a maximum of 4000 new tokens, i\.e\.Tmax=4000T\_\{\\max\}=4000\(see[subsection C\.1](https://arxiv.org/html/2605.30844#A3.SS1)for detailed description\)\. To ensure consistent evaluation across model families, we disable the thinking mode in Qwen3\-8B Instruct\.

### 5\.1Global effects of fine\-tuning on uncertainty

Across all model families and tasks, we observe three consistent patterns\.

Fine\-tuning reduces both token\-level and trajectory\-level normalized uncertainty\.Instruction fine\-tuning consistently reduces both token\-level \(measured by GenPPL\) and trajectory\-level \(measured by BF\) normalized uncertainty, as evidenced by systematic declines in GenPPL and BF across nearly all models and tasks \(see[Table 1](https://arxiv.org/html/2605.30844#S5.T1)\)\. Nevertheless, the magnitude of this reduction varies substantially across settings\. For example, Qwen3\-8B on mathematical reasoning exhibits only a mild decrease of roughly5%5\\%, whereas many other model–task combinations show much larger reductions, often in the range of60%60\\%–80%80\\%\. These results suggest that fine\-tuning contracts the effective generation space at both the token and trajectory levels, producing outputs that are more concentrated and deterministic\. At the same time, the heterogeneous magnitude of the reduction suggests that alignment does not impose a uniform compression of uncertainty\. Instead, the extent of contraction depends on the underlying task structure and model behavior, a phenomenon we discuss in the following paragraph\. Overall, our findings are consistent with prior observations that alignment reduces diversity and increases probability concentration\([wang2025optimizing,](https://arxiv.org/html/2605.30844#bib.bib34);[yang2025llm,](https://arxiv.org/html/2605.30844#bib.bib39);[lake2025distributional,](https://arxiv.org/html/2605.30844#bib.bib17);[west2025base,](https://arxiv.org/html/2605.30844#bib.bib37)\), although the shrinkage we observe is less extreme than the order\-of\-magnitude reductions reported in the previous study[yang2025llm](https://arxiv.org/html/2605.30844#bib.bib39)\.

Table 1:Task\-dependent changes in generation\-space uncertainty after instruction tuning across different model families\. We report three complementary uncertainty measures: GenPPL, BF, andCE⋆\\mathrm\{CE^\{\\star\}\}\(see[subsection A\.6](https://arxiv.org/html/2605.30844#A1.SS6)\)\. For each task and model family, we compare the base and instruct variants and report the relative percentage change after instruction tuning\. Standard errors \(SE\) are estimated via prompt\-cluster bootstrapping \(2K iterations, resampled with replacement\) as in[subsection C\.3](https://arxiv.org/html/2605.30844#A3.SS3)\. Blue cells indicate reductions in uncertainty after instruction tuning, while red cells indicate increases\. Darker color intensity corresponds to larger magnitude changes\.TaskTypeGenPPLBF𝐂𝐄⋆\\mathbf\{CE\}^\{\\star\}Qwen3\-8BLlama\-3\.1\-8BGemma\-3\-12BQwen3\-8BLlama\-3\.1\-8BGemma\-3\-12BQwen3\-8BLlama\-3\.1\-8BGemma\-3\-12BMathBase1\.23±\\pm0\.00523\.57±\\pm0\.0414\.56±\\pm0\.0521\.23±\\pm0\.00645\.31±\\pm0\.124\.98±\\pm0\.05555\.7±\\pm2\.0458±\\pm18487±\\pm11Instruct1\.17±\\pm0\.00671\.74±\\pm0\.0631\.17±\\pm0\.00981\.16±\\pm0\.00571\.57±\\pm0\.0241\.15±\\pm0\.005440\.5±\\pm2\.598\.7±\\pm9\.159\.1±\\pm5\.1% Change\-4\.44±\\pm0\.47%\-51\.2±\\pm2\.0%\-74\.4±\\pm0\.34%\-5\.76±\\pm0\.38%\-70\.5±\\pm0\.73%\-76\.9±\\pm0\.24%\-27\.3±\\pm3\.3%\-78\.4±\\pm2\.2%\-87\.9±\\pm1\.1%CodingBase1\.68±\\pm0\.0173\.16±\\pm0\.0552\.60±\\pm0\.0412\.08±\\pm0\.0344\.96±\\pm0\.0573\.24±\\pm0\.038207±\\pm4\.8326±\\pm13492±\\pm25Instruct1\.23±\\pm0\.00531\.31±\\pm0\.00881\.05±\\pm0\.00391\.21±\\pm0\.00611\.31±\\pm0\.00811\.05±\\pm0\.002181\.2±\\pm4\.0135±\\pm4\.317\.7±\\pm1\.6% Change\-27\.1±\\pm0\.79%\-58\.4±\\pm0\.74%\-59\.7±\\pm0\.64%\-41\.9±\\pm0\.99%\-73\.7±\\pm0\.35%\-67\.8±\\pm0\.38%\-60\.7±\\pm1\.6%\-58\.6±\\pm1\.7%\-96\.4±\\pm0\.39%Sentence CompletionBase4\.77±\\pm0\.166\.73±\\pm0\.128\.81±\\pm0\.114\.20±\\pm0\.117\.80±\\pm0\.159\.49±\\pm0\.11826±\\pm55969±\\pm401084±\\pm36Instruct1\.68±\\pm0\.0252\.56±\\pm0\.0691\.80±\\pm0\.0191\.55±\\pm0\.0182\.75±\\pm0\.0451\.74±\\pm0\.024112±\\pm11201±\\pm11515±\\pm29% Change\-64\.9±\\pm1\.1%\-62\.0±\\pm1\.3%\-79\.6±\\pm0\.30%\-63\.0±\\pm0\.88%\-64\.8±\\pm0\.83%\-81\.6±\\pm0\.30%\-86\.5±\\pm1\.0%\-79\.3±\\pm1\.3%\-52\.5±\\pm2\.8%Story GenerationBase3\.57±\\pm0\.117\.40±\\pm0\.0717\.98±\\pm0\.0624\.03±\\pm0\.0928\.24±\\pm0\.0738\.68±\\pm0\.070445±\\pm201320±\\pm41692±\\pm23Instruct2\.28±\\pm0\.0373\.24±\\pm0\.0922\.22±\\pm0\.0232\.11±\\pm0\.0393\.37±\\pm0\.0832\.17±\\pm0\.025454±\\pm28543±\\pm22925±\\pm40% Change\-36\.0±\\pm1\.3%\-56\.2±\\pm1\.4%\-72\.2±\\pm0\.28%\-47\.6±\\pm1\.3%\-59\.1±\\pm1\.2%\-75\.0±\\pm0\.28%1\.92±\\pm3\.5%\-58\.9±\\pm1\.6%33\.7±\\pm8\.3%

The effect of fine\-turning on canopy entropy is*task\-dependent*and could exhibit opposite trends\.In structured domains such as math, coding, and sentence completion,CE⋆\\mathrm\{CE\}^\{\\star\}decreases significantly after fine\-tuning\. In contrast, story generation exhibits a different pattern: Qwen3\-8B shows a slight increase of roughly2%2\\%, while Gemma\-3\-12B shows a much larger increase of about34%34\\%\(see[Table 1](https://arxiv.org/html/2605.30844#S5.T1)\)\. This suggests that fine\-tuning does not uniformly reduce canopy entropy, but instead behaves differently in constrained versus open\-ended generation settings\.

For deterministic tasks such as math and coding, the valid generation space is naturally constrained, so fine\-tuning acts as a pruning force that collapses the model’s generation space into a small set of correct trajectories\. Consequently, both stopping\-time variance and local token\-level entropy are significantly reduced\. By contrast, on creative tasks, fine\-tuned models exhibit a notable rightward shift in sequence length distributions \(see[Figure 3](https://arxiv.org/html/2605.30844#S5.F3)\), reflecting a tendency toward longer and more elaborative generations\. In the same domain, the corresponding base models frequently fall into “completion traps,” treating prompts as short text continuation tasks rather than directly answering the underlying questions \(see[subsection C\.2](https://arxiv.org/html/2605.30844#A3.SS2)\)\.

Although instruction tuning reduces local branching diversity, it also increases stopping\-time uncertainty, and sustains generation over much deeper trajectories\. As illustrated in[Figure 2](https://arxiv.org/html/2605.30844#S5.F2), while the base model has higher local entropy rates \(width\), it lacks enough trajectory length \(depth\) to accumulate a large Canopy Entropy \(CE⋆\\mathrm\{CE\}^\{\\star\}\)\. SinceCE⋆\\mathrm\{CE\}^\{\\star\}acts as a measure of the totalvolumeof the generation tree by accumulating uncertainty across the entire rollout, including stopping behavior, the increase in trajectory depth can mathematically outweigh the reduction in local branching\.

These findings suggest that fine\-tuning changes not only the magnitude of uncertainty, but also how uncertainty is distributed across the trajectory, particularly through its interaction with generation length and stopping behavior\.

![Refer to caption](https://arxiv.org/html/2605.30844v1/x2.png)Figure 2:Running entropy rate vs\. token position\. We plot mean running entropy rateR¯≤t=1t∑s=1tH\(Zs∣X,Gs\)\\bar\{R\}\_\{\\leq t\}=\\frac\{1\}\{t\}\\sum\_\{s=1\}^\{t\}H\(Z\_\{s\}\\mid X,G\_\{s\}\)averaged across active rollouts at positiontt, with the corresponding active rollout count beneath\. Dashed curves denote base models and solid curves denote fine\-tuned models\.![Refer to caption](https://arxiv.org/html/2605.30844v1/x3.png)Figure 3:Kernel density estimates of generated sequence lengths across task domains with sequence length shown on a logarithmic scale\.Fine\-turning consistently increases length–entropy correlation\.To better understand how uncertainty is distributed across trajectories, we analyze the relationship between output lengthNNand entropy raterNr\_\{N\}\. Empirically, fine\-tuning consistently shifts the correlation \(see[Definition 4](https://arxiv.org/html/2605.30844#Thmdefinition4)\) betweenNNandrNr\_\{N\}from negative toward positive values in structured tasks, while making the correlation less negative in open\-ended tasks such as story generation\.[Table 2](https://arxiv.org/html/2605.30844#S5.T2)shows the concrete results under Pearson correlation\. Similar results based on Spearman correlation and Kendall’sτ\\tauexhibit same trends, and are deferred to[Table 6](https://arxiv.org/html/2605.30844#A3.T6)in the appendix to avoid repetition\.

This reveals a fundamental change in generation behavior\. In base models, longer outputs are often associated with lower entropy rates, meaning later tokens become increasingly redundant and contribute little new information\. In contrast, fine\-tuned models exhibit much stronger positive length–entropy correlation: longer generations tend to sustain higher entropy rates and remain informative throughout the trajectory \(see[Figure 2](https://arxiv.org/html/2605.30844#S5.F2)\)\.

The distinction is especially important in the presence of length confounding\. As shown in[Figure 3](https://arxiv.org/html/2605.30844#S5.F3), fine\-tuned models generally produce longer outputs than their base counterparts, consistent with observations in prior work\([singhal2023long,](https://arxiv.org/html/2605.30844#bib.bib29)\)\. Under conventional diversity metrics that do not explicitly account for length\([tevet2021evaluating,](https://arxiv.org/html/2605.30844#bib.bib32);[li2016diversity,](https://arxiv.org/html/2605.30844#bib.bib18)\), these longer generations can artificially appear less diverse due to normalization effects or repetition accumulation, potentially biasing diversity comparisons\([friedman2022vendi,](https://arxiv.org/html/2605.30844#bib.bib8);[shaib2024standardizing,](https://arxiv.org/html/2605.30844#bib.bib26)\)\. The positive length–entropy correlation observed here demonstrates that increased length in fine\-tuned models reflects sustained information content rather than simple redundancy\.

Overall, these results suggest that fine\-tuning improves not merely the quantity of uncertainty, but also its organization and efficiency\. Rather than uniformly suppressing diversity, fine\-tuning restructures uncertainty to better align with task requirements—reducing unnecessary randomness while preserving informative variation across longer trajectories\.

Table 2:Base–instruct comparison of the length–entropy Pearson correlation, measured byρ\(N,rN\)\\rho\(N,r\_\{N\}\)\(see[Definition 4](https://arxiv.org/html/2605.30844#Thmdefinition4)\), for the Qwen3\-8B, LLaMA\-3\.1\-8B, and Gemma\-3\-12B model families\. Each entry reports the estimate±\\pmstandard error, where the absolute change is computed as the instruct value minus the corresponding base value\.MetricModelMATHCODINGSENTENCE COMPLETIONSTORY GENERATIONBaseInstructAbs\. ChangeBaseInstructAbs\. ChangeBaseInstructAbs\. ChangeBaseInstructAbs\. Changeρ\(N,rN\)\\rho\(N,r\_\{N\}\)Qwen\-0\.018±\\pm0\.0160\.087±\\pm0\.0370\.11±\\pm0\.042\-0\.41±\\pm0\.0140\.61±\\pm0\.0341\.0±\\pm0\.0370\.062±\\pm0\.0170\.52±\\pm0\.0240\.46±\\pm0\.027\-0\.27±\\pm0\.0170\.16±\\pm0\.0210\.43±\\pm0\.027Llama\-0\.27±\\pm0\.0160\.38±\\pm0\.0400\.65±\\pm0\.044\-0\.38±\\pm0\.0110\.29±\\pm0\.0210\.67±\\pm0\.021\-0\.26±\\pm0\.015\-0\.20±\\pm0\.0240\.053±\\pm0\.028\-0\.24±\\pm0\.012\-0\.10±\\pm0\.0200\.14±\\pm0\.022Gemma\-0\.13±\\pm0\.0140\.19±\\pm0\.0400\.32±\\pm0\.043\-0\.27±\\pm0\.0120\.39±\\pm0\.0350\.67±\\pm0\.037\-0\.23±\\pm0\.0150\.12±\\pm0\.0380\.36±\\pm0\.041\-0\.17±\\pm0\.012\-0\.049±\\pm0\.0270\.12±\\pm0\.032

### 5\.2Fine\-tuning more effectively converts token uncertainty into semantic diversity

After establishing how fine\-tuning reshapes entropy allocation, we next investigate how these changes in entropy rate translate into semantic diversity\([guo2025benchmarking,](https://arxiv.org/html/2605.30844#bib.bib10)\)\. Specifically, we study how generation uncertainty is converted into semantic diversity across sampled trajectories\.

For each task–prompt–model tuple\(t,p,m\)\(t,p,m\), we measure semantic diversity by computing the average pairwise cosine distance between generated outputs:Dtpm=2M\(M−1\)∑i<jd\(et,p,m,i,et,p,m,j\)D\_\{tpm\}=\\frac\{2\}\{M\(M\-1\)\}\\sum\_\{i<j\}d\(e\_\{t,p,m,i\},e\_\{t,p,m,j\}\)\([tevet2021evaluating,](https://arxiv.org/html/2605.30844#bib.bib32)\), whereet,p,m,ie\_\{t,p,m,i\}denotes the embedding of theii\-th Monte Carlo rollout, prepended with the retrieval prefix \("search\_document:"\) to ensure alignment with the encoder’s training objective for ModernBERT embedding model \(modernbert\-embed\-large\)\([modernbert,](https://arxiv.org/html/2605.30844#bib.bib35)\)\. Hered\(⋅,⋅\)d\(\\cdot,\\cdot\)denotes cosine distance\. Since cosine distance is bounded in\[0,1\]\[0,1\], we employ the Beta mixed\-effects regression, which is a standard method to accommodate bounded responses while allowing flexible modeling of dispersion and hierarchical variation\([figueroa2013mixed,](https://arxiv.org/html/2605.30844#bib.bib6)\)\.444In contrast, Gaussian linear models are inappropriate for bounded responses and exhibit clear heteroscedasticity and residual misspecification in diagnostic analyses\.

Formally, we modelDtpm∼Beta⁡\(μtpmϕtpm,\(1−μtpm\)ϕtpm\),D\_\{tpm\}\\sim\\operatorname\{Beta\}\\\!\\left\(\\mu\_\{tpm\}\\phi\_\{tpm\},\(1\-\\mu\_\{tpm\}\)\\phi\_\{tpm\}\\right\),whereμtpm∈\[0,1\]\\mu\_\{tpm\}\\in\[0,1\]denotes the conditional mean andϕtpm\>0\\phi\_\{tpm\}\>0is the precision parameter\. The conditional mean model is specified aslogit⁡\(μtpm\)=α\+β1Rtpm\+β2𝟏\{m=FT\}\+f\(invNtpm\)\+∑kτk𝟏\{t=k\}\+β3\(Rtpm⋅𝟏\{m=FT\}\)\+g\(invNtpm\)𝟏\{m=FT\}\+up\+vm\\operatorname\{logit\}\(\\mu\_\{tpm\}\)=\\alpha\+\\beta\_\{1\}R\_\{tpm\}\+\\beta\_\{2\}\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\+f\\left\(\\mathrm\{invN\}\_\{tpm\}\\right\)\+\\sum\_\{k\}\\tau\_\{k\}\\mathbf\{1\}\\\{t=k\\\}\+\\beta\_\{3\}\(R\_\{tpm\}\\cdot\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\)\+g\\left\(\\mathrm\{invN\}\_\{tpm\}\\right\)\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\+u\_\{p\}\+v\_\{m\}, whereRtpmR\_\{tpm\}is the entropy rate,invNtpm\\mathrm\{invN\}\_\{tpm\}is the inverse sequence length, andf\(⋅\)f\(\\cdot\),g\(⋅\)g\(\\cdot\)are natural spline functions corresponding to nonlinear length effects and their interaction with instruction tuning\. For consistency with the measurement of semantic diversity, the reported sequence length and entropy rate are likewise computed by averaging over the same set ofMMrollouts for each task–prompt–model combination\. The coefficientsτk\\tau\_\{k\}represent fixed task effects\. To account for hierarchical dependence, we include random interceptsup∼𝒩\(0,σp2\),vm∼𝒩\(0,σm2\),u\_\{p\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{p\}^\{2\}\),v\_\{m\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{m\}^\{2\}\),for prompt identity and model family, respectively\. To capture heteroscedasticity, we also model the precision parameterϕtpm\\phi\_\{tpm\}as a function of task, model variant, entropy rate, sequence length, and model family\. Residual diagnostics \(see[Figure 6](https://arxiv.org/html/2605.30844#A3.F6)\) indicate good calibration of the mean structure \(Kolmogorov\-Smirnov Test\([massey1951kolmogorov,](https://arxiv.org/html/2605.30844#bib.bib21)\):p=0\.55p=0\.55\) and no significant dispersion issues \(p=0\.068p=0\.068\) and no outliers \(p=0\.16p=0\.16\)\.

Main findings\.First, entropy rate exhibits a strong and highly significant positive association with semantic diversity\. The coefficient on entropy rate is large and significant \(β=0\.708\\beta=0\.708,p<2×10−16p<2\\times 10^\{\-16\}\), indicating that higher per\-token uncertainty leads to substantially greater semantic diversity across generated outputs\. More importantly, the interaction between entropy rate and fine\-tuning is strongly positive \(η=1\.183\\eta=1\.183,p<2×10−16p<2\\times 10^\{\-16\}\)\. Under the logit link, this implies a substantial amplification of the slope:∂logit⁡\(μ\)/∂R=0\.708\(base\),0\.708\+1\.183=1\.891\(fine\-tuned\)\.\\partial\\operatorname\{logit\}\(\\mu\)/\\partial R=0\.708\\;\\text\{\(base\)\},\\;0\.708\+1\.183=1\.891\\;\\text\{\(fine\-tuned\)\}\.Thus, fine\-tuning nearly triples the sensitivity of semantic diversity to entropy rate\. This indicates that uncertainty in fine\-tuned models is significantly more*semantically productive*: small increases in token\-level uncertainty translate into much larger increases in trajectory\-level diversity\.

At the same time, the main effect of fine\-tuning is negative \(τ=−0\.377\\tau=\-0\.377,p<2×10−5p<2\\times 10^\{\-5\}\), indicating that at fixed entropy rate and length, fine\-tuned models produce less baseline semantic diversity\. This is consistent with alignment compressing the generation space and enforcing more structured outputs\. However, the strong positive interaction shows that the remaining uncertainty is used more efficiently\. We observe a nonlinear and generally negative relationship between semantic diversity and sequence length, making length control essential in diversity analysis \(see[Table 3](https://arxiv.org/html/2605.30844#S5.T3)\)\.

Table 3:Beta GLMM with dispersion modeling and interaction effects\. For completeness, we direct readers to[subsection C\.4](https://arxiv.org/html/2605.30844#A3.SS4)for the full conditional mean and dispersion regression tables, along with additional model specifications and regression details\.Conditional modelEstimateStd\. Errorpp\-valueIntercept\-0\.8800\.070<2×10−16<2\\times 10^\{\-16\}RR0\.7080\.033<2×10−16<2\\times 10^\{\-16\}Instruct\-0\.3770\.0861\.23×10−51\.23\\times 10^\{\-5\}ns\(invN,4\)1\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{1\}0\.1520\.0411\.99×10−41\.99\\times 10^\{\-4\}R×R\\timesInstruct1\.1830\.046<2×10−16<2\\times 10^\{\-16\}

## 6Conclusion and limitations

In this work, we revisit diversity in large language models through a trajectory\-level information\-theoretic perspective\. By introducing Canopy Entropy and length–entropy coupling, we show that fine\-tuning does not simply reduce uncertainty or shrink the generation space\. Instead, aligned models reorganize uncertainty more efficiently across generation trajectories, producing longer generations that remain more semantically informative\. Our results highlight the importance of modeling how uncertainty evolves throughout the rollout process, beyond aggregate diversity metrics alone\.

Our work also has several limitations\. Entropy\-based measures characterize uncertainty allocation, but do not directly capture factuality or human preference quality\. Our framework is descriptive rather than causal: although we observe systematic changes in length–entropy coupling after fine\-tuning, the underlying mechanisms causing this effect remain an important direction for future work\.

## References

- \(1\)Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng\.The unreasonable effectiveness of entropy minimization in llm reasoning\.arXiv preprint arXiv:2505\.15134, 2025\.
- \(2\)Tom B\. Brown, Benjamin Mann, Nick Ryder, et al\.Language models are few\-shot learners\.NeurIPS, 2020\.
- \(3\)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al\.Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374, 2021\.
- \(4\)Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al\.Scaling instruction\-finetuned language models\.Journal of Machine Learning Research, 25\(70\):1–53, 2024\.
- \(5\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168, 2021\.
- \(6\)Jorge I Figueroa\-Zúñiga, Reinaldo B Arellano\-Valle, and Silvia LP Ferrari\.Mixed beta regression: A bayesian perspective\.Computational Statistics & Data Analysis, 61:137–147, 2013\.
- \(7\)Gerald B Folland\.Real analysis: modern techniques and their applications\.John Wiley & Sons, 1999\.
- \(8\)Dan Friedman and Adji Bousso Dieng\.The vendi score: A diversity evaluation metric for machine learning\.arXiv preprint arXiv:2210\.02410, 2022\.
- \(9\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783, 2024\.
- \(10\)Yanzhu Guo, Guokan Shang, and Chloé Clavel\.Benchmarking linguistic diversity of large language models\.Transactions of the Association for Computational Linguistics, 13:1507–1526, 2025\.
- \(11\)Florian Hartig\.Dharma: residual diagnostics for hierarchical \(multi\-level/mixed\) regression models\.CRAN: contributed packages, 2016\.
- \(12\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300, 2020\.
- \(13\)Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin\.Amortizing intractable inference in large language models\.arXiv preprint arXiv:2310\.04363, 2023\.
- \(14\)Mete Ismayilzada, Antonio Laverghetta Jr\., Simone A\. Luchini, Reet Patel, Antoine Bosselut, Lonneke van der Plas, and Roger E\. Beaty\.Creative preference optimization, 2025\.
- \(15\)Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker\.Perplexity—a measure of the difficulty of speech recognition tasks\.The journal of the Acoustical Society of America, 62\(S1\):S63–S63, 1977\.
- \(16\)Maurice G Kendall\.A new measure of rank correlation\.Biometrika, 30\(1\-2\):81–93, 1938\.
- \(17\)Thom Lake, Eunsol Choi, and Greg Durrett\.From distributional to overton pluralism: Investigating large language model alignment\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), pages 6794–6814, 2025\.
- \(18\)Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan\.A diversity\-promoting objective function for neural conversation models\.InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 110–119, 2016\.
- \(19\)Wei Lu, Rachel K Luu, and Markus J Buehler\.Fine\-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities\.npj Computational Materials, 11\(1\):84, 2025\.
- \(20\)Henry B Mann and Abraham Wald\.On stochastic limit and order relationships\.The Annals of Mathematical Statistics, 14\(3\):217–226, 1943\.
- \(21\)Frank J Massey Jr\.The kolmogorov\-smirnov test for goodness of fit\.Journal of the American statistical Association, 46\(253\):68–78, 1951\.
- \(22\)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.Advances in neural information processing systems, 35:27730–27744, 2022\.
- \(23\)Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\.Direct preference optimization: Your language model is secretly a reward model\.Advances in neural information processing systems, 36:53728–53741, 2023\.
- \(24\)Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al\.Code llama: Open foundation models for code\.arXiv preprint arXiv:2308\.12950, 2023\.
- \(25\)Walter Rudin\.Principles of mathematical analysis\.2021\.
- \(26\)Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa F Siu, Byron C Wallace, and Ani Nenkova\.Standardizing the measurement of text diversity: A tool and a comparative analysis of scores\.arXiv preprint arXiv:2403\.00553, 2024\.
- \(27\)Claude Elwood Shannon\.A mathematical theory of communication\.The Bell system technical journal, 27\(3\):379–423, 1948\.
- \(28\)Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, and Osbert Bastani\.Does instruction tuning reduce diversity? a case study using code generation\.
- \(29\)Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett\.A long way to go: Investigating length correlations in rlhf\.arXiv preprint arXiv:2310\.03716, 2023\.
- \(30\)Charles Spearman\.The proof and measurement of association between two things\.1961\.
- \(31\)Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, et al\.Gemma 3 technical report, 2025\.
- \(32\)Guy Tevet and Jonathan Berant\.Evaluating the evaluation of diversity in natural language generation\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 326–346, 2021\.
- \(33\)Shumin Wang, Yuexiang Xie, Wenhao Zhang, Yuchang Sun, Yanxi Chen, Yaliang Li, and Yanyong Zhang\.On the entropy dynamics in reinforcement fine\-tuning of large language models\.arXiv preprint arXiv:2602\.03392, 2026\.
- \(34\)Yichen Wang, Chenghao Yang, Tenghao Huang, Muhao Chen, Jonathan May, and Mina Lee\.Optimizing diversity and quality through base\-aligned model collaboration\.arXiv preprint arXiv:2511\.05650, 2025\.
- \(35\)Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli\.Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, 2024\.
- \(36\)Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le\.Finetuned language models are zero\-shot learners\.arXiv preprint arXiv:2109\.01652, 2021\.
- \(37\)Peter West and Christopher Potts\.Base models beat aligned models at randomness and creativity\.arXiv preprint arXiv:2505\.00047, 2025\.
- \(38\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.arXiv preprint arXiv:2505\.09388, 2025\.
- \(39\)Chenghao Yang, Sida Li, and Ari Holtzman\.Llm probability concentration: How alignment shrinks the generative horizon\.arXiv preprint arXiv:2506\.17871, 2025\.
- \(40\)Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi\.Hellaswag: Can a machine really finish your sentence?InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019\.
- \(41\)Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al\.Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions\.arXiv preprint arXiv:2406\.15877, 2024\.

## Appendix AMissing proofs and algorithms

### A\.1Equivalence to a two\-stage stochastic decision process

###### Theorem 3\(Equivalence between autoregressive generation and a two\-stage decision process\)\.

Fix a promptX=xX=xand a generation historyGt=gt=\(y1,…,yt−1\)G\_\{t\}=g\_\{t\}=\(y\_\{1\},\\ldots,y\_\{t\-1\}\)\. Suppose the autoregressive model defines a probability distribution over the extended vocabularyZ:=𝒱∪\{EOS\},\\mathcal\{\\mathcal\{\}\}Z:=\\mathcal\{V\}\\cup\\\{\\texttt\{EOS\}\\\},denoted byq\(z∣gt,x\),z∈Z\.q\(z\\mid g\_\{t\},x\),z\\in\\mathcal\{\\mathcal\{\}\}Z\.Then this one\-step generation rule is equivalent to the following two\-stage stochastic decision process:At∈\{STOP,CONT\},A\_\{t\}\\in\\\{\\textsf\{STOP\},\\textsf\{CONT\}\\\},whereℙ\(At=STOP∣gt,x\)=q\(EOS∣gt,x\),\\mathbb\{P\}\(A\_\{t\}=\\textsf\{STOP\}\\mid g\_\{t\},x\)=q\(\\texttt\{EOS\}\\mid g\_\{t\},x\),andℙ\(At=CONT∣gt,x\)=1−q\(EOS∣gt,x\)\.\\mathbb\{P\}\(A\_\{t\}=\\textsf\{CONT\}\\mid g\_\{t\},x\)=1\-q\(\\texttt\{EOS\}\\mid g\_\{t\},x\)\.Conditional onAt=CONTA\_\{t\}=\\textsf\{CONT\}, the next tokenYt∈𝒱Y\_\{t\}\\in\\mathcal\{V\}is sampled fromℙ\(Yt=v∣gt,x,At=CONT\)=q\(v∣gt,x\)1−q\(EOS∣gt,x\),v∈𝒱,\\mathbb\{P\}\(Y\_\{t\}=v\\mid g\_\{t\},x,A\_\{t\}=\\textsf\{CONT\}\)=\\frac\{q\(v\\mid g\_\{t\},x\)\}\{1\-q\(\\texttt\{EOS\}\\mid g\_\{t\},x\)\},v\\in\\mathcal\{V\},whenever1−q\(EOS∣gt,x\)\>01\-q\(\\texttt\{EOS\}\\mid g\_\{t\},x\)\>0\. This two\-stage process induces the same distribution over generated sequences as the original autoregressive model\.

###### Proof\.

Fix a promptxxand historygtg\_\{t\}\. Under the original autoregressive model, the next outcome is sampled directly fromq\(z∣gt,x\),z∈𝒱∪\{EOS\}\.q\(z\\mid g\_\{t\},x\),z\\in\\mathcal\{V\}\\cup\\\{\\texttt\{EOS\}\\\}\.In particular, generation terminates at stepttwith probabilityq\(EOS∣gt,x\)\.q\(\\texttt\{EOS\}\\mid g\_\{t\},x\)\.Now consider the two\-stage process\. The probability of termination isℙ\(At=STOP∣gt,x\)=q\(EOS∣gt,x\),\\mathbb\{P\}\(A\_\{t\}=\\textsf\{STOP\}\\mid g\_\{t\},x\)=q\(\\texttt\{EOS\}\\mid g\_\{t\},x\),which matches the probability assigned toEOSby the original model\. Next, for any tokenv∈𝒱v\\in\\mathcal\{V\}, the probability that the two\-stage process generatesvvis

ℙ\(Yt=v∣gt,x\)\\displaystyle\\mathbb\{P\}\(Y\_\{t\}=v\\mid g\_\{t\},x\)=ℙ\(At=CONT∣gt,x\)ℙ\(Yt=v∣gt,x,At=CONT\)\\displaystyle=\\mathbb\{P\}\(A\_\{t\}=\\textsf\{CONT\}\\mid g\_\{t\},x\)\\,\\mathbb\{P\}\(Y\_\{t\}=v\\mid g\_\{t\},x,A\_\{t\}=\\textsf\{CONT\}\)=\(1−q\(EOS∣gt,x\)\)q\(v∣gt,x\)1−q\(EOS∣gt,x\)\\displaystyle=\\left\(1\-q\(\\texttt\{EOS\}\\mid g\_\{t\},x\)\\right\)\\frac\{q\(v\\mid g\_\{t\},x\)\}\{1\-q\(\\texttt\{EOS\}\\mid g\_\{t\},x\)\}=q\(v∣gt,x\)\.\\displaystyle=q\(v\\mid g\_\{t\},x\)\.Thus, the two\-stage process assigns exactly the same probability to every possible next outcome:

ℙ\(STOP∣gt,x\)=q\(EOS∣gt,x\),ℙ\(Yt=v∣gt,x\)=q\(v∣gt,x\),v∈𝒱\.\\mathbb\{P\}\(\\textsf\{STOP\}\\mid g\_\{t\},x\)=q\(\\texttt\{EOS\}\\mid g\_\{t\},x\),\\;\\mathbb\{P\}\(Y\_\{t\}=v\\mid g\_\{t\},x\)=q\(v\\mid g\_\{t\},x\),\\quad v\\in\\mathcal\{V\}\.Therefore, the one\-step transition distribution is identical under the two formulations\. Since this equality holds for every promptxx, every generation historygtg\_\{t\}, and every generation steptt, the induced probability of any finite sequencey1:Ny\_\{1:N\}terminating at lengthNNis the same under both processes\. Thus, the two\-stage stochastic decision process is equivalent to the original autoregressive generation model\. ∎

### A\.2Proof of[Theorem 1](https://arxiv.org/html/2605.30844#Thmtheorem1)

###### Proof\.

LetX∼pXX\\sim p\_\{X\}denote the prompt distribution\. For each realizationX=xX=x, consider the sequence of random variables\(Zt\)t≥1\(Z\_\{t\}\)\_\{t\\geq 1\}withZt∈𝒵:=𝒱∪\{EOS\}Z\_\{t\}\\in\\mathcal\{Z\}:=\\mathcal\{V\}\\cup\\\{\\texttt\{EOS\}\\\}, where selectingEOSindicates termination\. Define the stopping timeN:=min⁡\{t:Zt=EOS\},N:=\\min\\\{t:Z\_\{t\}=\\texttt\{EOS\}\\\},and letYt=ZtY\_\{t\}=Z\_\{t\}fort<Nt<N\. Adopt the convention thatZt=EOSZ\_\{t\}=\\texttt\{EOS\}for allt\>Nt\>N, which induces a one\-to\-one correspondence between the infinite sequence\(Z1,Z2,…\)\(Z\_\{1\},Z\_\{2\},\\ldots\)and the pair\(N,Y1:N\)\(N,Y\_\{1:N\}\), conditional onXX\. Therefore,

H\(N,Y1:N∣X\)=H\(Z1,Z2,…∣X\)\.H\(N,Y\_\{1:N\}\\mid X\)=H\(Z\_\{1\},Z\_\{2\},\\ldots\\mid X\)\.
By the chain rule of entropy,H\(Z1,Z2,…∣X\)=∑t=1∞H\(Zt∣Z1:t−1,X\)\.H\(Z\_\{1\},Z\_\{2\},\\ldots\\mid X\)=\\sum\_\{t=1\}^\{\\infty\}H\(Z\_\{t\}\\mid Z\_\{1:t\-1\},X\)\.Since the historyGtG\_\{t\}is completely determined by\(Z1:t−1,X\)\(Z\_\{1:t\-1\},X\), we have

H\(Zt∣Z1:t−1,X\)=H\(Zt∣Gt,X\)=𝔼Gt,X\[H~\(Gt∣X\)\]\.H\(Z\_\{t\}\\mid Z\_\{1:t\-1\},X\)=H\(Z\_\{t\}\\mid G\_\{t\},X\)=\\mathbb\{E\}\_\{G\_\{t\},X\}\\big\[\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\big\]\.Therefore,H\(Z1,Z2,…∣X\)=∑t=1∞𝔼Gt,X\[H~\(Gt∣X\)\]\.H\(Z\_\{1\},Z\_\{2\},\\ldots\\mid X\)=\\sum\_\{t=1\}^\{\\infty\}\\mathbb\{E\}\_\{G\_\{t\},X\}\\big\[\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\big\]\.

SinceH~\(Gt∣X\)≥0\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\geq 0, we can apply Tonelli’s theorem\[[7](https://arxiv.org/html/2605.30844#bib.bib7)\]to exchange the sum and expectation\.

∑t=1∞𝔼Gt,X\[H~\(Gt∣X\)\]=𝔼Gt,X\[∑t=1∞H~\(Gt∣X\)\]\.\\sum\_\{t=1\}^\{\\infty\}\\mathbb\{E\}\_\{G\_\{t\},X\}\\big\[\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\big\]=\\mathbb\{E\}\_\{G\_\{t\},X\}\\left\[\\sum\_\{t=1\}^\{\\infty\}\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\right\]\.
Moreover, for allt\>Nt\>N,Zt=EOSZ\_\{t\}=\\texttt\{EOS\}deterministically, soH~\(Gt∣X\)=0\\widetilde\{H\}\(G\_\{t\}\\mid X\)=0\. Therefore,∑t=1∞H~\(Gt∣X\)=∑t=1N\(X\)H~\(Gt∣X\)\\sum\_\{t=1\}^\{\\infty\}\\widetilde\{H\}\(G\_\{t\}\\mid X\)=\\sum\_\{t=1\}^\{N\(X\)\}\\widetilde\{H\}\(G\_\{t\}\\mid X\), almost surely\. Combining the above, we haveH\(Z1,Z2,…∣X\)=𝔼Gt,X\[∑t=1N\(X\)H~\(Gt∣X\)\]\.H\(Z\_\{1\},Z\_\{2\},\\ldots\\mid X\)=\\mathbb\{E\}\_\{G\_\{t\},X\}\\left\[\\sum\_\{t=1\}^\{N\(X\)\}\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\right\]\.Finally, since\(Z1,Z2,…\)\(Z\_\{1\},Z\_\{2\},\\ldots\)is in one\-to\-one correspondence with\(N,Y1:N\)\(N,Y\_\{1:N\}\)givenXX, we obtainH\(N,Y1:N∣X\)=𝔼Gt,X\[∑t=1N\(X\)H~\(Gt∣X\)\]=CE⋆\.H\(N,Y\_\{1:N\}\\mid X\)=\\mathbb\{E\}\_\{G\_\{t\},X\}\\left\[\\sum\_\{t=1\}^\{N\(X\)\}\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\right\]=\\mathrm\{CE\}^\{\\star\}\.∎

### A\.3Algorithms

Input:Prompt distribution

pXp\_\{X\}; number of prompts

PP; rollout count per prompt

MM; maximum length

TmaxT\_\{\\max\}; sampling policy

π\\pi\(e\.g\. temperature/top\-

pp/top\-

kk\)\.

for*p=1p=1toPP*do

Sample prompt

xp∼pXx\_\{p\}\\sim p\_\{X\};

for*i=1i=1toMM*do

Initialize history

G1\(p,i\)←xpG^\{\(p,i\)\}\_\{1\}\\leftarrow x\_\{p\}; set

Smax\(p,i\)←0S\_\{\\max\}^\{\(p,i\)\}\\leftarrow 0;

for*t=1t=1toTmaxT\_\{\\max\}*do

Compute model distribution

qt\(p,i\)\(⋅∣Gt\(p,i\),xp\)q\_\{t\}^\{\(p,i\)\}\(\\cdot\\mid G\_\{t\}^\{\(p,i\)\},x\_\{p\}\)under sampling policy

π\\pi, includingEOSas an outcome;

Compute one\-step local entropy

g~t\(p,i\)←−∑z∈𝒵qt\(p,i\)\(z∣Gt\(p,i\),xp\)log⁡qt\(p,i\)\(z∣Gt\(p,i\),xp\);\\tilde\{g\}\_\{t\}^\{\(p,i\)\}\\leftarrow\-\\sum\_\{z\\in\\mathcal\{Z\}\}q\_\{t\}^\{\(p,i\)\}\(z\\mid G\_\{t\}^\{\(p,i\)\},x\_\{p\}\)\\log q\_\{t\}^\{\(p,i\)\}\(z\\mid G\_\{t\}^\{\(p,i\)\},x\_\{p\}\);
Update

Smax\(p,i\)←Smax\(p,i\)\+g~t\(p,i\);S\_\{\\max\}^\{\(p,i\)\}\\leftarrow S\_\{\\max\}^\{\(p,i\)\}\+\\tilde\{g\}\_\{t\}^\{\(p,i\)\};
Sample next outcome

zt\(p,i\)∼qt\(p,i\)\(⋅∣Gt\(p,i\),xp\);z\_\{t\}^\{\(p,i\)\}\\sim q\_\{t\}^\{\(p,i\)\}\(\\cdot\\mid G\_\{t\}^\{\(p,i\)\},x\_\{p\}\);
If

zt\(p,i\)=EOSz\_\{t\}^\{\(p,i\)\}=\\texttt\{EOS\}then set

Nmax\(p,i\)←tN\_\{\\max\}^\{\(p,i\)\}\\leftarrow tand break;

Else update history

Gt\+1\(p,i\)←Gt\(p,i\)⊕zt\(p,i\)\.G\_\{t\+1\}^\{\(p,i\)\}\\leftarrow G\_\{t\}^\{\(p,i\)\}\\oplus z\_\{t\}^\{\(p,i\)\}\.
end for

If noEOSis sampled by

TmaxT\_\{\\max\}: set

Nmax\(p,i\)←TmaxN\_\{\\max\}^\{\(p,i\)\}\\leftarrow T\_\{\\max\}\.

end for

end for

return*CE^max,P,M⋆=1PM∑p=1P∑i=1MSmax\(p,i\),𝔼^\[Nmax\]=1PM∑p=1P∑i=1MNmax\(p,i\)\.\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}S\_\{\\max\}^\{\(p,i\)\},\\;\\widehat\{\\mathbb\{E\}\}\[N\_\{\\max\}\]=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}N\_\{\\max\}^\{\(p,i\)\}\.*

Algorithm 1EstimatingCEmax⋆\\mathrm\{CE\}^\{\\star\}\_\{\\max\}over a prompt distribution
### A\.4Examples of prompt\-averaged bias decay rates

###### Corollary 1\.

By Remark[4\.1](https://arxiv.org/html/2605.30844#S4.Thmremark1),g\(Tmax\)g\(T\_\{\\max\}\)admits the following rates\.

Uniform exponential tail\.If there exist constantsc,λ\>0c,\\lambda\>0such that, for all promptsx∈𝒳x\\in\\mathcal\{X\}and allt≥1t\\geq 1,ℙ\(N\(x\)≥t∣X=x\)≤ce−λt,\\mathbb\{P\}\(N\(x\)\\geq t\\mid X=x\)\\leq ce^\{\-\\lambda t\},theng\(Tmax\)≤clog⁡\|𝒵\|∑t=Tmax\+1∞e−λt=clog⁡\|𝒵\|e−λ\(Tmax\+1\)1−e−λ=O\(e−λTmax\)\.g\(T\_\{\\max\}\)\\leq c\\log\|\\mathcal\{Z\}\|\\sum\_\{t=T\_\{\\max\}\+1\}^\{\\infty\}e^\{\-\\lambda t\}=c\\log\|\\mathcal\{Z\}\|\\frac\{e^\{\-\\lambda\(T\_\{\\max\}\+1\)\}\}\{1\-e^\{\-\\lambda\}\}=O\(e^\{\-\\lambda T\_\{\\max\}\}\)\.

Uniform polynomial tail\.If there exist constantsc\>0c\>0andα\>1\\alpha\>1such that, for all promptsx∈𝒳x\\in\\mathcal\{X\}and allt≥1t\\geq 1,ℙ\(N\(x\)≥t∣X=x\)≤ct−α,\\mathbb\{P\}\(N\(x\)\\geq t\\mid X=x\)\\leq ct^\{\-\\alpha\},theng\(Tmax\)≤clog⁡\|𝒵\|∑t=Tmax\+1∞t−α≤clog⁡\|𝒵\|α−1Tmax1−α=O\(Tmax1−α\)\.g\(T\_\{\\max\}\)\\leq c\\log\|\\mathcal\{Z\}\|\\sum\_\{t=T\_\{\\max\}\+1\}^\{\\infty\}t^\{\-\\alpha\}\\leq\\frac\{c\\log\|\\mathcal\{Z\}\|\}\{\\alpha\-1\}T\_\{\\max\}^\{1\-\\alpha\}=O\(T\_\{\\max\}^\{1\-\\alpha\}\)\.

### A\.5Proof of[Theorem 2](https://arxiv.org/html/2605.30844#Thmtheorem2)and MSE analysis

###### Proof\.

Given any fixed promptx∈𝒳x\\in\\mathcal\{X\}, denoteμmax\(x\):=𝔼\[Smax∣X=x\],\\mu\_\{\\max\}\(x\):=\\mathbb\{E\}\\\!\\left\[S\_\{\\max\}\\mid X=x\\right\],andSmax=∑t=1Nmax\(x\)H~\(Gt∣x\)\.S\_\{\\max\}=\\sum\_\{t=1\}^\{N\_\{\\max\}\(x\)\}\\widetilde\{H\}\(G\_\{t\}\\mid x\)\.ThenCEmax⋆=𝔼X∼pX\[μmax\(X\)\]\.\\mathrm\{CE\}^\{\\star\}\_\{\\max\}=\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\[\\mu\_\{\\max\}\(X\)\]\.Since\|𝒵\|<∞\|\\mathcal\{Z\}\|<\\infty, for allgtg\_\{t\}, we have0≤H~\(gt∣x\)≤Hmax:=log⁡\|𝒵\|\.0\\leq\\widetilde\{H\}\(g\_\{t\}\\mid x\)\\leq H\_\{\\max\}:=\\log\|\\mathcal\{Z\}\|\.Therefore0≤Smax≤HmaxNmax\(x\)\.0\\leq S\_\{\\max\}\\leq H\_\{\\max\}N\_\{\\max\}\(x\)\.Therefore,𝔼\[Smax2\]≤Hmax2𝔼\[Nmax\(X\)2\]=Hmax2νmax2,\\mathbb\{E\}\[S\_\{\\max\}^\{2\}\]\\leq H\_\{\\max\}^\{2\}\\mathbb\{E\}\[N\_\{\\max\}\(X\)^\{2\}\]=H\_\{\\max\}^\{2\}\\nu\_\{\\max\}^\{2\},whereνmax2:=E\[Nmax\(X\)2\]\\nu\_\{\\max\}^\{2\}:=E\[N\_\{\\max\}\(X\)^\{2\}\]\.

Now decompose the variance ofCE^max,P,M⋆\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\. Conditional onX1,…,XpX\_\{1\},\\dots,X\_\{p\}, theMMrollouts for each prompt are independent, by law of total variance, we have

Var⁡\(CE^max,P,M⋆\)=Var⁡\(1P∑p=1Pμmax\(Xp\)\)\+𝔼\[Var⁡\(CE^max,P,M⋆∣X1,…,Xp\)\]\.\\operatorname\{Var\}\\\!\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\right\)=\\operatorname\{Var\}\\\!\\left\(\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\mu\_\{\\max\}\(X\_\{p\}\)\\right\)\+\\mathbb\{E\}\\\!\\left\[\\operatorname\{Var\}\\\!\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\mid X\_\{1\},\\dots,X\_\{p\}\\right\)\\right\]\.By Jensen’s inequality, we haveμmax2\(X\)=Hmax2𝔼\(Nmax\(X\)\)2≤Hmax2νmax2\\mu\_\{\\max\}^\{2\}\(X\)=H^\{2\}\_\{\\max\}\\mathbb\{E\}\(N\_\{\\max\}\(X\)\)^\{2\}\\leq H^\{2\}\_\{\\max\}\\nu^\{2\}\_\{\\max\}, therefore

Var⁡\(1P∑p=1Pμmax\(Xp\)\)=1PVar⁡\(μmax\(X\)\)≤1P𝔼\[μmax\(X\)2\]≤Hmax2νmax2P\.\\operatorname\{Var\}\\\!\\left\(\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\mu\_\{\\max\}\(X\_\{p\}\)\\right\)=\\frac\{1\}\{P\}\\operatorname\{Var\}\(\\mu\_\{\\max\}\(X\)\)\\leq\\frac\{1\}\{P\}\\mathbb\{E\}\[\\mu\_\{\\max\}\(X\)^\{2\}\]\\leq\\frac\{H\_\{\\max\}^\{2\}\\nu\_\{\\max\}^\{2\}\}\{P\}\.Because,𝔼\[Var⁡\(CE^max,P,M⋆∣X1,…,Xp\)\]≤Hmax2νmax2PM\.\\mathbb\{E\}\\left\[\\operatorname\{Var\}\\\!\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\mid X\_\{1\},\\dots,X\_\{p\}\\right\)\\right\]\\leq\\frac\{H\_\{\\max\}^\{2\}\\nu\_\{\\max\}^\{2\}\}\{PM\}\.Thus we have,

Var⁡\(CE^max,P,M⋆\)≤Hmax2νmax2\(1P\+1PM\)\.\\operatorname\{Var\}\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\right\)\\leq H\_\{\\max\}^\{2\}\\nu\_\{\\max\}^\{2\}\\left\(\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\\right\)\.
Since𝔼\[CE^max,P,M⋆\]=CEmax⋆,\\mathbb\{E\}\\left\[\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\right\]=\\mathrm\{CE\}^\{\\star\}\_\{\\max\},Chebyshev’s inequality gives that, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,\|CE^max,P,M⋆−CEmax⋆\|≤Hmaxνmax1δ\(1P\+1PM\)\.\\left\|\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\_\{\\max\}\\right\|\\leq H\_\{\\max\}\\nu\_\{\\max\}\\sqrt\{\\frac\{1\}\{\\delta\}\\left\(\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\\right\)\}\.By the truncation bias assumption,0≤CE⋆−CEmax⋆≤g\(Tmax\)\.0\\leq\\mathrm\{CE\}^\{\\star\}\-\\mathrm\{CE\}^\{\\star\}\_\{\\max\}\\leq g\(T\_\{\\max\}\)\.Thus,

\|CE^max,P,M⋆−CE⋆\|≤Hmaxνmax1δ\(1P\+1PM\)\+g\(Tmax\)\.\\left\|\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\\right\|\\leq H\_\{\\max\}\\nu\_\{\\max\}\\sqrt\{\\frac\{1\}\{\\delta\}\\left\(\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\\right\)\}\+g\(T\_\{\\max\}\)\.Equivalently,\|CE^max,P,M⋆−CE⋆\|=Oℙ\(νmax1P\+1PM\+g\(Tmax\)\)\.\\left\|\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\\right\|=O\_\{\\mathbb\{P\}\}\\left\(\\nu\_\{\\max\}\\sqrt\{\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\}\+g\(T\_\{\\max\}\)\\right\)\.Thus, ifTmax=Tmax\(P,M\)→∞T\_\{\\max\}=T\_\{\\max\}\(P,M\)\\to\\infty,g\(Tmax\)→0,g\(T\_\{\\max\}\)\\to 0,and ifνmax\(P,M\)1P\+1PM→0,\\nu\_\{\\max\}\(P,M\)\\sqrt\{\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\}\\to 0,then we haveCE^max,P,M⋆→ℙCE⋆\.\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathrm\{CE\}^\{\\star\}\.∎

###### Theorem 4\(MSE of the prompt\-averaged truncated CE estimator\)\.

𝔼\[\(CE^max,P,M⋆−CE⋆\)2\]≤Hmax2PVarX∼pX⁡\(𝔼\[Nmax∣X\]\)\+Hmax2PM𝔼\[Nmax2\]\+g2\(Tmax\)\.\\mathbb\{E\}\\left\[\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\\right\)^\{2\}\\right\]\\leq\\frac\{H\_\{\\max\}^\{2\}\}\{P\}\\operatorname\{Var\}\_\{X\\sim p\_\{X\}\}\\\!\\left\(\\mathbb\{E\}\[N\_\{\\max\}\\mid X\]\\right\)\+\\frac\{H\_\{\\max\}^\{2\}\}\{PM\}\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\]\+g^\{2\}\(T\_\{\\max\}\)\.

In[Theorem 4](https://arxiv.org/html/2605.30844#Thmtheorem4), we further characterize the mean squared error \(MSE\) ofCE^max,P,M⋆\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}as the sum of the squared truncation bias and the variance of the estimator\. The variance decreases with both the number of promptsPPand the total number of rolloutsPMPM, highlighting two distinct sources of statistical efficiency: diversity across prompts and repeated sampling within each prompt\.

The variance admits a hierarchical decomposition with two distinct sources of randomness: a1/P1/Pterm due to variability across prompts, and a1/PM1/PMterm arising from rollout\-level stochasticity within each prompt\. Importantly, the prompt\-level variance cannot be reduced by increasing the number of rollouts per prompt, reflecting the intrinsic hierarchical structure of the estimator\. Moreover, the Monte Carlo error is scaled by the effective trajectory length \(captured byTmaxT\_\{\\max\}or, more precisely,NmaxN\_\{\\max\}\), since longer generations accumulate more uncertainty along the trajectory\.

These results reveal a fundamental trade\-off\. IncreasingTmaxT\_\{\\max\}reduces the truncation bias by capturing more of the generation process, but also increases computational cost and variance through longer trajectories\.

###### Proof\.

Define the per\-prompt rollout averageS¯max\(p\):=1M∑i=1MSmax\(p,i\)\\bar\{S\}\_\{\\max\}^\{\(p\)\}:=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}S\_\{\\max\}^\{\(p,i\)\}\. ThenCE^max,P,M⋆=1P∑p=1PS¯max\(p\)\.\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}=\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\bar\{S\}\_\{\\max\}^\{\(p\)\}\.Conditional onXpX\_\{p\},𝔼\[S¯max\(p\)∣Xp\]=μmax\(Xp\),\\mathbb\{E\}\[\\bar\{S\}\_\{\\max\}^\{\(p\)\}\\mid X\_\{p\}\]=\\mu\_\{\\max\}\(X\_\{p\}\),and, since theMMrollouts are conditionally independent,Var⁡\(S¯max\(p\)∣Xp\)=1Mσmax2\(Xp\)\.\\operatorname\{Var\}\(\\bar\{S\}\_\{\\max\}^\{\(p\)\}\\mid X\_\{p\}\)=\\frac\{1\}\{M\}\\sigma\_\{\\max\}^\{2\}\(X\_\{p\}\)\.By the law of total variance, we have

Var⁡\(S¯max\(p\)\)=VarXp⁡\(𝔼\[S¯max\(p\)∣Xp\]\)\+𝔼Xp\[Var⁡\(S¯max\(p\)∣Xp\)\]\.\\operatorname\{Var\}\(\\bar\{S\}\_\{\\max\}^\{\(p\)\}\)=\\operatorname\{Var\}\_\{X\_\{p\}\}\\left\(\\mathbb\{E\}\[\\bar\{S\}\_\{\\max\}^\{\(p\)\}\\mid X\_\{p\}\]\\right\)\+\\mathbb\{E\}\_\{X\_\{p\}\}\\left\[\\operatorname\{Var\}\(\\bar\{S\}\_\{\\max\}^\{\(p\)\}\\mid X\_\{p\}\)\\right\]\.Substituting the conditional mean and variance givesVar⁡\(S¯max\(p\)\)=VarX∼pX⁡\(μmax\(X\)\)\+1M𝔼X∼pX\[σmax2\(X\)\]\.\\operatorname\{Var\}\(\\bar\{S\}\_\{\\max\}^\{\(p\)\}\)=\\operatorname\{Var\}\_\{X\\sim p\_\{X\}\}\(\\mu\_\{\\max\}\(X\)\)\+\\frac\{1\}\{M\}\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\[\\sigma\_\{\\max\}^\{2\}\(X\)\]\.Since the promptsX1,…,XpX\_\{1\},\\dots,X\_\{p\}are i\.i\.d\., the variablesS¯max\(1\),…,S¯max\(P\)\\bar\{S\}\_\{\\max\}^\{\(1\)\},\\dots,\\bar\{S\}\_\{\\max\}^\{\(P\)\}are i\.i\.d\. Therefore,Var⁡\(CE^max,P,M⋆\)=1PVar⁡\(S¯max\(p\)\)=1PVarX∼pX⁡\(μmax\(X\)\)\+1PM𝔼X∼pX\[σmax2\(X\)\]\.\\operatorname\{Var\}\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\right\)=\\frac\{1\}\{P\}\\operatorname\{Var\}\(\\bar\{S\}\_\{\\max\}^\{\(p\)\}\)=\\frac\{1\}\{P\}\\operatorname\{Var\}\_\{X\\sim p\_\{X\}\}\(\\mu\_\{\\max\}\(X\)\)\+\\frac\{1\}\{PM\}\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\[\\sigma\_\{\\max\}^\{2\}\(X\)\]\.

For the MSE, writeCE^max,P,M⋆−CE⋆=\(CE^max,P,M⋆−CEmax⋆\)−g\(Tmax\)\.\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}=\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\_\{\\max\}\\right\)\-g\(T\_\{\\max\}\)\.Taking squares and expectations yields𝔼\[\(CE^max,P,M⋆−CE⋆\)2\]=𝔼\[\(CE^max,P,M⋆−CEmax⋆\)2\]\+\(g\(Tmax\)\)2,\\mathbb\{E\}\\left\[\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\\right\)^\{2\}\\right\]=\\mathbb\{E\}\\left\[\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\_\{\\max\}\\right\)^\{2\}\\right\]\+\(g\(T\_\{\\max\}\)\)^\{2\},because𝔼\[CE^max,P,M⋆−CEmax⋆\]=0\.\\mathbb\{E\}\[\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\-\\mathrm\{CE\}^\{\\star\}\_\{\\max\}\]=0\.

Since pathwise, for any given historygtg\_\{t\}and promptxx, we have0≤Smax\(x\)=∑t=1Nmax\(x\)H~\(gt∣x\)≤HmaxNmax\(x\)\.0\\leq S\_\{\\max\}\(x\)=\\sum\_\{t=1\}^\{N\_\{\\max\}\(x\)\}\\widetilde\{H\}\(g\_\{t\}\\mid x\)\\leq H\_\{\\max\}N\_\{\\max\}\(x\)\.Thus, conditional onX=xX=x,Var⁡\(Smax\(x\)∣X=x\)≤𝔼\[Smax2\(x\)∣X=x\]≤Hmax2𝔼\[Nmax2\(x\)∣X=x\]\.\\operatorname\{Var\}\(S\_\{\\max\}\(x\)\\mid X=x\)\\leq\\mathbb\{E\}\[S\_\{\\max\}^\{2\}\(x\)\\mid X=x\]\\leq H\_\{\\max\}^\{2\}\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\(x\)\\mid X=x\]\.Averaging over prompts gives𝔼X\[σmax2\(X\)\]≤Hmax2𝔼X𝔼\[Nmax2\(X\)∣X\]=Hmax2𝔼\[Nmax2\]\.\\mathbb\{E\}\_\{X\}\[\\sigma\_\{\\max\}^\{2\}\(X\)\]\\leq H\_\{\\max\}^\{2\}\\mathbb\{E\}\_\{X\}\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\(X\)\\mid X\]=H\_\{\\max\}^\{2\}\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\]\.

Similarly,μmax\(X\)=𝔼\[Smax\(X\)∣X\]≤Hmax𝔼\[Nmax∣X\]\.\\mu\_\{\\max\}\(X\)=\\mathbb\{E\}\[S\_\{\\max\}\(X\)\\mid X\]\\leq H\_\{\\max\}\\mathbb\{E\}\[N\_\{\\max\}\\mid X\]\.Since both quantities are nonnegative, this gives the prompt\-level bound0≤μmax\(X\)≤Hmax𝔼\[Nmax∣X\]\.0\\leq\\mu\_\{\\max\}\(X\)\\leq H\_\{\\max\}\\mathbb\{E\}\[N\_\{\\max\}\\mid X\]\.Therefore,VarX⁡\(μmax\(X\)\)≤Hmax2VarX⁡\(𝔼\[Nmax∣X\]\)\.\\operatorname\{Var\}\_\{X\}\(\\mu\_\{\\max\}\(X\)\)\\leq H\_\{\\max\}^\{2\}\\operatorname\{Var\}\_\{X\}\\\!\\left\(\\mathbb\{E\}\[N\_\{\\max\}\\mid X\]\\right\)\.Combining these two refined bounds with the exact variance decomposition gives

Var⁡\(CE^max,P,M⋆\)≤Hmax2PVarX∼pX⁡\(𝔼\[Nmax∣X\]\)\+Hmax2PM𝔼\[Nmax2\]\.\\operatorname\{Var\}\\left\(\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\right\)\\leq\\frac\{H\_\{\\max\}^\{2\}\}\{P\}\\operatorname\{Var\}\_\{X\\sim p\_\{X\}\}\\\!\\left\(\\mathbb\{E\}\[N\_\{\\max\}\\mid X\]\\right\)\+\\frac\{H\_\{\\max\}^\{2\}\}\{PM\}\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\]\.Adding the squared truncation biasg2\(Tmax\)g^\{2\}\(T\_\{\\max\}\)gives the stated refined MSE upper bound\. ∎

### A\.6Estimation and consistency of GenPPL, BF, and length–entropy correlation

###### Lemma 1\(Consistency of the truncated length estimator\)\.

LetN^max,P,M:=1PM∑p=1P∑i=1MNmax\(p,i\)\\widehat\{N\}\_\{\\max,P,M\}:=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}N\_\{\\max\}^\{\(p,i\)\}, whereNmax\(p,i\):=min⁡\{N\(p,i\),Tmax\}\.N\_\{\\max\}^\{\(p,i\)\}:=\\min\\\{N^\{\(p,i\)\},T\_\{\\max\}\\\}\.Assume𝔼\[N\]<∞\\mathbb\{E\}\[N\]<\\infty\. IfTmax=Tmax\(P,M\)→∞T\_\{\\max\}=T\_\{\\max\}\(P,M\)\\to\\inftyand𝔼\[Nmax2\]P→0,\\frac\{\\sqrt\{\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\]\}\}\{\\sqrt\{P\}\}\\to 0,thenN^max,P,M→ℙ𝔼\[N\]\.\\widehat\{N\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathbb\{E\}\[N\]\.

###### Proof\.

Letηmax\(x\):=𝔼\[Nmax∣X=x\]\.\\eta\_\{\\max\}\(x\):=\\mathbb\{E\}\[N\_\{\\max\}\\mid X=x\]\.By the law of total variance,Var⁡\(N^max,P,M\)=1PVarX∼pX⁡\(ηmax\(X\)\)\+1PM𝔼X∼pX\[Var⁡\(Nmax∣X\)\]\.\\operatorname\{Var\}\(\\widehat\{N\}\_\{\\max,P,M\}\)=\\frac\{1\}\{P\}\\operatorname\{Var\}\_\{X\\sim p\_\{X\}\}\(\\eta\_\{\\max\}\(X\)\)\+\\frac\{1\}\{PM\}\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\\left\[\\operatorname\{Var\}\(N\_\{\\max\}\\mid X\)\\right\]\.UsingVar⁡\(ηmax\(X\)\)≤𝔼\[ηmax\(X\)2\]≤𝔼\[Nmax2\],\\operatorname\{Var\}\(\\eta\_\{\\max\}\(X\)\)\\leq\\mathbb\{E\}\[\\eta\_\{\\max\}\(X\)^\{2\}\]\\leq\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\],and𝔼\[Var⁡\(Nmax∣X\)\]≤𝔼\[Nmax2\],\\mathbb\{E\}\[\\operatorname\{Var\}\(N\_\{\\max\}\\mid X\)\]\\leq\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\],we obtainVar⁡\(N^max,P,M\)≤𝔼\[Nmax2\]\(1P\+1PM\)\.\\operatorname\{Var\}\(\\widehat\{N\}\_\{\\max,P,M\}\)\\leq\\mathbb\{E\}\[N\_\{\\max\}^\{2\}\]\\left\(\\frac\{1\}\{P\}\+\\frac\{1\}\{PM\}\\right\)\.Therefore, under the stated condition,Var⁡\(N^max,P,M\)→0\.\\operatorname\{Var\}\(\\widehat\{N\}\_\{\\max,P,M\}\)\\to 0\.

In addition,𝔼\[N^max,P,M\]=𝔼\[Nmax\]\.\\mathbb\{E\}\[\\widehat\{N\}\_\{\\max,P,M\}\]=\\mathbb\{E\}\[N\_\{\\max\}\]\.SinceNmax=min⁡\{N,Tmax\}↑NN\_\{\\max\}=\\min\\\{N,T\_\{\\max\}\\\}\\uparrow NasTmax→∞T\_\{\\max\}\\to\\infty, the monotone convergence theorem\[[25](https://arxiv.org/html/2605.30844#bib.bib25)\]gives𝔼\[Nmax\]→𝔼\[N\]\.\\mathbb\{E\}\[N\_\{\\max\}\]\\to\\mathbb\{E\}\[N\]\.Hence,

N^max,P,M−𝔼\[N\]=\(N^max,P,M−𝔼\[Nmax\]\)\+\(𝔼\[Nmax\]−𝔼\[N\]\)\.\\widehat\{N\}\_\{\\max,P,M\}\-\\mathbb\{E\}\[N\]=\\left\(\\widehat\{N\}\_\{\\max,P,M\}\-\\mathbb\{E\}\[N\_\{\\max\}\]\\right\)\+\\left\(\\mathbb\{E\}\[N\_\{\\max\}\]\-\\mathbb\{E\}\[N\]\\right\)\.The first term converges to0in probability by Chebyshev’s inequality, and the second term converges to0deterministically\. Therefore,N^max,P,M→ℙ𝔼\[N\]\.\\widehat\{N\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathbb\{E\}\[N\]\.∎

###### Corollary 2\(Consistency of GenPPL\)\.

LetCE^max,P,M⋆=1PM∑p=1P∑i=1MSmax\(p,i\)\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}S\_\{\\max\}^\{\(p,i\)\}be the Monte Carlo estimator ofCE⋆\\mathrm\{CE\}^\{\\star\}, and letN^max,P,M:=1PM∑p=1P∑i=1MNmax\(p,i\)\\widehat\{N\}\_\{\\max,P,M\}:=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}N\_\{\\max\}^\{\(p,i\)\}be the corresponding estimator of𝔼\[N\]\\mathbb\{E\}\[N\]\. SinceCE^max,P,M⋆→ℙCE⋆,\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathrm\{CE\}^\{\\star\},\(by[Theorem 2](https://arxiv.org/html/2605.30844#Thmtheorem2)\)N^max,P,M→ℙ𝔼\[N\],\\widehat\{N\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathbb\{E\}\[N\],\(by Lemma[1](https://arxiv.org/html/2605.30844#Thmlemma1)\)\. DefineGenPPL^max,P,M:=exp⁡\(CE^max,P,M⋆N^max,P,M\),GenPPL:=exp⁡\(CE⋆𝔼\[N\]\)\.\\widehat\{\\mathrm\{GenPPL\}\}\_\{\\max,P,M\}:=\\exp\\left\(\\frac\{\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\}\{\\widehat\{N\}\_\{\\max,P,M\}\}\\right\),\\;\\mathrm\{GenPPL\}:=\\exp\\left\(\\frac\{\\mathrm\{CE\}^\{\\star\}\}\{\\mathbb\{E\}\[N\]\}\\right\)\.ThenGenPPL^max,P,M→ℙGenPPL\.\\widehat\{\\mathrm\{GenPPL\}\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathrm\{GenPPL\}\.

###### Proof\.

We first prove consistency of GenPPL\. Since we haveCE^max,P,M⋆→ℙCE⋆andN^max,P,M→ℙ𝔼\[N\]\.\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathrm\{CE\}^\{\\star\}\\;\\text\{and\}\\;\\widehat\{N\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathbb\{E\}\[N\]\.and𝔼\[N\]\>0\\mathbb\{E\}\[N\]\>0, the functionf\(a,b\)=exp⁡\(a/b\)f\(a,b\)=\\exp\(a/b\)is continuous at\(CE⋆,𝔼\[N\]\)\(\\mathrm\{CE\}^\{\\star\},\\mathbb\{E\}\[N\]\)\. Therefore, by the continuous mapping theorem\[[20](https://arxiv.org/html/2605.30844#bib.bib20)\],exp⁡\(CE^max,P,M⋆N^max,P,M\)→ℙexp⁡\(CE⋆𝔼\[N\]\)\.\\exp\\left\(\\frac\{\\widehat\{\\mathrm\{CE\}\}^\{\\star\}\_\{\\max,P,M\}\}\{\\widehat\{N\}\_\{\\max,P,M\}\}\\right\)\\xrightarrow\{\\mathbb\{P\}\}\\exp\\left\(\\frac\{\\mathrm\{CE\}^\{\\star\}\}\{\\mathbb\{E\}\[N\]\}\\right\)\.Hence,GenPPL^max,P,M→ℙGenPPL\\widehat\{\\mathrm\{GenPPL\}\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathrm\{GenPPL\}, which completes the proof\. ∎

###### Corollary 3\(Consistency of the BF estimator\)\.

LetL^max,P,M:=1PM∑p=1P∑i=1MSmax\(p,i\)Nmax\(p,i\),BF^max,P,M:=exp⁡\(L^max,P,M\),\\widehat\{L\}\_\{\\max,P,M\}:=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}\\frac\{S\_\{\\max\}^\{\(p,i\)\}\}\{N\_\{\\max\}^\{\(p,i\)\}\},\\;\\widehat\{\\mathrm\{BF\}\}\_\{\\max,P,M\}:=\\exp\\\!\\left\(\\widehat\{L\}\_\{\\max,P,M\}\\right\),whereSmax\(p,i\)=∑t=1Nmax\(p,i\)H~\(Gt\(p,i\)∣xp\),Nmax\(p,i\)=min⁡\{N\(p,i\),Tmax\}\.S\_\{\\max\}^\{\(p,i\)\}=\\sum\_\{t=1\}^\{N\_\{\\max\}^\{\(p,i\)\}\}\\widetilde\{H\}\(G\_\{t\}^\{\(p,i\)\}\\mid x\_\{p\}\),\\;N\_\{\\max\}^\{\(p,i\)\}=\\min\\\{N^\{\(p,i\)\},T\_\{\\max\}\\\}\.Assume thatN≥1N\\geq 1almost surely and that the normalized truncated entropy variablesrmax:=SmaxNmaxr\_\{\\max\}:=\\frac\{S\_\{\\max\}\}\{N\_\{\\max\}\}are integrable\. Then, for fixedTmaxT\_\{\\max\}, asP,M→∞P,M\\to\\infty,L^max,P,M→ℙLmax:=𝔼\[SmaxNmax\],\\widehat\{L\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}L\_\{\\max\}:=\\mathbb\{E\}\\\!\\left\[\\frac\{S\_\{\\max\}\}\{N\_\{\\max\}\}\\right\],and therefore, by the continuous mapping theorem,BF^max,P,M→ℙBFmax:=exp⁡\(Lmax\)\.\\widehat\{\\mathrm\{BF\}\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathrm\{BF\}\_\{\\max\}:=\\exp\(L\_\{\\max\}\)\.Moreover, ifSmaxNmax→Tmax→∞a\.s\.SN\\frac\{S\_\{\\max\}\}\{N\_\{\\max\}\}\\xrightarrow\[T\_\{\\max\}\\to\\infty\]\{a\.s\.\}\\frac\{S\}\{N\}and the sequence\{Smax/Nmax\}Tmax≥1\\\{S\_\{\\max\}/N\_\{\\max\}\\\}\_\{T\_\{\\max\}\\geq 1\}is uniformly integrable, thenLmax→L:=𝔼\[1N∑t=1NH~\(Gt∣X\)\],L\_\{\\max\}\\to L:=\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}\\widetilde\{H\}\(G\_\{t\}\\mid X\)\\right\],and consequentlyBFmax→BF:=exp⁡\(L\)\.\\mathrm\{BF\}\_\{\\max\}\\to\\mathrm\{BF\}:=\\exp\(L\)\.Thus, under these conditions,BF^max,P,M→ℙBF\\widehat\{\\mathrm\{BF\}\}\_\{\\max,P,M\}\\xrightarrow\{\\mathbb\{P\}\}\\mathrm\{BF\}asP,M→∞P,M\\to\\inftyandTmax→∞T\_\{\\max\}\\to\\infty\.

###### Proof\.

For fixedTmaxT\_\{\\max\}, the variablesr^max\(p,i\)=Smax\(p,i\)Nmax\(p,i\)\\hat\{r\}\_\{\\max\}^\{\(p,i\)\}=\\frac\{S\_\{\\max\}^\{\(p,i\)\}\}\{N\_\{\\max\}^\{\(p,i\)\}\}are integrable rollout\-level observations\. SinceN≥1N\\geq 1almost surely, the denominator is bounded away from zero\. Applying the law of large numbers to the empirical average over prompts and rollouts givesL^max,P,M=1PM∑p=1P∑i=1Mr^max\(p,i\)→ℙ𝔼\[rmax\]=Lmax\.\\widehat\{L\}\_\{\\max,P,M\}=\\frac\{1\}\{PM\}\\sum\_\{p=1\}^\{P\}\\sum\_\{i=1\}^\{M\}\\hat\{r\}\_\{\\max\}^\{\(p,i\)\}\\xrightarrow\{\\mathbb\{P\}\}\\mathbb\{E\}\[r\_\{\\max\}\]=L\_\{\\max\}\.Since the exponential map is continuous,BF^max,P,M=exp⁡\(L^max,P,M\)→ℙexp⁡\(Lmax\)=BFmax\.\\widehat\{\\mathrm\{BF\}\}\_\{\\max,P,M\}=\\exp\(\\widehat\{L\}\_\{\\max,P,M\}\)\\xrightarrow\{\\mathbb\{P\}\}\\exp\(L\_\{\\max\}\)=\\mathrm\{BF\}\_\{\\max\}\.Finally, ifSmax/Nmax→S/NS\_\{\\max\}/N\_\{\\max\}\\to S/Nalmost surely and the normalized entropy ratios are uniformly integrable, then convergence of expectations follows, yieldingLmax→LL\_\{\\max\}\\to L\. Another application of continuity of the exponential function givesBFmax→BF\\mathrm\{BF\}\_\{\\max\}\\to\\mathrm\{BF\}\. ∎

#### Empirical support for the truncation condition\.

The consistency result above requires the truncation bias induced byTmaxT\_\{\\max\}to vanish, or at least to be negligible in finite\-sample estimation\. While this is a population\-level condition and cannot be proven from data alone, our diagnostics provide empirical support for it in our experimental setting\. WithTmax=4000T\_\{\\max\}=4000, the observed truncation rates are uniformly small across tasks, model families, and variants: most are below1%1\\%, and the largest observed rate is4\.32%4\.32\\%\(see[subsection C\.1](https://arxiv.org/html/2605.30844#A3.SS1)\)\. In addition, the empirical length distributions concentrate well below the truncation threshold \(see[Figure 3](https://arxiv.org/html/2605.30844#S5.F3)\), and the number of active rollouts decays rapidly with token position \(see[Figure 2](https://arxiv.org/html/2605.30844#S5.F2)\)\. Together with the bounded observed entropy\-rate trajectories \(see[Figure 2](https://arxiv.org/html/2605.30844#S5.F2)\), these results suggest that the normalized entropy ratios are well behaved and that the finite\-TmaxT\_\{\\max\}truncation bias is small in practice\.

#### Estimating entropy rate\.

For each rollout\(p,i\)\(p,i\), Algorithm[1](https://arxiv.org/html/2605.30844#algorithm1)returns the truncated stopping timeNmax\(p,i\)=min⁡\{N\(p,i\),Tmax\}N\_\{\\max\}^\{\(p,i\)\}=\\min\\\{N^\{\(p,i\)\},T\_\{\\max\}\\\}and the accumulated truncated canopy entropySmax\(p,i\)=∑t=1Nmax\(p,i\)H~\(Gt\(p,i\)∣xp\)\.S\_\{\\max\}^\{\(p,i\)\}=\\sum\_\{t=1\}^\{N\_\{\\max\}^\{\(p,i\)\}\}\\widetilde\{H\}\(G\_\{t\}^\{\(p,i\)\}\\mid x\_\{p\}\)\.We estimate the rollout\-level entropy rate byr^max\(p,i\):=Smax\(p,i\)Nmax\(p,i\)\.\\widehat\{r\}\_\{\\max\}^\{\(p,i\)\}:=\\frac\{S\_\{\\max\}^\{\(p,i\)\}\}\{N\_\{\\max\}^\{\(p,i\)\}\}\.Thus, for each promptxpx\_\{p\}, the prompt\-specific length–entropy relationship is estimated from the rollout\-level pairs\{\(Nmax\(p,i\),r^max\(p,i\)\)\}i=1M\.\\left\\\{\\left\(N\_\{\\max\}^\{\(p,i\)\},\\widehat\{r\}\_\{\\max\}^\{\(p,i\)\}\\right\)\\right\\\}\_\{i=1\}^\{M\}\.

#### Length–entropy correlation\.

[Definition 4](https://arxiv.org/html/2605.30844#Thmdefinition4)introduced the length–entropy correlationρ\(N,rN\)\\rho\(N,r\_\{N\}\)at a conceptual level\. We now provide its statistically rigorous formulation together with the corresponding aggregation procedure\.

###### Definition 5\(length–entropy correlation\)\.

For a fixed promptX=xX=x, define the prompt\-conditional Pearson correlationρX:=ρ\(N,rN∣X=x\)=Cov⁡\(N,rN∣X=x\)Var⁡\(N∣X=x\)Var⁡\(rN∣X=x\)\.\\rho\_\{X\}:=\\rho\(N,r\_\{N\}\\mid X=x\)=\\frac\{\\operatorname\{Cov\}\(N,r\_\{N\}\\mid X=x\)\}\{\\sqrt\{\\operatorname\{Var\}\(N\\mid X=x\)\\operatorname\{Var\}\(r\_\{N\}\\mid X=x\)\}\}\.Since the prompt\-conditional correlationρX\\rho\_\{X\}is bounded and nonlinear, directly averaging correlations across prompts may introduce bias\. Therefore, we aggregate the prompt\-conditional Pearson correlations on the Fisher\-zz555This transformation maps correlation coefficients from\[−1,1\]\[\-1,1\]toℝ\\mathbb\{R\}and stabilizes their variance, making them approximately normally distributed with variance1/\(np−3\)1/\(n\_\{p\}\-3\)\.npn\_\{p\}is the number of rollouts associated with each promptpp\. Directly averaging correlations would lead to biased estimates due to the nonlinear and bounded nature of correlation coefficients\. See[algorithm 2](https://arxiv.org/html/2605.30844#algorithm2)for details\.scale to obtain a prompt\-averaged correlation measure\. Then the \(prompt\-averaged\) length–entropy correlationρ\(N,rN\):=tanh⁡\(𝔼X∼pX\[wXatanh⁡\(ρX\)\]𝔼X∼pX\[wX\]\),\\rho\(N,r\_\{N\}\):=\\tanh\\left\(\\frac\{\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\\left\[w\_\{X\}\\operatorname\{atanh\}\(\\rho\_\{X\}\)\\right\]\}\{\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\[w\_\{X\}\]\}\\right\),wherewX≥0w\_\{X\}\\geq 0is a prompt\-specific reliability weight\.

#### Rank\-based length–entropy correlations\.

In addition to Pearson correlation, which measures linear association, we also report two rank\-based measures\. Spearman correlation\[[30](https://arxiv.org/html/2605.30844#bib.bib30)\]is Pearson correlation applied to the ranked variables and therefore captures monotone dependence\. Kendall’sτb\\tau\_\{b\}\[[16](https://arxiv.org/html/2605.30844#bib.bib16)\]measures ordinal association by comparing concordant and discordant rollout pairs, with tie adjustment\. Since Kendall’sτb\\tau\_\{b\}is defined through pairwise comparisons, we aggregate it using pair\-count weights rather than Fisher\-zzaveraging\. We estimate and report rank\-based length–entropy correlations to assess the robustness of our findings\. The consistently positive relationship across different correlation measures suggests that the observed trend is not an artifact of any particular notion of correlation\.

Input:Prompt distribution

pXp\_\{X\}; number of prompts

PP; rollout counts

\{np\}p=1P\\\{n\_\{p\}\\\}\_\{p=1\}^\{P\}; maximum length

TmaxT\_\{\\max\}; sampling policy

π\\pi; clipping constant

ϵ\>0\\epsilon\>0\.

for*p=1p=1toPP*do

Sample prompt

xp∼pXx\_\{p\}\\sim p\_\{X\};

for*i=1i=1tonpn\_\{p\}*do

Generate one rollout using Algorithm[1](https://arxiv.org/html/2605.30844#algorithm1);

Record

Nmax\(p,i\)=min⁡\{N\(p,i\),Tmax\},Smax\(p,i\)=∑t=1Nmax\(p,i\)H~\(Gt\(p,i\)∣xp\)\.N\_\{\\max\}^\{\(p,i\)\}=\\min\\\{N^\{\(p,i\)\},T\_\{\\max\}\\\},\\;S\_\{\\max\}^\{\(p,i\)\}=\\sum\_\{t=1\}^\{N\_\{\\max\}^\{\(p,i\)\}\}\\widetilde\{H\}\(G\_\{t\}^\{\(p,i\)\}\\mid x\_\{p\}\)\.Compute the rollout\-level entropy rate

rmax\(p,i\)=Smax\(p,i\)Nmax\(p,i\)\.r\_\{\\max\}^\{\(p,i\)\}=\\frac\{S\_\{\\max\}^\{\(p,i\)\}\}\{N\_\{\\max\}^\{\(p,i\)\}\}\.
end for

Compute the prompt\-specific Pearson correlation

ρ^pP=Corr⁡\(\{Nmax\(p,i\)\}i=1np,\{rmax\(p,i\)\}i=1np\)\.\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{p\}=\\operatorname\{Corr\}\\left\(\\\{N\_\{\\max\}^\{\(p,i\)\}\\\}\_\{i=1\}^\{n\_\{p\}\},\\\{r\_\{\\max\}^\{\(p,i\)\}\\\}\_\{i=1\}^\{n\_\{p\}\}\\right\)\.
Compute the prompt\-specific Spearman correlation

ρ^pS=Corr⁡\(\{rank⁡\(Nmax\(p,i\)\)\}i=1np,\{rank⁡\(rmax\(p,i\)\)\}i=1np\),\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{p\}=\\operatorname\{Corr\}\\left\(\\\{\\operatorname\{rank\}\(N\_\{\\max\}^\{\(p,i\)\}\)\\\}\_\{i=1\}^\{n\_\{p\}\},\\\{\\operatorname\{rank\}\(r\_\{\\max\}^\{\(p,i\)\}\)\\\}\_\{i=1\}^\{n\_\{p\}\}\\right\),using mid\-ranks for ties\.

Compute the prompt\-specific Kendall’s

τb\\tau\_\{b\}τ^pK=KendallTauB⁡\(\{Nmax\(p,i\)\}i=1np,\{rmax\(p,i\)\}i=1np\)\.\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{p\}=\\operatorname\{KendallTauB\}\\left\(\\\{N\_\{\\max\}^\{\(p,i\)\}\\\}\_\{i=1\}^\{n\_\{p\}\},\\\{r\_\{\\max\}^\{\(p,i\)\}\\\}\_\{i=1\}^\{n\_\{p\}\}\\right\)\.
Compute the prompt\-specific covariance

Cov^p=Cov^\(\{Nmax\(p,i\)\}i=1np,\{rmax\(p,i\)\}i=1np\)\.\\widehat\{\\operatorname\{Cov\}\}\_\{p\}=\\widehat\{\\operatorname\{Cov\}\}\\left\(\\\{N\_\{\\max\}^\{\(p,i\)\}\\\}\_\{i=1\}^\{n\_\{p\}\},\\\{r\_\{\\max\}^\{\(p,i\)\}\\\}\_\{i=1\}^\{n\_\{p\}\}\\right\)\.
end for

Aggregate Pearson and Spearman using Fisher\-

zzweighted averages:

wpF=max⁡\{np−3,0\},w^\{\\mathrm\{F\}\}\_\{p\}=\\max\\\{n\_\{p\}\-3,0\\\},ρ^pcP=tanh⁡\(∑p=1PwpFatanh⁡\(clip⁡\(ρ^pP,−1\+ϵ,1−ϵ\)\)∑p=1PwpF\),\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{\\mathrm\{pc\}\}=\\tanh\\left\(\\frac\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}\\operatorname\{atanh\}\\left\(\\operatorname\{clip\}\(\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{p\},\-1\+\\epsilon,1\-\\epsilon\)\\right\)\}\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}\}\\right\),and

ρ^pcS=tanh⁡\(∑p=1PwpFatanh⁡\(clip⁡\(ρ^pS,−1\+ϵ,1−ϵ\)\)∑p=1PwpF\)\.\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{\\mathrm\{pc\}\}=\\tanh\\left\(\\frac\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}\\operatorname\{atanh\}\\left\(\\operatorname\{clip\}\(\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{p\},\-1\+\\epsilon,1\-\\epsilon\)\\right\)\}\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}\}\\right\)\.
Aggregate Kendall’s

τb\\tau\_\{b\}using pair\-count weights:

wpK=max⁡\{np\(np−1\)2,0\},w^\{\\mathrm\{K\}\}\_\{p\}=\\max\\left\\\{\\frac\{n\_\{p\}\(n\_\{p\}\-1\)\}\{2\},0\\right\\\},τ^pcK=∑p=1PwpKτ^pK∑p=1PwpK\.\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{\\mathrm\{pc\}\}=\\frac\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{K\}\}\_\{p\}\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{p\}\}\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{K\}\}\_\{p\}\}\.
Aggregate covariance using degrees\-of\-freedom weights:

wpCov=max⁡\{np−1,0\},w^\{\\mathrm\{Cov\}\}\_\{p\}=\\max\\\{n\_\{p\}\-1,0\\\},and

Cov^pc=∑p=1PwpCovCov^p∑p=1PwpCov\.\\widehat\{\\operatorname\{Cov\}\}\_\{\\mathrm\{pc\}\}=\\frac\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{Cov\}\}\_\{p\}\\widehat\{\\operatorname\{Cov\}\}\_\{p\}\}\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{Cov\}\}\_\{p\}\}\.
return*ρ^pcP,ρ^pcS,τ^pcK,Cov^pc\.\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{\\mathrm\{pc\}\},\\;\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{\\mathrm\{pc\}\},\\;\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{\\mathrm\{pc\}\},\\;\\widehat\{\\operatorname\{Cov\}\}\_\{\\mathrm\{pc\}\}\.*

Algorithm 2Estimating prompt\-averaged length–entropy correlations###### Proposition 1\(Consistency of prompt\-averaged length–entropy correlation estimators\)\.

For each promptX=xX=x, definermax\(x\):=Smax\(x\)Nmax\(x\),Smax\(x\)=∑t=1Nmax\(x\)H~\(Gt∣x\),Nmax\(x\)=min⁡\{N\(x\),Tmax\}r\_\{\\max\}\(x\):=\\frac\{S\_\{\\max\}\(x\)\}\{N\_\{\\max\}\(x\)\},\\;S\_\{\\max\}\(x\)=\\sum\_\{t=1\}^\{N\_\{\\max\}\(x\)\}\\widetilde\{H\}\(G\_\{t\}\\mid x\),\\;N\_\{\\max\}\(x\)=\\min\\\{N\(x\),T\_\{\\max\}\\\}\. Noted that in this case,Smax\(x\)S\_\{\\max\}\(x\)andNmax\(x\)N\_\{\\max\}\(x\)are random variables which depend on the generated historyGtG\_\{t\}\. LetρmaxP\(x\):=Corr⁡\(Nmax\(x\),rmax\(x\)∣X=x\)\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(x\):=\\operatorname\{Corr\}\(N\_\{\\max\}\(x\),r\_\{\\max\}\(x\)\\mid X=x\)denote the Pearson correlation, letρmaxS\(x\):=Corr⁡\(FNmax∣x\(Nmax\),Frmax∣x\(rmax\)∣X=x\)\\rho^\{\\mathrm\{S\}\}\_\{\\max\}\(x\):=\\operatorname\{Corr\}\\left\(F\_\{N\_\{\\max\}\\mid x\}\(N\_\{\\max\}\),F\_\{r\_\{\\max\}\\mid x\}\(r\_\{\\max\}\)\\mid X=x\\right\)denote the population Spearman rank correlation, and letτmaxK\(x\)\\tau^\{\\mathrm\{K\}\}\_\{\\max\}\(x\)denote Kendall’sτb\\tau\_\{b\}betweenNmaxN\_\{\\max\}andrmaxr\_\{\\max\}conditional onX=xX=x\. For sampled promptsx1,…,xpx\_\{1\},\\ldots,x\_\{p\}, suppose that for each promptxpx\_\{p\}we observenpn\_\{p\}conditionally i\.i\.d\. rollouts and compute the prompt\-specific estimatesρ^pP,ρ^pS,τ^pK,Cov^p\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{p\},\\;\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{p\},\\;\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{p\},\\;\\widehat\{\\operatorname\{Cov\}\}\_\{p\}as in Algorithm[2](https://arxiv.org/html/2605.30844#algorithm2)\. Assume thatNmax≥1N\_\{\\max\}\\geq 1almost surely\. For Pearson and covariance, assume that the conditional fourth moments ofNmaxN\_\{\\max\}andrmaxr\_\{\\max\}are uniformly bounded and that0<c≤Var⁡\(Nmax∣X=x\),0<c≤Var⁡\(rmax∣X=x\)0<c\\leq\\operatorname\{Var\}\(N\_\{\\max\}\\mid X=x\),\\;0<c\\leq\\operatorname\{Var\}\(r\_\{\\max\}\\mid X=x\)uniformly overxx\. For Spearman and Kendall, assume the corresponding rank\-correlation functionals are continuous at the conditional law of\(Nmax,rmax\)∣X=x\(N\_\{\\max\},r\_\{\\max\}\)\\mid X=x, uniformly overxx; in particular, ties are either absent or handled by the mid\-rank andτb\\tau\_\{b\}conventions used in Algorithm[2](https://arxiv.org/html/2605.30844#algorithm2)\. Finally, assume\|ρmaxP\(x\)\|≤1−ϵ0,\|ρmaxS\(x\)\|≤1−ϵ0\|\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(x\)\|\\leq 1\-\\epsilon\_\{0\},\\;\|\\rho^\{\\mathrm\{S\}\}\_\{\\max\}\(x\)\|\\leq 1\-\\epsilon\_\{0\}for someϵ0\>0\\epsilon\_\{0\}\>0, and thatminp⁡np→∞\\min\_\{p\}n\_\{p\}\\to\\inftyasP→∞P\\to\\infty\. Then, asP→∞P\\to\\inftyandminp⁡np→∞\\min\_\{p\}n\_\{p\}\\to\\infty,ρ^pc,maxP→ℙρpc,maxP,ρ^pc,maxS→ℙρpc,maxS,\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{\\mathrm\{pc\},\\max\}\\xrightarrow\{\\mathbb\{P\}\}\\rho^\{\\mathrm\{P\}\}\_\{\\mathrm\{pc\},\\max\},\\;\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{\\mathrm\{pc\},\\max\}\\xrightarrow\{\\mathbb\{P\}\}\\rho^\{\\mathrm\{S\}\}\_\{\\mathrm\{pc\},\\max\},whereρpc,maxP=tanh⁡\(𝔼X\[wF\(X\)atanh⁡\{ρmaxP\(X\)\}\]𝔼X\[wF\(X\)\]\),\\rho^\{\\mathrm\{P\}\}\_\{\\mathrm\{pc\},\\max\}=\\tanh\\left\(\\frac\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{F\}\}\(X\)\\operatorname\{atanh\}\\\{\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(X\)\\\}\]\}\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{F\}\}\(X\)\]\}\\right\),ρpc,maxS=tanh⁡\(𝔼X\[wF\(X\)atanh⁡\{ρmaxS\(X\)\}\]𝔼X\[wF\(X\)\]\)\.\\rho^\{\\mathrm\{S\}\}\_\{\\mathrm\{pc\},\\max\}=\\tanh\\left\(\\frac\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{F\}\}\(X\)\\operatorname\{atanh\}\\\{\\rho^\{\\mathrm\{S\}\}\_\{\\max\}\(X\)\\\}\]\}\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{F\}\}\(X\)\]\}\\right\)\.Also,τ^pc,maxK→ℙτpc,maxK:=𝔼X\[wK\(X\)τmaxK\(X\)\]𝔼X\[wK\(X\)\],\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{\\mathrm\{pc\},\\max\}\\xrightarrow\{\\mathbb\{P\}\}\\tau^\{\\mathrm\{K\}\}\_\{\\mathrm\{pc\},\\max\}:=\\frac\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{K\}\}\(X\)\\tau^\{\\mathrm\{K\}\}\_\{\\max\}\(X\)\]\}\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{K\}\}\(X\)\]\},andCov^pc,max→ℙCovpc,max:=𝔼X\[wCov\(X\)Cov⁡\(Nmax,rmax∣X\)\]𝔼X\[wCov\(X\)\]\.\\widehat\{\\operatorname\{Cov\}\}\_\{\\mathrm\{pc\},\\max\}\\xrightarrow\{\\mathbb\{P\}\}\\operatorname\{Cov\}\_\{\\mathrm\{pc\},\\max\}:=\\frac\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{Cov\}\}\(X\)\\operatorname\{Cov\}\(N\_\{\\max\},r\_\{\\max\}\\mid X\)\]\}\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{Cov\}\}\(X\)\]\}\.If additionallyTmax→∞T\_\{\\max\}\\to\\inftyand the corresponding truncated targets converge to their untruncated counterparts, then the same estimators are consistent for the corresponding prompt\-averaged untruncated Pearson, Spearman, Kendall, and covariance targets\.

###### Proof\.

We prove the result in two steps: first within each prompt, and then across prompts\. For a fixed promptxpx\_\{p\}, the rollout\-level pairs\(Nmax\(p,i\),rmax\(p,i\)\),i=1,…,np,\\left\(N\_\{\\max\}^\{\(p,i\)\},r\_\{\\max\}^\{\(p,i\)\}\\right\),\\;i=1,\\ldots,n\_\{p\},are conditionally i\.i\.d\. By Algorithm[2](https://arxiv.org/html/2605.30844#algorithm2), both coordinates are observed from the same rollout, withrmax\(p,i\)=Smax\(p,i\)Nmax\(p,i\)\.r\_\{\\max\}^\{\(p,i\)\}=\\frac\{S\_\{\\max\}^\{\(p,i\)\}\}\{N\_\{\\max\}^\{\(p,i\)\}\}\.The moment and nondegeneracy assumptions imply consistency of the sample covariance and variances\. Hence,ρ^pP→ℙρmaxP\(xp\),Cov^p→ℙCov⁡\(Nmax,rmax∣xp\)\.\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{p\}\\xrightarrow\{\\mathbb\{P\}\}\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(x\_\{p\}\),\\;\\widehat\{\\operatorname\{Cov\}\}\_\{p\}\\xrightarrow\{\\mathbb\{P\}\}\\operatorname\{Cov\}\(N\_\{\\max\},r\_\{\\max\}\\mid x\_\{p\}\)\.Similarly, by consistency of empirical ranks and the assumed continuity of the rank\-correlation functionals,ρ^pS→ℙρmaxS\(xp\),τ^pK→ℙτmaxK\(xp\)\.\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{p\}\\xrightarrow\{\\mathbb\{P\}\}\\rho^\{\\mathrm\{S\}\}\_\{\\max\}\(x\_\{p\}\),\\;\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{p\}\\xrightarrow\{\\mathbb\{P\}\}\\tau^\{\\mathrm\{K\}\}\_\{\\max\}\(x\_\{p\}\)\.Since\|ρmaxP\(x\)\|\|\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(x\)\|and\|ρmaxS\(x\)\|\|\\rho^\{\\mathrm\{S\}\}\_\{\\max\}\(x\)\|are bounded away from one, and the clipping constant is chosen smaller than this margin, the Fisher transform is continuous in a neighborhood of the limiting values\. Therefore,atanh⁡\(clip⁡\(ρ^pP,−1\+ϵ,1−ϵ\)\)→ℙatanh⁡\{ρmaxP\(xp\)\},\\operatorname\{atanh\}\\left\(\\operatorname\{clip\}\(\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{p\},\-1\+\\epsilon,1\-\\epsilon\)\\right\)\\xrightarrow\{\\mathbb\{P\}\}\\operatorname\{atanh\}\\\{\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(x\_\{p\}\)\\\},andatanh⁡\(clip⁡\(ρ^pS,−1\+ϵ,1−ϵ\)\)→ℙatanh⁡\{ρmaxS\(xp\)\}\.\\operatorname\{atanh\}\\left\(\\operatorname\{clip\}\(\\widehat\{\\rho\}^\{\\mathrm\{S\}\}\_\{p\},\-1\+\\epsilon,1\-\\epsilon\)\\right\)\\xrightarrow\{\\mathbb\{P\}\}\\operatorname\{atanh\}\\\{\\rho^\{\\mathrm\{S\}\}\_\{\\max\}\(x\_\{p\}\)\\\}\.Now consider the aggregation across prompts\. For Pearson, defineZpP=atanh⁡\{ρmaxP\(xp\)\}\.Z^\{\\mathrm\{P\}\}\_\{p\}=\\operatorname\{atanh\}\\\{\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(x\_\{p\}\)\\\}\.By the previous step and the uniform assumptions,∑p=1PwpF\[atanh⁡\(clip⁡\(ρ^pP,−1\+ϵ,1−ϵ\)\)−ZpP\]∑p=1PwpF→ℙ0\.\\frac\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}\\left\[\\operatorname\{atanh\}\\left\(\\operatorname\{clip\}\(\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{p\},\-1\+\\epsilon,1\-\\epsilon\)\\right\)\-Z^\{\\mathrm\{P\}\}\_\{p\}\\right\]\}\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}\}\\xrightarrow\{\\mathbb\{P\}\}0\.Since prompts are sampled i\.i\.d\., the weighted law of large numbers gives∑p=1PwpFZpP∑p=1PwpF→ℙ𝔼X\[wF\(X\)atanh⁡\{ρmaxP\(X\)\}\]𝔼X\[wF\(X\)\]\.\\frac\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}Z^\{\\mathrm\{P\}\}\_\{p\}\}\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{F\}\}\_\{p\}\}\\xrightarrow\{\\mathbb\{P\}\}\\frac\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{F\}\}\(X\)\\operatorname\{atanh\}\\\{\\rho^\{\\mathrm\{P\}\}\_\{\\max\}\(X\)\\\}\]\}\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{F\}\}\(X\)\]\}\.Applying the continuous mapping theorem withtanh⁡\(⋅\)\\tanh\(\\cdot\)yieldsρ^pc,maxP→ℙρpc,maxP\.\\widehat\{\\rho\}^\{\\mathrm\{P\}\}\_\{\\mathrm\{pc\},\\max\}\\xrightarrow\{\\mathbb\{P\}\}\\rho^\{\\mathrm\{P\}\}\_\{\\mathrm\{pc\},\\max\}\.The Spearman result is identical after replacingρmaxP\\rho^\{\\mathrm\{P\}\}\_\{\\max\}byρmaxS\\rho^\{\\mathrm\{S\}\}\_\{\\max\}\. For Kendall’sτb\\tau\_\{b\}, Algorithm[2](https://arxiv.org/html/2605.30844#algorithm2)aggregates directly on theτ\\tauscale using pair\-count weightswpK=np\(np−1\)2\.w^\{\\mathrm\{K\}\}\_\{p\}=\\frac\{n\_\{p\}\(n\_\{p\}\-1\)\}\{2\}\.Thus, by the prompt\-level consistency ofτ^pK\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{p\}and the weighted law of large numbers,τ^pc,maxK=∑p=1PwpKτ^pK∑p=1PwpK→ℙ𝔼X\[wK\(X\)τmaxK\(X\)\]𝔼X\[wK\(X\)\]\.\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{\\mathrm\{pc\},\\max\}=\\frac\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{K\}\}\_\{p\}\\widehat\{\\tau\}^\{\\mathrm\{K\}\}\_\{p\}\}\{\\sum\_\{p=1\}^\{P\}w^\{\\mathrm\{K\}\}\_\{p\}\}\\xrightarrow\{\\mathbb\{P\}\}\\frac\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{K\}\}\(X\)\\tau^\{\\mathrm\{K\}\}\_\{\\max\}\(X\)\]\}\{\\mathbb\{E\}\_\{X\}\[w^\{\\mathrm\{K\}\}\(X\)\]\}\.The covariance result follows in the same way, using weightswpCov=\(np−1\)\+w^\{\\mathrm\{Cov\}\}\_\{p\}=\(n\_\{p\}\-1\)\_\{\+\}and the prompt\-level consistency ofCov^p\\widehat\{\\operatorname\{Cov\}\}\_\{p\}\. Finally, ifTmax→∞T\_\{\\max\}\\to\\inftyand the truncated conditional covariance, variance, and rank\-association functionals converge to their untruncated counterparts, then the truncated prompt\-averaged targets converge to the corresponding untruncated targets\. Combining this approximation step with the fixed\-TmaxT\_\{\\max\}consistency above completes the proof\. ∎

#### Empirical support for the consistency assumptions\.

The assumptions underlying[Proposition 1](https://arxiv.org/html/2605.30844#Thmproposition1)are supported by several empirical diagnostics observed in our experiments\. First, the truncation rates underTmax=4000T\_\{\\max\}=4000are uniformly small across model families, tasks, and variants, with most rates below1%1\\%and the maximum observed rate equal to4\.32%4\.32\\%\(see[Table 5](https://arxiv.org/html/2605.30844#A3.T5)\)\. This suggests that the truncated quantities\(Nmax,rmax\)\(N\_\{\\max\},r\_\{\\max\}\)closely approximate their untruncated counterparts\(N,rN\)\(N,r\_\{N\}\)in practice\. Second, the empirical sequence\-length distributions exhibit rapidly decaying tails, and the number of active rollouts decreases quickly as generation proceeds, providing evidence that the stopping\-time distribution is sufficiently light\-tailed and that truncation bias is negligible for the chosenTmaxT\_\{\\max\}\(see[Figure 3](https://arxiv.org/html/2605.30844#S5.F3)\)\.

Third, the rollout\-level entropy rates remain bounded and stable throughout generation trajectories, supporting the bounded\-moment and continuity assumptions required for consistency of the covariance and correlation estimators \(see[Figure 2](https://arxiv.org/html/2605.30844#S5.F2)\)\. Finally, substantial variability is consistently observed in both generation length and entropy rate across prompts and rollouts, indicating that the conditional variances ofNmaxN\_\{\\max\}andrmaxr\_\{\\max\}are bounded away from zero with high probability \(see[Figure 4](https://arxiv.org/html/2605.30844#A1.F4)and[Figure 5](https://arxiv.org/html/2605.30844#A1.F5)\)\. Together, these empirical observations suggest that the regularity conditions required for the consistency of the prompt\-averaged Pearson, Spearman, Kendall, and covariance estimators are well approximated in our experimental regime\.

![Refer to caption](https://arxiv.org/html/2605.30844v1/x4.png)Figure 4:Gaussian KDEs oflog⁡Var^\(r∣xp\)\\log\\widehat\{\\mathrm\{Var\}\}\(r\\mid x\_\{p\}\)overP=100P\{=\}100prompts\. The dotted vertical marksminp⁡Var^\(r∣xp\)≈4×10−6\\min\_\{p\}\\widehat\{\\mathrm\{Var\}\}\(r\\mid x\_\{p\}\)\\\!\\approx\\\!4\\\!\\times\\\!10^\{\-6\}\. All densities sit well to the right of zero, providing empirical support for the bounded\-away\-from\-zero assumption\.![Refer to caption](https://arxiv.org/html/2605.30844v1/x5.png)Figure 5:Gaussian KDEs oflog⁡Var^\(N∣xp\)\\log\\widehat\{\\mathrm\{Var\}\}\(N\\mid x\_\{p\}\)overP=100P\{=\}100prompts\. The dotted vertical line marksminp⁡Var^\(N∣xp\)≈13\\min\_\{p\}\\widehat\{\\mathrm\{Var\}\}\(N\\mid x\_\{p\}\)\\\!\\approx\\\!13, the smallest per\-prompt variance observed across all model\-dataset combinations\. All densities concentrate atVar^\(N∣xp\)≫0\\widehat\{\\mathrm\{Var\}\}\(N\\mid x\_\{p\}\)\\\!\\gg\\\!0, empirically supporting the bounded\-away\-from\-zero assumption\. Instruct variants consistently shift to smaller values, indicating that fine\-tuning produces tighter length distributions\.

## Appendix BDiscussions

### B\.1How GenPPL and BF capture different notions of diversity: an essay analogy case study

We illustrate the difference between GenPPL and BF through an essay–word analogy, where trajectories correspond to essays and tokens correspond to individual words\. For a fixed promptxx, suppose a model generates two types of essays\. With probability1/21/2, it produces a short essay of lengthN=2N=2, where each word is selected uniformly from100100plausible words\. With probability1/21/2, it produces a long essay of lengthN=10N=10, where each word is selected uniformly from only22plausible words\. For the short essays,H\(Y1:2∣N=2,x\)=2log⁡100,r2=log⁡100\.H\(Y\_\{1:2\}\\mid N=2,x\)=2\\log 100,r\_\{2\}=\\log 100\.For the long essays,H\(Y1:10∣N=10,x\)=10log⁡2,r10=log⁡2\.H\(Y\_\{1:10\}\\mid N=10,x\)=10\\log 2,r\_\{10\}=\\log 2\.

BF treats each*trajectory*equally:log⁡BF=12log⁡100\+12log⁡2,\\log\\mathrm\{BF\}=\\frac\{1\}\{2\}\\log 100\+\\frac\{1\}\{2\}\\log 2,which yieldsBF≈14\.\\mathrm\{BF\}\\approx 14\.This means that a typical generated trajectory behaves as if each token has roughly1414plausible continuations\. Because BF weights trajectories equally, short but highly diverse generations contribute substantially even if they contain relatively few tokens overall\. In contrast, GenPPL treats each*token*equally\. Since most generated tokens come from the long low\-diversity essays, we obtainGenPPL≈3\.\\mathrm\{GenPPL\}\\approx 3\.Thus, a token on average has only about33plausible continuations\.

The distinction arises from the averaging mechanism\. BF averages uncertainty at the trajectory level, while GenPPL averages at the token level and therefore assigns greater weight to longer generations\. Consequently, BF reflects the diversity of a*typical trajectory*, whereas GenPPL reflects the uncertainty of a*typical token*\. These two measures therefore provide complementary views of generation space that cannot be captured by a single metric alone\.

### B\.2When to use BF vs\. GenPPL across different tasks\.

We refer readers to[Table 4](https://arxiv.org/html/2605.30844#A2.T4)for a concise practical comparison of when to use BF versus GenPPL across different tasks\.

Table 4:When to use BF vs\. GenPPL across different tasks\.TaskPreferred MetricLevelReasonMMLU \(QA\)GenPPLTokenRequires selecting a single correct answer\. Performance depends on calibrated token probabilities\. Diversity across full trajectories is unnecessary\.StoryGenerationBFTrajectoryOutputs are evaluated as complete units\. Diversity across stories reflects creativity and coverage of different narrative possibilities\.MathGenPPLBFTokenTrajectoryGenPPL is primary because correctness depends on precise step\-by\-step reasoning\. BF is secondary and useful for analyzing diversity across alternative solution strategies\.CodingGenPPLBFTokenTrajectoryGenPPL is primary because code correctness requires token\-level precision\. BF is secondary and useful for comparing multiple valid implementations\.SentenceCompletionGenPPLTokenThis is a next\-token prediction task, so GenPPL aligns directly with likelihood and perplexity\-based evaluation\.
### B\.3Comparison with perplexity\.

Perplexity\[[15](https://arxiv.org/html/2605.30844#bib.bib15)\]is a widely used evaluation metric in the LLM literature\. To avoid confusion, we briefly contrast it with GenPPL and BF\. Perplexity is defined on a fixed dataset\(y1,…,yT\)\(y\_\{1\},\\ldots,y\_\{T\}\), whereyt∈𝒱y\_\{t\}\\in\\mathcal\{V\}is the observed token andy<ty\_\{<t\}denotes its context\. Given the model distributionq\(⋅∣y<t\)q\(\\cdot\\mid y\_\{<t\}\), it is defined asPPL=exp⁡\(1T∑t=1T−log⁡q\(yt∣y<t\)\),\\mathrm\{PPL\}=\\exp\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\-\\log q\(y\_\{t\}\\mid y\_\{<t\}\)\\right\),where−log⁡q\(yt∣y<t\)\-\\log q\(y\_\{t\}\\mid y\_\{<t\}\)measures how surprising the true token is under the model\. Thus, perplexity evaluates predictive accuracy, with lower values indicating better fit to the data\. In contrast, GenPPL and BF are defined with respect to the model’s generative distribution and measure generation capacity\. Namely, the size and diversity of the space of possible outputs rather than accuracy on the observed data\.

## Appendix CExperiment details

In this section, we provide additional experimental and implementation details underlying our empirical analysis\. In particular, we discuss the stopping behavior of autoregressive generation and the resulting truncation bias, as well as the statistical modeling procedures used throughout the paper\. We further present detailed outputs from the Beta mixed\-effects regression analysis, including coefficient estimates, interaction effects, and dispersion modeling results, together with additional discussions on uncertainty allocation, semantic diversity, and length–entropy dynamics across model families and tasks\.

### C\.1Stopping criteria and empirical truncation bias

To ensure the generative process remains consistent with the theoretical framework, we include theEOStoken in the sampling vocabulary𝒱∪\{EOS\}\\mathcal\{V\}\\cup\\\{\\texttt\{EOS\}\\\}\. Crucially, duringtop\-kksampling, a rollout only terminates ifEOSis present within the topkkmost probable tokens and is subsequently sampled\.

While our theoretical framework allows for infinite generation length, practical estimation requires a finite truncation pointTmaxT\_\{\\max\}\. To ensure that our estimatorCE^max,P,M∗\\widehat\{\\mathrm\{CE\}\}^\{\*\}\_\{\\max,P,M\}remains a consistent proxy for the true Canopy EntropyCE∗\\mathrm\{CE\}^\{\*\}, we set a generous token budget ofTmax=4000T\_\{\\max\}=4000\. As shown in Table[5](https://arxiv.org/html/2605.30844#A3.T5), the empirical truncation rate remains low across all model families and tasks\. Notably, fine\-tuned models exhibit a 0\.00% truncation rate in almost all settings, confirming that aligned models naturally terminate well within the provided budget\. While truncation rates are higher for base models, they remain sufficiently low \(≤0\.0432\\leq 0\.0432\)\. Overall, these results demonstrate that the truncation bias termg\(Tmax\)g\(T\_\{\\max\}\)defined in[Theorem 2](https://arxiv.org/html/2605.30844#Thmtheorem2)is sufficiently small, ensuring that the observed differences in generation space are not artifacts of the maximum length constraint\.

Table 5:Empirical Truncation Rates\. We report the proportion of Monte Carlo rollouts that reached the maximum generation budget \(Tmax=4000T\_\{\\max\}=4000tokens\) without sampling anEOStoken\. Rates are calculated overP=100P=100prompts withM=100M=100rollouts per prompt\.ModelMATHSTORYCODINGCOMPLETIONBaseInstructBaseInstructBaseInstructBaseInstructQwen3\-8B0\.00730\.00000\.00050\.00000\.00080\.00040\.03130\.0000Llama\-3\.1\-8B0\.01050\.00060\.03250\.00000\.00950\.00000\.01740\.0000Gemma\-3\-12B0\.00360\.00000\.00660\.00000\.04320\.00000\.01190\.0000
### C\.2Completion traps

Instead of answering the instruction, base models sometimes treat the prompt as an unfinished text prefix and continue it with a short completion fragment\. Here we present some empirical examples where base models fall into such “completion traps”\. Some examples are listed below\.

Prompt:

> Come up with an original and creative solution for the following real\-world problem: Clara, a junior pre\-med student, is working part\-time and taking a 15 hour credit load at school\[…\]Clara is not sure how to solve her problem\.\.

Qwen3\-8B Base:

> "In order to entice donors, you would like to take the following approach:"

Llama\-3\.1\-8B Base:

> "How can she decide what steps she should take to solve her problem?"

Gemma\-3\-12B Base:

> "There will be partial credit for good ideas that might just work \(even if they’re not perfect\)\. Your solution will be judged on its creativity and how well it works\."

Coding prompts were especially vulnerable to premature termination caused by structural delimiters\. Prompts ending with a closing code block delimiter \(`‘‘‘`\) frequently produced empty responses from base models, as they likely interpret the closing backticks as an end\-of\-document signal from the training distribution\. To ensure valid evaluations, we manually stripped these trailing delimiters to force the model to generate the intended logic rather than treating the task as already finalized\.

Table 6:Base–instruct comparison of length–entropy correlation measured by Spearman’sρ\\rhoand Kendall’sτ\\tau\. Each entry reports estimate±\\pmstandard error, and absolute change is computed as instruct minus base\. All numbers are rounded to two significant digits\. Warm colors indicate increases, while cold colors indicate decreases; darker colors correspond to larger magnitude changes\.TaskTypeSpearman’sρ\\rhocorrelationKendall’sτ\\taucorrelationQwen3\-8BLlama\-3\.1\-8BGemma\-3\-12BQwen3\-8BLlama\-3\.1\-8BGemma\-3\-12BMATHBase\-0\.19±\\pm0\.023\-0\.56±\\pm0\.031\-0\.27±\\pm0\.023\-0\.12±\\pm0\.015\-0\.36±\\pm0\.024\-0\.17±\\pm0\.015Instruct0\.039±\\pm0\.0320\.13±\\pm0\.0290\.096±\\pm0\.0340\.026±\\pm0\.0200\.082±\\pm0\.0190\.056±\\pm0\.021Abs\. Change0\.23±\\pm0\.0370\.68±\\pm0\.0460\.36±\\pm0\.0450\.15±\\pm0\.0240\.44±\\pm0\.0330\.22±\\pm0\.028CODINGBase\-0\.28±\\pm0\.018\-0\.66±\\pm0\.014\-0\.40±\\pm0\.011\-0\.18±\\pm0\.013\-0\.47±\\pm0\.012\-0\.26±\\pm0\.0079Instruct0\.49±\\pm0\.0300\.30±\\pm0\.0230\.38±\\pm0\.0320\.32±\\pm0\.0210\.20±\\pm0\.0150\.25±\\pm0\.022Abs\. Change0\.76±\\pm0\.0350\.97±\\pm0\.0220\.78±\\pm0\.0330\.50±\\pm0\.0240\.67±\\pm0\.0160\.51±\\pm0\.023SENTENCE COMPLETIONBase\-0\.0049±\\pm0\.025\-0\.39±\\pm0\.019\-0\.32±\\pm0\.0160\.0026±\\pm0\.017\-0\.26±\\pm0\.013\-0\.22±\\pm0\.011Instruct0\.52±\\pm0\.024\-0\.22±\\pm0\.0250\.13±\\pm0\.0350\.36±\\pm0\.018\-0\.14±\\pm0\.0170\.086±\\pm0\.021Abs\. Change0\.53±\\pm0\.0330\.17±\\pm0\.0290\.46±\\pm0\.0370\.35±\\pm0\.0230\.12±\\pm0\.0190\.30±\\pm0\.023STORYBase\-0\.36±\\pm0\.015\-0\.32±\\pm0\.015\-0\.30±\\pm0\.014\-0\.25±\\pm0\.011\-0\.21±\\pm0\.011\-0\.20±\\pm0\.0099Instruct0\.16±\\pm0\.021\-0\.12±\\pm0\.019\-0\.049±\\pm0\.0270\.10±\\pm0\.014\-0\.081±\\pm0\.012\-0\.032±\\pm0\.017Abs\. Change0\.52±\\pm0\.0250\.20±\\pm0\.0240\.25±\\pm0\.0320\.35±\\pm0\.0170\.13±\\pm0\.0160\.16±\\pm0\.021

### C\.3Bootstrap procedure

All standard errors and confidence intervals reported in[Table 1](https://arxiv.org/html/2605.30844#S5.T1),[Table 2](https://arxiv.org/html/2605.30844#S5.T2), and[Table 6](https://arxiv.org/html/2605.30844#A3.T6)are estimated using a prompt\-level cluster bootstrap withB=2000B=2000replicates\. In each replicate, we sampleP=100P=100prompt indices uniformly with replacement from the original set of prompts\. AllM=100M=100rollouts associated with a sampled prompt are resampled jointly as a single cluster, preserving the within\-prompt rollout structure\.

Point estimates are always computed on the original, non\-resampled data; bootstrap replicates are used only for uncertainty quantification\. Reported standard errors correspond to the sample standard deviation of the bootstrap replicates with degree of freedom =n−1n\-1\.

### C\.4Beta mixed\-effects model for semantic diversity

The analysis is based on 2400 observations, corresponding to44tasks,33model families, and22model variants \(base and instruct\), with100100prompts evaluated for each configuration\.

Table 7:Full table of conditional mean model estimates for the Beta mixed\-effects regression\.TermEstimateStd\. Errorpp\-valueIntercept\-0\.879530\.07007<2e−16<2\\mathrm\{e\}\{\-16\}RR0\.708030\.03340<2e−16<2\\mathrm\{e\}\{\-16\}Instruct\-0\.376500\.086091\.23e−051\.23\\mathrm\{e\}\{\-05\}ns\(invN,4\)1\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{1\}0\.151970\.040850\.0001990\.000199ns\(invN,4\)2\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{2\}0\.396900\.238900\.0966410\.096641ns\(invN,4\)3\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{3\}0\.620651\.470240\.6729240\.672924ns\(invN,4\)4\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{4\}0\.610223\.100320\.8439660\.843966Coding\-0\.037210\.036400\.3067260\.306726Math0\.169480\.034488\.83e−078\.83\\mathrm\{e\}\{\-07\}Stories\-0\.275630\.02864<2e−16<2\\mathrm\{e\}\{\-16\}R×InstructR\\times\\mathrm\{Instruct\}1\.182540\.04642<2e−16<2\\mathrm\{e\}\{\-16\}Instruct×ns\(invN,4\)1\\times\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{1\}\-0\.130290\.078680\.0977340\.097734Instruct×ns\(invN,4\)2\\times\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{2\}1\.353210\.317011\.97e−051\.97\\mathrm\{e\}\{\-05\}Instruct×ns\(invN,4\)3\\times\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{3\}1\.006501\.483030\.4973400\.497340Instruct×ns\(invN,4\)4\\times\\mathrm\{ns\}\(\\mathrm\{invN\},4\)\_\{4\}0\.814213\.102030\.7929540\.792954Conditional mean modeling\.Formally, we modelDtpm∼Beta⁡\(μtpmϕtpm,\(1−μtpm\)ϕtpm\),D\_\{tpm\}\\sim\\operatorname\{Beta\}\\\!\\left\(\\mu\_\{tpm\}\\phi\_\{tpm\},\(1\-\\mu\_\{tpm\}\)\\phi\_\{tpm\}\\right\),whereμtpm∈\(0,1\)\\mu\_\{tpm\}\\in\(0,1\)denotes the conditional mean andϕtpm\>0\\phi\_\{tpm\}\>0is the precision parameter\. The conditional mean model is specified aslogit⁡\(μtpm\)=α\+β1Rtpm\+β2𝟏\{m=FT\}\+f\(invNtpm\)\+∑kτk𝟏\{t=k\}\+β3\(Rtpm⋅𝟏\{m=FT\}\)\+g\(invNtpm\)𝟏\{m=FT\}\+up\+vm\\operatorname\{logit\}\(\\mu\_\{tpm\}\)=\\alpha\+\\beta\_\{1\}R\_\{tpm\}\+\\beta\_\{2\}\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\+f\\left\(\\mathrm\{invN\}\_\{tpm\}\\right\)\+\\sum\_\{k\}\\tau\_\{k\}\\mathbf\{1\}\\\{t=k\\\}\+\\beta\_\{3\}\(R\_\{tpm\}\\cdot\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\)\+g\\left\(\\mathrm\{invN\}\_\{tpm\}\\right\)\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\+u\_\{p\}\+v\_\{m\}whereRtpmR\_\{tpm\}is the entropy rate,invNtpm\\mathrm\{invN\}\_\{tpm\}is the inverse sequence length, andf\(⋅\)f\(\\cdot\),g\(⋅\)g\(\\cdot\)are natural spline functions corresponding to nonlinear length effects and their interaction with instruction tuning\. The coefficientsτk\\tau\_\{k\}represent fixed task effects\. To account for hierarchical dependence, we include random interceptsup∼𝒩\(0,σp2\),vm∼𝒩\(0,σm2\),u\_\{p\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{p\}^\{2\}\),v\_\{m\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{m\}^\{2\}\),for prompt identity and model family, respectively\.

In addition to the finding presented in[section 5](https://arxiv.org/html/2605.30844#S5), we also observe strong nonlinear dependence on inverse sequence length\. Several spline terms and their interactions with fine\-tuning are significant, confirming that the relationship between diversity and length is highly nonlinear and differs between base and fine\-tuned models\. This reinforces that simple linear controls for length are insufficient\.

Task\-level effects are also pronounced\. Relative to completion, mathematical reasoning exhibits significantly higher semantic diversity \(\+0\.169\+0\.169,p<10−6p<10^\{\-6\}\), while story generation shows substantially lower embedding\-based diversity \(−0\.276\-0\.276,p<2×10−16p<2\\times 10^\{\-16\}\), despite appearing lexically diverse\. This suggests that open\-ended generation can remain semantically concentrated even when surface variation is high \(see[Table 7](https://arxiv.org/html/2605.30844#A3.T7)\)\.

Finally, random effects indicate that prompt\-level variability remains larger than model\-level variability, highlighting the strong influence of prompt semantics on generation diversity \(see[Table 8](https://arxiv.org/html/2605.30844#A3.T8)\)\.

Table 8:Estimated random intercept variances…Random effectGroupVarianceStd\. Dev\.Interceptprompt\_uid0\.0139370\.11806Interceptmodel\_name0\.0068170\.08256Table 9:Dispersion model estimates for the Beta mixed\-effects regression\. The dispersion submodel uses a log link for the precision parameterϕtpm\\phi\_\{tpm\}\.TermEstimateStd\. Errorpp\-valueIntercept2\.75440\.2328<2e−16<2\\mathrm\{e\}\{\-16\}Instruct1\.41530\.24527\.83e−097\.83\\mathrm\{e\}\{\-09\}Coding2\.55110\.2186<2e−16<2\\mathrm\{e\}\{\-16\}Math2\.14180\.2149<2e−16<2\\mathrm\{e\}\{\-16\}Stories1\.65340\.1479<2e−16<2\\mathrm\{e\}\{\-16\}ns\(invN,3\)1\\mathrm\{ns\}\(\\mathrm\{invN\},3\)\_\{1\}\-3\.37800\.53232\.22e−102\.22\\mathrm\{e\}\{\-10\}ns\(invN,3\)2\\mathrm\{ns\}\(\\mathrm\{invN\},3\)\_\{2\}\-2\.00390\.88700\.02390\.0239ns\(invN,3\)3\\mathrm\{ns\}\(\\mathrm\{invN\},3\)\_\{3\}2\.62001\.85070\.15690\.1569RR0\.97510\.12904\.12e−144\.12\\mathrm\{e\}\{\-14\}LLaMA\-3\.1\-8B1\.00910\.1162<2e−16<2\\mathrm\{e\}\{\-16\}Qwen3\-8B0\.72700\.11512\.71e−102\.71\\mathrm\{e\}\{\-10\}Instruct×\\timesCoding\-2\.19690\.28521\.32e−141\.32\\mathrm\{e\}\{\-14\}Instruct×\\timesMath\-1\.54400\.28616\.76e−086\.76\\mathrm\{e\}\{\-08\}Instruct×\\timesStories\-2\.92730\.2085<2e−16<2\\mathrm\{e\}\{\-16\}Instruct×R\\times R\-1\.60850\.20473\.96e−153\.96\\mathrm\{e\}\{\-15\}Dispersion modeling\.The dispersion model specifies how the precision parameterϕtpm\\phi\_\{tpm\}varies across model variants, tasks, sequence length, entropy rate, and model family\. Specifically, we use the log\-link specificationlog⁡\(ϕtpm\)=γ0\+γ1𝟏\{m=FT\}\+∑kδk𝟏\{t=k\}\+h\(invNtpm\)\+γ2Rtpm\+∑jηj𝟏\{model=j\}\+∑kζk𝟏\{m=FT\}𝟏\{t=k\}\+γ3\(Rtpm⋅𝟏\{m=FT\}\),\\log\(\\phi\_\{tpm\}\)=\\gamma\_\{0\}\+\\gamma\_\{1\}\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\+\\sum\_\{k\}\\delta\_\{k\}\\mathbf\{1\}\\\{t=k\\\}\+h\\\!\\left\(\\mathrm\{invN\}\_\{tpm\}\\right\)\+\\gamma\_\{2\}R\_\{tpm\}\+\\sum\_\{j\}\\eta\_\{j\}\\mathbf\{1\}\\\{\\mathrm\{model\}=j\\\}\+\\sum\_\{k\}\\zeta\_\{k\}\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\\mathbf\{1\}\\\{t=k\\\}\+\\gamma\_\{3\}\\left\(R\_\{tpm\}\\cdot\\mathbf\{1\}\\\{m=\\mathrm\{FT\}\\\}\\right\),whereh\(⋅\)h\(\\cdot\)is a natural spline function capturing nonlinear length\-dependent dispersion effects\. The task indicatorsδk\\delta\_\{k\}allow the residual precision to vary across tasks, while the model\-family indicatorsηj\\eta\_\{j\}account for systematic differences in dispersion across model families\. The interaction terms between model variant and task allow fine\-tuning to affect dispersion differently across tasks, and the interaction between entropy rate and model variant tests whether the relationship between entropy rate and precision changes after instruction tuning\.

Empirically, the dispersion model reveals substantial heterogeneity in residual precision\. Instruction\-tuned models have significantly higher baseline precision than base models \(γ^1=1\.415\\hat\{\\gamma\}\_\{1\}=1\.415,p<0\.001p<0\.001\), indicating lower conditional variability after accounting for the mean structure\. Precision also differs strongly by task, with coding, math, and story generation all showing significantly higher precision relative to the reference task\. The spline terms for inverse length indicate that dispersion varies nonlinearly with output length, while the positive main effect of entropy rate \(γ^2=0\.975\\hat\{\\gamma\}\_\{2\}=0\.975,p<0\.001p<0\.001\) suggests that higher entropy rate is associated with greater precision among base models\. However, the negative interaction between instruction tuning and entropy rate \(γ^3=−1\.609\\hat\{\\gamma\}\_\{3\}=\-1\.609,p<0\.001p<0\.001\) shows that this relationship is substantially weakened or reversed for fine\-tuned models\. The significant model\-variant\-by\-task interactions further indicate that fine\-tuning changes residual variability in a task\-dependent manner \(see[Table 9](https://arxiv.org/html/2605.30844#A3.T9)\)\.

Residual diagnosis\.The figure \(see[Figure 6](https://arxiv.org/html/2605.30844#A3.F6)\) presents a DHARMa residual diagnostic QQ plot\[[11](https://arxiv.org/html/2605.30844#bib.bib11)\]for the fitted Beta mixed\-effects regression model, comparing the empirical residual distribution against the expected uniform distribution\. The residuals closely follow the diagonal reference line, indicating good overall model calibration and no substantial systematic deviation from the assumed distribution\. The associated diagnostic tests further support model adequacy: the Kolmogorov–Smirnov test \(p=0\.551p=0\.551\) suggests no significant deviation from uniformity, the dispersion test \(p=0\.068p=0\.068\) indicates no significant over\- or under\-dispersion, and the outlier test \(p=0\.160p=0\.160\) shows no evidence of excessive outliers\. Overall, these diagnostics suggest that the fitted model provides an adequate characterization of the observed data and that the regression results are statistically reliable\.

![Refer to caption](https://arxiv.org/html/2605.30844v1/x6.png)Figure 6:DHARMa residual diagnostics for the fitted Beta mixed\-effects regression model\. The QQ plot shows no significant deviation from the expected uniform residual distribution\.
### C\.5Compute resources and runtime

All experiments were run on a single internal SLURM cluster\. Each generation job uses one NVIDIA GPU \(A100, H100, or H200\), 8 CPU cores, and 100 GB host RAM, with a 6\-hour wall\-clock timeout per job\. Overall, we estimate total project compute at approximately 30\-40 GPU\-hours, overwhelmingly dominated by rollout generation; metric computation and statistical analysis were negligible by comparison and can be completed on CPU\.
Fine-Tuning Improves Information Conveyance in Language Models

Similar Articles

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Label-Free Reinforcement Learning via Cross-Model Entropy

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning

Submit Feedback

Similar Articles

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Label-Free Reinforcement Learning via Cross-Model Entropy
From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale
Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning