Where You Inject Diversity Matters: A Unified Framework for Diverse Generation
Summary
This paper introduces a unified framework for test-time diverse generation in large language models, categorizing methods by where diversity is injected (surface-level vs. specification-level). It proposes specification-level methods that generate diverse intermediate specifications, achieving better output diversity across five open-ended tasks and four backbone models while maintaining quality.
View Cached Full Text
Cached at: 06/10/26, 06:10 AM
# A Unified Framework for Diverse Generation
Source: [https://arxiv.org/html/2606.10302](https://arxiv.org/html/2606.10302)
## Where You Inject Diversity Matters: A Unified Framework for Diverse Generation
Cheng Zhang UNC Chapel Hill zch@cs\.unc\.edu &Rui Xin University of Washington rx31@cs\.washington\.edu &Chudi Zhong UNC Chapel Hill chudi@unc\.edu
###### Abstract
Open\-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations\. Existing test\-time diversity methods operate at different stages of generation with varying effectiveness, but it remains unclear what design choices lead to meaningful diversity in the output\. We introduce a framework that characterizes test\-time diverse generation methods by the diversity source introduced during generation and provide a transmission score for measuring how effectively variation in the source reaches the final output\. Guided by this framework, we propose fully automated specification\-level generation methods that first generate diverse intermediate specifications and then condition on them to produce final responses\. Across five open\-ended tasks and four backbone models, specification\-level injection improves output diversity over test\-time baselines while maintaining comparable quality\. Our analysis shows that successful diversity injection depends on both the diversity of the sources and their transmission to the output, highlighting source design and source\-to\-output realization as two key levers for building more diverse generation systems\.
Where You Inject Diversity Matters: A Unified Framework for Diverse Generation
Cheng ZhangUNC Chapel Hillzch@cs\.unc\.eduRui XinUniversity of Washingtonrx31@cs\.washington\.eduChudi ZhongUNC Chapel Hillchudi@unc\.edu
## 1Introduction
Large language models \(LLMs\) are increasingly used for open\-ended generation tasks such as creative writing, brainstorming, argumentation, and ideation\. In these settings, users often need a set of meaningfully different outputs rather than a single best answer\. However, when prompted repeatedly with the same request, LLMs tend to converge on similar high\-level patterns and recurring ideas, producing outputs that differ in wording but not in substance\(Sorensenet al\.,[2024](https://arxiv.org/html/2606.10302#bib.bib31); Jianget al\.,[2026](https://arxiv.org/html/2606.10302#bib.bib32)\)\. This limits the practical value of generating multiple outputs and risks homogenizing the creative input available to users\.
A growing body of test\-time methods attempts to address this problem by increasing decoding randomness through temperature scaling or modified sampling strategies\(Fanet al\.,[2018](https://arxiv.org/html/2606.10302#bib.bib4); Holtzmanet al\.,[2020](https://arxiv.org/html/2606.10302#bib.bib7); Minhet al\.,[2025](https://arxiv.org/html/2606.10302#bib.bib12)\), introducing per\-output random strings or noise tokens during generation\(Misaki and Akiba,[2026](https://arxiv.org/html/2606.10302#bib.bib30); Agrawal and Goyal,[2026](https://arxiv.org/html/2606.10302#bib.bib2)\), or steering decoding with diversity\-promoting modules\(Ruanet al\.,[2025](https://arxiv.org/html/2606.10302#bib.bib15)\)\. These methods operate at different stages of the generation process and inject different forms of variation, raising a natural question: what kind of diversity source is more effective for producing diverse final outputs?
In this work, we propose aDiversity Injection Framework, a unified lens for studying test\-time diverse generation\. We formulate generation asyi=G\(x,zi\)y\_\{i\}=G\(x,z\_\{i\}\), wherexxis the input prompt andziz\_\{i\}is an explicit diversity source that varies across outputs\. Methods differ in whetherziz\_\{i\}exists and what information it carries\. We categorize methods into three levels:*no injection*\(Level 0\), where no per\-outputziz\_\{i\}exists;*surface\-level injection*\(Level 1\), whereziz\_\{i\}carries no semantic content about the output; and*specification\-level injection*\(Level 2\), whereziz\_\{i\}provides a detailed generative configuration\. We introduce an entropy decomposition of output diversity into*injected diversity*\(how diverse the sourcesziz\_\{i\}are\),*information loss*\(diversity inzznot realized in the output\), and*residual randomness*\(variation from decoding\)\. From this decomposition we derive a*transmission score*that quantifies what fraction of injected diversity reaches the final output\.
Analysis through the framework reveals that existing fully automated diversity methods remain at Level 0 or Level 1, relying entirely on decoding randomness or injecting diversity sources that carry little semantic content\. Level 2 injection has not yet been fully explored for promoting diverse generation\. This gap motivates us to ask:*can specification\-level diversity injection improve output diversity across general open\-ended tasks?*
We propose fully automated methods that instantiate Level 2 injection and are applicable to broad open\-ended generation tasks\. Given a prompt, the model first generates multiple diverse specifications, each describing a distinct generative configuration along dimensions such as perspective, tone, and thematic focus\. The model then conditions on each specification to produce a final output\. The entire process requires no human intervention and no task\-specific engineering: the same pipeline applies to creative writing, argumentation, poetry, and brainstorming alike\.
In summary, we make three contributions: \(1\) a unified framework that characterizes test\-time diverse generation methods by the nature of their diversity source, and an entropy\-based transmission score that quantifies how effectively injected diversity reaches the output; \(2\) automated, general\-purpose specification\-level methods that inject diversity through intermediate generative configurations; and \(3\) empirical evidence across 5 open\-ended tasks and 4 backbone models showing that specification\-level injection improves output diversity over lower\-level test\-time baselines while maintaining comparable quality\.
## 2Related Work
We discuss related work in three areas: test\-time methods for diverse generation, planning\-based generation, and diversity evaluation\.
##### Decoding\-based Diverse Generation\.
The most straightforward approach to increasing output diversity is to modify the sampling strategy at decoding time\. Temperature scaling flattens the output distribution to encourage more varied token selectionsAckleyet al\.\([1985](https://arxiv.org/html/2606.10302#bib.bib1)\)\. Top\-kksampling\(Fanet al\.,[2018](https://arxiv.org/html/2606.10302#bib.bib4)\), nucleus sampling\(Holtzmanet al\.,[2020](https://arxiv.org/html/2606.10302#bib.bib7)\), and min\-ppsampling\(Minhet al\.,[2025](https://arxiv.org/html/2606.10302#bib.bib12)\)each restrict the candidate token pool using different truncation criteria\. Contrastive search\(Su and Collier,[2023](https://arxiv.org/html/2606.10302#bib.bib17)\)penalizes tokens that are too similar to previous context, and diverse beam search\(Vijayakumaret al\.,[2018](https://arxiv.org/html/2606.10302#bib.bib19)\)encourages diversity across beam groups\. Avoidance Decoding\(Parket al\.,[2025](https://arxiv.org/html/2606.10302#bib.bib44)\)extends this contrastive idea to multi\-output generation by treating prior branches as negative samples and penalizing candidate tokens using hidden\-state and sentence\-embedding similarity\. G2\(Ruanet al\.,[2025](https://arxiv.org/html/2606.10302#bib.bib15)\)takes a different approach by steering decoding with dual guide modules that encourage novelty and suppress repetition\. These methods all operate at the token level and do not introduce explicit per\-output conditioning that structures the diversity in generation\.
##### Prompt\-based Diverse Generation\.
Several training\-free methods promote diversity through prompt design\.Zhanget al\.\([2024](https://arxiv.org/html/2606.10302#bib.bib21)\)improve diversity in commonsense generation tasks by diversifying in\-context learning examples\.Hayatiet al\.\([2024](https://arxiv.org/html/2606.10302#bib.bib6)\)explore diverse perspective extraction through criteria\-based prompting on subjective topics\. Verbalized Sampling\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.10302#bib.bib29)\)prompts the model to generate multiple candidate responses with associated probabilities in a single call\. SSOT\(Misaki and Akiba,[2026](https://arxiv.org/html/2606.10302#bib.bib30)\)prepends a randomly generated string to each generation as a diversity seed, andAgrawal and Goyal \([2026](https://arxiv.org/html/2606.10302#bib.bib2)\)prepend random, unrelated concepts to prompts and measure the resulting changes in responses\. Unlike these methods, we make the diversity source explicit for each output before generation\.
##### Planning\-based Generation\.
A separate line of work uses outlines or plans to control text structure, especially in storytelling, though not for the purpose of promoting output diversity\. DOC\(Yanget al\.,[2023](https://arxiv.org/html/2606.10302#bib.bib27)\)uses detailed hierarchical outlines to improve coherence in long\-form storytelling\.Wang and Kreminski \([2024](https://arxiv.org/html/2606.10302#bib.bib28)\)combine symbolic outline planning with LLM\-based surface realization for story generation, building on earlier narrative planning work\(Riedl and Young,[2010](https://arxiv.org/html/2606.10302#bib.bib39); Cardona\-Riveraet al\.,[2024](https://arxiv.org/html/2606.10302#bib.bib40)\)\. Luminate\(Suhet al\.,[2024](https://arxiv.org/html/2606.10302#bib.bib26)\)generates stylistic and content dimensions for human\-AI co\-creative exploration, but requires interactive human steering\. These methods usually target quality of the output rather than diversity and are mostly restricted to storytelling or require human involvement\.
##### Diversity Evaluation
Measuring diversity in LLM outputs remains an open challenge\. Standard diversity metrics include lexical measures such as Distinct\-NN\(Liet al\.,[2016](https://arxiv.org/html/2606.10302#bib.bib10)\)and Self\-BLEU\(Zhu and others,[2018](https://arxiv.org/html/2606.10302#bib.bib24)\), and embedding\-based distances using pretrained sentence encoders\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.10302#bib.bib14)\)\.Shypulaet al\.\([2025](https://arxiv.org/html/2606.10302#bib.bib16)\)argue that diversity should be evaluated jointly with quality\. Datasets have been developed recently to facilitate evaluation\.Zhanget al\.\([2025b](https://arxiv.org/html/2606.10302#bib.bib23)\)propose NoveltyBench, a benchmark specifically designed for evaluating response diversity\.Jianget al\.\([2026](https://arxiv.org/html/2606.10302#bib.bib32)\)introduce Infinite Chats, a large\-scale dataset of open\-ended queries and use it to study inter\- and intra\-model homogeneity in modern LLMs\. We use these metrics and benchmarks in our experiments\.
## 3Diversity Injection Framework
When an LLM generates multiple outputs for the same prompt, the outputs may differ from one another\. These differences can come from two sources: randomness in the decoding process, or explicit guidance that steers each generation in a different direction\. We formalize this distinction into a unified framework\.
### 3\.1Formulation
Given an input promptxx, a method producesmmoutputs\{y1,…,ym\}\\\{y\_\{1\},\\dots,y\_\{m\}\\\}\. We write the generation of each output as:
yi=G\(x,zi\)y\_\{i\}=G\(x,z\_\{i\}\)\(1\)whereziz\_\{i\}is a*diversity source*, the information beyondxxthat the model conditions on when generatingyiy\_\{i\}\. For some methods,ziz\_\{i\}is absent; for others, it may be a random string or a set of constraints\. We categorize methods into three levels based on what informationziz\_\{i\}carries\.
##### Level 0: No Injection\.
No per\-output diversity source exists\. Every output is sampled from the same conditional distribution:yi∼P\(Y\|x\)y\_\{i\}\\sim P\(Y\|x\), and differences between outputs arise entirely from the stochasticity of the decoding process\.
##### Level 1: Surface\-Level Injection\.
Each output is conditioned on aziz\_\{i\}that carries no semantic content about the target output, such as a random seed string, a nonce token, or an arbitrary identifier\. The diversity source perturbs the input but does not specify any property of the output\.
##### Level 2: Specification\-Level Injection\.
Each output is conditioned on aziz\_\{i\}that specifies the generative configuration along multiple dimensions, such as narrative perspective, tone, focal subject, stylistic register, and content keywords\. The diversity source establishes*how*the model approaches generation and*which elements*it foregrounds, jointly steering style and content\.
The three levels form a spectrum of increasing output diversity in the conditioning signal\. A method belongs to Level 1 if there exists a per\-outputziz\_\{i\}that varies across outputs but carries no semantic content\. It belongs to Level 2 ifziz\_\{i\}specifies meaningful properties of the output along multiple dimensions\.
### 3\.2Classification of Existing Methods
Table[1](https://arxiv.org/html/2606.10302#S3.T1)classifies existing diverse generation methods by injection level\.
Table 1:Classification of existing diverse output generation methods by injection level\.Methodziz\_\{i\}LevelDirect Generationnone0Verbalized Sampling\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.10302#bib.bib29)\)none0G2Ruanet al\.\([2025](https://arxiv.org/html/2606.10302#bib.bib15)\)none0SSOTMisaki and Akiba \([2026](https://arxiv.org/html/2606.10302#bib.bib30)\)random string1Rand ConceptAgrawal and Goyal \([2026](https://arxiv.org/html/2606.10302#bib.bib2)\)random concept1##### Level 0 Methods\.
Direct Generationand its variants \(temperature scaling, top\-kksampling, nucleus sampling, and min\-ppsampling\) repeatedly sample from the same prompt\. They may reshape the token distribution but add no per\-output conditioning, so diversity comes solely from sampling randomness\.Verbalized Sampling\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.10302#bib.bib29)\)elicits multiple candidates in a single call, but issues the same prompt every time, leaving decoding stochasticity as the sole source of variation\.G2\(Ruanet al\.,[2025](https://arxiv.org/html/2606.10302#bib.bib15)\)steers decoding with diversity\-promoting guide modules but likewise uses the same prompt for every output\.
##### Level 1 Methods\.
*SSOT*\(Misaki and Akiba,[2026](https://arxiv.org/html/2606.10302#bib.bib30)\)instructs the model to first generate a random string, then condition on it as a seed to produce the output\. The random string serves as aziz\_\{i\}and differs across outputs\. However, it carries no semantic content about the desired output’s topic or organization\. The same holds for Random ConceptAgrawal and Goyal \([2026](https://arxiv.org/html/2606.10302#bib.bib2)\), which prepends an unrelated concept as the seed\.
##### The Level 2 Gap\.
To our knowledge, no existing test\-time method provides per\-output specifications to promote diversity in general open\-ended generation\. A related line of work on narrative planning conditions generation on outlines or plot structures\(Yanget al\.,[2023](https://arxiv.org/html/2606.10302#bib.bib27); Suhet al\.,[2024](https://arxiv.org/html/2606.10302#bib.bib26)\), but these methods use planning or specification mainly to improve the quality of a single generation and are often restricted to storytelling\. We address this gap in Section[4](https://arxiv.org/html/2606.10302#S4)by introducing fully automated Level 2 methods that provide specifications to induce diverse outputs across general open\-ended tasks\.
### 3\.3Quantifying Diversity Transmission
The framework above tells us*where*diversity is injected\. A natural next question is: how much of the injected diversity actually survives to the final output? A method might generate highly diverse plans \(high diversity inzz\), but if the LLM ignores or flattens those plans during generation, the output diversity will be low\. We now formalize a way to measure this\.
LetZZandYYbe random variables representing the diversity source and the output for a given promptX=xX=x\. By the definition of conditional mutual information, the entropy of the output can be decomposed as:
H\(Y∣X\)=I\(Y;Z∣X\)\+H\(Y∣X,Z\)H\(Y\\mid X\)=I\(Y;Z\\mid X\)\+H\(Y\\mid X,Z\)\(2\)whereI\(Y;Z∣X\)I\(Y;Z\\mid X\)is therealized diversity, the portion of output diversity that is attributable to variation inZZ\. In other words, this is the diversity that the injection mechanism successfully transmits to the output\.H\(Y∣X,Z\)H\(Y\\mid X,Z\)is theresidual diversity, the variation that arises even when the diversity sourcezzis held fixed, i\.e\., decoding randomness\. The realized diversity can be further expanded as:
I\(Y;Z∣X\)=H\(Z∣X\)−H\(Z∣X,Y\)I\(Y;Z\\mid X\)=H\(Z\\mid X\)\-H\(Z\\mid X,Y\)\(3\)whereH\(Z∣X\)H\(Z\\mid X\)is theinjected diversity, measuring how diverse the sources themselves are, andH\(Z∣X,Y\)H\(Z\\mid X,Y\)capturesinformation loss, measuring the diversity present inZZthat is not realized in the output\. Based on this decomposition, we define the transmission score as follows\.
###### Definition 1\(Transmission Score\)\.
For methods with an explicit diversity sourceZZandH\(Z∣X\)\>0H\(Z\\mid X\)\>0, we define the transmission score as
T=I\(Y;Z∣X\)H\(Z∣X\)T=\\frac\{I\(Y;Z\\mid X\)\}\{H\(Z\\mid X\)\}\(4\)It measures the fraction of source diversity that is realized in the final output\.
SinceI\(Y;Z∣X\)≤H\(Z∣X\)I\(Y;Z\\mid X\)\\leq H\(Z\\mid X\), the transmission score lies in\[0,1\]\[0,1\]\. A value ofT=0T=0indicates that the output is conditionally independent of the source given the prompt, whileT=1T=1indicates that the source variation is fully recoverable from the output under the chosen representation\.
The transmission score is defined only for methods with an explicit diversity source \(Levels 1 and 2\)\. For Level 0 methods, no per\-outputziz\_\{i\}exists, so the decomposition does not apply, and output diversity arises from decoding randomness\. We examine howTTand the decomposition terms vary across methods in Section[5](https://arxiv.org/html/2606.10302#S5)\.
### 3\.4Estimating Output Entropy and Transmission Score
Computing the transmission scoreTTand the decomposition in Eq\.[2](https://arxiv.org/html/2606.10302#S3.E2)\-[3](https://arxiv.org/html/2606.10302#S3.E3)requires estimatingH\(Y\|X\),H\(Y\|X,Z\)H\(Y\|X\),H\(Y\|X,Z\), andH\(Z\|X\)H\(Z\|X\)\. The latter two can be computed directly from the LLM’s log probabilities given a fixedzzor a fixed prompt\. The challenge is estimatingH\(Y\|X\)H\(Y\|X\), the marginal output entropy, which requires marginalizing overZZ\.
Algorithm 1EstimateT\(X,P,M,L\)\\textsc\{EstimateT\}\(X,P,M,L\): Estimate the Transmission Score1:Task
XX, model
PP, estimation size
MM, evaluation size
LL
2:Transmission score
T^\\hat\{T\}
3:Sample estimation set
\{Zj\}j=1M∼P\(Z\|X\)\\\{Z\_\{j\}\\\}\_\{j=1\}^\{M\}\\sim P\(Z\|X\)
4:Sample evaluation set
\{Zℓ\}ℓ=1L∼P\(Z\|X\)\\\{Z\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}\\sim P\(Z\|X\)
5:Generate
Yℓ∼P\(Y∣Zℓ,X\)Y\_\{\\ell\}\\sim P\(Y\\mid Z\_\{\\ell\},X\)for each
ℓ\\ell
6:for
ℓ=1,…,L\\ell=1,\\ldots,Ldo
7:
P^\(Yℓ∣X\)←1M∑j=1MP\(Yℓ∣Zj,X\)\\hat\{P\}\(Y\_\{\\ell\}\\mid X\)\\leftarrow\\frac\{1\}\{M\}\\sum\_\{j=1\}^\{M\}P\(Y\_\{\\ell\}\\mid Z\_\{j\},X\)
8:
h^ℓ←−1\|Yℓ\|log2P^\(Yℓ∣X\)\\hat\{h\}\_\{\\ell\}\\leftarrow\-\\frac\{1\}\{\|Y\_\{\\ell\}\|\}\\log\_\{2\}\\hat\{P\}\(Y\_\{\\ell\}\\mid X\)
9:endfor
10:
H^\(Y∣X\)←1L∑ℓh^ℓ\\hat\{H\}\(Y\\mid X\)\\leftarrow\\frac\{1\}\{L\}\\sum\_\{\\ell\}\\hat\{h\}\_\{\\ell\}
11:
H^\(Y\|X,Z\)←1L∑ℓ\[−1\|Yℓ\|log2P\(Yℓ\|Zℓ,X\)\]\\hat\{H\}\(Y\|X,Z\)\\\!\\leftarrow\\\!\\frac\{1\}\{L\}\\sum\_\{\\ell\}\\\!\\left\[\-\\frac\{1\}\{\|Y\_\{\\ell\}\|\}\\\!\\log\_\{2\}\\\!P\(Y\_\{\\ell\}\|Z\_\{\\ell\},X\)\\\!\\right\]
12:
H^\(Z\|X\)←1L∑ℓ\[−1\|Zℓ\|log2P\(Zℓ\|X\)\]\\hat\{H\}\(Z\|X\)\\leftarrow\\frac\{1\}\{L\}\\sum\_\{\\ell\}\\left\[\-\\frac\{1\}\{\|Z\_\{\\ell\}\|\}\\log\_\{2\}P\(Z\_\{\\ell\}\|X\)\\right\]
13:return
T^=H^\(Y∣X\)−H^\(Y∣X,Z\)H^\(Z∣X\)\\hat\{T\}=\\frac\{\\hat\{H\}\(Y\\mid X\)\-\\hat\{H\}\(Y\\mid X,Z\)\}\{\\hat\{H\}\(Z\\mid X\)\}
Figure 1:Two variants for generating diverse specifications \(ziz\_\{i\}’s\) to encourage LLM outputs\. Diverse Outline \(left\) relies on the LLM’s internal capacity to jointly produce a batch of diverse outlines\. Keyword \(right\) explicitly constructs combinatorial outlines by sampling keyword combinations from LLM generated outline contents\.We estimateH\(Y\|X\)H\(Y\|X\)using Monte Carlo approximation on a holdout set\. For a given taskXX, we draw two independent sets of outlines from the LLM: anestimation set\{Zj\}j=1M\\\{Z\_\{j\}\\\}\_\{j=1\}^\{M\}and anevaluation set\{Zℓ\}ℓ=1L\\\{Z\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}\. For eachZℓZ\_\{\\ell\}, we generate one outputYℓ∼P\(Y∣Zℓ,X\)Y\_\{\\ell\}\\sim P\(Y\\mid Z\_\{\\ell\},X\)\. We then approximate the marginal likelihood of each evaluation output by averaging over the estimation set:
P^\(Yℓ∣X\)=1M∑j=1MP\(Yℓ∣Zj,X\)\.\\hat\{P\}\(Y\_\{\\ell\}\\mid X\)=\\frac\{1\}\{M\}\\sum\_\{j=1\}^\{M\}P\(Y\_\{\\ell\}\\mid Z\_\{j\},X\)\.The output entropy is then estimated by averaging the per\-token log\-probability across theLLevaluation members:
H^\(Y∣X\)=1L∑ℓ=1L\[−1\|Yℓ\|log2P^\(Yℓ∣X\)\],\\hat\{H\}\(Y\\mid X\)=\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\left\[\-\\frac\{1\}\{\|Y\_\{\\ell\}\|\}\\log\_\{2\}\\hat\{P\}\(Y\_\{\\ell\}\\mid X\)\\right\],where\|Yℓ\|\|Y\_\{\\ell\}\|is the number of tokens inYℓY\_\{\\ell\}\.
Note thatP^\(Yℓ\|X\)\\hat\{P\}\(Y\_\{\\ell\}\|X\)is an unbiased estimator\. Since eachZjZ\_\{j\}is drawn i\.i\.d\. fromP\(Z\|X\)P\(Z\|X\),𝔼\{Zj\}\[P^\(Yℓ\|X\)\]=𝔼Z∼P\(Z\|X\)\[P\(Yℓ\|Z,X\)\]=P\(Yℓ\|X\)\.\\mathbb\{E\}\_\{\\\{Z\_\{j\}\\\}\}\[\\hat\{P\}\(Y\_\{\\ell\}\|X\)\]=\\mathbb\{E\}\_\{Z\\sim P\(Z\|X\)\}\[P\(Y\_\{\\ell\}\|Z,X\)\]=P\(Y\_\{\\ell\}\|X\)\.However, the entropy estimateH^\(Y∣X\)\\hat\{H\}\(Y\\mid X\)has a positive bias ofO\(1/M\)O\(1/M\)due to Jensen’s inequality applied to the concavelog\\logfunction, soH^\(Y\|X\)\\hat\{H\}\(Y\|X\)is a soft upper bound on the true entropy\. The bias vanishes asMMincreases\. Full derivations are provided in Appendix[B](https://arxiv.org/html/2606.10302#A2)\.
Given estimates ofH\(Y\|X\)H\(Y\|X\),H\(Y\|X,Z\)H\(Y\|X,Z\), andH\(Z\|X\)H\(Z\|X\), we can calculate the transmission score\. Algorithm[1](https://arxiv.org/html/2606.10302#alg1)summarizes the full procedure\.
## 4Level\-2 Instantiations
We propose two fully automated methods that instantiate Level 2 injection for general open\-ended generation tasks\. Both methods follow a two\-step pipeline\. Given an input promptxx, we first generate a set of per\-output specifications\{z1,…,zm\}\\\{z\_\{1\},\\ldots,z\_\{m\}\\\}, and then condition on each specification to produce a final outputyi=G\(x,zi\)y\_\{i\}=G\(x,z\_\{i\}\)\. Each specification is represented as a short outline describing the intended generative configuration, including dimensions such as perspective, tone, thematic focus, and content organization\. The two methods differ in how these specifications are constructed\.
##### Diverse Outline\.
Given taskxx, we prompt the LLM to generatemmspecifications in a single call, where each specification is a compact list of 4\-6 keywords capturing tone, format, perspective, and key focus\. The prompt instructs the model to maximize diversity across specifications and specifies the output format \(Figure[1](https://arxiv.org/html/2606.10302#S3.F1)Left\)\. This variant relies on the LLM’s internal ability to propose diverse specifications\.
##### Keyword\.
We prompt the LLM to generate a set of dimensions along which responses to the task can vary and a pool of candidate values for each \(e\.g\., tone with values such as satirical or earnest\)\. Each specificationziz\_\{i\}is assembled by selecting one value per dimension\. To ensure the selected specifications differ from one another, we use greedy Hamming\-distance maximization, iteratively selecting the combination most distant from all previously chosen ones \(Figure[1](https://arxiv.org/html/2606.10302#S3.F1)right\)\. This combinatorial construction encourages distinct outlines to differ across multiple axes\.
Both methods require no human intervention and apply to open\-ended generation task with the same pipeline\. Prompts are provided in Appendix[A\.3](https://arxiv.org/html/2606.10302#A1.SS3)\.
## 5Experiments
Our evaluation addresses the following questions: \(1\) Does injecting more informative diversity sources lead to more diverse outputs, without sacrificing quality \(Sec[5\.1](https://arxiv.org/html/2606.10302#S5.SS1)\)? \(2\) When a source is injected, does variation in the source contribute to output diversity, beyond what decoding randomness alone produces \(Sec[5\.2](https://arxiv.org/html/2606.10302#S5.SS2)\)? \(3\) How does output diversity scale as we increase the number of generated outputs \(Sec[5\.3](https://arxiv.org/html/2606.10302#S5.SS3)\)?
We evaluate on five open\-ended generation tasks spanning creative writing, conversation, poetry, argumentation, and general knowledge\.
- •WritingPrompts\(Huanget al\.,[2024](https://arxiv.org/html/2606.10302#bib.bib33)\)contains Reddit writing prompts that ask for short fiction given a premise, spanning genres from science fiction to literary drama\.
- •Infinite Chats\(Jianget al\.,[2026](https://arxiv.org/html/2606.10302#bib.bib32)\)consists of 26K open\-ended user queries covering diverse everyday topics, requiring substantive, multi\-paragraph responses\.
- •NoveltyBench\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.10302#bib.bib23)\)is a benchmark designed to evaluate response diversity, covering creative writing, factual knowledge, and subjective opinion generation with prompts that admit multiple valid answers\.
- •Poetry Foundation\(Poetry Foundation,[2023](https://arxiv.org/html/2606.10302#bib.bib41)\)provides poetry writing prompts with varied stylistic and thematic constraints\.
- •IBM ArgKP\(Bar\-Haimet al\.,[2020](https://arxiv.org/html/2606.10302#bib.bib34)\)contains debate topics for which the model must generate diverse argumentative key points\.
Unless otherwise noted, all experiments use the default decoding settings of each backbone model\. We provide the generated outputs in Appendix[A\.3](https://arxiv.org/html/2606.10302#A1.SS3)\.
Table 2:Diversity and quality on Infinite Chats and NoveltyBench\. Methods are grouped by injection level\. G2 and High Temperature require access to the model and are evaluated only on Qwen3\-4B\. Bold indicates the best result\.↑\\uparrow= higher is better\. Level 2 methods achieve higher diversity while maintaining comparable quality\.Infinite ChatsNoveltyBenchLv\.MethodLLM Div↑\\uparrowEmb Div↑\\uparrowStruct Div↑\\uparrowQuality↑\\uparrowLLM Div↑\\uparrowEmb Div↑\\uparrowStruct Div↑\\uparrowDistinct\-NN↑\\uparrowQuality↑\\uparrowGPT\-5\-mini0Direct2\.69±\\pm1\.520\.26±\\pm0\.150\.21±\\pm0\.098\.30±\\pm0\.722\.88±\\pm1\.290\.41±\\pm0\.150\.24±\\pm0\.0915\.25±\\pm6\.327\.30±\\pm0\.570Verbalized2\.84±\\pm1\.180\.23±\\pm0\.110\.20±\\pm0\.078\.47±\\pm0\.453\.10±\\pm1\.210\.36±\\pm0\.110\.22±\\pm0\.079\.76±\\pm4\.097\.30±\\pm0\.621SSOT2\.40±\\pm0\.780\.18±\\pm0\.090\.17±\\pm0\.048\.30±\\pm0\.452\.36±\\pm0\.810\.24±\\pm0\.100\.19±\\pm0\.064\.74±\\pm2\.986\.88±\\pm0\.74\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 4\.87±\\pm1\.39\\cellcolorblue\!10 0\.32±\\pm0\.13\\cellcolorblue\!10 0\.25±\\pm0\.06\\cellcolorblue\!10 8\.36±\\pm0\.54\\cellcolorblue\!10 6\.63±\\pm1\.43\\cellcolorblue\!10 0\.43±\\pm0\.09\\cellcolorblue\!10 0\.31±\\pm0\.06\\cellcolorblue\!10 16\.46±\\pm3\.78\\cellcolorblue\!10 7\.32±\\pm0\.70\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!107\.32±\\pm1\.55\\cellcolorblue\!100\.40±\\pm0\.13\\cellcolorblue\!100\.31±\\pm0\.08\\cellcolorblue\!10 8\.28±\\pm0\.61\\cellcolorblue\!107\.13±\\pm1\.89\\cellcolorblue\!100\.50±\\pm0\.11\\cellcolorblue\!100\.35±\\pm0\.08\\cellcolorblue\!1016\.75±\\pm4\.30\\cellcolorblue\!107\.39±\\pm0\.79Qwen3\-4B0Direct2\.11±\\pm0\.850\.19±\\pm0\.140\.13±\\pm0\.077\.02±\\pm0\.682\.44±\\pm0\.800\.30±\\pm0\.160\.18±\\pm0\.1010\.96±\\pm7\.446\.72±\\pm0\.810Verbalized2\.31±\\pm0\.910\.25±\\pm0\.150\.13±\\pm0\.076\.55±\\pm1\.042\.77±\\pm0\.800\.37±\\pm0\.140\.18±\\pm0\.0911\.01±\\pm5\.096\.26±\\pm1\.290Direct \(τ=1\.5\\tau\{=\}1\.5\)2\.35±\\pm0\.820\.19±\\pm0\.140\.13±\\pm0\.076\.67±\\pm0\.992\.36±\\pm0\.850\.25±\\pm0\.150\.17±\\pm0\.097\.38±\\pm5\.776\.51±\\pm1\.000G22\.09±\\pm0\.690\.12±\\pm0\.070\.19±\\pm0\.057\.03±\\pm0\.782\.75±\\pm0\.620\.21±\\pm0\.080\.23±\\pm0\.065\.54±\\pm4\.367\.35±\\pm0\.741SSOT2\.21±\\pm0\.660\.15±\\pm0\.090\.15±\\pm0\.066\.70±\\pm0\.852\.13±\\pm0\.740\.20±\\pm0\.120\.17±\\pm0\.086\.03±\\pm4\.786\.36±\\pm1\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!105\.77±\\pm1\.52\\cellcolorblue\!10 0\.33±\\pm0\.09\\cellcolorblue\!10 0\.22±\\pm0\.06\\cellcolorblue\!107\.71±\\pm0\.49\\cellcolorblue\!105\.01±\\pm1\.72\\cellcolorblue\!10 0\.38±\\pm0\.09\\cellcolorblue\!10 0\.25±\\pm0\.08\\cellcolorblue\!10 13\.21±\\pm5\.58\\cellcolorblue\!107\.40±\\pm1\.20\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!10 5\.58±\\pm1\.82\\cellcolorblue\!100\.35±\\pm0\.12\\cellcolorblue\!100\.24±\\pm0\.08\\cellcolorblue\!10 7\.45±\\pm0\.59\\cellcolorblue\!10 5\.00±\\pm2\.00\\cellcolorblue\!100\.41±\\pm0\.11\\cellcolorblue\!100\.27±\\pm0\.09\\cellcolorblue\!1013\.85±\\pm5\.79\\cellcolorblue\!10 7\.09±\\pm1\.17### 5\.1Specification\-Level Injection Improves Diversity with Comparable Quality
We compare six methods, including three Level 0 baselines \(Direct Generation, Verbalized SamplingZhanget al\.\([2025a](https://arxiv.org/html/2606.10302#bib.bib29)\), and G2Ruanet al\.\([2025](https://arxiv.org/html/2606.10302#bib.bib15)\)\), one Level 1 method \(SSOTMisaki and Akiba \([2026](https://arxiv.org/html/2606.10302#bib.bib30)\)\), and two Level 2 methods proposed by this work \(Keyword and Diverse Outline\)\.
We use four backbone models: GPT\-5\-miniSinghet al\.\([2025](https://arxiv.org/html/2606.10302#bib.bib35)\), Qwen3\-4BYanget al\.\([2025](https://arxiv.org/html/2606.10302#bib.bib36)\), DeepSeek\-V3\.2DeepSeek\-AIet al\.\([2025](https://arxiv.org/html/2606.10302#bib.bib37)\), and Mistral\-Large\-3Mistral AI \([2025](https://arxiv.org/html/2606.10302#bib.bib38)\)\. G2 requires access to model logits for its guide modules, which is only available for open\-weight models\. We therefore report G2 results with Qwen3\-4B only\. For open\-weight models whose temperature can be set above one \(Qwen3\-4B and DeepSeek\-V3\.2\), we additionally include a high\-temperature baseline \(τ=1\.5\\tau=1\.5\)\. Detailed setup is elaborated in Appendix[A\.1](https://arxiv.org/html/2606.10302#A1.SS1)\.
We evaluate our experimental results along five dimensions\.
- •LLM\-judged diversity: for each pair of outputs from the same prompt, we ask GPT\-5\.3\-Chat\(Singhet al\.,[2025](https://arxiv.org/html/2606.10302#bib.bib35)\)to rate their pairwise diversity on a 1–10 scale and average over all pairs\.
- •Lexical diversity: we report Distinct\-3 \(proportion of unique trigrams\) and Self\-BLEU \(mean BLEU of each output against all others\.
- •Embedding diversity: we compute the mean pairwise cosine distance \(1−cos1\-\\cos\) between sentence embeddings of all outputs from the same prompt, using OpenAI’s text\-embedding\-3\-small\(OpenAI,[2024](https://arxiv.org/html/2606.10302#bib.bib43)\)\.
- •Structural diversity: we first prompt GPT\-5\.3\-Chat to extract a short outline from each output, then compute embedding diversity over the extracted outlines rather than the full outputs\. This captures variation in high\-level organization rather than surface wording\.
- •Quality: we ask GPT\-5\.3\-Chat to rate each output on a 1–10 scale for coherence, relevance, and overall quality, and report the average\.
We sample 100 prompts per dataset for evaluation, except for ArgKP where we retain 41 prompts after filtering those containing harmful content\. For each prompt, we generate 20 outputs and report the mean and standard deviation of each metric across prompts\. For NoveltyBench, we additionally report its specialized Distinct\-N metric, which counts the number of unique equivalence classes among generated responses\.
Table[2](https://arxiv.org/html/2606.10302#S5.T2)shows that both Keyword and Diverse Outline consistently achieve the highest diversity across all metrics on both benchmarks, while maintaining comparable quality to baselines\. This suggests that introducing more informative specifications is an effective way to improve diversity in open\-ended generation\. Additional results across diversity metrics, datasets, and backbone models are reported in Appendix[A\.2](https://arxiv.org/html/2606.10302#A1.SS2)\.
### 5\.2Varying Injected Source Expands Output Diversity
To understand why Level\-2 methods outperform, we examine how variation in the injected source contributes to output diversity\. We apply the entropy decomposition introduced in Sec[3\.3](https://arxiv.org/html/2606.10302#S3.SS3)to measure the transmission scoreTTand realized diversityI\(Y;Z\|X\)I\(Y;Z\|X\)\. This analysis requires an explicit per\-output diversity sourceziz\_\{i\}\. Level 0 methods have no such source, and both terms are undefined for them\. We therefore compare only methods with explicitziz\_\{i\}\. Table[3](https://arxiv.org/html/2606.10302#S5.T3)reports these quantities alongside the output entropyH\(Y\|X\)H\(Y\|X\), estimated using the holdout procedure described in Sec[3\.4](https://arxiv.org/html/2606.10302#S3.SS4)\.
As shown in Table[3](https://arxiv.org/html/2606.10302#S5.T3), SSOT has near\-zero transmission \(T≈0\.003T\\approx 0\.003\) and near\-zero realized diversity, with an output entropyH\(Y\|X\)H\(Y\|X\)of only 0\.80, far below the Level 2 methods\. Although SSOT introduces a per\-output diversity source, its random strings carry little information about the desired output and are largely ignored during generation\. In contrast, both Keyword and Diverse Outline achieve substantially higher transmission scores and realized diversity, with output entropy above 2\.8\. These results suggest that effective diversity injection requires both informative sources and reliable transmission from the source to the final output\. More results are in Appendix[A\.2](https://arxiv.org/html/2606.10302#A1.SS2)\.
Table 3:Transmission scoreTT, realized diversityI\(Y;Z\|X\)I\(Y;Z\|X\), and output entropyH\(Y\|X\)H\(Y\|X\)on Qwen3\-4B\.TTandI\(Y;Z\|X\)I\(Y;Z\|X\)are undefined for Level 0 methods\. Level 2 methods achieve higherTTandI\(Y;Z\|X\)I\(Y;Z\|X\)than Level 1 method\.DataMethodTT↑\\uparrowI\(Y;Z\|X\)I\(Y;Z\|X\)↑\\uparrowH\(Y\|X\)H\(Y\|X\)↑\\uparrowInfiniteChats\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!10 0\.552\\cellcolorblue\!10 1\.74\\cellcolorblue\!10 2\.98\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.464\\cellcolorblue\!10 1\.45\\cellcolorblue\!10 2\.96SSOT0\.0030\.000\.80NoveltyBench\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!10 0\.507\\cellcolorblue\!10 1\.43\\cellcolorblue\!10 2\.88\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.476\\cellcolorblue\!10 1\.36\\cellcolorblue\!10 2\.92SSOT0\.0060\.000\.73Figure[2](https://arxiv.org/html/2606.10302#S5.F2)compares the output entropy of Direct GenerationH\(Y′\|X\)H\(Y^\{\\prime\}\|X\)against the residual entropyH\(Y\|X,Z\)H\(Y\|X,Z\)of our methods when conditioned on a single fixed outline, capturing per\-outline decoding randomness, and the total output entropy of our methods when the outline variesH\(Y\|X\)H\(Y\|X\)\. For both Keyword and Diverse Outline, the total entropyH\(Y\|X\)H\(Y\|X\)exceeds the entropy of Direct Generation, while the fixed\-source entropyH\(Y\|X,Z\)H\(Y\|X,Z\)falls below it\. This consistent ordering,\(H\(Y\|X,Z\)<H\(Y′\|X\)<H\(Y\|X\)\(H\(Y\|X,Z\)<H\(Y^\{\\prime\}\|X\)<H\(Y\|X\), shows that conditioning on a single source narrows the per\-source output distribution, yet varying the source across generations yields higher total diversity than Direct Generation\. Thus, variation across injected sources contributes to the output diversity of specification\-level methods\.
Figure 2:EntropyH\(Y′\|X\)H\(Y^\{\\prime\}\|X\)of Direct Generation vs\. outline\-conditioned entropyH\(Y\|X,Z\)H\(Y\|X,Z\)and total output entropyH\(Y\|X\)H\(Y\|X\)of Level 2 methods on NoveltyBench and Infinite Chats\. Fixed\-outline entropy falls below Direct Generation, yet total entropy exceeds it, showing that outline variation expands output diversity\.We next provide a complementary analysis using embedding diversity\. For each method, we vary the number of outlinesm∈\{2,5,10\}m\\in\\\{2,5,10\\\}, generatingk=10,4,2k=10,4,2outputs per outline respectively and measure two quantities: Embedding diversity for the same outline \(*within\-outline diversity*\), and Embedding diversity for different outlines \(*between\-outline diversity*\)\. If outline variation is the primary driver of diversity, between\-outline diversity should exceed within\-outline diversity\.
Figure[3](https://arxiv.org/html/2606.10302#S5.F3)shows that between\-outline diversity consistently exceeds within\-outline diversity across all values ofmm, for both methods and datasets\. Within\-outline diversity remains roughly stable asmmgrows, indicating that repeated sampling under a fixed source contributes limited additional variation\. Between\-outline diversity increases withmm, suggesting that adding more sources expands the range of generated outputs\.
Figure 3:Within\-outline diversity and between\-outline diversity for Diverse Outline and Keyword, acrossm∈\{2,5,10\}m\\in\\\{2,5,10\\\}outlines on Infinite Chats and NoveltyBench\. Between\-outline diversity consistently exceeds within\-outline diversity and grows withmm, while within\-outline diversity remains flat\. More results are in Appendix[A\.2](https://arxiv.org/html/2606.10302#A1.SS2)\.To directly test whether outline diversity drives output diversity, we perform a controlled intervention on the keyword method\. As described in Sec[4](https://arxiv.org/html/2606.10302#S4), Keyword constructs each specification by selecting one value per dimension\. By default, each specification has independent values for all dimensions\. We reduce specification diversity by forcing specifications to share the same value on a subset of dimensions\. As more dimensions are shared, specifications become more similar\. As shown in Figure[4](https://arxiv.org/html/2606.10302#S5.F4), embedding diversity scales linearly with outline embedding diversity across both datasets, indicating that output variation is directly driven by the diversity of the injected specifications\.
Figure 4:Effect of Outline Diversity on Output Diversity\. We reduce outline diversity in the Keyword method by forcing outlines to share values on an increasing number of dimensions \(0→40\\to 4\)\. Output embedding diversity scales linearly with specification embedding diversity on both NoveltyBench and Infinite Chats\.
### 5\.3Scaling with the Number of Generated Outputs
Figure 5:Diversity scaling as the number of generated outputs increases, using GPT\-5\-mini\. Diverse Outline and Keyword maintain an advantage over Direct Generation across scales\.We investigate how output diversity scales with the number of generated outputs\. Using GPT\-5\-mini as the backbone, we vary the number of outputs across\{5,10,20,30,40,50\}\\\{5,10,20,30,40,50\\\}and compare Diverse Outline and Keyword against Direct Generation on Infinite Chats and NoveltyBench, reporting LLM\-judged diversity and embedding diversity\.
As shown in Figure[5](https://arxiv.org/html/2606.10302#S5.F5), Diverse Outline and Keyword generally outperform Direct Generation across scales\. In most cases, the gap remains visible as more outputs are generated\. These results suggest that specification\-level methods can make effective use of additional output slots by introducing diverse intermediate sources\.
## 6Conclusion
We introduced the Diversity Injection Framework, which unifies test\-time diverse generation methods by the nature of their diversity source and quantifies diversity transmission via an entropy\-based score\. Guided by the framework, we proposed specification\-level generation methods that inject diversity through detailed per\-output configurations before generation begins\. Diverse Outline lets the model propose all specifications in one pass and achieves high diversity in nearly all settings, while Keyword builds them from orthogonal axes, offering a more structured and interpretable construction\. That both work through different source constructions shows that source design is the effective lever for diverse generation\. Experiments across five open\-ended tasks and four backbone models show that these methods substantially improve output diversity while maintaining comparable quality\. We hope the framework and transmission score provide a foundation for diagnosing and improving diversity in future generation systems, and that specification\-level injection offers a practical, training\-free path toward richer, more varied LLM outputs\.
##### Limitations\.
Our approach has the following limitations\. First, specification\-level injection adds a specification\-generation step before output generation, which increases token usage and latency relative to Direct Generation, though in practice the step is lightweight and the cost amortizes when many outputs are requested\. Second, our transmission score depends on a chosen representation of the diversity source and output\. This is appropriate for comparing methods under a fixed evaluation protocol, but different representations may emphasize different aspects of source\-to\-output transmission\.
## References
- A learning algorithm for boltzmann machines\.Cognitive Science9\(1\),pp\. 147–169\.External Links:ISSN 0364\-0213,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0364-0213%2885%2980012-4),[Link](https://www.sciencedirect.com/science/article/pii/S0364021385800124)Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2)\.
- P\. Agrawal and P\. Goyal \(2026\)Addressing llm diversity by infusing random concepts\.arXiv preprint arXiv:2601\.18053\.Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p2.1),[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.10302#S3.SS2.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.10302#S3.T1.1.6.1)\.
- R\. Bar\-Haim, L\. Eden, R\. Friedman, Y\. Kantor, D\. Lahav, and N\. Slonim \(2020\)From arguments to key points: Towards automatic argument summarization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4029–4039\.External Links:[Link](https://aclanthology.org/2020.acl-main.371/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.371)Cited by:[§A\.2](https://arxiv.org/html/2606.10302#A1.SS2.SSS0.Px1.p3.1),[5th item](https://arxiv.org/html/2606.10302#S5.I1.i5.p1.1)\.
- R\. E\. Cardona\-Rivera, A\. Jhala, J\. Porteous, and R\. M\. Young \(2024\)The story so far on narrative planning\.InProceedings of the Thirty\-Fourth International Conference on Automated Planning and Scheduling,ICAPS ’24\.External Links:ISBN 1\-57735\-889\-9,[Link](https://doi.org/10.1609/icaps.v34i1.31509),[Document](https://dx.doi.org/10.1609/icaps.v34i1.31509)Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px3.p1.1)\.
- DeepSeek\-AI, A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong, C\. Lu, C\. Zhao, C\. Deng, C\. Xu, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, E\. Li, F\. Zhou, F\. Lin, F\. Dai, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Li, H\. Liang, H\. Wei, H\. Zhang, H\. Luo, H\. Ji, H\. Ding, H\. Tang, H\. Cao, H\. Gao, H\. Qu, H\. Zeng, J\. Huang, J\. Li, J\. Xu, J\. Hu, J\. Chen, J\. Xiang, J\. Yuan, J\. Cheng, J\. Zhu, J\. Ran, J\. Jiang, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Gao, K\. Guan, K\. Huang, K\. Zhou, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Wang, L\. Zhao, L\. Yin, L\. Guo, L\. Luo, L\. Ma, L\. Wang, L\. Zhang, M\. S\. Di, M\. Y\. Xu, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, P\. Huang, P\. Cong, P\. Wang, Q\. Wang, Q\. Zhu, Q\. Li, Q\. Chen, Q\. Du, R\. Xu, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Yin, R\. Xu, R\. Shen, R\. Zhang, S\. H\. Liu, S\. Lu, S\. Zhou, S\. Chen, S\. Cai, S\. Chen, S\. Hu, S\. Liu, S\. Hu, S\. Ma, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. Zhou, T\. Ni, T\. Yun, T\. Pei, T\. Ye, T\. Yue, W\. Zeng, W\. Liu, W\. Liang, W\. Pang, W\. Luo, W\. Gao, W\. Zhang, X\. Gao, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Chen, X\. Zhang, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Li, X\. Yang, X\. Li, X\. Chen, X\. Su, X\. Pan, X\. Lin, X\. Fu, Y\. Q\. Wang, Y\. Zhang, Y\. Xu, Y\. Ma, Y\. Li, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Qian, Y\. Yu, Y\. Zhang, Y\. Ding, Y\. Shi, Y\. Xiong, Y\. He, Y\. Zhou, Y\. Zhong, Y\. Piao, Y\. Wang, Y\. Chen, Y\. Tan, Y\. Wei, Y\. Ma, Y\. Liu, Y\. Yang, Y\. Guo, Y\. Wu, Y\. Wu, Y\. Cheng, Y\. Ou, Y\. Xu, Y\. Wang, Y\. Gong, Y\. Wu, Y\. Zou, Y\. Li, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Zhao, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Huang, Z\. Wu, Z\. Li, Z\. Zhang, Z\. Xu, Z\. Wang, Z\. Gu, Z\. Zhu, Z\. Li, Z\. Zhang, Z\. Xie, Z\. Gao, Z\. Pan, Z\. Yao, B\. Feng, H\. Li, J\. L\. Cai, J\. Ni, L\. Xu, M\. Li, N\. Tian, R\. J\. Chen, R\. L\. Jin, S\. S\. Li, S\. Zhou, T\. Sun, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Song, X\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Z\. Huang, Z\. Xu, Z\. Zhang, D\. Ji, J\. Liang, J\. Guo, J\. Chen, L\. Xia, M\. Wang, M\. Li, P\. Zhang, R\. Chen, S\. Sun, S\. Wu, S\. Ye, T\. Wang, W\. L\. Xiao, W\. An, X\. Wang, X\. Sun, X\. Wang, Y\. Tang, Y\. Zha, Z\. Zhang, Z\. Ju, Z\. Zhang, and Z\. Qu \(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.External Links:2512\.02556,[Link](https://arxiv.org/abs/2512.02556)Cited by:[§5\.1](https://arxiv.org/html/2606.10302#S5.SS1.p2.1)\.
- A\. Fan, M\. Lewis, and Y\. Dauphin \(2018\)Hierarchical neural story generation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 889–898\.Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p2.1),[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2)\.
- S\. A\. Hayati, M\. Lee, D\. Rajagopal, and D\. Kang \(2024\)How far can we extract diverse perspectives from large language models?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 5336–5366\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.306/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.306)Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi \(2020\)The curious case of neural text degeneration\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rygGQyrFvH)Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p2.1),[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2)\.
- X\. Y\. Huang, K\. Vishnubhotla, and F\. Rudzicz \(2024\)The gpt\-writingprompts dataset: a comparative analysis of character portrayal in short stories\.arXiv preprint arXiv:2406\.16767\.Cited by:[§A\.2](https://arxiv.org/html/2606.10302#A1.SS2.SSS0.Px1.p3.1),[1st item](https://arxiv.org/html/2606.10302#S5.I1.i1.p1.1)\.
- L\. Jiang, Y\. Chai, M\. Li, M\. Liu, R\. Fok, N\. Dziri, Y\. Tsvetkov, M\. Sap, and Y\. Choi \(2026\)Artificial hivemind: the open\-ended homogeneity of language models \(and beyond\)\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Vol\.38\.Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p1.1),[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px4.p1.1),[2nd item](https://arxiv.org/html/2606.10302#S5.I1.i2.p1.1)\.
- J\. Li, M\. Galley, C\. Brockett, J\. Gao, and B\. Dolan \(2016\)A diversity\-promoting objective function for neural conversation models\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),pp\. 110–119\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px4.p1.1)\.
- N\. N\. Minh, A\. Baker, C\. Neo, A\. G\. Roush, A\. Kirsch, and R\. Shwartz\-Ziv \(2025\)Turning up the heat: min\-p sampling for creative and coherent LLM outputs\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=FBkpCyujtS)Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p2.1),[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2)\.
- K\. Misaki and T\. Akiba \(2026\)String seed of thought: prompting LLMs for distribution\-faithful and diverse generation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=luXtbX1lVK)Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p2.1),[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.10302#S3.SS2.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.10302#S3.T1.1.5.1),[§5\.1](https://arxiv.org/html/2606.10302#S5.SS1.p1.1)\.
- Mistral AI \(2025\)Mistral\-large\-3\-675b\-instruct\-2512\.Hugging Face\.External Links:[Link](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512)Cited by:[§5\.1](https://arxiv.org/html/2606.10302#S5.SS1.p2.1)\.
- OpenAI \(2024\)New embedding models and API updates\.Note:[https://openai\.com/blog/new\-embedding\-models\-and\-api\-updates](https://openai.com/blog/new-embedding-models-and-api-updates)Cited by:[3rd item](https://arxiv.org/html/2606.10302#S5.I2.i3.p1.1)\.
- K\. Park, N\. Yang, and K\. Jung \(2025\)Avoidance decoding for diverse multi\-branch story generation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 7489–7505\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.381/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.381),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2)\.
- Poetry Foundation \(2023\)Poetryfoundation\.org: data summary\.Note:Public online archiveExternal Links:[Link](https://www.poetryfoundation.org/)Cited by:[§A\.2](https://arxiv.org/html/2606.10302#A1.SS2.SSS0.Px1.p3.1),[4th item](https://arxiv.org/html/2606.10302#S5.I1.i4.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 3982–3992\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px4.p1.1)\.
- M\. O\. Riedl and R\. M\. Young \(2010\)Narrative planning: balancing plot and character\.Journal of Artificial Intelligence Research39,pp\. 217–268\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Ruan, Y\. Li, Y\. Liu, Y\. Chen, W\. Luo, P\. Li, Y\. Liu, and G\. Chen \(2025\)G2: guided generation for enhanced output diversity in llms\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 14127–14145\.Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p2.1),[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2),[§3\.2](https://arxiv.org/html/2606.10302#S3.SS2.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2606.10302#S3.T1.1.4.1),[§5\.1](https://arxiv.org/html/2606.10302#S5.SS1.p1.1)\.
- A\. Shypula, S\. Li, B\. Zhang, V\. Padmakumar, K\. Yin, and O\. Bastani \(2025\)Evaluating the diversity and quality of llm generated content\.InProceedings of the 2nd Conference on Language Modeling \(COLM\),Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[1st item](https://arxiv.org/html/2606.10302#S5.I2.i1.p1.1),[§5\.1](https://arxiv.org/html/2606.10302#S5.SS1.p2.1)\.
- T\. Sorensen, J\. Moore, J\. Fisher, M\. L\. Gordon, N\. Mireshghallah, C\. M\. Rytting, A\. Ye, L\. Jiang, X\. Lu, N\. Dziri,et al\.\(2024\)Position: a roadmap to pluralistic alignment\.InInternational Conference on Machine Learning,pp\. 46280–46302\.Cited by:[§1](https://arxiv.org/html/2606.10302#S1.p1.1)\.
- Y\. Su and N\. Collier \(2023\)Contrastive search is what you need for neural text generation\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=GbkWw3jwL9)Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2)\.
- S\. Suh, M\. Chen, B\. Min, T\. J\. Li, and H\. Xia \(2024\)Luminate: structured generation and exploration of design space with large language models for human\-ai co\-creation\.InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems \(CHI ’24\),External Links:[Document](https://dx.doi.org/10.1145/3613904.3642400)Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.10302#S3.SS2.SSS0.Px3.p1.1)\.
- A\. Vijayakumar, M\. Cogswell, R\. Selvaraju, Q\. Sun, S\. Lee, D\. Crandall, and D\. Batra \(2018\)Diverse beam search for improved description of complex scenes\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px1.p1.2)\.
- P\. J\. Wang and M\. Kreminski \(2024\)Guiding and diversifying llm\-based story generation via answer set programming\.arXiv preprint arXiv:2406\.00554\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.10302#S5.SS1.p2.1)\.
- K\. Yang, D\. Klein, N\. Peng, and Y\. Tian \(2023\)DOC: improving long story coherence with detailed outline control\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 3378–3465\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.10302#S3.SS2.SSS0.Px3.p1.1)\.
- J\. Zhang, S\. Yu, D\. Chong, A\. Sicilia, M\. R\. Tomz, C\. D\. Manning, and W\. Shi \(2025a\)Verbalized sampling: how to mitigate mode collapse and unlock llm diversity\.arXiv preprint arXiv:2510\.01171\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.10302#S3.SS2.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2606.10302#S3.T1.1.3.1.1.2),[§5\.1](https://arxiv.org/html/2606.10302#S5.SS1.p1.1)\.
- T\. Zhang, B\. Peng, and D\. Bollegala \(2024\)Improving diversity of commonsense generation by large language models via in\-context learning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 9226–9242\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.540/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.540)Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhang, H\. Diddee, S\. Holm, H\. Liu, X\. Liu, V\. Samuel, B\. Wang, and D\. Ippolito \(2025b\)NoveltyBench: evaluating language models for human\-like diversity\.InProceedings of the 2nd Conference on Language Modeling \(COLM\),Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px4.p1.1),[3rd item](https://arxiv.org/html/2606.10302#S5.I1.i3.p1.1)\.
- Y\. Zhuet al\.\(2018\)Texygen: a benchmarking platform for text generation models\.InProceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1097–1100\.Cited by:[§2](https://arxiv.org/html/2606.10302#S2.SS0.SSS0.Px4.p1.1)\.
## Appendix AAdditional Experiment Details
### A\.1Models and Evaluation Setup
We evaluate on four backbone models:GPT\-5\-mini\(Azure API\),DeepSeek\-V3\.2\(Azure API\),Mistral\-Large\-3\(Azure API\), andQwen3\-4B\(served locally via vLLM\)\. All models use default decoding settings \(τ=1\.0\\tau=1\.0\) unless stated otherwise\. For DeepSeek\-V3\.2 and Qwen3\-4B\-Instruct\-2507, we additionally include a high\-temperature baseline \(Direct atτ=1\.5\\tau\{=\}1\.5; Qwen additionally usestop\-p=0\.9\\text\{top\-\}p=0\.9\)\. Each experiment generates 20 outlines and one output per outline for 100 prompts per dataset\.
For the entropy estimation in Table[3](https://arxiv.org/html/2606.10302#S5.T3)and Figure[2](https://arxiv.org/html/2606.10302#S5.F2), we reuse the same 20 specifications and outputs in Table[2](https://arxiv.org/html/2606.10302#S5.T2)\. We randomly split the 20 specifications into an estimation set of 10 and an evaluation set of 10 to estimateH\(Y\|X\),H\(Y\|X,Z\)H\(Y\|X\),H\(Y\|X,Z\)andH\(Z\|X\)H\(Z\|X\)following the procedure in Section[3\.4](https://arxiv.org/html/2606.10302#S3.SS4)\.
For the diversity decomposition experiment in Figure[3](https://arxiv.org/html/2606.10302#S5.F3), we fix the total number of outputs at 20 and vary the number of outlinesm∈\{2,5,10\}m\\in\\\{2,5,10\\\}, generating 10, 4, and 2 outputs per outline, respectively\.
LLM\-based diversity and quality scoring is performed byGPT\-5\.3\-Chatas the judge model\. Embedding diversity \(both at the output level and the structural level\) is computed using OpenAI’stext\-embedding\-3\-smallmodel, measuring mean pairwise cosine distance\(1−cosine similarity\)\(1\-\\text\{cosine similarity\}\)across the outputs\.
All datasets and models used in this work are publicly available and used in accordance with their intended research use and terms of service\. Any released code, prompts, and generated artifacts will be distributed for research purposes only\.
### A\.2Additional Experiment Results
##### Additional Results for Sec[5\.1](https://arxiv.org/html/2606.10302#S5.SS1)\.
We provide additional results across datasets, backbone models, and metrics in this section\.
Table[4](https://arxiv.org/html/2606.10302#A1.T4)reports diversity and quality on NoveltyBench and Infinite Chat on Deepseek\-V3\.2 and Mistral\-Large\-3\. The same pattern as in Table[2](https://arxiv.org/html/2606.10302#S5.T2)holds\. Level 2 methods achieve the highest diversity across most diversity metrics while maintaining comparable quality\.
Tables[5](https://arxiv.org/html/2606.10302#A1.T5)and[6](https://arxiv.org/html/2606.10302#A1.T6)extend the evaluation to three more datasets: IBM ArgKPBar\-Haimet al\.\([2020](https://arxiv.org/html/2606.10302#bib.bib34)\), Poetry FoundationPoetry Foundation \([2023](https://arxiv.org/html/2606.10302#bib.bib41)\), and WritingPromptsHuanget al\.\([2024](https://arxiv.org/html/2606.10302#bib.bib33)\)\. The same evaluation protocol is applied\. Level 2 methods consistently outperform Level 0 and Level 1 baselines on all three diversity metrics \(LLM Div, Emb Div, Struct Div\) across models and datasets, while maintaining competitive quality scores\.
Table[7](https://arxiv.org/html/2606.10302#A1.T7)and Table[8](https://arxiv.org/html/2606.10302#A1.T8)report lexical diversity, including Distinct\-3, which measures the proportion of unique trigrams across outputs, and Self\-Bleu for all methods across all datasets and models\. Level\-2 methods generally improve or match lexical diversity across most settings\. For DeepSeek\-V3\.2, high temperature produces high token\-level randomness, resulting in lower Self\-BLEU\. However, this comes without the structural diversity gains that Level\-2 methods provide \(Tables[4](https://arxiv.org/html/2606.10302#A1.T4)\-[6](https://arxiv.org/html/2606.10302#A1.T6)\)\.
##### Additional Results for Sec[5\.2](https://arxiv.org/html/2606.10302#S5.SS2)\.
Table[9](https://arxiv.org/html/2606.10302#A1.T9)extends the transmission\-score analysis to WritingPrompts, Poetry Foundation, and IBM ArgKP, reportingTT,I\(Y;Z∣X\)I\(Y;Z\\mid X\), andH\(Y∣X\)H\(Y\\mid X\)for Diverse Outline, Keyword, and SSOT\. Level\-2 methods achieve substantially higher transmission scores and realized diversity than SSOT across all three datasets\.
Additionally, Figure[6](https://arxiv.org/html/2606.10302#A1.F6)shows the within\-outline and between\-outline embedding diversity decomposition for Diverse Outline and Keyword on these datasets\. The same pattern as in Figure[3](https://arxiv.org/html/2606.10302#S5.F3)holds: between\-outline diversity consistently exceeds within\-outline diversity and grows with the number of outlines, while within\-outline diversity remains flat, indicating that source variation contributes to the diversity gains of Level 2 methods across task types\.
Table 4:Diversity and quality on Infinite Chats and NoveltyBench using DeepSeek\-V3\.2 and Mistral\-Large\-3\. Methods are grouped by injection level\. Bold indicates the best result\.↑\\uparrow= higher is better\.Infinite ChatsNoveltyBenchLv\.MethodLLM Div↑\\uparrowEmb Div↑\\uparrowStruct Div↑\\uparrowQuality↑\\uparrowLLM Div↑\\uparrowEmb Div↑\\uparrowStruct Div↑\\uparrowDistinct\-NN↑\\uparrowQuality↑\\uparrowDeepSeek\-V3\.20Direct3\.37±\\pm1\.590\.29±\\pm0\.160\.21±\\pm0\.077\.51±\\pm0\.553\.39±\\pm1\.440\.42±\\pm0\.160\.25±\\pm0\.0817\.87±\\pm4\.007\.39±\\pm0\.490Direct \(τ=1\.5\\tau\{=\}1\.5\)3\.36±\\pm1\.650\.32±\\pm0\.130\.21±\\pm0\.087\.07±\\pm0\.733\.59±\\pm1\.510\.43±\\pm0\.140\.25±\\pm0\.0919\.28±\\pm2\.646\.90±\\pm0\.820Verbalized3\.03±\\pm1\.300\.31±\\pm0\.150\.19±\\pm0\.066\.88±\\pm1\.003\.22±\\pm1\.140\.39±\\pm0\.150\.22±\\pm0\.0611\.55±\\pm4\.116\.79±\\pm0\.721SSOT2\.86±\\pm0\.960\.26±\\pm0\.160\.19±\\pm0\.066\.97±\\pm1\.093\.25±\\pm0\.770\.38±\\pm0\.130\.22±\\pm0\.0711\.18±\\pm4\.406\.53±\\pm1\.28\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 5\.88±\\pm1\.44\\cellcolorblue\!10 0\.40±\\pm0\.13\\cellcolorblue\!10 0\.25±\\pm0\.06\\cellcolorblue\!10 7\.50±\\pm0\.81\\cellcolorblue\!10 6\.77±\\pm1\.85\\cellcolorblue\!10 0\.47±\\pm0\.10\\cellcolorblue\!10 0\.29±\\pm0\.07\\cellcolorblue\!10 17\.41±\\pm3\.75\\cellcolorblue\!107\.77±\\pm0\.80\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!107\.07±\\pm1\.76\\cellcolorblue\!100\.44±\\pm0\.13\\cellcolorblue\!100\.27±\\pm0\.07\\cellcolorblue\!10 7\.36±\\pm0\.88\\cellcolorblue\!107\.13±\\pm2\.12\\cellcolorblue\!100\.50±\\pm0\.11\\cellcolorblue\!100\.32±\\pm0\.08\\cellcolorblue\!10 18\.04±\\pm3\.33\\cellcolorblue\!10 7\.43±\\pm0\.97Mistral\-L30Direct2\.33±\\pm0\.880\.28±\\pm0\.170\.13±\\pm0\.066\.92±\\pm0\.682\.59±\\pm0\.870\.45±\\pm0\.130\.16±\\pm0\.0714\.48±\\pm5\.776\.80±\\pm0\.580Verbalized2\.63±\\pm0\.940\.26±\\pm0\.150\.12±\\pm0\.056\.91±\\pm0\.853\.12±\\pm1\.020\.37±\\pm0\.130\.15±\\pm0\.079\.69±\\pm4\.176\.81±\\pm0\.621SSOT2\.78±\\pm0\.720\.21±\\pm0\.120\.12±\\pm0\.057\.26±\\pm0\.732\.70±\\pm0\.760\.30±\\pm0\.120\.14±\\pm0\.069\.77±\\pm4\.286\.74±\\pm1\.07\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 6\.51±\\pm1\.32\\cellcolorblue\!10 0\.35±\\pm0\.11\\cellcolorblue\!10 0\.19±\\pm0\.05\\cellcolorblue\!108\.02±\\pm0\.40\\cellcolorblue\!10 7\.33±\\pm1\.18\\cellcolorblue\!10 0\.42±\\pm0\.08\\cellcolorblue\!10 0\.23±\\pm0\.06\\cellcolorblue\!10 15\.13±\\pm5\.62\\cellcolorblue\!108\.16±\\pm0\.52\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!107\.65±\\pm1\.67\\cellcolorblue\!100\.40±\\pm0\.12\\cellcolorblue\!100\.23±\\pm0\.07\\cellcolorblue\!10 7\.86±\\pm0\.46\\cellcolorblue\!107\.84±\\pm1\.72\\cellcolorblue\!100\.46±\\pm0\.10\\cellcolorblue\!100\.26±\\pm0\.08\\cellcolorblue\!1015\.74±\\pm5\.46\\cellcolorblue\!10 7\.91±\\pm0\.89Table 5:Diversity and quality on IBM ArgKP and Poetry Foundation\. Methods are grouped by injection level\. DeepSeek\-V3\.2 and Qwen3\-4B include a high\-temperature baseline \(Direct atτ=1\.5\\tau\{=\}1\.5\)\.Bold= best per column within each model\.↑\\uparrow= higher is better\.IBM ArgKPPoetry FoundationLv\.MethodLLM Div↑\\uparrowEmb Div↑\\uparrowStruct Div↑\\uparrowQuality↑\\uparrowLLM Div↑\\uparrowEmb Div↑\\uparrowStruct Div↑\\uparrowQuality↑\\uparrowGPT\-5\-mini0Direct3\.51±\\pm1\.090\.23±\\pm0\.060\.23±\\pm0\.047\.26±\\pm0\.412\.44±\\pm0\.850\.11±\\pm0\.040\.13±\\pm0\.048\.30±\\pm0\.250Verbalized2\.95±\\pm0\.760\.19±\\pm0\.050\.22±\\pm0\.057\.30±\\pm0\.432\.99±\\pm0\.890\.11±\\pm0\.040\.15±\\pm0\.038\.40±\\pm0\.251SSOT2\.51±\\pm0\.470\.15±\\pm0\.070\.18±\\pm0\.046\.93±\\pm0\.543\.15±\\pm0\.670\.10±\\pm0\.020\.13±\\pm0\.038\.32±\\pm0\.21\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 6\.17±\\pm0\.92\\cellcolorblue\!10 0\.28±\\pm0\.04\\cellcolorblue\!10 0\.31±\\pm0\.05\\cellcolorblue\!10 7\.32±\\pm0\.20\\cellcolorblue\!10 7\.98±\\pm1\.25\\cellcolorblue\!10 0\.22±\\pm0\.06\\cellcolorblue\!10 0\.24±\\pm0\.04\\cellcolorblue\!10 7\.94±\\pm0\.85\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!107\.97±\\pm1\.07\\cellcolorblue\!100\.33±\\pm0\.07\\cellcolorblue\!100\.34±\\pm0\.05\\cellcolorblue\!107\.37±\\pm0\.30\\cellcolorblue\!108\.83±\\pm1\.29\\cellcolorblue\!100\.23±\\pm0\.06\\cellcolorblue\!100\.27±\\pm0\.04\\cellcolorblue\!10 8\.18±\\pm0\.85DeepSeek\-V3\.20Direct3\.71±\\pm1\.320\.30±\\pm0\.050\.21±\\pm0\.057\.44±\\pm0\.212\.90±\\pm1\.260\.23±\\pm0\.080\.17±\\pm0\.047\.99±\\pm0\.320Direct \(τ=1\.5\\tau\{=\}1\.5\)3\.58±\\pm1\.270\.30±\\pm0\.080\.23±\\pm0\.057\.26±\\pm0\.332\.80±\\pm1\.020\.25±\\pm0\.080\.15±\\pm0\.047\.73±\\pm0\.370Verbalized2\.32±\\pm0\.480\.20±\\pm0\.030\.18±\\pm0\.047\.36±\\pm0\.253\.23±\\pm0\.950\.25±\\pm0\.070\.16±\\pm0\.048\.01±\\pm0\.271SSOT2\.29±\\pm0\.540\.13±\\pm0\.020\.14±\\pm0\.037\.53±\\pm0\.254\.26±\\pm1\.170\.16±\\pm0\.050\.17±\\pm0\.047\.97±\\pm0\.36\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 5\.32±\\pm1\.30\\cellcolorblue\!10 0\.29±\\pm0\.05\\cellcolorblue\!10 0\.24±\\pm0\.04\\cellcolorblue\!107\.73±\\pm0\.23\\cellcolorblue\!10 7\.39±\\pm1\.44\\cellcolorblue\!10 0\.30±\\pm0\.06\\cellcolorblue\!10 0\.22±\\pm0\.05\\cellcolorblue\!10 8\.10±\\pm0\.25\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!107\.66±\\pm1\.52\\cellcolorblue\!100\.35±\\pm0\.05\\cellcolorblue\!100\.30±\\pm0\.05\\cellcolorblue\!10 7\.71±\\pm0\.19\\cellcolorblue\!108\.78±\\pm1\.03\\cellcolorblue\!100\.37±\\pm0\.07\\cellcolorblue\!100\.23±\\pm0\.05\\cellcolorblue\!108\.18±\\pm0\.19Mistral\-L30Direct2\.60±\\pm0\.930\.24±\\pm0\.090\.11±\\pm0\.066\.66±\\pm0\.371\.92±\\pm0\.380\.14±\\pm0\.050\.08±\\pm0\.037\.61±\\pm0\.340Verbalized2\.22±\\pm0\.430\.14±\\pm0\.030\.10±\\pm0\.047\.14±\\pm0\.272\.61±\\pm0\.600\.19±\\pm0\.050\.10±\\pm0\.037\.76±\\pm0\.261SSOT2\.21±\\pm0\.370\.09±\\pm0\.020\.09±\\pm0\.027\.39±\\pm0\.254\.17±\\pm0\.940\.12±\\pm0\.030\.10±\\pm0\.038\.01±\\pm0\.36\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 5\.79±\\pm1\.51\\cellcolorblue\!10 0\.26±\\pm0\.04\\cellcolorblue\!10 0\.17±\\pm0\.03\\cellcolorblue\!107\.94±\\pm0\.22\\cellcolorblue\!10 6\.96±\\pm0\.98\\cellcolorblue\!10 0\.20±\\pm0\.05\\cellcolorblue\!10 0\.15±\\pm0\.04\\cellcolorblue\!10 8\.18±\\pm0\.31\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!107\.93±\\pm1\.90\\cellcolorblue\!100\.37±\\pm0\.05\\cellcolorblue\!100\.21±\\pm0\.03\\cellcolorblue\!10 7\.81±\\pm0\.28\\cellcolorblue\!108\.03±\\pm1\.17\\cellcolorblue\!100\.26±\\pm0\.06\\cellcolorblue\!100\.16±\\pm0\.04\\cellcolorblue\!108\.19±\\pm0\.28Qwen3\-4B0Direct1\.72±\\pm0\.400\.14±\\pm0\.040\.12±\\pm0\.036\.89±\\pm0\.332\.46±\\pm0\.520\.10±\\pm0\.060\.11±\\pm0\.037\.47±\\pm0\.770Direct \(τ=1\.5\\tau\{=\}1\.5\)2\.02±\\pm0\.820\.14±\\pm0\.070\.12±\\pm0\.036\.78±\\pm0\.522\.83±\\pm0\.570\.10±\\pm0\.050\.12±\\pm0\.047\.13±\\pm0\.870G21\.59±\\pm0\.410\.07±\\pm0\.020\.16±\\pm0\.057\.03±\\pm0\.252\.46±\\pm0\.480\.08±\\pm0\.030\.19±\\pm0\.057\.19±\\pm0\.960Verbalized1\.76±\\pm0\.410\.12±\\pm0\.030\.12±\\pm0\.036\.97±\\pm0\.302\.56±\\pm0\.620\.12±\\pm0\.040\.12±\\pm0\.047\.45±\\pm0\.521SSOT1\.64±\\pm0\.370\.09±\\pm0\.040\.15±\\pm0\.046\.98±\\pm0\.282\.96±\\pm0\.640\.19±\\pm0\.080\.15±\\pm0\.047\.29±\\pm0\.58\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 5\.75±\\pm1\.43\\cellcolorblue\!10 0\.27±\\pm0\.06\\cellcolorblue\!10 0\.22±\\pm0\.04\\cellcolorblue\!10 7\.38±\\pm0\.37\\cellcolorblue\!10 6\.36±\\pm0\.97\\cellcolorblue\!10 0\.29±\\pm0\.05\\cellcolorblue\!10 0\.21±\\pm0\.04\\cellcolorblue\!108\.11±\\pm0\.17\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!106\.08±\\pm1\.39\\cellcolorblue\!100\.31±\\pm0\.07\\cellcolorblue\!100\.25±\\pm0\.04\\cellcolorblue\!107\.42±\\pm0\.29\\cellcolorblue\!107\.21±\\pm1\.57\\cellcolorblue\!100\.29±\\pm0\.06\\cellcolorblue\!100\.23±\\pm0\.05\\cellcolorblue\!10 8\.04±\\pm0\.23Table 6:Diversity and quality on WritingPrompts\. Methods are grouped by injection level\. DeepSeek\-V3\.2 and Qwen3\-4B include a high\-temperature baseline \(Direct atτ=1\.5\\tau\{=\}1\.5\)\.Bold= best per column within each model\.↑\\uparrow= higher is better\.WritingPromptsLv\.MethodLLM Div↑\\uparrowEmb Div↑\\uparrowStruct Div↑\\uparrowQuality↑\\uparrowGPT\-5\-mini0Direct4\.96±\\pm2\.380\.28±\\pm0\.100\.22±\\pm0\.088\.59±\\pm0\.210Verbalized4\.69±\\pm1\.660\.25±\\pm0\.070\.20±\\pm0\.068\.71±\\pm0\.201SSOT3\.55±\\pm1\.360\.21±\\pm0\.080\.17±\\pm0\.068\.17±\\pm1\.50\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 6\.16±\\pm1\.31\\cellcolorblue\!10 0\.29±\\pm0\.05\\cellcolorblue\!10 0\.23±\\pm0\.05\\cellcolorblue\!108\.86±\\pm0\.14\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!108\.71±\\pm1\.07\\cellcolorblue\!100\.37±\\pm0\.07\\cellcolorblue\!100\.29±\\pm0\.06\\cellcolorblue\!10 8\.73±\\pm0\.18DeepSeek\-V3\.20Direct5\.31±\\pm1\.980\.34±\\pm0\.110\.21±\\pm0\.068\.21±\\pm0\.370Direct \(τ=1\.5\\tau\{=\}1\.5\)4\.74±\\pm2\.100\.29±\\pm0\.130\.20±\\pm0\.067\.91±\\pm0\.430Verbalized4\.73±\\pm1\.570\.29±\\pm0\.100\.19±\\pm0\.058\.19±\\pm0\.371SSOT3\.74±\\pm1\.070\.24±\\pm0\.090\.17±\\pm0\.048\.01±\\pm1\.00\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 6\.93±\\pm1\.27\\cellcolorblue\!10 0\.34±\\pm0\.06\\cellcolorblue\!10 0\.23±\\pm0\.05\\cellcolorblue\!108\.40±\\pm0\.27\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!108\.07±\\pm1\.94\\cellcolorblue\!100\.38±\\pm0\.10\\cellcolorblue\!100\.26±\\pm0\.06\\cellcolorblue\!10 8\.16±\\pm1\.46Mistral\-L30Direct2\.46±\\pm1\.020\.21±\\pm0\.120\.12±\\pm0\.057\.46±\\pm0\.470Verbalized3\.26±\\pm1\.360\.23±\\pm0\.090\.13±\\pm0\.047\.74±\\pm0\.731SSOT3\.76±\\pm1\.500\.20±\\pm0\.080\.13±\\pm0\.047\.56±\\pm1\.83\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 6\.15±\\pm1\.45\\cellcolorblue\!10 0\.29±\\pm0\.06\\cellcolorblue\!10 0\.18±\\pm0\.04\\cellcolorblue\!108\.35±\\pm0\.26\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!107\.86±\\pm2\.05\\cellcolorblue\!100\.34±\\pm0\.07\\cellcolorblue\!100\.20±\\pm0\.05\\cellcolorblue\!10 8\.13±\\pm1\.19Qwen3\-4B0Direct1\.89±\\pm0\.720\.15±\\pm0\.090\.11±\\pm0\.057\.10±\\pm0\.860Direct \(τ=1\.5\\tau\{=\}1\.5\)2\.22±\\pm0\.810\.16±\\pm0\.090\.12±\\pm0\.046\.79±\\pm1\.020Verbalized2\.06±\\pm0\.980\.14±\\pm0\.070\.11±\\pm0\.046\.88±\\pm0\.730G22\.00±\\pm0\.690\.09±\\pm0\.030\.22±\\pm0\.056\.81±\\pm1\.011SSOT2\.12±\\pm0\.830\.14±\\pm0\.070\.15±\\pm0\.056\.56±\\pm0\.89\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 5\.74±\\pm1\.57\\cellcolorblue\!10 0\.30±\\pm0\.05\\cellcolorblue\!10 0\.20±\\pm0\.05\\cellcolorblue\!108\.19±\\pm0\.21\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!106\.51±\\pm1\.80\\cellcolorblue\!100\.34±\\pm0\.06\\cellcolorblue\!100\.22±\\pm0\.06\\cellcolorblue\!10 8\.04±\\pm0\.29Table 7:Lexical diversity \(Distinct\-3 and Self\-BLEU\) on Infinite Chats and NoveltyBench\. Methods are grouped by injection level\. DeepSeek\-V3\.2 and Qwen3\-4B include a high\-temperature baseline \(Direct atτ=1\.5\\tau\{=\}1\.5\)\.Bold= best per column within each model\.↑\\uparrow= higher is better,↓\\downarrow= lower is better\.Infinite ChatsNoveltyBenchLv\.MethodDistinct\-3↑\\uparrowSelf\-BLEU↓\\downarrowDistinct\-3↑\\uparrowSelf\-BLEU↓\\downarrowGPT\-5\-mini0Direct0\.87±\\pm0\.090\.26±\\pm0\.140\.89±\\pm0\.070\.23±\\pm0\.110Verbalized0\.88±\\pm0\.070\.26±\\pm0\.110\.87±\\pm0\.060\.28±\\pm0\.101SSOT0\.78±\\pm0\.110\.38±\\pm0\.140\.76±\\pm0\.110\.42±\\pm0\.15\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.91±\\pm0\.07\\cellcolorblue\!10 0\.21±\\pm0\.10\\cellcolorblue\!10 0\.92±\\pm0\.04\\cellcolorblue\!10 0\.17±\\pm0\.07\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.94±\\pm0\.04\\cellcolorblue\!100\.12±\\pm0\.07\\cellcolorblue\!100\.95±\\pm0\.04\\cellcolorblue\!100\.11±\\pm0\.06DeepSeek\-V3\.20Direct0\.91±\\pm0\.050\.19±\\pm0\.090\.90±\\pm0\.060\.21±\\pm0\.090Direct \(τ=1\.5\\tau\{=\}1\.5\)0\.98±\\pm0\.040\.06±\\pm0\.050\.97±\\pm0\.100\.07±\\pm0\.050Verbalized0\.85±\\pm0\.080\.31±\\pm0\.110\.82±\\pm0\.120\.34±\\pm0\.141SSOT0\.77±\\pm0\.120\.40±\\pm0\.150\.78±\\pm0\.140\.37±\\pm0\.13\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.92±\\pm0\.06\\cellcolorblue\!10 0\.17±\\pm0\.09\\cellcolorblue\!10 0\.94±\\pm0\.04\\cellcolorblue\!10 0\.14±\\pm0\.06\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!10 0\.93±\\pm0\.05\\cellcolorblue\!10 0\.16±\\pm0\.09\\cellcolorblue\!10 0\.92±\\pm0\.09\\cellcolorblue\!10 0\.15±\\pm0\.07Mistral\-L30Direct0\.62±\\pm0\.230\.57±\\pm0\.270\.70±\\pm0\.170\.46±\\pm0\.200Verbalized0\.71±\\pm0\.120\.49±\\pm0\.130\.70±\\pm0\.130\.50±\\pm0\.141SSOT0\.71±\\pm0\.110\.48±\\pm0\.130\.66±\\pm0\.150\.52±\\pm0\.16\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.89±\\pm0\.05\\cellcolorblue\!10 0\.22±\\pm0\.08\\cellcolorblue\!100\.91±\\pm0\.03\\cellcolorblue\!10 0\.18±\\pm0\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.91±\\pm0\.05\\cellcolorblue\!100\.19±\\pm0\.08\\cellcolorblue\!10 0\.91±\\pm0\.09\\cellcolorblue\!100\.17±\\pm0\.06Qwen3\-4B0Direct0\.79±\\pm0\.120\.37±\\pm0\.150\.78±\\pm0\.130\.37±\\pm0\.140Direct \(τ=1\.5\\tau\{=\}1\.5\)0\.86±\\pm0\.110\.28±\\pm0\.150\.81±\\pm0\.130\.34±\\pm0\.150Verbalized0\.78±\\pm0\.100\.40±\\pm0\.130\.77±\\pm0\.180\.36±\\pm0\.130G20\.71±\\pm0\.110\.47±\\pm0\.130\.80±\\pm0\.060\.37±\\pm0\.081SSOT0\.67±\\pm0\.120\.51±\\pm0\.130\.68±\\pm0\.120\.51±\\pm0\.14\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.88±\\pm0\.07\\cellcolorblue\!100\.23±\\pm0\.08\\cellcolorblue\!10 0\.87±\\pm0\.10\\cellcolorblue\!10 0\.24±\\pm0\.07\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.88±\\pm0\.06\\cellcolorblue\!10 0\.23±\\pm0\.09\\cellcolorblue\!100\.88±\\pm0\.06\\cellcolorblue\!100\.24±\\pm0\.09Table 8:Lexical diversity \(Distinct\-3 and Self\-BLEU\) on WritingPrompts, IBM ArgKP, and Poetry Foundation\. Methods are grouped by injection level\. DeepSeek\-V3\.2 and Qwen3\-4B include a high\-temperature baseline \(Direct atτ=1\.5\\tau\{=\}1\.5\)\.Bold= best per column within each model\.↑\\uparrow= higher is better,↓\\downarrow= lower is better\.WritingPromptsIBM ArgKPPoetry FoundationLv\.MethodDistinct\-3↑\\uparrowSelf\-BLEU↓\\downarrowDistinct\-3↑\\uparrowSelf\-BLEU↓\\downarrowDistinct\-3↑\\uparrowSelf\-BLEU↓\\downarrowGPT\-5\-mini0Direct0\.92±\\pm0\.070\.18±\\pm0\.110\.94±\\pm0\.020\.16±\\pm0\.040\.88±\\pm0\.040\.25±\\pm0\.050Verbalized0\.92±\\pm0\.050\.18±\\pm0\.090\.91±\\pm0\.020\.23±\\pm0\.040\.90±\\pm0\.030\.22±\\pm0\.051SSOT0\.84±\\pm0\.160\.27±\\pm0\.110\.81±\\pm0\.030\.38±\\pm0\.040\.88±\\pm0\.030\.25±\\pm0\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.94±\\pm0\.04\\cellcolorblue\!10 0\.14±\\pm0\.06\\cellcolorblue\!10 0\.91±\\pm0\.01\\cellcolorblue\!10 0\.22±\\pm0\.02\\cellcolorblue\!10 0\.85±\\pm0\.09\\cellcolorblue\!10 0\.16±\\pm0\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.95±\\pm0\.03\\cellcolorblue\!100\.10±\\pm0\.05\\cellcolorblue\!100\.96±\\pm0\.01\\cellcolorblue\!100\.11±\\pm0\.03\\cellcolorblue\!10 0\.88±\\pm0\.09\\cellcolorblue\!100\.14±\\pm0\.05DeepSeek\-V3\.20Direct0\.93±\\pm0\.040\.17±\\pm0\.070\.95±\\pm0\.020\.13±\\pm0\.040\.91±\\pm0\.050\.21±\\pm0\.080Direct \(τ=1\.5\\tau\{=\}1\.5\)0\.98±\\pm0\.020\.07±\\pm0\.050\.98±\\pm0\.010\.06±\\pm0\.030\.97±\\pm0\.020\.09±\\pm0\.050Verbalized0\.91±\\pm0\.040\.20±\\pm0\.070\.89±\\pm0\.020\.28±\\pm0\.040\.91±\\pm0\.060\.21±\\pm0\.071SSOT0\.87±\\pm0\.050\.28±\\pm0\.080\.78±\\pm0\.040\.43±\\pm0\.050\.86±\\pm0\.040\.28±\\pm0\.04\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.93±\\pm0\.03\\cellcolorblue\!10 0\.17±\\pm0\.05\\cellcolorblue\!10 0\.92±\\pm0\.02\\cellcolorblue\!10 0\.20±\\pm0\.04\\cellcolorblue\!10 0\.87±\\pm0\.08\\cellcolorblue\!10 0\.17±\\pm0\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!10 0\.91±\\pm0\.16\\cellcolorblue\!10 0\.13±\\pm0\.05\\cellcolorblue\!10 0\.94±\\pm0\.01\\cellcolorblue\!10 0\.16±\\pm0\.03\\cellcolorblue\!10 0\.90±\\pm0\.04\\cellcolorblue\!10 0\.13±\\pm0\.05Mistral\-L30Direct0\.51±\\pm0\.190\.70±\\pm0\.220\.69±\\pm0\.200\.50±\\pm0\.250\.35±\\pm0\.110\.89±\\pm0\.120Verbalized0\.79±\\pm0\.080\.40±\\pm0\.100\.74±\\pm0\.060\.48±\\pm0\.070\.80±\\pm0\.040\.38±\\pm0\.061SSOT0\.77±\\pm0\.180\.37±\\pm0\.090\.69±\\pm0\.040\.52±\\pm0\.050\.81±\\pm0\.040\.35±\\pm0\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.88±\\pm0\.03\\cellcolorblue\!10 0\.24±\\pm0\.05\\cellcolorblue\!10 0\.89±\\pm0\.02\\cellcolorblue\!10 0\.24±\\pm0\.04\\cellcolorblue\!10 0\.83±\\pm0\.03\\cellcolorblue\!10 0\.27±\\pm0\.04\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.89±\\pm0\.13\\cellcolorblue\!100\.18±\\pm0\.05\\cellcolorblue\!100\.91±\\pm0\.02\\cellcolorblue\!100\.18±\\pm0\.03\\cellcolorblue\!100\.87±\\pm0\.03\\cellcolorblue\!100\.22±\\pm0\.04Qwen3\-4B0Direct0\.79±\\pm0\.080\.39±\\pm0\.100\.85±\\pm0\.030\.32±\\pm0\.060\.80±\\pm0\.060\.37±\\pm0\.070Direct \(τ=1\.5\\tau\{=\}1\.5\)0\.83±\\pm0\.100\.33±\\pm0\.130\.87±\\pm0\.100\.26±\\pm0\.160\.89±\\pm0\.050\.23±\\pm0\.070Verbalized0\.78±\\pm0\.080\.40±\\pm0\.090\.82±\\pm0\.030\.36±\\pm0\.030\.84±\\pm0\.050\.32±\\pm0\.070G20\.71±\\pm0\.090\.48±\\pm0\.100\.70±\\pm0\.040\.51±\\pm0\.040\.80±\\pm0\.060\.37±\\pm0\.071SSOT0\.71±\\pm0\.080\.47±\\pm0\.090\.71±\\pm0\.040\.50±\\pm0\.040\.82±\\pm0\.090\.29±\\pm0\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.89±\\pm0\.03\\cellcolorblue\!10 0\.22±\\pm0\.04\\cellcolorblue\!100\.89±\\pm0\.03\\cellcolorblue\!100\.23±\\pm0\.05\\cellcolorblue\!10 0\.89±\\pm0\.04\\cellcolorblue\!10 0\.19±\\pm0\.05\\cellcolorblue\!10 2\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.90±\\pm0\.03\\cellcolorblue\!100\.20±\\pm0\.05\\cellcolorblue\!10 0\.89±\\pm0\.02\\cellcolorblue\!10 0\.23±\\pm0\.04\\cellcolorblue\!10 0\.89±\\pm0\.04\\cellcolorblue\!100\.19±\\pm0\.05Figure 6:Within\-outline diversity and between\-outline diversity for Diverse Outline and Keyword, acrossm∈\{2,5,10\}m\\in\\\{2,5,10\\\}outlines on WritingPrompts, ArgKP and Poetry Foundation\. Between\-outline diversity consistently exceeds within\-outline diversity and grows withmm, while within\-outline diversity remains flat\.Table 9:Transmission scoreTT, realized diversityI\(Y;Z∣X\)I\(Y;Z\\mid X\), and output entropyH\(Y∣X\)H\(Y\\mid X\)on Qwen3\-4B for WritingPrompts, Poetry Foundation, and IBM ArgKP\.TTandI\(Y;Z∣X\)I\(Y;Z\\mid X\)are undefined for Level 0 methods\. Level 2 methods consistently achieve higherTTandI\(Y;Z∣X\)I\(Y;Z\\mid X\)than the Level 1 baseline\.DataMethodTT↑\\uparrowI\(Y;Z∣X\)I\(Y;Z\{\\mid\}X\)↑\\uparrowH\(Y∣X\)H\(Y\{\\mid\}X\)↑\\uparrowWritingPrompts\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.463\\cellcolorblue\!10 1\.46\\cellcolorblue\!10 3\.08\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.388\\cellcolorblue\!10 1\.12\\cellcolorblue\!10 2\.85SSOT0\.0030\.000\.64PoetryFound\.\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.243\\cellcolorblue\!10 0\.94\\cellcolorblue\!10 3\.86\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.165\\cellcolorblue\!10 0\.65\\cellcolorblue\!10 3\.94SSOT0\.0000\.001\.22IBMArgKP\\cellcolorblue\!10 Div Outline \(ours\)\\cellcolorblue\!100\.565\\cellcolorblue\!10 1\.61\\cellcolorblue\!10 2\.82\\cellcolorblue\!10 Keyword \(ours\)\\cellcolorblue\!10 0\.457\\cellcolorblue\!10 1\.25\\cellcolorblue\!10 2\.70SSOT0\.0020\.000\.82
### A\.3Experiment Prompts and Example Outputs
#### A\.3\.1Diverse Outline: Prompts and Generation Details
As described in Section[4](https://arxiv.org/html/2606.10302#S4), Diverse Outline generates allmmspecifications in a single LLM call, then conditions on each to produce one output\. We provide the prompt details for both steps below\.
##### Outline generation\.
Each specificationziz\_\{i\}takes the form of a compact keyword outline: a list of 4\-6 keywords capturing tone, format, perspective, and key focus\. The system prompt instructs the model to produce such outlines simultaneously, requiring that no two outlines share more than one keyword\. Because the model sees its own previous outlines while generating each subsequent one, it naturally enforces diversity across the full set\. The full prompt is shown in Figure[7](https://arxiv.org/html/2606.10302#A1.F7)\.
System Prompt — Diverse Outline Generation You are a diversity planning engine\. Goal: produce \{N\} DISTINCT keyword outlines for the SAME request\. Rules:
1. 1\.Each outline is a compact list of 4–6 keywords/phrases capturing: tone, format, perspective, and key focus\. Good examples: “instructional tone”, “bullet\-point format”, “second\-person”, “focus on failure cases”\.
2. 2\.Maximize diversity: no two outlines should share more than 1 keyword\.
3. 3\.Keywords must be task\-specific — encode what makes each approach different for this particular request\.
Output JSON only:
```
{
"outlines": [
{ "id": 1,
"keywords": ["keyword1", ..., "keyword4"] },
...
]
}
```
User Prompt Task: \{TASK\}
Figure 7:Prompt used to generatemmdiverse keyword outlines in one batch call in Diverse Outline\.
##### Output generation\.
Each outline is then used to condition the generation of one output\. The model is guided by the keyword list but not rigidly required to reproduce every keyword verbatim; the keywords shape the overall tone, format, and focus of the response\. The generation prompt is shown in Figure[8](https://arxiv.org/html/2606.10302#A1.F8)\.
System Prompt — Output Generation \(Diverse Outline\) You are a helpful AI assistant\. Your task is to respond to the given task, guided by the provided keywords\. *Input:*
- •Task:the original request \(follow ALL constraints in it, e\.g\. length, format, topic\)\.
- •Outline:a JSON object with axis:value pairs describing structural choices \(e\.g\. narrative\_pov, tone, structure, theme\)\.
*Instructions:*
1. 1\.Strictly follow any constraints in the Task\.
2. 2\.Let each keyword shape the tone, format, and focus of your response\.
3. 3\.Keep your response to approximately 200 words\.
4. 4\.Do not list or explain the keywords\. Output ONLY the final response text\.
Figure 8:Prompt used to generate outputs conditioned on a Diverse Outline specification\.
#### A\.3\.2Keyword: Axis Generation and Combination Strategy
As described in Section[4](https://arxiv.org/html/2606.10302#S4), Keyword first generates a set of dimensions and candidate values, then selectsmmcombinations to construct specifications\. We provide the prompt and selection details below\.
##### Axis generation\.
For each prompt, we generateA=4A=4dimensions \(referred to as axes in the prompt\) andV=8V=8values each in a single call\. Axes are required to be orthogonal: selecting a value on one axis should not constrain choices on other axes\. The dimensions are generated once per prompt and shared across all outputs\. The full prompt is shown in Figure[9](https://arxiv.org/html/2606.10302#A1.F9)\.
System Prompt — Axis Generation You are a creative diversity planner\. Generate \{N\_KEYS\} independent structural dimensions \(axes\) that capture the most impactful creative choices for responses to the given task\. For each axis, generate \{N\_VALUES\} distinct values representing meaningfully different choices\. Axes must be orthogonal — choosing a value for one axis should not constrain choices for other axes\. Output JSON only:
```
{
"axes": [
{
"key": "narrative_pov",
"label": "Narrative POV",
"values": ["first-person unreliable",
"third-person omniscient", ...]
}, ...
]
}
```
User Prompt Task type: \{TASK\_TYPE\} Task: \{TASK\} Generate \{N\_KEYS\} axes with \{N\_VALUES\} values each\.
Figure 9:Prompt used to generate diversity axes for Keyword\. We useA=4A=4axes andV=8V=8values per axis\.
##### Combination selection\.
GivenAAaxes each withVVvalues, there areVA=84=4,096V^\{A\}=8^\{4\}=4\{,\}096possible axis\-value combinations\. We selectm=20m=20combinations using a*greedy Hamming\-distance maximization*strategy\.
Formally, represent each combination as a tuple𝐜=\(v1,v2,v3,v4\)∈\{1,…,V\}A\\mathbf\{c\}=\(v\_\{1\},v\_\{2\},v\_\{3\},v\_\{4\}\)\\in\\\{1,\\ldots,V\\\}^\{A\}\. The Hamming distance between two combinations𝐜\\mathbf\{c\}and𝐜′\\mathbf\{c\}^\{\\prime\}is the number of axes on which they differ:
dH\(𝐜,𝐜′\)=∑a=1A𝟏\[va≠va′\]\.d\_\{H\}\(\\mathbf\{c\},\\mathbf\{c\}^\{\\prime\}\)=\\sum\_\{a=1\}^\{A\}\\mathbf\{1\}\[v\_\{a\}\\neq v^\{\\prime\}\_\{a\}\]\.Two combinations withdH=Ad\_\{H\}=Adiffer on every axis; two withdH=0d\_\{H\}=0are identical\.
The algorithm proceeds greedily:
1. 1\.Draw a random combination𝐜1\\mathbf\{c\}\_\{1\}uniformly from the full space as the seed\.
2. 2\.Fori=2,…,mi=2,\\ldots,m: select the candidate𝐜∗\\mathbf\{c\}^\{\*\}that maximizes its*minimum*Hamming distance to all already\-selected combinations, 𝐜i=argmax𝐜∉𝒮min𝐜′∈𝒮dH\(𝐜,𝐜′\),\\mathbf\{c\}\_\{i\}=\\arg\\max\_\{\\mathbf\{c\}\\notin\\mathcal\{S\}\}\\min\_\{\\mathbf\{c\}^\{\\prime\}\\in\\mathcal\{S\}\}\\,d\_\{H\}\(\\mathbf\{c\},\\mathbf\{c\}^\{\\prime\}\),where𝒮=\{𝐜1,…,𝐜i−1\}\\mathcal\{S\}=\\\{\\mathbf\{c\}\_\{1\},\\ldots,\\mathbf\{c\}\_\{i\-1\}\\\}is the set of already\-selected combinations\. Ties are broken randomly\.
This max\-min criterion acts as a greedy farthest\-point sampler in Hamming space, ensuring that every newly added combination is as different as possible from all existing ones on as many axes as possible\. Form=20m=20andA=4A=4, the resulting set typically achieves a minimum pairwise Hamming distance of≥2\\geq 2, meaning no two selected outlines agree on more thanA−2=2A\-2=2axes simultaneously\.
##### Output generation\.
Each of themmselected combinations is used as a structured outline to condition one output\. The generation prompt is shown in Figure[10](https://arxiv.org/html/2606.10302#A1.F10)\. The model is explicitly instructed that every axis value must be clearly and visibly present in the generated output\.
System Prompt — Output Generation \(Keyword\) You are a helpful AI assistant\. Your task is to respond to the given task, strictly manifesting the structural choices in the Outline\. *Input:*
- •Task:the original request \(follow ALL constraints in it, e\.g\. length, format, topic\)\.
- •Outline:a JSON object with axis:value pairs that you MUST embody \(e\.g\. narrative\_pov, tone, structure, theme\)\.
*Instructions:*
1. 1\.Strictly follow any constraints in the Task\.
2. 2\.EACH axis value in the Outline MUST be clearly and visibly present in your response\. A reader should be able to identify every axis value directly from your text without being told\.
3. 3\.Keep your response to approximately 200 words\.
4. 4\.Do not list or explain the axis values\. Output ONLY the final response text\.
Figure 10:Prompt used to generate outputs conditioned on a Keyword specification\.Table 10:Example outputs from GPT\-5\-mini on a question“Give one popular name for a baby boy in Papua New Guinea\.”\. Each row shows one representative output; the group label in brackets is the equivalence class assigned by the evaluator\. The Direct baseline produces outputs all following the template“\[English Christian name\] is popular due to missionary influence…”Both Level\-2 methods produce outputs differing in name choice, language variety \(English, Tok Pisin, Hiri Motu, oral\-tradition style\), and regional framing\.\\cellcolordirectbgDirect \(Baseline\)\\cellcolordivbgDiverse Outline \(ours\)\\cellcolorkwbgKeyword Combo \(ours\)\\cellcolordirectbgJohnis a widely used and familiar name for baby boys in Papua New Guinea\. Due to the strong influence of Christian missionaries and the legacy of English, John has become one of the most common names…\\cellcolordivbg\(informal, English\)
Try:John— “God is gracious\.” This is a friendly, simple name you’ll see often on newborn lists from Lae urban hospitals\. It works well in both English and Tok Pisin\.\\cellcolorkwbg\(modern global name, island provinces\)
Michael— Modern global / borrowed name\. Popular baby\-boy choice in Papua New Guinea’s island provinces, often adopted by Christian families from Manus and New Ireland\.\\cellcolordirectbgMichaelis a popular and recognizable name for baby boys in Papua New Guinea, reflecting the ongoing cultural influence of Christianity and English\-language traditions across the country…\\cellcolordivbg\(authoritative, Papuan Malay register\)
Recommendation:Johan\. From the Oro Province provincial census baby name sample, Johan is one of the top\-five recorded male given names for the 2019–2023 cohort\. Its Papuan Malay origin distinguishes it from the more common anglicised John\.\\cellcolorkwbg\(Austronesian origin, Milne Bay\)
Manu— Austronesian / island\-language origin\. A beloved choice from the islands, heard in canoe\-launching ceremonies and school enrollment registers along the Milne Bay coast\.\\cellcolordirectbgDavidis a common and well\-known name for baby boys in Papua New Guinea, frequently chosen by parents influenced by Christian naming traditions and English cultural norms…\\cellcolordivbg\(sincere, island dialect style\)
One very common name you will meet across Papua New Guinea isPeter— often heard and written asPitain local island dialect phrasing\. Unlike the more urbanised John, Peter carries a seafaring, coastal resonance in communities along the Bismarck Archipelago\.\\cellcolorkwbg\(Trans\-New Guinea highlands origin\)
Kumul— Trans\-New Guinea / highlands\-language origin\. In Port Moresby schools, this name evokes pride in the Bird of Paradise, PNG’s national emblem\. Parents from Simbu and Western Highlands provinces favour it for its cultural resonance\.\\cellcolordirectbgPaulis a commonly used name for baby boys in Papua New Guinea, reflecting the strong presence of Christian traditions and English\-language influence across the country…\\cellcolordivbg\(traditional, oral\-narrative style\)
Elders on the seaside speak this name like a blessing passed from one canoe to the next:Devid— a name that arrived with the first mission ship and took root in coastal villages\. The local spelling marks it as no longer foreign\.\\cellcolorkwbg\(local clan / ceremonial name, Madang\)
Kambu— a local clan or traditional ceremonial name heard around Madang and northern coastal communities\. Unlike borrowed Christian names, Kambu carries ancestral lineage meaning and is given during a formal naming ceremony\.Table 11:Example outputs from GPT\-5\-mini on question“What is the single best first phone for a teenager? You must provide only one phone as your recommendation\.”\. The Direct baseline recommendsGoogle Pixel 7aacross all 20 outputs\. Level\-2 methods recommend different devices from different expert perspectives, varying in recommended phone, primary criterion, and response format\.\\cellcolordirectbgDirect \(Baseline\)\\cellcolordivbgDiverse Outline \(ours\)\\cellcolorkwbgKeyword Combo \(ours\)\\cellcolordirectbg If I had to pick a single best first phone for a teenager, I’d recommend theGoogle Pixel 7a\. It strikes an excellent balance of price, performance, and useful features that matter for a teenager’s day\-to\-day life…\\cellcolordivbg\(parent\-to\-parent, parental controls\)
As a mom of a teenager, I recommend theiPhone SE \(2022\): it gives you the full suite of iOS parental controls and security without breaking the bank\. Screen Time, Ask to Buy, Communication Limits, and Family Sharing make management straightforward…\\cellcolorkwbg\(skeptical, value/performance checklist\)
Skeptical about hype, but if you force me to pick:Google Pixel 7a\. It best meets my criterion of overall value / balanced performance\. Checklist: excellent computational camera, guaranteed OS updates, solid battery, mid\-range price…\\cellcolordirectbg For a teenager getting their first smartphone, my top single recommendation is theGoogle Pixel 7a\. It offers terrific value: flagship\-level camera quality, a clean user interface, and timely security updates for several years…\\cellcolordivbg\(repair technician, ruggedness\)
As a repair technician, my single recommendation for an active teenager is theSamsung Galaxy XCover6 Pro\. It carries military\-standard rugged certification \(MIL\-STD\-810H\) and IP68 dust/water protection — it consistently survives drops from pocket height…\\cellcolorkwbg\(friendly conversational, affordability first\)
Verdict:Motorola Moto G Play \(2023\)is the single best first phone for a teenager\. It nails affordability / lowest upfront price while still offering reliable battery life, a clean Android experience, and essential cameras…\\cellcolordirectbg My single best pick for a teenager’s first phone is theGoogle Pixel 7a\. It offers a thoughtful mix of features that make it an excellent all\-rounder: a standout camera, reliable updates, and practical everyday performance…\\cellcolordivbg\(IT administrator, family ecosystem\)
As a school IT administrator, I pick theiPhone SE \(3rd gen\)for families prioritizing ecosystem compatibility\. It gives full access to Apple Family Sharing, Screen Time, Find My, and iMessage/FaceTime at an affordable price…\\cellcolorkwbg\(enthusiastic, parental controls focus\)
My pick:iPhone SE \(3rd gen\)— I absolutely adore this choice\! It nails parental controls and safety features, with Screen Time, Family Sharing, and robust iOS privacy protections built right in…\\cellcolordirectbg The single best first phone I recommend for a teenager is theGoogle Pixel 7a\. It delivers a balanced blend of camera performance, clean software, and multi\-year security support at a fair mid\-range price…\\cellcolordivbg\(outdoor coach, rugged / trail use\)
Pick:CAT S62 Pro— the purpose\-built rugged phone for a teen headed into trails, streams, and bike\-pack days\. Built to MIL\-STD\-810H and IP68/IP69K, with a thermal FLIR camera for hikes and reliable GPS for off\-road navigation…\\cellcolorkwbg\(reassuring parent, battery life focus\)
Verdict: The single best first phone for a teenager is theSamsung Galaxy A54\. It delivers dependable all\-day battery life \(5000 mAh\), strong performance, and long\-term security updates at an affordable price…
#### A\.3\.3Output Example
Table[10](https://arxiv.org/html/2606.10302#A1.T10)illustrates the qualitative comparison on a NoveltyBench task \("Give one popular name for a baby boy in Papua New Guinea\."\)\. The Direct baseline collapses to a single template across all 20 outputs, producing only 2 distinct equivalence classes, while both Level 2 methods produce responses spanning 18\-20 distinct equivalence classes, varying in name choice, language register \(English, Tok Pisin, Hiri Motu, oral\-tradition style\), and regional framing\.
Table[11](https://arxiv.org/html/2606.10302#A1.T11)shows a factual recommendation task where Direct Generation exhibits strong mode collapse: all 20 outputs nominate the same phone \(Google Pixel 7a\) with near\-identical phrasing\. Diverse Outline varies both the recommended device and the expert persona \(parent, repair technician, IT administrator, outdoor coach\)\. Keyword varies the recommended device, tone, and primary evaluation criterion\.
## Appendix BBias and Variance Analysis for Entropy Estimation
Section[3\.4](https://arxiv.org/html/2606.10302#S3.SS4)introduced an estimator for the output entropyH\(Y\|X\)H\(Y\|X\)\. Here we analyze its bias and variance properties\.
##### Residual bias of the entropy estimator\.
WhileP^\(Yℓ\|X\)\\hat\{P\}\(Y\_\{\\ell\}\|X\)is unbiased, the per\-token entropy estimatorh^ℓ=−1\|Yℓ\|log2P^\(Yℓ\|X\)\\hat\{h\}\_\{\\ell\}=\-\\frac\{1\}\{\|Y\_\{\\ell\}\|\}\\log\_\{2\}\\hat\{P\}\(Y\_\{\\ell\}\|X\)carries positive biased from Jensen’s inequality:𝔼\[−logP^\]≠−log𝔼\[P^\]\\mathbb\{E\}\[\-\\log\\hat\{P\}\]\\neq\-\\log\\mathbb\{E\}\[\\hat\{P\}\]\. A second\-order delta\-method expansion yields
𝔼\[h^ℓ\]−h\(Yℓ∣X\)\\displaystyle\\mathbb\{E\}\[\\hat\{h\}\_\{\\ell\}\]\-h\(Y\_\{\\ell\}\\mid X\)\(5\)=12ln2⋅\|Yℓ\|⋅Var\{Zj\}\(P^\(Yℓ∣X\)\)\[P\(Yℓ∣X\)\]2\+O\(1M2\)\\displaystyle=\\frac\{1\}\{2\\ln 2\\cdot\|Y\_\{\\ell\}\|\}\\cdot\\frac\{\\mathrm\{Var\}\_\{\\\{Z\_\{j\}\\\}\}\(\\hat\{P\}\(Y\_\{\\ell\}\\mid X\)\)\}\{\[\\,P\(Y\_\{\\ell\}\\mid X\)\\,\]^\{\\,2\}\}\+O\\\!\\left\(\\tfrac\{1\}\{M^\{2\}\}\\right\)=O\(1M\),\\displaystyle=O\\\!\\left\(\\tfrac\{1\}\{M\}\\right\),whereh\(Yℓ\|X\)=−1\|Yℓ\|log2P\(Yℓ\|X\)h\(Y\_\{\\ell\}\|X\)=\-\\frac\{1\}\{\|Y\_\{\\ell\}\|\}\\log\_\{2\}P\(Y\_\{\\ell\}\|X\)is the true per\-token entropy\. TheO\(1/M\)O\(1/M\)rate follows fromVar\(P^\)=O\(1/M\)\\mathrm\{Var\}\(\\hat\{P\}\)=O\(1/M\), the standard rate for a sample mean ofMMi\.i\.d\. terms\. Crucially, this bias is*always positive*\(Jensen’s inequality on a convex function\), soH^\(Y∣X\)\\hat\{H\}\(Y\\mid X\)is a soft upper bound onH\(Y∣X\)H\(Y\\mid X\)\.
##### Variance\.
By the law of total variance, the estimator’s variance decomposes as
Var\(H^\)=1L\[\\displaystyle\\mathrm\{Var\}\(\\hat\{H\}\)\\;=\\;\\frac\{1\}\{L\}\\Big\[VarYℓ\[h\(Yℓ∣X\)\]⏟Vouter=O\(1\)\+\\displaystyle\\underbrace\{\\mathrm\{Var\}\_\{Y\_\{\\ell\}\}\\\!\\left\[h\(Y\_\{\\ell\}\\mid X\)\\right\]\}\_\{V\_\{\\text\{outer\}\}\\,=\\,O\(1\)\}\\;\+\\;\(6\)𝔼Yℓ\[Var\{Zj\}\(h^ℓ\)\]⏟Vinner=O\(1/M\)\]\.\\displaystyle\\underbrace\{\\mathbb\{E\}\_\{Y\_\{\\ell\}\}\\\!\\left\[\\mathrm\{Var\}\_\{\\\{Z\_\{j\}\\\}\}\(\\hat\{h\}\_\{\\ell\}\)\\right\]\}\_\{V\_\{\\text\{inner\}\}\\,=\\,O\(1/M\)\}\\;\\Big\]\.whereVouterV\_\{\\text\{outer\}\}captures variability across evaluation samples andVinnerV\_\{\\text\{inner\}\}shrinks asO\(1/M\)O\(1/M\)by the same delta\-method argument as Eq\.[5](https://arxiv.org/html/2606.10302#A2.E5)\.
Both the bias and the inner variance shrink asO\(1/M\)O\(1/M\), so increasing the estimation set sizeMMsimultaneously reduces both sources of error\.Similar Articles
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search
DEI introduces a distributed Quality-Diversity search framework using heterogeneous LLMs as mutation operators, showing that model diversity improves performance over homogeneous parallel approaches. Evaluated on the Core War domain, a four-node heterogeneous ensemble achieves significant gains in QD-Score and coverage.
DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation
DebiasRAG proposes a tuning-free, query-specific debiasing framework using retrieval-augmented generation to reduce social biases in LLMs without degrading their original capabilities.
An In-Vitro Study on Cross-Lingual Generalization in Language Models
This paper introduces an in-vitro framework with two procedurally generated languages to study cross-lingual generalization in language models, finding that tokenization's preservation of reusable substructure is more critical than lexical similarity or data balance for transferring capabilities across languages.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
The article discusses the UniVidX paper, which introduces a unified multimodal framework for video generation using diffusion priors and discusses its cross-modal coherence mechanisms.
Where does output diversity collapse in post-training?
This paper investigates where and why output diversity collapses during post-training of language models, analyzing three OLMo 3 lineages (Think, Instruct, RL-Zero) across multiple tasks and metrics. The authors find that diversity collapse is primarily determined by training data composition and embedded in model weights during training, not addressable at inference time alone.