UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

arXiv cs.CL Papers

Summary

UnpredictaBench is a benchmark for evaluating how well large language models can sample from target distributions, including statistical and natural-language random processes. Experiments show that current models struggle to capture true underlying distributions, with no model exceeding 40% on the KS@100 metric.

arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:19 AM

# UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs
Source: [https://arxiv.org/html/2606.06622](https://arxiv.org/html/2606.06622)
Amirhossein Abaskohi\* † 1Amirhossein Dabiriaghdam\* 1Liang Luo2 Ellie Dingqiao Wen2Lele Wang1Giuseppe Carenini1Peter West1 1University of British Columbia2Independent Researcher

###### Abstract

We introduceUnpredictaBench, an evaluation that tests the ability of large language models \(LLMs\) to capture true underlying distributions\. As LLMs are increasingly used as substitutes for other entities \(e\.g\., for humans in economic simulations\), the tendency of many models to collapse towards a single plausible answer means a failure to capture the*unpredictability*of real systems\. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs\.UnpredictaBenchisolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural\-language scenarios that describe random processes\. We introduce 448 such problems together withK​S​@​NKS@N, a general\-purpose evaluation metric that quantifies how well a model outputs approximate black\-box target distributions via the Kolmogorov\-Smirnov statistical test\. This is the rate at which we fail to reject model samples of size N against ground\-truth samples, with larger N indicating greater difficulty\. Tested across open and proprietary models, we find a large spread in distributional capabilities\. For instance, when models generate samples of size 100 \(K​S​@​100KS@100, our standard metric\), scores range from near 0 to over 20%\. No model is able to achieve over 40% atK​S​@​100KS@100, showing significant headroom in distributional sampling as a capability\. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue\.UnpredictaBenchshows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand\-ins for complex systems111Dataset is available on[![[Uncaptioned image]](https://arxiv.org/html/2606.06622v1/figures/huggingface_logo.png)Hugging Face](https://huggingface.co/datasets/UnpredictaBench/UnpredictaBench), with code and ground\-truth values released on[GitHub](https://github.com/UnpredictaBench/UnpredictaBenchCode)\.\.

11footnotetext:Equal contribution\.22footnotetext:Corresponding author:aabaskoh@cs\.ubc\.ca## 1Introduction

Randomness and uncertainty are core aspects of many fields of knowledge–physics, biology, statistics, and even human behavior–and although large language models \(LLMs\) can reason about randomness\[[28](https://arxiv.org/html/2606.06622#bib.bib1)\], it is not clear how well they can produce it\. This is particularly important as these models are increasingly used as stand\-ins to simulate other systems\[[27](https://arxiv.org/html/2606.06622#bib.bib21),[11](https://arxiv.org/html/2606.06622#bib.bib23),[10](https://arxiv.org/html/2606.06622#bib.bib22)\], making predictions about physical outcomes or modeling human interactions \(see Figure[1\(b\)](https://arxiv.org/html/2606.06622#S1.F1)\)\. In order for these applications to work, models must produce uncertain outcomes that are*calibrated*to the underlying process, although their ability to do this is not well evaluated\. Recent work suggests that LLMs can partially reason about distributions when estimating probabilities or percentiles\[[28](https://arxiv.org/html/2606.06622#bib.bib1)\], but this does not translate to faithful stochastic generation\. Prior studies show failures in behavioral simulation\[[6](https://arxiv.org/html/2606.06622#bib.bib2)\], real\-world distribution modeling\[[29](https://arxiv.org/html/2606.06622#bib.bib3)\], mixed\-strategy games\[[8](https://arxiv.org/html/2606.06622#bib.bib4)\], and even simple random tasks such as coin flips, dice rolls, and random integers\[[13](https://arxiv.org/html/2606.06622#bib.bib5),[9](https://arxiv.org/html/2606.06622#bib.bib14),[40](https://arxiv.org/html/2606.06622#bib.bib12),[4](https://arxiv.org/html/2606.06622#bib.bib16)\]\. Towards a systematic evaluation of this question, we introduceUnpredictaBench, a benchmark to test distributional randomness in LLMs\.

Figure 1:\(a\)Most models fail to reproduce target distributions, either lacking distributional understanding or collapsing to a narrow output range\.Nemotron\-3\-Super\-120B\[[23](https://arxiv.org/html/2606.06622#bib.bib28)\]is a notable exception, capturing the multimodal Skellam structure reasonably well, whereasOLMo\-3\-7B\[[24](https://arxiv.org/html/2606.06622#bib.bib32)\]places nearly all mass near zero despite the true Poisson distribution extending well beyond 20\.\(b\)Since real\-world systems are stochastic, applications such as economic simulation and epidemiological modeling require LLMs to reproduce randomness faithfully; distributional mismatch can yield biased estimates, overconfident predictions, and misleading conclusions\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x1.png)Verifying the stochastic correctness of LLMs in general requires broad progress in evaluation, and so the goal ofUnpredictaBenchis to test whether models can capture even a simple version of this problem: sampling from direct, single\-output distributions\. The benchmark is composed of 448 known distributions, stochastic code problems, and word problems\. These include unimodal and multimodal distributions, real\-world problems \(e\.g\. race condition in multi\-threading\), and list shuffling\. Models are tasked with generating independent samples, and evaluated with a new metric,𝑲​𝑺​@​𝑵\\bm\{KS@N\}\. Simply,K​S​@​NKS@Naims to capture a notion of distributional accuracy, based on the rate at which model samples are rejected against a black\-box sample from the true distribution by a Kolmogorov\-Smirnov test\[[14](https://arxiv.org/html/2606.06622#bib.bib41),[32](https://arxiv.org/html/2606.06622#bib.bib42)\]with a fixed threshold\. Increasing N naturally increases difficulty, andK​S​@​NKS@Nrequires only*samples*from the ground truth\.

Evaluating a range of open and frontier models onUnpredictaBench, we observe high variance in performance\. No model surpasses 40% forK​S​@​100KS@100\(our default setting\), and most models spread their accuracy between 0% and 20%, indicating that generating a plausible sample of size 100 remains a significant challenge across the board\.Nemotron\-3\-super\-120b\-a12b\[[23](https://arxiv.org/html/2606.06622#bib.bib28)\]consistently ranks among the top performers across variousK​S​@​NKS@Nlevels, whereas models likeGPT\-5\.4\[[26](https://arxiv.org/html/2606.06622#bib.bib29)\]andClaude\-sonnet\-4\.6\[[1](https://arxiv.org/html/2606.06622#bib.bib30)\]average only 15\.18% and 4\.7% across all tasks, respectively—lower than much smaller open\-source models such asQwen\-3\.5\-2B\[[30](https://arxiv.org/html/2606.06622#bib.bib31)\], which achieves 17\.67%\. We see similar trends in related metrics including Wasserstein Distance and Jensen–Shannon divergence\[[18](https://arxiv.org/html/2606.06622#bib.bib24)\]\. Qualitatively, we find a range of model failures, from collapse onto a reasonable mode, to total miscalibration to the true distribution \(Figure[1\(a\)](https://arxiv.org/html/2606.06622#S1.F1)\)\. Interventions such as reasoning can help, but are far from solving the problem\. In terms of benchmark difficulty, tasks requiring models to infer the underlying distribution from code and shuffling tasks prove the most challenging, with several strong overall performers collapsing to 0% on the latter\.UnpredictaBenchaccuracy correlates strongly with utility metrics from NoveltyBench\[[39](https://arxiv.org/html/2606.06622#bib.bib25)\]and CREATE\[[34](https://arxiv.org/html/2606.06622#bib.bib26)\], confirming that distributional fidelity captures a genuine notion of model quality while offering a statistically grounded alternative to LLM\-as\-a\-judge evaluation\[[41](https://arxiv.org/html/2606.06622#bib.bib27)\]\.

UnpredictaBenchis a first step in understanding, evaluating, and improving the ability of LLMs to capture complex sources of randomness\. We should not yet expect LLMs to capture more complex distributions, such as human behavior, given their struggle in this simple setting\. This benchmark also offers a roadmap for future work in this area, naturally providing increasingly difficult versions through modifications such as an increase in sample size, and providing a template for future benchmarks that can reuse elements such asK​S​@​NKS@N\. Overall, ourcontributionsare as follows:\(i\)We introduceUnpredictaBench, a benchmark of 448 test instances covering 40 target distributions across unimodal and multimodal settings with a diverse task suite spanning textual, code, real\-world, and shuffling scenarios, evaluating distributional randomness beyond simple numeric prompting\.\(ii\)We proposeK​S​@​NKS@N, a repeated\-generation evaluation metric that compares empirical model outputs against ground\-truth distributions, assessing stochastic fidelity rather than one\-off correctness\.\(iii\)We provide a first systematic analysis of LLMs as statistical random generators across a wide range of distributions and prompting conditions, offering a unified testbed for future work on randomness and distributional generation\.

## 2Related Work

Probabilistic reasoning and randomness generation\.Prior work establishes that LLMs can perform non\-trivial probabilistic reasoning with contextual support\[[28](https://arxiv.org/html/2606.06622#bib.bib1)\], but a consistent finding is that reasoning*about*a distribution does not translate to faithfully*generating from*it\.Guet al\.\[[6](https://arxiv.org/html/2606.06622#bib.bib2)\]show that LLMs can identify probabilistic structure but fail to sample from it accurately,Plevckoet al\.\[[29](https://arxiv.org/html/2606.06622#bib.bib3)\]show that LLMs do not faithfully encode real\-world observational distributions, andZhanget al\.\[[38](https://arxiv.org/html/2606.06622#bib.bib13)\]demonstrate that performance deteriorates when latent distributions must be inferred\. During generation, LLMs fail even in simple settings such as uniform random number generation\[[9](https://arxiv.org/html/2606.06622#bib.bib14)\], with outputs reflecting human\-like biases rather than true randomness\[[13](https://arxiv.org/html/2606.06622#bib.bib5),[40](https://arxiv.org/html/2606.06622#bib.bib12)\]\.Coronado\-Blázquez \[[4](https://arxiv.org/html/2606.06622#bib.bib16)\]provide a broad empirical study showing model outputs are often surprisingly deterministic and biased toward specific values, andGuoet al\.\[[8](https://arxiv.org/html/2606.06622#bib.bib4)\]demonstrate a cognition–behavior gap in strategic settings: models can state the correct mixed strategy yet their actual choices remain biased\. Most directly related to our work,Guet al\.\[[7](https://arxiv.org/html/2606.06622#bib.bib20)\]show that while frontier models can convert provided random seeds to target distributions, their ability to sample directly from specified categorical distributions is fundamentally flawed\.UnpredictaBenchdiffers from all of these by providing a unified benchmark over many distributions and tasks rather than focusing on any single setting\.

Alignment, uncertainty, and behavioral factors\.Another body of work investigates why models exhibit poor stochastic behavior\. Post\-training is a key culprit:West and Potts \[[35](https://arxiv.org/html/2606.06622#bib.bib17)\]show that base models outperform aligned models on random number generation and creativity,Liet al\.\[[17](https://arxiv.org/html/2606.06622#bib.bib18)\]show that cross\-entropy fine\-tuning systematically reduces output diversity, andZhanget al\.\[[37](https://arxiv.org/html/2606.06622#bib.bib19)\]show that fine\-tuning on temperature\-shifted self\-samples can partially recover it\. Beyond training, prompt structure can heavily condition apparent stochastic behavior\[[2](https://arxiv.org/html/2606.06622#bib.bib7)\]\. On uncertainty calibration, raw model confidence is often poorly calibrated\[[31](https://arxiv.org/html/2606.06622#bib.bib11)\]and structured by semantic similarity between candidate responses\[[20](https://arxiv.org/html/2606.06622#bib.bib9)\]\. Finally,Caoet al\.\[[3](https://arxiv.org/html/2606.06622#bib.bib15)\]show that fine\-tuning can improve alignment with human opinion distributions in social simulation, but persistent diversity reduction remains\. These findings motivateUnpredictaBench’s repeated\-output evaluation: the goal is not simply to elicit diverse responses, but to testwhether model outputs are calibrated to a target distribution\.

## 3UnpredictaBench

In this section, we describe the construction ofUnpredictaBenchand summarize its task design, statistics, and our evaluation strategy, as illustrated in Figure[2](https://arxiv.org/html/2606.06622#S3.F2)\. Our goal is to evaluate whether language models can*generate outputs consistent with target probability distributions*, rather than simply recognize or describe them\.

### 3\.1Benchmark Construction and Task Types

Figure 2:UnpredictaBenchPipeline\.\(a\) Data Generation\.Instances are constructed from two sources: 40 distributions selected from Wikipedia, from whichGPT\-5\.4\[[26](https://arxiv.org/html/2606.06622#bib.bib29)\]generates tasks across 7 categories; and 50 human\-curated real\-world stochastic tasks\.\(b\) Evaluation\.Each task is evaluated by querying the model100100times independently and comparing the empirical output distribution against a ground\-truth reference using three metrics\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x2.png)We first crawled probability distributions from Wikipedia222[https://en\.wikipedia\.org/wiki/List\_of\_probability\_distributions](https://en.wikipedia.org/wiki/List_of_probability_distributions)\. For each distribution, we extracted detailed information including the probability density/mass function, mean, mode, median, real\-world applications, and key statistical properties\. In total, we collected176 distributions\. From this pool, we selected40 well\-known distributions\(see Table[10](https://arxiv.org/html/2606.06622#A1.T10)in Appendix[A](https://arxiv.org/html/2606.06622#A1)for the full list of distributions\), as our benchmark targets general\-purpose language models rather than expert statisticians\. These distributions form the basis of all benchmark tasks\. To construct the benchmark instances, we use a templated generation pipeline where distribution information is passed toGPT\-5\.4\[[26](https://arxiv.org/html/2606.06622#bib.bib29)\]to produce prompts across different task types\. For each automatically generated task, the prompt also specifies distribution hyperparameters, chosen tocover both concentrated and spread\-out regimes\. This allows the benchmark to test whether models can adapt not only to different distribution families, but also to different parameterizations of the same distribution\. In addition,50 tasks were manually constructed by a single annotator: 30 Real\-World Scenario tasks and 20 Shuffling tasks\. All 450 generated and manually constructed tasks were then reviewed by two independent annotators, resulting in the removal of 2 tasks that failed quality checks, yielding a final benchmark of448 instances\. The exact prompt templates used for generation and answer extraction are provided in Appendix[N](https://arxiv.org/html/2606.06622#A14)\.UnpredictaBenchcontains seven task categories, designed to probe distributional understanding across varied representations and difficulty levels\.

Textual Tasks: \(1\) Text Explicit and \(2\) Text Implicit\.Textual tasks present distributions in natural language\. In explicit tasks, the distribution and its parameters are fully named and the model is asked to generate a sample directly\. In implicit tasks, a real\-world scenario is described without naming the underlying distribution, requiring the model to infer the stochastic process before sampling\. Prompt templates are given in Prompts[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)–[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\.

Example 3\.1: Text Explicit \(left\) and Text Implicit \(right\)Generate a random sample from a Poisson distribution with rate parameterλ=18\\lambda=18, representing the number of events observed in one fixed interval\.A quality\-control engineer inspects 6 components, each independently failing with probabilities 0\.04, 0\.05, 0\.03, 0\.06, 0\.04, and 0\.05\. What is one possible total number of failed components in a single inspection?

Code Tasks: \(3\) Code Explicit and \(4\) Code Implicit\.Code tasks require the model to predict a possible output of a stochastic Python program\. In explicit tasks, the distribution is sampled directly viaNumPy333[https://numpy\.org/](https://numpy.org/)\. In implicit tasks, the target distribution is implemented indirectly through transformations such as square roots, trigonometric functions, or summations applied to samples from a different distribution, requiring deeper reasoning about the underlying stochastic process\. Prompt templates are given in Prompts[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)–[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\.

Example 3\.2: Code Explicit \(left\) and Code Implicit \(right\)``` import numpy as np a = 2.0; b = 2.5 u = np.random.uniform(0.0, 1.0) sample = a * (b / a) ** u print(sample) ``` ``` import numpy as np rng = np.random.default_rng() x = rng.gamma(shape=0.6, scale=1.0) y = rng.gamma(shape=0.6, scale=1.0) outcome = x / (x + y) print(float(outcome)) ```

\(5\) Multimodal Tasks\.Multimodal tasks require sampling from distributions formed by combining two or more component distributions, constructed from 20 highly recognizable distributions \(refer to Appendix[A](https://arxiv.org/html/2606.06622#A1)\) in our set via mixture sampling or additive combinations\. These tasks evaluate whether models can maintain multi\-modal coverage rather than collapsing to a single mode\. Prompt templates are given in Prompts[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)and[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\.

Example 3\.3: Multimodal ExampleGenerate one random sample from a 2\-component mixture of exponential distributions: with probability 0\.55, draw from Exponential\(λ=8\.0\\lambda\{=\}8\.0\); with probability 0\.45, draw from Exponential\(λ=1\.6\\lambda\{=\}1\.6\)\. What is the sampled value?

\(6\) Shuffling Tasks\.Shuffling tasks evaluate permutation\-level randomness by asking the model to produce a uniform random shuffle of a given list of up to five elements\. Lists span four types: numerical values, counting words \(e\.g\., first, second\), arbitrary words, and mixed lists\. Outputs are encoded via Lehmer codes prior to evaluation \(Section[C\.1](https://arxiv.org/html/2606.06622#A3.SS1)\)\.

Example 3\.4: Shuffling ExampleConsidering this list:\["first", "second", "third"\], what is one possible uniform random shuffle? Respond with exactly one list only\.

\(7\) Real\-World Scenario Tasks\.To evaluate whether models can simulate inherently uncertain environments, we include 30 manually curated real\-world scenario tasks covering six categories of practical nondeterminism:\(i\)MCMC sampling dynamics,\(ii\)multi\-outcome decision\-making,\(iii\)race conditions and multi\-threaded execution,\(iv\)hashing and collision behavior,\(v\)network simulations with stochastic delays, and\(vi\)distributed systems with asynchronous communication\. These tasks require models to implicitly reason about underlying stochastic processes rather than simply pattern\-match to a named distribution\. Examples are provided in Appendix[B](https://arxiv.org/html/2606.06622#A2)\.

### 3\.2Statistics

Table[1](https://arxiv.org/html/2606.06622#S3.T1)summarizes the key statistics ofUnpredictaBench\. The benchmark comprises448 instancesin English, of which398 areGPT\-5\.4\-authoredand50 are human\-authored\. Of the 398 automatically generated tasks, half use concentrated parameter settings and half use spread out parameter settings,80 are multimodalwhile the remaining318 are unimodal, and the distribution across task types is:159 Text Explicit,79 Text Implicit,80 Code Explicit, and80 Code Implicit\. The 30 human\-authored Real\-World tasks span six categories: OS concurrency \(6\), garbage collection \(6\), network simulations \(5\), distributed systems \(5\), hashing \(4\), and MCMC \(4\)\. The 20 Shuffling tasks cover four list types: integer \(6\), ordinal \(6\), word \(5\), and decimal \(3\) lists, with an average list length of2\.95 elements\. The right panel of Table[1](https://arxiv.org/html/2606.06622#S3.T1)shows the coverage of target distribution\.

Table 1:Overview ofUnpredictaBench\.Left:dataset\-level statistics\.Right:category coverage of the probability distributions used in the benchmark\.StatisticValueInstances448LanguageEnglishHuman\-authored prompts50GPT\-5\.4\-authored prompts398Min prompt length \(char\)164Median prompt length \(char\)501Mean prompt length \(char\)513\.7Max prompt length \(char\)1788
Target Dist\. CategoryCountAbs\. continuous, semi\-infinite11Abs\. continuous, bounded interval6Abs\. continuous, whole real line5Discrete, finite support6Discrete, infinite support5Joint distributions5Mixed discrete–continuous1Non\-numeric1

### 3\.3Evaluation Strategy

To assess how well a model reproduces a target distribution, we compare a setAAofN=100N=100independent samples drawn from the model’s predictive distribution𝒟pred\\mathcal\{D\}\_\{\\mathrm\{pred\}\}against a reference setBBofM=10,000M=10\{,\}000samples drawn from the ground\-truth distribution𝒟gt\\mathcal\{D\}\_\{\\mathrm\{gt\}\}\. Because the reference set is itself sampled, the evaluation could in principle depend on the particular ground\-truth draw used for comparison\. We therefore conduct a sensitivity analysis in Appendix[J](https://arxiv.org/html/2606.06622#A10), where we repeat the evaluation with multiple independently sampled reference sets and confirm that the results are stable\.

For sequence\-valued tasks such as shuffling, we first encode each permutationπ\\pivia its*Lehmer code*\[[16](https://arxiv.org/html/2606.06622#bib.bib43)\]L​\(π\)L\(\\pi\), a bijective mapping from permutations to integer sequences that preserves all ordering information, and normalize each coordinate to\[0,1\]\[0,1\]\. We then use the first coordinateZ1​\(π\)Z\_\{1\}\(\\pi\)as a scalar proxy for the permutation distribution, enabling direct application of our scalar metrics\. We focus onZ1Z\_\{1\}because the first Lehmer coordinate has the largest support: for a permutation of lengthnn,L1L\_\{1\}can takenndistinct values, whereas later coordinates have progressively smaller support\. As a result, matching the distribution ofZ1Z\_\{1\}is a stricter one\-dimensional diagnostic than matching later coordinates, since it requires the model to reproduce a richer marginal distribution over possible initial ranks\. While no single coordinate fully characterizes the joint distribution over permutations,Z1Z\_\{1\}provides a challenging and interpretable scalar summary of ordering behavior, making it suitable for comparison with our scalar\-valued tasks\. Full details of the Lehmer encoding are provided in Appendix[C\.1](https://arxiv.org/html/2606.06622#A3.SS1)\.

Our primary evaluation metric is𝑲​𝑺​@​𝑵\\bm\{KS@N\}, which we treat as anaccuracy metricfor distributional fidelity\. For each problemiiin a set ofllstochastic tasks, we apply a two\-sample Kolmogorov–Smirnov test betweenAAandBBand obtain app\-valuepks,ip\_\{\\mathrm\{ks\},i\}\. We then define:

KS​@​N=1l​∑i=1l𝟏​\[pks,i≥pthreshold\],\\small\\mathrm\{KS@\}N=\\frac\{1\}\{l\}\\sum\_\{i=1\}^\{l\}\\mathbf\{1\}\\left\[p\_\{\\mathrm\{ks\},i\}\\geq p\_\{\\mathrm\{threshold\}\}\\right\],\(1\)the fraction of problems for which the model’s samples are*not*rejected as inconsistent with the ground truth\. We setpthreshold=0\.0001p\_\{\\mathrm\{threshold\}\}=0\.0001to ensure a low false\-negative rate, and verify that using the true distribution as𝒟pred\\mathcal\{D\}\_\{\\mathrm\{pred\}\}achievesKS​@​N=1\.0\\mathrm\{KS@\}N=1\.0across all values ofNNconsidered\. LargerNNincreases difficulty by demanding closer calibration to the true distribution\. We additionally report two complementary metrics: the Debiased Wasserstein\-1 Distance Z\-score \(WDZ\), which expresses the observed earth mover’s distance in standard deviations above the permutation null baseline and captures tail behavior and systematic shifts in location and scale; and Jensen–Shannon Divergence \(JSD\), which captures density\-level shape mismatches\. See Appendix[C](https://arxiv.org/html/2606.06622#A3)for their full definitions\.

## 4Experiments and Results

### 4\.1Experimental Settings

Model\.In this study, we evaluate a diverse set of models spanning multiple architectures and scales, covering both open\-weight and proprietary systems\. Open\-weight families include OLMo\-3\[[24](https://arxiv.org/html/2606.06622#bib.bib32)\]; Qwen\-3\[[33](https://arxiv.org/html/2606.06622#bib.bib33)\]; Qwen\-3\.5\[[30](https://arxiv.org/html/2606.06622#bib.bib31)\]; Nemotron\-3\[[23](https://arxiv.org/html/2606.06622#bib.bib28)\]; Ministral\-3\[[22](https://arxiv.org/html/2606.06622#bib.bib34)\]; Llama\-3\.1, and Llama\-3\.2\[[19](https://arxiv.org/html/2606.06622#bib.bib39)\]; Phi\-3\.5\[[21](https://arxiv.org/html/2606.06622#bib.bib35)\]; and DeepSeek\-v3\.2\[[5](https://arxiv.org/html/2606.06622#bib.bib36)\]\. Proprietary models include Claude\-sonnet\-4\.6\[[1](https://arxiv.org/html/2606.06622#bib.bib30)\]; GPT\-4o\[[25](https://arxiv.org/html/2606.06622#bib.bib40)\]; GPT\-5\.4\[[26](https://arxiv.org/html/2606.06622#bib.bib29)\]; Mercury\-2\[[12](https://arxiv.org/html/2606.06622#bib.bib37)\]; and Grok\-4\.1\-fast\[[36](https://arxiv.org/html/2606.06622#bib.bib38)\]\.

Sampling and Generation Settings\.In all experiments, we use a fixed temperature ofT=1\.0T\{=\}1\.0unless otherwise specified\. TemperatureT=1\.0T\{=\}1\.0is a natural default as it preserves the model’s trained output distribution without artificially concentrating or flattening it\. Reasoning is disabled by setting the reasoning effort tononeexcept in the reasoning experiments \(Section[4\.4](https://arxiv.org/html/2606.06622#S4.SS4)\), where we set reasoning effort toxhigh, allocating up to 95% of tokens for reasoning with a maximum of 4,096 tokens\. Each model is queried independently 100 times per problem instance withmax\_tokens=64for standard, Shuffling, and real\-world tasks\. For the list prompting ablation \(Section[5\.3](https://arxiv.org/html/2606.06622#S5.SS3)\), we setmax\_tokens=640andmax\_tokens=2512when requesting 10 and 35 elements, respectively\. For Shuffling and real\-world tasks, prompts are specified per problem and bundled with each task in the benchmark, since these tasks differ in style\. Text and code tasks instead use two static prompts, given in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)and Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\.

Answer Parsing and Retry Protocol\.For answer extraction, models are prompted to return their answer in a structured format: a number, string, or list enclosed in\{\{asked\_value\}\}depending on the task type, enabling reliable parsing\. If extraction fails, we retry usingGPT\-4o\-minias a fallback extractor \(refer to Prompts[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\-[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)for the instructions used for extraction\)\. If the model fails to produce a valid output \(e\.g\., omitting the final value or returning a malformed list\) across 5 consecutive retries, that run is skipped for that problem instance\. For the reasoning token extraction and list prompting experiments, model calls are repeated and outputs accumulated until at least 100 values are collected; if more than 100 are obtained, only the first 100 are used\.

Evaluation Infrastructure and Cost\.All primary experiments were conducted via OpenRouter444[https://openrouter\.ai](https://openrouter.ai/), a cloud\-based model aggregation platform providing unified API access to open and proprietary models\. A small number of models unavailable on OpenRouter were evaluated locally on a workstation equipped with an Intel Core i9 CPU, 64 GB of RAM, and an NVIDIA RTX 3090 GPU with 24 GB VRAM\. Local\-only models include:Llama\-3\.2\-1B\-instruct,Phi\-3\.5\-mini\-instruct,Qwen3\.5\-2B,OLMo\-3\-7B\-instruct, andMinistral\-3\-3B\-instruct\-2512\. Each individual experiment required approximately 1–10 minutes of wall\-clock time via OpenRouter, with variation attributable to cloud provider load, model size, and architecture\. The total API cost across all reported experiments was approximately $300 USD\.

### 4\.2Overall Model Performance on UnpredictaBench

Figure[3](https://arxiv.org/html/2606.06622#S4.F3)presents the KS@100 scores for all models evaluated, grouped by model family\. The results reveal a striking performance gap between the top\-performing systems and the broader field\.Nemotron\-3 Super 120B achieves the highest KS@100 at 32\.64%, nearly doubling the score of the third\-ranked model, underscoring the advantage conferred by scale within the NVIDIA family, where even the smaller Nemotron\-3 Nano 30B \(20\.83%\) remains competitive with frontier models\. Among frontier models, GPT\-4o \(23\.90%\) and DeepSeek V3\.2 \(21\.73%\) form a tight cluster immediately behind the NVIDIA leaders, while GPT\-5\.4 \(15\.18%\) and GPT\-4o Mini \(9\.60%\) trail considerably, suggesting that model tier within a family matters as much as the family itself\. The open\-weight Llama 3\.1 70B \(16\.57%\) and Qwen3\.5 2B \(17\.67%\) are noteworthy: the latter in particular demonstrates thata compact 2B model can rival systems an order of magnitude larger, hinting at the outsized role of training data and instruction tuning over raw parameter count\. At the lower end of the spectrum, several models cluster below 5%, including Ministral\-3 3B \(1\.35%\), Phi\-3\.5 Mini \(2\.90%\), and OLMo\-3 7B \(3\.21%\)\. Surprisingly,Claude Sonnet 4\.6 \(4\.70%\) and Mistral Large 2512 \(4\.69%\) fall into this lower tier despite their considerable sizes, which may reflect a mismatch between our benchmark’s task distribution and the optimization objectives of these models\. We hypothesize that this is because MoE routing tends to activate a sparse and consistent subset of experts for a given input type, effectively reducing the diversity of computation paths and producing outputs that are no more varied than those of a much smaller dense model\.

Figure 3:KS@100 \(%\) of all evaluated models grouped by model family\. Each bar represents a single model, color\-coded by its originating family \(see legend\)\.Nemotron\-3 Super 120B leads all models at 32\.64%, with a substantial drop\-off to the next tier\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x3.png)Category\-Level Analysis\.Table[2](https://arxiv.org/html/2606.06622#S4.T2)presents a fine\-grained breakdown of model performance across four task categories \(Code,Text,RealWorld, andShuffling\), alongside Jensen–Shannon Divergence \(JSD\) and the Wasserstein Distance Z\-score \(WDZ\)\. Crucially,KS@100 is broadly consistent with both JSD and WDZ across models and categories: models scoring higher on KS@100 consistently exhibit low JSD and WDZ as well, validating that theKS\-based metric captures genuine distributional alignment\. JSD captures global distributional similarity, while WDZ emphasizes tail behavior; both align with the KS@100 ranking, supporting metric robustness\. Evaluation stability across three repeated runs is reported in Appendix[I](https://arxiv.org/html/2606.06622#A9), where we analyze run\-to\-run variance using the standard deviation across runs\.

Performance across Sample Sizes\.Table[3](https://arxiv.org/html/2606.06622#S4.T3)reports KS@N across seven evaluation thresholds \(N∈\{1,2,5,10,20,50,100\}N\\in\\\{1,2,5,10,20,50,100\\\}\) for all evaluated models\. All models achieve perfect KS@1, confirming that every model can produce at least one plausible value within the support of the target distribution\. Performance degrades monotonically asNNincreases, reflecting the growing statistical difficulty of producing a full sample that is indistinguishable from the ground truth\. The drop is particularly steep between KS@20 and KS@100 for most models, with the gap between the strongest model \(Nemotron\-3 Super 120B, 35\.43%\) and the weakest \(Qwen\-3\.5\-35B\-a3b and Llama\-3\.2\-1B\-instruct, both at 2\.76%–3\.27%\) widening considerably at largerNN\. Notably, Claude Sonnet 4\.6 achieves a strong KS@2 \(97\.73%\) but falls sharply to 5\.04% at KS@100, one of the steepest drop\-offs in the table, further evidence thatsingle\-sample plausibility is a poor proxy for distributional fidelity\.

Shuffling and Code tasks emerge as the most demanding categories, with no model exceeding 40% in either\. In Shuffling,several strong overall performers collapse to 0%\(GPT\-5\.4, Mercury\-2, Claude Sonnet 4\.6, Qwen3\.5\-35B\-a3b\), while DeepSeek V3\.2, OLMo\-3 7B, Llama\-3\.2\-1B, and Qwen3\.5 2B all sustain∼\\sim37%, suggesting these models retain a notion of distributional randomness over longer output sequences that is independent of overall scale\. RealWorld tasks, by contrast, yield the highest individual scores:Llama\-3\.2\-1B achieves a remarkable 59\.09%despite ranking poorly overall, a pattern we attribute to the narrower effective output range and near\-uniform structure of many real\-world distributions\.Nemotron\-3 Super 120B drops sharply to 3\.33% on RealWorlddespite its overall dominance, revealing that strong distributional knowledge in structured domains does not transfer to real\-world stochastic settings\. Models heavily optimized for precise reasoning, notably Claude Sonnet 4\.6 and Qwen3\.5\-35B\-a3b, score near the bottom across all categories, consistent with the hypothesis thatdeterministic fine\-tuning suppresses the output diversity\[[35](https://arxiv.org/html/2606.06622#bib.bib17)\]required for distributional matching\. Finally, Mercury\-2’s diffusion\-based generation does not appear to confer any natural advantage here, with the model collapsing to 0% on Shuffling and underperforming across Text and RealWorld tasks\. We further break down performance by distribution \(Appendix[E](https://arxiv.org/html/2606.06622#A5)\), prompting format \(Appendix[F](https://arxiv.org/html/2606.06622#A6)\), distribution modality \(Appendix[G](https://arxiv.org/html/2606.06622#A7)\), and distributional spread \(Appendix[H](https://arxiv.org/html/2606.06622#A8)\)\.

Table 2:Per\-category performance across Code, Text, RealWorld, and Shuffling tasks, reporting KS@100, JSD, and WDZ\. JSD measures global distributional overlap while WDZ emphasizes tail behavior\.*Random Machine*is a Python pseudorandom number generator matching the ground\-truth sampling procedure, serving as the theoretical performance ceiling\. Full detailed results for all models are provided in Appendix[D](https://arxiv.org/html/2606.06622#A4)\.Table 3:KS@N \(%\) for all evaluated models at seven sample sizesN∈\{1,2,5,10,20,50,100\}N\\in\\\{1,2,5,10,20,50,100\\\}\. KS@N measures the fraction of problems for which the model’sNNsamples are not rejected under the KS test at thresholdp<0\.0001p<0\.0001\.All models achieve perfect KS@1 by construction, as a single sample is almost never rejected\. Performance degrades monotonically withNN, with the steepest drops occurring between KS@20 and KS@100\. Bold values indicate the best score in each column\.
### 4\.3The Effect of Instruction Tuning

Table[4](https://arxiv.org/html/2606.06622#S4.T4)compares base and instruction\-tuned variants of three models\. The results show thatinstruction tuning provides only slight benefit for distributional understanding and, in most cases, actively reduces output diversity\. WhileK​S​@​100KS@100improves modestly across all three models, the gains are small, suggesting that the base model’s knowledge of the target distribution is largely preserved but not meaningfully enhanced by instruction tuning\. Notably, JSD and WDZ reveal a more nuanced trend: instruction tuning sometimes worsens these metrics because base models, while more diverse, occasionally generate out\-of\-support values, increasing distributional distance\. Instruction tuning reduces such errors, but often at the cost of diversity\.

Table 4:Base vs\. instruction\-tuned model variants across KS@50, KS@100, JSD, and WDZ, evaluated onUnpredictaBenchexcluding the Shuffling and RealWorld subsets\.Δ\\Deltadenotes the change from base to instruct\.Greenindicates the base model outperforms the instruction\-tuned\.
### 4\.4The Effect of Reasoning

Table[5](https://arxiv.org/html/2606.06622#S4.T5)compares KS@N when evaluated on the model’sFinal Outputversus numbers extractedFrom Reasoningtokens\. Overall, reasoning improves final output performance across all four models, consistent with the hypothesis that the core challenge is not merely output diversity but alsounderstanding the distribution described in the prompt\. However,the benefit is model\-specific and the two number sources tell very different stories\. For Nemotron\-3 Super 120B and DeepSeek V3\.2, extracting numbers from reasoning tokens causes a sharp performance drop \(−\-33\.17 and−\-15\.33 at KS@20 respectively\), suggesting these models repeatedly revisit the same candidate values during deliberation rather than broadly exploring the support\.The reasoning process explores less than it appears to\.By contrast, Qwen3\-32B benefits from reasoning in both conditions, with its reasoning tokens yielding a gain of\+\+9\.30 at KS@50\. Qwen3\.5\-35B\-a3b presents the most striking case: its final output yields essentially zero improvement over baseline at all sample sizes, because the model defaults to repeating a single number in its final answer\. Yet its reasoning tokens reveal a substantially broader set of candidates it considers but never reports, yielding a large gain when extracted directly \(\+\+35\.18 at KS@20\)\.This model knows more than it says\.

Table 5:KS@N forFinal Outputvs\. numbers extractedFrom Reasoningtokens, evaluated onUnpredictaBenchexcluding the Shuffling and RealWorld subsets\.Δ\\Deltadenotes the change relative to the no\-reasoning baseline\. Shaded rows correspond to theFrom Reasoningcondition\.
### 4\.5Qualitative Analysis

Figure[1\(a\)](https://arxiv.org/html/2606.06622#S1.F1)illustrates two representative failure modes observed across the benchmark\. On the Skellam distribution, Nemotron\-3\-Super\-120B covers part of the target support, including both negative and positive values, but assigns probability mass incorrectly and collapses much of its density onto a small number of bins\. On the Poisson task, OLMo\-3\-7B produces a right\-skewed set of samples concentrated at small values, while the true distribution peaks at substantially larger counts and maintains support beyond 20\. Together, these examples show that models often fail not only by collapsing to an overly narrow output range, but also by producing samples that occupy the rough numerical range of the target while misrepresenting its probability structure\.

Figure[4](https://arxiv.org/html/2606.06622#S4.F4)deepens this picture by examining Llama 3\.2 1B \(base and instruct\) on a Beta distribution and a Poisson\-Binomial task, overlaying the ground truth, model samples, and logit probability massP​\(y\)∝∏tP​\(tt∣t<t,x\)P\(y\)\\propto\\prod\_\{t\}P\(t\_\{t\}\\mid t\_\{<t\},x\), all max\-scaled for visibility\.Logit and sample distributions are consistently closely aligned; and this is not trivially expected\. Since each call involves independent stochastic decoding, one might expect logit distributions to vary across calls, producing broad sample diversity\. Instead,the model’s internal beliefs are stable across calls and the failure is already visible in the logits: the diversity problem is not a decoding artifact but a reflection of what the model fundamentally believes is plausible\. On the discrete task \(Poisson Binomial\), the base model covers the support substantially more broadly than the instruct variant, which collapses toward lower values; illustratinginstruction tuning suppresses output diversityby penalizing unusual outputs during RLHF\-style\[[15](https://arxiv.org/html/2606.06622#bib.bib44)\]training\. Moreover, we observe more outliers in the Poisson Binomial task, likely due to the distribution’s complexity\. On the continuous distribution Beta, both variants fail to capture the U\-shaped ground truth\. Additional qualitative analysis is provided in Appendix[M](https://arxiv.org/html/2606.06622#A13)\. We further analyze two behaviors in the appendix\. Appendix[K](https://arxiv.org/html/2606.06622#A11)reports instruction following, measuring how often a model fails to return a structurally valid output when prompted \(note that this differs from producing values that fall within the target distribution\.\)\. Appendix[L](https://arxiv.org/html/2606.06622#A12)reports output diversity, measuring how many of a model’s 100 runs yield a previously unseen number\.

Figure 4:Llama\-3\.2\-1B\-base \(top\) and \-instruct \(bottom\) on a Beta distribution as text \(left\) and a Poisson\-Binomial distribution as code \(right\)\. All values are max\-scaled for visibility\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x4.png)
### 4\.6Alignment with Novelty and Creativity Benchmarks

We compareUnpredictaBenchagainst NoveltyBench and CREATE to examine whether distributional fidelity relates to broader notions of creativity and novelty\. NoveltyBench reports Distinct10, measuring lexical diversity, and Utility10, measuring the combined usability and diversity of generated outputs\. CREATE reports utility under two temperature settings,p=0\.7p\{=\}0\.7andp=0\.9p\{=\}0\.9\. Unlike both benchmarks, which rely on LLM\-as\-a\-judge evaluation,UnpredictaBenchprovides a statistically grounded metric that directly measures distributional fidelity without requiring a judge model\.

As shown in Figure[5](https://arxiv.org/html/2606.06622#S4.F5), KS@100 correlates positively with utility metrics from both benchmarks, including CREATE Utility atp=0\.7p\{=\}0\.7\(r=0\.75r=0\.75\) andp=0\.9p\{=\}0\.9\(r=0\.78∗r=0\.78^\{\*\}\), as well as NoveltyBench Utility10\(r=0\.65r=0\.65\)\. This suggests that distributional fidelity captures a meaningful aspect of creative generation\. In contrast, NoveltyBench Distinct10correlates negatively with KS@100 \(r=−0\.21r=\-0\.21\), consistent with our finding that raw diversity without distributional understanding is insufficient\. As shown in Table[6](https://arxiv.org/html/2606.06622#S4.T6), Nemotron\-3 Super 120B leads across all benchmarks, while Llama\-3\.2\-1B\-instruct achieves the highest Distinct10score but ranks near the bottom on utility and KS@100, illustrating that lexical diversity and distributional fidelity are distinct properties\.Mercury\-2 is a notable outlier: its diffusion\-based architecture yields diverse numerical outputs on our structured stochastic tasks, but struggles with the open\-ended linguistic diversity required by creativity benchmarks\.

Figure 5:Pearson correlation betweenUnpredictaBenchKS@100 and metrics from NoveltyBench and CREATE across seven models\. Each scatter plot compares one external benchmark metric against KS@100, with a fitted regression line\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x5.png)Table 6:Cross\-benchmark comparison of model performance on NoveltyBench, CREATE, andUnpredictaBench\(ours\)\. NoveltyBench reports Distinct10and Utility10; CREATE reports utility at two temperature settings;UnpredictaBenchreports KS@100\.

## 5Ablations

### 5\.1The Effect of Temperature

Table[7](https://arxiv.org/html/2606.06622#S5.T7)reports performance across five temperature settings for three models\. The results reveal a consistent and intuitive pattern:higher temperatures improve KS@100 across all models, as increased sampling diversity brings model outputs closer to the ground\-truth distribution\. For Nemotron\-3 Super 120B, KS@100 peaks aroundT=1\.2T\{=\}1\.2\(39\.57% average\) and remains strong atT=1\.5T\{=\}1\.5, while dropping sharply atT=0\.1T\{=\}0\.1\(5\.23%\), confirming that near\-greedy decoding is particularly harmful for stochastic tasks\. Ministral\-3 3B follows the same trend, though interestingly its best performance occurs atT=1\.0T\{=\}1\.0on Code andT=1\.2T\{=\}1\.2on Text, suggesting a task\-dependent optimal temperature\.OLMo\-3 7B is a notable exception: its WDZ remains persistently high across all temperatures and even increases slightly with temperature, indicating that higher diversity comes at the cost of greater tail deviation\. This suggests that for weaker models, raising temperature amplifies out\-of\-support outputs rather than improving distributional coverage, echoing the base\-versus\-instruct trade\-off discussed in the previous section\. Taken together, these results suggest thattemperature is an important but model\-dependent lever: strong models benefit substantially from higher temperatures, while weaker models may require more targeted interventions\.

Table 7:Effect of sampling temperature on model performance across Code and Text task categories, along with their average\. For each setting we report KS@100 \(higher is better\), JSD \(lower is better\), and WDZ \(closer to zero is better\)\. All models are evaluated at five temperatures:T∈\{0\.1,0\.7,1\.0,1\.2,1\.5\}T\\in\\\{0\.1,0\.7,1\.0,1\.2,1\.5\\\}\. Higher temperatures generally improve KS@100 by increasing diversity, though weaker models may show larger tail deviations in WDZ\.ModelTempCodeTextAverageKS@100↑JSD↓WDZ↓KS@100↑JSD↓WDZ↓KS@100↑JSD↓WDZ↓Nemotron\-3\-super\-120b\-a12b1\.544\.960\.547\.2633\.750\.365\.7439\.350\.456\.50Nemotron\-3\-super\-120b\-a12b1\.246\.640\.517\.3432\.500\.335\.8239\.570\.426\.58Nemotron\-3\-super\-120b\-a12b1\.040\.340\.487\.4428\.130\.305\.8934\.230\.396\.67Nemotron\-3\-super\-120b\-a12b0\.721\.850\.447\.6018\.130\.276\.0519\.990\.366\.83Nemotron\-3\-super\-120b\-a12b0\.15\.460\.407\.825\.000\.236\.245\.230\.327\.03Ministral\-3\-3B\-instruct\-25121\.515\.550\.4311\.8419\.380\.221\.5217\.460\.326\.68Ministral\-3\-3B\-instruct\-25121\.215\.550\.4011\.9823\.130\.201\.5719\.340\.306\.77Ministral\-3\-3B\-instruct\-25121\.017\.230\.3712\.1121\.250\.181\.6119\.240\.276\.86Ministral\-3\-3B\-instruct\-25120\.710\.080\.3312\.3012\.500\.151\.6911\.290\.246\.99Ministral\-3\-3B\-instruct\-25120\.11\.260\.2912\.565\.000\.121\.813\.130\.207\.19OLMo\-3\-7B\-instruct1\.59\.240\.5234\.589\.380\.3420\.389\.310\.4327\.48OLMo\-3\-7B\-instruct1\.59\.240\.5234\.589\.380\.3420\.389\.310\.4327\.48OLMo\-3\-7B\-instruct1\.26\.300\.4934\.749\.380\.3120\.567\.840\.4027\.65OLMo\-3\-7B\-instruct1\.05\.460\.4634\.937\.500\.2920\.746\.480\.3727\.84OLMo\-3\-7B\-instruct0\.73\.360\.4235\.215\.660\.2620\.994\.510\.3428\.10OLMo\-3\-7B\-instruct0\.12\.100\.3735\.592\.520\.2221\.252\.310\.3028\.42

### 5\.2Effect of Sampling Budget

Table[8](https://arxiv.org/html/2606.06622#S5.T8)examines what happens when models are given larger generation budgets of 500 and 1000 samples, evaluated at increasing subset sizes\. Two complementary trends emerge\. First,generating more samples consistently improves short\-horizon KS@100: KS@100 increases for all models as the generation budget grows from 100 to 1000, suggesting that with more attempts, models are more likely to produce outputs that locally resemble the target distribution\. Second, and more revealingly,evaluating over the full generated set exposes deeper distributional failures: KS@500 and KS@1000 are consistently lower than KS@100 within the same generation budget, meaning that while models can appear well\-calibrated over a small sample, their biases become statistically detectable under stricter evaluation\. This confirms thatour choice ofN=100N\{=\}100for the main benchmark is conservative: models that pass at KS@100 may still fail under more demanding scrutiny, and the true ceiling of current models is lower than the headline numbers suggest\. Ministral\-3 3B is the strongest model across all settings, maintaining the highest KS@100 at every budget and evaluation threshold, while Llama\-3\.2\-1B and Phi\-3\.5 Mini remain near the bottom regardless of how many samples are drawn, indicating thatscaling the sampling budget cannot compensate for a fundamental lack of distributional understanding\.

Table 8:Effect of increasing generation budget on distributional fidelity\. For each generation budget \(100, 500, and 1000 samples\), we report KS\-based KS@100 evaluated at different subset sizes\. KS@100 within a larger budget measures short\-horizon fidelity, while KS@500 and KS@1000 apply stricter statistical scrutiny over the full generated set\. Larger budgets improve short\-horizon KS@100 but consistently reveal deeper distributional biases when evaluated at scale, demonstrating that models cannot fully escape their distributional limitations by generating more samples\.
### 5\.3The Effect of Asking for a List of Samples Instead of One

Table[9](https://arxiv.org/html/2606.06622#S5.T9)compares the standard single\-output protocol against prompting models to generate lists of 10 or 35 values per call, merging repeated calls until 100 numbers are accumulated \(truncating to the first 100 if more are produced\)\. We capped list size at 35 because models consistently fail to follow instructions beyond this threshold, skipping numbers or truncating their output prematurely\. The results show thatasking for lists generally improves KS@100, consistent with the intuition that generating multiple values in a single forward pass encourages the model to diversify across the support rather than anchoring on a single point\. However,the benefit is strongly model\-dependent and does not hold uniformly across evaluation thresholds\. For Nemotron\-3 Super 120B and Ministral\-3 3B, requesting 10 outputs yields meaningful gains at KS@100 \(\+\+17\.12 and\+\+14\.32 respectively\) but slightly hurts short\-horizon performance at KS@20, suggesting that list generation improves global coverage at the cost of local coherence\. Increasing to 35 outputs partially reverses these gains, indicating a sweet spot around 10 values per call for these models\. OLMo\-3 7B benefits consistently across both list sizes and all evaluation thresholds, suggesting it handles list generation well regardless of list length\.Llama\-3\.2\-1B is the exception: list prompting hurts performance at nearly every threshold and list size, with 35 outputs causing a sharp drop \(−\-3\.81 at KS@100\), likely because the model struggles to maintain distributional diversity over longer lists and instead repeats values or drifts out of support\. Taken together, these results suggest thatlist prompting is a simple but model\-sensitive interventionthat can meaningfully improve distributional fidelity for capable models without any additional training\.

Table 9:Comparison of single\-output and list\-output prompting strategies at list sizes of 10 and 35, evaluated at KS@20, KS@50, and KS@100\. To reach 100 total samples, model calls are repeated and their outputs merged; if a call produces more than the requested count, only the first 100 values are used\. List size is capped at 35 as models reliably fail to follow instructions for larger lists, skipping or truncating their output\.Δ\\Deltadenotes the change relative to the single\-output baseline\. Shaded rows correspond to the 35\-output condition\.ModelSourceKS@20KS@50KS@100ScoreΔ\\DeltaScoreΔ\\DeltaScoreΔ\\DeltaNemotron\-3\-super\-120B\-a12b10 outputs80\.15−\-2\.2268\.84\+12\.0152\.01\+17\.1235 outputs66\.83−\-15\.5455\.53−\-1\.3141\.21\+6\.31Ministral\-3\-3B\-instruct\-251210 outputs59\.55−\-0\.7544\.97\+13\.5730\.65\+14\.3235 outputs54\.27−\-6\.0338\.19\+6\.7826\.88\+10\.55OLMo\-3\-7B\-instruct10 outputs62\.31\+14\.0739\.20\+21\.3623\.87\+17\.5935 outputs55\.03\+6\.7835\.68\+17\.8424\.37\+18\.09Llama\-3\.2\-1B\-instruct10 outputs18\.09−\-2\.1610\.05\+3\.094\.02−\-1\.0435 outputs11\.06−\-9\.204\.77−\-2\.191\.26−\-3\.81

## 6Conclusion

We introducedUnpredictaBench, a benchmark for evaluating the ability of LLMs to generate samples consistent with true underlying statistical distributions\. Across 448 test instances spanning 40 distributions and four task categories, we find that no current model comes close to solving the benchmark, with even the strongest model achieving only 32\.64% at KS@100\. Models fail in two distinct ways: lacking a meaningful internal representation of the target distribution, or understanding its rough shape but collapsing to a narrow set of outputs\. Instruction tuning exacerbates the latter, while reasoning, temperature, and list prompting help modestly but fall far short of closing the gap\. Our cross\-dataset analysis shows thatUnpredictaBenchaligns with utility metrics from creativity benchmarks while offering a statistically grounded alternative to LLM\-as\-a\-judge evaluation\.The gap between current models and the Random Machine ceiling remains large and unsolved\.

## Limitations and Broader Impact

#### Positive Impact\.

UnpredictaBenchtargets a capability with direct relevance to simulation, scientific modeling, and decision\-making: faithful distributional generation\. Many downstream uses of LLMs, including economic, epidemiological, and multi\-agent simulations, depend on outputs that reflect a true underlying distribution rather than collapsing onto a few dominant modes\. By providing a statistically grounded benchmark and the reusable KS@NKS@N KS@N metric, this work offers a concrete target for improving model calibration in stochastic settings, potentially reducing biased estimates and overconfident predictions in applications that require sampling\. The benchmark also isolates two distinct failure modes, weak distributional understanding and insufficient output diversity, giving practitioners a clearer diagnostic for where a model breaks down\. Our results surface actionable findings that can guide future model development and inform when a model is suitable for simulation\-style deployment\.

#### Negative Impact & Limitations\.

All prompts are in English and 89% areGPT\-5\.4\-generated, which may introduce phrasing biases and limit generalizability to multilingual or human\-authored settings\. Code tasks are Python\-only, so our conclusions may not transfer to other languages or programming paradigms\. The ground\-truth distributions reflect the reference samples we construct, and alternative formulations of a task could yield different targets\.UnpredictaBenchis strictly an evaluation benchmark and is not designed to be used as training data; optimizing directly against it risks overfitting to our specific tasks and metrics, and strong benchmark performance should not be interpreted as real\-world deployment readiness\. The dataset contains no personal or sensitive information and is released underCC BY 4\.0on[![[Uncaptioned image]](https://arxiv.org/html/2606.06622v1/figures/huggingface_logo.png)Hugging Face](https://huggingface.co/datasets/UnpredictaBench/UnpredictaBench)\. The code and ground\-truth values are released on[GitHub](https://github.com/UnpredictaBench/UnpredictaBenchCode)\.

## References

- \[1\]\(2026\-02\)Introducing Claude Sonnet 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)Accessed: 2026\-04\-20Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p3.2),[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[2\]E\. J\. Bigelow, E\. S\. Lubana, R\. P\. Dick, H\. Tanaka, and T\. Ullman\(2024\)In\-context learning dynamics with random binary sequences\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=62K7mALO2q)Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p2.1)\.
- \[3\]Y\. Cao, H\. Liu, A\. Arora, I\. Augenstein, P\. Röttger, and D\. Hershcovich\(2025\-04\)Specializing large language models to simulate survey response distributions for global populations\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 3141–3154\.External Links:[Link](https://aclanthology.org/2025.naacl-long.162/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.162),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p2.1)\.
- \[4\]J\. Coronado\-Blázquez\(2025\)Deterministic or probabilistic? the psychology of llms as random number generators\.External Links:2502\.19965,[Link](https://arxiv.org/abs/2502.19965)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[5\]DeepSeek\-AI\(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.External Links:2512\.02556,[Link](https://arxiv.org/abs/2512.02556)Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[6\]J\. Gu, L\. Pang, H\. Shen, and X\. Cheng\(2025\-01\)Do LLMs play dice? exploring probability distribution sampling in large language models for behavioral simulation\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 5375–5390\.External Links:[Link](https://aclanthology.org/2025.coling-main.360/)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[7\]X\. Gu, S\. De, M\. Titsias, L\. Markeeva, P\. Veličković, and R\. Pascanu\(2026\)The illusion of stochasticity in llms\.External Links:2604\.06543,[Link](https://arxiv.org/abs/2604.06543)Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[8\]Z\. Guo, H\. Lv, C\. Zhang, Y\. Zhao, Y\. Zhang, and L\. Cui\(2025\-11\)The illusion of randomness: how LLMs fail to emulate stochastic decision\-making in rock\-paper\-scissors games?\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 8618–8637\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.458/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.458),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[9\]A\. K\. Hopkins, A\. Renda, and M\. Carbin\(2023\)Can LLMs generate random numbers? evaluating LLM sampling in controlled domains\.InICML 2023 Workshop: Sampling and Optimization in Discrete Space,External Links:[Link](https://openreview.net/forum?id=Vhh1K9LjVI)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[10\]W\. Huang, F\. Xia, T\. Xiao, H\. Chan, J\. Liang, P\. Florence, A\. Zeng, J\. Tompson, I\. Mordatch, Y\. Chebotar, P\. Sermanet, T\. Jackson, N\. Brown, L\. Luu, S\. Levine, K\. Hausman, and brian ichter\(2022\)Inner monologue: embodied reasoning through planning with language models\.In6th Annual Conference on Robot Learning,External Links:[Link](https://openreview.net/forum?id=3R3Pz5i0tye)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1)\.
- \[11\]b\. ichter, A\. Brohan, Y\. Chebotar, C\. Finn, K\. Hausman, A\. Herzog, D\. Ho, J\. Ibarz, A\. Irpan, E\. Jang, R\. Julian, D\. Kalashnikov, S\. Levine, Y\. Lu, C\. Parada, K\. Rao, P\. Sermanet, A\. T\. Toshev, V\. Vanhoucke, F\. Xia, T\. Xiao, P\. Xu, M\. Yan, N\. Brown, M\. Ahn, O\. Cortes, N\. Sievers, C\. Tan, S\. Xu, D\. Reyes, J\. Rettinghouse, J\. Quiambao, P\. Pastor, L\. Luu, K\. Lee, Y\. Kuang, S\. Jesmonth, N\. J\. Joshi, K\. Jeffrey, R\. J\. Ruano, J\. Hsu, K\. Gopalakrishnan, B\. David, A\. Zeng, and C\. K\. Fu\(2023\-14–18 Dec\)Do as i can, not as i say: grounding language in robotic affordances\.InProceedings of The 6th Conference on Robot Learning,K\. Liu, D\. Kulic, and J\. Ichnowski \(Eds\.\),Proceedings of Machine Learning Research, Vol\.205,pp\. 287–318\.External Links:[Link](https://proceedings.mlr.press/v205/ichter23a.html)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1)\.
- \[12\]Inception Labs\(2026\-02\)Introducing Mercury 2\.Note:[https://www\.inceptionlabs\.ai/blog/introducing\-mercury\-2](https://www.inceptionlabs.ai/blog/introducing-mercury-2)Accessed: 2026\-04\-20Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[13\]K\. V\. Koevering and J\. Kleinberg\(2024\)How random is random? evaluating the randomness and humaness of llms’ coin flips\.External Links:2406\.00092,[Link](https://arxiv.org/abs/2406.00092)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[14\]A\. N\. Kolmogorov\(1933\)Sulla determinazione empirica di una legge di distribuzione\.Giornale dell’Istituto Italiano degli Attuari4,pp\. 83–91\.Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p2.3)\.
- \[15\]N\. Lambert\(2026\)Reinforcement learning from human feedback\.External Links:2504\.12501,[Link](https://arxiv.org/abs/2504.12501)Cited by:[§4\.5](https://arxiv.org/html/2606.06622#S4.SS5.p2.1)\.
- \[16\]D\. H\. Lehmer\(1960\)Teaching combinatorial tricks to a computer\.External Links:[Link](https://api.semanticscholar.org/CorpusID:115452165)Cited by:[§3\.3](https://arxiv.org/html/2606.06622#S3.SS3.p2.10)\.
- \[17\]Z\. Li, C\. Chen, T\. Xu, Z\. Qin, J\. Xiao, Z\. Luo, and R\. Sun\(2025\)Preserving diversity in supervised fine\-tuning of large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NQEe7B7bSw)Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p2.1)\.
- \[18\]J\. Lin\(2002\)Divergence measures based on the shannon entropy\.IEEE Transactions on Information theory37\(1\),pp\. 145–151\.External Links:[Link](https://ieeexplore.ieee.org/document/61115)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p3.2)\.
- \[19\]A\. @\. M\. Llama Team\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[20\]L\. H\. McCabe, R\. Melamed, T\. Hartvigsen, and H\. H\. Huang\(2026\)Estimating semantic alphabet size for LLM uncertainty quantification\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uYK6GPVg1O)Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p2.1)\.
- \[21\]Microsoft\(2024\-08\)Discover the new multi\-lingual, high\-quality Phi\-3\.5 SLMs\.Note:[https://techcommunity\.microsoft\.com/blog/azure\-ai\-foundry\-blog/discover\-the\-new\-multi\-lingual\-high\-quality\-phi\-3\-5\-slms/4225280](https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280)Accessed: 2026\-04\-20Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[22\]Mistral AI\(2025\-12\)Introducing Mistral 3\.Note:[https://mistral\.ai/news/mistral\-3](https://mistral.ai/news/mistral-3)Accessed: 2026\-04\-20Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[23\]NVIDIA\(2025\)NVIDIA nemotron 3: efficient and open intelligence\.Note:White PaperExternal Links:[Link](https://arxiv.org/abs/2512.20856)Cited by:[Figure 1](https://arxiv.org/html/2606.06622#S1.F1),[§1](https://arxiv.org/html/2606.06622#S1.p3.2),[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[24\]T\. Olmo, :\. A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison, J\. Morrison, J\. Poznanski, K\. Lo, L\. Soldaini, M\. Jordan, M\. Chen, M\. Noukhovitch, N\. Lambert, P\. Walsh, P\. Dasigi, R\. Berry, S\. Malik, S\. Shah, S\. Geng, S\. Arora, S\. Gupta, T\. Anderson, T\. Xiao, T\. Murray, T\. Romero, V\. Graf, A\. Asai, A\. Bhagia, A\. Wettig, A\. Liu, A\. Rangapur, C\. Anastasiades, C\. Huang, D\. Schwenk, H\. Trivedi, I\. Magnusson, J\. Lochner, J\. Liu, L\. J\. V\. Miranda, M\. Sap, M\. Morgan, M\. Schmitz, M\. Guerquin, M\. Wilson, R\. Huff, R\. L\. Bras, R\. Xin, R\. Shao, S\. Skjonsberg, S\. Z\. Shen, S\. S\. Li, T\. Wilde, V\. Pyatkin, W\. Merrill, Y\. Chang, Y\. Gu, Z\. Zeng, A\. Sabharwal, L\. Zettlemoyer, P\. W\. Koh, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi\(2026\)Olmo 3\.External Links:2512\.13961,[Link](https://arxiv.org/abs/2512.13961)Cited by:[Figure 1](https://arxiv.org/html/2606.06622#S1.F1),[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[25\]OpenAI\(2024\-05\)Hello GPT\-4o\.Note:[https://openai\.com/index/hello\-gpt\-4o/](https://openai.com/index/hello-gpt-4o/)Accessed: 2026\-04\-20Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[26\]OpenAI\(2026\-03\)Introducing GPT\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Accessed: 2026\-04\-20Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p3.2),[Figure 2](https://arxiv.org/html/2606.06622#S3.F2),[§3\.1](https://arxiv.org/html/2606.06622#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[27\]J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,UIST ’23,New York, NY, USA\.External Links:ISBN 9798400701320,[Link](https://doi.org/10.1145/3586183.3606763),[Document](https://dx.doi.org/10.1145/3586183.3606763)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1)\.
- \[28\]A\. Paruchuri, J\. Garrison, S\. Liao, J\. B\. Hernandez, J\. Sunshine, T\. Althoff, X\. Liu, and D\. McDuff\(2024\-11\)What are the odds? language models are capable of probabilistic reasoning\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 11712–11733\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.654/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.654)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[29\]D\. Plevcko, P\. Okanovic, T\. Hoefler, and E\. Bareinboim\(2025\)Epidemiology of large language models: a benchmark for observational distribution knowledge\.ArXivabs/2511\.03070\.External Links:[Link](https://api.semanticscholar.org/CorpusID:282757780)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[30\]Qwen Team\(2026\-02\)Qwen3\.5: towards native multimodal agents\.Note:Accessed: 2026\-04\-20External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p3.2),[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[31\]Z\. Shi, Y\. Zhu, Y\. Xie, J\. Shi, G\. Xie, H\. Zhang, Y\. Jiang, C\. Miao, and Q\. Li\(2025\-11\)Reasoning under uncertainty: efficient LLM inference via unsupervised confidence dilution and convergent adaptive sampling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 32204–32218\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1638/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1638),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p2.1)\.
- \[32\]N\. V\. Smirnov\(1948\)Table for estimating the goodness of fit of empirical distributions\.The Annals of Mathematical Statistics19\(2\),pp\. 279–281\.Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p2.3)\.
- \[33\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[34\]M\. Wadhwa, T\. S\. Roy, H\. Lederman, J\. J\. Li, and G\. Durrett\(2026\)CREATE: testing llms for associative creativity\.External Links:2603\.09970,[Link](https://arxiv.org/abs/2603.09970)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p3.2)\.
- \[35\]P\. West and C\. Potts\(2025\)Base models beat aligned models at randomness and creativity\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=vqN8uom4A1)Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p2.1),[§4\.2](https://arxiv.org/html/2606.06622#S4.SS2.p4.1)\.
- \[36\]xAI\(2025\-11\)Grok 4\.1\.Note:[https://x\.ai/news/grok\-4\-1](https://x.ai/news/grok-4-1)Accessed: 2026\-04\-20Cited by:[§4\.1](https://arxiv.org/html/2606.06622#S4.SS1.p1.1)\.
- \[37\]R\. Zhang, R\. H\. Bai, H\. Zheng, N\. Jaitly, R\. Collobert, and Y\. Zhang\(2026\)Embarrassingly simple self\-distillation improves code generation\.External Links:2604\.01193,[Link](https://arxiv.org/abs/2604.01193)Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p2.1)\.
- \[38\]R\. Zhang, X\. Zhang, and M\. Zhao\(2025\)Predicting effects, missing distributions: evaluating llms as human behavior simulators in operations management\.ArXivabs/2510\.03310\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281842519)Cited by:[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[39\]Y\. Zhang, H\. Diddee, S\. Holm, H\. Liu, X\. Liu, V\. Samuel, B\. Wang, and D\. Ippolito\(2025\)NoveltyBench: evaluating creativity and diversity in language models\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=XZm1ekzERf)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p3.2)\.
- \[40\]M\. Zhao, Y\. Du, and M\. Wang\(2026\)Large language models are bad dice players: llms struggle to generate random numbers from statistical distributions\.External Links:2601\.05414,[Link](https://arxiv.org/abs/2601.05414)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p1.1),[§2](https://arxiv.org/html/2606.06622#S2.p1.1)\.
- \[41\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=uccHPGDlao)Cited by:[§1](https://arxiv.org/html/2606.06622#S1.p3.2)\.

## Appendix AUnpredictaBenchDistributions List

UnpredictaBenchcovers 40 probability distributions across 8 categories, listed in Table[10](https://arxiv.org/html/2606.06622#A1.T10)\. The highlighted distributions were used for the multimodal subtask\.

Table 10:All 40 probability distributions included inUnpredictaBench, grouped by category\.\#Distribution\#DistributionAbsolutely Continuous⋅\\cdotBounded Interval1Beta4Triangular2Arcsine5Truncated Normal3Reciprocal6UniformAbsolutely Continuous⋅\\cdotSemi\-infinite\[0,∞\)\[0,\\infty\)7Erlang13Weibull8FF14Chi\-Squared9Fréchet15Exponential10Gamma16Inverse Gaussian11Pareto17Log\-Normal12Rayleigh MixtureAbsolutely Continuous⋅\\cdotWhole Real Line18Gumbel21Logistic19Laplace22Normal20Student’sttDiscrete⋅\\cdotFinite Support23Bernoulli26Binomial24Poisson Binomial27Discrete Uniform25Beta\-Binomial28HypergeometricDiscrete⋅\\cdotInfinite Support29Poisson32Geometric30Skellam33Negative Binomial31Compound PoissonJoint Distributions34Dirichlet37Multivariatett35Multinomial38Negative Multinomial36Multivariate NormalMixed Discrete\-Continuous39Rectified GaussianNon\-Numeric40Categorical
## Appendix BReal World Examples

This section provides representative examples from the RealWorld category ofUnpredictaBench\. Each task is presented in either code or textual form, and models are asked to produce a single plausible output consistent with the underlying stochastic process\.

Example 1: Network Simulation \(Code\)\.The following example presents a pseudo\-code network simulation where two packets are routed through paths with stochastic latency\. The ground\-truth distribution is over the two possible outputsAandB, with probabilities determined by thenetwork\_fluctuation\(\)function\. See Example[B\.1](https://arxiv.org/html/2606.06622#A2)\.

Example B\.1: Real World Example \(Network Code\)Consider the following pseudo\-code that simulates a network:``` class Packet: def __init__(self, name): self.name = name def send(packet, path): return (packet.name, path.get_latency()) class Path: def __init__(self, name, base_latency): self.name = name self.base_latency = base_latency def get_latency(self): return self.base_latency + network_fluctuation() path1 = Path("P1", 50) path2 = Path("P2", 70) packetA = Packet("A") packetB = Packet("B") resultA = send(packetA, path1) resultB = send(packetB, path2) if resultA[1] < resultB[1]: print(resultA[0]) else: print(resultB[0]) ``` Question: Provide one possible output of this code\. Respond with exactly one word only \(A or B\), and do not include any explanation or mention uncertainty\.

Example 2: Garbage Collection \(Textual\)\.This textual example describes a memory management scenario where three short\-lived objects are cleaned up in an unspecified order\. The ground\-truth distribution is uniform overA,B, andC, reflecting the non\-deterministic ordering of garbage collection\. See Example[B\.2](https://arxiv.org/html/2606.06622#A2)\.

Example B\.2: Real World Example \(Garbage Collection Textual\)You are debugging a performance issue in a program that frequently creates short\-lived objects\. To better understand memory behavior, you added a small log inside the object finalization routine so that whenever an object is cleaned up, its name is printed\.In one particular run, three temporary objects—A, B, and C—are created during a computation phase\. Later in the execution, all three become unused after their references go out of scope\. To observe what happens next, you trigger a manual cleanup step\.When reviewing the console output from that run, you notice that only one object name appears immediately after the cleanup is triggered\. The program then continues, and no additional relevant logs are shown for that moment\.Objects involved:\- A\- B\- CQuestion: What is one possible output? Respond with exactly one word only, and do not include any explanation\.

Example 3: Distributed Systems \(Textual\)\.This example models a replicated key\-value store under unstable network conditions, where the responding replica is determined stochastically by which one replies first\. The ground\-truth distribution is uniform over replicasA,B, andC\. See Example[B\.3](https://arxiv.org/html/2606.06622#A2)\.

Example B\.3: Real World Example \(Distributed Systems Textual\)You are observing a replicated key\-value store under unstable network conditions\. A client issues a read request, and multiple replicas may return a value depending on which one responds first\. The client logs the replica that served the request\.Replicas:\- A\- B\- CIn the logs for that run, only one replica name appears for that read\.Question: What is one possible output? Respond with exactly one word only, and do not include any explanation\.",

Example 4: MCMC State Transition \(Code\)\.This example presents a code\-based task where an LLM agent is queried to decide the next state in a Markov chain transition\. The stochasticity arises from the non\-determinism of the LLM call itself, making the ground\-truth distribution overA,B, andCempirically estimated from repeated execution\. See Example[B\.4](https://arxiv.org/html/2606.06622#A2)\.

Example B\.4: Real World Example \(MCMC Textual\)Consider the following code:``` from openai import OpenAI client = OpenAI() state = "idle" options = ["A", "B", "C"] prompt = f"""An agent is currently in the ’{state}’ state in a real-world workflow system.\n\nThe agent needs to decide the next state to transition into based on its internal reasoning.\n\n The possible next states are:\nA\nB\nC\n\nWhich state should the agent go to next? Respond with exactly one word.\n""" response = client.responses.create(model="gpt-4o", input=prompt) print(response.output_text.strip()) ``` Question: What is one possible output of this code? Respond with exactly one word only, and do not include any explanation\.

## Appendix CDetailed Explanation of Evaluation Metrics

### C\.1Handling Sequence\-Valued Tasks via Lehmer Codes

A subset ofUnpredictaBenchtasks ask the model to produce a*sequence*rather than a scalar\. In these cases, both𝒟pred\\mathcal\{D\}\_\{\\mathrm\{pred\}\}and𝒟gt\\mathcal\{D\}\_\{\\mathrm\{gt\}\}are distributions over permutations of\{1,2,…,n\}\\\{1,2,\\dots,n\\\}, and scalar distributional metrics do not apply directly\. To bring these tasks into a common framework, we encode each permutationπ∈Sn\\pi\\in S\_\{n\}via its*Lehmer code*

Li​\(π\)=\|\{j\>i:πj<πi\}\|,i=1,…,n,L\_\{i\}\(\\pi\)=\\big\|\\\{\\,j\>i:\\pi\_\{j\}<\\pi\_\{i\}\\,\\\}\\big\|,\\qquad i=1,\\dots,n,\(2\)whereLi​\(π\)∈\{0,1,…,n−i\}L\_\{i\}\(\\pi\)\\in\\\{0,1,\\dots,n\-i\\\}counts the number of elements to the right of positioniithat are smaller thanπi\\pi\_\{i\}\. The Lehmer code is a bijection betweenSnS\_\{n\}and the factorial number system, so no information is lost\. We normalize each digit by its maximum possible value:

Zi​\(π\)=\{Li​\(π\)n−i,i<n,0,i=n,Z\_\{i\}\(\\pi\)=\\begin\{cases\}\\dfrac\{L\_\{i\}\(\\pi\)\}\{n\-i\},&i<n,\\\\\[6\.0pt\] 0,&i=n,\\end\{cases\}\(3\)so that under a uniformly random permutation, each normalized coordinateZiZ\_\{i\}is asymptotically uniform on\[0,1\]\[0,1\]\. In our evaluation, we focus on the first coordinateZ1Z\_\{1\}, which has the largest support among Lehmer coordinates and therefore provides the richest one\-dimensional marginal diagnostic\. We apply the scalar distributional metrics directly toZ1Z\_\{1\}\.

### C\.2Debiased Wasserstein\-1 Distance

The Wasserstein\-1 distance between𝒟pred\\mathcal\{D\}\_\{\\mathrm\{pred\}\}and𝒟gt\\mathcal\{D\}\_\{\\mathrm\{gt\}\}is

W1​\(𝒟pred,𝒟gt\)=1N​∑i=1N\|a\(i\)−b\(i\)\|,W\_\{1\}\(\\mathcal\{D\}\_\{\\mathrm\{pred\}\},\\mathcal\{D\}\_\{\\mathrm\{gt\}\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\|a\_\{\(i\)\}\-b\_\{\(i\)\}\\right\|,\(4\)wherea\(i\)a\_\{\(i\)\}andb\(i\)b\_\{\(i\)\}are theii\-th order statistics ofAAandBB\. To correct for finite\-sample bias and enable comparison across tasks with different units, we compute a permutation null overR=999R=999random partitions of the pooled sampleP=A∪BP=A\\cup B, obtaining null meanμW\\mu\_\{W\}and standard deviationσW\\sigma\_\{W\}\. We report the debiased statistic and itszz\-score:

W~1=W1​\(𝒟pred,𝒟gt\)−μW,ZW1=W1​\(𝒟pred,𝒟gt\)−μWσW\.\\widetilde\{W\}\_\{1\}=W\_\{1\}\(\\mathcal\{D\}\_\{\\mathrm\{pred\}\},\\mathcal\{D\}\_\{\\mathrm\{gt\}\}\)\-\\mu\_\{W\},\\qquad Z\_\{W\_\{1\}\}=\\frac\{W\_\{1\}\(\\mathcal\{D\}\_\{\\mathrm\{pred\}\},\\mathcal\{D\}\_\{\\mathrm\{gt\}\}\)\-\\mu\_\{W\}\}\{\\sigma\_\{W\}\}\.\(5\)Values ofZW1Z\_\{W\_\{1\}\}near zero indicate indistinguishability from chance; larger values indicate systematic distributional mismatch\.

### C\.3Jensen–Shannon Divergence

We fit Gaussian kernel density estimatesp^𝒟pred\\hat\{p\}\_\{\\mathcal\{D\}\_\{\\mathrm\{pred\}\}\}andp^𝒟gt\\hat\{p\}\_\{\\mathcal\{D\}\_\{\\mathrm\{gt\}\}\}toAAandBB, evaluate them on a shared grid ofG=512G=512points covering the union of supports with10%10\\%padding, and normalize to obtain discrete distributionsppandqq\. The Jensen–Shannon divergence is then

JSD​\(𝒟pred∥𝒟gt\)=12​KL​\(p∥m\)\+12​KL​\(q∥m\),m=p\+q2,\\mathrm\{JSD\}\(\\mathcal\{D\}\_\{\\mathrm\{pred\}\}\\,\\\|\\,\\mathcal\{D\}\_\{\\mathrm\{gt\}\}\)=\\frac\{1\}\{2\}\\mathrm\{KL\}\(p\\,\\\|\\,m\)\+\\frac\{1\}\{2\}\\mathrm\{KL\}\(q\\,\\\|\\,m\),\\qquad m=\\frac\{p\+q\}\{2\},\(6\)whereKL​\(u∥v\)=∑kuk​log⁡ukvk\\mathrm\{KL\}\(u\\,\\\|\\,v\)=\\sum\_\{k\}u\_\{k\}\\log\\frac\{u\_\{k\}\}\{v\_\{k\}\}\. JSD is symmetric, bounded in\[0,log⁡2\]\[0,\\log 2\], and well\-defined even when supports do not overlap, capturing density\-level shape mismatches that distance\-based metrics can underweight\.

## Appendix DExtended Model Results

Table[11](https://arxiv.org/html/2606.06622#A4.T11)extends the category\-level results of Table[2](https://arxiv.org/html/2606.06622#S4.T2)to the full set of evaluated models, reporting KS@100, JSD, and WDZ across all four task categories\. The patterns observed in the main paper hold broadly: RealWorld tasks yield the highest individual scores while Code and Text remain the most demanding, and models with strong overall KS@100 tend to show consistently lower JSD and WDZ values\. Llama\-3\.2\-1B\-instruct again stands out with a remarkable 59\.09% on RealWorld despite near\-bottom performance elsewhere, and the Qwen3\.5 MoE variants continue to underperform relative to their parameter counts across all categories\.

Table 11:Per\-category results for the full set of evaluated models, reporting KS@100 \(↑: higher is better\), Jensen–Shannon Divergence \(JSD, ↓: lower is better\), and Wasserstein Distance Z\-score \(WDZ, ↓: closer to zero is better\) across Code, Text, RealWorld, and Shuffling task categories\. This table extends Table[2](https://arxiv.org/html/2606.06622#S4.T2)in the main paper to include all models evaluated in this work\.
## Appendix EPer\-Distribution KS@100 Breakdown

Table[12](https://arxiv.org/html/2606.06622#A5.T12)reports KS@N broken down by target distribution, averaged across all models and task formats\. Cells are highlighted relative to the per\-column average: distributions above average are marked as easier and those below as harder\. A clear pattern emerges:simple discrete distributions with small finite support are consistently the easiest, with Bernoulli \(43\.04%\), Categorical \(34\.78%\), and Discrete Uniform \(16\.52%\) leading at KS@100\. This is unsurprising given that models are likely to have encountered these distributions frequently during pretraining and their support is small enough that even limited diversity suffices to pass the KS test\. At the other end,heavy\-tailed and multivariate distributions prove the most challenging: Fréchet \(1\.74%\), Dirichlet \(1\.74%\), Negative Binomial \(5\.22%\), and Negative Multinomial \(6\.09%\) rank at the bottom at KS@100, reflecting the difficulty of reproducing long tails and correlated multivariate structure\. Compound Poisson, Erlang, Inverse Gaussian, and Pareto all cluster below 9% at KS@100, suggesting that distributions requiring precise scale and shape calibration are particularly problematic\. Notably, the Beta distribution shows a sharp drop from KS@10 \(87\.39%\) to KS@100 \(4\.78%\), one of the steepest in the table, consistent with our qualitative finding in Section[4\.5](https://arxiv.org/html/2606.06622#S4.SS5)that bounded continuous distributions suffer from severe support misspecification at the logit level\.

Table 12:KS@N averaged across all models and task formats, broken down by target distribution, at thresholdsN∈\{10,20,50,100\}N\\in\\\{10,20,50,100\\\}\. Cells are highlighted relative to the per\-column mean:above averagedistributions are relatively easier, whilebelow averagedistributions are harder\. Distributions are ordered roughly from easiest to hardest at KS@100\.
## Appendix FExplicit vs\. Implicit Prompting

Table[13](https://arxiv.org/html/2606.06622#A6.T13)compares model performance under explicit and implicit prompting conditions\. In the explicit setting, the distribution is directly named or described, while in the implicit setting the model must infer the underlying stochastic process from context without being told the distribution family\. Overall,explicit prompting yields higher KS@N for the majority of models, which is consistent with the intuition that naming a distribution reduces the problem to parameter estimation and sampling, whereas implicit prompting additionally requires distributional inference\. Nemotron\-3 Super 120B leads in both settings \(41\.42% and 26\.42% at KS@100 respectively\), and the gap between explicit and implicit is substantial \(15 percentage points\), suggesting that even the strongest model benefits considerably from being told what distribution to sample from\. Interestingly,a handful of models perform better implicitly than explicitly: GPT\-5\.4 \(20\.13% vs\. 14\.23%\), DeepSeek V3\.2 \(23\.27% vs\. 17\.57%\), OLMo\-3 7B \(8\.18% vs\. 5\.02%\), and Claude Sonnet 4\.6 \(6\.92% vs\. 3\.78%\) all show higher KS@100 under implicit prompting\. This counterintuitive result may reflect the fact that when a distribution is named explicitly, these models anchor too strongly on a memorized prototype of that distribution rather than adapting to the specific parameterization given in the prompt\. In the implicit setting, without a named anchor, they may rely more on contextual reasoning, which for certain task types yields better\-calibrated outputs\. At the lower end of the table, the explicit/implicit gap narrows considerably, suggesting thatfor weaker models the bottleneck is not prompt format but fundamental distributional understanding\.

Table 13:KS@50 and KS@100 under explicit and implicit prompting conditions for all evaluated models\. In the explicit setting the target distribution is directly named or described; in the implicit setting the model must infer the distributional structure from context\. Bold values indicate the best score in each column\.
## Appendix GUnimodal vs\. Multimodal Distribution Complexity

Table[14](https://arxiv.org/html/2606.06622#A7.T14)compares model performance on unimodal tasks, where the target is a single distribution, against multimodal tasks, where the target is a mixture of two component distributions\. The results reveal a nuanced picture that differs notably across models\.For the strongest models, multimodal tasks are actually easier: Nemotron\-3 Super 120B achieves 42\.50% on multimodal versus 33\.65% on unimodal at KS@100, and GPT\-4o similarly scores 38\.75% versus 22\.96%\. We hypothesize that this reflects the fact that mixture distributions, by construction, have broader and more spread\-out support, making it easier for a diverse model to pass the KS test even with imperfect mode coverage\.For weaker models, the pattern reverses: Mercury\-2 \(1\.25% vs\. 10\.38%\), Claude Sonnet 4\.6 \(0\.00% vs\. 6\.31%\), Phi\-3\.5 Mini \(0\.00% vs\. 3\.14%\), and both Qwen3\.5 MoE variants \(0\.00% vs\.∼\\sim4%\) all collapse entirely on multimodal tasks while retaining some performance on unimodal ones\. This suggests thatcapturing a mixture distribution requires the model to simultaneously maintain multiple modes in its output, a form of diversity that models with strong deterministic tendencies cannot sustain\. GPT\-5\.4 presents the starkest reversal: 7\.50% on multimodal versus 18\.87% on unimodal at KS@100, consistent with its tendency to collapse to a single point as illustrated in Figure[1\(a\)](https://arxiv.org/html/2606.06622#S1.F1), which is particularly damaging when the target has two well\-separated modes\.

Table 14:KS@50 and KS@100 for unimodal and multimodal target distributions\. Unimodal tasks involve sampling from a single target distribution, while multimodal tasks require matching a mixture of two component distributions\. Bold values indicate the best score in each column; underlined values indicate the second best\.
## Appendix HEffect of Distributional Spread

Table[15](https://arxiv.org/html/2606.06622#A8.T15)compares performance on concentrated distributions, which have low variance and most mass near the mean, against spread out distributions with high variance and broad support\. The results reveal a striking model\-dependent reversal\.For most strong models, concentrated and spread out tasks are roughly equally challenging, with Nemotron\-3 Super 120B performing comparably in both settings \(36\.36% vs\. 34\.50% at KS100\) and GPT\-4o similarly close \(27\.27% vs\. 25\.00%\)\. However,for many mid\-range and weaker models, spread out distributions are substantially harder: Grok\-4\.1\-fast \(11\.62% vs\. 0\.50%\), Claude Sonnet 4\.6 \(9\.64% vs\. 0\.50%\), Phi\-3\.5 Mini \(4\.55% vs\. 0\.50%\), and both Qwen3\.5 MoE variants \(around 6\.5% vs\. 0\.00%\) collapse almost entirely on spread out distributions\. This is consistent with our hypothesis thatdeterministically trained models anchor near the mode of a distribution, which is a reasonable strategy for concentrated distributions but catastrophically fails when the true distribution has broad support and significant tail mass\. Conversely, a small number of models perform better on spread out tasks: Ministral\-3B instruct \(21\.00% vs\. 12\.12%\), Llama\-3\.2\-1B \(16\.50% vs\. 6\.06%\), and Llama\-3\.1\-8B instruct \(6\.50% vs\. 2\.02%\) all show higher KS100 on spread out distributions, suggesting these models generate outputs diverse enough to cover broad support but insufficiently precise to match the tighter mass concentration required by low\-variance distributions\.

Table 15:KS@50 and KS@100 for concentrated \(low\-variance\) and spread out \(high\-variance\) target distributions\. Bold values indicate the best score in each column; underlined values indicate the second best\.
## Appendix IError Analysis

Figure[6](https://arxiv.org/html/2606.06622#A9.F6)reports the mean and min\-max range of KS@100, JSD, and WDZ across three repeated evaluation runs for six models, broken down by task category\. The error bars are consistently narrow across all models and metrics, confirming thatour benchmark results are stable and reproducible: the variance introduced by ground\-truth resampling is negligible relative to the differences observed between models and categories\. This validates the use of a single evaluation run for the main results reported in the paper\.

Beyond stability, the figure reinforces several patterns from the main analysis\. Llama\-3\.2\-1B’s RealWorld KS@100 \(around 59%\) stands out as both high and stable, while its Text JSD \(around 0\.52\) and WDZ \(around 29\.82\) are among the worst and equally stable, confirming that its strong RealWorld performance is a genuine distributional property rather than an evaluation artifact\. Nemotron\-3 Super 120B shows tight error bars on its high Text KS@100 \(40\.34%\) alongside a notably high Text JSD \(0\.48\), a consistent tension between the KS\-based KS@100 and distributional distance metrics that holds across all three runs\. OLMo\-3 7B’s RealWorld WDZ is persistently the highest in the table \(around 59\.16\), with very little variance, suggesting this is a stable structural failure rather than a sampling fluke\.The tight confidence intervals across all models and metrics give us confidence that the rankings and conclusions in the main paper are robust to evaluation noise\.

Figure 6:Mean and min\-max range of KS@100 \(%\), Jensen\-Shannon Divergence \(JSD\), and Wasserstein Z\-score Distance \(WDZ\) across three repeated evaluation runs for six models, broken down by task category \(Code, Text, RealWorld, Shuffling\)\. Each dot represents an individual run; the larger marker shows the mean and error bars span the full observed range\. The consistently narrow error bars confirm that benchmark results are stable across runs, validating the use of a single evaluation run in the main paper\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x6.png)
## Appendix JGround Truth Sensitivity

To assess the sensitivity of our evaluation to the choice of ground truth samples, we generate three independent sets of ground truth values, each consisting of 1,000 samples drawn from the true distribution for each problem, and report the standard deviation of KS@100, JSD, and WDZ across these three sets in Table[16](https://arxiv.org/html/2606.06622#A10.T16)\. The standard deviations are small across all models and categories, confirming thatour evaluation is robust to the specific ground truth sample set used\. KS@100 standard deviations remain below 0\.42 for all models, and JSD and WDZ deviations are similarly tight for the majority of models, with the slight increase from Random to RealWorld settings reflecting the naturally higher variability of real\-world distributions\. Qwen\-3\.5\-35B\-a3b shows the highest JSD instability \(up to 2\.55\), while Llama\-3\.1\-70B and DeepSeek V3\.2 are among the most stable models across all metrics and settings\.

To ensure full reproducibility, we fix and release the exact ground truth samples used in this evaluation\.Upon acceptance, we will release both the ground truth generation code and the fixed ground truth sets used for this submission, enabling direct replication of our reported numbers\. Users who wish to evaluate under different conditions are free to adapt the generation code to produce their own ground truth sets, for instance by increasing the number of samples, changing the random seed, or substituting alternative sampling procedures\. The KS@N metric and our evaluation framework are designed to be fully agnostic to the specific ground truth instantiation, making such adaptations straightforward\.

Table 16:Standard deviation \(σ\\sigma\) of KS@100 \(%\), JSD, and WDZ across three independent ground truth sets of 1,000 samples each, reported for Random, Shuffling, and RealWorld evaluation settings\. Lower standard deviation indicates greater robustness to the choice of ground truth samples\. Bold values highlight the highest and lowest standard deviations within each metric column\.
## Appendix KInstruction Following Analysis of the Models

A generation*fails*when its output cannot be parsed into a valid sample, for example a malformed sequence, an empty completion, a refusal, or a value outside the admissible support\. We discard failed generations and resample the same prompt up to5 retries, retaining only valid samples for metric computation, soretries do not bias the distributional metricsand instead measure how reliably a model emits well\-formed outputs\. Table[17](https://arxiv.org/html/2606.06622#A11.T17)reports the mean attempts per valid sample \(*Avg\. Attempt*\) and the fraction of calls needing at least one resample \(*Retry Rate*\)\.Retry cost concentrates inRealWorldand follows aclear inverse\-scaling trend:Llama\-3\.2\-1Bretries on68\.6%of calls \(4\.28 attempts each\) andQwen3\.5\-2Bon 29\.9%, while the strongest models stay near zero \(GPT\-5\.40\.0%,Qwen3\.5\-397B0\.2%\)\.Shufflingis effectively retry\-free \(max2\.05%2\.05\\%,Mercury\-2\), as its format is simple enough that validity is rarely the bottleneck\. InText and Codethe trend reverses: the highest rates belong to two capable models, the diffusion\-basedMercury\-2\(13\.6%13\.6\\%\) andClaude\-sonnet\-4\.6\(10\.9%10\.9\\%, and the highest average attempt count overall at1\.4061\.406\), indicating thatthese retries reflect format\-adherence quirks, not capability, withMercury\-2again an outlier across all three categories\.

Table 17:Retry behaviour across the threeUnpredictaBenchcategories\.*Avg\. Attempt*is the mean number of generation attempts\.
## Appendix LOutput Diversity Analysis

![Refer to caption](https://arxiv.org/html/2606.06622v1/x7.png)Figure 7:Per\-run output diversity on the shuffling task, measured as the number of unique items produced out of the≈\\approx40 items presented per run, aggregated over 1000 runs at temperature1\.01\.0\. For each model we show the mean \(marker\), the±1\\pm 1standard\-deviation band \(thick bar\), and the full observed min–max range \(thin line\); color encodes the mean\. The dashed line marks the attainable ceiling \(≈\\approx39\.8 items per run\)\.Figure[7](https://arxiv.org/html/2606.06622#A12.F7)reports the per\-run diversity of each model on the shuffling task, measured as the number of unique items produced out of the≈\\approx40 items presented per run, aggregated over 1000 runs at temperature1\.01\.0\. For each model we plot the mean \(marker\), the±1\\pm 1standard\-deviation band \(thick bar\), and the full observed min–max range \(thin line\), with color encoding the mean for legibility\. All models cluster well below the attainable ceiling of≈\\approx39\.8, indicating that none reproduces the full uniform spread expected under ideal sampling\. Notably, diversity does not increase with scale: the highest mean unique counts come from the smallest instruct models,Llama\-3\.2\-1B\-instruct\(36\.9936\.99\) andLlama\-3\.1\-8B\-instruct\(36\.3836\.38\), while the two largest mixture\-of\-experts models,Qwen3\.5\-397B\-a17b\(31\.3731\.37\) andQwen3\.5\-35B\-a3b\(28\.5228\.52\), are theleast diverse and are the only models whose maximum never exceeds 35, suggesting a systematically collapsed output distribution rather than occasional low\-diversity runs\. The remaining models occupy a comparatively narrow band of means \(33\.533\.5–35\.535\.5\), and we observe that lower\-diversity models tend to exhibit both higher variance and longer left tails \(e\.g\.GPT\-5\.4andClaude\-sonnet\-4\.6reach as few as 25–26 unique items in their worst runs\), implying that diversity loss is driven primarily by intermittent mode collapse rather than a uniform downward shift\.

## Appendix MAdditional Qualitative Analysis

Figures[8](https://arxiv.org/html/2606.06622#A13.F8)–[11](https://arxiv.org/html/2606.06622#A13.F11)show the empirical density of model generated samples \(blue bars\) against the ground\-truth distribution \(red curve\) across all evaluated models for four representative tasks\. These plots extend the qualitative observations from Section[4\.5](https://arxiv.org/html/2606.06622#S4.SS5)to the full model pool and across different task types and distributions\.

A consistent pattern emerges across all four figures:most models either collapse to a narrow spike or concentrate mass at a single point, failing to reproduce the shape of the ground truth\. This is most dramatically visible for Claude Sonnet 4\.6, GPT\-5\.4, Qwen\-3\.5\-397B\-a17b, Qwen\-3\.5\-35B\-a3b, and Mistral Large, which in multiple figures produce near\-degenerate distributions with virtually all mass at one value\. The Fréchet distribution \(Figure[8](https://arxiv.org/html/2606.06622#A13.F8)\) is particularly revealing: it has a heavy right tail that almost no model captures, with most collapsing to values near the lower bound of the support\. The Truncated Normal \(Figure[9](https://arxiv.org/html/2606.06622#A13.F9)\) is one of the more tractable distributions, and here we observe the widest spread of model behaviors: some models like DeepSeek V3\.2 and Llama\-3\.1\-70B approximate the bell shape reasonably, while others such as Claude Sonnet 4\.6 and GPT\-5\.4 again collapse to a single point\. On the implicit Binomial task \(Figure[10](https://arxiv.org/html/2606.06622#A13.F10)\), where the distribution name is not stated and must be inferred from context, models generally struggle more: even models that perform reasonably on explicit tasks show increased variance and misalignment here\. Finally, the implicit code\-based Poisson task \(Figure[11](https://arxiv.org/html/2606.06622#A13.F11)\) exposes a clear divide: models that can interpret the stochastic code produce outputs roughly consistent with the Poisson shape, while weaker models collapse entirely to zero or a single small integer\.Across all four figures, Nemotron\-3 Super 120B and DeepSeek V3\.2 consistently produce the broadest and most ground\-truth\-aligned distributions, while models optimized for deterministic reasoning show the most severe collapse\.

Figure 8:Model sample distributions vs\. ground truth for theFréchet Distributionunder thetextual explicit concentratedtask setting\. Each subplot shows the empirical density of 100 model\-generated samples \(blue bars\) overlaid with the ground\-truth distribution \(red curve\)\. Most models fail to capture the heavy right tail of the Fréchet distribution, collapsing near the lower bound of the support\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x8.png)Figure 9:Model sample distributions vs\. ground truth for theTruncated Normal Distributionunder thetextual explicit spread outtask setting\. The spread out parameterization results in a broad bell\-shaped ground truth\. While some models approximate the shape reasonably, others collapse to a single point despite the wide support\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x9.png)Figure 10:Model sample distributions vs\. ground truth for theBinomial Distributionunder thetextual implicit concentratedtask setting\. The distribution name is not stated in the prompt; models must infer the distributional structure from context\. The discrete, concentrated support makes this task deceptively difficult, as models must both identify the correct distribution and match its probability mass across a small integer range\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x10.png)Figure 11:Model sample distributions vs\. ground truth for thePoisson Distributionunder thecode implicit concentratedtask setting\. Models must infer the Poisson sampling process from a code snippet without an explicit distribution name\. The integer\-valued, right\-skewed ground truth exposes a clear divide between models that can interpret stochastic code and those that collapse to zero or a single small integer\.![Refer to caption](https://arxiv.org/html/2606.06622v1/x11.png)
## Appendix NPrompts

#### Text Task Generation\.

Text\-based tasks are generated using four prompts depending on whether the task is explicit or implicit and whether the target distribution is concentrated or spread out\. The explicit concentrated and spread out variants are given in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)and Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4), and the implicit concentrated and spread out variants in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)and Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\. The prompt used to elicit a single sampled value from models at evaluation time is given in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\.

#### Code Task Generation\.

Code\-based tasks follow the same explicit/implicit and concentrated/spread out structure as text tasks\. The four generation prompts are given in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4), Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4), Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4), and Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\. The sampling prompt used at evaluation time is given in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\.

#### Multimodal Task Generation\.

Multimodal tasks, which require models to sample from mixture distributions, are generated using two prompts corresponding to concentrated and spread out parameter regimes, given in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)and Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)\.

#### Answer Extraction\.

Model outputs are parsed using a family of LLM\-based answer extractors tailored to each task type\. The extractors for standard text and code tasks, list output tasks, shuffling tasks, and real\-world tasks are given in Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4), Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4), Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4), and Prompt[N](https://arxiv.org/html/2606.06622#A14.SS0.SSS0.Px4)respectively\.

Prompt N\.1: Text Explicit Task Generation Prompt \(Concentrated\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): \{distribution\_name\}\- Details and properties of the distribution, extracted from Wikipedia \(string\): \{distribution\_wikipedia\}\.For each request:\- Generate a clear, explicit sampling question based on the provided distribution data\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range concentrated as much as possible \(least spread out\), when this is feasible and still valid for the specified distribution\.\- Ensure the question is both relevant and solvable using properties of the specified probability distribution\.\- \*\*Do not\*\* ask about mean, median, or other typical statistics\. Focus on generating questions that require drawing random samples from the distribution\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘question‘ \(string\): A sampling question built for the specific distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Strictly a single numerical value\.\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to solve the sampling task\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.2: Text Explicit Task Generation Prompt \(Spread\-out\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): \{distribution\_name\}\- Details and properties of the distribution, extracted from Wikipedia \(string\): \{distribution\_wikipedia\}\.For each request:\- Generate a clear, explicit sampling question based on the provided distribution data\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range spread out as much as possible \(least concentrated\), when this is feasible and still valid for the specified distribution\.\- Ensure the question is both relevant and solvable using properties of the specified probability distribution\.\- \*\*Do not\*\* ask about mean, median, or other typical statistics\. Focus on generating questions that require drawing random samples from the distribution\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘question‘ \(string\): A sampling question built for the specific distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Strictly a single numerical value\.\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to solve the sampling task\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.3: Text Implicit Task Generation Prompt \(Concentrated\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Generate a clear, implicit sampling question based on the provided distribution data\.\- The question must be implicit and self\-contained, but it must \*\*not\*\* mention the actual distribution name in the question itself\.\- Write the question in an indirect/applied way, similar to the example below, so that it implies the distribution through the scenario or sampling process rather than naming it directly\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range concentrated as much as possible \(least spread out\), when this is feasible and still valid for the specified distribution\.\- Ensure the question is both relevant and solvable using properties of the specified probability distribution\.\- \*\*Do not\*\* ask about mean, median, or other typical statistics\. Focus on generating questions that require drawing random samples from the distribution\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘question‘ \(string\): A sampling question built for the specific distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Use ‘numerical\_value‘ to indicate that the answer should be a single numerical value\.\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to solve the sampling task\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.4: Text Implicit Task Generation Prompt \(Spread\-out\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Generate a clear, implicit sampling question based on the provided distribution data\.\- The question must be implicit and self\-contained, but it must \*\*not\*\* mention the actual distribution name in the question itself\.\- Write the question in an indirect/applied way, similar to the example below, so that it implies the distribution through the scenario or sampling process rather than naming it directly\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range spread out as much as possible \(least concentrated\), when this is feasible and still valid for the specified distribution\.\- Ensure the question is both relevant and solvable using properties of the specified probability distribution\.\- \*\*Do not\*\* ask about mean, median, or other typical statistics\. Focus on generating questions that require drawing random samples from the distribution\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘question‘ \(string\): A sampling question built for the specific distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Use ‘numerical\_value‘ to indicate that the answer should be a single numerical value\.\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to solve the sampling task\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.5: Code Explicit Task Generation Prompt \(Concentrated\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Generate a clear, explicit sampling Python code based on the provided distribution data\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range concentrated as much as possible \(least spread out\), when this is feasible and still valid for the specified distribution\.\- Ensure the code is both relevant and runnable using properties of the specified probability distribution\.\- Strictly avoid queries or code about mean, median, or other descriptive statistics; focus exclusively on random sampling procedures\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘code\_snippet‘ \(string\): A valid Python code \(NumPy/SciPy or similar\) built for the specific distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Strictly a single numerical value\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.6: Code Explicit Task Generation Prompt \(Spread\-out\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Generate a clear, explicit sampling Python code based on the provided distribution data\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range spread out as much as possible \(least concentrated\), when this is feasible and still valid for the specified distribution\.\- Ensure the code is both relevant and runnable using properties of the specified probability distribution\.\- Strictly avoid queries or code about mean, median, or other descriptive statistics; focus exclusively on random sampling procedures\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘code\_snippet‘ \(string\): A valid Python code \(NumPy/SciPy or similar\) built for the specific distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Strictly a single numerical value\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.7: Code Implicit Task Generation Prompt \(concentrated\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Generate a clear, implicit sampling Python code snippet based on the provided distribution data\.\- The code must be self\-contained and executable, but it must \*\*not\*\* explicitly mention the actual distribution name in variable names, comments, printed text, or explanatory text\.\- Write the code in an indirect/applied way so that it implies the distribution through the scenario, transformation, or sampling procedure rather than naming it directly\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range concentrated as much as possible \(least spread out\), when this is feasible and still valid for the specified distribution\.\- Ensure the code is both relevant and runnable using properties of the specified probability distribution\.\- \*\*Do not\*\* generate code about mean, median, or other typical/descriptive statistics\. Focus on generating code that performs random sampling and produces a single sampled numerical outcome\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to perform the implicit sampling task and output a single numerical value\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Use ‘numerical\_value‘ to indicate that the answer should be a single numerical value\.Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.8: Code Implicit Task Generation Prompt \(Spread\-out\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Generate a clear, implicit sampling Python code snippet based on the provided distribution data\.\- The code must be self\-contained and executable, but it must \*\*not\*\* explicitly mention the actual distribution name in variable names, comments, printed text, or explanatory text\.\- Write the code in an indirect/applied way so that it implies the distribution through the scenario, transformation, or sampling procedure rather than naming it directly\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type\.\- Prefer parameter values that make the distribution’s support or typical sample range spread out as much as possible \(least concentrated\), when this is feasible and still valid for the specified distribution\.\- Ensure the code is both relevant and runnable using properties of the specified probability distribution\.\- \*\*Do not\*\* generate code about mean, median, or other typical/descriptive statistics\. Focus on generating code that performs random sampling and produces a single sampled numerical outcome\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the selected distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to perform the implicit sampling task and output a single numerical value\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the distribution\.\- ‘expected\_answer\_type‘ \(string\): Use ‘numerical\_value‘ to indicate that the answer should be a single numerical value\.Before returning, validate that all output fields are present and their content is contextually appropriate for the distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.9: Multimodal Task Generation Prompt \(Concentrated\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from multimodal probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Determine if the provided distribution is multimodal based on the input details\.\- If it is already multimodal, proceed as usual\.\- If the provided distribution is unimodal \(single modal\), construct a multimodal \(2\-component\) distribution by reasonable means \(e\.g\., as a mixture or sum of distributions, or another mathematically valid transformation\), and base your generated question on this multimodal version, specifying all relevant parameters\.\- Generate a clear, explicit sampling question based on the \(if necessary, constructed\) multimodal distribution data\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type or construction\.\- Prefer parameter values that make the distribution’s support or typical sample range concentrated as much as possible \(least spread out\), when this is feasible and still valid for the specified distribution\.\- Ensure the question is both relevant and solvable using properties of the specified or constructed probability distribution\.\- The question should be human readable and easy to understand\. \*\*Do not\*\* make it complex\.\- \*\*Do not\*\* set any random seed in the question\.\- \*\*Do not\*\* ask about mean, median, or other typical statistics\. Focus on generating questions that require drawing random samples from the multimodal distribution\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the \(possibly constructed\) distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘question‘ \(string\): A sampling question built for the \(possibly constructed\) multimodal distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘num\_components‘ \(int\): Number of components in the constructed multimodal distribution\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the \(possibly constructed\) multimodal distribution\.\- ‘expected\_answer\_type‘ \(string\): Strictly a single numerical value\.\- ‘inherently\_multimodal‘ \(boolean\): Indicates whether the distribution is inherently multimodal or constructed from a single unimodal distribution\.\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to solve the sampling task\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the \(possibly constructed\) distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.10: Multimodal Task Generation Prompt \(Spread out\)Developer: You are an expert statistician and data scientist\. Your goal is to assess an LLM’s ability to understand and generate random samples from multimodal probability distributions\.\*\*Inputs:\*\*\- Name of the distribution \(string\): distribution\_name\- Details and properties of the distribution, extracted from Wikipedia \(string\): distribution\_wikipedia\.For each request:\- Determine if the provided distribution is multimodal based on the input details\.\- If it is already multimodal, proceed as usual\.\- If the provided distribution is unimodal \(single modal\), construct a multimodal \(2\-component\) distribution by reasonable means \(e\.g\., as a mixture or sum of distributions, or another mathematically valid transformation\), and base your generated question on this multimodal version, specifying all relevant parameters\.\- Generate a clear, explicit sampling question based on the \(if necessary, constructed\) multimodal distribution data\.\- Select random but valid values for the distribution parameters that are reasonable given the distribution type or construction\.\- Prefer parameter values that make the distribution’s support or typical sample range spread out as much as possible \(least concentrated\), when this is feasible and still valid for the specified distribution\.\- Ensure the question is both relevant and solvable using properties of the specified or constructed probability distribution\.\- The question should be human readable and easy to understand\. \*\*Do not\*\* make it complex\.\- \*\*Do not\*\* set any random seed in the question\.\- \*\*Do not\*\* ask about mean, median, or other typical statistics\. Focus on generating questions that require drawing random samples from the multimodal distribution\.If any required input fields are missing, incomplete, or malformed \(e\.g\., ‘distribution\_name‘ is empty/invalid, ‘distribution\_wikipedia‘ is not about a probability distribution, or parameters cannot be inferred\), immediately return a JSON object with an ‘error‘ field describing the issue\. Do not proceed further in such cases\.After producing the output, validate that all output fields are present and contextually appropriate for the \(possibly constructed\) distribution; if any required fields are missing or inappropriate, return only a JSON object with an ‘error‘ field describing the issue instead of partial results\.\*\*Expected Output Structure:\*\*\- On success, return a JSON object containing:\- ‘question‘ \(string\): A sampling question built for the \(possibly constructed\) multimodal distribution\.\- ‘parameters‘ \(object\): Dictionary of explicitly chosen, valid distribution parameters\.\- ‘num\_components‘ \(int\): Number of components in the constructed multimodal distribution\- ‘context‘ \(string\): Brief scenario or use\-case relevant to the \(possibly constructed\) multimodal distribution\.\- ‘expected\_answer\_type‘ \(string\): Strictly a single numerical value\.\- ‘inherently\_multimodal‘ \(boolean\): Indicates whether the distribution is inherently multimodal or constructed from a single unimodal distribution\.\- ‘code\_snippet‘ \(string\): Valid Python code \(NumPy/SciPy or similar\) to solve the sampling task\.\- Before returning, validate that all output fields are present and their content is contextually appropriate for the \(possibly constructed\) distribution\. If one or more fields are missing, inappropriate, or cannot be generated, respond only with an ‘error‘ field in a JSON object describing the problem\.\*\*Example Success Output:\*\*\{\{Example\}\}\*\*Example Error Output:\*\*\{\{Example\}\}

Prompt N\.11: Text Tasks Sampling PromptAnswer the following question without explanation or code\. If it asks for a random number drawn from a distribution or random process, provide one valid sampled value\. Return only the final sampled number as plain text:\{\{question\}\}

Prompt N\.12: Code Tasks Sampling PromptWhat is the output of this code? Predict the output without running it\. If the program is nondeterministic \(for example, it generates random numbers\), provide one valid possible output from a single execution\. Return only the exact plain\-text output, with no explanation or formatting:\{\{code\_snippet\}\}

Prompt N\.13: Answer Extractor LLM \(text and code tasks\)Analyze the model output below and decide whether it reports a number\.The model output is the response of another LLM that was asked to output a random number from a specific distribution; the distribution details do not matter here\. Model output:\{model\_output\} Rules:1\. If the output does not report any number \(only code, explanation, etc\.\), return \{\{"rationale": "No number found in the model output", "number": null\}\}2\. If the output reports a number, return \{\{"rationale": "The model mentioned <<the\_number\>\> in the exact text <<exact\_span\>\> at <<exact\_location\>\> of the output", "number": <the\_number\>\}\}3\. There may exist some cases that the model output is incomplete, malformed, or does not follow instructions\. In those cases, you may see some numbers unrelated to the final answer \(like repeating the list of parameters from the input distribution\); therefore, if you cannot confidently identify a number being reported as the final answer, default to \{\{"rationale": "Cannot confidently identify a number being reported as the final answer", "number": null\}\}\.4\. Only use numbers that are explicitly present in the model output\. Do not infer, calculate, or extract values from variable names, code structure, or explanatory text unless they are clearly presented as the answer\.5\. For <<exact\_span\>\>, copy the smallest exact substring from the model output that contains the reported number\.6\. For <<exact\_location\>\>, briefly describe where that exact span occurs in the output\. Example templates include "beginning of the output", "middle of the output", "end of the output", "first line", and "last line", but any other concise, precise location description is allowed\.7\. Do NOT quote any number unless it is the final reported answer\.8\. Before finalizing, verify that the JSON is valid, the rationale matches the chosen number or null outcome, and any number returned is explicitly presented in the model output as the final answer\.9\. Return only one valid JSON object with exactly these keys and no extra text, markdown, or formatting: \{\{"rationale": "No number found in the model output", "number": null\}\} or \{\{"rationale": "The model mentioned <<the\_number\>\> in the exact text <<exact\_span\>\> at <<exact\_location\>\> of the output", "number": <the\_number\>\}\}\.

Prompt N\.14: Answer Extractor LLM \(text and code tasks with list output\)Analyze the model output below and decide whether it reports exactly \{expected\_count\} numeric values \(the model was asked for multiple independent samples; distribution details do not matter here\)\. Model output:\{model\_output\}Rules:1\. If the output does not clearly present exactly \{expected\_count\} distinct final numeric answers, return \{\{"rationale": <string explaining why\>, "numbers": null\}\}\.2\. If the output clearly presents exactly \{expected\_count\} final numeric answers \(for example one number per line\), return \{\{"rationale": <string summarizing where each value appears\>, "numbers": \[<n1\>, <n2\>, \.\.\.\]\}\} with the numbers in the same order as in the model output \(list length must be exactly \{expected\_count\}\)\.3\. Ignore numbers that are clearly not part of the final answers \(parameters from the prompt, line numbers, unrelated code\)\. If you cannot confidently identify exactly \{expected\_count\} values as the final answers, return "numbers": null\.4\. Only use numbers explicitly present in the model output\. Do not infer or calculate unstated values\.5\. Each element of "numbers" must be a JSON number \(integer or float\), not a string\.6\. Return only one valid JSON object with exactly these keys and no extra text, markdown, or code fences: "rationale" \(string\) and "numbers" \(JSON array of length \{expected\_count\} or null\)\.

Prompt N\.15: Answer Extractor LLM \(shuffling task\)Analyze the model output below and extract exactly one shuffled list answer\.The model output is the response of another LLM that was asked to return one possible shuffled list\. Model output:\{model\_output\} Rules:1\. Return exactly one valid JSON object with exactly these keys: \{"rationale": <string\>, "value": <list\_or\_null\>\}\.2\. If there is no valid list answer, return \{"rationale": "No valid list found in the model output", "value": null\}\.3\. If multiple possible answers appear \(for example text containing "or"\), choose the first complete list that appears in the output\.4\. The "value" field must be a JSON array \(not a string\) and must preserve the original order and element types\.5\. Allowed list element types: string, integer, float\.6\. If the model uses Python\-style single quotes, convert them to equivalent JSON string values in "value"\.7\. Do not infer missing elements and do not synthesize a list\.8\. Return only the JSON object and no additional text, markdown, or code fences\.

Prompt N\.16: Answer Extractor LLM \(real\-world task\)Analyze the model output below and extract one final textual answer exactly as reported\.The model output may be a single word, a short token, or a multiline program output\. Model output:\{model\_output\} Rules:1\. Return exactly one valid JSON object with exactly these keys: \{"rationale": <string\>, "value": <string\_or\_null\>\}\.2\. If no usable answer text is present, return \{"rationale": "No valid textual answer found in the model output", "value": null\}\.3\. Preserve line order and internal newlines for multiline outputs\.4\. Trim only leading/trailing whitespace around the whole extracted answer\.5\. If the output contains multiple alternatives in one line \(for example "A or B"\), choose the first explicit answer candidate\.6\. Do not invent content and do not infer missing lines\.7\. Return only the JSON object and no additional text, markdown, or code fences\.

Similar Articles

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

arXiv cs.CL

RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.